They should eat their own dogfood, and use HP-UX! I'm sure they'll attract lots of developers because H-PUX's cutting edge development tools, like debuggers that can't debug!
And users will love patching their OS every week.
atom - still bitter from being my project's HPUX guy a few years ago...
One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.
Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.
One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.
I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).
They should eat their own dogfood, and use HP-UX! I'm sure they'll attract lots of developers because H-PUX's cutting edge development tools, like debuggers that can't debug!
And users will love patching their OS every week.
atom - still bitter from being my project's HPUX guy a few years ago...
One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.
Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.
One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.
I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).
Of course, the /index.html page of a db driven site would change constantly.
For instance, www.slashdot.org/ might have an article about Natalie Portman in the root page when the search bot comes by, but it'll be gone in a day.