atom · Slashdot Mirror

← Back to Users

User: atom

atom's activity in the archive.

Stories: 0
Comments: 3
First seen: 1999-12-14
Last seen: 2001-03-02
Profile: (view on slashdot.org)

Comments · 3

Why not HP-UX?! HAHAHAHAHA on HP Ditching WindowsCE for Linux on Jornada? · 2001-03-02 01:47 · Score: 2

They should eat their own dogfood, and use HP-UX! I'm sure they'll attract lots of developers because H-PUX's cutting edge development tools, like debuggers that can't debug!

And users will love patching their OS every week.

atom - still bitter from being my project's HPUX guy a few years ago...
Re:Black holes on Is the Internet Becoming Unsearchable? · 1999-12-14 00:26 · Score: 1

One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.

Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.

One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.

I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).
Re:Database driven web pages are 'spam' on Is the Internet Becoming Unsearchable? · 1999-12-14 00:15 · Score: 1

Of course, the /index.html page of a db driven site would change constantly.

For instance, www.slashdot.org/ might have an article about Natalie Portman in the root page when the search bot comes by, but it'll be gone in a day.