Searching the 'Deep Web'

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Tuesday March 9, 2004 @01:50AM from the sounds-more-like-the-deep-hurting dept.

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

2 of 193 comments (clear)

Min score:

Reason:

Sort:

Maybe I'm just missing the point... by robslimo · 2004-03-09 01:56 · Score: 5, Interesting

...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.
Re:PHP? by DeadSea · 2004-03-09 02:14 · Score: 5, Interesting
Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.
Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.
Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:
1. Make your pages *look* static (end in .html)
2. Avoid CGI parameters except for handling form data (no ? in url)
3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.
It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.
BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)