Slashdot Mirror


Searching the 'Deep Web'

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

6 of 193 comments (clear)

  1. With the 10% that is crawled by Trigun · · Score: 5, Funny

    being pretty much total crap, I'd really hate to see the other 90%!

  2. Deep Web? by Traicovn · · Score: 5, Insightful

    I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
  3. Maybe I'm just missing the point... by robslimo · · Score: 5, Interesting

    ...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

    It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.

  4. Top 4 by UncleBiggims · · Score: 5, Informative
    About.com lists the top 4 places to search the deep web as:Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.

    Are you Corn Fed?
  5. Re:PHP? by DeadSea · · Score: 5, Interesting
    Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.

    Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.

    Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:

    1. Make your pages *look* static (end in .html)
    2. Avoid CGI parameters except for handling form data (no ? in url)
    3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.

    It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.

    BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)

  6. True nature of the deep database problem by andygrace · · Score: 5, Informative

    I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.

    Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.