Slashdot Mirror


Searching the 'Deep Web'

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

3 of 193 comments (clear)

  1. AKA goodbye robots.txt by Anonymous Coward · · Score: -1, Redundant

    AKA "What's a robots.txt file?" says the innocent web crawling robot. :P

  2. Get ready to tighten up those dynamic site scripts by pubjames · · Score: 0, Redundant


    My guess is that they will be looking at ways of automatically polling dynamic web sites to extract all the data from the database. So if a site has a page, for instance

    www.site.com/index.asp?content=10,

    the search engine will try content=1 to content=n to see what it gets.

  3. Deep crawling my hard drive? by pieterh · · Score: -1, Redundant

    Surely if it's not been published on a web site, it's not meant to be accessible and indexed. The hidden 90% is mostly confidential data, private documents, porn, and miscellaneous files. Why would anyone want to crawl this?