Slashdot Mirror


Searching the 'Deep Web'

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

8 of 193 comments (clear)

  1. Maybe I'm just missing the point... by robslimo · · Score: 5, Interesting

    ...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

    It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.

  2. PHP? by TGK · · Score: 4, Interesting

    Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

    While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.

    --
    Killfile(TGK)
    No trees were killed in the creation of this post. However, many electrons were inconvenienced.
    1. Re:PHP? by DeadSea · · Score: 5, Interesting
      Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.

      Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.

      Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:

      1. Make your pages *look* static (end in .html)
      2. Avoid CGI parameters except for handling form data (no ? in url)
      3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.

      It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.

      BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)

  3. Spiders? by Vo0k · · Score: 4, Interesting

    ...and I wonder about something different.
    Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too :)

    --
    Anagram("United States of America") == "Dine out, taste a Mac, fries"
  4. Bad kitty! by Underholdning · · Score: 4, Interesting

    There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.

  5. Funny by BenBenBen · · Score: 4, Interesting

    Google's always been good enough for me.

    --
    The Slashdot Paradox: "100% Overrated"
  6. Re:How?? by MImeKillEr · · Score: 4, Interesting

    People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

    Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.

    I've never submitted the URL to google.

    Should we assume that Google's already crawled a majority of the sites out there?

    BTW, Yahoo has no record of my site in their database.

    --
    Cruising the internet on my TI-99/4A @ a whopping 300 baud!
  7. On a related note... by cr0sh · · Score: 4, Interesting
    What about the "invisible web"?

    The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?

    Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.

    Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).

    Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.

    So I ask again, has anything been done to further the "searching" within/for the "invisible web"?

    --
    Reason is the Path to God - Anon