Slashdot Mirror


Searching the 'Deep Web'

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

22 of 193 comments (clear)

  1. With the 10% that is crawled by Trigun · · Score: 5, Funny

    being pretty much total crap, I'd really hate to see the other 90%!

    1. Re:With the 10% that is crawled by Zone-MR · · Score: 4, Informative

      It could actually be useful content.

      Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.

      The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44

      Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.

      Of course making it crawl /?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.

  2. Deep Web? by Traicovn · · Score: 5, Insightful

    I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
  3. Deep web? by hookedup · · Score: 4, Funny

    Doesnt crap sink? Not sure I want to know what the other 90-odd percent is. After tubgirl, goatse, etc.. what else could possibly be next..

  4. deep web? by rjelks · · Score: 4, Funny

    Is it just me, or does this sound like we're gonna get more pr0n when we search?

    -

  5. Maybe I'm just missing the point... by robslimo · · Score: 5, Interesting

    ...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

    It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.

  6. PHP? by TGK · · Score: 4, Interesting

    Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

    While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.

    --
    Killfile(TGK)
    No trees were killed in the creation of this post. However, many electrons were inconvenienced.
    1. Re:PHP? by DeadSea · · Score: 5, Interesting
      Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.

      Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.

      Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:

      1. Make your pages *look* static (end in .html)
      2. Avoid CGI parameters except for handling form data (no ? in url)
      3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.

      It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.

      BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)

    2. Re:PHP? by Xner · · Score: 4, Informative
      I'm not exactly sure what you mean. If it is accessible by clicking on links, most search engines should be able to index it. If you want to be extra-friendly you can use $PATH_INFO to make dynamic pages look more like static ones, e.g.:

      http://site.com/blah/prog.php/stat/1
      instead of
      http://site.com/blah/prog.php?stat=1

      I use it all the time and it works really well.

      --
      Pathman, Free (as in GPL) 3D Pac Man
  7. From the article by sczimme · · Score: 4, Insightful


    Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

    There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.

    Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?

    --
    I want to drag this out as long as possible. Bring me my protractor.
  8. Spiders? by Vo0k · · Score: 4, Interesting

    ...and I wonder about something different.
    Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too :)

    --
    Anagram("United States of America") == "Dine out, taste a Mac, fries"
  9. Top 4 by UncleBiggims · · Score: 5, Informative
    About.com lists the top 4 places to search the deep web as:Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.

    Are you Corn Fed?
  10. 1 percent,? by zonix · · Score: 4, Insightful
    The article alleges that current search services like Google manage to access less than 1% of the web [...]

    1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.

    z
    --
    What would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me
  11. Relevancy by Traicovn · · Score: 4, Insightful

    Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?

    I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?

    I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
  12. Bad kitty! by Underholdning · · Score: 4, Interesting

    There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.

  13. Re:Oh yeah, a whole new pair of dimes by dsanfte · · Score: 4, Insightful

    I wish you luck using that credit card number without the appropriate expiration date. The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.

    --
    occultae nullus est respectus musicae - originally a Greek proverb
  14. True nature of the deep database problem by andygrace · · Score: 5, Informative

    I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.

    Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.

  15. Funny by BenBenBen · · Score: 4, Interesting

    Google's always been good enough for me.

    --
    The Slashdot Paradox: "100% Overrated"
  16. Re:Limitations of Google by Stiletto · · Score: 4, Insightful


    Solution: Web designers, stop trying to be so clever.

    If you want your site to be spiderable, don't hide it behind javascript and flash!

  17. Re:How?? by MImeKillEr · · Score: 4, Interesting

    People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

    Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.

    I've never submitted the URL to google.

    Should we assume that Google's already crawled a majority of the sites out there?

    BTW, Yahoo has no record of my site in their database.

    --
    Cruising the internet on my TI-99/4A @ a whopping 300 baud!
  18. On a related note... by cr0sh · · Score: 4, Interesting
    What about the "invisible web"?

    The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?

    Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.

    Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).

    Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.

    So I ask again, has anything been done to further the "searching" within/for the "invisible web"?

    --
    Reason is the Path to God - Anon
  19. Re:But if you bypass the front pages... by CAIMLAS · · Score: 4, Insightful

    but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

    Hello, 1996 is calling; they want their paranoia back!

    Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.

    There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.

    --
    ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers