Searching the 'Deep Web'
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...
[Something witty and intelligent should have appeared here.]
{Traicovn}
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.
Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?
I want to drag this out as long as possible. Bring me my protractor.
1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.
zWhat would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me
Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?
I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?
I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.
[Something witty and intelligent should have appeared here.]
{Traicovn}
I wish you luck using that credit card number without the appropriate expiration date. The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.
occultae nullus est respectus musicae - originally a Greek proverb
Solution: Web designers, stop trying to be so clever.
If you want your site to be spiderable, don't hide it behind javascript and flash!
but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?
Hello, 1996 is calling; they want their paranoia back!
Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.
There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers