Searching the 'Deep Web'
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
These new deep-web crawlers try and ignore the robot access control files. They try and intelligently determine if they're in some type of infinite looping situation, but basically this is how they work.
- Complete Planet
- The Invisible Web Directory
- Librarian's Index to the Internet
- INFOMINE
Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.Are you Corn Fed?
When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.
As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.
Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons specialist who became briefly embroiled in the 2001 anthrax scare). It's a public document, but you won't find it on Google. To find a copy, you need to know your way around to the U.S. Government Printing Office catalog database.
The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.
"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.
If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?
When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.
The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that proble
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.
Pathman, Free (as in GPL) 3D Pac Man
I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.
Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.
It could actually be useful content.
/?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.
Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.
The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44
Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.
Of course making it crawl
If I'm not mistaken, the original reason for robots.txt was to prevent endless loops from confusing spiders, not to "cover" some information that would otherwise be easily accessible. Of course, others use it for other things now...
Exactly. The article mentions things like flight schedules and classified ads. Those sorts of rapidly and constantly changing infor sources need a completely different system to effectively search them. Fortunately, they've already been invented. Orbitz, and cheap tickets, and expedia are a few of many that handle flight schedules. Any website for a local newspaper probably does a decent job with classified ads.
If I want to find cheap airline tickets, I put "airline tickets" into google, and it'll give me a list of websites that are designed to help me find airline tickets. It doesn't try and find the actual flights for me, and that's ok.
This deep web browser idea is going to end up being a feature bloated search engine that does lots of things, but does them all poorly, and does nothing particularly well.
One time I threw a brick at a duck.
I agree that the search engines do not index dynamically generated pages very well. This page on my site http://www.dealsites.net/index.php?module=MyHeadli nes&func=view&myh=menu&gid=22&pid=2&eid=504&tid=30 0&context= hasn't seemed to attract any of the search engines yet. I'm not sure why, the data changes hourly and I have a direct link to that page on my site.
However, when search engines do start doing deep crawls, especially if they do POSTs and GETs, then the bandwidth of the web site will go up tremendously. While it is important to get crawled, what happens when your site uses more bandwidth for search engines than users? Also what would prevent other companies from developing thier own search engines? Then you might have 20 or more search engines doing deep crawls every month. Many websites are operated on low-cost low-bandwith hosting plans.
Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.
One word: backlinks. Pages, even with request parameters, that get linked to from lots of popular (high-pagerank) sites get indexed.
Go somewhere random
I can't speak for everyone, but here we check not only a spider's User Agent string, but also whether the request is coming from Google's IP range or elsewhere. So your results may not be so great.
Then again, I've defeated many registration (er, pr0n) gateways by just seting a Referer header identical to the URL I'm requesting, so some defenses are better than others...