Searching the 'Deep Web'
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
- Complete Planet
- The Invisible Web Directory
- Librarian's Index to the Internet
- INFOMINE
Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.Are you Corn Fed?
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.
Pathman, Free (as in GPL) 3D Pac Man
I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.
Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.
It could actually be useful content.
/?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.
Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.
The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44
Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.
Of course making it crawl