Searching the 'Deep Web'

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Tuesday March 9, 2004 @01:50AM from the sounds-more-like-the-deep-hurting dept.

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

4 of 193 comments (clear)

Min score:

Reason:

Sort:

Top 4 by UncleBiggims · 2004-03-09 02:02 · Score: 5, Informative
About.com lists the top 4 places to search the deep web as:
Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.

Are you Corn Fed?
Re:PHP? by Xner · 2004-03-09 02:17 · Score: 4, Informative

I'm not exactly sure what you mean. If it is accessible by clicking on links, most search engines should be able to index it. If you want to be extra-friendly you can use $PATH_INFO to make dynamic pages look more like static ones, e.g.:
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.

--
Pathman, Free (as in GPL) 3D Pac Man
True nature of the deep database problem by andygrace · 2004-03-09 02:19 · Score: 5, Informative

I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.

Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.
Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:27 · Score: 4, Informative

It could actually be useful content.

Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.

The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.

Of course making it crawl /?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.