Searching the 'Deep Web'

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Tuesday March 9, 2004 @01:50AM from the sounds-more-like-the-deep-hurting dept.

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

19 of 193 comments (clear)

Min score:

Reason:

Sort:

No... by Anonymous Coward · 2004-03-09 01:56 · Score: 1, Interesting

but it will get us 90% more useless results. The regular search spam on Google is bad enough (it's getting to the level of bad results AltaVista had before Google took over the throne) without this extra noise...
Maybe I'm just missing the point... by robslimo · 2004-03-09 01:56 · Score: 5, Interesting

...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.
PHP? by TGK · 2004-03-09 01:57 · Score: 4, Interesting

Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.

--
Killfile(TGK)
No trees were killed in the creation of this post. However, many electrons were inconvenienced.
1. Re:PHP? by DeadSea · 2004-03-09 02:14 · Score: 5, Interesting
  Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.
  Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.
  Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:
  
  Make your pages *look* static (end in .html)
  
  Avoid CGI parameters except for handling form data (no ? in url)
  
  Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.
  
  It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.
  BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)
Spiders? by Vo0k · 2004-03-09 01:58 · Score: 4, Interesting

...and I wonder about something different.
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too :)

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"
1. Re:Spiders? by MyHair · 2004-03-09 03:07 · Score: 2, Interesting
  
  Good question. I haven't tried it yet, but I've run into several sites that Google indexes but the site refuses me entry until I register (which I don't). Some of them are clever enough to put Javascript (or something) in to prevent you from looking at Google's cache of that page. Yeah, I could get around that, but usually by then I figure I don't care what that site has to say.
2. Re:Spiders? by Anonymous Coward · 2004-03-09 08:37 · Score: 1, Interesting
  
  This is intriguing...
  
  Tell us more.
  
  (May I request an example URL?)
Privacy and Crap by jackb_guppy · 2004-03-09 02:00 · Score: 2, Interesting

Going after the other 90% does not mean that new things will come to top. Oh there maybe a few cool items like "Who realy shot JFK" or launch code for a trident.

But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.

Just is in archology, you will find a nice vase or two... but the rest is rumble.

You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.
What color hat will the robots have? by w3weasel · 2004-03-09 02:15 · Score: 1, Interesting

Search my database????

How the fsck is the bot gonna have the DBI string to interface my DB without knowing the name of the DB, the name of the account that created the DB or the user account on the DB with correct permissions to read the info??????????????

Hmmm... sounds like marketing hype.

--
Just as irrigation is the lifeblood of the Southwest, lifeblood is the soup of cannibals. -- Jack Handy
Bad kitty! by Underholdning · 2004-03-09 02:17 · Score: 4, Interesting

There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.

--
Underholdning.info
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:18 · Score: 2, Interesting

User-agent: *
Disallow: /s3kr3t/

trawler: "Hey cool, thx for the tip I never would have thought to try /s3kr3t/"
Funny by BenBenBen · 2004-03-09 02:21 · Score: 4, Interesting

Google's always been good enough for me.

--
The Slashdot Paradox: "100% Overrated"
Re:With the 10% that is crawled by Turing+Machine · 2004-03-09 02:54 · Score: 2, Interesting

http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post.

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.
Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:59 · Score: 2, Interesting

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.

Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.

Hypothetically speaking, whats there to stop someone doing a:

<?
print("<a href='thispage.php/${rand()}'>Some page...</a>");
?> ... and looping google?
How?? by Haydn+Fenton · 2004-03-09 03:01 · Score: 3, Interesting

I think i have a pretty good understanding of how google works..

People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.

So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.

This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attacks =5 so how many times would the crawler decide was enough to move onto the next link?

Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?
1. Re:How?? by MImeKillEr · 2004-03-09 03:41 · Score: 4, Interesting
  
  People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.
  
  Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.
  
  I've never submitted the URL to google.
  
  Should we assume that Google's already crawled a majority of the sites out there?
  
  BTW, Yahoo has no record of my site in their database.
  
  --
  Cruising the internet on my TI-99/4A @ a whopping 300 baud!
Re:From the article by Professeur+Shadoko · 2004-03-09 03:55 · Score: 2, Interesting

Right.
But if you are interested in a specific subject..
Let's say you have a technical problem.
Chances are somewhere on the planet someone submitted the same problem on a web-based forum.

Now you want google to give you THAT specific message.
You don't want google to tell you "hmmm... I guess the solution must be in one of those zillions of forums here, here, and here".
another form of DOS by ramar · 2004-03-09 04:03 · Score: 2, Interesting

If the boys with fat pipes start indexing "deeper" into sites, I think we're going to see a lot of sites going offline until they've been refactored to handle this sort of thing.

The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.
On a related note... by cr0sh · 2004-03-09 06:05 · Score: 4, Interesting

What about the "invisible web"?
The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?
Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.
Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).
Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.
So I ask again, has anything been done to further the "searching" within/for the "invisible web"?

--
Reason is the Path to God - Anon