Google Crawls The Deep Web

← Back to Stories (view on slashdot.org)

Posted by Zonk on Wednesday April 16, 2008 @09:13AM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

11 of 197 comments (clear)

Min score:

Reason:

Sort:

Re:Forums? by brunascle · 2008-04-16 09:26 · Score: 2, Informative

as TFA states, it's only GET requests, not POSTs. so it would mostly be search queries.
Re:Just think! by Anonymous Coward · 2008-04-16 09:44 · Score: 1, Informative

I had a search not for "allinurl:select from where" but for "allinurl: delete from" ... throws up a bunch of phpBBAdmin pages with "Do you really want to do this" and "Yes" and "No" buttons .... which one will Google click :)
directions like 'nofollow' are still respected by frovingslosh · 2008-04-16 09:53 · Score: 5, Informative

Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.

--
I'm an American. I love this country and the freedoms that we used to have.
Re:Oops... by Bogtha · 2008-04-16 09:57 · Score: 4, Informative

This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.

--
Bogtha Bogtha Bogtha
Re:Just think! by Lillesvin · 2008-04-16 10:05 · Score: 3, Informative

... maybe a borked machine?

Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.

A little more on topic though, what exactly is Google looking for there? I mean, what content (of any interest to anyone) is hiding behind forms? Many sites that require registration (like NY Times (IIRC) and others) already check if the UserAgent string is that of a Google crawler and lets it index if so in order for people to be able to search eg. NY Times articles on Google but only read them if they register (or change their UserAgent string or use BugMeNot).

And how does Google make sure they don't end up accidently editing a crapload of wikies by filling out random forms on random sites and hitting submit?

--
"Live free or don't."
Re:What about register forms? by stephanruby · 2008-04-16 10:59 · Score: 2, Informative

Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
Yes, if you require all your human visitors to read your robots.txt, and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese.
Re:Forums, and "web 2.0" sites. by Z80xxc! · 2008-04-16 11:36 · Score: 2, Informative

Seems to me it would be easy enough to detect the googlebot user agent, then if so, automatically redirect it to the page on the other end (or even send it to a random 404 page or something), all without processing the form data at all.
<? if ($_SERVER['HTTP_USER_AGENT']=="User_agentMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); { header( 'Location: /landing_page.php' ) ; } else { processtheform(); } ?>
Of course, this would have to be implemented, which would be a PITA, but it seems to me that it would work just fine.
Re:Just think! by dartarrow · 2008-04-16 16:01 · Score: 3, Informative

I think you mean this: http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx

--
I love humanity, it is people I hate
Re:Opt IN by dave420 · 2008-04-16 23:56 · Score: 2, Informative

Of course they could link to a site and make the browser perform a POST. That's trivial. A form and some javascript will do that no problem. They seem to not be doing that because GET forms should be non-destructive, whereas POST forms can be quite destructive.
Re:Forms that create agreements by dave420 · 2008-04-17 00:00 · Score: 2, Informative

That is a POST form, which Google have said they will not mess with.
Re:Just think! by solaraddict · 2008-04-17 19:46 · Score: 2, Informative

At that time it was stated that any link that performs an action, such as delete, should be a post(...) [clears his throat]
And the RFC 2616 opened its mouth and said:
In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
It must be true, the fRFC confirms it!