Google Crawls The Deep Web

← Back to Stories (view on slashdot.org)

Posted by Zonk on Wednesday April 16, 2008 @09:13AM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

23 of 197 comments (clear)

Just think! by scubamage · 2008-04-16 09:14 · Score: 5, Funny

Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)
1. Re:Just think! by AKAImBatman · 2008-04-16 09:28 · Score: 3, Funny
  
  Hmm... that reminds me of this DailyWTF. Who knew that Mr. Test User was such a big customer? :-P
  
  --
  Javascript + Nintendo DSi = DSiCade
2. Re:Just think! by Anonymous Coward · 2008-04-16 16:45 · Score: 1, Funny
  
  If you put a "delete this page" button on any page, I would honestly be shocked if Google got to it before some punk-ass kid did...
Oops... by JohnnyDanger · 2008-04-16 09:16 · Score: 5, Funny

They just bought everything on Amazon.
1. Re:Oops... by Firehed · 2008-04-16 10:03 · Score: 2, Funny
  
  HTTP spec be damned - has IE taught you nothing?
  
  --
  How are sites slashdotted when nobody reads TFAs?
Forums? by fishybell · 2008-04-16 09:18 · Score: 5, Funny

Well, I certainly hope that they put in some decent smarts to prevent it from making posts onto forums, blogs, /., etc.

On the plus side, this should enable Google to get by the "Must be 18 to view" buttons ;)

--
><));>
1. Re:Forums? by spintriae · 2008-04-16 11:53 · Score: 3, Funny
  
  Google's only 12 years old. It shouldn't be visiting those sites.
HELLO I AM GOOGLEBOT by Anonymous Coward · 2008-04-16 09:19 · Score: 5, Funny

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
1. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · 2008-04-16 09:21 · Score: 5, Funny
  
  I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
2. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · 2008-04-16 09:26 · Score: 4, Funny
  
  I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
3. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · 2008-04-16 09:40 · Score: 1, Funny
  
  I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
I'm in your Intarwebs by Mathus · 2008-04-16 09:32 · Score: 2, Funny

Cracking your forms. Sorry, could not help myself.
robots.txt by B3ryllium · 2008-04-16 09:37 · Score: 4, Funny

Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?
Note to self... by fahrbot-bot · 2008-04-16 09:38 · Score: 3, Funny

our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML...
...post invoice forms ordering expensive items to be shipped to Google. Be sure to log incoming IP addresses for verification.

--
It must have been something you assimilated. . . .
Re:Will it solve captchas? by skraps · 2008-04-16 09:39 · Score: 5, Funny

Just what we need, some 'bot adding it's insightful comments based on other words in the same document.
Are such questions on your mind often?
..then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
What does that suggest to you?

--
Karma: -2147483648 (Mostly affected by integer overflow)
The Internet is for Porn by kiehlster · 2008-04-16 09:49 · Score: 5, Funny

If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.
Re:Will it solve captchas? by urcreepyneighbor · 2008-04-16 10:05 · Score: 4, Funny

You whore! You told me you loved me, Eliza! You said you'd call!

--
"The fight for freedom has only just begun." - Geert Wilders
Anecdote from Google by arrrrg · 2008-04-16 10:12 · Score: 5, Funny

When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash.) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.
Re:good and bad by martin-boundary · 2008-04-16 12:05 · Score: 2, Funny

Fix your damn site if you're worried about this particular attack.
Nope. I'll just refer them to the DMCA anti circumvention provisions. Let those damn phd kids fix their damn algorithms or get the hell off my damn lawn :)
Eeek, you found me! by CheeseTroll · 2008-04-16 14:08 · Score: 2, Funny

n/t

--
A post a day keeps productivity at bay.
Re:Google, consider this... by Kristoph · 2008-04-16 17:26 · Score: 3, Funny

Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create?

If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.

]{
Re:Forums, and "web 2.0" sites. by enoz · 2008-04-16 17:35 · Score: 2, Funny

Any forum that can't stop a "good" bot is going to have spam all over it anyway from the "bad" ones... C'mon there's no point in Google launching a war against phpBB, there are more than enough spambots doing that already.
Great! by dsouza42 · 2008-04-17 01:56 · Score: 2, Funny

Now they'l finally be able to index all kinds of Google searches... oh, wait.