Google Crawls The Deep Web

← Back to Stories (view on slashdot.org)

Posted by Zonk on Wednesday April 16, 2008 @09:13AM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

3 of 197 comments (clear)

Min score:

Reason:

Sort:

Will it solve captchas? by lastninja · 2008-04-16 09:17 · Score: 4, Interesting

only half kidding

--
John Carmack fan, browsing at +5 since 1999.
Re:Just think! by Ariven · 2008-04-16 12:45 · Score: 5, Interesting

I remember an article while back where someone had cut/pasted some articles from one section of their site to another.. and as a result had edit and delete links in the live content instead of on their internal web interface.

And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..

And now, they are gonna start submitting forms? the fallout is gonna be entertaining.

--
ariven.com
Re:Just think! by jc42 · 2008-04-16 15:08 · Score: 4, Interesting

I had similar problems a few years ago. The database had a lot of data in a compact format, and I wrote some retrieval pages that would extract the data and run it through any of a list of formatters to give clients the output format they wanted. Very practical. Over time, the list of output formats slowly grew, as did the database. Then one day, the machine was totally bogged down with http requests. It turned out that a search site had figured out how to use my format-conversion form, and had requested all of our data in every format that my code delivered.

Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.

Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.

I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.