Google Crawls The Deep Web
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)
Several years ago, I tried a demo of Bright Planet's Deep Query Manager that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!
...
Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.
Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.
Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind
My work here is dung.
They just bought everything on Amazon.
only half kidding
John Carmack fan, browsing at +5 since 1999.
On the plus side, this should enable Google to get by the "Must be 18 to view" buttons
><));>
I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
Cracking your forms. Sorry, could not help myself.
Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?
It must have been something you assimilated. . . .
If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.
Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.
If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.
I'm an American. I love this country and the freedoms that we used to have.
Sweet, now Google will be Fuzzing the entire web.
How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?
And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.
If I have nothing to hide, don't search me
When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash.) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.
So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.
Seems to me it would be easy enough to detect the googlebot user agent, then if so, automatically redirect it to the page on the other end (or even send it to a random 404 page or something), all without processing the form data at all.
<? if ($_SERVER['HTTP_USER_AGENT']=="User_agentMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); { header( 'Location:Of course, this would have to be implemented, which would be a PITA, but it seems to me that it would work just fine.
n/t
A post a day keeps productivity at bay.
"Technology: Google fills your backend database with garbage"
Why, no, I haven't meta-moderated lately. Thanks for asking!
- Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.
- If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.
- Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.
The above are all reasons why you might want some or all of the content on a site not indexed.
Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create?
If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.
]{
Of course they could link to a site and make the browser perform a POST. That's trivial. A form and some javascript will do that no problem. They seem to not be doing that because GET forms should be non-destructive, whereas POST forms can be quite destructive.
That is a POST form, which Google have said they will not mess with.
Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.
For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's OK to hit sites with huge increases in traffic because "if it's on the Internet, they expect people to visit". Sure, they expect people to visit — not automated systems run by companies with vast resources that can push a typical small site into paying for extra bandwidth or being taken off-line in a matter of minutes.
It is not OK to just Slashdot a site out of the blue. It is not OK to aggressively attack every form on a site to see what you can find. It is not OK to set up a 1,000,000 computer botnet and then effect a DDoS attack against a web site your client doesn't like. It is not OK to send me so much spam that I have to waste hours of my life sorting through it to find legitimate e-mail. These are all variations of exactly the same principle: knowingly causing a huge, unexpected and potentially expensive or damaging increase in traffic to someone without their knowledge or consent. And most of them are already illegal in a lot of jurisdictions.
It doesn't take a genius to spot that this is unethical behaviour, and it's long past time we stopped pretending it was OK because Google can Do No Evil(TM) and we like Slashdot. The current approach is unsustainable, and since the Internet's days as an unmetered, untaxed medium appear to be numbered in the current political climate, the sooner the robots.txt advocates get it, the better.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Now they'l finally be able to index all kinds of Google searches... oh, wait.