Google Crawls The Deep Web
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)
Several years ago, I tried a demo of Bright Planet's Deep Query Manager that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!
...
Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.
Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.
Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind
My work here is dung.
They just bought everything on Amazon.
only half kidding
John Carmack fan, browsing at +5 since 1999.
On the plus side, this should enable Google to get by the "Must be 18 to view" buttons
><));>
I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
This brings up a concern from the description.
So Googlebot will come across a web page.
It follows a link.
The link leads to a page with a form.
Googlebot fills out the form based on content already on the site.
Googlebot clicks submit.
Googlebot goes to the next page, and continues to follow links.
The problem comes when that form was a post form like the one I am typing on right now for a forum, or some other type of form to create user generated content. This makes it seem like google will see the text box and input random content from the site, then post it.
What keeps googlebot from becoming a nonsensical spambot? Yes, you can use nofollow, but there is such a huge quantity of web forms that don't have that now because they've never needed it. Retrofitting all of them web wide is not the most realistic of goals.
Touch everywhere, even when inappropriate.
...Google will rule the world
Distributed proteome folding @ WorldCommunityGrid.org
Team Slashdot - Members:#1 Run Time:#1 Points:#1 Results:#1
Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like. Also, what about the excess click-throughs that some websites may be paying an outside entity for? Finally, what of the time spent by IIS in examining the logs for yet another anomaly. Maybe these are unlikely possiblities, or maybe not, but it will come back to affect your image. Just a thought exercise: Consider the fun to be had in leading Google through dynamically generated pages, when a google Deep Web crawler comes to visit >:-)
V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
They'll have to be careful how they go about this. If they start filling in forms with bogus data on blogs, forums etc., there are going to be a lot of pissed off website owners out there. Just imagine the number of admins who'll have to update their robots.txt for this. Just my 2c.
Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
Cracking your forms. Sorry, could not help myself.
Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?
It must have been something you assimilated. . . .
If well you can have links that do actions and change information, submitting forms is a good recipe for massive changes, from comment spam to anything, sky is the limit.
Now you can't see what is on the web, by crawling, without changing it.
Here's how Google will respond when you complain to them about junk data in your forms: "We're sorry to hear about the problems with the way GoogleBot indexes your web site. Please note that GoogleBot strictly follows the robots exclusion standard and found no indication that your forms were not suitable for being accessed by automated processes. To avoid unwanted accesses, please update your robots.txt to correctly indicate which forms you don't want to be accessed by GoogleBot. Our webmastertools-service can help you make these updates."
So it will search recursively through .... Google.
Or probably benefite from altavista/yahoo/... results .
(just joking).
If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.
I'm an American. I love this country and the freedoms that we used to have.
http://www.robotstxt.org/
Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!
Or not.
Wimps. Index it all, who cares if the site doesn't want it. If its public facing it deserves to be indexed.
---- Booth was a patriot ----
How about online stores? Google is going to get some merchandise...
Sweet, now Google will be Fuzzing the entire web.
How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?
And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.
If I have nothing to hide, don't search me
And a few relevant URLs from helpful sponsors?
Now you just need to hire a few sweatshop workers to get past those pesky captchas...
This explains the sudden increase in users registered as Username==Username, with Password ==Password across the interwebs. To reply please send an email to name@domain.com
When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash.) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.
So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.
The stuff behind forms are normally of no use to someone doing a search on Google, so how does this fits in with their "Do-No-Evil" motto? What's the use of indexing a confirmation page to a support ticket system? Is someone going to do a search for: "A support ticket has been created. Your reference number is .... bla... bla... bla..." Anyway, how do you expect someone to visit a dynamic confirmation page without filling out a form? Is Google going to hack our CAPTCHA scripts and anti-spam measures just to get past our forms?
"Nevertheless, directions like 'nofollow' and 'noindex' are still respected". It's like the stupid CAN SPAM law, spammers are allowed to spam us until we tell them to stop. Google automatically allow themselves the privilege of fiddling with our e-mail forms until we tell them to stop.
www.cybertopcops.com
.....The first thing that popped into my head was someone out there figuring out how to use this to access password protected sites/accounts.
"Hey! Look at this! I googled "World of Warcraft Forums" and just got 10 million hits, all logged in as a user!"
I saw this a few months ago while grepping through our apache log. Googlebot was submitting search requests for some weird stuff to our online catalog (for example: "Ctnblnd"). After some research I found that Googlebot was the only client which had ever searched for most of these terms and that they were abbreviations that our accounting department uses. I was guessing that they were doing something like this in the lab for words that they "didn't know" but ultimately put our search url into robots.txt because I didn't like our search results showing up in theirs.
In a few months, there'll be a new blog post - Google will attempt to access and index all sites' password-protected pages by matching usernames found elsewhere on the site (e.g. from email addresses) with intelligent guesses at passwords, based on information it's gleaned regarding those individuals. Failing that, it'll run through entries found in various cracker dictionaries.
#DeleteChrome
Comment removed based on user account deletion
Google has announced that Google Phones (beta) will soon unveil the results of its having wardialed all 6,800,000,000 U.S. telephone numbers. Visitors to the Google Phones site will be able to search individual phone numbers to determine (without personally dialing the number) whether the number belongs to a landline telephone, cell phone, fax, or modem.
On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.
"Since we are a big, rich entity, the laws don't apply to us. We can do black-hat hacking exploits that would cause law enforcement to raid your home if you did the same thing," said a Google spokesman.
As per Google Webmaster Central:
"Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site."
Stuff like login forms, contact forms or forms for user generated content should be using the POST method not GET, so there shouldn't be any concern for web developers who know how to design their sites. If you are using GET in the wrong places, then it is your own fault.
What is the motto of this story, read the actual post/article in detail before overreacting on something posted out of context by a slashdotter. (and yes I'm also guilty of this)
www.cybertopcops.com
...that Google's Deep Crawl is already emuating the kidd33z.
Also, note that Google is not being altruistic when they say they will only process GET forms. From a programming POV yes it is no harder for them to submit POST forms than it is to submit GET forms. The problem is that they index their resulting data by storing URLs (which a GET request provides and a POST does not) so they would have no way to redirect a person clicking on the result list from Google to the POST form results (thatâ(TM)s just not supported by the browsers). So we are talking about a technical limitation, not a altruistic self-limitation.
Repeatedly querying to extract every permutation of their API could be much larger than their underlying data (think of the combinatorics of only 5 query fields of only 5 values each, against only a couple of hundred values in the database, like many at sites), and far too much traffic for small sites (and probably for big sites, too, if their combinations of queries at all matches their traffic).
What could be even better would be if sites that don't want get that huge load just to have their data searchable in Google would be a "getindex" keyword, rather than just "noindex", that specifies a URL from which the site's data index can be retrieved by Google. The getindex keyword would also point to a schema URL that would let Google decode the index.
That way, the site in question could let all its data get added to Google's centrally searchable index, if it wants to allow it (otherwise, it would post "noindex"). Sites might even find themselves using Google's copy of their index for their own queries, rather than use CPU time querying their own local copy. Just like sites today use Google's index for searching their site's documents.
All they'd do would be to regenerate their index whenever they want, and maybe ping a Google API at Google.com that reloads their index from their site back into Google's updated index. Such an "index hosting" service would quickly become the norm for many sites, just as searching sites by searching their document index at Google is the norm today, but would have been considered weird a decade ago.
--
make install -not war
Luckily, no damage done, since it's a harmless operation, but I am concerned about being penalized for a high number of "duplicate" pages since the response to each one is probably identical ("Sorry, no help is available on the topic "impediment to." and similar crap.)
It also doesn't seem to know when to quit - it's responsible for more hits per month than I am, and as a nervous webmaster that's saying a lot.
michael
I'm having flashbacks of The Venture Brothers, episode Twenty Years To Midnight.
Google searching websites like the Grand Inquisitor -- IGNORE ME!
The problem with their searching is a form like this one: http://quaker.org/users.cgi It's *meant* to keep people out unless they've entered into a legal agreement.
Don't piss off The Angry Economist
> According to the HTTP specs, GET requests should always be idempotent.
Wrong. It would be correct to say "according to the newer HTTP specs, GET requests should always be idempotent." I have pages still in production that I wrote before that requirement was added. While it is simple to change "ACTION=GET" to "ACTION=POST" in a static HTML text file, it's not so simple when I wrote some of the pages over twelve years ago and many are HTML generated with cgi programs that are compiled C using many external libraries. Most haven't been recompiled since then. My company made our big move for backend systems to Linux/C from MAI Basic Four in the fall of 1993, and moved our web site and ported all of our C cgi programs over to Linux the fall of 1995 from basic running on BSDi systems.
My plan is to password protect the directories. A human could screw us up like this new Google bot so this change needs to be done anyway.
Windows is a bonfire, Linux is the sun. Linux only looks smaller if you lack perspective.
n/t
A post a day keeps productivity at bay.
Soon enough they will dig too deep and unleash a terror the world has long forgotten. Maybe a Deep Crow? How exciting! :)
"Technology: Google fills your backend database with garbage"
Why, no, I haven't meta-moderated lately. Thanks for asking!
they should use them as well. And bugmenot. If only to show webmasters how pointless it is to require a captcha to search and a login to post to a forum. Anonymous cowards everywhere!
... they hit the Solar Dynamics Observatory database next year. It'll be collecting several petabytes of images...
until the google trawler starts making it's own first posts.
Googlebot, is that you?
I run a Rails site. There is a particular action on the site which, at the moment, sits in my password protected admin's area. Two people accessing it simultaneously would lock the server for 2 seconds. That isn't a problem for me when I get about 1,000 users a day and the action is only accessible to a single admin, but it would not be unreasonable to push that action to the public site, because the chances of temporal collision between users are low and the cost for a collision is negligible.
However, if you were sucking down 10k pages from my site with a spider, you could DOS my site pretty trivially. (Call it the Googlebot effect.) Thats why you let me say "If you are capable of routinely generating page views with scale, feel free to go anywhere but Door #1."
Help poke pirates in the eyepatch, arr.
Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.
For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's OK to hit sites with huge increases in traffic because "if it's on the Internet, they expect people to visit". Sure, they expect people to visit — not automated systems run by companies with vast resources that can push a typical small site into paying for extra bandwidth or being taken off-line in a matter of minutes.
It is not OK to just Slashdot a site out of the blue. It is not OK to aggressively attack every form on a site to see what you can find. It is not OK to set up a 1,000,000 computer botnet and then effect a DDoS attack against a web site your client doesn't like. It is not OK to send me so much spam that I have to waste hours of my life sorting through it to find legitimate e-mail. These are all variations of exactly the same principle: knowingly causing a huge, unexpected and potentially expensive or damaging increase in traffic to someone without their knowledge or consent. And most of them are already illegal in a lot of jurisdictions.
It doesn't take a genius to spot that this is unethical behaviour, and it's long past time we stopped pretending it was OK because Google can Do No Evil(TM) and we like Slashdot. The current approach is unsustainable, and since the Internet's days as an unmetered, untaxed medium appear to be numbered in the current political climate, the sooner the robots.txt advocates get it, the better.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Now they'l finally be able to index all kinds of Google searches... oh, wait.
Looks like it's time to reformat the Internet. Sure, theoretically this shouldn't cause problems, but we all know (and many of us have probably been guilty of early in our development careers) that practice doesn't always follow theory.
And I was hoping this would be about google finally indexing things like Freenet and the .onion domain.