Slashdot Mirror


Google Crawls The Deep Web

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

197 comments

  1. Just think! by scubamage · · Score: 5, Funny

    Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)

    1. Re:Just think! by WGFCrafty · · Score: 1

      Why?

    2. Re:Just think! by rawdirt · · Score: 1

      already there?

      I sent this drupal log trace to abuse@google.com

      Message Duplicate entry cb489713c1455ab98723be737bfe8ca7 for key 1 query: INSERT INTO sessions (sid, uid, cache, hostname, session, timestamp) VALUES (cb489713c1455ab98723be737bfe8ca7, 0, 0, 66.249.85.133, , 1207635444) in /var/www/drupal-5.5/includes/database.mysql.inc on line 172.
      Severity error
      Hostname 66.249.85.133

      no response from them yet, maybe a borked machine?

    3. Re:Just think! by AKAImBatman · · Score: 3, Funny

      Hmm... that reminds me of this DailyWTF. Who knew that Mr. Test User was such a big customer? :-P

    4. Re:Just think! by Anonymous Coward · · Score: 1, Informative

      I had a search not for "allinurl:select from where" but for "allinurl: delete from" ... throws up a bunch of phpBBAdmin pages with "Do you really want to do this" and "Yes" and "No" buttons .... which one will Google click :)

    5. Re:Just think! by Lillesvin · · Score: 3, Informative

      ... maybe a borked machine?

      Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.

      A little more on topic though, what exactly is Google looking for there? I mean, what content (of any interest to anyone) is hiding behind forms? Many sites that require registration (like NY Times (IIRC) and others) already check if the UserAgent string is that of a Google crawler and lets it index if so in order for people to be able to search eg. NY Times articles on Google but only read them if they register (or change their UserAgent string or use BugMeNot).

      And how does Google make sure they don't end up accidently editing a crapload of wikies by filling out random forms on random sites and hitting submit?

      --
      "Live free or don't."
    6. Re:Just think! by Nullav · · Score: 1

      +1, Informative!

      --
      I just read Slashdot for the articles.
    7. Re:Just think! by Anonymous Coward · · Score: 0

      I see you have been reading Neal Stephenson...

    8. Re:Just think! by Anonymous Coward · · Score: 0

      I like being fingered (I'm a guy) and yeah, it feels really good. But what does this have to do with Google? Exactly. Nothing.

    9. Re:Just think! by CastrTroy · · Score: 1

      Actually, we (the web) have had problems with this before. Web accellerators started following links on pages before you clicked them. If the link happened to link to an action deleting something, it would delete it just by visiting a page with the delete link on it. Granted you should never do anything destructive with a get request, but now Google is starting to submit forms. I wonder how much stuff they will end up deleting with their program that automatically submits forms with values it think should be correct.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    10. Re:Just think! by Ariven · · Score: 5, Interesting

      I remember an article while back where someone had cut/pasted some articles from one section of their site to another.. and as a result had edit and delete links in the live content instead of on their internal web interface.

      And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..

      And now, they are gonna start submitting forms? the fallout is gonna be entertaining.

    11. Re:Just think! by LordKronos · · Score: 1

      I've seen a number of users come crying in the mythtv forum that somehow all of their recordings mysteriously disappeared. Seems having your mythweb completely unsecured isn't such a good thing.

      For those people, this move by Google is great news. You see, the delete links were all simple GET requests, so the spiders were able to delete content. However, the scheduling is all done via POST'ed forms, so nothing would ever get recorded. This move on Google's part is really just an attempt to combat this. The other spiders delete all your recordings, then Google comes in a schedules your box to record every single show for the next 2 weeks. Sure, theres bound to be a few conflicts (unless your box has 150 tuners), but at least you won't have to resort to LiveTV to actually find something to watch.

    12. Re:Just think! by jc42 · · Score: 4, Interesting

      I had similar problems a few years ago. The database had a lot of data in a compact format, and I wrote some retrieval pages that would extract the data and run it through any of a list of formatters to give clients the output format they wanted. Very practical. Over time, the list of output formats slowly grew, as did the database. Then one day, the machine was totally bogged down with http requests. It turned out that a search site had figured out how to use my format-conversion form, and had requested all of our data in every format that my code delivered.

      Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.

      Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.

      I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    13. Re:Just think! by dartarrow · · Score: 3, Informative
      --
      I love humanity, it is people I hate
    14. Re:Just think! by Anonymous Coward · · Score: 1, Funny

      If you put a "delete this page" button on any page, I would honestly be shocked if Google got to it before some punk-ass kid did...

    15. Re:Just think! by Arancaytar · · Score: 1

      Or just deleting those databases in order to reduce the set of information it has to index. Google "Google Purge onion"! :P

    16. Re:Just think! by Ed+Avis · · Score: 1

      I hope the forms they submit are only GET request forms and not POST ones.

      --
      -- Ed Avis ed@membled.com
    17. Re:Just think! by Anonymous Coward · · Score: 1, Insightful

      I can't help but feel that any site that doesn't perform proper access control really needs this kind of wake up call.

      The way you put it is sounds like Google has somehow done something wrong there but it's not like every user to a site is going to be courteous enough to just not hit the delete link themselves. The responsibility for this problem lies entirely at the feet of the site's developer, even if the link was out there it absolutely should never have just gone ahead and deleted the content without checking who was trying to delete the content first.

      There are certainly valid concerns about Google's plans but incompetent web developers shouldn't be one of them and as I pointed out originally, someone's going to take advantage of their incompetence eventually whether it's the Google bot or someone else.

    18. Re:Just think! by Anonymous Coward · · Score: 0

      I'm sure this will forever be recorded the Annals of Internet History.

    19. Re:Just think! by Ariven · · Score: 1

      Nah, I dont think that Google did anything wrong, but my point is that before, when this happened, the message that was put across was "dont use links to perform actions like delete, use a POST and a form".... now, even that isnt a guarantee of safety.

      I think that the programmers of the website have full responsibility to properly protect against stuff like this, but, being a realist, I know there is going to be enough that dont, that I will be properly entertained as a result. :)

    20. Re:Just think! by Ariven · · Score: 1

      Yup, thats the one... I tried a quick search for it when I posted before, but couldn't find it right away... my google-fu failed me. ;)

    21. Re:Just think! by Goaway · · Score: 1

      It still is. Google won't do POST, just GET.

    22. Re:Just think! by Goaway · · Score: 1

      They are.

    23. Re:Just think! by Garridan · · Score: 1

      Alexa did that to me once. The Alexa plugin scraped my boss's username/password for our admin backend, and spidered a bunch of delete links. Oops. So I made 'em into buttons. And a quick search of Alexa indicates that it still spiders the admin site, robots.txt be damned.

    24. Re:Just think! by solaraddict · · Score: 2, Informative

      At that time it was stated that any link that performs an action, such as delete, should be a post(...) [clears his throat]
      And the RFC 2616 opened its mouth and said:

      In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
      It must be true, the fRFC confirms it!
  2. Bright Planet's DQM by eldavojohn · · Score: 3, Interesting

    Several years ago, I tried a demo of Bright Planet's Deep Query Manager that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!

    Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.

    Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.

    Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind ...

    --
    My work here is dung.
    1. Re:Bright Planet's DQM by Anonymous Coward · · Score: 0

      It doesn't bring in money directly, but getting those pages listed in Google will bring more people to your site, and from that comes more ad revenue.

    2. Re:Bright Planet's DQM by menace3society · · Score: 2, Interesting

      You could build a really interesting "Deep Web" crawler by ignoring robots.txt. In fact, an index just of robots.txt files would be pretty cool in its own right. Call it "Sweet Sixteen" (10**100 in binary) or something.

    3. Re:Bright Planet's DQM by cheater512 · · Score: 1

      The more content they have off your site, the more visitors they send.

      The visitors *do* generate ad revenue. :)

    4. Re:Bright Planet's DQM by enoz · · Score: 2, Interesting

      One time when I was Deep Crawling a particular website I decided to take a peek at their robots.txt file. To my amazement they had listed all the folders that they didn't want anyone to find, yet had provided absolutely no security to prevent you accessing the content if you knew the location.

      It's cases like that where doing a half-arsed job is worse than not trying at all.

    5. Re:Bright Planet's DQM by bothwell · · Score: 1

      That's not really so unusual, surely? My main domain's robots.txt is set to disallow all search engines from spidering any of the content, but it's still accessible to humans. Kinda like... y'know, it's cool if you wanna look at my collection of Hansard re-written as slash fiction, but I don't want google associating it with my moniker. Obviously I could just choose to store it somewhere else under a different pseudonym but whichever. My point (such as it is) is that surely your guys could have been doing something similar. Obviously I don't know what the offending content was, could have been much worse than Hansard slash fic. If such a thing exists.

    6. Re:Bright Planet's DQM by jo42 · · Score: 1

      whitehouse.gov?

    7. Re:Bright Planet's DQM by menace3society · · Score: 1

      I prefer .com

  3. Oops... by JohnnyDanger · · Score: 5, Funny

    They just bought everything on Amazon.

    1. Re:Oops... by Bogtha · · Score: 4, Informative

      This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

      This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.

      --
      Bogtha Bogtha Bogtha
    2. Re:Oops... by Firehed · · Score: 2, Funny

      HTTP spec be damned - has IE taught you nothing?

      --
      How are sites slashdotted when nobody reads TFAs?
    3. Re:Oops... by Anonymous Coward · · Score: 0

      HTML != HTTP

    4. Re:Oops... by Anonymous Coward · · Score: 0

      They just bought everything on Amazon ...using your credit card! Oops indeed!
    5. Re:Oops... by Jarjarthejedi · · Score: 1

      IE's horridness trancends the mere concept of acronyms.

      --
      There are two kinds of fool One says 'This is old therefore good' Another says 'This is new therefore better'- Dean Ing
    6. Re:Oops... by orkysoft · · Score: 5, Insightful

      Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.

      --

      I suffer from attention surplus disorder.
    7. Re:Oops... by cheater512 · · Score: 1

      No, IE ignores chunks of HTTP as well as HTML.

    8. Re:Oops... by UnderCoverPenguin · · Score: 1

      This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

      I have seen plenty of forms that use get for commanding actions, including making purchases.(for example, I used to work for web company; one of the page designers only ever used get - and then would wonder why my code replied with an error when his get requests exceeded the size limit)

      --
      Don't try to out wierd me, three-eyes. I get stranger things than you, free with my breakfast cereal. --Zaphod Beeblebr
    9. Re:Oops... by jrumney · · Score: 1

      Unfortunately there are also tons of sites whose developers did not understand the part about POST being for creating new resources, and PUT being for making changes on the server.

      HTTP verb semantics are a very dangerous thing for Google or any other third party to rely on, unless they are using a documented API where the developers have explicitly followed REST principles.

    10. Re:Oops... by Anonymous Coward · · Score: 0

      They just bought everything on Amazon. I wonder if you someone will be able to modify it so that it tricks the bot to acutally purchase or agree to some license
    11. Re:Oops... by orkysoft · · Score: 1

      Thanks for the typical Slashdot nitpick :-P

      Nobody is using PUT, and I doubt whether the popular browsers even support it.

      I was taught seven bits to the byte, GET and POST, and that's the way I likes it!

      --

      I suffer from attention surplus disorder.
    12. Re:Oops... by Anonymous Coward · · Score: 0

      Just sold them a million dollars worth of AAAAAAAAAAA. Now where the hell do I score a million dollars worth of AAAAAAAAAAB?

    13. Re:Oops... by FooAtWFU · · Score: 1

      What's it to Google (or a third party) if they mess up your pathetically-designed form? It's not like they're going to "accidentally purchase something" (like some people suggested) unless they have their robots equipped with billing information submission functions (somehow I doubt it).

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    14. Re:Oops... by exp(pi*sqrt(163)) · · Score: 1

      Do you thing the DoD use GET or POST for launching nuclear warheards? Is there a guideline about that?

      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    15. Re:Oops... by Anonymous+Brave+Guy · · Score: 2, Interesting

      What's it to Google (or a third party) if they mess up your pathetically-designed form?

      That depends. If they effectively launch a denial-of-service attack and eat zilliabytes of people's limited bandwidth by attempting to submit with all possible combinations of form controls and large amounts of random data in text fields, would that be:

      1. antisocial?
      2. negligent?
      3. the almost immediate end of their reign as most popular search engine as numerous webmasters blocked their robots?
      4. illegal?
      5. all of the above?
      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    16. Re:Oops... by WizzardX · · Score: 0, Redundant
    17. Re:Oops... by jlarocco · · Score: 1

      HTTP is a documented API.

      What makes you think somebody who's just fucked up HTTP isn't going to go right ahead and fuck up "REST principles" while they're at it?

    18. Re:Oops... by rtb61 · · Score: 1
      Well that does bring up a point. Should you have to include extra coding in your html to block google, or should google only be allowed to deep search sites that have extra coding that invites them in.

      Google in a way is saying that if you fail to properly secure your site that they have a right to data mine it and generate profits from your data. Perhaps, mind you, just perhaps, that really, legally, is not appropriate and perhaps a legal investigation is required to clarify this before everyone starts doing it.

      --
      Chaos - everything, everywhere, everywhen
    19. Re:Oops... by enoz · · Score: 1

      Methinks the developers where sniffing too much MIME

    20. Re:Oops... by enoz · · Score: 1

      Obviously they should be using DELETE

    21. Re:Oops... by Anonymous Coward · · Score: 0

      robots.txt

      google listens. It works; it's magic.

  4. Will it solve captchas? by lastninja · · Score: 4, Interesting

    only half kidding

    --
    John Carmack fan, browsing at +5 since 1999.
    1. Re:Will it solve captchas? by Firefalcon · · Score: 1
    2. Re:Will it solve captchas? by fishybell · · Score: 1

      Just what we need, some 'bot adding it's insightful comments based on other words in the same document...then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?

      --
      ><));>
    3. Re:Will it solve captchas? by skraps · · Score: 5, Funny

      Just what we need, some 'bot adding it's insightful comments based on other words in the same document.
      Are such questions on your mind often?

      ..then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
      What does that suggest to you?
      --
      Karma: -2147483648 (Mostly affected by integer overflow)
    4. Re:Will it solve captchas? by urcreepyneighbor · · Score: 4, Funny

      You whore! You told me you loved me, Eliza! You said you'd call!

      --
      "The fight for freedom has only just begun." - Geert Wilders
    5. Re:Will it solve captchas? by Kemanorel · · Score: 1

      What does that suggest to you? A new Turing test is needed?
      --
      Mess not in the affairs of dragons, for you are crunchy and good with ketchup.
    6. Re:Will it solve captchas? by Anonymous Coward · · Score: 0

      Your nick fits you perfectly.

    7. Re:Will it solve captchas? by holyspidoo · · Score: 0

      Actually, my fear is that even MORE captchas will emerge out of this.

      If I had to enter a bleepin captcha everytime I did a search on ebay or amazon, I'd be searching exclusively for "rope +hanging"

    8. Re:Will it solve captchas? by DrWho520 · · Score: 1

      This is why I read /.

      --
      The cancel button is your friend. Do not hesitate to use it.
    9. Re:Will it solve captchas? by Sobrique · · Score: 1

      Don't worry. It's all OK. They've managed to automate CAPTCHA filling.
      Problem sorted.

  5. Forums? by fishybell · · Score: 5, Funny
    Well, I certainly hope that they put in some decent smarts to prevent it from making posts onto forums, blogs, /., etc.


    On the plus side, this should enable Google to get by the "Must be 18 to view" buttons ;)

    --
    ><));>
    1. Re:Forums? by brunascle · · Score: 2, Informative

      as TFA states, it's only GET requests, not POSTs. so it would mostly be search queries.

    2. Re:Forums? by fishybell · · Score: 1

      ...and porn. You can't forget the porn.

      --
      ><));>
    3. Re:Forums? by MenTaLguY · · Score: 1

      Unfortunately a lot of developers misuse GET requests for actions which modify state. (I suppose this'll teach them...)

      --

      DNA just wants to be free...
    4. Re:Forums? by Bogtha · · Score: 1

      The usual excuse for that is that they want a link — for aesthetic purposes, to put in an email, etc. If you're using a form anyway, those reasons disappear. I'm sure there are a few developers who screw this up, but it won't be anywhere near as common as the problems GWA uncovered.

      --
      Bogtha Bogtha Bogtha
    5. Re:Forums? by spintriae · · Score: 3, Funny

      Google's only 12 years old. It shouldn't be visiting those sites.

    6. Re:Forums? by Anonymous Coward · · Score: 0

      Why should I have to create an entire form with a hidden variable containing the "action" when a single link will do? If I have 10 buttons why should I have 10 forms??? Screw this I'm using JavaScript for EVERYTHING.

    7. Re:Forums? by Mr.+Slippery · · Score: 1

      The usual excuse for that is that they want a link -- for aesthetic purposes, to put in an email, etc.

      If a link works better for interface purposes, Javascript event handlers let you do a POST by clicking a link. I've done that where I needed to pass some login information via a POST to what should have otherwise been a GET. (Yes, it should really be done by cookies or by a session ID in the URL, but that wasn't practical in this case for historical reasons.)

      I'd never really considered how all those verification links we send in e-mail break the rule about GET/POST semantics and side effect and idempotency. Maybe it's worth more deep thought. But if it's doing sometime more complicated and significant than confirming an action, the link should take the user to a page that verifies "Do you really want to do this?", and that page should do a POST.

      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
  6. HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 5, Funny

    I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    1. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 5, Funny

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    2. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 4, Funny

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    3. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 2, Interesting

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    4. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 0

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    5. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 1, Funny

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    6. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 1, Insightful

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    7. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 0

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    8. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 0

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    9. Re:HELLO I AM GOOGLEBOT by Anonymous+Brave+Guy · · Score: 1

      HTTP/1.1 426 Upgrade Required
      Upgrade: Common courtesy/1.0, HTTP/1.1

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    10. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 0

      I am just submitting this form to see what's behind the form I submitted. PLEASE IGNORE ME.

  7. Forums, and "web 2.0" sites. by PyroMosh · · Score: 1

    This brings up a concern from the description.

    So Googlebot will come across a web page.
    It follows a link.
    The link leads to a page with a form.
    Googlebot fills out the form based on content already on the site.
    Googlebot clicks submit.
    Googlebot goes to the next page, and continues to follow links.

    The problem comes when that form was a post form like the one I am typing on right now for a forum, or some other type of form to create user generated content. This makes it seem like google will see the text box and input random content from the site, then post it.

    What keeps googlebot from becoming a nonsensical spambot? Yes, you can use nofollow, but there is such a huge quantity of web forms that don't have that now because they've never needed it. Retrofitting all of them web wide is not the most realistic of goals.

    1. Re:Forums, and "web 2.0" sites. by Idiomatick · · Score: 1

      Google indexes more than any other search engine by expanding the web themselves. It was moving too slow for them.

      Really though i don't think this will be a problem. People at google are pretty smart and i'm sure they've thought of this. Even if you believe google is evil there no evil corporate benefit to spamming garbled text to the entire internet.

    2. Re:Forums, and "web 2.0" sites. by mmkkbb · · Score: 1

      They will use Markov chains which may end up sounding more intelligent than many forum denizens. Fark, Free Republic, LGF, etc. won't even notice.

      --
      -mkb
    3. Re:Forums, and "web 2.0" sites. by Nos. · · Score: 1

      Not only that, but suppose I search for something, that is hidden behind a form. Assuming I click the link on the search results, I'm going to (most likely) taken to an error page saying I have to fill out the form.

    4. Re:Forums, and "web 2.0" sites. by Simon+(S2) · · Score: 1

      This makes it seem like google will see the text box and input random content from the site, then post it. No. Googlebot will only do gets, not posts.
      --
      I just don't trust anything that bleeds for five days and doesn't die.
    5. Re:Forums, and "web 2.0" sites. by lordSaurontheGreat · · Score: 1

      ...This makes it seem like google will see the text box and input random content from the site, then post it. ...


      No, the Google Bot sees raw HTML and CSS code, plus maybe some basic JavaScript. All of /.'s new js additions will throw off the bots immediately.


      In addition, you're forgetting that GET and POST are completely different things. They look identical to you, but the HTML is different, and it's not difficult to differentiate between them. One is <form method="GET"> and the other is <form method="POST">

      --
      Consider yourself spoken to.
    6. Re:Forums, and "web 2.0" sites. by menace3society · · Score: 1

      I am tempted to copy and paste that and post it as my reply, but I think that would be insufferably clever. So, too, is referring the fact that I could be insufferably clever, but choose not to be. Etc...

    7. Re:Forums, and "web 2.0" sites. by Z80xxc! · · Score: 2, Informative

      Seems to me it would be easy enough to detect the googlebot user agent, then if so, automatically redirect it to the page on the other end (or even send it to a random 404 page or something), all without processing the form data at all.

      <? if ($_SERVER['HTTP_USER_AGENT']=="User_agentMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); { header( 'Location: /landing_page.php' ) ; } else { processtheform(); } ?>

      Of course, this would have to be implemented, which would be a PITA, but it seems to me that it would work just fine.

    8. Re:Forums, and "web 2.0" sites. by enoz · · Score: 2, Funny

      Any forum that can't stop a "good" bot is going to have spam all over it anyway from the "bad" ones... C'mon there's no point in Google launching a war against phpBB, there are more than enough spambots doing that already.

    9. Re:Forums, and "web 2.0" sites. by maglor_83 · · Score: 1

      It would have been even more insufferably clever if it hadn't already been done almost an hour before you post.

    10. Re:Forums, and "web 2.0" sites. by PyroMosh · · Score: 1

      IT would work just fine, but so would a "nofollow" tag which would be even easier to implement. The problem is that it is indeed a PITA to implement across the internet.

  8. And now that Captcha has been cracked... by UnCivil+Liberty · · Score: 1

    ...Google will rule the world

    --
    Distributed proteome folding @ WorldCommunityGrid.org
    Team Slashdot - Members:#1 Run Time:#1 Points:#1 Results:#1
  9. good and bad by ILuvRamen · · Score: 3, Insightful

    Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

    --
    Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    1. Re:good and bad by QuoteMstr · · Score: 5, Insightful

      And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.

    2. Re:good and bad by Bogtha · · Score: 4, Insightful

      Now all they need is something to read text in flash files and they've got something going.

      They've indexed Flash for about four years now.

      I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

      No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.

      --
      Bogtha Bogtha Bogtha
    3. Re:good and bad by Metasquares · · Score: 1

      expose data that's supposed to be protected and private
      Ugh, it's the friend class of the entire Internet!
    4. Re:good and bad by martin-boundary · · Score: 2, Funny

      Fix your damn site if you're worried about this particular attack.
      Nope. I'll just refer them to the DMCA anti circumvention provisions. Let those damn phd kids fix their damn algorithms or get the hell off my damn lawn :)
    5. Re:good and bad by inline_four · · Score: 1

      What I think is really needed is the ability to process JavaScript. Take Google's own WebKit -- a page it produces will contain next to no interesting static HTML to index, while all the real content gets loaded through JavaScript, well before any kind of user interaction. A page like that is invisible to most web crawlers as far as I know.

      --
      Alexey
  10. Google, consider this... by Kiralan · · Score: 0

    Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like. Also, what about the excess click-throughs that some websites may be paying an outside entity for? Finally, what of the time spent by IIS in examining the logs for yet another anomaly. Maybe these are unlikely possiblities, or maybe not, but it will come back to affect your image. Just a thought exercise: Consider the fun to be had in leading Google through dynamically generated pages, when a google Deep Web crawler comes to visit >:-)

    --
    V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
    1. Re:Google, consider this... by chris_mahan · · Score: 1

      The best would be for the app to be hosted on google appengine, then it would take the app down, and the culprit would be google. So when google comes and bills you for bandwith, CPU and storage usage, you bill them right back, citing Bot activity.

      hehehe

      --

      "Piter, too, is dead."

    2. Re:Google, consider this... by poot_rootbeer · · Score: 3, Insightful

      Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.

      If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.

    3. Re:Google, consider this... by Kristoph · · Score: 3, Funny

      Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create?

      If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.

      ]{

  11. This could cause problems by tehcmn · · Score: 0, Redundant

    They'll have to be careful how they go about this. If they start filling in forms with bogus data on blogs, forums etc., there are going to be a lot of pissed off website owners out there. Just imagine the number of admins who'll have to update their robots.txt for this. Just my 2c.

    1. Re:This could cause problems by profplump · · Score: 1

      They are only submitting forms with a GET method. According to the HTTP specs, GET requests should always be idempotent. If you've got forms that use the GET method and aren't idempotent you should *already* be taking extra precautions avoid accidental use by bots and other automated tools.

    2. Re:This could cause problems by GryMor · · Score: 1

      You do realize that 'delete' is idempotent, right?

      Idempotence simply requires that:
      f(STATE) == f(f(STATE))
      It doesn't require that:
      STATE == f(STATE)

      So Idempotent actions can cause state changes, such as deleting an item.

      --
      Realities just a bunch of bits.
  12. What about register forms? by Anonymous Coward · · Score: 0, Flamebait

    Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?

    1. Re:What about register forms? by stephanruby · · Score: 2, Informative

      Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
      Yes, if you require all your human visitors to read your robots.txt, and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese.
  13. I'm in your Intarwebs by Mathus · · Score: 2, Funny

    Cracking your forms. Sorry, could not help myself.

  14. robots.txt by B3ryllium · · Score: 4, Funny

    Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?

  15. Note to self... by fahrbot-bot · · Score: 3, Funny
    our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML...

    ...post invoice forms ordering expensive items to be shipped to Google. Be sure to log incoming IP addresses for verification.

    --
    It must have been something you assimilated. . . .
  16. Heisenberg for web by gmuslera · · Score: 1

    If well you can have links that do actions and change information, submitting forms is a good recipe for massive changes, from comment spam to anything, sky is the limit.

    Now you can't see what is on the web, by crawling, without changing it.

  17. They'll make it your fault. by Anonymous Coward · · Score: 0

    Here's how Google will respond when you complain to them about junk data in your forms: "We're sorry to hear about the problems with the way GoogleBot indexes your web site. Please note that GoogleBot strictly follows the robots exclusion standard and found no indication that your forms were not suitable for being accessed by automated processes. To avoid unwanted accesses, please update your robots.txt to correctly indicate which forms you don't want to be accessed by GoogleBot. Our webmastertools-service can help you make these updates."

  18. Google() {Google();} by unforkable · · Score: 1

    So it will search recursively through .... Google. Or probably benefite from altavista/yahoo/... results . (just joking).

    1. Re:Google() {Google();} by TheRaven64 · · Score: 1

      I'm surprised it isn't doing this already, from the number of 'search results' pages an average Google search turns up.

      --
      I am TheRaven on Soylent News
  19. The Internet is for Porn by kiehlster · · Score: 5, Funny

    If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.

    1. Re:The Internet is for Porn by CheeseTroll · · Score: 1

      Didn't you learn during the 90's that dotcom's age in Internet Years? :-)

      --
      A post a day keeps productivity at bay.
    2. Re:The Internet is for Porn by Anonymous Coward · · Score: 0

      We'll be showered with a gigantic wave of pornographic information. Will those showers be golden?

  20. directions like 'nofollow' are still respected by frovingslosh · · Score: 5, Informative
    Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.

    Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.

    --
    I'm an American. I love this country and the freedoms that we used to have.
    1. Re:directions like 'nofollow' are still respected by QuantumHobbit · · Score: 2, Insightful

      But they don't want you to find out that the moon landing was faked and that Jimmie Hoffa shot Kennedy while driving a car that runs on water. I agree with you. If you don't want people to know something don't put it on the web. If you want people to know put it on the web and let google send the people to you. It's all bureaucracy inaction.

    2. Re:directions like 'nofollow' are still respected by Christophotron · · Score: 3, Interesting

      As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.

      I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.

      I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing such information is akin to hiding or obfuscating it: you don't actually want anyone to read it or anything, but you can say it's available on the web so your ass is covered. IMO there should be a law stating that all of .gov MUST be indexed by search engines.

      Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc? If it's not breaking the law, then Google should have some serious competition. A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.

    3. Re:directions like 'nofollow' are still respected by bigbigbison · · Score: 1

      While I don't see Google doing it because of the backlash I'm a bit surprised that no other search engine has touted ignoring "nofollow" and "noindex" as a "feature" of their search engine in the attempt to look like they are better than Google.

      --
      http://www.popularculturegaming.com -- my blog about the culture of videogame players
    4. Re:directions like 'nofollow' are still respected by Anonymous Coward · · Score: 0

      A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.

      There are still sites that should never be indexed for the content that they hold. Sure, sure, Gov sites shouldn't contain any of the content that I speak of, but there are sites that work hard to keep children out of them via registration with SurfWatch, CyberPatrol, CyberSitter, Net Nanny and RSACi (among others).
    5. Re:directions like 'nofollow' are still respected by STrinity · · Score: 1

      Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc?
      No, only Internet standards. No need to follow those antiquated things. Google can become the search equivalent of IE.
      --
      Les Miserables Volume 1 now up with my reading of
    6. Re:directions like 'nofollow' are still respected by enoz · · Score: 1

      The search equivalent to IE.... so being the dominant player, using a feature-limited interface, and prone to leaking private information?

      I think Google is already there.

    7. Re:directions like 'nofollow' are still respected by Anonymous Coward · · Score: 0

      One good workaround is the fact that most government sites (all?) are public domain and fall under various laws like FOIA, Sunshine, etc..

      That means they are fair game to any type of technological fix used to get the information from them. If it is published on the web, it is public record.

      Mirror the entire site, and strip the nofollow tags. It would be a useful public service.

    8. Re:directions like 'nofollow' are still respected by Arancaytar · · Score: 1

      What about the millions and millions of search spam comments on blogs that are only kept in check by nofollow?

  21. ROBOTS.TXT & CONTENT="NOINDEX", "NOFOLLOW" by Chyeld · · Score: 1

    http://www.robotstxt.org/

    Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!

    Or not.

  22. sites can still be excluded by nurb432 · · Score: 1, Flamebait

    Wimps. Index it all, who cares if the site doesn't want it. If its public facing it deserves to be indexed.

    --
    ---- Booth was a patriot ----
    1. Re:sites can still be excluded by Christophotron · · Score: 1
      +1 hell yes. Sure, indexing may be abused at the expense of clueless web developers, but they'll clean up their act very rapidly in the wake of all the security breaches.

      Non-indexing may be abused as well. As someone said in an earlier comment, .gov sites like to disallow indexing. What possible purpose could this serve other than to make people's lives miserable when dealing with the government?

      IANA(web developer), but I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it?

    2. Re:sites can still be excluded by Omnius · · Score: 1

      IANA(web developer), but I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it? Then as a web developer I'll explain it to you. I pay for the storage of my web site as well as every single byte that goes in or out of my web site (bandwidth). So, every query that is done against my web site by a query engine (or a user) costs me money. Generally I am willing to spend that money to get my content indexed in the various search engines, but it should be MY CHOICE since it is MY MONEY. The way I limit that today is using robots.txt and other techniques. Now, if the search engine wants to pay me to index my content, that's another thing entirely.

      BTW, I agree that government sites should not use exclusion rules for public data, but the right thing to do is to complain to the oversight committee for that government web site, not blame the search engines.

    3. Re:sites can still be excluded by nurb432 · · Score: 1

      Then dont put your site public facing.

      --
      ---- Booth was a patriot ----
    4. Re:sites can still be excluded by Anonymous Coward · · Score: 0

      The robots exclusion standard (aka robots.txt) was designed to prevent technical problems which occurred when automated processes tried to crawl very large URL spaces, like those of database backed websites with dynamically generated URLs. This original intention of the robots exclusion standard is still used today to trap spiders which ignore the standard: Some websites have infinite junk page trees linked invisibly and excluded in robots.txt. So instead of just blocking the IP address of an ignorant spider, which doesn't help much against distributed spiders running on botnets, these websites "poison the harvest", so to speak.

    5. Re:sites can still be excluded by Anonymous Coward · · Score: 0

      Don't be an idiot. It's his site, he can do what he wants with it. The alternative to a civilized declarative standard is server-side robot-detection and exclusion. It can be done and it is done, but it's an arms race. If you are not keen on constantly proving that you're not a machine, follow the standard.

    6. Re:sites can still be excluded by WeblionX · · Score: 1

      You mean besides the fact that every bloody search engine doesn't need to waste my resources poking about my forums viewing every single bloody post. My SQL server has enough problems when it's just users, blocking search engines with robots.txt cut out 90% of error reports of the SQL server failing in some manner.

      --
      (\(\
      (=_=) Bani!
      (")")
    7. Re:sites can still be excluded by mysidia · · Score: 1

      Or block all the IP ranges belonging to whichever search provider is foolish enough to ignore your wishes and apply automated crawling anyways, and pursue legal action.

      You probably forget that it's easy to setup tarpits designed to make robots have an infinite number of pages to index.

      It's also possible to make links that are invisible to humans not viewing page source, but if clicked by a robot will automatically ban that ip from accessing the web server.

    8. Re:sites can still be excluded by Omnius · · Score: 1

      It's also possible to make links that are invisible to humans not viewing page source, but if clicked by a robot will automatically ban that ip from accessing the web server. This might work if you don't want to be indexed at all, but usually you have a part of your public site that you want to be indexed and part of it that you don't want them to index. Usually banning an IP comes only after examining error logs and finding a heinous abuser and is not something I would feel comfortable doing in an automated fashion as you could end up banning sites and people who should be your friend (remember that proxies, potentially one IP, are often not just one person and could be thousands and some of those could be programs).
    9. Re:sites can still be excluded by danielsfca2 · · Score: 3, Insightful

      I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it? Well I'm glad you asked. The presence (and continued following) of the robots.txt standard is crucial for these reasons:

      - Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.

      - If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.

      - Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.

      The above are all reasons why you might want some or all of the content on a site not indexed.
  23. stores by sveard · · Score: 1

    How about online stores? Google is going to get some merchandise...

  24. Fuzzing the world by corsec67 · · Score: 2, Insightful

    Sweet, now Google will be Fuzzing the entire web.

    How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?

    And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

    --
    If I have nothing to hide, don't search me
    1. Re:Fuzzing the world by Bogtha · · Score: 2, Insightful

      In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

      More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.

      I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fixing their bugs every time something like this happens.

      --
      Bogtha Bogtha Bogtha
    2. Re:Fuzzing the world by mysidia · · Score: 1

      It's not exactly smart behavior that rails allows to fake a post.

      But use of rails is such a common case, Google being as non-evil as they are, will most likely take the sensible action and avoid submitting GET forms that have a 'method=post' attribute, when they find out about that.

      Nothing actually forces them to discriminate only on GET/POST. They can also check form values and make sure they don't submit a GET form that has an attribute like 'username', 'message', 'subject', or 'e-mail address'

    3. Re:Fuzzing the world by Bogtha · · Score: 1

      I wouldn't be surprised if they did that, after all they did a similar thing with GWA and URLs with query strings. But I can't help but think it's a silly path to take. It makes an "unwritten rule" of HTTP that certain magic strings are off-limits, and of course, no specification contains a list of these magic strings, you have to reverse engineer other software for them.

      --
      Bogtha Bogtha Bogtha
  25. Evil Bot by Arancaytar · · Score: 1

    For text boxes, our computers automatically choose words from the site that has the form


    And a few relevant URLs from helpful sponsors?

    Now you just need to hire a few sweatshop workers to get past those pesky captchas...
    1. Re:Evil Bot by geekangel · · Score: 1

      Getting past Capchas... isn't that what Amazon's Mechanical Turk is for?

  26. Username == Username by QuantumHobbit · · Score: 1

    This explains the sudden increase in users registered as Username==Username, with Password ==Password across the interwebs. To reply please send an email to name@domain.com

  27. Anecdote from Google by arrrrg · · Score: 5, Funny

    When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash.) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

    So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.

    1. Re:Anecdote from Google by Arimus · · Score: 1

      [blockquote]
      end and submitting random data on random forms
      [/blockquote]

      Sod worrying about zapping sites, what will happen when they crawl the nuclear launch site and enter random data into the authorisation field, and in a rare feat of sod's law end up getting the code just right....

      (oh and what's the betting they'll put redmond in as a target string?)

      --
      --- Users are like bacteria -> Each one causing a thousand tiny crises until the host finally gives up and dies.
    2. Re:Anecdote from Google by Colonel+Korn · · Score: 0, Troll

      The world needs web hosts that block all Google IPs.

      --
      "I zero-index my hamsters" - Willtor (147206)
    3. Re:Anecdote from Google by Anonymous Coward · · Score: 0

      They already exist, but you probably can't find them... since they're not in Google.

    4. Re:Anecdote from Google by IdeaMan · · Score: 1

      Google-fu young padawan lacking are.

      --
      They ARE out to get you simply because They are in it for themselves and they don't care about you.
    5. Re:Anecdote from Google by RyoShin · · Score: 1

      This could be the incident you speak of. :)

      (Or at least super similar.)

    6. Re:Anecdote from Google by sootman · · Score: 1

      That happened to me on a database demo site that I did. The 'edit,' 'details,' and, yes, 'delete' buttons were just plain old text links. I posted the URL of the page to a mailing list, Google came in through that, and methodically 'clicked' on each link, including the 'delete' ones. (There was even a confirmation page with 'Are you sure you want to delete this? _Yes_ or _No_'--as links, of course.) I went to show someone it one day and all the data was gone. It was just sample data, so no great loss. I figured it was just some bored person who deleted everything but I looked at my access_log and there was Googlebot all over the place.

      Sort of like the server version of BattleBots. Coming to Comedy Central this fall!

      --
      Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
    7. Re:Anecdote from Google by garphik · · Score: 1

      ... but told him he wouldn't be so lucky the next time
      Its as if they rule the world

      wide web

  28. What's the use? by cppgenius · · Score: 1

    The stuff behind forms are normally of no use to someone doing a search on Google, so how does this fits in with their "Do-No-Evil" motto? What's the use of indexing a confirmation page to a support ticket system? Is someone going to do a search for: "A support ticket has been created. Your reference number is .... bla... bla... bla..." Anyway, how do you expect someone to visit a dynamic confirmation page without filling out a form? Is Google going to hack our CAPTCHA scripts and anti-spam measures just to get past our forms? "Nevertheless, directions like 'nofollow' and 'noindex' are still respected". It's like the stupid CAN SPAM law, spammers are allowed to spam us until we tell them to stop. Google automatically allow themselves the privilege of fiddling with our e-mail forms until we tell them to stop.

    --
    www.cybertopcops.com
    1. Re:What's the use? by aXis100 · · Score: 1

      I'd say there is plenty of value behind forms. They're not just for submitting an application, some places use them as a navigation front end.

      What about online stores with combos / search fields, but no direct index?
      What about forums with a guest login?

    2. Re:What's the use? by I'm+Don+Giovanni · · Score: 1

      How would your example of "online stores with search fields" work? Google sends millions of different search queries to the server and index es each result?

      The idea of sending automating form submissions seems very spammy.

      Seems kind of weird (and wrong, frankly) to force another's server to handle automated bogus form submissions.

      If this were an opt-in thing, then sure, those that want the content that resides behind forms to be indexed by google could opt for that. But google is making this an opt-out thing. If the web mananger has content that is only accessible via forms, then he probably doesn't intend for that content to be indexed. Google is saying, "Well use 'nofollow' links or edit the robots.txt file or whatever" (I'm not a web dev so I know next to nothing about such things), but it still forces the webdev to go back and "fix" his site to prevent form-hidden content from being googlized, which he didn't have to do before.

      Anyway,

      --
      -- "I never gave these stories much credence." - HAL 9000
  29. Correct me if I am wrong........ by Anachragnome · · Score: 1

    .....The first thing that popped into my head was someone out there figuring out how to use this to access password protected sites/accounts.

    "Hey! Look at this! I googled "World of Warcraft Forums" and just got 10 million hits, all logged in as a user!"

  30. Saw this a few months ago.. by Kenny+Austin · · Score: 1

    I saw this a few months ago while grepping through our apache log. Googlebot was submitting search requests for some weird stuff to our online catalog (for example: "Ctnblnd"). After some research I found that Googlebot was the only client which had ever searched for most of these terms and that they were abbreviations that our accounting department uses. I was guessing that they were doing something like this in the lab for words that they "didn't know" but ultimately put our search url into robots.txt because I didn't like our search results showing up in theirs.

    1. Re:Saw this a few months ago.. by corsec67 · · Score: 1

      ... I didn't like our search results showing up in theirs.

      And I hate it when a search result goes to... another page of search results. "You searched for 'perpetual motion engine'. Here are links to pages of us doing that search on other sites as well." Not very useful.

      It isn't easy to programmatically tell the difference, but this seems like this would make that happen much more often.
      --
      If I have nothing to hide, don't search me
  31. I bet I know what's next by 93+Escort+Wagon · · Score: 1

    In a few months, there'll be a new blog post - Google will attempt to access and index all sites' password-protected pages by matching usernames found elsewhere on the site (e.g. from email addresses) with intelligent guesses at passwords, based on information it's gleaned regarding those individuals. Failing that, it'll run through entries found in various cracker dictionaries.

    --
    #DeleteChrome
  32. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  33. In other news, by mbstone · · Score: 1, Insightful

    Google has announced that Google Phones (beta) will soon unveil the results of its having wardialed all 6,800,000,000 U.S. telephone numbers. Visitors to the Google Phones site will be able to search individual phone numbers to determine (without personally dialing the number) whether the number belongs to a landline telephone, cell phone, fax, or modem.

    On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.

    "Since we are a big, rich entity, the laws don't apply to us. We can do black-hat hacking exploits that would cause law enforcement to raid your home if you did the same thing," said a Google spokesman.

    1. Re:In other news, by Rogan's+Heroes · · Score: 0, Troll

      Well if you're a stupid enough developer that someone can hack your site by purely using GET requests, than you probably deserved it.

  34. Uh, you missed one critical point by cppgenius · · Score: 1

    As per Google Webmaster Central:

    "Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site."

    Stuff like login forms, contact forms or forms for user generated content should be using the POST method not GET, so there shouldn't be any concern for web developers who know how to design their sites. If you are using GET in the wrong places, then it is your own fault.

    What is the motto of this story, read the actual post/article in detail before overreacting on something posted out of context by a slashdotter. (and yes I'm also guilty of this)

    --
    www.cybertopcops.com
  35. It suggests... by Web+Goddess · · Score: 1

    ...that Google's Deep Crawl is already emuating the kidd33z.

  36. Opt IN by Omnius · · Score: 1
    It seems to me that a much better plan would be to extend robots.txt to include a way for web sites to OPT IN to having their form-fronted "deep" data indexed. This would make sure that only sites that are ready for this kind of intrusion (and have data worth indexing behind their forms) get indexed. Why go for an OPT OUT methodology when the vast majority of the forms on the web front for stuff that wouldn't benefit from indexing.

    Also, note that Google is not being altruistic when they say they will only process GET forms. From a programming POV yes it is no harder for them to submit POST forms than it is to submit GET forms. The problem is that they index their resulting data by storing URLs (which a GET request provides and a POST does not) so they would have no way to redirect a person clicking on the result list from Google to the POST form results (thatâ(TM)s just not supported by the browsers). So we are talking about a technical limitation, not a altruistic self-limitation.

    1. Re:Opt IN by dave420 · · Score: 2, Informative

      Of course they could link to a site and make the browser perform a POST. That's trivial. A form and some javascript will do that no problem. They seem to not be doing that because GET forms should be non-destructive, whereas POST forms can be quite destructive.

  37. "getindex" Google Keyword? by Doc+Ruby · · Score: 1

    Repeatedly querying to extract every permutation of their API could be much larger than their underlying data (think of the combinatorics of only 5 query fields of only 5 values each, against only a couple of hundred values in the database, like many at sites), and far too much traffic for small sites (and probably for big sites, too, if their combinations of queries at all matches their traffic).

    What could be even better would be if sites that don't want get that huge load just to have their data searchable in Google would be a "getindex" keyword, rather than just "noindex", that specifies a URL from which the site's data index can be retrieved by Google. The getindex keyword would also point to a schema URL that would let Google decode the index.

    That way, the site in question could let all its data get added to Google's centrally searchable index, if it wants to allow it (otherwise, it would post "noindex"). Sites might even find themselves using Google's copy of their index for their own queries, rather than use CPU time querying their own local copy. Just like sites today use Google's index for searching their site's documents.

    All they'd do would be to regenerate their index whenever they want, and maybe ping a Google API at Google.com that reloads their index from their site back into Google's updated index. Such an "index hosting" service would quickly become the norm for many sites, just as searching sites by searching their document index at Google is the norm today, but would have been considered weird a decade ago.

    --

    --
    make install -not war

  38. Now it all makes sense... by mtconnol · · Score: 1
    I've been seeing this behavior in my logs for months on my website (http://www.learningmusician.com/ - find a music teacher!) - now it makes total sense. In the Help section of my site, I use a form to search for a help topic - and for several months now, one #%(*&@% persistent machine from a Google subnet has been submitting all kinds of trash into that form - looks like just about every word from my site.

    Luckily, no damage done, since it's a harmless operation, but I am concerned about being penalized for a high number of "duplicate" pages since the response to each one is probably identical ("Sorry, no help is available on the topic "impediment to." and similar crap.)

    It also doesn't seem to know when to quit - it's responsible for more hits per month than I am, and as a nervous webmaster that's saying a lot.

    michael

  39. I'm seeing an image by cyberfunkr · · Score: 1

    I'm having flashbacks of The Venture Brothers, episode Twenty Years To Midnight.

    Google searching websites like the Grand Inquisitor -- IGNORE ME!

  40. Forms that create agreements by Russ+Nelson · · Score: 1

    The problem with their searching is a form like this one: http://quaker.org/users.cgi It's *meant* to keep people out unless they've entered into a legal agreement.

    --
    Don't piss off The Angry Economist
    1. Re:Forms that create agreements by dave420 · · Score: 2, Informative

      That is a POST form, which Google have said they will not mess with.

  41. Re:This *WILL* could cause problems by Anonymous Coward · · Score: 0

    > According to the HTTP specs, GET requests should always be idempotent.

    Wrong. It would be correct to say "according to the newer HTTP specs, GET requests should always be idempotent." I have pages still in production that I wrote before that requirement was added. While it is simple to change "ACTION=GET" to "ACTION=POST" in a static HTML text file, it's not so simple when I wrote some of the pages over twelve years ago and many are HTML generated with cgi programs that are compiled C using many external libraries. Most haven't been recompiled since then. My company made our big move for backend systems to Linux/C from MAI Basic Four in the fall of 1993, and moved our web site and ported all of our C cgi programs over to Linux the fall of 1995 from basic running on BSDi systems.

    My plan is to password protect the directories. A human could screw us up like this new Google bot so this change needs to be done anyway.

  42. How to become a millionaire in four easy steps! by PAjamian · · Score: 1
    1. 1. Set up a shopping cart which is lack on security and uses GET forms instead of POST forms
    2. 2. Put one item in the shopping cart, a used tic tac box for 1 million dollars (it's a collector's item)
    3. 3. Wait for the google bot to buy the tic tacs with the corporate credit card
    4. 4. Profit!!!!
    --
    Windows is a bonfire, Linux is the sun. Linux only looks smaller if you lack perspective.
  43. Eeek, you found me! by CheeseTroll · · Score: 2, Funny

    n/t

    --
    A post a day keeps productivity at bay.
  44. Like the dwarves... by Aggrav8d · · Score: 1

    Soon enough they will dig too deep and unleash a terror the world has long forgotten. Maybe a Deep Crow? How exciting! :)

  45. Title Correction by awyeah · · Score: 2, Insightful

    "Technology: Google fills your backend database with garbage"

    --
    Why, no, I haven't meta-moderated lately. Thanks for asking!
  46. Now that captchas are solved by Anonymous Coward · · Score: 0

    they should use them as well. And bugmenot. If only to show webmasters how pointless it is to require a captcha to search and a login to post to a forum. Anonymous cowards everywhere!

  47. They're in for a surprise when... by Dr.+Zowie · · Score: 1

    ... they hit the Solar Dynamics Observatory database next year. It'll be collecting several petabytes of images...

  48. good and bad-motivation by Anonymous Coward · · Score: 0

    Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. A good thing for Slashdot.

  49. I can't wait by sentientbrendan · · Score: 1

    until the google trawler starts making it's own first posts.

  50. Re:mod 04 by Hanners1979 · · Score: 1

    Googlebot, is that you?

  51. Conservation of server resources by patio11 · · Score: 1

    I run a Rails site. There is a particular action on the site which, at the moment, sits in my password protected admin's area. Two people accessing it simultaneously would lock the server for 2 seconds. That isn't a problem for me when I get about 1,000 users a day and the action is only accessible to a single admin, but it would not be unreasonable to push that action to the public site, because the chances of temporal collision between users are low and the cost for a collision is negligible.

    However, if you were sucking down 10k pages from my site with a spider, you could DOS my site pretty trivially. (Call it the Googlebot effect.) Thats why you let me say "If you are capable of routinely generating page views with scale, feel free to go anywhere but Door #1."

  52. Robots.txt is not the answer by Anonymous+Brave+Guy · · Score: 2, Insightful

    Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.

    For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.

    For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's OK to hit sites with huge increases in traffic because "if it's on the Internet, they expect people to visit". Sure, they expect people to visit — not automated systems run by companies with vast resources that can push a typical small site into paying for extra bandwidth or being taken off-line in a matter of minutes.

    It is not OK to just Slashdot a site out of the blue. It is not OK to aggressively attack every form on a site to see what you can find. It is not OK to set up a 1,000,000 computer botnet and then effect a DDoS attack against a web site your client doesn't like. It is not OK to send me so much spam that I have to waste hours of my life sorting through it to find legitimate e-mail. These are all variations of exactly the same principle: knowingly causing a huge, unexpected and potentially expensive or damaging increase in traffic to someone without their knowledge or consent. And most of them are already illegal in a lot of jurisdictions.

    It doesn't take a genius to spot that this is unethical behaviour, and it's long past time we stopped pretending it was OK because Google can Do No Evil(TM) and we like Slashdot. The current approach is unsustainable, and since the Internet's days as an unmetered, untaxed medium appear to be numbered in the current political climate, the sooner the robots.txt advocates get it, the better.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:Robots.txt is not the answer by firewrought · · Score: 1

      Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
      Robots.txt defines the difference between good bots and bad bots. The fact that it's voluntary is not pertinent when you're talking about an ethical standard: all ethical standard are "voluntary". The fact that it's opt-out is not a significant problem when all you have to do is add two lines to a text file. TWO LINES!

      Here are some random facts you may find pertinent to your evaluation of this issue:
      • The WWW was originally created on the assumption that bots would crawl it. This is reflected in the HTTP verb semantics as well as HTML itself (all those H1, H2, H3 stuff was so that automated indices could be built).
      • In Blake vs. Google, Blake sued Google for crawling his site. The courts found Blake to be a dumbass for not using robots.txt.
      • If Google were to purposefully ignore robots.txt, it would come back to haunt them socially and legally.
      Finally, your post mentions no alternatives to robots.txt. What exactly are you advocating? Please make sure that you propose a workable solution with an advantage that outweighs its implementation cost (you'll probably need an order-of-magnitude gain to society to make this happen). Please provide a transition plan, especially if your solution is opt-in. Please describe the expected impact on the overall usability of the web and an estimate of how many sites would disappear from search engines as a result of your change.
      --
      -1, Too Many Layers Of Abstraction
    2. Re:Robots.txt is not the answer by Anonymous+Brave+Guy · · Score: 1

      It's not about two lines, it's about awareness — and I don't just mean knowing about robots.txt, I mean being aware of all possible eventualities that might adversely affect your site, being technically competent to configure your robots.txt such that they do not happen, and being able to deal with any sites that do not respect robots.txt anyway. I claim that a very substantial number of people who put web sites up, possibly even the majority, do not have this knowledge.

      For the lawsuit you mentioned, did you mean Field vs. Google? If so, the case of Blake Field is hardly representative, given that the guy pretty actively set Google up and then tried to extract money from them in court. He was not your average guy setting up a family home page, your average volunteer setting up a simple web app for a local charity, or your average small-time contractor setting up a basic e-commerce site for a local speciality store.

      As for alternatives, I am simply proposing that it should be necessary by default to opt-in to things that may be detrimental. I really don't care if that makes life harder for organisations that want to do those detrimental things. I also really don't care if it means some search engines won't index sites that don't opt-in.

      There is no reason to believe that switching to an opt-in model would mean the end of humanity as we know it. Millions of people have worked out how to put a web page on-line, complete with hand-crafted HTML or learning to use some mark-up on their hosting service, dealing with ISPs, and so on. I'm quite sure they could work out how to add a file or flick a switch on their hosting service's control panel saying "willing to be indexed" or something. (If they can't, it rather undermines the credibility of arguing that robots.txt is an acceptable mechanism to opt-out anyway.)

      I'm also not convinced that those who don't opt-in would be losing much. You seem to take it as an axiom that the Internet relies on search engines to be effective and that there would be some terrible cost to society by not implementing these disruptive things that we don't have at the moment and no-one really misses. Clearly I take a different position. We have no way of knowing what would have evolved or could still evolve if search engines were not as dominant as they are today. We do know that hyperlinks to related sites work very well as a way for people to find new information of interest, that numerous sites are willing to provide many such links to their visitors, that mass human scanning and indexing in centralised places is feasible (DMOZ), that centralised tools allowing people with similar interests to find sites you've enjoyed are feasible (StumbleUpon), and that alternative models such as social networks can become every bit as far-reaching and ubiquitous as search engines (see how fast the latest meme spreads through LiveJournal, Facebook and the like). It is far from clear to me that the world would be a worse place if Google hadn't arrived and other networking effects had effectively provided the same kind of functionality. After all, you're not going to get much of a page rank on Google until other credible sites have already linked to you anyway.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  53. Great! by dsouza42 · · Score: 2, Funny

    Now they'l finally be able to index all kinds of Google searches... oh, wait.

  54. Time to Reformat the Internet... by Homncruse · · Score: 1

    Looks like it's time to reformat the Internet. Sure, theoretically this shouldn't cause problems, but we all know (and many of us have probably been guilty of early in our development careers) that practice doesn't always follow theory.

  55. Freenet and Onion by Anonymous Coward · · Score: 0

    And I was hoping this would be about google finally indexing things like Freenet and the .onion domain.