Slashdot Mirror


Google Crawls The Deep Web

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

30 of 197 comments (clear)

  1. Just think! by scubamage · · Score: 5, Funny

    Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)

    1. Re:Just think! by AKAImBatman · · Score: 3, Funny

      Hmm... that reminds me of this DailyWTF. Who knew that Mr. Test User was such a big customer? :-P

    2. Re:Just think! by Lillesvin · · Score: 3, Informative

      ... maybe a borked machine?

      Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.

      A little more on topic though, what exactly is Google looking for there? I mean, what content (of any interest to anyone) is hiding behind forms? Many sites that require registration (like NY Times (IIRC) and others) already check if the UserAgent string is that of a Google crawler and lets it index if so in order for people to be able to search eg. NY Times articles on Google but only read them if they register (or change their UserAgent string or use BugMeNot).

      And how does Google make sure they don't end up accidently editing a crapload of wikies by filling out random forms on random sites and hitting submit?

      --
      "Live free or don't."
    3. Re:Just think! by Ariven · · Score: 5, Interesting

      I remember an article while back where someone had cut/pasted some articles from one section of their site to another.. and as a result had edit and delete links in the live content instead of on their internal web interface.

      And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..

      And now, they are gonna start submitting forms? the fallout is gonna be entertaining.

    4. Re:Just think! by jc42 · · Score: 4, Interesting

      I had similar problems a few years ago. The database had a lot of data in a compact format, and I wrote some retrieval pages that would extract the data and run it through any of a list of formatters to give clients the output format they wanted. Very practical. Over time, the list of output formats slowly grew, as did the database. Then one day, the machine was totally bogged down with http requests. It turned out that a search site had figured out how to use my format-conversion form, and had requested all of our data in every format that my code delivered.

      Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.

      Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.

      I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    5. Re:Just think! by dartarrow · · Score: 3, Informative
      --
      I love humanity, it is people I hate
  2. Bright Planet's DQM by eldavojohn · · Score: 3, Interesting

    Several years ago, I tried a demo of Bright Planet's Deep Query Manager that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!

    Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.

    Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.

    Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind ...

    --
    My work here is dung.
  3. Oops... by JohnnyDanger · · Score: 5, Funny

    They just bought everything on Amazon.

    1. Re:Oops... by Bogtha · · Score: 4, Informative

      This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.

      This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.

      --
      Bogtha Bogtha Bogtha
    2. Re:Oops... by orkysoft · · Score: 5, Insightful

      Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.

      --

      I suffer from attention surplus disorder.
  4. Will it solve captchas? by lastninja · · Score: 4, Interesting

    only half kidding

    --
    John Carmack fan, browsing at +5 since 1999.
    1. Re:Will it solve captchas? by skraps · · Score: 5, Funny

      Just what we need, some 'bot adding it's insightful comments based on other words in the same document.
      Are such questions on your mind often?

      ..then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
      What does that suggest to you?
      --
      Karma: -2147483648 (Mostly affected by integer overflow)
    2. Re:Will it solve captchas? by urcreepyneighbor · · Score: 4, Funny

      You whore! You told me you loved me, Eliza! You said you'd call!

      --
      "The fight for freedom has only just begun." - Geert Wilders
  5. Forums? by fishybell · · Score: 5, Funny
    Well, I certainly hope that they put in some decent smarts to prevent it from making posts onto forums, blogs, /., etc.


    On the plus side, this should enable Google to get by the "Must be 18 to view" buttons ;)

    --
    ><));>
    1. Re:Forums? by spintriae · · Score: 3, Funny

      Google's only 12 years old. It shouldn't be visiting those sites.

  6. HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 5, Funny

    I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    1. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 5, Funny

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

    2. Re:HELLO I AM GOOGLEBOT by Anonymous Coward · · Score: 4, Funny

      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

  7. good and bad by ILuvRamen · · Score: 3, Insightful

    Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

    --
    Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    1. Re:good and bad by QuoteMstr · · Score: 5, Insightful

      And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.

    2. Re:good and bad by Bogtha · · Score: 4, Insightful

      Now all they need is something to read text in flash files and they've got something going.

      They've indexed Flash for about four years now.

      I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

      No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.

      --
      Bogtha Bogtha Bogtha
  8. robots.txt by B3ryllium · · Score: 4, Funny

    Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?

  9. Note to self... by fahrbot-bot · · Score: 3, Funny
    our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML...

    ...post invoice forms ordering expensive items to be shipped to Google. Be sure to log incoming IP addresses for verification.

    --
    It must have been something you assimilated. . . .
  10. The Internet is for Porn by kiehlster · · Score: 5, Funny

    If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.

  11. Re:Google, consider this... by poot_rootbeer · · Score: 3, Insightful

    Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.

    If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.

  12. directions like 'nofollow' are still respected by frovingslosh · · Score: 5, Informative
    Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.

    Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.

    --
    I'm an American. I love this country and the freedoms that we used to have.
    1. Re:directions like 'nofollow' are still respected by Christophotron · · Score: 3, Interesting

      As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.

      I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.

      I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing such information is akin to hiding or obfuscating it: you don't actually want anyone to read it or anything, but you can say it's available on the web so your ass is covered. IMO there should be a law stating that all of .gov MUST be indexed by search engines.

      Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc? If it's not breaking the law, then Google should have some serious competition. A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.

  13. Anecdote from Google by arrrrg · · Score: 5, Funny

    When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash.) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

    So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.

  14. Re:sites can still be excluded by danielsfca2 · · Score: 3, Insightful

    I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it? Well I'm glad you asked. The presence (and continued following) of the robots.txt standard is crucial for these reasons:

    - Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.

    - If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.

    - Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.

    The above are all reasons why you might want some or all of the content on a site not indexed.
  15. Re:Google, consider this... by Kristoph · · Score: 3, Funny

    Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create?

    If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.

    ]{