Slashdot Mirror


Google Buys reCAPTCHA For Better Book Scanning

TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."

6 of 138 comments (clear)

  1. Well... by vikhyat · · Score: 4, Interesting

    This should improve Google's indecipherable CAPTCHA.

  2. Won't this eventually defeat the purpose? by natehoy · · Score: 3, Interesting

    Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

    Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

    --
    "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
  3. Re:maybe they should use CAPTCHAs... by Rik+Sweeney · · Score: 3, Interesting

    Funny you should say that

    http://mailhide.recaptcha.net/

  4. Re:Mod up by mrcaseyj · · Score: 5, Interesting

    I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.

  5. Re:WTF Summary by slyborg · · Score: 2, Interesting

    I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.

    I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!

  6. Re:WTF Summary by Anonymous Coward · · Score: 3, Interesting

    Interesting you should say that.

    Unfortunately, it won't work - 4chan already ruined it for everyone.

    http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/