Slashdot Mirror


reCAPTCHA Hard At Work, Rescuing Fading Texts

sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)

21 of 112 comments (clear)

  1. Not new by JazzyMusicMan · · Score: 4, Informative

    Ticketmaster and other sites have already been doing this for a while. Go to ticketmaster and search for tickets, you'll see two words. One is known and the other is unknown. If you don't believe me, try to guess which one they know and misspell the other one on purpose (or don't, this is for historic posterity =) )

    1. Re:Not new by felipekk · · Score: 3, Funny

      Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.

    2. Re:Not new by grahamd0 · · Score: 5, Funny

      Facebook uses reCAPTCHA. I guess you can make something useful out of the millions of useless teenagers wasting their time on Facebook.

      That's not fair.

      Plenty of useless adults waste their time on Facebook.

    3. Re:Not new by erbmjw · · Score: 3, Informative
      from reCAPTCHA FAQ

      When showing reCAPTCHA to the user, is it possible not to show the reCAPTCHA logo? We allow you to customize the theme of reCAPTCHA with our Client API. You are still required to have text on your website which states that you are using reCAPTCHA, however with our theming API, you are free to do this in a way that blends in to your site.

    4. Re:Not new by Your+Pal+Dave · · Score: 3, Informative

      Quoting from the NPR story which aired earlier today:

      more than 40,000 Web sites -- including popular ones such as Ticketmaster, Facebook and Craigslist -- are using a new kind of security program called reCAPTCHA.

    5. Re:Not new by Alzheimers · · Score: 3, Funny

      But you...

      *sigh* ...Nevermind. It's Friday. Go have a beer or something.

  2. Validate your data, guys! by Anonymous Coward · · Score: 3, Funny

    I can usually tell which of the two words is from a real old text. With high probability (>90%) I can correctly answer the real CAPTCHA and replace someone's OCR'd word with "penis".

    I've only ever done this maybe ten or twenty times, but it could easily become an automatic part of using the system.

    1. Re:Validate your data, guys! by PPH · · Score: 3, Interesting

      Since they use entries from several users to validate correct translations for OCR'ed text, this probably won't cause them major problems. OTOH, I wonder if they can track the accuracy of each user's inputs and, if it becomes evident that a user is either incompetent or attempting to screw with the system, take appropriate measures.

      When someone's karma starts dropping into the negative range, they should let us know how well this worked out. If anyone can see their posts, that is.

      --
      Have gnu, will travel.
  3. Cool possible uses by Irish_Samurai · · Score: 4, Interesting

    Man, I would love to see the results if this technique was used for an ontological purpose.

    Please type in the word from the choices below that most closely relates to this word: OLD

    HISTORIC
    LIFESPAN

    Interesting shit indeed.

    1. Re:Cool possible uses by burgundysizzle · · Score: 5, Funny

      Or perhaps SLASHDOT-READER:

      OVERWEIGHT

      GEEK

      SPENDS-TO-MUCH-TIME-USING-COMPUTERS

      ALL-OF-THE-ABOVE

      I fit into the category ALL-OF-THE-ABOVE. The only generalisation that is missing about slashdotters is the one about girlfriends.

  4. DMCA Violation by Nymz · · Score: 5, Funny

    The feature known as FADING was designed to protect copyright works from being pirated by becoming illegible before the work could fall into the public domain.

  5. Prior art by armanox · · Score: 4, Funny

    I think that erosion on stone tablets predates fading by quite a bit....

    --
    I'm starting to think GNU is the problem with "GNU/Linux" these days.
  6. Image Captchas by pembo13 · · Score: 3, Informative

    I've found implementing a simple "please choose the name of the item seen bellow" eliminates a large amount of spam (all?) but has the problem of not being viable for blind people.

    --
    "Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
    1. Re:Image Captchas by Martz · · Score: 4, Funny

      Just use an alt tag.

  7. Re:One Problem by Anonymous Coward · · Score: 3, Funny

    The following security test allows us to validate you are a human and not an automated script.

    please type the following two words in the text box below

    you moron

    ____________ _____________

  8. Re:reCAPTCHA and Open Source by corbettw · · Score: 3, Informative

    There are multiple libraries for reCAPTCHA already published, all under the MIT License. Just see http://code.google.com/p/recaptcha/ for a list of them.

    --
    God invented whiskey so the Irish would not rule the world.
  9. Re:AC for the plain old CAPTCHA by grahamd0 · · Score: 4, Funny

    Let me introduce you to my friend, the question mark.

  10. Re:One Problem by RedWizzard · · Score: 4, Funny

    One FUNDAMENTAL problem with this

    ... is that you didn't RTFA.

  11. Re:Problems With ReCaptcha by Robotech_Master · · Score: 3, Informative

    I've seen one ReCAPTCHA string that was just a distorted entirely illegible blob of ink.

    Just do what I did: click the "refresh" button to the right for a new word pair and enter that one.

    --
    Editor Emeritus and Senior Writer, TeleRead.org
  12. Recaptcha doesn't recapture context by Mumei+no+koshinuke · · Score: 5, Interesting

    When solving these I sometimes find that there's more than one possibility for an illegible word, yet I can't tell which it is without knowing the context.
    For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
    Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.

  13. Use to hide your own email addy by RJFerret · · Score: 5, Informative

    You can also use reCaptcha for your own email address, and be more willing to provide it "publicly" since they'd have to answer the reCaptcha to get to the mailto... reCaptcha mailhide