Carnegie Mellon CAPTCHA Digitization Project Now Underway

← Back to Stories (view on slashdot.org)

Carnegie Mellon CAPTCHA Digitization Project Now Underway

Posted by Zonk on Tuesday October 2, 2007 @12:44AM from the way-more-fun-than-the-usual-kind dept.

tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"

12 of 119 comments (clear)

I want to participate... by DrWho520 · 2007-10-02 00:50 · Score: 3, Interesting

Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!

--
The cancel button is your friend. Do not hesitate to use it.
1. Re:I want to participate... by EvilGrin666 · 2007-10-02 00:54 · Score: 4, Informative
  
  Here's the website, http://recaptcha.net/
Re:Problems by AltGrendel · 2007-10-02 01:13 · Score: 3, Insightful

I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.

--
The simple truth is that interstellar distances will not fit into the human imagination
- Douglas Adams
Re:Problems by jsight · 2007-10-02 01:22 · Score: 5, Insightful

I agree... I don't understand why people find so many silly faults with this.

1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).

Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.

In practice, these seem to get broken less often than people think.

--
Throw the bums out!
Re:I'm not so sure this is a good idea. by necro81 · 2007-10-02 01:26 · Score: 3, Insightful

There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
That is, for all intents and purposes, impractical, which was the entire point. The backlog of work was never going to get done in a reasonable timescale with dedicated humans correcting all the errors. A dedicated human, even with the context, will still make mistakes or get stumped.

Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says:
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Re:I'm not so sure this is a good idea. by smallfries · 2007-10-02 01:27 · Score: 4, Insightful

Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.

For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.

I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:I'm not so sure this is a good idea. by MrMr · 2007-10-02 01:40 · Score: 5, Funny

've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that.

Congratulations,
you managed to fail the Turing test.
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:07 · Score: 5, Informative

"And that's not even counting malice where people deliberately put wrong words in."

We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Re:I'm not so sure this is a good idea. by Alzheimers · 2007-10-02 02:11 · Score: 4, Funny

ELIZA > And how does this make you feel?
Drupal Module makes it simple by Slashdot+Parent · 2007-10-02 02:19 · Score: 3, Interesting

For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module that makes using reCAPTCHA a snap.

I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 02:21 · Score: 5, Informative

Sorry, but we've already thought of this attack :)

We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:38 · Score: 3, Informative

You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.

In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.