Carnegie Mellon CAPTCHA Digitization Project Now Underway

← Back to Stories (view on slashdot.org)

Carnegie Mellon CAPTCHA Digitization Project Now Underway

Posted by Zonk on Tuesday October 2, 2007 @12:44AM from the way-more-fun-than-the-usual-kind dept.

tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"

8 of 119 comments (clear)

Min score:

Reason:

Sort:

Re:Rock on by cheater512 · 2007-10-02 00:56 · Score: 1, Insightful

I've found a flaw.

It gives you two words to enter in but you only have to get the right one correct in order to get through.

Spammers could fill the left word with nonsense and OCR the right one and the system would crumble.
Who cares if the OCR isnt 100% accurate. It'll be good enough to get a lot of spam through.
Re:Rock on by Fluffy+Bunnies · 2007-10-02 01:10 · Score: 2, Insightful

Where in TFA does it say that the one on the right is always the right one?
Re:Rock on by Smidge204 · 2007-10-02 01:11 · Score: 2, Insightful

You don't know which word is known (and checked against) and which is unknown. This makes your ORC attack less effective because you must get BOTH words right in order to guarantee success.

Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone else for that matter.

You didn't even have to RTFA to get that much...
=Smidge=
Re:Problems by AltGrendel · 2007-10-02 01:13 · Score: 3, Insightful

I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.

--
The simple truth is that interstellar distances will not fit into the human imagination
- Douglas Adams
Re:Problems by jsight · 2007-10-02 01:22 · Score: 5, Insightful

I agree... I don't understand why people find so many silly faults with this.

1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).

Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.

In practice, these seem to get broken less often than people think.

--
Throw the bums out!
Re:I'm not so sure this is a good idea. by necro81 · 2007-10-02 01:26 · Score: 3, Insightful

There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
That is, for all intents and purposes, impractical, which was the entire point. The backlog of work was never going to get done in a reasonable timescale with dedicated humans correcting all the errors. A dedicated human, even with the context, will still make mistakes or get stumped.

Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says:
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Re:I'm not so sure this is a good idea. by smallfries · 2007-10-02 01:27 · Score: 4, Insightful

Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.

For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.

I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:I want to participate... by Falkkin · 2007-10-02 02:00 · Score: 2, Insightful

Our demo at http://recaptcha.net/fastcgi/demo/recaptcha keeps track of the number of words you've digitized. :)