Carnegie Mellon CAPTCHA Digitization Project Now Underway

Posted by Zonk on Tuesday October 2, 2007 @12:44AM from the way-more-fun-than-the-usual-kind dept.

tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"

3 of 119 comments (clear)

Min score:

Reason:

Sort:

Re:I want to participate... by EvilGrin666 · 2007-10-02 00:54 · Score: 4, Informative

Here's the website, http://recaptcha.net/
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:07 · Score: 5, Informative

"And that's not even counting malice where people deliberately put wrong words in."

We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 02:21 · Score: 5, Informative

Sorry, but we've already thought of this attack :)

We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).