Carnegie Mellon CAPTCHA Digitization Project Now Underway
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!
The cancel button is your friend. Do not hesitate to use it.
Interesting idea, but here are the immediate problems as I see them...
Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).
Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.
Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.
Am I right on these points? (I just woke up).
> The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart)
> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."
Where did they get the "P" from?
If all slashdotters would decide to answer with CowboyNeal to the second CAPTCHAs question, there is a large chance of his name appearing in one of the deciphered old texts. CowboyNeal to the Old Testament! This points out one major disadvantage of the system: since the computer can't check whether the answer is correct, a large group of people can abuse it with a growing probability in time. Since there is no incentive to answer to the second CAPTCHA correctly, making it widely known that the second CAPTCHA is not checked was less than a good idea. Good cause undermined by wide publicity. I, for one, welcome our new old-text-obfuscating slashdotter overlords.
Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.
For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module that makes using reCAPTCHA a snap.
I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Still won't work. It's safe to assume the distortion/noise added to the text to prevent simple OCR would be different for each instance of the image; that's the whole point, after all. Hashes of the image data are useless in that case.
Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the attackers manage to find a way to identify the same word despite the random distortions mentioned above (which would effectively beat *all* CAPTCHA systems anyway) then using that data more than a few times guarantees it will be removed from circulation.
=Smidge=
It doesn't seem like these Re-capchas require that the user type in the correct case for letters. Won't this be a problem for translated text? Even if they don't absolutely require it, they should at least request that the user use the correct case.
After doing a hundred or so, several problems I can see with this that may cause problems with accuracy even if the text is human-readable:
1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.
But it's a brilliant idea and for the majority of the text samples there was no ambiguity.
-- Insert witty one-liner here. --