Carnegie Mellon CAPTCHA Digitization Project Now Underway

← Back to Stories (view on slashdot.org)

Carnegie Mellon CAPTCHA Digitization Project Now Underway

Posted by Zonk on Tuesday October 2, 2007 @12:44AM from the way-more-fun-than-the-usual-kind dept.

tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"

30 of 119 comments (clear)

Fiery church? by gEvil+(beta) · 2007-10-02 00:48 · Score: 2, Funny

Is this proof that Carnegie Mellon (and the BBC) support religious terrorism?

--
This guy's the limit!
I want to participate... by DrWho520 · 2007-10-02 00:50 · Score: 3, Interesting

Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!

--
The cancel button is your friend. Do not hesitate to use it.
1. Re:I want to participate... by morgan_greywolf · 2007-10-02 00:54 · Score: 2, Funny
  
  Apparently, you have to go down to the fiery church to burn a few hours...
  
  --
  My blog
2. Re:I want to participate... by EvilGrin666 · 2007-10-02 00:54 · Score: 4, Informative
  
  Here's the website, http://recaptcha.net/
3. Re:I want to participate... by Falkkin · 2007-10-02 02:00 · Score: 2, Insightful
  
  Our demo at http://recaptcha.net/fastcgi/demo/recaptcha keeps track of the number of words you've digitized. :)
Problems by David_Shultz · 2007-10-02 01:07 · Score: 2, Interesting

Interesting idea, but here are the immediate problems as I see them...

Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).

Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.

Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.

Am I right on these points? (I just woke up).
1. Re:Problems by AltGrendel · 2007-10-02 01:13 · Score: 3, Insightful
  
  I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.
  
  --
  The simple truth is that interstellar distances will not fit into the human imagination
  - Douglas Adams
2. Re:Problems by jsight · 2007-10-02 01:22 · Score: 5, Insightful
  
  I agree... I don't understand why people find so many silly faults with this.
  
  1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
  2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
  3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
  4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).
  
  Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.
  
  In practice, these seem to get broken less often than people think.
  
  --
  Throw the bums out!
Re:Rock on by Fluffy+Bunnies · 2007-10-02 01:10 · Score: 2, Insightful

Where in TFA does it say that the one on the right is always the right one?
Give it a go! by cookieinc · 2007-10-02 01:11 · Score: 2, Informative

You can try it out at the top of this page.
Re:Rock on by Smidge204 · 2007-10-02 01:11 · Score: 2, Insightful

You don't know which word is known (and checked against) and which is unknown. This makes your ORC attack less effective because you must get BOTH words right in order to guarantee success.

Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone else for that matter.

You didn't even have to RTFA to get that much...
=Smidge=
Re:I'm not so sure this is a good idea. by necro81 · 2007-10-02 01:26 · Score: 3, Insightful

There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
That is, for all intents and purposes, impractical, which was the entire point. The backlog of work was never going to get done in a reasonable timescale with dedicated humans correcting all the errors. A dedicated human, even with the context, will still make mistakes or get stumped.

Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says:
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Re:I'm not so sure this is a good idea. by smallfries · 2007-10-02 01:27 · Score: 4, Insightful

Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.

For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.

I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
CATTTTCHA? by MichailS · 2007-10-02 01:35 · Score: 2, Interesting

> The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart)
> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."

Where did they get the "P" from?
Re:I'm not so sure this is a good idea. by MrMr · 2007-10-02 01:40 · Score: 5, Funny

've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that.

Congratulations,
you managed to fail the Turing test.
`CowboyNeal' answer to all CAPTCHAs by gyepi · 2007-10-02 01:47 · Score: 2, Interesting

If all slashdotters would decide to answer with CowboyNeal to the second CAPTCHAs question, there is a large chance of his name appearing in one of the deciphered old texts. CowboyNeal to the Old Testament! This points out one major disadvantage of the system: since the computer can't check whether the answer is correct, a large group of people can abuse it with a growing probability in time. Since there is no incentive to answer to the second CAPTCHA correctly, making it widely known that the second CAPTCHA is not checked was less than a good idea. Good cause undermined by wide publicity. I, for one, welcome our new old-text-obfuscating slashdotter overlords.

--
Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.
1. Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 02:21 · Score: 5, Informative
  
  Sorry, but we've already thought of this attack :)
  
  We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
2. Re:`CowboyNeal' answer to all CAPTCHAs by gyepi · 2007-10-02 03:18 · Score: 2
  
  Is there any word on how CAPTCHA decoders, like PWNtcha, perform against the current reCAPTCHA?
  
  In case reCAPTCHA can be automatically deciphered efficiently, a slightly altered malevolent attack might still be feasible. Let D be a roughly complete list of English words (a dictionary), together with the relative frequencies of the words occurring in standard English texts. Generate a fixed mapping f from D to D such that words are going to be assigned to each other only in case their occurrence frequencies are roughly the same - ie `banana' could be mapped to `orange' since their relative frequency (I guess) is roughly the same.
  Now let your deciphering program attack the reCAPTCHA service such that it guesses the two words from the presented CAPTCHA, gives the correct answer to one of them (at random), and gives the permuted answer (according to f) to the other. You will see no bumps in the frequencies, and roughly every second attempt will put in false information to the database. Since f is fixed, sooner or later the same word will come up again, in case the false answer is going to be verified.
  
  Even without an efficient automated reCAPTCHA decipherer, you could do the same with a bunch of people, just tell them that as a first attempt always go to a website where a small cgi script gives you back f(Word) for any given Word. I'm not claiming that you can find enough evil people for that around here, of course...
  
  ((Obviously the efficiency of this attack can be increased by mapping a very common word - say, "with" - to an uncommon one, and mapping a whole bunch of uncommon words "with" so that, on the basis of relative occurrence frequencies in standard texts and the estimated ratio of malevolent/benevolent users you see no frequency bumps. The advantage of the simpler but less efficient method above is that it doesn't require a guess of the ratio of the malevolent/benevolent users.))
  
  --
  Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.
3. Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 03:31 · Score: 2
  
  PWNtcha does not defeat reCAPTCHA, nor are we aware of any existing OCR or CAPTCHA-breaking algorithms that do. We are working with research groups at a couple universities who are trying to break our CAPTCHA (and if they can, we'll obviously fix it). In case we do notice a break, it's trivial for us to switch to a completely different kind of CAPTCHA (using different distortions). Because our system is a web service, if there is a security breach, we can fix it for all sites at once by simply changing the distortions on our challenge images. This is a big security benefit compared to other CAPTCHA systems that are difficult (at best) to patch and update.
  
  As you point out, if we did get broken on a wide scale, it would be possible to seed bad data into the system. However, it's easy enough for us to simply distrust all responses that happened during the vulnerable period.
"Turing" test by DrLex · 2007-10-02 01:49 · Score: 2, Informative

Well, this finally makes CAPTCHAs somewhat useful. I won't try to formulate it in some sugar-coated way: I personally hate CAPTCHAs. On some types (especially the ones from Digg), I fail about 50% of them, and that's getting quite annoying after a while. Especially when your code is rejected even if you believe there is no doubt about what you've read in the image.
I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:07 · Score: 5, Informative

"And that's not even counting malice where people deliberately put wrong words in."

We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Peekaboom by EnsilZah · 2007-10-02 02:08 · Score: 2

Sounds like what they're doing at Peekaboom and The ESP Game, harnessing humans to solve problems that are difficult for computers.
Here's an nice video on the subject.
Re:I'm not so sure this is a good idea. by Alzheimers · 2007-10-02 02:11 · Score: 4, Funny

ELIZA > And how does this make you feel?
Drupal Module makes it simple by Slashdot+Parent · 2007-10-02 02:19 · Score: 3, Interesting

For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module that makes using reCAPTCHA a snap.

I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:38 · Score: 3, Informative

You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.

In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
Re:I'm not so sure this is a good idea. by Eponymous+Bastard · 2007-10-02 02:46 · Score: 2

I got "derground". If they are getting this from digitized books, they have to work on undoing hyphenation before presenting it to the user.

I wonder, afte this is running for a while, most of the unknown words will be nonsense (jabberwocky, snickersnee) Proper or made up names (Elric of Melnibone? I saw Benoit in the third captcha I solved, I now got one that looks like Visscher), numbers and other things people wouldn't work through.

The other problem is with common words that OCR gets wrong. I've/me are common enough that they might be overrepresented, or undertranslated.

In the end, since this is a university project, the end product is not the product itself (translated books) but rather the papers and master/PhD theses you can write with the data. Are people better at OCR than computers? By how much? How much is people's ability to recognize a word impaired by cutting off the context? Are people better at common words than at proper names and unknown words?
JS is almost unavoidable for logins now. by Kadin2048 · 2007-10-02 02:57 · Score: 2, Informative

Unfortunately I think most CAPTCHAs use JS; it's been a while since I've been to a site that didn't make me turn it on to get through login/registration. I have no idea why this is, since people have been doing login pages since before JS was around or popular, but now it seems like the way every idiot is doing it.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:Rock on by Smidge204 · 2007-10-02 03:22 · Score: 2, Interesting

Still won't work. It's safe to assume the distortion/noise added to the text to prevent simple OCR would be different for each instance of the image; that's the whole point, after all. Hashes of the image data are useless in that case.

Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the attackers manage to find a way to identify the same word despite the random distortions mentioned above (which would effectively beat *all* CAPTCHA systems anyway) then using that data more than a few times guarantees it will be removed from circulation.
=Smidge=
Not case sensitive? Ut oh by cshay · 2007-10-02 03:54 · Score: 2, Interesting

It doesn't seem like these Re-capchas require that the user type in the correct case for letters. Won't this be a problem for translated text? Even if they don't absolutely require it, they should at least request that the user use the correct case.
Minor problems but good overall by MrKevvy · 2007-10-02 04:58 · Score: 2, Interesting

After doing a hundred or so, several problems I can see with this that may cause problems with accuracy even if the text is human-readable:

1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.

But it's a brilliant idea and for the majority of the text samples there was no ambiguity.

--
-- Insert witty one-liner here. --