Carnegie Mellon CAPTCHA Digitization Project Now Underway
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
Here's the website, http://recaptcha.net/
I agree... I don't understand why people find so many silly faults with this.
1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).
Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.
In practice, these seem to get broken less often than people think.
Throw the bums out!
Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.
For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.
I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that.
Congratulations,
you managed to fail the Turing test.
"And that's not even counting malice where people deliberately put wrong words in."
We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
ELIZA > And how does this make you feel?
Sorry, but we've already thought of this attack :)
We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).