Carnegie Mellon CAPTCHA Digitization Project Now Underway
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
Is this proof that Carnegie Mellon (and the BBC) support religious terrorism?
This guy's the limit!
Good idea, congrats to all the smart people who came up with this one.
Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!
The cancel button is your friend. Do not hesitate to use it.
It's been a while since I looked a recaptcha but IIRC it relies on javascript and document.write() so it's useless for any xhtml site. The audio captchas likewise assume the screen reader is capable of script.
If signing up to a wiki, or creating a bogus mail account means a little beneficial work is done, then even after replacing all the useful content with links, or sending out hundreds of spams your actions would still be karma neutral, right?
Time to get linking...
So, the plan is to take already hard-to-read words, make them harder to read, pair them with another hard to read word, and see how many people agree it's the same word? I've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that. I'm betting at least 1 other person agrees and puts the same thing I did, accepting that translation into the database...
And that's not even counting malice where people deliberately put wrong words in... Chances are they won't both put the wrong word for the same word, but it -can- happen, especially with malicious intent.
It's a neat idea, but I don't think it'll work all that great. There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Interesting idea, but here are the immediate problems as I see them...
Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).
Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.
Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.
Am I right on these points? (I just woke up).
You can try it out at the top of this page.
Take this OCR software, we still own you! And now you come crawling back!
Hey, it's even resistant to typos! I got "terson reported", typed in "tersonn reportted", and it said "Correct!". ...hey, wait a minute....
There is a presentation about similar topics by Luis von Ahn on here. The presentation talks about using what he calls human computation, basically using people on the internet to perform various tasks that are difficult for computers to do. One idea is using people playing a game to label images on the internet so that they can be indexed with much greater accuracy than the current google image search.
> The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart)
> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."
Where did they get the "P" from?
I did that protect your email address with OCR thing at http://mailhide.recaptcha.net/ and tried solving it myself. I mistyped one of the words accidentally and noticed a second after I hit enter. It said 'Congrats you're a human!' and proceeded to give me the address.
If all slashdotters would decide to answer with CowboyNeal to the second CAPTCHAs question, there is a large chance of his name appearing in one of the deciphered old texts. CowboyNeal to the Old Testament! This points out one major disadvantage of the system: since the computer can't check whether the answer is correct, a large group of people can abuse it with a growing probability in time. Since there is no incentive to answer to the second CAPTCHA correctly, making it widely known that the second CAPTCHA is not checked was less than a good idea. Good cause undermined by wide publicity. I, for one, welcome our new old-text-obfuscating slashdotter overlords.
Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.
Well, this finally makes CAPTCHAs somewhat useful. I won't try to formulate it in some sugar-coated way: I personally hate CAPTCHAs. On some types (especially the ones from Digg), I fail about 50% of them, and that's getting quite annoying after a while. Especially when your code is rejected even if you believe there is no doubt about what you've read in the image.
I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
Sounds like what they're doing at Peekaboom and The ESP Game, harnessing humans to solve problems that are difficult for computers.
Here's an nice video on the subject.
I supervise an America's Army clan website which uses phpBB for the forums. Spam bots were barely slowed down by the standard CAPTCHA registration requirement. I'd get dozens of bogus registration requests a day from bots that used OCR to get in.
:-)
A couple of months ago I switch to recaptcha.net's plugin for phpBB and it stemmed the tide. The number of spam bots getting thru decreased greatly. Those that did, I felt slightly better when I deleted their registration requests unfulfilled. Their Evil cpu cycles had been reclaimed for Good!
Now, I'm expecting if this gains momentum, the spam bots to have tweaked OCR that will better handle recaptcha images. I also expect it will happen like before, where it slowly ramped up in annoyance for me. During that time, there will be an increase in positive results for CMU, which is a good thing.
Once the bots get good enough that I (and other forum admins) change, I expect CMU's OCR algorithms to have improved enough to not need this service.
Learning HOW to think is more important than learning WHAT to think.
For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module that makes using reCAPTCHA a snap.
I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Don't worry. The system only accepts a word as correct after two people give the same answer. Hopefully the next person to get your challenge won't make the same typo you did. :)
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
I can see a serious privacy problem with this, since it divulges the IP address of visitors to a third party (Carnegie Mellon). The API is fundamentally broken, since both the website visitor and the website need to contact the central server (rather than the website alone), which allows said third party to generate personalized profiles of web surfers.
This was reported in slashdot about a year ago, and after I read about it I setup a captcha in my page to reveal my email...
:) and for the people that want to participate you just have to "hide" your email behind a link which will show a captcha (with the two phrases)
other than that, it is really nice
Ubuntu is an African word meaning 'I can't configure Debian'
Always enter "foo" as the second word, just for the heck of it!
Unfortunately I think most CAPTCHAs use JS; it's been a while since I've been to a site that didn't make me turn it on to get through login/registration. I have no idea why this is, since people have been doing login pages since before JS was around or popular, but now it seems like the way every idiot is doing it.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Spammers can use the 'get a human to do it' as easily as any one else can do.
They can set up fake porn sites with registrations (collecting more email addresses to spam in the process), and when someone wants to 'register' for the free porn, the spammers site scrapes a captcha from the site they want to get into with a bot, and show it to their user trying to sign up for porn. The eager pornhound dutifully types in the answer, which the spammer's scripts can then supply to the site the capthcha originally came from. They can even feedback the results - if the answer doesnt work at the real site, then the user made a mistake, and get another.
(+5 Frightening)
It doesn't seem like these Re-capchas require that the user type in the correct case for letters. Won't this be a problem for translated text? Even if they don't absolutely require it, they should at least request that the user use the correct case.
What happen if the unknown word are wrong? (well, the probability is still there)... ermm...can we replace the word with random number (mixed of characters and numbers)
Deciphering Mayan hieroglyphics!
Champollion is rolling in his grave in frustration because he didn't think of this...
I want to play Free Market with a drowning Libertarian.
How does this thing handle capitalizations? What are the chances that two people will be too lazy to Capitalize the proper nouns and acronyms? Two matches to verify a word seems low. Crap I just checked it. I found a group with two capitalized words and entered them without caps. It accepted it.
After doing a hundred or so, several problems I can see with this that may cause problems with accuracy even if the text is human-readable:
1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.
But it's a brilliant idea and for the majority of the text samples there was no ambiguity.
-- Insert witty one-liner here. --
"AAAAAAAAAAAAAAAAAAARGHHH!!!"
Hear that? It's the sound of X number of spammers crying out in agony/frustration/pain/rage.
"All hands, BRACE FOR IMPACT!"
I wish I had mod points! Those three links are classics! I could waste HOURS On this!
I imagine that it works like any OCR... they have a guess for what it is and a confidence level. If a character falls below some confidence threshold, they will feed it to a reCAPTCHA user. They may know with 99.5% certainty that the word is "?og", but only 85% certainty that the word is "Dog". Whether a user enters 'd' or 'D' is largely irrelevant.
I could see it being a problem with 'Z' and 'z', or something like that. I'm sure they can parse the language, though, and intelligently decide if it is likely to be a situation that calls for a capital letter in those rare situations.
i agree here....
www.tdobson.net #### Dare to Dream #### blog.tdobson.net
If they hypothetically feed you the words "dark market", they may know with 99.9% confidence that the second word is "market". For the first word, they may know with 99.9% confidence that the word is "?ark"... that first wildcard, though, may be 'd' (85% confidence), 'p' (20%), 'sh' (0.5%), or many number of other things... there is some probability that the wildcard is any given character. If they predict it to be 'd' with 85% confidence (but it is below the threshold), they will take a 'd' or a 'D' as confirmation of that. They aren't going to assume that it is a capital 'D', which might have some negligibly low probability, just because you type a capital 'D'.
Their algorithms are almost certainly smarter than that.
I can't see any reason for this.
Is there really a shortage of willing volunteer transcribers? I seem to remember Project Gutenberg getting far more volunteers than they could use, without even asking...
And speaking for myself, I'm sure I could transcribe a couple full sentences more quickly than I could two arbitrary words, so I'd call this a terrible use of the available volunteer resources as well.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Won't the technology developed by this program be useful for breaking Captchas? If we can teach computers to decipher them, their usefulness as a human-only readable key is lost.
It's only good for supplying the captchas on your website. Nowhere is there a link to start doing translations.
Can't see if this is mentioned, but there is a plug-in for the Wordpress software that implements reCaptcha. A particularly appropriate use is on http://www.ifshakespeare.co.uk/ which is a literary blog.
Why don't they just have it randomly choose whether to put the unknown word first or second each time? Then you have an incentive to translate both words correctly.
So what happens when little Suzy from myponyandme.com gets an excerpt from 'The Sex Manual'?