Carnegie Mellon CAPTCHA Digitization Project Now Underway

Fiery church? by gEvil+(beta) · 2007-10-02 00:48 · Score: 2, Funny

Is this proof that Carnegie Mellon (and the BBC) support religious terrorism?

--
This guy's the limit!

Rock on by riffzifnab · 2007-10-02 00:48 · Score: 1

Good idea, congrats to all the smart people who came up with this one.

Re:Rock on by cheater512 · 2007-10-02 00:56 · Score: 1, Insightful

I've found a flaw.

It gives you two words to enter in but you only have to get the right one correct in order to get through.

Spammers could fill the left word with nonsense and OCR the right one and the system would crumble.
Who cares if the OCR isnt 100% accurate. It'll be good enough to get a lot of spam through.
Re:Rock on by mapkinase · 2007-10-02 01:08 · Score: 1

It is very easy to set up the image, so that the user does not know which word is known to the system and which is not.

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Re:Rock on by Draconian · 2007-10-02 01:09 · Score: 1

You are assuming they are will be using a generated captcha for the "known" part that can be OCRed. They can use a word that was previously unknown (i.e. not OCRable) but has been identified by previous reCaptcha users.
Re:Rock on by Fluffy+Bunnies · 2007-10-02 01:10 · Score: 2, Insightful

Where in TFA does it say that the one on the right is always the right one?
Re:Rock on by Smidge204 · 2007-10-02 01:11 · Score: 2, Insightful

You don't know which word is known (and checked against) and which is unknown. This makes your ORC attack less effective because you must get BOTH words right in order to guarantee success.

Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone else for that matter.

You didn't even have to RTFA to get that much...
=Smidge=
Re:Rock on by phantomcircuit · 2007-10-02 02:17 · Score: 1

Preforming a garbage attack is still possible so long as some information is shared between attackers.

All that is necessary is that a hash of the image is stored and the same garbage is sent both times the image appears.

Once more the more images are attacked in this manner the faster the attack would progress as more of the known images would be absolutely known to the attacker as well.
Re:Rock on by Anonymous Coward · 2007-10-02 02:24 · Score: 0

"the right is always the right one?"

I'd say that's pretty obvious, wouldn't you? :)
Re:Rock on by Anonymous Coward · 2007-10-02 03:12 · Score: 0

It doesn't but go to the CM site and you will see that the right one is always right.
Re:Rock on by Smidge204 · 2007-10-02 03:22 · Score: 2, Interesting

Still won't work. It's safe to assume the distortion/noise added to the text to prevent simple OCR would be different for each instance of the image; that's the whole point, after all. Hashes of the image data are useless in that case.

Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the attackers manage to find a way to identify the same word despite the random distortions mentioned above (which would effectively beat *all* CAPTCHA systems anyway) then using that data more than a few times guarantees it will be removed from circulation.
=Smidge=
Re:Rock on by cheater512 · 2007-10-02 10:16 · Score: 1

The article doesnt say. However go ahead and try it.

I did it about 10 times putting garbage in the left. Every time I got it correct.
Re:Rock on by Clandestine_Blaze · 2007-10-02 12:57 · Score: 1

I can confirm this. I did about 100 in less than 15 minutes. For quite a few of them, I either entered garbage or nothing at all as the word on the left and still got through successfully.

--
Best "String" Ever!

I want to participate... by DrWho520 · 2007-10-02 00:50 · Score: 3, Interesting

Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!

--
The cancel button is your friend. Do not hesitate to use it.

Re:I want to participate... by morgan_greywolf · 2007-10-02 00:54 · Score: 2, Funny

Apparently, you have to go down to the fiery church to burn a few hours...

--
My blog
Re:I want to participate... by EvilGrin666 · 2007-10-02 00:54 · Score: 4, Informative

Here's the website, http://recaptcha.net/
Re:I want to participate... by somersault · 2007-10-02 01:13 · Score: 1

Dude. A 'few hours'? Sounds like you need a hobby!

--
which is totally what she said
Re:I want to participate... by gronofer · 2007-10-02 01:46 · Score: 1

If you want to spend hours, try Distributed Proofreaders.
Re:I want to participate... by Falkkin · 2007-10-02 02:00 · Score: 2, Insightful

Our demo at http://recaptcha.net/fastcgi/demo/recaptcha keeps track of the number of words you've digitized. :)
Re:I want to participate... by bunratty · 2007-10-02 03:27 · Score: 1

Personally, I like to play Peekaboom, also by van Ahn and others at Carnegie Mellon.

--
What a fool believes, he sees, no wise man has the power to reason away.
Re:I want to participate... by MrKevvy · 2007-10-02 04:09 · Score: 1

You can use a live demo on their about page so no sign-up required and you can start digitizing words immediately.

--
-- Insert witty one-liner here. --
Re:I want to participate... by s0lar · 2007-10-02 05:58 · Score: 1

The implementation looks nice, but the actual word images are awful. They are twisted and crossed out making it sometimes difficult to tell an "e" from an "o".

Also, they should really give the whole sentences this provides context and would yield to higher results. Otherwise many short words would be misinterpreted.

Rock off by Anonymous Coward · 2007-10-02 00:55 · Score: 0

It's been a while since I looked a recaptcha but IIRC it relies on javascript and document.write() so it's useless for any xhtml site. The audio captchas likewise assume the screen reader is capable of script.

does that mean it's ok to spam now? by dgym · 2007-10-02 00:59 · Score: 1

If signing up to a wiki, or creating a bogus mail account means a little beneficial work is done, then even after replacing all the useful content with links, or sending out hundreds of spams your actions would still be karma neutral, right?

Time to get linking...

I'm not so sure this is a good idea. by Aladrin · 2007-10-02 01:06 · Score: 1

So, the plan is to take already hard-to-read words, make them harder to read, pair them with another hard to read word, and see how many people agree it's the same word? I've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that. I'm betting at least 1 other person agrees and puts the same thing I did, accepting that translation into the database...

And that's not even counting malice where people deliberately put wrong words in... Chances are they won't both put the wrong word for the same word, but it -can- happen, especially with malicious intent.

It's a neat idea, but I don't think it'll work all that great. There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM

Re:I'm not so sure this is a good idea. by necro81 · 2007-10-02 01:26 · Score: 3, Insightful

There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
That is, for all intents and purposes, impractical, which was the entire point. The backlog of work was never going to get done in a reasonable timescale with dedicated humans correcting all the errors. A dedicated human, even with the context, will still make mistakes or get stumped.

Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says:
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Re:I'm not so sure this is a good idea. by smallfries · 2007-10-02 01:27 · Score: 4, Insightful

Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.

For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.

I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:I'm not so sure this is a good idea. by MrMr · 2007-10-02 01:40 · Score: 5, Funny

've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that.

Congratulations,
you managed to fail the Turing test.
Re:I'm not so sure this is a good idea. by Aladrin · 2007-10-02 01:55 · Score: 1

I tend towards optimism, so whenever I catch myself going 'Wow, that's great!' I back off and take another look. They stated only 2 people had to confirm to accept a translation... If that were more like 4 or 5, I'd be a lot happier... It's a -lot- more duplication of effort, but rules out a lot of mischief, too. If it ends up like SETI, though, they'll have so much help that they end up processing all their data many years ahead of schedule.

I plan to use at least the mailhide recaptcha on my site. I don't have a forum or other feedback method, so not much need for the regular one. If everyone helps a little like that, at least this method can be tried. I'm not terribly confident in the results, but it's not going to hurt me any, and just might work. (Or lead to something that does.)

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:07 · Score: 5, Informative

"And that's not even counting malice where people deliberately put wrong words in."

We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Re:I'm not so sure this is a good idea. by Aladrin · 2007-10-02 02:09 · Score: 1

lol Okay, that -is- funny. And make me look up 'Alau'... It's an island. But having to Google captchas is where I draw the line. ;)

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Re:I'm not so sure this is a good idea. by NeilTheStupidHead · 2007-10-02 02:10 · Score: 1

For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary. That would defeat the point of the project. Words scanned from real books contain all manner of 'not a word' combinations of letters and numbers, the principle is the same. I came across several portions of words that had been hyphenated at the margin of a page. Many Capatcha type systems use random strings of characters. Any non-english words that show up should be treated as a sting of characters.

--
Lose: misplace or fail || Loose: not bound together
Re:I'm not so sure this is a good idea. by Alzheimers · 2007-10-02 02:11 · Score: 4, Funny

ELIZA > And how does this make you feel?
Re:I'm not so sure this is a good idea. by Aladrin · 2007-10-02 02:23 · Score: 1

"Never underestimate the power of stupid people in large groups."

I'm sure you've got most bases covered, but intentional malice goes way beyond 'a few malicious people'. In this case, it involves at least 1 malicious person, a captcha breaker, a few thousand anonymous free proxies, and a lot of malice. I'll admit that I find this idea trivial because I'm a programmer, but I think most (non-script-kiddie) hackers will find it trivial as well.

I sincerely hope nobody tries to sabotage your project, but I'd feel better if you at least seemed to be taking this (currently non-existant) 'threat' more seriously.

I do wish you the best of luck in the project, and plan to support it in the little ways that I can.

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 02:38 · Score: 3, Informative

You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.

In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
Re:I'm not so sure this is a good idea. by Eponymous+Bastard · 2007-10-02 02:46 · Score: 2

I got "derground". If they are getting this from digitized books, they have to work on undoing hyphenation before presenting it to the user.

I wonder, afte this is running for a while, most of the unknown words will be nonsense (jabberwocky, snickersnee) Proper or made up names (Elric of Melnibone? I saw Benoit in the third captcha I solved, I now got one that looks like Visscher), numbers and other things people wouldn't work through.

The other problem is with common words that OCR gets wrong. I've/me are common enough that they might be overrepresented, or undertranslated.

In the end, since this is a university project, the end product is not the product itself (translated books) but rather the papers and master/PhD theses you can write with the data. Are people better at OCR than computers? By how much? How much is people's ability to recognize a word impaired by cutting off the context? Are people better at common words than at proper names and unknown words?
Re:I'm not so sure this is a good idea. by Falkkin · 2007-10-02 03:00 · Score: 1

Since this is a university project, we do actually care quite a bit about transcribing books :) In fact, that's the aspect of the system that I'm primarily responsible for. However, there is a lot of really interesting data along the lines of what you're suggesting, and I'm sure some of that data will eventually make it into papers.

"I wonder, afte this is running for a while, most of the unknown words will be nonsense"

It's already been running for a few months, and we're getting millions of solutions a day, and there's still a pretty good mix of words in the system :) Most words in the source documents aren't nonsense.
Re:I'm not so sure this is a good idea. by smallfries · 2007-10-02 03:56 · Score: 1

We're using a different defintion of word :) I meant that if the presented substring didn't have "word" boundaries on either side then it would screw up the spacing in the output. I didn't mean that the symbols didn't form a dictionary word.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:I'm not so sure this is a good idea. by Taxman415a · 2007-10-02 08:10 · Score: 1

That's true there's always the OCR confidence metric to take into account. What concerns me is that I haven't seen anything that applies random sampling in checking the final accepted answers. What the method description says is if two people agree on a new word it's accepted. Why not scale that number based on the OCR confidence? You mention doing that to reduce the number of people that need to solve it, but why not to increase it? That and/or figure out some procedure to randomly sample accepted answers and send them out again. If a decent percentage of those randomly sampled third tries do not agree with the first two, then you know that 2 isn't the right number for default acceptance. I'm guessing you've thought of this, but I didn't see it anywhere in the description or FAQ.
Re:I'm not so sure this is a good idea. by iarnell · 2007-10-02 08:21 · Score: 1

Why do you ask?
Re:I'm not so sure this is a good idea. by NeilTheStupidHead · 2007-10-02 08:24 · Score: 1

Yes, upon review I see that I mis-read your post, my apologies.

--
Lose: misplace or fail || Loose: not bound together
Re:I'm not so sure this is a good idea. by Alzheimers · 2007-10-02 09:38 · Score: 1

I don't want to talk about this anymore.
Re:I'm not so sure this is a good idea. by Cytotoxic · 2007-10-02 10:05 · Score: 1

and since there's an ugly line through them, I can't be close to sure it's right...

I have to agree with this point. I tried about 20 of them and there were at least 4 that were impossible to be sure of because of the wavy line running through the critical part of a character - this was particularly an issue on numbers, where there is no possible context to give you the correct answer. I guess it all comes out in the wash because you just re-present the images until a consensus develops.
Re:I'm not so sure this is a good idea. by pikine · 2007-10-02 15:16 · Score: 1

I was presented with the two words "Bliss" and "etnamese", the latter I presume should be "Vietnamese" but for some reason the word breaker dropped the initial "Vi". If they can't even do word breaking correctly, I wonder how reCAPTCHA is going to help.

--
I once had a signature.
Re:I'm not so sure this is a good idea. by xant · 2007-10-02 16:39 · Score: 1

> If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation

Wow, that seems like a major mistake if you're actually doing that. It's quite possible for a human to make a mistake on a word, for exactly the same reason the OCR makes a mistake. In fact, the most likely error for a human to make is the same one the OCR made. Which means you will be accepting as 'read' many errors simply because the human agreed.

--
It's rare that you're presented with a knob whose only two positions are Make History and Flee Your Glorious Destiny.
Re:I'm not so sure this is a good idea. by Aladrin · 2007-10-02 21:57 · Score: 1

Well, as for that, it could be a problem with the original text as well... There might have been a mis-print, or a letter rubbed/torn off, or something... I have to say that a poor translation of the books in the next 5 years is better than a good translation in the next 400 years. (That's the estimate at the current pace... 400 years.)

Still, as someone else noted, there -should- be a way to note that one of the words appears to be nonsense and that you'd like to flag it for a human to interpret instead. Especiallysince that nonsense word will make it into the 'correct words' list to be used for captchas.

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Re:I'm not so sure this is a good idea. by Anonymous Coward · 2007-10-02 23:31 · Score: 0

1) The obscuring pattern had better be different on the same word for each participant.

2) Two participants getting the same result is not sufficient confirmation. There are going to be a lot of matching mistakes.

Problems by David_Shultz · 2007-10-02 01:07 · Score: 2, Interesting

Interesting idea, but here are the immediate problems as I see them...

Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).

Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.

Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.

Am I right on these points? (I just woke up).

Re:Problems by AltGrendel · 2007-10-02 01:13 · Score: 3, Insightful

I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.

--
The simple truth is that interstellar distances will not fit into the human imagination
- Douglas Adams
Re:Problems by Anonymous Coward · 2007-10-02 01:22 · Score: 1, Funny

Am I right on these points? (I just woke up).
No. Not even close. Get some coffee, then RTFA.
Re:Problems by jsight · 2007-10-02 01:22 · Score: 5, Insightful

I agree... I don't understand why people find so many silly faults with this.

1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).

Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.

In practice, these seem to get broken less often than people think.

--
Throw the bums out!
Re:Problems by niceone · 2007-10-02 01:32 · Score: 1

I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.

I think the GP's worry is that the spammers use OCR and there are a lot of them, so the two challenges you are relying on for checking both get answered by the same OCR spambot code - so they could match even though they're wrong.

--
ccalam - acoustic versions of new songs.
Re:Problems by InvisblePinkUnicorn · 2007-10-02 01:56 · Score: 1

"Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm."

Please RTFA. How do you propose that the same bot gets the same word twice in one sitting, let alone with the same warping and strikethrough so as to guarantee the same word is typed both times?

Check out recaptcha.net to test it out.
Re:Problems by Ed+Avis · 2007-10-02 01:56 · Score: 1

Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas.
As the article said, by selection, these are bits of text that OCR algorithms cannot read. We can assume that CM is using the best available OCR, so even 'some algorithms' that you mention, which are better than humans at reading captchas in most ordinary cases, will be ineffective for these particular images.

--
-- Ed Avis ed@membled.com
Re:Problems by Falkkin · 2007-10-02 01:58 · Score: 1

A couple things:

1) We've done some studies at CMU that shows that recognizing and typing 2 real English words is much easier and faster than typing 6 or 7 random letters and numbers. Would you rather type "private much" (which is what just showed up for reCAPTCHA) or "KXd2cM" (which is what showed up for Yahoo's CAPTCHA)?

2) Any given CAPTCHA is only shown to a couple of users. We're getting millions of legitimate solutions a day, so even a relatively sophisticated bot would have little chance of seeing the same image twice.
Re:Problems by Anonymous Coward · 2007-10-02 04:13 · Score: 0

Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.

Then why aren't those algorithms used to directly decipher the OCR text, instead of 1) warping it more and 2) presenting it to a human? I'm sceptical of your claim.

Give it a go! by cookieinc · 2007-10-02 01:11 · Score: 2, Informative

You can try it out at the top of this page.

Ha! by omgamibig · 2007-10-02 01:15 · Score: 0, Flamebait

Take this OCR software, we still own you! And now you come crawling back!

Ignores typos? by AySz88 · 2007-10-02 01:30 · Score: 0, Offtopic

Hey, it's even resistant to typos! I got "terson reported", typed in "tersonn reportted", and it said "Correct!". ...hey, wait a minute....

Presentation about human computation by gambino21 · 2007-10-02 01:32 · Score: 1

There is a presentation about similar topics by Luis von Ahn on here. The presentation talks about using what he calls human computation, basically using people on the internet to perform various tasks that are difficult for computers to do. One idea is using people playing a game to label images on the internet so that they can be indexed with much greater accuracy than the current google image search.

CATTTTCHA? by MichailS · 2007-10-02 01:35 · Score: 2, Interesting

> The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart)
> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."

Where did they get the "P" from?

Re:CATTTTCHA? by jas79 · 2007-10-02 01:40 · Score: 1

Public according to wikipedia.
Re:CATTTTCHA? by alerante · 2007-10-02 06:01 · Score: 1

CAPTCHA actually stands for "Completely Automated Public Turing test to tell Computers and Humans Apart".
Re:CATTTTCHA? by Anonymous Coward · 2007-10-02 06:22 · Score: 0

>> The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart)
>> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."

>Where did they get the "P" from?

From Pimp. Completely Automated Pimp Test to tell Computers and Humans Apart.
Re:CATTTTCHA? by Anonymous Coward · 2007-10-02 07:22 · Score: 0

I'll pimp-slap you if you make another post about pimps, since you aren't really a pimp.

- A Real Pimp

Possible problem by thatblackguy · 2007-10-02 01:37 · Score: 1

I did that protect your email address with OCR thing at http://mailhide.recaptcha.net/ and tried solving it myself. I mistyped one of the words accidentally and noticed a second after I hit enter. It said 'Congrats you're a human!' and proceeded to give me the address.

`CowboyNeal' answer to all CAPTCHAs by gyepi · 2007-10-02 01:47 · Score: 2, Interesting

If all slashdotters would decide to answer with CowboyNeal to the second CAPTCHAs question, there is a large chance of his name appearing in one of the deciphered old texts. CowboyNeal to the Old Testament! This points out one major disadvantage of the system: since the computer can't check whether the answer is correct, a large group of people can abuse it with a growing probability in time. Since there is no incentive to answer to the second CAPTCHA correctly, making it widely known that the second CAPTCHA is not checked was less than a good idea. Good cause undermined by wide publicity. I, for one, welcome our new old-text-obfuscating slashdotter overlords.

--
Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.

Re:`CowboyNeal' answer to all CAPTCHAs by pha95mlb · 2007-10-02 02:18 · Score: 1

Not correct - the 'known' and 'unknown' CAPTCHAs are presented in a random order. You don't know which is the first or which is the second.
Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 02:21 · Score: 5, Informative

Sorry, but we've already thought of this attack :)

We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
Re:`CowboyNeal' answer to all CAPTCHAs by gyepi · 2007-10-02 03:18 · Score: 2

Is there any word on how CAPTCHA decoders, like PWNtcha, perform against the current reCAPTCHA?

In case reCAPTCHA can be automatically deciphered efficiently, a slightly altered malevolent attack might still be feasible. Let D be a roughly complete list of English words (a dictionary), together with the relative frequencies of the words occurring in standard English texts. Generate a fixed mapping f from D to D such that words are going to be assigned to each other only in case their occurrence frequencies are roughly the same - ie `banana' could be mapped to `orange' since their relative frequency (I guess) is roughly the same.
Now let your deciphering program attack the reCAPTCHA service such that it guesses the two words from the presented CAPTCHA, gives the correct answer to one of them (at random), and gives the permuted answer (according to f) to the other. You will see no bumps in the frequencies, and roughly every second attempt will put in false information to the database. Since f is fixed, sooner or later the same word will come up again, in case the false answer is going to be verified.

Even without an efficient automated reCAPTCHA decipherer, you could do the same with a bunch of people, just tell them that as a first attempt always go to a website where a small cgi script gives you back f(Word) for any given Word. I'm not claiming that you can find enough evil people for that around here, of course...

((Obviously the efficiency of this attack can be increased by mapping a very common word - say, "with" - to an uncommon one, and mapping a whole bunch of uncommon words "with" so that, on the basis of relative occurrence frequencies in standard texts and the estimated ratio of malevolent/benevolent users you see no frequency bumps. The advantage of the simpler but less efficient method above is that it doesn't require a guess of the ratio of the malevolent/benevolent users.))

--
Attitudes make the difference between Space and Time: we want to MAX our temporal, and MIN our spatial extension.
Re:`CowboyNeal' answer to all CAPTCHAs by Falkkin · 2007-10-02 03:31 · Score: 2

PWNtcha does not defeat reCAPTCHA, nor are we aware of any existing OCR or CAPTCHA-breaking algorithms that do. We are working with research groups at a couple universities who are trying to break our CAPTCHA (and if they can, we'll obviously fix it). In case we do notice a break, it's trivial for us to switch to a completely different kind of CAPTCHA (using different distortions). Because our system is a web service, if there is a security breach, we can fix it for all sites at once by simply changing the distortions on our challenge images. This is a big security benefit compared to other CAPTCHA systems that are difficult (at best) to patch and update.

As you point out, if we did get broken on a wide scale, it would be possible to seed bad data into the system. However, it's easy enough for us to simply distrust all responses that happened during the vulnerable period.

"Turing" test by DrLex · 2007-10-02 01:49 · Score: 2, Informative

Well, this finally makes CAPTCHAs somewhat useful. I won't try to formulate it in some sugar-coated way: I personally hate CAPTCHAs. On some types (especially the ones from Digg), I fail about 50% of them, and that's getting quite annoying after a while. Especially when your code is rejected even if you believe there is no doubt about what you've read in the image.
I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.

Re:"Turing" test by Anonymous Coward · 2007-10-02 02:23 · Score: 0

The good doctor will now surely provide us with... say 3 examples of this system you propose?
Re:"Turing" test by iangoldby · 2007-10-02 02:41 · Score: 1

A simple text-based question that requires actual intelligence is a much better Turing test... writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
I think it is more than a challenge. I have introduced a system like this on a public forum that I administer. It's a phpBB mod that asks a question during the registration phase to which the registrant is required to give a correct answer.

The problem is that I have found it very hard to come up with even a relatively small number of questions and answers that require understanding, have unambiguous answers, and do not assume any cultural or 'trivia' knowledge (other than understanding of the language).

Here are some examples that I came up with, along with my critique:
What is the third word of this sentence?
I think this is quite a good one. No knowledge other than understanding the language is required.
What is the result of three multiplied by three?
Mathematical question - I imagine this is probably the easiest category to crack by AI.
What day of the week comes after Wednesday?
Can probably assume that anyone with understanding of the language knows the answer, but strictly, this is a trivia question, and therefore unsuitable.
What is a shape with three sides called?
Another trivia question.
What colour is a ripe tomato?
Another trivia question. Additionally, a blind person might conceivably not know the answer.
How many days are there in a fortnight?
Trivia again.

As you can see, these are not very good questions. In fact, I think the first is the only one that does not depend on any specific knowledge.

Can anyone come up with better questions?
Re:"Turing" test by Eponymous+Bastard · 2007-10-02 02:55 · Score: 1

If you assume english knowledge:

What language is this in?
What are the first five letters of the alphabet?
What are the five vowels?

Other stuff:
Are you a human or a computer program?
What is the name of this site? (see title bar)
Pick a number, any number. (Any number is taken as correct)
Leave the following space blank.

Of course, the biggest problem with a limited dictionary of questions like this is that a spammer can sit through them, answer them all, or at least a portion, and then put a script to replay the answers. If the script gets a new question it just refreshes.
Re:"Turing" test by DrLex · 2007-10-02 05:13 · Score: 1
Your 'trivia' questions are not particularly problematic unless you want to make sure that even 4-year olds or people who can hardly read and write English can post on your forum. Which is something you might not really desire. Even if someone doesn't know the days of the week, or what color a ripe tomato has, looking it up or asking someone by phone or chat is pretty trivial. For a visually impaired person, a captcha is a much higher barrier.
Or if you mean that they would be too easy for a robot to answer, I have yet to see a system that can read and answer any 'trivia' question. If someone builds one, well, that would actually be a useful contribution to computer science.

The main problem with any anti-robot system is that the more standard it becomes, the more rewarding it becomes to crack it. If I write an OCR system that can read CAPTCHAs of a certain kind, all sites using this system become vulnerable. Similarly, if everyone would be using the same set of text questions, spammers will eventually build a database with the answers. But, changing a set of questions is a lot easier and user-friendlier for your visitors than making your CAPTCHAs harder to read. The unicity of your questions is more important than the amount of intelligence required to answer them. For instance, the following list of questions all require the user to simply type an 'a'.
- You will have to type a letter 'a' in this field.
- Please enter the first letter of the word 'alphabet'.
- If my name is Ann, what letter does my name start with?
- Type "a" here, without the quotes.
- There are 26 letters in the alphabet. What's the first? If you don't know, it's easy to guess from the word itself.
- When I say my ABC, what letter do I start with?
- What's the last letter in the word 'CAPTCHA'?
- I will repeat a certain vowel now: AAAAA. Type it once.
- In the following list, one letter is different from the rest, type it. E E A E E E.
- Give the first vowel in the word 'slashdot'.
Re:"Turing" test by iangoldby · 2007-10-02 06:05 · Score: 1

Those are all excellent questions and I hope you don't mind if I adapt them for my forum.

My point about trivia questions is that they are often very culturally-dependent. What is obvious and very easy for an average American (or English person) may not be at all obvious to someone from Burkina Faso (for example).
Re:"Turing" test by snarkh · 2007-10-02 06:15 · Score: 1

What colour is a ripe tomato?

Can be yellow, brown, purple or even green!
Re:"Turing" test by DrLex · 2007-10-03 02:08 · Score: 1

Of course I don't mind. It's not like there's a license agreement attached to each of my posts :)

Peekaboom by EnsilZah · 2007-10-02 02:08 · Score: 2

Sounds like what they're doing at Peekaboom and The ESP Game, harnessing humans to solve problems that are difficult for computers.
Here's an nice video on the subject.

Re:Peekaboom by Taxman415a · 2007-10-02 08:23 · Score: 1

That's all the same guy (in collaboration with various others). So no wonder you picked up the connection. :) http://www.cs.cmu.edu/~biglou/research.html
Re:Peekaboom by Anonymous Coward · 2007-10-02 15:42 · Score: 0

The ESP Game sounds almost exactly like what I've seen using the Google Image Labeler before.
But copyright 2005, I assume The ESP Game is older.

Practical Use by chill · 2007-10-02 02:18 · Score: 1

I supervise an America's Army clan website which uses phpBB for the forums. Spam bots were barely slowed down by the standard CAPTCHA registration requirement. I'd get dozens of bogus registration requests a day from bots that used OCR to get in.

A couple of months ago I switch to recaptcha.net's plugin for phpBB and it stemmed the tide. The number of spam bots getting thru decreased greatly. Those that did, I felt slightly better when I deleted their registration requests unfulfilled. Their Evil cpu cycles had been reclaimed for Good! :-)

Now, I'm expecting if this gains momentum, the spam bots to have tweaked OCR that will better handle recaptcha images. I also expect it will happen like before, where it slowly ramped up in annoyance for me. During that time, there will be an increase in positive results for CMU, which is a good thing.

Once the bots get good enough that I (and other forum admins) change, I expect CMU's OCR algorithms to have improved enough to not need this service.

--
Learning HOW to think is more important than learning WHAT to think.

Drupal Module makes it simple by Slashdot+Parent · 2007-10-02 02:19 · Score: 3, Interesting

For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module that makes using reCAPTCHA a snap.

I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Re:Drupal Module makes it simple by BacMan · 2007-10-02 02:46 · Score: 1

I'm a happy user as well. Ever since implementing it on my feedback form, spam dropped to zero! Here is my recaptcha using the recapcha4j API.

http://www.testdesigner.com/about/contact/
Re:Drupal Module makes it simple by HorsePunchKid · 2007-10-02 05:48 · Score: 1

I use both the Drupal module you've mentioned and the MediaWiki plugin that the CMU team (apparently) maintains. If you don't use either of those, you've still got a lot of options.

--
Steven N. Severinghaus

Does it stop spam? by Mr_Blank · 2007-10-02 02:22 · Score: 1

From their learn more page:

f you get email spam we have a method that will help you to reduce it. Many spammers crawl the web looking for email addresses. When they see an email address on a web page, they send spam to the address. Mailhide allows you to safely post your email address on the web. Mailhide takes an address such as jsmith@example.com and turns it into jsm...@example.com. In order to reveal the address, a user must click on the "..." and solve a reCAPTCHA. If you use the Mailhide version of your email address, spammers won't be able to find your real email address and you'll get less spam.

Does that work? Or are there a thousand ways for the spammers to break this?

Re:Does it stop spam? by Belacgod · 2007-10-02 02:59 · Score: 1

Any way the spammers break this involves improved OCR. Said improved OCR will be available to Carnegie Mellon too, thus in any event the stuff will be translated faster (and if they restrict reCaptcha offerings to things their OCR has in fact choked on, it will retain its effectiveness even as OCR technology improves).
Re:Does it stop spam? by The+Cisco+Kid · 2007-10-02 03:18 · Score: 1

If you are able to install this mailhide script, it would be simpler, instead of posting your email address, to post a link to a form where someone wanting to contact you can type their message, give you their email address (or link to their contact form, if they like:), and then have it submit to a script that emails you the contents of the form (make sure your email address is hardcoded in the script, and *not* included in a hidden form field)
Re:Does it stop spam? by Anonymous Coward · 2007-10-02 05:13 · Score: 0

Any way the spammers break this involves improved OCR.

Or cheap human labor. If someone really wants your email address to send a message promoting their "enlarge your mortgage with this stock" spam, they can still get it, and sell it to whoever they want, but it will at least slow down the automated crawlers.

Don't worry by Slashdot+Parent · 2007-10-02 02:23 · Score: 1

Don't worry. The system only accepts a word as correct after two people give the same answer. Hopefully the next person to get your challenge won't make the same typo you did. :)

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Privacy by Random+Walk · 2007-10-02 02:43 · Score: 1

I can see a serious privacy problem with this, since it divulges the IP address of visitors to a third party (Carnegie Mellon). The API is fundamentally broken, since both the website visitor and the website need to contact the central server (rather than the website alone), which allows said third party to generate personalized profiles of web surfers.

OLD NEWS... and a dupe by xtracto · 2007-10-02 02:47 · Score: 1

This was reported in slashdot about a year ago, and after I read about it I setup a captcha in my page to reveal my email...

other than that, it is really nice :) and for the people that want to participate you just have to "hide" your email behind a link which will show a captcha (with the two phrases)

--
Ubuntu is an African word meaning 'I can't configure Debian'

Say Foo! by Anonymous Coward · 2007-10-02 02:55 · Score: 1, Funny

Always enter "foo" as the second word, just for the heck of it!

JS is almost unavoidable for logins now. by Kadin2048 · 2007-10-02 02:57 · Score: 2, Informative

Unfortunately I think most CAPTCHAs use JS; it's been a while since I've been to a site that didn't make me turn it on to get through login/registration. I have no idea why this is, since people have been doing login pages since before JS was around or popular, but now it seems like the way every idiot is doing it.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Sadly by The+Cisco+Kid · 2007-10-02 03:22 · Score: 1

Spammers can use the 'get a human to do it' as easily as any one else can do.

They can set up fake porn sites with registrations (collecting more email addresses to spam in the process), and when someone wants to 'register' for the free porn, the spammers site scrapes a captcha from the site they want to get into with a bot, and show it to their user trying to sign up for porn. The eager pornhound dutifully types in the answer, which the spammer's scripts can then supply to the site the capthcha originally came from. They can even feedback the results - if the answer doesnt work at the real site, then the user made a mistake, and get another.

Re:Sadly by Falkkin · 2007-10-02 03:41 · Score: 1

This is quite possibly an Internet urban legend. It certainly sounds plausible, but I've never seen a report of such an attack "in the wild". In addition, doing this attack with reCAPTCHA would require a high level of sophistication, as we have security features in place specifically to detect this man-in-the-middle attack.

We have noticed one such "humans filling out CAPTCHAs for spammers" attack on reCAPTCHA, but in this case it was offshore workers being paid to solve CAPTCHAs. We shut them out of the system promptly. (But even if we hadn't, it's still a win over using nothing, because at least the spammers are incurring a non-trivial economic cost for every CAPTCHA solved.)
Re:Sadly by The+Cisco+Kid · 2007-10-03 01:56 · Score: 1

Trust me - I've worked with the anti-spam community. Its not an urban legend. (And no, I don't have any specific examples I can give you)

The government is the only terrorist by Anonymous Coward · 2007-10-02 03:30 · Score: 0

(+5 Frightening)

Not case sensitive? Ut oh by cshay · 2007-10-02 03:54 · Score: 2, Interesting

It doesn't seem like these Re-capchas require that the user type in the correct case for letters. Won't this be a problem for translated text? Even if they don't absolutely require it, they should at least request that the user use the correct case.

How if..... by aman534 · 2007-10-02 04:17 · Score: 1

What happen if the unknown word are wrong? (well, the probability is still there)... ermm...can we replace the word with random number (mixed of characters and numbers)

Next Step: by CptPicard · 2007-10-02 04:56 · Score: 1

Deciphering Mayan hieroglyphics!

Champollion is rolling in his grave in frustration because he didn't think of this...

--
I want to play Free Market with a drowning Libertarian.

Caps? by DeadPanDan · 2007-10-02 04:58 · Score: 1

How does this thing handle capitalizations? What are the chances that two people will be too lazy to Capitalize the proper nouns and acronyms? Two matches to verify a word seems low. Crap I just checked it. I found a group with two capitalized words and entered them without caps. It accepted it.

Minor problems but good overall by MrKevvy · 2007-10-02 04:58 · Score: 2, Interesting

After doing a hundred or so, several problems I can see with this that may cause problems with accuracy even if the text is human-readable:

1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.

But it's a brilliant idea and for the majority of the text samples there was no ambiguity.

--
-- Insert witty one-liner here. --

Re:Minor problems but good overall by DeadPanDan · 2007-10-02 05:19 · Score: 1

I don't see how archaic spelling and fragmented words are a problem. It not important that you know the word, only that you can spell it. If you correctly spell "ad-" and someone else correctly spells "vances" they'll get stitched together to form the correct word.
Re:Minor problems but good overall by MrKevvy · 2007-10-02 05:43 · Score: 1

re: "I don't see how archaic spelling and fragmented words are a problem"

Context. If the text is difficult to read so that one or more letters are ambiguous, if you know that the word is a modern American English word then you can fill in the blank(s). I failed to mention proper nouns (ie names) and that is more common because there are no standardized spellings of them. They are turning up quite often in the text.

Also some of the scanned text was a number with a fraction, and some had accent marks and the input doesn't take Unicode. :^)

--
-- Insert witty one-liner here. --

Re:whoosh! by zenhkim · 2007-10-02 05:25 · Score: 1

"AAAAAAAAAAAAAAAAAAARGHHH!!!"

Hear that? It's the sound of X number of spammers crying out in agony/frustration/pain/rage.

--
"All hands, BRACE FOR IMPACT!"

MOD PARENT UP by Chapter80 · 2007-10-02 05:36 · Score: 1

I wish I had mod points! Those three links are classics! I could waste HOURS On this!

It's not like they have NO idea what the word is.. by smitth1276 · 2007-10-02 06:24 · Score: 1

I imagine that it works like any OCR... they have a guess for what it is and a confidence level. If a character falls below some confidence threshold, they will feed it to a reCAPTCHA user. They may know with 99.5% certainty that the word is "?og", but only 85% certainty that the word is "Dog". Whether a user enters 'd' or 'D' is largely irrelevant.

I could see it being a problem with 'Z' and 'z', or something like that. I'm sure they can parse the language, though, and intelligently decide if it is likely to be a situation that calls for a capital letter in those rare situations.

MOD grandparent PARENT UP by Virgil+Tibbs · 2007-10-02 06:50 · Score: 1

i agree here....

--
www.tdobson.net #### Dare to Dream #### blog.tdobson.net

Caps aren't relevant by smitth1276 · 2007-10-02 06:51 · Score: 1

If they hypothetically feed you the words "dark market", they may know with 99.9% confidence that the second word is "market". For the first word, they may know with 99.9% confidence that the word is "?ark"... that first wildcard, though, may be 'd' (85% confidence), 'p' (20%), 'sh' (0.5%), or many number of other things... there is some probability that the wildcard is any given character. If they predict it to be 'd' with 85% confidence (but it is below the threshold), they will take a 'd' or a 'D' as confirmation of that. They aren't going to assume that it is a capital 'D', which might have some negligibly low probability, just because you type a capital 'D'.

Their algorithms are almost certainly smarter than that.

Why? by evilviper · 2007-10-02 08:21 · Score: 1

I can't see any reason for this.

Is there really a shortage of willing volunteer transcribers? I seem to remember Project Gutenberg getting far more volunteers than they could use, without even asking...

And speaking for myself, I'm sure I could transcribe a couple full sentences more quickly than I could two arbitrary words, so I'd call this a terrible use of the available volunteer resources as well.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Isn't this self-defeating? by Miqel · 2007-10-02 09:16 · Score: 1

Won't the technology developed by this program be useful for breaking Captchas? If we can teach computers to decipher them, their usefulness as a human-only readable key is lost.

That site is useless for signing up. by Anonymous Coward · 2007-10-02 10:17 · Score: 0

It's only good for supplying the captchas on your website. Nowhere is there a link to start doing translations.

There's a Wordpress plug-in... by vague+disclaimer · 2007-10-02 10:47 · Score: 1

Can't see if this is mentioned, but there is a plug-in for the Wordpress software that implements reCaptcha. A particularly appropriate use is on http://www.ifshakespeare.co.uk/ which is a literary blog.

Simpler way to stop malicious users by Anonymous Coward · 2007-10-02 13:02 · Score: 0

Why don't they just have it randomly choose whether to put the unknown word first or second each time? Then you have an incentive to translate both words correctly.

Look out Suzy... by mhannibal · 2007-10-04 04:51 · Score: 0

So what happens when little Suzy from myponyandme.com gets an excerpt from 'The Sex Manual'?

Slashdot Mirror

Carnegie Mellon CAPTCHA Digitization Project Now Underway

119 comments