Fill Out CAPTCHAs, Digitize Books At The Same Time

← Back to Stories (view on slashdot.org)

Fill Out CAPTCHAs, Digitize Books At The Same Time

Posted by Zonk on Thursday May 24, 2007 @11:15AM from the i-would-like-to-subscribe-to-your-newsletter dept.

alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."

26 of 121 comments (clear)

Verification? by traindirector · 2007-05-24 11:16 · Score: 5, Insightful

CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.

I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
1. Re:Verification? by greatgregg · 2007-05-24 11:19 · Score: 5, Informative
  
  From recaptcha.net: "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."
2. Re:Verification? by bugnuts · 2007-05-24 11:48 · Score: 4, Funny
  The problem is that any unsophisticated captcha interpreter can spit out the text that's known, and make a (bad) guess at what is hard to read. Then, if there is any significant amount of spammers, we end up with exactly the same issue - computers having trouble with OCR.
  
  e.g., /. puts in a captcha to translate the following two sections:
  12345
  l1il1
  
  The captcha software knows the "12345"
  but it doesn't know the "l1ill1". A human could figure out both.
  
  But spammer captcha deciphering can figure out 12345, and is allowed to incorrectly guess 11ii1 for the 2nd part. End result is
  
  a spammer is posting something as indecipherable as this message except insults your penis size
  some OCRed book is now committed to a false interpretation
  I have to change the password on my luggage.
3. Re:Verification? by codename.matrix · 2007-05-24 11:55 · Score: 2, Informative
  
  http://recaptcha.net/security.html the words are additionally distorted and they add lines and warps so that a computer cannot read it.
4. Re:Verification? by Bjarke+Roune · 2007-05-24 11:59 · Score: 4, Insightful
  
  This is not a problem if the known word is a hard image that has been solved by humans in previous captchas. This scheme works as long as the system has a small pool of known images to start the process off.
  
  --
  Bjarke Roune
5. Re:Verification? by Falkkin · 2007-05-24 12:00 · Score: 4, Insightful
  
  "So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."
  
  Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
6. Re:Verification? by Mr.+Underbridge · 2007-05-24 12:07 · Score: 2, Interesting
  
  This is one of the most creative ideas I've heard all year. Human-based distributed computing with captchas? Awesome!
7. Re:Verification? by autophile · 2007-05-24 12:25 · Score: 3, Informative
  
  Yeah, but it's not like you're only allowed to present a given unknown word once. Present it many times, and use the word with the most hits.
  --Rob
  
  --
  Towards the Singularity.
8. Re:Verification? by DeathElk · 2007-05-24 12:50 · Score: 2, Informative
  
  RTFA's TFA
9. Re:Verification? by poopdeville · 2007-05-24 14:06 · Score: 2, Insightful
  
  Because nobody wants to wait around for another person to verify the CAPTCHA before posting on /. That is, you need two CAPTCHA images because you still want them to work as a CAPTCHA.
  
  --
  After all, I am strangely colored.
Better links by Falkkin · 2007-05-24 11:21 · Score: 4, Informative

The article is lacking some information. Here are some better links:

Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works

Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
1. Re:Better links by inKubus · 2007-05-24 11:54 · Score: 4, Interesting
  
  Also, Amazon has a pretty cool program where you can perform HITs (Human Intelligence Tasks) for a few cents each. They have a lot of stuff like transcribing podcasts, identifying stuff in satellite images, etc.
  
  --
  Cool! Amazing Toys.
Official reCAPTCHA site by traindirector · 2007-05-24 11:23 · Score: 4, Informative

I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/
1. Re:Official reCAPTCHA site by caffeinemessiah · 2007-05-24 12:46 · Score: 4, Informative
  
  There's an interesting solution to this problem -- the "scientist at Carnegie Mellon" is Luis von Ahn who was recently awarded a MacArthur genius award. In optical recognition tasks like this where the "true" answer is not known, how do you verify that a human agent correctly did the recognition? Just see if a bunch of other users type the same thing. It's a clever twist on consensus voting, and was recently snatched up by Google as "Google image labeler" here.
  
  --
  An old-timer with old-timey ideas.
2. Re:Official reCAPTCHA site by MindStalker · 2007-05-24 14:35 · Score: 2, Interesting
  
  Problem is, for the first few people seeing a new Capatcha the computer will have to let you through even if you guess wrong, so the lock feature of the Capatcha doesn't work.
  
  As others mentioned this system gives you a known then an unknown, though I think its stupid that it further makes it difficult by putting a slash through it and making it wavey. Helloo, if you system had a hard time recognizing it why do you want to make it harder to recognize. I saw several in the examples in which the word was nonenglish and I had a hard time guessing the correct spelling because I couldn't make out a letter. There needs to be a I don't know button as well :)
Re:Exactly what I was wondering by Falkkin · 2007-05-24 11:29 · Score: 3, Informative

The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
More than just digitizing text by penguinbroker · 2007-05-24 11:37 · Score: 3, Informative

This would also be a great approach to a lot of NLP/Translation annotation tasks. Although these types of tasks generally require a robustness (knowing which answers to trust and which to ignore) that anonymity makes difficult.
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
Booger by Tablizer · 2007-05-24 11:41 · Score: 2, Insightful

What if the OCR cannot read a word because there was a booger on it during the scan? A human won't be able to determine it either because it will be mostly a blotch. How are they gonna know the difference between human-decipherable words and lost-cause words (such as booger blotches)?

--
Table-ized A.I.
How it could work by AaronW · 2007-05-24 11:45 · Score: 2, Insightful

I can see how this would work, but in order to also provide security, extra letters or words would also need to be in the captcha. I.e. if there's an un-OCRable word "between", the captcha could contain "frog between" or something like that, and the first word could be a previous un-OCRable word that has been validated by enough people.

Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.

--
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
Re:I got my digitized copy of the US Constitution by multipartmixed · 2007-05-24 12:02 · Score: 2, Funny

Constitution, consititution...

Oh! You mean the "E. Plebnista?"

--

Do daemons dream of electric sleep()?
A pain for users by EssenceLumin · 2007-05-24 12:11 · Score: 2, Insightful

Great, so now I would have to fill out two of those stupid things instead of one. Why would a company want to inflict this on its users?
Re:Exactly what I was wondering by hpavc · 2007-05-24 12:17 · Score: 3, Insightful

Likely has a good idea on 'unknown' word as well, the example "This aged portion of society were distinguished from" the OCR didn't cut it but it did did kick start a guess. At least on "This -> niis" it can see its not 'ZOMG' or 'Fark' easy enough.

Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.

Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.

--
members are seeing something, your seeing an ad
Great CAPTCHA solution to solve people not RTFA! by GNUALMAFUERTE · 2007-05-24 14:08 · Score: 5, Interesting

Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.

We should put a CAPTCHA system on slashdot:

When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:

- The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
- This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.

This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)

--
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Re:A better scheme by tepples · 2007-05-24 14:25 · Score: 2, Informative

A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct. The Article states that the system already does this.
Image spam by JeremyR · 2007-05-24 16:12 · Score: 4, Interesting

Maybe this technique can be adapted to fight image spam more effectively :-)
Hmmm, That Looks Like A... by WiseWeasel · 2007-05-24 20:35 · Score: 2, Funny

Damnit, where's the smushed bug key?!?

--
"I like systems, their application excepted", George Sand (French)