Fill Out CAPTCHAs, Digitize Books At The Same Time

← Back to Stories (view on slashdot.org)

Fill Out CAPTCHAs, Digitize Books At The Same Time

Posted by Zonk on Thursday May 24, 2007 @11:15AM from the i-would-like-to-subscribe-to-your-newsletter dept.

alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."

11 of 121 comments (clear)

Min score:

Reason:

Sort:

Re:Verification? by greatgregg · 2007-05-24 11:19 · Score: 5, Informative

From recaptcha.net: "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."
Better links by Falkkin · 2007-05-24 11:21 · Score: 4, Informative

The article is lacking some information. Here are some better links:

Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works

Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
Official reCAPTCHA site by traindirector · 2007-05-24 11:23 · Score: 4, Informative

I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/
1. Re:Official reCAPTCHA site by caffeinemessiah · 2007-05-24 12:46 · Score: 4, Informative
  
  There's an interesting solution to this problem -- the "scientist at Carnegie Mellon" is Luis von Ahn who was recently awarded a MacArthur genius award. In optical recognition tasks like this where the "true" answer is not known, how do you verify that a human agent correctly did the recognition? Just see if a bunch of other users type the same thing. It's a clever twist on consensus voting, and was recently snatched up by Google as "Google image labeler" here.
  
  --
  An old-timer with old-timey ideas.
Re:Exactly what I was wondering by Falkkin · 2007-05-24 11:29 · Score: 3, Informative

The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
More than just digitizing text by penguinbroker · 2007-05-24 11:37 · Score: 3, Informative

This would also be a great approach to a lot of NLP/Translation annotation tasks. Although these types of tasks generally require a robustness (knowing which answers to trust and which to ignore) that anonymity makes difficult.
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
Re:Verification? by nwbvt · 2007-05-24 11:42 · Score: 1, Informative

Considering all the other people who asked that question, they really needed to make that clear in their press releases.
So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time. Yeah, it will often take two tries to get it right, but its not like CAPTCHAs usually work fine on one try anyways... And hey, if you just try for only one word (and leave the other blank), you will end up on average typing the same amount.
The article makes comparisons to SETI@Home, but thats a bit different since that is relying on the computer to do the work, not the actual users. That means its fairly consistent and you really are not impacting users all that much (with the exception of pegging their CPU when they are away from the computer).

--
Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.
Re:Verification? by codename.matrix · 2007-05-24 11:55 · Score: 2, Informative

http://recaptcha.net/security.html the words are additionally distorted and they add lines and warps so that a computer cannot read it.
Re:Verification? by autophile · 2007-05-24 12:25 · Score: 3, Informative

Yeah, but it's not like you're only allowed to present a given unknown word once. Present it many times, and use the word with the most hits.
--Rob

--
Towards the Singularity.
Re:Verification? by DeathElk · 2007-05-24 12:50 · Score: 2, Informative

RTFA's TFA
Re:A better scheme by tepples · 2007-05-24 14:25 · Score: 2, Informative

A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct. The Article states that the system already does this.