Fill Out CAPTCHAs, Digitize Books At The Same Time

← Back to Stories (view on slashdot.org)

Fill Out CAPTCHAs, Digitize Books At The Same Time

Posted by Zonk on Thursday May 24, 2007 @11:15AM from the i-would-like-to-subscribe-to-your-newsletter dept.

alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."

4 of 121 comments (clear)

Min score:

Reason:

Sort:

Verification? by traindirector · 2007-05-24 11:16 · Score: 5, Insightful

CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.

I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
1. Re:Verification? by Bjarke+Roune · 2007-05-24 11:59 · Score: 4, Insightful
  
  This is not a problem if the known word is a hard image that has been solved by humans in previous captchas. This scheme works as long as the system has a small pool of known images to start the process off.
  
  --
  Bjarke Roune
2. Re:Verification? by Falkkin · 2007-05-24 12:00 · Score: 4, Insightful
  
  "So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."
  
  Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
Re:Exactly what I was wondering by hpavc · 2007-05-24 12:17 · Score: 3, Insightful

Likely has a good idea on 'unknown' word as well, the example "This aged portion of society were distinguished from" the OCR didn't cut it but it did did kick start a guess. At least on "This -> niis" it can see its not 'ZOMG' or 'Fark' easy enough.

Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.

Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.

--
members are seeing something, your seeing an ad