Fill Out CAPTCHAs, Digitize Books At The Same Time
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.
I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
The article is lacking some information. Here are some better links:
Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works
Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/
The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
Likely has a good idea on 'unknown' word as well, the example "This aged portion of society were distinguished from" the OCR didn't cut it but it did did kick start a guess. At least on "This -> niis" it can see its not 'ZOMG' or 'Fark' easy enough.
Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.
Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.
members are seeing something, your seeing an ad
Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.
We should put a CAPTCHA system on slashdot:
When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:
- The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
- This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.
This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Maybe this technique can be adapted to fight image spam more effectively :-)