reCAPTCHA Hard At Work, Rescuing Fading Texts
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
Ticketmaster and other sites have already been doing this for a while. Go to ticketmaster and search for tickets, you'll see two words. One is known and the other is unknown. If you don't believe me, try to guess which one they know and misspell the other one on purpose (or don't, this is for historic posterity =) )
I can usually tell which of the two words is from a real old text. With high probability (>90%) I can correctly answer the real CAPTCHA and replace someone's OCR'd word with "penis".
I've only ever done this maybe ten or twenty times, but it could easily become an automatic part of using the system.
Man, I would love to see the results if this technique was used for an ontological purpose.
Please type in the word from the choices below that most closely relates to this word: OLD
HISTORIC
LIFESPAN
Interesting shit indeed.
The New York Times is already online from 1851 onwards. the concept is cool, truly, but why not CAPTCHA something not already accomplished? Oh, I know. That was, like, a metaphor, right?
How about a moderation of -1 pedantic.
The feature known as FADING was designed to protect copyright works from being pirated by becoming illegible before the work could fall into the public domain.
I think that erosion on stone tablets predates fading by quite a bit....
I'm starting to think GNU is the problem with "GNU/Linux" these days.
a little OT I know but is anyone else having a bad time with gmail's captchas? I've tried signing up several of our customers for gmail recently and it's becoming really hard to get them right. The "audio" playback used to be the saving grace, but the last two I did it sounded like ten people were talking to me all at once with no discernible key voice. (and last I succeeded, the string to be entered was spoken in three groups, by three different voices)
I work for the Department of Redundancy Department.
I've found implementing a simple "please choose the name of the item seen bellow" eliminates a large amount of spam (all?) but has the problem of not being viable for blind people.
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
The following security test allows us to validate you are a human and not an automated script.
please type the following two words in the text box below
you moron
____________ _____________
The software presents one optically unreadable word and one "control" CAPTCHA word. Getting the control word right identifies the user as a human, and the program records his or her response to the unreadable word and adds it to a database.
So, there is the real CAPTCHA, and another reCAPTCHA.
Took me a bit to get past the new security measures, But I got a coupon 5 cents off my next shoe purchase.
Modding me -1 troll doesn't make me wrong.
Right about now, I'm wondering what the implications would be for including reCAPTCHA in an open source project. (a PHP-based blog I'm working on) Right now the blog is read-only, since I have yet to build my own working CAPTCHA system and putting up an unprotected reply form is sheer idiocysince it wil lbe a whole five minutes before the spam bots find it. My project is GPLv3, so would including ReCAPTCHA cause me some sort of licensing problem?
"It is a denial of justice not to stretch out a helping hand to the fallen; that is the common right of humanity."
I've seen a number of issues with reCaptcha that I don't really know how to handle (i.e. what to enter): 1. Multiple word strings 2. Foreign characters 3. Illegible Text 4. A single word for both entries 5. Words that look like one thing initially, but are really another when you look closer
Let me introduce you to my friend, the question mark.
is full of hyperbole, dogma, propaganda, and meaningless blatherings.
If video games influenced behavior the Pac Man generation would be eating pills and running away from their problems.
One FUNDAMENTAL problem with this
... is that you didn't RTFA.
How are they able to tell if I've accurately solved an unknown. If the word is "Yesterday" and I enter "Fucktard", not only will the society get some very wrong data, but I'll also have passed the CAPTCHA without entering the actual letters.
UTF-8: There and Back Again
COWBOY NEAL
That slashdot's Goatse troll server guy proves useful.
Note: This is not a troll. One of the guys that offers open web services to slashdot trolls is also responsible for considerable development of CAPTCHA breakage and is an eminent Debian developer. This is why I've said that we should respect his efforts despite the unpleasant side effects. The truly brilliant we should grant exceptions from social behavior because they discover things more proper folk would not.
Help stamp out iliturcy.
When solving these I sometimes find that there's more than one possibility for an illegible word, yet I can't tell which it is without knowing the context.
For example, in some fonts "cost" and "cast" might be indistinguishable in the image shown. But given the context of the sentence it's trivial for a human to tell the difference.
Suppose that they found these words on which people disagreed and had another captcha system which showed the full sentence. I'd guess they could improve their accuracy significantly in this case. Since they could prescreen for ambiguous words using the current captcha system, even if fewer people were willing to solve the "large" captcha, they would still get all the solutions they needed.
One FUNDAMENTAL problem with this, isn't the point of a captcha to descramble the letters to get access? If contents of the image shown is unknown, then doesn't that defeat the point entirely?
Actually, you are correct that it won't work if the *entire image* is unknown. But with reCAPTCHA it is not. You see, reCAPTCHA works by showing two words, one of which is known and the other that is unknown. When the user gets the known word correct, it is assumed that the unknown word is atleast partially correct. This both validates the captcha and allows them to build their database of scanned "known" words. Of course, to prevent database poisoning, the "unknown" words are still given many times, in order to "cross reference" and reduce the chance human error.
kernel: lp0 on fire
why don't they just use whatever software is used by the crackers to bombard us with spam email to go through all of these books are whatever speed they're capable of. If compromised PCs can send tens of thousands of fake emails, why not just set a few up to figure out these words/
How much worse is this than trusting users to correctly identify the text? I ask because I honestly don't know the succcess rate of the automated system.
The authors also tested software designed to crack CAPTCHAs against images created using reCAPTCHA, and found that they failed completely. The authors ascribe this to the fact that the letters in scanned images contain distortions that are not the result of a clean mathematical transformation. User response times were also measured, but there were no significant differences between the time it took users to handle traditional systems and that required to use reCAPTCHA.
You can also use reCaptcha for your own email address, and be more willing to provide it "publicly" since they'd have to answer the reCaptcha to get to the mailto... reCaptcha mailhide
My company is working on digitizing a large volume of old text (19th century government documents). There are a number of problems unique to old text:
- OCR breaks down due to archaic letter shapes, smudging, letter damage and paper deterioration.
- we evaluated OCR versus having the entire text retyped by Indians, and ended up going with the Indians. The only way to get sufficient accuracy (>99%) was to have everything done twice and do a comparison.
- Even then, the typed text has to be checked using both automated and manual processes. The text is highly structured, which makes automatic checks possible, but we can't catch everything that way. Then again, the checks necessary for our text are more extensive than for an old newspaper.
- For old texts, your average spelling checker is useless. You end up adding loads of words to the dictionary.
ReCAPTCHA solves one of these problem (text entry), but I suspect a fair amount of work remains. E.g. sometimes you need context to decipher a word correctly.
I feel pretty good about opening my porn bookmarks now that they've adopted reCAPTCHA.
Just to try it out I set up a mechanical turk using reCAPTCHA. So if you like the idea you can keep at it, instead of just solving one of them once. It can be a bit addicting.
"Education is not the filling of a pail, but the lighting of a fire." -- William Butler Yeats
Ok, the guy didn't RTFA, lika many don't, but is this kind of reply really necessary? I would moderate it as rude rather than funny. (like this one a lot better). Guess I am getting old. :-(
The recaptcha.net site does not have links to the OCRed text data they are accumulating. It's nowhere to be found in the F.A.Q. or the wiki. Everything just deals with implementing the API and such. If we, the public are helping to create this archive, where can we download plaintext results of the system?