Google Buys reCAPTCHA For Better Book Scanning
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
This should improve Google's indecipherable CAPTCHA.
I suppose most people write fast enough to allow sentence captchas already.
You're asked to enter TWO words; one known; one not.
From: recaptcha.net:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
After Bill Clinton's first erection as President, he proceeded .....
It's NOT me! It's the meds! I'm on 1000mg of Fukitol.
"Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "
That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.
I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads they can use that as a captcha.
my karma will be here long after I'm gone
Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?
Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.
No kidding!!! What do you say at this point?
"Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."
Funny you should say that
http://mailhide.recaptcha.net/
Summation 2
I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.
It's not even killing two birds with one stone, it's killing two birds with one of the birds.
Question everything
I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.
I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.
I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!
wisdow
OCR error?
Quidnam Latine loqui modo coepi?
You don't assume.
For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.
For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.
So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.
Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.
We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.
...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.
Which gives rise to the question...
Don't you mean, "Which begs the question..."?!
(ducks)
I only post comments when someone on the internet is wrong.
Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.
Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.
Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
Interesting you should say that.
Unfortunately, it won't work - 4chan already ruined it for everyone.
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/
hence the limit of two words to allow for "me too!" comments.
lol
So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...