Google Buys reCAPTCHA For Better Book Scanning
TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."
How slow is searching the internet going to be if you have to fill out stupid obscured word each time?!
This should improve Google's indecipherable CAPTCHA.
I suppose most people write fast enough to allow sentence captchas already.
You're asked to enter TWO words; one known; one not.
From: recaptcha.net:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
Check out this Google book.... about the 7th page down.
http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false
I thought these were scanned in by robots? If so it looks like it has well kept fingernails.
As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.
At least, that's how I'd implement it.
It is the wisdow of the crowds. There are two words, one is a normal mangled (and known beforehand) captcha, the other is one that the best OCR google got its hands on couldn't solve.
People still have to solve the first one correctly, and if enough people give the same answer to the second one, it is added considered correct.
ReCAPTCHA is a free service that usually integrates into forums, bLogs, and other such anonymous comment-posting services to help eliminate bot spamming. I think they will not use it on Google search pages, but exploit ReCAPTCHA users of all of those sites that do use it already. Sounds to me like a really good idea...
I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...
The interface uses two words: one which is verified and one which isn't. Assuming the first one is typed in correctly, they present the second to a bunch of people until they get a consensus (three the same, I think) and then it goes in the "verified" pile. Thus, even if the second word's not verified yet, a spammer will still get caught out by the other one.
sig:- (wit >= sarcasm)
I'm fairly certain the scanners read the text, get a good idea of what it says, then asks several people to tell them what it says, as more people type the text in they become more clear on what it says.. I've used reCaptcha a number of times and find it to work pretty well. Though I have wondered the same thing you're wondering.
MABASPLOOM!
ReCaptcha does that:
One of the words is generated or known, and the other is the new word they are trying to scan. You have to give both to access the protected system, since you don't know which is the known word and which is the new word.
http://en.wikipedia.org/wiki/ReCAPTCHA
If I have nothing to hide, don't search me
Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.
They could have a list of possibilities, generated by computer or human. Then- They throw out the same word several times and aggregate the answers. Comparing elements in the aggregates they see how many people chose a particular word. When the probability that the word is wrong reaches near zero, they introduce it to the database. Don't know how they did it, that's just one of the ways they could have. Not a cure-all, but it helps with the scans I suppose.
WTF Post?
This is not just any captcha, but recaptcha. This captcha system will challenge you to recognize two words, one of which it understands and one it cannot understand. It assumes that, if sufficient people map the unrecognized word to the same set of letters (and also get the known word right), the image indeed maps to these letters.
This is, indeed, a neat idea for OCR.
After Bill Clinton's first erection as President, he proceeded .....
It's NOT me! It's the meds! I'm on 1000mg of Fukitol.
"Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "
That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.
I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads they can use that as a captcha.
my karma will be here long after I'm gone
The system works by having you validate 2 words. One of the words is a word that already been verified to be correct, a known quantity. The other word is the unknown word. If you get the first one correct, it assumes you got the other one correct to. Error correction is done by having multiple people evaluate the same unknown word. If 3 people agree that the unknown word is "Bacon", the word is then taken to be bacon.
Random people trying to mess up the system will not suceed. However, if you convinced everyone to simply enter "Bacon" we could have some amazing google book searches.
Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?
Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.
No kidding!!! What do you say at this point?
"Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."
I totally agree, this is pure genius. Distributed Human-engined OCR is certainly the best solution to traditional OCR problems, and at the same time it leaves many doors to unforeseen traps ajar.
I find reCaptcha high readable, this isn't like other captcha techniques where there are really thin letters and randoms objects strewn about, it's just blurry, zoomed in typewritten words that are hard for a computer to distinguish.
Reviewing just the first hour of video games.
Funny you should say that
http://mailhide.recaptcha.net/
Summation 2
That's really interesting. I've always wondered why I have passed these CAPTCHAs even when I had to make wild guesses on some of the words because they were so hard to read.
However, how long will it be before a lot of users realize that it is irrelevant what you enter for the unknown word? Even if you don't know for sure which of the word that is the unknown one, knowing the above I think the risk is high that you just type nonsense if you can't read one of the words.
If enough people do this the system will be quite ineffective. reCAPTCHA will probably not accept the wrong solution very often, but it will take a lot of time to get enough users with the same solution to accept it. But with a massive amount of users, even a small amount of the total might be enough to keep it running?
I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.
It's not even killing two birds with one stone, it's killing two birds with one of the birds.
Question everything
The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.
http://images.google.com/imagelabeler/
Well, yeah, but the OCR attacker also just needs to get the OCR readable word right...
Kaetemi
I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.
I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!
wisdow
OCR error?
Quidnam Latine loqui modo coepi?
Because it presents the same words to many, many people. Yes, 10 people can all be wrong, but how likely is it that more than half of 100 people are all wrong in exactly the same way?
It's not necessarily the second word that's unknown.
The first one is not a normal mangled word. It's another word that could not be OCR'd but has already been identified by the crowd.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
You don't assume.
For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.
For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.
So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.
That is what happened with the Anonymous attack on the Time poll, with the 'penis' attack.
They looked at both words, see which one was the least readable, fill in the good one and fill in 'penis' for the second one, in the hopes of poisoning the database so that they only have to enter the first word correctly.
Would be kind of amusing to see a couple of books showing up on Google Books with the word 'penis' randomly inserted in pages where reCaptcha was used.
That'd involve designing a pattern-recognition system which can reliably decide which of two OCR words is less readable, mind you.
No kidding!!! What do you say at this point?
Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?
Utilizing the synergization of benchmark e-solutions to pre-workaround action items!
no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her Sidekick out and text someone for a couple of minutes. And god help you if you are still with her (inevitably) when 12pm or 1pm comes about. She will get up and leave for lunch (or unroll her food), whether you're waiting or not. Actually, she'd prefer you to wait there.
She is a ubiquitous inhabitant of government offices of all sorts, as well as front desks in companies that don't respect themselves. She will need the supervisor/manager to resolve any issue that goes beyond typing your name (incorrectly), but she will march on city hall with the rest of her co-workers if they don't get another 5% raise in the middle of the recession.
One KNOWN, one not. The known word is not necessarily going to be OCR readable... you can seed the database with 100 or so images which are known, but maybe not OCR readable. Of course it works better if the known words are NOT OCR readable.
The point is OCR can have typos as well, so just because OCR returns a result doesn't mean it should be trusted. The known word of the two is likely independently analyzed, probably by a human.
Once enough people put the same answer for an unknown word, it becomes trustworthy. That is not easy to hack by making repeated requests with your OCR tool (which does not get GOOD results, but does get CONSISTENT results, therefore the same answer each time) and putting incorrect answers in the database - one of the millions of human users will likely get one of the words being attacked, and respond differently. So you will have several different answers and no clear winner, leaving it an unknown word.
This was actually done by the guys at 4chan /b/:
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/
I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...
...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?
So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.
I just got a correct response from a clearly incorrect answer.
The image was of Beloved but being difficult I answered 8cloved and got accepted.
It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.
[Intentionally left blank]
Interesting you should say that.
Unfortunately, it won't work - 4chan already ruined it for everyone.
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/
I always intentionally smash the keyboard with my palm for the second word.
Well, it doesn't have to be the first word known and the second word unknown, it could be the opposite, or random.
If it is at random, one of the following will happen: I will either screw up the known word, in which case my OCR will not be trusted, or I will screw up the OCR word and get through. It should only take a few tries to get through, and there is no chance of helping with OCR.
Yeah, the multiple answers idea occurred to me later. I'm actually not talking about deliberate garbage answers, just people getting it wrong, and if it is badly scanned, etc. you will get multiple answers for the unknown text, and possibly not 100:1, but maybe 2 answers that 100:90 or something of that order - you still don't know which is more correct. Or maybe because of the nature of the image, the vast majority of people may actually converge on a wrong answer.
The 'known' word wasn't necessarily OCR readable. And their methods of OCR are probably not quite the same as the attacker's.
Building on the sibling replies, I'd also like to point out that for third-world human-powered captcha-entering sweatshops, there is no advantage to randomly guessing the second word versus just entering both words correctly. You'll end up having to enter the same amount of correct words per successful captcha attempt either way.
Yeah. I often get combinations like "WORD vjfkjsmxs" or worse, "WORD [illegible smudge]".
I tend to simply put a dash for the smudge. They're not using that word to verify, after all, they just want to know what it says. So I tell them, "nothing". Likely, they'll get a lot of different results for it, and if the scoring algorithm is good it will eventually determine the word is illegible (or at least show it to a moderator of some kind).
You keep running it until one answer dominates in a statistical sense. With the amount of data they are getting, it wouldn't be hard to construct a pretty accurate probabilistic model. If you never get a satisfactory probability for the most frequent answer, you could flag it for a developer to look at.
Suppose 50% of people filling in the CAPTCHA are malicious. They type in things like "penis", "B00BIES", "qwerty", "asdf", etc. 12,5% of people fail at deciphering the captcha completely. 12,5 of people fail, but succeed in providing near matches with one or two letters wrong. 25% of people succeed in deciphering the CAPTCHA.
I'm just taking a guess at the percentages. But still, with a bit of analysis, it would become quite easy for reCAPTCHA to filter out the noise. The only way reCAPTCHA would fail at the analysis is if the malicious people organize with the explicit purpose of poisoning the reCAPTCHA results. While possible, I think this is unlikely unless reCAPTCHA starts say... sponsoring expeditions to kill baby seals.