Google Buys reCAPTCHA For Better Book Scanning

← Back to Stories (view on slashdot.org)

Google Buys reCAPTCHA For Better Book Scanning

Posted by CmdrTaco on Thursday September 17, 2009 @02:06AM from the when-spammers-give-you-lemons dept.

TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."

27 of 138 comments (clear)

Min score:

Reason:

Sort:

Well... by vikhyat · 2009-09-17 02:10 · Score: 4, Interesting

This should improve Google's indecipherable CAPTCHA.
Why just words? by Thanshin · 2009-09-17 02:11 · Score: 3, Insightful

I suppose most people write fast enough to allow sentence captchas already.
1. Re:Why just words? by Canazza · 2009-09-17 02:21 · Score: 4, Insightful
  
  no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
  I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
  Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.
  
  --
  It pays to be obvious, especially if you have a reputation for being subtle.
Re:WTF Summary by duguk · 2009-09-17 02:14 · Score: 5, Informative

You're asked to enter TWO words; one known; one not.

From: recaptcha.net:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
I hope they have a couple of tests! by NoYob · 2009-09-17 02:25 · Score: 4, Funny

As I get older, I find that I'm having a harder time reading from computer monitors and especially captchas. I confuse words all the time. For acample: erection with election. Not so bad, but if Google doesn't pass that unknown to multiple folks, it could get embarrassing. Text from a Bill Clinton bio:
After Bill Clinton's first erection as President, he proceeded .....

--
It's NOT me! It's the meds! I'm on 1000mg of Fukitol.
Re:WTF Summary by iamhassi · 2009-09-17 02:28 · Score: 4, Insightful

"Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "

That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.

I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads they can use that as a captcha.

--
my karma will be here long after I'm gone
Won't this eventually defeat the purpose? by natehoy · 2009-09-17 02:35 · Score: 3, Interesting

Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?
Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

--
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
1. Re:Won't this eventually defeat the purpose? by slim · 2009-09-17 02:54 · Score: 5, Insightful
  
  What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.
  Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.
2. Re:Won't this eventually defeat the purpose? by Hurricane78 · 2009-09-17 03:50 · Score: 2, Insightful
  
  No it's not warped and obfuscated. ReCaptcha gives you the word as-is.
  GP is using faulty logic (circular reasoning I think).
  If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
  So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
3. Re:Won't this eventually defeat the purpose? by koick · 2009-09-17 05:34 · Score: 2, Informative
  
  In this interview on Wired, Luis von Ahn explains that they do indeed warp it: http://www.youtube.com/watch?v=3PuZ55kyf7E
4. Re:Won't this eventually defeat the purpose? by Hays · 2009-09-17 06:09 · Score: 2, Insightful
  
  The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?
5. Re:Won't this eventually defeat the purpose? by ChaosDiscord · 2009-09-17 08:05 · Score: 2, Informative
  
  No it's not warped and obfuscated. ReCaptcha gives you the word as-is.
  Go here. Bounce on the reload button a few times to see some example reCAPTCHA. Tell me with a straight face that they're not warped. Perhaps they're scanning books printed on silly putty? As for obfuscated see the example here. They used to slap a line across each word. They don't appear to be doing so any more, but they used to.
  
  --
  Search 2010 Gen Con events
Re:WTF Summary by Sockatume · 2009-09-17 02:36 · Score: 4, Informative

The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.

--
No kidding!!! What do you say at this point?
Re:WTF Summary by Anonymous Coward · 2009-09-17 02:38 · Score: 5, Funny

"Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."
Re:maybe they should use CAPTCHAs... by Rik+Sweeney · 2009-09-17 02:54 · Score: 3, Interesting

Funny you should say that
http://mailhide.recaptcha.net/

--
Summation 2
reCAPTCHA is awesome by Thaelon · 2009-09-17 03:00 · Score: 5, Funny

I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.
It's not even killing two birds with one stone, it's killing two birds with one of the birds.

--
Question everything
Re:Mod up by mrcaseyj · 2009-09-17 03:06 · Score: 5, Interesting

I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.
Re:WTF Summary by slyborg · 2009-09-17 03:08 · Score: 2, Interesting

I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.
I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!
Re:WTF Summary by digitig · 2009-09-17 03:14 · Score: 2, Funny

wisdow
OCR error?

--
Quidnam Latine loqui modo coepi?
Re:WTF Summary by Chyeld · 2009-09-17 03:21 · Score: 2, Insightful

You don't assume.
For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.
For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.
So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.
Re:Mod up by Chabil+Ha' · 2009-09-17 03:50 · Score: 2, Insightful

Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

--
We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
Re:Mod up by Anonymous Coward · 2009-09-17 04:06 · Score: 2, Funny

Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.
...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.
Re:Mod up by Kozz · 2009-09-17 04:07 · Score: 2, Funny

Which gives rise to the question...
Don't you mean, "Which begs the question..."?!
(ducks)

--
I only post comments when someone on the internet is wrong.
Re:Imagine! by natehoy · 2009-09-17 04:28 · Score: 2, Insightful

Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.
Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.
Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.

--
"This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
Re:WTF Summary by Anonymous Coward · 2009-09-17 04:50 · Score: 3, Interesting

Interesting you should say that.
Unfortunately, it won't work - 4chan already ruined it for everyone.
http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/
Re:Mod up by selven · 2009-09-17 10:17 · Score: 2, Funny

hence the limit of two words to allow for "me too!" comments.
lol
Re:Imagine! by SnowZero · 2009-09-17 16:19 · Score: 2, Insightful

So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...