Slashdot Mirror


Google Buys reCAPTCHA For Better Book Scanning

TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."

138 comments

  1. Imagine! by rallymatte · · Score: 1

    How slow is searching the internet going to be if you have to fill out stupid obscured word each time?!

    1. Re:Imagine! by Anonymous Coward · · Score: 1, Insightful

      As slow as searching most forums

    2. Re:Imagine! by natehoy · · Score: 2, Insightful

      Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.

      Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.

      Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    3. Re:Imagine! by mysidia · · Score: 1

      I could see Google using it to protect their account signup/login/new service signup processes, not their search function.

      What's more valuable is other people using reCaptcha technology. Google can now benefit from their use, by using the service to assist their book scanning/OCR efforts.

    4. Re:Imagine! by Anonymous Coward · · Score: 0

      Do what I do. For the last year or so I've been "penis bombing" reCAPTCHA. Since only one word of the two is known, if you type in "penis" for the harder to read one it will still validate. More people need to do this to get a good effect.

    5. Re:Imagine! by SnowZero · · Score: 2, Insightful

      So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...

    6. Re:Imagine! by agnosticnixie · · Score: 1

      They'd probably not do it because they'd be bound to be sued for it.

    7. Re:Imagine! by mysidia · · Score: 1

      Not likely. They already use Captcha technology to protect their signup pages. It's just a matter of replacing their in-house custom built implementation with the reCaptcha one.

    8. Re:Imagine! by agnosticnixie · · Score: 1

      Any large scale implementation of capcha is in breach of section 501 (which applies to google as they're a US company) due to its inaccessibility (w3c report for the interested ) - this kind of lawsuit has already happened in the past if the company had an important enough presence and problematic implementation enough (and no, having sound samples is not unproblematic) to add a significant barrier to access to internet utilities.

      We'll just have to see how far and wide they go with it.

    9. Re:Imagine! by yodleboy · · Score: 1

      that's interesting because i've seen many sites that use captcha and also have a plain button or link next to it that says "listen to this word". When clicked, the words in the captcha are played for you. So, unless you are blind AND deaf, in which case the internet in it's entirety is "inaccessibile" then you should be able to complete the process. Sounds like compliance to me and I'd doubt Google wouldn't implement this functionality.

    10. Re:Imagine! by agnosticnixie · · Score: 1

      So, unless you are blind AND deaf

      Not only

      in which case the internet in it's entirety is "inaccessibile"

      Wrong, I know it's /., but talking shit about stuff you know jack about is not a good argument.

  2. Well... by vikhyat · · Score: 4, Interesting

    This should improve Google's indecipherable CAPTCHA.

    1. Re:Well... by Anonymous Coward · · Score: 0

      As long as it doesn't replace them with one i had from ReCAPTCHA that day, ooowee that was a toughy.
      I just can't seem to place my finger on it... pretty sure i was drunk.

    2. Re:Well... by Anonymous Coward · · Score: 0

      This should improve Google's indecipherable CAPTCHA.

      Google's purchase of ReCaptcha is brought on by the poor job of both digitizing and OCRing the books that they have scanned. Had they taken the time to do it right the first time...

      I buy my books from kirtasbooks.com, their image quality is superb, and their Digitize on Demand and Print on Demand service is the best. They only have 800K books now, but I am sure they will move to the front of the pack based on quality alone.

  3. WTF Summary by afxgrin · · Score: 0, Troll

    How does solving a captcha help the database? That doesn't make ANY sense at all - a captcha needs to be solved before hand to make sure that the user authenticates the correct word. You don't just type into the captcha input box any random word, and it lets you through!

    Heh I can just see these spamming guys trying to modify an OCR system for captcha breaking, and suddenly realizing they can just input any word.

    1. Re:WTF Summary by duguk · · Score: 5, Informative

      You're asked to enter TWO words; one known; one not.

      From: recaptcha.net:
      But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

    2. Re:WTF Summary by Anonymous Coward · · Score: 1, Interesting

      As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.

      At least, that's how I'd implement it.

    3. Re:WTF Summary by vivaoporto · · Score: 1

      It is the wisdow of the crowds. There are two words, one is a normal mangled (and known beforehand) captcha, the other is one that the best OCR google got its hands on couldn't solve.

      People still have to solve the first one correctly, and if enough people give the same answer to the second one, it is added considered correct.

    4. Re:WTF Summary by Anonymous Coward · · Score: 0

      And then, imagine what happens if they actually use the results for their online book contents...

    5. Re:WTF Summary by Anonymous Coward · · Score: 0

      It can be done... Ever had a captcha come back stating there was a problem and you need to try again (with a different piece of gobledy-gook). What if the problem was faked and it was an unknown piece of text being used? Record the guesses, once you have a sufficient number of agreeing values, you might assume that the word has been solved and record it as such. Might be a bit sneaky, underhanded and very inefficient, but it might work.

    6. Re:WTF Summary by neoform · · Score: 1

      I'm fairly certain the scanners read the text, get a good idea of what it says, then asks several people to tell them what it says, as more people type the text in they become more clear on what it says.. I've used reCaptcha a number of times and find it to work pretty well. Though I have wondered the same thing you're wondering.

      --
      MABASPLOOM!
    7. Re:WTF Summary by corsec67 · · Score: 1

      ReCaptcha does that:
      One of the words is generated or known, and the other is the new word they are trying to scan. You have to give both to access the protected system, since you don't know which is the known word and which is the new word.

      http://en.wikipedia.org/wiki/ReCAPTCHA

      --
      If I have nothing to hide, don't search me
    8. Re:WTF Summary by narfman0 · · Score: 1

      They could have a list of possibilities, generated by computer or human. Then- They throw out the same word several times and aggregate the answers. Comparing elements in the aggregates they see how many people chose a particular word. When the probability that the word is wrong reaches near zero, they introduce it to the database. Don't know how they did it, that's just one of the ways they could have. Not a cure-all, but it helps with the scans I suppose.

    9. Re:WTF Summary by iammani · · Score: 1

      WTF Post?

      This is not just any captcha, but recaptcha. This captcha system will challenge you to recognize two words, one of which it understands and one it cannot understand. It assumes that, if sufficient people map the unrecognized word to the same set of letters (and also get the known word right), the image indeed maps to these letters.

      This is, indeed, a neat idea for OCR.

    10. Re:WTF Summary by iamhassi · · Score: 4, Insightful

      "Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "

      That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.

      I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads they can use that as a captcha.

      --
      my karma will be here long after I'm gone
    11. Re:WTF Summary by Useful+Wheat · · Score: 1

      The system works by having you validate 2 words. One of the words is a word that already been verified to be correct, a known quantity. The other word is the unknown word. If you get the first one correct, it assumes you got the other one correct to. Error correction is done by having multiple people evaluate the same unknown word. If 3 people agree that the unknown word is "Bacon", the word is then taken to be bacon.

      Random people trying to mess up the system will not suceed. However, if you convinced everyone to simply enter "Bacon" we could have some amazing google book searches.

    12. Re:WTF Summary by Sockatume · · Score: 4, Informative

      The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.

      --
      No kidding!!! What do you say at this point?
    13. Re:WTF Summary by Anonymous Coward · · Score: 5, Funny

      "Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."

    14. Re:WTF Summary by hansamurai · · Score: 1

      I find reCaptcha high readable, this isn't like other captcha techniques where there are really thin letters and randoms objects strewn about, it's just blurry, zoomed in typewritten words that are hard for a computer to distinguish.

    15. Re:WTF Summary by Granis · · Score: 1

      That's really interesting. I've always wondered why I have passed these CAPTCHAs even when I had to make wild guesses on some of the words because they were so hard to read.

      However, how long will it be before a lot of users realize that it is irrelevant what you enter for the unknown word? Even if you don't know for sure which of the word that is the unknown one, knowing the above I think the risk is high that you just type nonsense if you can't read one of the words.

      If enough people do this the system will be quite ineffective. reCAPTCHA will probably not accept the wrong solution very often, but it will take a lot of time to get enough users with the same solution to accept it. But with a massive amount of users, even a small amount of the total might be enough to keep it running?

    16. Re:WTF Summary by Kaetemi · · Score: 1

      Well, yeah, but the OCR attacker also just needs to get the OCR readable word right...

      --
      Kaetemi
    17. Re:WTF Summary by slyborg · · Score: 2, Interesting

      I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.

      I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!

    18. Re:WTF Summary by digitig · · Score: 2, Funny

      wisdow

      OCR error?

      --
      Quidnam Latine loqui modo coepi?
    19. Re:WTF Summary by mckinleyn · · Score: 1

      Because it presents the same words to many, many people. Yes, 10 people can all be wrong, but how likely is it that more than half of 100 people are all wrong in exactly the same way?

    20. Re:WTF Summary by Ziwcam · · Score: 1

      It's not necessarily the second word that's unknown.

    21. Re:WTF Summary by complete+loony · · Score: 1

      The first one is not a normal mangled word. It's another word that could not be OCR'd but has already been identified by the crowd.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    22. Re:WTF Summary by Chyeld · · Score: 2, Insightful

      You don't assume.

      For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.

      For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.

      So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.

    23. Re:WTF Summary by bami · · Score: 1

      That is what happened with the Anonymous attack on the Time poll, with the 'penis' attack.

      They looked at both words, see which one was the least readable, fill in the good one and fill in 'penis' for the second one, in the hopes of poisoning the database so that they only have to enter the first word correctly.

      Would be kind of amusing to see a couple of books showing up on Google Books with the word 'penis' randomly inserted in pages where reCaptcha was used.

    24. Re:WTF Summary by Sockatume · · Score: 1

      That'd involve designing a pattern-recognition system which can reliably decide which of two OCR words is less readable, mind you.

      --
      No kidding!!! What do you say at this point?
    25. Re:WTF Summary by Rich0 · · Score: 0, Offtopic

      Maybe next election when there's hanging chads they can use that as a captcha.

      It would certainly be a lot more fair than the current process - which is a bunch of cronies each interpret the results to their preferred candidate's advantage and then a judge settles it.

      Of course, the better solution is to not have such ambiguity in the first place.

      If you wanted to implement a system for interpreting analog votes here is what I'd do:

      1. All ambiguous votes are digitized. Of course, the definition of "ambiguous" is itself ambiguous - if somebody solidly fills in one circle and leaves one dot in another, is that ambiguous? What constitutes a stray mark vs a double-vote? I guess you could err on the side of caution, or maybe put all votes through the digitizer.

      2. The digitizer chops up each vote into individual boxes and then presents them to a user in random order. For example, if the Gore box is on the left on the ballot, it could be on the left or on the right in the presented ballot.

      3. The human interprets the vote. They have no cues to actually determine who the vote is for - just whether a given box was selected.

      4. Each vote is given to sufficient numbers of people that a high-confidence vote can be selected. If you get 3 people who agree then maybe that's enough. If you get any disagreements maybe you keep asking for opinions until one response has a significant margin. Maybe votes are tossed entirely at some threshold.

      The key is that those looking at ballots should not be able to tell which boxes correspond to which candidates. That will eliminate the bias from the system.

      Again, in my opinion computers should generate human-readable ballots - so that the computer validates the ballot BEFORE the voter submits it. No issue with stray marks if there are no pencils in the room.

    26. Re:WTF Summary by b4dc0d3r · · Score: 1

      One KNOWN, one not. The known word is not necessarily going to be OCR readable... you can seed the database with 100 or so images which are known, but maybe not OCR readable. Of course it works better if the known words are NOT OCR readable.

      The point is OCR can have typos as well, so just because OCR returns a result doesn't mean it should be trusted. The known word of the two is likely independently analyzed, probably by a human.

      Once enough people put the same answer for an unknown word, it becomes trustworthy. That is not easy to hack by making repeated requests with your OCR tool (which does not get GOOD results, but does get CONSISTENT results, therefore the same answer each time) and putting incorrect answers in the database - one of the millions of human users will likely get one of the words being attacked, and respond differently. So you will have several different answers and no clear winner, leaving it an unknown word.

    27. Re:WTF Summary by Anonymous Coward · · Score: 0

      One thing they should do, is that once they have a substantial amount of words done for a book, set up a neural network and train it to the font that's known on the words, and allow it to translate the rest of the book into text and index it. It would save a lot of time, as people don't need to confirm each and every word of the book, only a fraction.

    28. Re:WTF Summary by Anonymous Coward · · Score: 3, Interesting

      Interesting you should say that.

      Unfortunately, it won't work - 4chan already ruined it for everyone.

      http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/

    29. Re:WTF Summary by melikamp · · Score: 0, Troll

      I must say this system is ingenious.

      I respectfully disagree. I hate CAPTCHA because it discriminates against AI. Instead, Web-based systems should be designed to accommodate AI participants. I hate reCAPTCHA even more because it is even more annoying and I have no idea who I am working for. I always intentionally smash the keyboard with my palm for the second word. I think that tricking people into working for you is by far the least decent way of distributing this process. It would be better to have an "OCR box" which has nothing to do with CAPTCHA and is known to be a part of a copyleft or public domain project, like Wikimedia. It should display, as others have suggested, single sentences or sentence fragments, so that the reader can use the context, and it should be completely unrelated to CAPTCHA, which is just a discriminatory practice, and, as such, unethical.

    30. Re:WTF Summary by koxkoxkox · · Score: 1

      I always intentionally smash the keyboard with my palm for the second word.

      Well, it doesn't have to be the first word known and the second word unknown, it could be the opposite, or random.

    31. Re:WTF Summary by Anonymous Coward · · Score: 0

      This was attempted and completed during Time Magazine's 2009 Time 100 Internet vote (with "penis" instead of "boobs")[1]. A representative from reCapcha said actually poisoning the database would take a more concentrated effort and that they have other ways to prevent such poisoning. That particular vote only started implementing reCapcha in the final stretch, which broke all the autovoters and required a brute force attack (which, as you can see[2], was successful).

      [1] http://musicmachinery.com/tag/4chan/
      [2] http://en.wikipedia.org/wiki/Time_100#Hacking

      Incidentally, when presented with a reCapcha I only have to read one of the words.

    32. Re:WTF Summary by melikamp · · Score: 1

      If it is at random, one of the following will happen: I will either screw up the known word, in which case my OCR will not be trusted, or I will screw up the OCR word and get through. It should only take a few tries to get through, and there is no chance of helping with OCR.

    33. Re:WTF Summary by slyborg · · Score: 1

      Yeah, the multiple answers idea occurred to me later. I'm actually not talking about deliberate garbage answers, just people getting it wrong, and if it is badly scanned, etc. you will get multiple answers for the unknown text, and possibly not 100:1, but maybe 2 answers that 100:90 or something of that order - you still don't know which is more correct. Or maybe because of the nature of the image, the vast majority of people may actually converge on a wrong answer.

    34. Re:WTF Summary by mysidia · · Score: 1

      The 'known' word wasn't necessarily OCR readable. And their methods of OCR are probably not quite the same as the attacker's.

    35. Re:WTF Summary by joelpt · · Score: 1

      Building on the sibling replies, I'd also like to point out that for third-world human-powered captcha-entering sweatshops, there is no advantage to randomly guessing the second word versus just entering both words correctly. You'll end up having to enter the same amount of correct words per successful captcha attempt either way.

    36. Re:WTF Summary by Arancaytar · · Score: 1

      Yeah. I often get combinations like "WORD vjfkjsmxs" or worse, "WORD [illegible smudge]".

      I tend to simply put a dash for the smudge. They're not using that word to verify, after all, they just want to know what it says. So I tell them, "nothing". Likely, they'll get a lot of different results for it, and if the scoring algorithm is good it will eventually determine the word is illegible (or at least show it to a moderator of some kind).

    37. Re:WTF Summary by SnowZero · · Score: 1

      You keep running it until one answer dominates in a statistical sense. With the amount of data they are getting, it wouldn't be hard to construct a pretty accurate probabilistic model. If you never get a satisfactory probability for the most frequent answer, you could flag it for a developer to look at.

    38. Re:WTF Summary by GravityStar · · Score: 1

      Suppose 50% of people filling in the CAPTCHA are malicious. They type in things like "penis", "B00BIES", "qwerty", "asdf", etc. 12,5% of people fail at deciphering the captcha completely. 12,5 of people fail, but succeed in providing near matches with one or two letters wrong. 25% of people succeed in deciphering the CAPTCHA.

      I'm just taking a guess at the percentages. But still, with a bit of analysis, it would become quite easy for reCAPTCHA to filter out the noise. The only way reCAPTCHA would fail at the analysis is if the malicious people organize with the explicit purpose of poisoning the reCAPTCHA results. While possible, I think this is unlikely unless reCAPTCHA starts say... sponsoring expeditions to kill baby seals.

  4. Why just words? by Thanshin · · Score: 3, Insightful

    I suppose most people write fast enough to allow sentence captchas already.

    1. Re:Why just words? by Canazza · · Score: 4, Insightful

      no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
      I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
      Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

      --
      It pays to be obvious, especially if you have a reputation for being subtle.
    2. Re:Why just words? by Anonymous Coward · · Score: 0

      alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.

      maybe you don't know many programmers? (like many other developers) I have to look at my hands (not constantly, but at least a glance every 3rd word) to type. But when my job was solely programming I was well into the 50 wpm range (ie faster than I speak.) Simply I don't have to look at anything else in the middle of development, but was typing constantly so learned speed. I doubt my watching has much affect other than a mental requirement... But I can't enter data/transpose from paper for crap (fingers seam to loose confidence without constant feedback.)

    3. Re:Why just words? by crazyjimmy · · Score: 1

      Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

      Don't hate on the children. Most keyboards are way too big for the li'l ones anyways. We should be getting them netbooks... and maybe cellphone keyboards. They could probably type great on those, with their tiny little fingers.

      Lord knows, I can't do it. :)

      --Jimmy

    4. Re:Why just words? by British · · Score: 1

      I admit, I'm great with a standard QWERTY keyboard, but when it comes to remote controls for cable boxes/vcrs, etc, I slow down to a crawl. Perhaps it's just what you are used to. I almost never look at my keyboard(maybe for typing in tough passwords), but for my VCR remote control(infrequently used), it's a bit more difficult.

    5. Re:Why just words? by Anonymous Coward · · Score: 0

      Why not whole articles? And not cosmopolitan ones, I mean technical articles! People may even learn something out of captchas. ;) Well worth the effort!

    6. Re:Why just words? by BetterSense · · Score: 1

      I can touch-type Dvorak at 80+wpm. I'm reduced to hunt-and-peck mode with Qwerty, however. Which proves the superiority of Dvorak of course.

    7. Re:Why just words? by Abcd1234 · · Score: 1

      80 wpm? Isn't dvorak supposed to be faster or something? ;)

    8. Re:Why just words? by Abcd1234 · · Score: 1

      (like many other developers) I have to look at my hands (not constantly, but at least a glance every 3rd word) to type.

      "like many other developers"??? Jebus, I hope not. I've never met a single developer who can't touch type. And in the company I work for, the average is in the 60-70 wpm range (and I'm definitely on the higher end, averaging about 120 wpm).

      As for the looking at the keyboard, TBH, I'd just find that annoying... when I'm in the "flow", I prefer to keep my eyes on the screen... having to pause periodically to look down at the keyboard would drive be *batty*.

  5. Stupid by Anonymous Coward · · Score: 0

    Why didn't they just spend the money on improve their character recognition AI? Ultimately, they will end up having an AI that defeats the purpose of this company anyways...

  6. Great by Anonymous Coward · · Score: 0

    Here's to the prospect, for those of us who don't permit random web sites to run code on our computers, of yet more javascript dependant captchas to manually hack through.

    In related (and more important) news mozilla at last have a working 64-bit JIT for tracemonkey.

  7. Is that a finger cot? by AmigaHeretic · · Score: 1

    Check out this Google book.... about the 7th page down.

    http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false

    I thought these were scanned in by robots? If so it looks like it has well kept fingernails.

    1. Re:Is that a finger cot? by KDR_11k · · Score: 1

      Presumably the robot wasn't the only one ever to handle that book.

      --
      Justice is the sheep getting arrested while an impartial judge declares the vote void.
    2. Re:Is that a finger cot? by Anonymous Coward · · Score: 1, Funny

      Presumably the robot wasn't the only one ever to handle that book.

      Maybe not. But I know that when I'm done handling a book I usually don't leave my hands there with it.

    3. Re:Is that a finger cot? by quercus.aeternam · · Score: 1

      Humans - the new replacement for robots.

      Why drop half a million dollars on a machine when you can pay someone 25k a year to do the same job!

      But really, they probably do have robots that do some of the work - but to my (very limited) knowledge, even the best are somewhat destructive.

    4. Re:Is that a finger cot? by Jared555 · · Score: 1

      They probably also have some that were manually scanned, or there are probably cases where pages stick together and require human intervention. If the robot scans a book and then later it is discovered a page didn't get scanned they probably are going to manually scan it.

    5. Re:Is that a finger cot? by Anonymous Coward · · Score: 0

      Auto-book scanners, even good "nondestructive" ones, can be hard on books. If the book is old/rare/fragile, it's best to have a human flipping the pages.

  8. Good idea, but how? by Nesa2 · · Score: 1, Interesting

    ReCAPTCHA is a free service that usually integrates into forums, bLogs, and other such anonymous comment-posting services to help eliminate bot spamming. I think they will not use it on Google search pages, but exploit ReCAPTCHA users of all of those sites that do use it already. Sounds to me like a really good idea...

    I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

    1. Re:Good idea, but how? by Anonymous Coward · · Score: 0

      I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

      1. The CAPTCHA test to submit the form is still based on a known word.
      2. The unknown word is shown to multiple users, so even if some percentage get it wrong, eventually the system will have a majority opinion on the correct value.

    2. Re:Good idea, but how? by city · · Score: 1

      There is a really good talk by the reCAPTCHA found, Von Ahn, describing their method for validation a word and how they are using it to digitize old NYT articles. I think it's his one: http://www.youtube.com/v/Aszl5avDtekhl=en&%23038;fs=1&%23038;rel=0

      --
      I am a v1ral sig. Plse c0py me and h3lp me spread. Thank y0u?
    3. Re:Good idea, but how? by koick · · Score: 1

      Your linky no worky. This one does: http://www.youtube.com/watch?v=3PuZ55kyf7E (interview on Wired)

    4. Re:Good idea, but how? by Anonymous Coward · · Score: 0

      I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

      That was my first thought. In order for a captcha to be effective, must you not already know what it contains?

    5. Re:Good idea, but how? by Anonymous Coward · · Score: 0

      They display two words. A known word to confirm the captcha and an unknown word to identify.

      If you get the captcha word right then the unknown word entry is put forward as a possible solution.

      This is then crowd sourced, so once enough people suggest the same solution to an unknown word, we have a winner.

    6. Re:Good idea, but how? by Anonymous Coward · · Score: 0

      i saw a talk on recaptcha...
      they pair the scanned word with a regular captcha, and issue the same scanned word to multiple people.
      if a person get the regular captcha correct, then there is some confidence that they also got the scanned word correct.
      if multiple do this, then confidence is further increased.

    7. Re:Good idea, but how? by Akral · · Score: 1

      Simple.
      They present two words - one is computer generated and is, in fact, the real CAPTCHA test. The other is a failed to OCR word from a book. People fill both words, because they don't know, which is which. They show the same failed OCR word to a hundred people and get a stable result by majority of people, even if somebody tries to abuse the system and write some bad words instead.

      --
      Don't worry, be happy!
    8. Re:Good idea, but how? by Anonymous Coward · · Score: 0

      I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

      Majority rules!

      Google releases a new recaptcha that has not been deciphered automatically by OCR software. The recording of the word will actually be the sound of a dog barfing in a swimming pool.

      On Google's scale, this new word image might be seen by 1000 our so users on any given second. Of the 1000 selected group of users currently seeing the new recaptcha, 900 of them reply "GOOGLE", while 50 reply "GOGGLE", 31 reply "OOGGLE" and 19 reply "PISSOFF".

      Result: scanned image must be "GOOGLE", even if it was in fact "GOGGLE". This would then be used as the correct answer until proven otherwise.

      I can't wait to read the newly re-written books!!!

  9. Er... no. Read the reCAPTCHA info by djkitsch · · Score: 1

    The interface uses two words: one which is verified and one which isn't. Assuming the first one is typed in correctly, they present the second to a bunch of people until they get a consensus (three the same, I think) and then it goes in the "verified" pile. Thus, even if the second word's not verified yet, a spammer will still get caught out by the other one.

    --
    sig:- (wit >= sarcasm)
    1. Re:Er... no. Read the reCAPTCHA info by Tony+Hoyle · · Score: 1

      So if enough people type ' penis' as the result, eventually 3 people will identify the captcha as 'penis' and it gets in the list of known words.

  10. I'm real giddy about this by Kokuyo · · Score: 1, Interesting

    Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.

  11. I hope they have a couple of tests! by NoYob · · Score: 4, Funny
    As I get older, I find that I'm having a harder time reading from computer monitors and especially captchas. I confuse words all the time. For acample: erection with election. Not so bad, but if Google doesn't pass that unknown to multiple folks, it could get embarrassing. Text from a Bill Clinton bio:

    After Bill Clinton's first erection as President, he proceeded .....

    --
    It's NOT me! It's the meds! I'm on 1000mg of Fukitol.
    1. Re:I hope they have a couple of tests! by Anonymous Coward · · Score: 0

      "After Bill Clinton's first erection as President, he proceeded ....."

      I don't see any typos or errors in that sentence.

    2. Re:I hope they have a couple of tests! by HipToday · · Score: 1

      Or acample with example.

    3. Re:I hope they have a couple of tests! by ElSupreme · · Score: 1

      I find that ReCapcha is MUCH easier than standard ones to decipher. I mean I have 10s of years deciphering text on the curve of a book, with cheap printing. Versus the made hard to read on purpose ones.

      But a few of the ReCapchas are just miss printed and would require someone to read the sentance to figure out what sholud go there.

      --
      My addiction: Arguing with idiots. AKA Slashdot!
    4. Re:I hope they have a couple of tests! by Anonymous Coward · · Score: 0

      > I confuse words all the time. For acample: erection with election.

      In that case I thoroughly recommend visiting www.sensibleelection.com

      (Captcha: Nothing particularly relevant)

    5. Re:I hope they have a couple of tests! by natehoy · · Score: 1

      Most CAPTCHA solutions have at least two ways you can solve them. Some offer an audio version of the words that is only slightly garbled (enough to defeat voice recognition) that you can listen to in addition to or instead of the CAPTCHA word, and some allow you to solve some simple word problem instead of CAPTCHA if your hearing AND eyesight are both bad.

      As far as the Clinton example, funny, but in reality people are going to be looking at one word at a time. The Clinton bio example would be frequently made (humorously or maliciously) due to context. But if the word "election" was put on a CAPTCHA, most people would interpret it correctly. A few might get funny and try "erection" just to see if it's the "non significant" word, but I doubt that would be EVERYONE. If you checked the word against a dozen people, you'd have to have at least (at a guess) 10 of them with the exact same sense of humor to get the word automatically accepted as "erection" and not "election".

      I don't know Google's algorithm for re-checking words, but the article clearly says they'll be doing some rechecking for reliability by having a number of different randomly-chosen people interpret the same word. I imagine that words where the answers are all identical might get 4-5 checks, while words that prove less consistent will get checked at least a dozen times or so, and those that continue being unreliable would probably get an authoritative check.

      If, say, 4 people chose "erection" and the remaining 8 chose "election", the word would probably be flagged as "unreliable" by the automated CAPTCHA system and reviewed by a Google employee in proper context for final verification. Then the word would be corrected. Exactly which of the two words is chosen would probably depend on the political affiliation of the Google employee. :)

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    6. Re:I hope they have a couple of tests! by Hurricane78 · · Score: 1

      Protip: Ctrl-+

      Seriously. Or change the freakin' resolution of your display.

      There, was it that hard? ^^

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    7. Re:I hope they have a couple of tests! by BForrester · · Score: 1

      CTRL-+ just makes "erection" bigger. Since you ask: yes, it's hard.

  12. maybe they should use CAPTCHAs... by Anonymous Coward · · Score: 0

    to allow people to send emails to "higher class of service" mailboxes. Hey, I should patent that idea before Nathan the ex-Microsoftie gets to it.

    1. Re:maybe they should use CAPTCHAs... by Rik+Sweeney · · Score: 3, Interesting

      Funny you should say that

      http://mailhide.recaptcha.net/

  13. Won't this eventually defeat the purpose? by natehoy · · Score: 3, Interesting

    Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

    Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

    --
    "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    1. Re:Won't this eventually defeat the purpose? by CSMatt · · Score: 1

      CAPTCHAs can be defeated right now by using mechanical turk or social engineering to get humans to solve the CAPTCHAs for the spammers.

    2. Re:Won't this eventually defeat the purpose? by slim · · Score: 5, Insightful

      What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.

      Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.

    3. Re:Won't this eventually defeat the purpose? by funfail · · Score: 1

      CAPTCHAs can also be defeated with a system like reCAPTCHA.

    4. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 1, Interesting

      If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...

    5. Re:Won't this eventually defeat the purpose? by delete2kill · · Score: 0

      CAPTCHA solving is a lucrative industry $ 5-8 for 1000 out in far east some times solicited over craigslist ... kinda like gold farming with enough CAPTCHA you could create a program to defeat an algorithm ..but of course reCAPTCHA is different

    6. Re:Won't this eventually defeat the purpose? by jacktherobot · · Score: 1

      in addition to just showing a scanned word, the captcha image is contorted and corrupted. This makes captchas much much harder to solve compared to standard OCR problems. Improving and perfecting OCR is unlikely to have as much of an adverse impact on captchas as spammers hiring poor folks to solve them.

    7. Re:Won't this eventually defeat the purpose? by maxume · · Score: 1

      All you have to do is add a level of indirection. Take the reCAPTCHA images and present them to users of your rereCAPTCHA system, and then use the results to solve the reCAPTCHA tests.

      I suppose keeping up with the turnover of the reCAPTCHA might be an issue, but if the problem were valuable enough to solve...

      --
      Nerd rage is the funniest rage.
    8. Re:Won't this eventually defeat the purpose? by Hurricane78 · · Score: 2, Insightful

      No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

      GP is using faulty logic (circular reasoning I think).

      If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
      So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    9. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 0

      The idea is that there is no way to lose. Either you have an effective mechanism to fight spam, or you have a better method for scanning books. In both cases it is a win for the society.

    10. Re:Won't this eventually defeat the purpose? by koick · · Score: 2, Informative

      In this interview on Wired, Luis von Ahn explains that they do indeed warp it: http://www.youtube.com/watch?v=3PuZ55kyf7E

    11. Re:Won't this eventually defeat the purpose? by Hays · · Score: 2, Insightful

      The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?

    12. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 0

      Google, as usual on their quest to become the New Microsoft, continues to miss the point which is that CAPTCHA has been defeated and discredited as an anti-abuse tool. It's no accident that Gmail with it's tiny user base compared to its rivals is the main source of abuse I see these days. Google desperately clings to the CAPTCHA long after the bad guys have defeated it both programmatically and BY USING HUMANS.

      Google needs to get its head out of a certain other body part.

    13. Re:Won't this eventually defeat the purpose? by ChaosDiscord · · Score: 2, Informative

      No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

      Go here. Bounce on the reload button a few times to see some example reCAPTCHA. Tell me with a straight face that they're not warped. Perhaps they're scanning books printed on silly putty? As for obfuscated see the example here. They used to slap a line across each word. They don't appear to be doing so any more, but they used to.

    14. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 0

      If spammers could develop software designed to solve these captuas. It would kind of make the entire thing unnecessary since the words needed for the capchas could just be OCR'd successfully in the first place.

      Of course then ReCaptcha would still have been successful. Just instead of crowdsourcing people to OCR words in books, they would have successfully crowd sourced people into developing better OCR technology.

      The main problem is if spammers make something that is correct %50 of the time, it could defeat the artificial 'check' ReCaptcha while giving false information to the legitimate scanned ReCaptcha . In some cases I have had a fairly good idea which was the real word they wanted because it wasn't quite decipherable, I think we will still end up with some false words where the original scan is just too hard to read but looks close enough to something that most people make the same mistake.

    15. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 0

      http://recaptcha.net/security.html
      Actually they do obfuscate the word more. See step 3.

    16. Re:Won't this eventually defeat the purpose? by DNS-and-BIND · · Score: 1

      Spammers already have a way around captchas - getting Indians to solve them. I turned the flow of spam off my website for about a month by installing a captcha for registration. Then, I get a few enterprising young businessmen from India solving the captchas and spamming the comments by hand. You can't win.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    17. Re:Won't this eventually defeat the purpose? by totallymeat · · Score: 1

      But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

      If you look at CAPTCHAs as a method for improving artificial intelligence, then the issue you're raising gets turned into a benefit of ever-improving Turing tests.

      1. Find a problem that AI cannot readily solve, yet humans can (obfuscated word recognition).
      2. Develop a CAPTCHA out of the problem (reCAPTCHA).
      3. Either the problem remains unsolved and the CAPTCHA must be solved by humans...
      4. ... or we develop stronger AI capable of solving the original problem.

      In either outcome, we get something useful, either better Turing tests or more robust AI. The hardest part of this loop is developing novel CAPTCHAs, but at least the cycle results in useful outcomes every time.

    18. Re:Won't this eventually defeat the purpose? by natehoy · · Score: 1

      Excellent point.

      Someone please mod parent insightful. Thanks! :)

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
  14. Mod up by Anne+Honime · · Score: 1

    I totally agree, this is pure genius. Distributed Human-engined OCR is certainly the best solution to traditional OCR problems, and at the same time it leaves many doors to unforeseen traps ajar.

    1. Re:Mod up by mrcaseyj · · Score: 5, Interesting

      I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.

    2. Re:Mod up by Chabil+Ha' · · Score: 2, Insightful

      Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

      --
      We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
    3. Re:Mod up by Anonymous Coward · · Score: 2, Funny

      Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

      ...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.

    4. Re:Mod up by Kozz · · Score: 2, Funny

      Which gives rise to the question...

      Don't you mean, "Which begs the question..."?!

      (ducks)

      --
      I only post comments when someone on the internet is wrong.
    5. Re:Mod up by hipifreq · · Score: 1

      But if the people behind reCaptcha are really doing this well, then they remember the words that you refreshed. If a word gets refreshed enough then a real human can go to the real book and figure out the meaning of the word.

    6. Re:Mod up by Tim+C · · Score: 1

      Because having to read and enter a single, hard to read word is enough hassle for most people; two is stretching it. An entire sentence would be too much.

    7. Re:Mod up by ChienAndalu · · Score: 1

      True. I think they could however highlight one or two words and ask the user to enter the highlighted words

    8. Re:Mod up by selven · · Score: 2, Funny

      hence the limit of two words to allow for "me too!" comments.

      lol

  15. reCAPTCHA is awesome by Thaelon · · Score: 5, Funny

    I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.

    It's not even killing two birds with one stone, it's killing two birds with one of the birds.

    --

    Question everything

    1. Re:reCAPTCHA is awesome by pHus10n · · Score: 1

      Your analogy made me lol IRL.

    2. Re:reCAPTCHA is awesome by Sockatume · · Score: 1

      I've already posted so I can't mod you up, but that might be the greatest analogy I've ever heard. I'm already thinking up applications for it.

      --
      No kidding!!! What do you say at this point?
  16. Psst, scanning books is just one goal by melted · · Score: 1

    The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.

    1. Re:Psst, scanning books is just one goal by Anonymous Coward · · Score: 0

      To be honest, this was my first thought as well. I use reCaptcha on my sites, but I reject google analytics because I don't want to help google gather data on my users. It is really frustrating because reCaptcha is a great tool that I was happy to take advantage of. I might have to re-evaluate that decision now.

  17. Tinfoil hat by Anonymous Coward · · Score: 0

    What is stopping them from including their analytics code (or else something that scrapes behaviour of a user over different websites) behind the scenes?

    A corporate motto?

  18. Evil? by AP31R0N · · Score: 1

    Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?

    --
    Utilizing the synergization of benchmark e-solutions to pre-workaround action items!
  19. Familiar Creature by TheMeuge · · Score: 1

    no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.

    I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her Sidekick out and text someone for a couple of minutes. And god help you if you are still with her (inevitably) when 12pm or 1pm comes about. She will get up and leave for lunch (or unroll her food), whether you're waiting or not. Actually, she'd prefer you to wait there.

    She is a ubiquitous inhabitant of government offices of all sorts, as well as front desks in companies that don't respect themselves. She will need the supervisor/manager to resolve any issue that goes beyond typing your name (incorrectly), but she will march on city hall with the rest of her co-workers if they don't get another 5% raise in the middle of the recession.

    1. Re:Familiar Creature by Anonymous Coward · · Score: 0

      Stereotype much?

      It's a question of needs, training and habit. My grandfather ran a small newspaper in the 1920s-1950s, but had never learned to touch type properly - he only used two fingers on each hand. Apparently he could routinely type well over 70 wpm on a manual typewriter, without looking at the keyboard, or what he had typed.

      There are many ways to skin a cat.

    2. Re:Familiar Creature by Anonymous Coward · · Score: 0

      in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout

      You put most college grads above this? Seriously?

      You must be a female English Lit BA that now is looking for some job that has to do with your degree, but yet pays more than $10 / hour. Hell, you might even be male, and considering this is ./, it is very likely. Here, I'll misspeel something to give you validation in life.

  20. The Machine by Anonymous Coward · · Score: 0

    I once caught my dad doing something similar via one of those "make money on the internet" sites. I told him that he was most likely assisting a programmer to design "character-by-shape"-recognition software....that he was in essence making the machine smarter.

    -Oz

  21. Marble cake, also, the game by dazjorz · · Score: 1

    This was actually done by the guys at 4chan /b/: http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/

  22. Waiiiiit.... by WWWWolf · · Score: 1

    I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...

    ...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?

    So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.

  23. Beloved != 8cloved by EdgeyEdgey · · Score: 1

    I just got a correct response from a clearly incorrect answer.
    The image was of Beloved but being difficult I answered 8cloved and got accepted.
    It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.

    --
    [Intentionally left blank]
    1. Re:Beloved != 8cloved by /dev/trash · · Score: 1

      read up on the implementation to see why you are wrong.

  24. They only bought it to bash the Internet Archive by Anonymous Coward · · Score: 0

    Brewster Kahle, aka the Internet Archive, is the beneficiary of reCaptcha's work. Convenient way to knock off someone who wants to release for free, what you're hoping to make money off of.
    Presumably there's something in the legal language to guarantee that the Internet Archive will continue to benefit from reCaptcha, but I'm afraid I see this as nothing more than a "slapping back" attempt by Google.
    As a good, truly evil company should do...