Slashdot Mirror


Google Buys reCAPTCHA For Better Book Scanning

TimmyC writes "This story may interest the Slashdot folk, many of whom use the reCAPTCHA anti-spam service. Well, reCAPTCHA is now owned by Google. Apparently, what attracted Google to ReCAPTCHA is that the company has linked its core authentication service with efforts to digitize print books and periodicals. The search giant has a massive (and controversial) effort underway in that area for its Google Books and Google News Archive services. Every time people solve a CAPTCHA from the company, they are also, as a byproduct, helping to turn scanned words into plain text that can be indexed and made searchable by search engines. Interesting times indeed."

103 of 138 comments (clear)

  1. Imagine! by rallymatte · · Score: 1

    How slow is searching the internet going to be if you have to fill out stupid obscured word each time?!

    1. Re:Imagine! by Anonymous Coward · · Score: 1, Insightful

      As slow as searching most forums

    2. Re:Imagine! by natehoy · · Score: 2, Insightful

      Google's probably not going to add this to their default search engine. They've already got a good audience using this where it's appropriate - to keep spambots from joining or posting to forums or in other contexts where you want to determine if your web client is human or bot.

      Google SEARCH exists and is popular because it's fast and convenient. I can't see them adding a 2-word CAPTCHA to do a simple search only because that would drive search traffic (which is already very profitable) to their competition.

      Google is very, very clever at designing mutually beneficial arrangements. They craft all of their products so the user is receiving some significant benefit in return for the information or work they provide to Google. reCAPTCHA only provides a benefit when users see a forum is pretty clean from spam and crap because CAPTCHA is there, so they'll go to the effort of joining those forums. Forum master and user both see a tangible benefit - reduced spam - and will happily compensate google with 5 seconds' work.

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    3. Re:Imagine! by mysidia · · Score: 1

      I could see Google using it to protect their account signup/login/new service signup processes, not their search function.

      What's more valuable is other people using reCaptcha technology. Google can now benefit from their use, by using the service to assist their book scanning/OCR efforts.

    4. Re:Imagine! by SnowZero · · Score: 2, Insightful

      So, a project is trying to digitize historical books, newspapers, and documents, preserving them in a form that would allow our history to be kept near-losslessly for the first time since humans started writing -- and you are trying to purposely pollute their data. Okay then...

    5. Re:Imagine! by agnosticnixie · · Score: 1

      They'd probably not do it because they'd be bound to be sued for it.

    6. Re:Imagine! by mysidia · · Score: 1

      Not likely. They already use Captcha technology to protect their signup pages. It's just a matter of replacing their in-house custom built implementation with the reCaptcha one.

    7. Re:Imagine! by agnosticnixie · · Score: 1

      Any large scale implementation of capcha is in breach of section 501 (which applies to google as they're a US company) due to its inaccessibility (w3c report for the interested ) - this kind of lawsuit has already happened in the past if the company had an important enough presence and problematic implementation enough (and no, having sound samples is not unproblematic) to add a significant barrier to access to internet utilities.

      We'll just have to see how far and wide they go with it.

    8. Re:Imagine! by yodleboy · · Score: 1

      that's interesting because i've seen many sites that use captcha and also have a plain button or link next to it that says "listen to this word". When clicked, the words in the captcha are played for you. So, unless you are blind AND deaf, in which case the internet in it's entirety is "inaccessibile" then you should be able to complete the process. Sounds like compliance to me and I'd doubt Google wouldn't implement this functionality.

    9. Re:Imagine! by agnosticnixie · · Score: 1

      So, unless you are blind AND deaf

      Not only

      in which case the internet in it's entirety is "inaccessibile"

      Wrong, I know it's /., but talking shit about stuff you know jack about is not a good argument.

  2. Well... by vikhyat · · Score: 4, Interesting

    This should improve Google's indecipherable CAPTCHA.

  3. Why just words? by Thanshin · · Score: 3, Insightful

    I suppose most people write fast enough to allow sentence captchas already.

    1. Re:Why just words? by Canazza · · Score: 4, Insightful

      no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.
      I know alot of people who still have to look at the keys when they type, and while it's generally faster than that bint, it's still painfully slow.
      Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

      --
      It pays to be obvious, especially if you have a reputation for being subtle.
    2. Re:Why just words? by crazyjimmy · · Score: 1

      Not to mention Children, when it comes to touch typing, kids can be fast learners, but before they get the hang of it, they can be very slow too.

      Don't hate on the children. Most keyboards are way too big for the li'l ones anyways. We should be getting them netbooks... and maybe cellphone keyboards. They could probably type great on those, with their tiny little fingers.

      Lord knows, I can't do it. :)

      --Jimmy

    3. Re:Why just words? by British · · Score: 1

      I admit, I'm great with a standard QWERTY keyboard, but when it comes to remote controls for cable boxes/vcrs, etc, I slow down to a crawl. Perhaps it's just what you are used to. I almost never look at my keyboard(maybe for typing in tough passwords), but for my VCR remote control(infrequently used), it's a bit more difficult.

    4. Re:Why just words? by BetterSense · · Score: 1

      I can touch-type Dvorak at 80+wpm. I'm reduced to hunt-and-peck mode with Qwerty, however. Which proves the superiority of Dvorak of course.

    5. Re:Why just words? by Abcd1234 · · Score: 1

      80 wpm? Isn't dvorak supposed to be faster or something? ;)

    6. Re:Why just words? by Abcd1234 · · Score: 1

      (like many other developers) I have to look at my hands (not constantly, but at least a glance every 3rd word) to type.

      "like many other developers"??? Jebus, I hope not. I've never met a single developer who can't touch type. And in the company I work for, the average is in the 60-70 wpm range (and I'm definitely on the higher end, averaging about 120 wpm).

      As for the looking at the keyboard, TBH, I'd just find that annoying... when I'm in the "flow", I prefer to keep my eyes on the screen... having to pause periodically to look down at the keyboard would drive be *batty*.

  4. Re:WTF Summary by duguk · · Score: 5, Informative

    You're asked to enter TWO words; one known; one not.

    From: recaptcha.net:
    But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

  5. Is that a finger cot? by AmigaHeretic · · Score: 1

    Check out this Google book.... about the 7th page down.

    http://www.google.com/books?id=Y0OOlnDFUM8C&printsec=frontcover&dq=Le+Morte+d'Arthur&as_brr=1#v=onepage&q=&f=false

    I thought these were scanned in by robots? If so it looks like it has well kept fingernails.

    1. Re:Is that a finger cot? by KDR_11k · · Score: 1

      Presumably the robot wasn't the only one ever to handle that book.

      --
      Justice is the sheep getting arrested while an impartial judge declares the vote void.
    2. Re:Is that a finger cot? by Anonymous Coward · · Score: 1, Funny

      Presumably the robot wasn't the only one ever to handle that book.

      Maybe not. But I know that when I'm done handling a book I usually don't leave my hands there with it.

    3. Re:Is that a finger cot? by quercus.aeternam · · Score: 1

      Humans - the new replacement for robots.

      Why drop half a million dollars on a machine when you can pay someone 25k a year to do the same job!

      But really, they probably do have robots that do some of the work - but to my (very limited) knowledge, even the best are somewhat destructive.

    4. Re:Is that a finger cot? by Jared555 · · Score: 1

      They probably also have some that were manually scanned, or there are probably cases where pages stick together and require human intervention. If the robot scans a book and then later it is discovered a page didn't get scanned they probably are going to manually scan it.

  6. Re:WTF Summary by Anonymous Coward · · Score: 1, Interesting

    As a control, the system sends out one word that it knows the answer to. You don't know which of the two is the unknown word beforehand. Also, I think that the same unknown word is kept in rotation for a couple of iterations just to double-check that it was entered correctly.

    At least, that's how I'd implement it.

  7. Re:WTF Summary by vivaoporto · · Score: 1

    It is the wisdow of the crowds. There are two words, one is a normal mangled (and known beforehand) captcha, the other is one that the best OCR google got its hands on couldn't solve.

    People still have to solve the first one correctly, and if enough people give the same answer to the second one, it is added considered correct.

  8. Good idea, but how? by Nesa2 · · Score: 1, Interesting

    ReCAPTCHA is a free service that usually integrates into forums, bLogs, and other such anonymous comment-posting services to help eliminate bot spamming. I think they will not use it on Google search pages, but exploit ReCAPTCHA users of all of those sites that do use it already. Sounds to me like a really good idea...

    I'm interested though how they are going to know what a correct entry by a user would be for a scanned word in order to validate it if they only have a scan...

    1. Re:Good idea, but how? by city · · Score: 1

      There is a really good talk by the reCAPTCHA found, Von Ahn, describing their method for validation a word and how they are using it to digitize old NYT articles. I think it's his one: http://www.youtube.com/v/Aszl5avDtekhl=en&%23038;fs=1&%23038;rel=0

      --
      I am a v1ral sig. Plse c0py me and h3lp me spread. Thank y0u?
    2. Re:Good idea, but how? by koick · · Score: 1

      Your linky no worky. This one does: http://www.youtube.com/watch?v=3PuZ55kyf7E (interview on Wired)

    3. Re:Good idea, but how? by Akral · · Score: 1

      Simple.
      They present two words - one is computer generated and is, in fact, the real CAPTCHA test. The other is a failed to OCR word from a book. People fill both words, because they don't know, which is which. They show the same failed OCR word to a hundred people and get a stable result by majority of people, even if somebody tries to abuse the system and write some bad words instead.

      --
      Don't worry, be happy!
  9. Er... no. Read the reCAPTCHA info by djkitsch · · Score: 1

    The interface uses two words: one which is verified and one which isn't. Assuming the first one is typed in correctly, they present the second to a bunch of people until they get a consensus (three the same, I think) and then it goes in the "verified" pile. Thus, even if the second word's not verified yet, a spammer will still get caught out by the other one.

    --
    sig:- (wit >= sarcasm)
    1. Re:Er... no. Read the reCAPTCHA info by Tony+Hoyle · · Score: 1

      So if enough people type ' penis' as the result, eventually 3 people will identify the captcha as 'penis' and it gets in the list of known words.

  10. Re:WTF Summary by neoform · · Score: 1

    I'm fairly certain the scanners read the text, get a good idea of what it says, then asks several people to tell them what it says, as more people type the text in they become more clear on what it says.. I've used reCaptcha a number of times and find it to work pretty well. Though I have wondered the same thing you're wondering.

    --
    MABASPLOOM!
  11. Re:WTF Summary by corsec67 · · Score: 1

    ReCaptcha does that:
    One of the words is generated or known, and the other is the new word they are trying to scan. You have to give both to access the protected system, since you don't know which is the known word and which is the new word.

    http://en.wikipedia.org/wiki/ReCAPTCHA

    --
    If I have nothing to hide, don't search me
  12. I'm real giddy about this by Kokuyo · · Score: 1, Interesting

    Just wait until some soccer mom needs to protect her genius of a brat from all the bad things there are. Latest crusade? A 'bad' word in a CAPTCHA. Just you wait, it will happen.

  13. Re:WTF Summary by narfman0 · · Score: 1

    They could have a list of possibilities, generated by computer or human. Then- They throw out the same word several times and aggregate the answers. Comparing elements in the aggregates they see how many people chose a particular word. When the probability that the word is wrong reaches near zero, they introduce it to the database. Don't know how they did it, that's just one of the ways they could have. Not a cure-all, but it helps with the scans I suppose.

  14. Re:WTF Summary by iammani · · Score: 1

    WTF Post?

    This is not just any captcha, but recaptcha. This captcha system will challenge you to recognize two words, one of which it understands and one it cannot understand. It assumes that, if sufficient people map the unrecognized word to the same set of letters (and also get the known word right), the image indeed maps to these letters.

    This is, indeed, a neat idea for OCR.

  15. I hope they have a couple of tests! by NoYob · · Score: 4, Funny
    As I get older, I find that I'm having a harder time reading from computer monitors and especially captchas. I confuse words all the time. For acample: erection with election. Not so bad, but if Google doesn't pass that unknown to multiple folks, it could get embarrassing. Text from a Bill Clinton bio:

    After Bill Clinton's first erection as President, he proceeded .....

    --
    It's NOT me! It's the meds! I'm on 1000mg of Fukitol.
    1. Re:I hope they have a couple of tests! by HipToday · · Score: 1

      Or acample with example.

    2. Re:I hope they have a couple of tests! by ElSupreme · · Score: 1

      I find that ReCapcha is MUCH easier than standard ones to decipher. I mean I have 10s of years deciphering text on the curve of a book, with cheap printing. Versus the made hard to read on purpose ones.

      But a few of the ReCapchas are just miss printed and would require someone to read the sentance to figure out what sholud go there.

      --
      My addiction: Arguing with idiots. AKA Slashdot!
    3. Re:I hope they have a couple of tests! by natehoy · · Score: 1

      Most CAPTCHA solutions have at least two ways you can solve them. Some offer an audio version of the words that is only slightly garbled (enough to defeat voice recognition) that you can listen to in addition to or instead of the CAPTCHA word, and some allow you to solve some simple word problem instead of CAPTCHA if your hearing AND eyesight are both bad.

      As far as the Clinton example, funny, but in reality people are going to be looking at one word at a time. The Clinton bio example would be frequently made (humorously or maliciously) due to context. But if the word "election" was put on a CAPTCHA, most people would interpret it correctly. A few might get funny and try "erection" just to see if it's the "non significant" word, but I doubt that would be EVERYONE. If you checked the word against a dozen people, you'd have to have at least (at a guess) 10 of them with the exact same sense of humor to get the word automatically accepted as "erection" and not "election".

      I don't know Google's algorithm for re-checking words, but the article clearly says they'll be doing some rechecking for reliability by having a number of different randomly-chosen people interpret the same word. I imagine that words where the answers are all identical might get 4-5 checks, while words that prove less consistent will get checked at least a dozen times or so, and those that continue being unreliable would probably get an authoritative check.

      If, say, 4 people chose "erection" and the remaining 8 chose "election", the word would probably be flagged as "unreliable" by the automated CAPTCHA system and reviewed by a Google employee in proper context for final verification. Then the word would be corrected. Exactly which of the two words is chosen would probably depend on the political affiliation of the Google employee. :)

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    4. Re:I hope they have a couple of tests! by Hurricane78 · · Score: 1

      Protip: Ctrl-+

      Seriously. Or change the freakin' resolution of your display.

      There, was it that hard? ^^

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    5. Re:I hope they have a couple of tests! by BForrester · · Score: 1

      CTRL-+ just makes "erection" bigger. Since you ask: yes, it's hard.

  16. Re:WTF Summary by iamhassi · · Score: 4, Insightful

    "Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. "

    That explains why half the time I can't even read the word. I swear every time I reach a captcha I have to refresh it 5x before I finally land on two words I can read.

    I must say this system is ingenious. Distributed OCR: let millions of internet users figure out what the words are. Maybe next election when there's hanging chads they can use that as a captcha.

    --
    my karma will be here long after I'm gone
  17. Re:WTF Summary by Useful+Wheat · · Score: 1

    The system works by having you validate 2 words. One of the words is a word that already been verified to be correct, a known quantity. The other word is the unknown word. If you get the first one correct, it assumes you got the other one correct to. Error correction is done by having multiple people evaluate the same unknown word. If 3 people agree that the unknown word is "Bacon", the word is then taken to be bacon.

    Random people trying to mess up the system will not suceed. However, if you convinced everyone to simply enter "Bacon" we could have some amazing google book searches.

  18. Won't this eventually defeat the purpose? by natehoy · · Score: 3, Interesting

    Google is doing this in order to prevent spam and to improve OCR. But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

    Don't get me wrong, I think this is a marvelous idea, potentially using volunteer labor of humans as OCR to interpret a book one poorly-scanned word at a time. But it does seem to have the side effect of eventually destroying the original purpose of what they bought. Maybe CAPTCHA is worth more as a "crowdsourced OCR solution" than it ever was as spam prevention anyway...

    --
    "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
    1. Re:Won't this eventually defeat the purpose? by CSMatt · · Score: 1

      CAPTCHAs can be defeated right now by using mechanical turk or social engineering to get humans to solve the CAPTCHAs for the spammers.

    2. Re:Won't this eventually defeat the purpose? by slim · · Score: 5, Insightful

      What you get in the capcha is the scanned word, plus some warping and obfuscation. Therefore if OCR advances to the point where it has no trouble with the original scan, it would still have trouble with the capcha.

      Spammers already have a neat way around capchas -- they proxy them to people on porn and warez sites. If you ever fill in a capcha on such a site, you're probably helping a spambot out.

    3. Re:Won't this eventually defeat the purpose? by funfail · · Score: 1

      CAPTCHAs can also be defeated with a system like reCAPTCHA.

    4. Re:Won't this eventually defeat the purpose? by Anonymous Coward · · Score: 1, Interesting

      If spammers figure out how to defeat reCAPTCHA, Google will probably hire them to automatically digitise books; that probably pays a lot better than spamming. You can think of it as trying to set all the ingenuity of the world's spammers working at the same problem...

    5. Re:Won't this eventually defeat the purpose? by jacktherobot · · Score: 1

      in addition to just showing a scanned word, the captcha image is contorted and corrupted. This makes captchas much much harder to solve compared to standard OCR problems. Improving and perfecting OCR is unlikely to have as much of an adverse impact on captchas as spammers hiring poor folks to solve them.

    6. Re:Won't this eventually defeat the purpose? by maxume · · Score: 1

      All you have to do is add a level of indirection. Take the reCAPTCHA images and present them to users of your rereCAPTCHA system, and then use the results to solve the reCAPTCHA tests.

      I suppose keeping up with the turnover of the reCAPTCHA might be an issue, but if the problem were valuable enough to solve...

      --
      Nerd rage is the funniest rage.
    7. Re:Won't this eventually defeat the purpose? by Hurricane78 · · Score: 2, Insightful

      No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

      GP is using faulty logic (circular reasoning I think).

      If ReCaptcha improves OCR algorithms, then not only spammers will have access to them, but so does the effort behind ReCaptcha.
      So the now scannable words would be scanned and never turn up there. ReCaptcha would just present you with those words that would still not be scannable by any OCR.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    8. Re:Won't this eventually defeat the purpose? by koick · · Score: 2, Informative

      In this interview on Wired, Luis von Ahn explains that they do indeed warp it: http://www.youtube.com/watch?v=3PuZ55kyf7E

    9. Re:Won't this eventually defeat the purpose? by Hays · · Score: 2, Insightful

      The text is warped and obfuscated. Look at example captchas -- do you really think the geometric swirls were in the source documents?

    10. Re:Won't this eventually defeat the purpose? by ChaosDiscord · · Score: 2, Informative

      No it's not warped and obfuscated. ReCaptcha gives you the word as-is.

      Go here. Bounce on the reload button a few times to see some example reCAPTCHA. Tell me with a straight face that they're not warped. Perhaps they're scanning books printed on silly putty? As for obfuscated see the example here. They used to slap a line across each word. They don't appear to be doing so any more, but they used to.

    11. Re:Won't this eventually defeat the purpose? by DNS-and-BIND · · Score: 1

      Spammers already have a way around captchas - getting Indians to solve them. I turned the flow of spam off my website for about a month by installing a captcha for registration. Then, I get a few enterprising young businessmen from India solving the captchas and spamming the comments by hand. You can't win.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    12. Re:Won't this eventually defeat the purpose? by totallymeat · · Score: 1

      But once OCR is improved to the point where it can read poorer scans, won't spammers be able to use that new technology to eventually defeat CAPTCHA?

      If you look at CAPTCHAs as a method for improving artificial intelligence, then the issue you're raising gets turned into a benefit of ever-improving Turing tests.

      1. Find a problem that AI cannot readily solve, yet humans can (obfuscated word recognition).
      2. Develop a CAPTCHA out of the problem (reCAPTCHA).
      3. Either the problem remains unsolved and the CAPTCHA must be solved by humans...
      4. ... or we develop stronger AI capable of solving the original problem.

      In either outcome, we get something useful, either better Turing tests or more robust AI. The hardest part of this loop is developing novel CAPTCHAs, but at least the cycle results in useful outcomes every time.

    13. Re:Won't this eventually defeat the purpose? by natehoy · · Score: 1

      Excellent point.

      Someone please mod parent insightful. Thanks! :)

      --
      "This post contains words, known to the State of California to cause thought. Wash brain thoroughly after reading."
  19. Re:WTF Summary by Sockatume · · Score: 4, Informative

    The best part is, it automatically selects for words which are invulnerable to OCR-based attacks. And if the user's presented with an illegible scanned CAPTCHA, they aren't penalised for getting it wrong.

    --
    No kidding!!! What do you say at this point?
  20. Re:WTF Summary by Anonymous Coward · · Score: 5, Funny

    "Hey everyone, let's all sit refreshing the google gmail account creation page, and always type "boobs" for the second captcha value..."

  21. Mod up by Anne+Honime · · Score: 1

    I totally agree, this is pure genius. Distributed Human-engined OCR is certainly the best solution to traditional OCR problems, and at the same time it leaves many doors to unforeseen traps ajar.

    1. Re:Mod up by mrcaseyj · · Score: 5, Interesting

      I agree that the idea is ingenious. But on the only one I ran into, the word was completely indecipherable. I don't mean that it was really hard, I mean that it was a word so thoroughly mangled that it was clearly impossible to read by anyone, especially without context. The lack of context is one of the big weaknesses of the system. When a word is unclear, it's the words around it that give critical clues to what it is.

    2. Re:Mod up by Chabil+Ha' · · Score: 2, Insightful

      Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

      --
      We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
    3. Re:Mod up by Anonymous Coward · · Score: 2, Funny

      Which gives rise to the question: Why isn't captcha giving us complete sentences? Not only would you be OCRing more words, but the context gives the human a greater chance at getting it right, whilst increasing the chance of a spam bot of getting it wrong.

      ...and increasing the rate of people saying "F- it, the captcha should not be longer than my comment." - hence the limit of two words to allow for "me too!" comments.

    4. Re:Mod up by Kozz · · Score: 2, Funny

      Which gives rise to the question...

      Don't you mean, "Which begs the question..."?!

      (ducks)

      --
      I only post comments when someone on the internet is wrong.
    5. Re:Mod up by hipifreq · · Score: 1

      But if the people behind reCaptcha are really doing this well, then they remember the words that you refreshed. If a word gets refreshed enough then a real human can go to the real book and figure out the meaning of the word.

    6. Re:Mod up by Tim+C · · Score: 1

      Because having to read and enter a single, hard to read word is enough hassle for most people; two is stretching it. An entire sentence would be too much.

    7. Re:Mod up by ChienAndalu · · Score: 1

      True. I think they could however highlight one or two words and ask the user to enter the highlighted words

    8. Re:Mod up by selven · · Score: 2, Funny

      hence the limit of two words to allow for "me too!" comments.

      lol

  22. Re:WTF Summary by hansamurai · · Score: 1

    I find reCaptcha high readable, this isn't like other captcha techniques where there are really thin letters and randoms objects strewn about, it's just blurry, zoomed in typewritten words that are hard for a computer to distinguish.

  23. Re:maybe they should use CAPTCHAs... by Rik+Sweeney · · Score: 3, Interesting

    Funny you should say that

    http://mailhide.recaptcha.net/

  24. Re:WTF Summary by Granis · · Score: 1

    That's really interesting. I've always wondered why I have passed these CAPTCHAs even when I had to make wild guesses on some of the words because they were so hard to read.

    However, how long will it be before a lot of users realize that it is irrelevant what you enter for the unknown word? Even if you don't know for sure which of the word that is the unknown one, knowing the above I think the risk is high that you just type nonsense if you can't read one of the words.

    If enough people do this the system will be quite ineffective. reCAPTCHA will probably not accept the wrong solution very often, but it will take a lot of time to get enough users with the same solution to accept it. But with a massive amount of users, even a small amount of the total might be enough to keep it running?

  25. reCAPTCHA is awesome by Thaelon · · Score: 5, Funny

    I have to say, reCAPTCHA is one of the most elegant solutions I've ever seen to a problem.

    It's not even killing two birds with one stone, it's killing two birds with one of the birds.

    --

    Question everything

    1. Re:reCAPTCHA is awesome by pHus10n · · Score: 1

      Your analogy made me lol IRL.

    2. Re:reCAPTCHA is awesome by Sockatume · · Score: 1

      I've already posted so I can't mod you up, but that might be the greatest analogy I've ever heard. I'm already thinking up applications for it.

      --
      No kidding!!! What do you say at this point?
  26. Psst, scanning books is just one goal by melted · · Score: 1

    The other is to track how users browse the web, for ad targeting. All they need to do is put a cookie in your browser and read it next time you see a captcha or load a Google analytics script.

  27. Re:WTF Summary by Kaetemi · · Score: 1

    Well, yeah, but the OCR attacker also just needs to get the OCR readable word right...

    --
    Kaetemi
  28. Re:WTF Summary by slyborg · · Score: 2, Interesting

    I still don't get it. How do you know that the person correctly identified the second word? I don't see how a priori decoding the first word means that the second was correct. I would expect that the individual bad data rate from this technique would be substantial.

    I do enjoy the fact that Google, a ridiculously profitable company by virtue of its near-monopoly on Internet search advertising, is using the public who pays it via these ad impressions to do its work for free, and using the technique invented and used by spammers to crowd-source solve CAPTCHAs to get into Gmail and the like!

  29. Re:WTF Summary by digitig · · Score: 2, Funny

    wisdow

    OCR error?

    --
    Quidnam Latine loqui modo coepi?
  30. Re:WTF Summary by mckinleyn · · Score: 1

    Because it presents the same words to many, many people. Yes, 10 people can all be wrong, but how likely is it that more than half of 100 people are all wrong in exactly the same way?

  31. Re:WTF Summary by Ziwcam · · Score: 1

    It's not necessarily the second word that's unknown.

  32. Re:WTF Summary by complete+loony · · Score: 1

    The first one is not a normal mangled word. It's another word that could not be OCR'd but has already been identified by the crowd.

    --
    09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
  33. Re:WTF Summary by Chyeld · · Score: 2, Insightful

    You don't assume.

    For the purposes of captcha, typing one word correct suffices. As long as you get the right word (the known 'good' word) correct.

    For the purposes of distributed OCR, the "how do you know if the unknown word was ID'ed correctly" issue is simply solved by having the word ID'ed several times. Given you don't know which word is the 'test' word and which is the one actually needing IDing, there shouldn't be a problem with people guessing "Penis!" or "Boobies!" all the time.

    So as long as a majority of the people ID the word the same way, you have can have a high level of confidence that it's being ID'ed correctly.

  34. Re:WTF Summary by bami · · Score: 1

    That is what happened with the Anonymous attack on the Time poll, with the 'penis' attack.

    They looked at both words, see which one was the least readable, fill in the good one and fill in 'penis' for the second one, in the hopes of poisoning the database so that they only have to enter the first word correctly.

    Would be kind of amusing to see a couple of books showing up on Google Books with the word 'penis' randomly inserted in pages where reCaptcha was used.

  35. Re:WTF Summary by Sockatume · · Score: 1

    That'd involve designing a pattern-recognition system which can reliably decide which of two OCR words is less readable, mind you.

    --
    No kidding!!! What do you say at this point?
  36. Evil? by AP31R0N · · Score: 1

    Have you paranoiacs figured out how Google is going to use this to spy on you or otherwise do evil?

    --
    Utilizing the synergization of benchmark e-solutions to pre-workaround action items!
  37. Familiar Creature by TheMeuge · · Score: 1

    no they don't. I was transfering flights at London Heathrow and there was only one window open, and a massive queue. I get to the front and I find the woman at the computer used one finger typing... ONE FINGER, not even one on each hand, one feking finger. This was someone who was supposedly trained to do this job, can't even touch type.

    I don't know about London, but in the U.S., the 1-2 finger typing is usually accomplished by a community college dropout, whose fingernail extensions are about 2 inches long, and who types either by carefully and slowly pressing one key at a time with the nail extension, or with the second knuckle of her middle finger. She will also scream: "Can I help you" with enough contempt to burn your eyebrows off. When you get to the counter, she will look you over with as much spite as humanly possible, then get her Sidekick out and text someone for a couple of minutes. And god help you if you are still with her (inevitably) when 12pm or 1pm comes about. She will get up and leave for lunch (or unroll her food), whether you're waiting or not. Actually, she'd prefer you to wait there.

    She is a ubiquitous inhabitant of government offices of all sorts, as well as front desks in companies that don't respect themselves. She will need the supervisor/manager to resolve any issue that goes beyond typing your name (incorrectly), but she will march on city hall with the rest of her co-workers if they don't get another 5% raise in the middle of the recession.

  38. Re:WTF Summary by b4dc0d3r · · Score: 1

    One KNOWN, one not. The known word is not necessarily going to be OCR readable... you can seed the database with 100 or so images which are known, but maybe not OCR readable. Of course it works better if the known words are NOT OCR readable.

    The point is OCR can have typos as well, so just because OCR returns a result doesn't mean it should be trusted. The known word of the two is likely independently analyzed, probably by a human.

    Once enough people put the same answer for an unknown word, it becomes trustworthy. That is not easy to hack by making repeated requests with your OCR tool (which does not get GOOD results, but does get CONSISTENT results, therefore the same answer each time) and putting incorrect answers in the database - one of the millions of human users will likely get one of the words being attacked, and respond differently. So you will have several different answers and no clear winner, leaving it an unknown word.

  39. Marble cake, also, the game by dazjorz · · Score: 1

    This was actually done by the guys at 4chan /b/: http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/

  40. Waiiiiit.... by WWWWolf · · Score: 1

    I thought I had some hazy recollection that reCAPTCHA was being used for some open projects, like helping to OCR out-of-copyright works...

    ...so now it is being used to fuel Google's massive, still-very-much-copyrighted, proprietary book scanning effort?

    So how's this going to benefit people? I'm, of course, assuming the details are spotty at the moment and I'm terribly interested to hear more details from Google's official "do no evil" department on how they intend to contribute to the world.

  41. Beloved != 8cloved by EdgeyEdgey · · Score: 1

    I just got a correct response from a clearly incorrect answer.
    The image was of Beloved but being difficult I answered 8cloved and got accepted.
    It did the job of proving that I wasn't a bot, but if there are enough difficult people (like me) out there then we could really screw Google over.

    --
    [Intentionally left blank]
    1. Re:Beloved != 8cloved by /dev/trash · · Score: 1

      read up on the implementation to see why you are wrong.

  42. Re:WTF Summary by Anonymous Coward · · Score: 3, Interesting

    Interesting you should say that.

    Unfortunately, it won't work - 4chan already ruined it for everyone.

    http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/

  43. Re:WTF Summary by koxkoxkox · · Score: 1

    I always intentionally smash the keyboard with my palm for the second word.

    Well, it doesn't have to be the first word known and the second word unknown, it could be the opposite, or random.

  44. Re:WTF Summary by melikamp · · Score: 1

    If it is at random, one of the following will happen: I will either screw up the known word, in which case my OCR will not be trusted, or I will screw up the OCR word and get through. It should only take a few tries to get through, and there is no chance of helping with OCR.

  45. Re:WTF Summary by slyborg · · Score: 1

    Yeah, the multiple answers idea occurred to me later. I'm actually not talking about deliberate garbage answers, just people getting it wrong, and if it is badly scanned, etc. you will get multiple answers for the unknown text, and possibly not 100:1, but maybe 2 answers that 100:90 or something of that order - you still don't know which is more correct. Or maybe because of the nature of the image, the vast majority of people may actually converge on a wrong answer.

  46. Re:WTF Summary by mysidia · · Score: 1

    The 'known' word wasn't necessarily OCR readable. And their methods of OCR are probably not quite the same as the attacker's.

  47. Re:WTF Summary by joelpt · · Score: 1

    Building on the sibling replies, I'd also like to point out that for third-world human-powered captcha-entering sweatshops, there is no advantage to randomly guessing the second word versus just entering both words correctly. You'll end up having to enter the same amount of correct words per successful captcha attempt either way.

  48. Re:WTF Summary by Arancaytar · · Score: 1

    Yeah. I often get combinations like "WORD vjfkjsmxs" or worse, "WORD [illegible smudge]".

    I tend to simply put a dash for the smudge. They're not using that word to verify, after all, they just want to know what it says. So I tell them, "nothing". Likely, they'll get a lot of different results for it, and if the scoring algorithm is good it will eventually determine the word is illegible (or at least show it to a moderator of some kind).

  49. Re:WTF Summary by SnowZero · · Score: 1

    You keep running it until one answer dominates in a statistical sense. With the amount of data they are getting, it wouldn't be hard to construct a pretty accurate probabilistic model. If you never get a satisfactory probability for the most frequent answer, you could flag it for a developer to look at.

  50. Re:WTF Summary by GravityStar · · Score: 1

    Suppose 50% of people filling in the CAPTCHA are malicious. They type in things like "penis", "B00BIES", "qwerty", "asdf", etc. 12,5% of people fail at deciphering the captcha completely. 12,5 of people fail, but succeed in providing near matches with one or two letters wrong. 25% of people succeed in deciphering the CAPTCHA.

    I'm just taking a guess at the percentages. But still, with a bit of analysis, it would become quite easy for reCAPTCHA to filter out the noise. The only way reCAPTCHA would fail at the analysis is if the malicious people organize with the explicit purpose of poisoning the reCAPTCHA results. While possible, I think this is unlikely unless reCAPTCHA starts say... sponsoring expeditions to kill baby seals.