Slashdot Mirror


Fill Out CAPTCHAs, Digitize Books At The Same Time

alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."

121 comments

  1. Verification? by traindirector · · Score: 5, Insightful

    CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.

    I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...

    1. Re:Verification? by greatgregg · · Score: 5, Informative

      From recaptcha.net: "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."

    2. Re:Verification? by mikee805 · · Score: 1

      I read the article trying to find an answer that. This system works for what they want. Get disrupted OCR done by humans, but does not seem to preform the job of a CAPTCHA.

      --
      B5 71 ED FB 55 D6 4E 68 07 25 E2 FA CA 93 F0 2F, is mine! All mine!
    3. Re:Verification? by 26199 · · Score: 0, Redundant

      That was my thought. I suppose you could let the first five people through automatically, then use their answers to check everyone else; but what's the point of a CAPTCHA that lets a certain minimum portion through?

      Turning people away when they actually got it right is worse, though; that way you potentially lose customers in trying to fight spam.

      Seems like an interesting idea, but I don't see how it can work...

    4. Re:Verification? by anarchy_man3 · · Score: 1

      Yea that doesn't quite seem to make sense. I can only see this leading to poor transcriptions with no security benefit.

    5. Re:Verification? by Solder+Fumes · · Score: 1

      What you said. Also, register-bots will destroy this because their OCR will come up with something close to what the serving computer already knows. And it'll put in incorrect results, which will pass security AND be added to the digital book, and now your precious digital book has more OCR typos than it would have in the first place.

    6. Re:Verification? by redwoodtree · · Score: 1

      The article states:

      "I think it's a brilliant idea -- using the Internet to correct OCR mistakes,"

      Suggesting that the words have been OCR'd, and that the user is correct the mistakes. This goes on to suggest that there is a margin of error that takes into account OCR mistakes but will allow the corrected text.

      With a little imagination, it's easy to think of many permutations to this, along with the idea of just asking for a new captcha if the first one doesn't work.

      The article also states there's a speakable text version of itself for the hearing impaired.

    7. Re:Verification? by 26199 · · Score: 1

      And poster above has explained nicely how it works. Thanks. They could have put that in the article... (or summary!)

    8. Re:Verification? by penguinbroker · · Score: 1
      from the website :: http://recaptcha.net/learnmore.html

      "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."

    9. Re:Verification? by alstor · · Score: 1

      I had the exact same thoughts when I skimmed the article...

    10. Re:Verification? by joek1010 · · Score: 0, Redundant

      "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."

      http://recaptcha.net/learnmore.html

    11. Re:Verification? by nwbvt · · Score: 1, Informative

      Considering all the other people who asked that question, they really needed to make that clear in their press releases.

      So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time. Yeah, it will often take two tries to get it right, but its not like CAPTCHAs usually work fine on one try anyways... And hey, if you just try for only one word (and leave the other blank), you will end up on average typing the same amount.

      The article makes comparisons to SETI@Home, but thats a bit different since that is relying on the computer to do the work, not the actual users. That means its fairly consistent and you really are not impacting users all that much (with the exception of pegging their CPU when they are away from the computer).

      --
      Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.
    12. Re:Verification? by bugnuts · · Score: 4, Funny
      The problem is that any unsophisticated captcha interpreter can spit out the text that's known, and make a (bad) guess at what is hard to read. Then, if there is any significant amount of spammers, we end up with exactly the same issue - computers having trouble with OCR.

      e.g., /. puts in a captcha to translate the following two sections:
      12345
      l1il1

      The captcha software knows the "12345"
      but it doesn't know the "l1ill1". A human could figure out both.

      But spammer captcha deciphering can figure out 12345, and is allowed to incorrectly guess 11ii1 for the 2nd part. End result is
      • a spammer is posting something as indecipherable as this message except insults your penis size
      • some OCRed book is now committed to a false interpretation
      • I have to change the password on my luggage.

    13. Re:Verification? by codename.matrix · · Score: 2, Informative

      http://recaptcha.net/security.html the words are additionally distorted and they add lines and warps so that a computer cannot read it.

    14. Re:Verification? by Bjarke+Roune · · Score: 4, Insightful

      This is not a problem if the known word is a hard image that has been solved by humans in previous captchas. This scheme works as long as the system has a small pool of known images to start the process off.

    15. Re:Verification? by Falkkin · · Score: 4, Insightful

      "So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."

      Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.

    16. Re:Verification? by Mr.+Underbridge · · Score: 2, Interesting

      This is one of the most creative ideas I've heard all year. Human-based distributed computing with captchas? Awesome!

    17. Re:Verification? by autophile · · Score: 3, Informative

      Yeah, but it's not like you're only allowed to present a given unknown word once. Present it many times, and use the word with the most hits.

      --Rob

      --
      Towards the Singularity.
    18. Re:Verification? by jambarama · · Score: 1

      That sounds fine, but I'd still be worried about the accuracy of the read. As long as we're going to have users read and answer two captchas, why don't do this: present the unknown captcha to two people, if they agree on the answer accept it, if not show the same captcha to a third person. Accuracy is probably more important than speed for books that still haven't been digitized.

    19. Re:Verification? by InakaBoyJoe · · Score: 0, Redundant

      Exactly, it doesn't make sense to automatically "trust" a user's entire response based on one correct word.

      Why not show the same image to multiple users, and assume the response is correct only if two or more of them concur? This has the effect of doubling (or tripling) the effort required to solve, but gives you at least some verifiability. Sort of like using google as a spell-checker.

      Of course, you'd still have to fall-back on the "one correct word" idea for verification when the user makes the entry, but in terms of adding text to the database, some statistical verification would be a good thing...

    20. Re:Verification? by DeathElk · · Score: 2, Informative

      RTFA's TFA

    21. Re:Verification? by Anonymous Coward · · Score: 0

      There's always someone that doesn't get it and they have to question the article with faulty logic.

      1. The spammer doesn't know which is the known word and which is the unknown word.
      2. You have a huge sample size for each word.

    22. Re:Verification? by nwbvt · · Score: 1

      It doesn't need to be planned. For instance if the given text is very close to something dirty, a lot of people will get the same idea and will put in the same text. And if you doubt the power pranksters like this can have, look back at the Google bombing episodes.

      The Wikipedia is a bit different as you have to make an effort here. People are not required to write Wikipedia articles to sign up for an email account or post on a message board. If they were, the resulting information would be even less consistent than it currently is.

      --
      Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.
    23. Re:Verification? by poopdeville · · Score: 2, Insightful

      Because nobody wants to wait around for another person to verify the CAPTCHA before posting on /. That is, you need two CAPTCHA images because you still want them to work as a CAPTCHA.

      --
      After all, I am strangely colored.
    24. Re:Verification? by Endo13 · · Score: 1
      Um... did you even read all of the post you're replying to?

      The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct. That sounds like a pretty good system to me.

      You do realize that "a number of other people" here could refer to even several dozen or several hundred?

      So clearly it's not going to be just one person that determines the answer for unknown captchas.
      --
      There is no -1 Disagree mod. Slashdot.org/faq defines mod options. USE IT.
    25. Re:Verification? by Bluedove · · Score: 1

      it's even easier than that. the same captcha gets presented randomly to two people (quasi-)simultaneously. Unless both parties type the same word, neither is verified. This gives an added incentive to people to type what it really says. Of course, it leaves open some room for the sub-class of assholes who will gleefully spend hours typing the wrong answer to captchas, just to fsck with the other guy. That's why you always have to give a failed verification a couple of chances. Hey, sometimes i mistype my password a few times, too. It's no big deal.

    26. Re:Verification? by Anonymous Coward · · Score: 0

      My submissions to this system have already included the words:
      Cunnilingus, Prostate, queer, twat, Cock and Cunt.

    27. Re:Verification? by delinear · · Score: 1

      As the other poster stated, the image is distorted so that standard OCR technology won't be able to make a very good guess with any certainty. Additionally I think a double-blind is used, where two words are displayed, one the system knows in advanced and one it doesn't. The OCR would have to get the first one exactly correct in order to distort the results of the second one (since if the first verification check is failed the suggested word is discarded completely and ha no skewing effects on results). If OCR was already capable of that, captchas would already be pretty useless, so it seems like a fairly secure method.

    28. Re:Verification? by Anonymous Coward · · Score: 0

      If they solve the one for which the answer is known, the system assumes their answer is correct for the new one.

      That's a REALLY bad assumption. I get capchas wrong all the time. Sometimes I swear a computer would have an easier task than me on some of these damned capchas, particularly when you have a capcha that is one letter off from a completely different word.

      Computers don't get drunk, stoned, or tired. I do. And I haven't had enough coffee yet.

      Is that capcha "matter" or "natter"? With no context it's nearly impossible to tell. I'll just have to takle a guess...

      -mcgrew

    29. Re:Verification? by spikedvodka · · Score: 1

      Why not have the CAPTCHA look something like this: (use K for a known character and U for an unknown character)
      KUKUKUK (or some other random permutation of K & U in the desired length)

      This way, you can
      a) check all the K's for validitity, if so, then ACCEPT
      b) Break up words so that they aren't as easily recognizeable
      c) Still allows you to compare different people's answers for U's, as you aren't using them for validity
      d) I would think that this method would reduce the number of "Jackasses" because you never know which characters are K's and which are U's

      My thought would be:
      If all K's don't Match, die('Sorry, wrong, please try again')
      else
      store the user entered value for U[n] in a list
      When number of entries in list for any given U >= $StatisticallyRelevantSampleSize & Conformity of results >=95%, accept result and move the U to a K, as we now know the correct value
      if number of entries in list for any given U > 3*$StatisticallyRelevantSampleSize, & Conformity of results 90%, remove U, and send it to a "Trusted Human" for analysis.

      This reduces the work load on the "Trusted Human", but it also assumes that the computer can find character boundries

      --
      I will not give in to the terrorists. I will not become fearful.
    30. Re:Verification? by poot_rootbeer · · Score: 1

      it's not like you're only allowed to present a given unknown word once. Present it many times, and use the word with the most hits.

      True. But captchas generally require prompt feedback; you want to know right away whether or not the user has passed the Turing test, not leave it unknown for a couple hours until a sufficient number of other users have submitted their answers to establish a consensus.

    31. Re:Verification? by GuldKalle · · Score: 1

      That's why you use the first word (also a digitized word, but one that has been verified by the technique) for the turning test, and the second word to extend the turning test vocabulary.

      --
      What?
    32. Re:Verification? by Anonymous Coward · · Score: 0

      Using your notation, consider the two CAPTCHAs:

      KKKKKKKK, and UUUUUUUU.

      The idea as presented in the article is to have the user fill out both of them, and use only the known one as verification for CAPTCHA-correctness. Remember, the user won't know which is which. Assuming the known CAPTCHA is passed, the user's data for the unknown CAPTCHA is stored for later statistical analysis.

      Your idea is a variation on this. But it won't work as well because of kerning and ligatures.

  2. Better links by Falkkin · · Score: 4, Informative

    The article is lacking some information. Here are some better links:

    Official reCAPTCHA site
    Hide your email address with reCAPTCHA (super easy!)
    A more detailed blog post about how the system works

    Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.

    1. Re:Better links by inKubus · · Score: 4, Interesting

      Also, Amazon has a pretty cool program where you can perform HITs (Human Intelligence Tasks) for a few cents each. They have a lot of stuff like transcribing podcasts, identifying stuff in satellite images, etc.

      --
      Cool! Amazing Toys.
    2. Re:Better links by JonathanR · · Score: 1

      identifying stuff in satellite images, etc. I hope George Tenet is not the director of that business unit...
    3. Re:Better links by Arancaytar · · Score: 1

      HITs (Human Intelligence Tasks) for a few cents each


      Brings to mind a dystopian (and fictional) future where robots lord it over us but still need us to process large amounts of data for them. Like the Matrix, but without violating the laws of Thermodynamics. That'd make a cool SF novel, I think...
  3. Official reCAPTCHA site by traindirector · · Score: 4, Informative

    I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/

    1. Re:Official reCAPTCHA site by caffeinemessiah · · Score: 4, Informative

      There's an interesting solution to this problem -- the "scientist at Carnegie Mellon" is Luis von Ahn who was recently awarded a MacArthur genius award. In optical recognition tasks like this where the "true" answer is not known, how do you verify that a human agent correctly did the recognition? Just see if a bunch of other users type the same thing. It's a clever twist on consensus voting, and was recently snatched up by Google as "Google image labeler" here.

      --
      An old-timer with old-timey ideas.
    2. Re:Official reCAPTCHA site by MindStalker · · Score: 2, Interesting

      Problem is, for the first few people seeing a new Capatcha the computer will have to let you through even if you guess wrong, so the lock feature of the Capatcha doesn't work.

      As others mentioned this system gives you a known then an unknown, though I think its stupid that it further makes it difficult by putting a slash through it and making it wavey. Helloo, if you system had a hard time recognizing it why do you want to make it harder to recognize. I saw several in the examples in which the word was nonenglish and I had a hard time guessing the correct spelling because I couldn't make out a letter. There needs to be a I don't know button as well :)

    3. Re:Official reCAPTCHA site by Pollardito · · Score: 1

      There's an interesting solution to this problem -- the "scientist at Carnegie Mellon" is Luis von Ahn who was recently awarded a MacArthur genius award. In optical recognition tasks like this where the "true" answer is not known, how do you verify that a human agent correctly did the recognition? Just see if a bunch of other users type the same thing. It's a clever twist on consensus voting, and was recently snatched up by Google as "Google image labeler" here. it was also previously available as The ESP Game, from...(wait for it)...Carnegie Mellon
    4. Re:Official reCAPTCHA site by maaskaas · · Score: 1

      There is a Google Techtalk with "this scientist" available here: http://video.google.com/videoplay?docid=-824646398 0976635143. This way of using simple techniques are so innovative, because it can perform really complex tasks that no one thought about before. Some of his other projects include online "games" (Peek a boom) which collects relevant data from the users playing it so that a computer can determine what kind of object is in a digital picture, and where it is located. It is really a big step for innovative and smart computing...

    5. Re:Official reCAPTCHA site by name*censored* · · Score: 1

      No, because the "lock" feature of the captcha isn't the word that is being digitised, it is an unmodified run-of-the-mill captcha (GPP). This merely adds a little bit of work for the user (and I'm assuming once the mystery word has been verified by consensus, they'd begin to exclusively use the mystery word). They may even just take the mystery word in the original text (the one being digitised) and couple it with the one before/after that *WAS* translatable, but was then captcha-tised for regular captcha use. This would help the user by providing some for the word (that might appear to be one of many solutions - as you said, they're sometimes indecipherable).

      --
      Commodore64_love: I don't comprehend people who're so frightened of death that they'll bankrupt themselves to stay alive
    6. Re:Official reCAPTCHA site by neongrau · · Score: 1

      ok, so they "ripped off" the concept of /. meta moderation ;)

      sue them!

      j/k

    7. Re:Official reCAPTCHA site by evilviper · · Score: 1

      how do you verify that a human agent correctly did the recognition? Just see if a bunch of other users type the same thing.

      I'm not going to type in a captcha and just wait around on the page for an hour until X other people try to answer it... This system of yours gives priority to the answers of the first few people that see it, which may well be the OCR system of some spammers.

      Even more, once you've got the first few answers, then it's just a typical captcha, as you already have had it entered, and know what it says.

      And if you want some hybrid approach... Where the first word is known, and the second one isn't, you're making twice as much work for the people, with a captcha that is only half as dependable at catching spammers. In addition, I miss-type things like captchas all the time... If my typo is in the unknown word, this whole system is hosed.

      If you want people to transcribe a book... just ask them to volunteer a few minutes their time. There's no shortage. If, however, you want to do a captcha, do it the normal, reliable, old fashioned way.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    8. Re:Official reCAPTCHA site by Anonymous Coward · · Score: 0

      the "scientist at Carnegie Mellon" is Luis von Ahn who was recently awarded a MacArthur genius award

      Cool! I Feel I received that award too. Upon reading the /. headline I asked myself "uh oh!? how do they know if/when the answer is correct then?". Then I thought about it for about 5 seconds and realized: "easy, simply ask to several people" (solving the "first answer" problem being left as an exercise to the reader).

      So I guess I qualify for that genius award.

      Five seconds.

    9. Re:Official reCAPTCHA site by GeneJoker · · Score: 1

      So I take a try and google image label, and the SECOND GODDAMN PICTURE was furry porn.
      I hate you internet.
      (I'm sorry I didn't mean it we'll never fight again)

    10. Re:Official reCAPTCHA site by rholliday · · Score: 1

      There needs to be a I don't know button as well :)
      There is. If you take another look you'll note there's a button to reload it and get a new one, as well as one to get an audio challenge.
      --
      Xbox reviews.. We think they're funny.
  4. Exactly what I was wondering by raygundan · · Score: 1

    They need a way to verify if the answer is correct... if they know the answer, they don't need help digitizing the hard-to-read text. If they don't know the answer, it won't work as a CAPTCHA.

    Am I missing something fundamental here?

    1. Re:Exactly what I was wondering by Falkkin · · Score: 3, Informative

      The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.

    2. Re:Exactly what I was wondering by hpavc · · Score: 3, Insightful

      Likely has a good idea on 'unknown' word as well, the example "This aged portion of society were distinguished from" the OCR didn't cut it but it did did kick start a guess. At least on "This -> niis" it can see its not 'ZOMG' or 'Fark' easy enough.

      Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.

      Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.

      --
      members are seeing something, your seeing an ad
    3. Re:Exactly what I was wondering by camperdave · · Score: 1

      It isn't being used as a captcha. It is being disguised as a captcha in order to get a person to translate:

      Computer: [unreadable scribble]
      User: Bartholomew
      Computer: Please try again.
      [Captcha image]

      User:Red49
      Computer: Access Granted!
      Computer to OCR Central: [unreadable scribble]="Bartholomew"

      --
      When our name is on the back of your car, we're behind you all the way!
  5. They should put a million monkeys on this by Anonymous Coward · · Score: 0

    They should put a million monkeys on this

  6. A captcha doesn't have to function as a password by aeschenkarnos · · Score: 1
    It can instead be a "little job" that must be done before you get to the pr0n.

    However the Iron Internet Law of "lolz > human decency" applies ... and we can look forward to books being translated as "chucknorrischucknorrischucknorrischurknorris..."

  7. idea by brunascle · · Score: 1

    someone set up a database of what the words really say along with what we should type instead, and make it public. it'll be fun! like mad libs!

    1. Re:idea by Kaetemi · · Score: 1

      Apparently it seems to be having some trouble with seeing the difference between "which" and "witch" ^^

      --
      Kaetemi
  8. Re:OK, how to defeat - by Anonymous Coward · · Score: 0

    if you wanted everyone to do it, you shouldn't have used something political. at least you went with a likely majority...

  9. the obvious question by Anonymous Coward · · Score: 0

    Nicked from the appropriate webiste, should answer my (and likely a lot of other people's first question)

    But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

  10. More than just digitizing text by penguinbroker · · Score: 3, Informative
    This would also be a great approach to a lot of NLP/Translation annotation tasks. Although these types of tasks generally require a robustness (knowing which answers to trust and which to ignore) that anonymity makes difficult.

    http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258

    I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.

    i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?

    1. Re:More than just digitizing text by Chysn · · Score: 1

      > This would also be a great approach to a lot of NLP/Translation annotation tasks.

      This would also be a great approach to solving captchas on other sites. Wanna buy tickets with a bot but have a captcha in your way? Set up a third-party captcha server to have humans solve your captchas for you!

      --
      --I'm so big, my sig has its own sig.
      -- See?
  11. Booger by Tablizer · · Score: 2, Insightful

    What if the OCR cannot read a word because there was a booger on it during the scan? A human won't be able to determine it either because it will be mostly a blotch. How are they gonna know the difference between human-decipherable words and lost-cause words (such as booger blotches)?

    1. Re:Booger by penguinbroker · · Score: 1

      they could offer a button that states : "this is not text" and then the user would be given a whole new captcha

    2. Re:Booger by Tablizer · · Score: 1

      they could offer a button that states : "this is not text" and then the user would be given a whole new captcha

      But then a bot may exploit it by passing on until it finds one it can process.

    3. Re:Booger by penguinbroker · · Score: 1

      a lot of login boxes have a system of recording successive requests from an ip address to stop brute force login hacks. a similar method could be used here.

  12. How it could work by AaronW · · Score: 2, Insightful

    I can see how this would work, but in order to also provide security, extra letters or words would also need to be in the captcha. I.e. if there's an un-OCRable word "between", the captcha could contain "frog between" or something like that, and the first word could be a previous un-OCRable word that has been validated by enough people.

    Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.

    --
    This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
  13. A spam tactic? by yogikoudou · · Score: 1
    Spammers are already using CAPTCHA techniques to automate account creations on protected websites:
    1. Have a person subscribe to a porn website by typing in a CAPTCHA image that comes from a legitimate website.
    2. The user provides the correct word while subscribing
    3. Not even a "???" step
    4. Profit! The protected website is spammed.
    I'm wondering whether this system will be used for legitimate OCR purposes or for more spam...
    1. Re:A spam tactic? by Falkkin · · Score: 1

      CAPTCHAs simply tell whether there's a person sitting at the other end of the machine. No CAPTCHA can tell you whether that person is a malicious user or not. With this approach, at least the spammers are helping to digitize books.

    2. Re:A spam tactic? by 1u3hr · · Score: 1
      Spammers are already using CAPTCHA techniques to automate account creations on protected websites

      Really? How do you know this? Can you give an example of a porn site that asks for captchas? If not, it's an urban legend.

      I've seen this suggested as an attack on captchas, but never heard of any site that put it into practice. Probably it is simpler to pay some third-world computer sweatshop worker to solve hundreds of them per hour for a few dollars a day. But that's equally a conjecture.

      Dodgy free porn sites (which is a large percentage) often have nasty exploits to infect your computer -- never browse porn with IE.

  14. Re:How does this help? by Anonymous Coward · · Score: 0

    Er, how about reading the actual idea? Or at least a few comments explaining it?

    The concept is as follows: the software has a list of known words(graphical data and transcriptions) and a list of unknown words(graphical data). As a CAPTCHA, it presents one known and one unknown word. If the user transcribes the known word correctly, they pass the test, and data is contributed for the unknown word, which will eventually by this process make its way into the "known words" list.

  15. Look up the human computation google talk by Rix · · Score: 1

    They probably just accept the first x entries until they have a base for comparison. The entries will converge on correctness.

    1. Re:Look up the human computation google talk by doyoulikeworms · · Score: 1

      http://video.google.com/videoplay?docid=-824646398 0976635143 Pretty interesting. I especially love the porn aspect...

  16. Re:OK, how to defeat - by wsanders · · Score: 0

    OK, for the humor impaired:

    BUSH IS AN IDIOT

    then you can leave off the Obama part.

    Oh, come on, somebody mod this funny - it's even on-topic. Puhleeez?

    --
    Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
  17. A better scheme by Yossarian45793 · · Score: 1

    A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct.

    1. Re:A better scheme by tepples · · Score: 2, Informative

      A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct. The Article states that the system already does this.
  18. So... If you thought that CAPTCHAs were hard... by vitalyb · · Score: 1

    ...Wait till you see these new CAPTCHAS.

    Mushed text with letters that slide into each other, bad lighting and every other kind of bad scanning you can imagine. Hell, you'd be lucky if you can recognize letters at all.

    Question is, if the machine couldn't figure out what the word is, how will it verify your answer? Is it going to be something along "by the popular vote"?

    Something is very not right in all this.

  19. Re:I got my digitized copy of the US Constitution by multipartmixed · · Score: 2, Funny

    Constitution, consititution...

    Oh! You mean the "E. Plebnista?"

    --

    Do daemons dream of electric sleep()?
  20. A pain for users by EssenceLumin · · Score: 2, Insightful

    Great, so now I would have to fill out two of those stupid things instead of one. Why would a company want to inflict this on its users?

    1. Re:A pain for users by Peppersnail · · Score: 1

      Poor you! Now, every other week where you'll want to join a new forum (that's a LOT of forums), you'll have to solve two CAPTCHAs instead of one! Waaaaaah!

    2. Re:A pain for users by EssenceLumin · · Score: 1

      Oh whatever. When I do want to join a site they sometimes have captchas that I can't figure out and have to try five of them or however many. It's frustrating. Throw in a second one which was put there because it is illegible and I'll say oh well, forget it. Forums aren't the only places using them either.

    3. Re:A pain for users by Falkkin · · Score: 1

      I think it's faster for me to process and type two English words than to type one word that's just random letters. My brain is trained to read and type English; it's not trained to type nonsense.

  21. Here's an early test phrase... by jpellino · · Score: 1

    owha tajer kiam

    --
    "Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
  22. Re:OK, how to defeat - by Anonymous Coward · · Score: 0

    If you can tell me what Obama's platform and beliefs really are, then OK, fine. But I'm not even sure Obama supports know anything other than that he seems to be a smart guy.

    Bush may or may not be our best selection ever (definitely not the best, but he's in office and we need to support him not make his job harder so you can spout more reasons why he sucks), but it sounds like you'd vote for the donkey.

  23. Re:I got my digitized copy of the US Constitution by arodland · · Score: 1

    I bow to you, sir.

  24. This expains by Joebert · · Score: 1, Funny
    Please type the characters you see in the following image to register.

    George Bush

    Type: Miserable Failure

    Thankyou, click here to proceed.
    --
    Wanna fight ? Bend over, stick your head up your ass, and fight for air.
  25. Amazon's Mechanical Turk by scott_karana · · Score: 1

    This project isn't the first of its sort: Amazon has the Mechanical Turk project, where users perform various tasks similar to CAPTCHAs for amazon.com credit.

    http://www.mturk.com/

  26. Great CAPTCHA solution to solve people not RTFA! by GNUALMAFUERTE · · Score: 5, Interesting

    Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.

    We should put a CAPTCHA system on slashdot:

    When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:

      - The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
      - This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.

    This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)

    --
    WTF am I doing replying to an AC at 5 A.M on a Friday night?
  27. Working as intended by mythar · · Score: 1

    In a hole in the ground there lived a penis. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.

    (yeah, i'd trust the internet community to digitize my books. why don't we just cut out the middle-man, and create a wiki-gutenberg project?)

  28. Re:OK, how to defeat - by Anonymous Coward · · Score: 0

    Bush may or may not be our best selection ever (definitely not the best, but he's in office and we need to support him not make his job harder so you can spout more reasons why he sucks), but it sounds like you'd vote for the donkey.

    We do? I don't remember many Republicans supporting Clinton during the Lewinski affair. Nor do I remember many people supporting Nixon during Watergate -- and rightfully so. Bush has done worse to the country than either of them.

  29. Mod parent up by xzqx · · Score: 1

    Brilliant.

  30. How stupid by holophrastic · · Score: 1

    So, let me get this straight. There are systems out there, in the wild so to speak, that offer security by presenting a task that humans can do easily but machines have trouble doing. And now, this very same system is going to assist machines in solving the very inability upon which the system is based.

    That's the dumbest most retarded (traditional sense of teh word) thing that I've ever heard.

  31. What if it can't be typed? by Anonymous Coward · · Score: 0
    I found one problem when trying this on their site. How should it handle things that can't be typed on a keyboard? For example, one of the CAPTCHAS were

    law. (that should be "law." followed by "10" in superscript)

    and

    ædicule
    I don't think either of these would be done correctly. Most people would just type "law.10" for the first, and "aedicule" for the second. Ones that include characters without a similar key, e.g. (that should be the symbol Omega), might wind up not being solved at all.
  32. Missed opportunity by h4ter · · Score: 1

    From the security page of the reCAPTCHA site: "if somebody writes a program that can read our distorted images, we can add more distortions in very little time"

    If someone can write a program to solve the distorted images of OCR-unreadable words, don't you just hire that guy to do your OCR and get out of the CAPTCHA business?

  33. Image spam by JeremyR · · Score: 4, Interesting

    Maybe this technique can be adapted to fight image spam more effectively :-)

  34. CAPTCHA+CAPTCHA by k3vlar · · Score: 1, Redundant

    I thought the point of CAPTCHAs was to compare what a user types with information stored on the hosting server. If the hosting server doesn't know what the book says, then how can it validate the CAPTCHA?

    --
    Unlike porn, which yada yada rimshot hey-ooh!
  35. Still a dumb solution for a CAPTCHA by Moraelin · · Score: 1

    See, for recognizing words, that's ok. You can give it to 200 users spread over 10 days and see what most said. So, yes, I'm not surprised that Google does the same thing, but the catch is: not as a captcha.

    It's just about the most idiotic idea I've ever heard for a _CAPTCHA_. Here's why:

    1. What about the first person that sees any given word? Do you let them get in regardless of what they type (remember, there is no consensus yet about that word)? Or will I have to wait another 2 weeks to see if my post is allowed on Slashdot?

    What about the second or third attempts at a word, for that matter? If the first two guys said it's "goatse" and I say it's "apple", how do you know which of us is the bullshitter? Did you stumble upon two jokers on the first two tries, or am I a bot? Basically for each word there is a sizable window where you still don't know what it means yet. Statistical consensus doesn't exist yet. At that point you're basically stuck accepting anything whatsoever. And since you'll want to use more than one word, that window of opportunity will come again and again. Maybe a quarter of the time you're essentially not yet knowing what it says and whether the user is bullshitting you.

    Or in other words, will (A) an attacker just have to try until he stumbles upon a word for which no consensus exists yet? Or (B) you'll inconvenience legitimate users even more than the idiotic captchas already do?

    2. It necessarily involves repetition. Otherwise you can't build consensus. So it's actually worse than current captchas. You can still crack them by paying a couple of unskilled workers in Elbonia to just crack capchas for $1 per hour, but this time you can also cache the ones they already cracked. The same image is bound to appear again sooner or later, and then a computer can crack it automatically.

    3. Most of the words scanned from books are actually easier to automatically crack by OCR. Yeah, the OCR might fuck-up a letter somewhere, but it's easy to run that through a spellchecker to make an educated guess. Or even just take a random statistical guess. Even guessing at the ratio of consonants to vowels will give you better odds for most languages than the current captchas. So if someone wants to use bots to spam, you've just made his job _easier_.

    4. However a good portion are actually harder for an average user. E.g., if it comes from some manuscript in some medieval gothic script, and some worn/discoloured/whatever manuscript at that, I might get a headache trying to decypher it even as a human. Or what if it contains some phrase in cyrilic, greek, or some made-up script? To a machine it looks like just the next word in the sequence. Captchas are already a usability nightmare, this would just make it an even bigger nightmare for a lot of people.

    5. It can be deliberately poisoned. Even with two words (one known, one unknown), it only takes an army of jokers or bots who pick the first or second to answer right, and answer "goatse" to the other. You'll still get your majority eventually, but it will take longer and, as statistics flukes work, occasionally you'll get 5 "goatse" answers for a word before you get even one right answer. Do you start rejecting people who said something else yet?

    6. It solves none of the _real_ problems with captchas. E.g., they're still crackable by proxy, or by sweatshops with 1-2 guys cracking captchas at $1-$2 per hour. E.g., it still is a usability nightmare for a lot of very real people.

    So I don't care how much of a genius he might be on an unrelated domain, or who else uses the same approach... for a completely different problem. Both are here just appeal to false authority.

    Even geniuses occasionally get a dumb idea. Tesla, for example, was one of the greatest geniuses of this century. He did get a _lot_ of SF ideas, though, like time travel machines, death rays, thought photography, walls of light, etc. Stuff which can't possibly work. E.g., his thought photography was based on the idea that mental im

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:Still a dumb solution for a CAPTCHA by marcello_dl · · Score: 1

      You raise some good points, anyway if a weight is given to the submitted answers according to its string similarity with the ocr (wrong) guess and a dictionary, then it becomes easier to spot spammers, unless spammers do an ocr and post the second best dictionary guess. Anyway, no "goatse" instead of "slashdot". Add that to the captcha server being centralized so bots can be analysed for submission and/or error frequency and demoted or given "impossible" captchas.

      --
      ---- MISSING MISCELLANEOUS DATA SEGMENT --- [sigdash] trolololol
  36. Hmmm, That Looks Like A... by WiseWeasel · · Score: 2, Funny

    Damnit, where's the smushed bug key?!?

    --
    "I like systems, their application excepted", George Sand (French)
  37. Its a Game by Anonymous Coward · · Score: 0

    Replace one of the words with nonsence and hit submit. If it passes you get a point. The challenge is to figure out what word the computer is least likely to know.

  38. You're all missing the point by kilgoretrout99 · · Score: 1

    The second request is by definition not a CAPTCHA, since the answer is not known. They're using you to try and determine that answer. This after they've met their security criteria by using a real CAPTCHA. That means this is just unpaid labour! Wait 'till my union rep finds out about this, there'll be trouble!!

  39. Re:Great CAPTCHA solution to solve people not RTFA by xtracto · · Score: 1

    Uh that sounds a lot like the Prince of Persia "anti-pirate" feature which asked you to drink the bottle with the letter in:
    "Page 13, Line 4, Word 5, Letter 2", after ending the first level...

    Nothing that a Hex editor operation in the .SAV file could not fix ;-)

    --
    Ubuntu is an African word meaning 'I can't configure Debian'
  40. Re:Great CAPTCHA solution to solve people not RTFA by parkrrrr · · Score: 1

    Your solution assumes that it's actually possible to tell Slashdotters and assholes apart.

    I believe it is doomed to fail.

  41. CAPTCHAs are bad design by ElForesto · · Score: 1

    Any method of anti-spam that causes the user to jump through hoops is a bad design. CAPTCHAs are no more effective than a battery of tests against content at preventing spam, period. While an unscrupulous website operator can lift the CAPTCHA and get unwitting users to submit it, they can't fool systems like, say, Spam Karma that test for the characteristics of spam. I've been using it for quite a while and it's been 100% accurate in telling me what is or is not spam while providing zero inconvenience to the end user. About the only way for spammers to sneak it by is to *gasp* leave comments using a real person, a task so expensive that it's not worth it.

    --
    There is a difference between "insightful" and "inciteful" other than spelling.
  42. This sound like not working by unablepostAC · · Score: 1

    How do they know if what I type is the real text, if they don't know in advance what it says.
    And if they already know what it says, then why would they need someone else to type it for the first time.
    the extent of how academics can be o out of touch with reality.

  43. As you probably noticed... by Moraelin · · Score: 1

    As you probably noticed, my 3'rd objection was, essentially, "but spammers could run it through an OCR and then guess at the 1-2 misshapen letters". So you're telling me that then the system would do the same to validate that you're not a bot.

    I dunno... it seems to me that, au contraire, you just described a way to make it easier for bots to pass. Magna cum laude.

    You even have the exact way to tune it for maximum effect: the guys with the same OCR software are more likely to pass. Even if you don't exactly know which algorithm they're using, you can just try several and see which gets through those captchas more often.

    Note however that you don't even need to be _too_ well tuned. If your OCR software misses maybe a letter in each word, you have a 1/26 chance to pass by just picking a random letter there. If it missed two, you have a 1/676 chance by sheer random chance. Those are _excellent_ odds to get a bot through. A distributed army of zombies could create tens of thousands of spam accounts per day that way.

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:As you probably noticed... by marcello_dl · · Score: 1

      But they don't have the same data of the "captcha server" to do ocr, they have the distorted one. The similarity is weighed with the original data, so bots ought to have a pretty good ocr program, equivalent to a pretty good captcha decoder.

      --
      ---- MISSING MISCELLANEOUS DATA SEGMENT --- [sigdash] trolololol
  44. Re:Great CAPTCHA solution to solve people not RTFA by mopower70 · · Score: 1

    Or better yet, make the CAPTCHA the text of the entire article, thereby forcing people to actually RTFA before being able to post.

  45. Source Material by Flwyd · · Score: 1
    --
    Ceci n'est pas une signature.
  46. Still a great solution for digitizing books by 5of0 · · Score: 1
    Uhh, you seem to be stuck on the idea that they only use the mystery word for captcha.
    Maybe you should actually read the comments you replied to...which quote the reCAPTCHA website:

    But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
    In other words, several of your points are entirely invalid. The mystery word is not used at all for verification.
    Several of your other arguments are about how it's not any better of a CAPTCHA. Which is preposterous, because it's not supposed to better at being a CAPTCHA. It's supposed to be better at digitizing books, which the current CAPTCHA scheme has exactly 0 effectiveness at. Complaining that this isn't any better of a CAPTCHA is like complaining that charity golf tournaments aren't any better than the Masters. It's a ridiculous argument, because charity golf tournaments have an entirely different focus, while still being a golf tournament.
    And it's got a reload button for hard words. You've got a somewhat valid point in that being not random, it is easier to guess. But /. CAPTCHAs are real words too, and they don't do anything towards digitizing books.
    You've still got some fragment of an argument left, but most of it is destroyed by the simple facts that:
    a) Verification is not based on the "mystery" word
    b) It's supposed to be better at digitizing books, not better at being a CAPTCHA

    It's just as good as most CAPTCHAs out there, and it digitizes books. It's a good idea.
    --
    You all have Oo.o and Firefox, so get World Wind.
  47. World's Best CAPTCHA by PGillingwater · · Score: 1

    This class of CAPTCHA is not always going to work first time, every time. It depends upon the subjective opinion or skill of the user. In my view, the ultimate CAPTCHA has been released:

    www.hotcaptcha.com

    --
    Paul Gillingwater
    MBA, CISSP, CISM
  48. Re:Great CAPTCHA solution to solve people not RTFA by GNUALMAFUERTE · · Score: 1

    Oh man, you just brought my childhood back =)

    Thank you for the good memories ... oh so many hours wasted ...

    --
    WTF am I doing replying to an AC at 5 A.M on a Friday night?