Fill Out CAPTCHAs, Digitize Books At The Same Time
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.
I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
The article is lacking some information. Here are some better links:
Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works
Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/
They need a way to verify if the answer is correct... if they know the answer, they don't need help digitizing the hard-to-read text. If they don't know the answer, it won't work as a CAPTCHA.
Am I missing something fundamental here?
They should put a million monkeys on this
However the Iron Internet Law of "lolz > human decency" applies ... and we can look forward to books being translated as "chucknorrischucknorrischucknorrischurknorris..."
someone set up a database of what the words really say along with what we should type instead, and make it public. it'll be fun! like mad libs!
if you wanted everyone to do it, you shouldn't have used something political. at least you went with a likely majority...
Nicked from the appropriate webiste, should answer my (and likely a lot of other people's first question)
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
What if the OCR cannot read a word because there was a booger on it during the scan? A human won't be able to determine it either because it will be mostly a blotch. How are they gonna know the difference between human-decipherable words and lost-cause words (such as booger blotches)?
Table-ized A.I.
I can see how this would work, but in order to also provide security, extra letters or words would also need to be in the captcha. I.e. if there's an un-OCRable word "between", the captcha could contain "frog between" or something like that, and the first word could be a previous un-OCRable word that has been validated by enough people.
Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
- Have a person subscribe to a porn website by typing in a CAPTCHA image that comes from a legitimate website.
- The user provides the correct word while subscribing
- Not even a "???" step
- Profit! The protected website is spammed.
I'm wondering whether this system will be used for legitimate OCR purposes or for more spam...Er, how about reading the actual idea? Or at least a few comments explaining it?
The concept is as follows: the software has a list of known words(graphical data and transcriptions) and a list of unknown words(graphical data). As a CAPTCHA, it presents one known and one unknown word. If the user transcribes the known word correctly, they pass the test, and data is contributed for the unknown word, which will eventually by this process make its way into the "known words" list.
They probably just accept the first x entries until they have a base for comparison. The entries will converge on correctness.
OK, for the humor impaired:
BUSH IS AN IDIOT
then you can leave off the Obama part.
Oh, come on, somebody mod this funny - it's even on-topic. Puhleeez?
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct.
...Wait till you see these new CAPTCHAS.
Mushed text with letters that slide into each other, bad lighting and every other kind of bad scanning you can imagine. Hell, you'd be lucky if you can recognize letters at all.
Question is, if the machine couldn't figure out what the word is, how will it verify your answer? Is it going to be something along "by the popular vote"?
Something is very not right in all this.
Constitution, consititution...
Oh! You mean the "E. Plebnista?"
Do daemons dream of electric sleep()?
Great, so now I would have to fill out two of those stupid things instead of one. Why would a company want to inflict this on its users?
owha tajer kiam
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
If you can tell me what Obama's platform and beliefs really are, then OK, fine. But I'm not even sure Obama supports know anything other than that he seems to be a smart guy.
Bush may or may not be our best selection ever (definitely not the best, but he's in office and we need to support him not make his job harder so you can spout more reasons why he sucks), but it sounds like you'd vote for the donkey.
I bow to you, sir.
Type: Miserable Failure
Thankyou, click here to proceed.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
This project isn't the first of its sort: Amazon has the Mechanical Turk project, where users perform various tasks similar to CAPTCHAs for amazon.com credit.
http://www.mturk.com/
Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.
We should put a CAPTCHA system on slashdot:
When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:
- The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
- This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.
This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)
WTF am I doing replying to an AC at 5 A.M on a Friday night?
In a hole in the ground there lived a penis. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.
(yeah, i'd trust the internet community to digitize my books. why don't we just cut out the middle-man, and create a wiki-gutenberg project?)
Bush may or may not be our best selection ever (definitely not the best, but he's in office and we need to support him not make his job harder so you can spout more reasons why he sucks), but it sounds like you'd vote for the donkey.
We do? I don't remember many Republicans supporting Clinton during the Lewinski affair. Nor do I remember many people supporting Nixon during Watergate -- and rightfully so. Bush has done worse to the country than either of them.
Brilliant.
So, let me get this straight. There are systems out there, in the wild so to speak, that offer security by presenting a task that humans can do easily but machines have trouble doing. And now, this very same system is going to assist machines in solving the very inability upon which the system is based.
That's the dumbest most retarded (traditional sense of teh word) thing that I've ever heard.
From the security page of the reCAPTCHA site: "if somebody writes a program that can read our distorted images, we can add more distortions in very little time"
If someone can write a program to solve the distorted images of OCR-unreadable words, don't you just hire that guy to do your OCR and get out of the CAPTCHA business?
Maybe this technique can be adapted to fight image spam more effectively :-)
I thought the point of CAPTCHAs was to compare what a user types with information stored on the hosting server. If the hosting server doesn't know what the book says, then how can it validate the CAPTCHA?
Unlike porn, which yada yada rimshot hey-ooh!
See, for recognizing words, that's ok. You can give it to 200 users spread over 10 days and see what most said. So, yes, I'm not surprised that Google does the same thing, but the catch is: not as a captcha.
It's just about the most idiotic idea I've ever heard for a _CAPTCHA_. Here's why:
1. What about the first person that sees any given word? Do you let them get in regardless of what they type (remember, there is no consensus yet about that word)? Or will I have to wait another 2 weeks to see if my post is allowed on Slashdot?
What about the second or third attempts at a word, for that matter? If the first two guys said it's "goatse" and I say it's "apple", how do you know which of us is the bullshitter? Did you stumble upon two jokers on the first two tries, or am I a bot? Basically for each word there is a sizable window where you still don't know what it means yet. Statistical consensus doesn't exist yet. At that point you're basically stuck accepting anything whatsoever. And since you'll want to use more than one word, that window of opportunity will come again and again. Maybe a quarter of the time you're essentially not yet knowing what it says and whether the user is bullshitting you.
Or in other words, will (A) an attacker just have to try until he stumbles upon a word for which no consensus exists yet? Or (B) you'll inconvenience legitimate users even more than the idiotic captchas already do?
2. It necessarily involves repetition. Otherwise you can't build consensus. So it's actually worse than current captchas. You can still crack them by paying a couple of unskilled workers in Elbonia to just crack capchas for $1 per hour, but this time you can also cache the ones they already cracked. The same image is bound to appear again sooner or later, and then a computer can crack it automatically.
3. Most of the words scanned from books are actually easier to automatically crack by OCR. Yeah, the OCR might fuck-up a letter somewhere, but it's easy to run that through a spellchecker to make an educated guess. Or even just take a random statistical guess. Even guessing at the ratio of consonants to vowels will give you better odds for most languages than the current captchas. So if someone wants to use bots to spam, you've just made his job _easier_.
4. However a good portion are actually harder for an average user. E.g., if it comes from some manuscript in some medieval gothic script, and some worn/discoloured/whatever manuscript at that, I might get a headache trying to decypher it even as a human. Or what if it contains some phrase in cyrilic, greek, or some made-up script? To a machine it looks like just the next word in the sequence. Captchas are already a usability nightmare, this would just make it an even bigger nightmare for a lot of people.
5. It can be deliberately poisoned. Even with two words (one known, one unknown), it only takes an army of jokers or bots who pick the first or second to answer right, and answer "goatse" to the other. You'll still get your majority eventually, but it will take longer and, as statistics flukes work, occasionally you'll get 5 "goatse" answers for a word before you get even one right answer. Do you start rejecting people who said something else yet?
6. It solves none of the _real_ problems with captchas. E.g., they're still crackable by proxy, or by sweatshops with 1-2 guys cracking captchas at $1-$2 per hour. E.g., it still is a usability nightmare for a lot of very real people.
So I don't care how much of a genius he might be on an unrelated domain, or who else uses the same approach... for a completely different problem. Both are here just appeal to false authority.
Even geniuses occasionally get a dumb idea. Tesla, for example, was one of the greatest geniuses of this century. He did get a _lot_ of SF ideas, though, like time travel machines, death rays, thought photography, walls of light, etc. Stuff which can't possibly work. E.g., his thought photography was based on the idea that mental im
A polar bear is a cartesian bear after a coordinate transform.
Damnit, where's the smushed bug key?!?
"I like systems, their application excepted", George Sand (French)
Replace one of the words with nonsence and hit submit. If it passes you get a point. The challenge is to figure out what word the computer is least likely to know.
The second request is by definition not a CAPTCHA, since the answer is not known. They're using you to try and determine that answer. This after they've met their security criteria by using a real CAPTCHA. That means this is just unpaid labour! Wait 'till my union rep finds out about this, there'll be trouble!!
Uh that sounds a lot like the Prince of Persia "anti-pirate" feature which asked you to drink the bottle with the letter in:
.SAV file could not fix ;-)
"Page 13, Line 4, Word 5, Letter 2", after ending the first level...
Nothing that a Hex editor operation in the
Ubuntu is an African word meaning 'I can't configure Debian'
Your solution assumes that it's actually possible to tell Slashdotters and assholes apart.
I believe it is doomed to fail.
Any method of anti-spam that causes the user to jump through hoops is a bad design. CAPTCHAs are no more effective than a battery of tests against content at preventing spam, period. While an unscrupulous website operator can lift the CAPTCHA and get unwitting users to submit it, they can't fool systems like, say, Spam Karma that test for the characteristics of spam. I've been using it for quite a while and it's been 100% accurate in telling me what is or is not spam while providing zero inconvenience to the end user. About the only way for spammers to sneak it by is to *gasp* leave comments using a real person, a task so expensive that it's not worth it.
There is a difference between "insightful" and "inciteful" other than spelling.
How do they know if what I type is the real text, if they don't know in advance what it says.
And if they already know what it says, then why would they need someone else to type it for the first time.
the extent of how academics can be o out of touch with reality.
As you probably noticed, my 3'rd objection was, essentially, "but spammers could run it through an OCR and then guess at the 1-2 misshapen letters". So you're telling me that then the system would do the same to validate that you're not a bot.
I dunno... it seems to me that, au contraire, you just described a way to make it easier for bots to pass. Magna cum laude.
You even have the exact way to tune it for maximum effect: the guys with the same OCR software are more likely to pass. Even if you don't exactly know which algorithm they're using, you can just try several and see which gets through those captchas more often.
Note however that you don't even need to be _too_ well tuned. If your OCR software misses maybe a letter in each word, you have a 1/26 chance to pass by just picking a random letter there. If it missed two, you have a 1/676 chance by sheer random chance. Those are _excellent_ odds to get a bot through. A distributed army of zombies could create tens of thousands of spam accounts per day that way.
A polar bear is a cartesian bear after a coordinate transform.
Or better yet, make the CAPTCHA the text of the entire article, thereby forcing people to actually RTFA before being able to post.
Maybe they can help piece together secrets from East Germany.
Ceci n'est pas une signature.
Maybe you should actually read the comments you replied to...which quote the reCAPTCHA website:
In other words, several of your points are entirely invalid. The mystery word is not used at all for verification.
Several of your other arguments are about how it's not any better of a CAPTCHA. Which is preposterous, because it's not supposed to better at being a CAPTCHA. It's supposed to be better at digitizing books, which the current CAPTCHA scheme has exactly 0 effectiveness at. Complaining that this isn't any better of a CAPTCHA is like complaining that charity golf tournaments aren't any better than the Masters. It's a ridiculous argument, because charity golf tournaments have an entirely different focus, while still being a golf tournament.
And it's got a reload button for hard words. You've got a somewhat valid point in that being not random, it is easier to guess. But
You've still got some fragment of an argument left, but most of it is destroyed by the simple facts that:
a) Verification is not based on the "mystery" word
b) It's supposed to be better at digitizing books, not better at being a CAPTCHA
It's just as good as most CAPTCHAs out there, and it digitizes books. It's a good idea.
You all have Oo.o and Firefox, so get World Wind.
This class of CAPTCHA is not always going to work first time, every time. It depends upon the subjective opinion or skill of the user. In my view, the ultimate CAPTCHA has been released:
www.hotcaptcha.com
Paul Gillingwater
MBA, CISSP, CISM
Oh man, you just brought my childhood back =)
... oh so many hours wasted ...
Thank you for the good memories
WTF am I doing replying to an AC at 5 A.M on a Friday night?