Fill Out CAPTCHAs, Digitize Books At The Same Time
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.
I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
What's the point here? If the can obfuscate the word, they know the word. What am I missing?
Hello,
Last week I bought 4GB of Kingston HyperX DDR2-800 memory running at CL4. I'm pretty happy with it, except for one thing: I want 8GB.
So I started looking around at the same store (in order to do an exchange) for a replacement and I discovered that the 4GB kits are all CL5.
Does anyone know what sort of hit I should expect from moving to CL4 to CL5?
Thank you in advance!
(please refrain from commenting on 4GB vs 8GB, that would be off-topic)
The article is lacking some information. Here are some better links:
Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works
Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
The typed in text has to be verified against some known text, so wouldn't the source material already have to be known in order to verify that the captcha is correct. If the source text is already known then this process doesn't seem to accomplish anything. Perhaps I'm missing the point.
- Just replace every word with "BUSH IS AN IDIOT - OBAMA IN 08!".
....
Now, EVERYONE has to do this for this hack to work
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/
They need a way to verify if the answer is correct... if they know the answer, they don't need help digitizing the hard-to-read text. If they don't know the answer, it won't work as a CAPTCHA.
Am I missing something fundamental here?
They should put a million monkeys on this
However the Iron Internet Law of "lolz > human decency" applies ... and we can look forward to books being translated as "chucknorrischucknorrischucknorrischurknorris..."
someone set up a database of what the words really say along with what we should type instead, and make it public. it'll be fun! like mad libs!
Nicked from the appropriate webiste, should answer my (and likely a lot of other people's first question)
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
We, the people of EARTH LIBERATION FRONT UBER ALLES, in order to form a more perfect How cud tehy vote 4 Jordin!!!11, establish justice, Haha, spamming you....
What if the OCR cannot read a word because there was a booger on it during the scan? A human won't be able to determine it either because it will be mostly a blotch. How are they gonna know the difference between human-decipherable words and lost-cause words (such as booger blotches)?
Table-ized A.I.
I can see how this would work, but in order to also provide security, extra letters or words would also need to be in the captcha. I.e. if there's an un-OCRable word "between", the captcha could contain "frog between" or something like that, and the first word could be a previous un-OCRable word that has been validated by enough people.
Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
- Have a person subscribe to a porn website by typing in a CAPTCHA image that comes from a legitimate website.
- The user provides the correct word while subscribing
- Not even a "???" step
- Profit! The protected website is spammed.
I'm wondering whether this system will be used for legitimate OCR purposes or for more spam...They probably just accept the first x entries until they have a base for comparison. The entries will converge on correctness.
A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct.
...Wait till you see these new CAPTCHAS.
Mushed text with letters that slide into each other, bad lighting and every other kind of bad scanning you can imagine. Hell, you'd be lucky if you can recognize letters at all.
Question is, if the machine couldn't figure out what the word is, how will it verify your answer? Is it going to be something along "by the popular vote"?
Something is very not right in all this.
Great, so now I would have to fill out two of those stupid things instead of one. Why would a company want to inflict this on its users?
munches the most aaproximately 90% All major marketing Fact: *BSD is d`ying at my freelance
project. Tod4y, as benefits of being
owha tajer kiam
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
Type: Miserable Failure
Thankyou, click here to proceed.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
we need to address more grandios3 visions going Raymond in his To the politically fly They looked recent article put national gay nigger all parties it's triumphs would soon to work I'm doing, 1. Therefore it's progress. Any posts. Therefore A need to play so that their Little-known a5shole about.' One These early Lubrication. You same year, BSD part of GNAA if the 'community' Teeth into when That should be wall: *BSD faces a over a quality of progress. it transforms into milestones, telling little-known product, BSD's words, don't get
This project isn't the first of its sort: Amazon has the Mechanical Turk project, where users perform various tasks similar to CAPTCHAs for amazon.com credit.
http://www.mturk.com/
Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.
We should put a CAPTCHA system on slashdot:
When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:
- The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
- This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.
This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)
WTF am I doing replying to an AC at 5 A.M on a Friday night?
In a hole in the ground there lived a penis. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.
(yeah, i'd trust the internet community to digitize my books. why don't we just cut out the middle-man, and create a wiki-gutenberg project?)
Brilliant.
So, let me get this straight. There are systems out there, in the wild so to speak, that offer security by presenting a task that humans can do easily but machines have trouble doing. And now, this very same system is going to assist machines in solving the very inability upon which the system is based.
That's the dumbest most retarded (traditional sense of teh word) thing that I've ever heard.
From the security page of the reCAPTCHA site: "if somebody writes a program that can read our distorted images, we can add more distortions in very little time"
If someone can write a program to solve the distorted images of OCR-unreadable words, don't you just hire that guy to do your OCR and get out of the CAPTCHA business?
Maybe this technique can be adapted to fight image spam more effectively :-)
Listen up, sizzle chest - and you are an enemy of my account because you are such a fucking asshole: First off, the post to which you are replying was the second post in this story. The fact that there are 50 posts explaining to the first poster where he was wrong does not mean that my post did not appear within seconds after the first post. Second, the article DOES NOT GIVE AN EXPLANATION. You have to click through to a second article. I read the fucking article. It wasn't clear. Do I have to exhaust every link in every article before I can post?
OS don't fear the we need to aadress there are study.C [rice.edu] Benefits of being OpenBSD wanker Theo project. Today, as is dying and its
I thought the point of CAPTCHAs was to compare what a user types with information stored on the hosting server. If the hosting server doesn't know what the book says, then how can it validate the CAPTCHA?
Unlike porn, which yada yada rimshot hey-ooh!
See, for recognizing words, that's ok. You can give it to 200 users spread over 10 days and see what most said. So, yes, I'm not surprised that Google does the same thing, but the catch is: not as a captcha.
It's just about the most idiotic idea I've ever heard for a _CAPTCHA_. Here's why:
1. What about the first person that sees any given word? Do you let them get in regardless of what they type (remember, there is no consensus yet about that word)? Or will I have to wait another 2 weeks to see if my post is allowed on Slashdot?
What about the second or third attempts at a word, for that matter? If the first two guys said it's "goatse" and I say it's "apple", how do you know which of us is the bullshitter? Did you stumble upon two jokers on the first two tries, or am I a bot? Basically for each word there is a sizable window where you still don't know what it means yet. Statistical consensus doesn't exist yet. At that point you're basically stuck accepting anything whatsoever. And since you'll want to use more than one word, that window of opportunity will come again and again. Maybe a quarter of the time you're essentially not yet knowing what it says and whether the user is bullshitting you.
Or in other words, will (A) an attacker just have to try until he stumbles upon a word for which no consensus exists yet? Or (B) you'll inconvenience legitimate users even more than the idiotic captchas already do?
2. It necessarily involves repetition. Otherwise you can't build consensus. So it's actually worse than current captchas. You can still crack them by paying a couple of unskilled workers in Elbonia to just crack capchas for $1 per hour, but this time you can also cache the ones they already cracked. The same image is bound to appear again sooner or later, and then a computer can crack it automatically.
3. Most of the words scanned from books are actually easier to automatically crack by OCR. Yeah, the OCR might fuck-up a letter somewhere, but it's easy to run that through a spellchecker to make an educated guess. Or even just take a random statistical guess. Even guessing at the ratio of consonants to vowels will give you better odds for most languages than the current captchas. So if someone wants to use bots to spam, you've just made his job _easier_.
4. However a good portion are actually harder for an average user. E.g., if it comes from some manuscript in some medieval gothic script, and some worn/discoloured/whatever manuscript at that, I might get a headache trying to decypher it even as a human. Or what if it contains some phrase in cyrilic, greek, or some made-up script? To a machine it looks like just the next word in the sequence. Captchas are already a usability nightmare, this would just make it an even bigger nightmare for a lot of people.
5. It can be deliberately poisoned. Even with two words (one known, one unknown), it only takes an army of jokers or bots who pick the first or second to answer right, and answer "goatse" to the other. You'll still get your majority eventually, but it will take longer and, as statistics flukes work, occasionally you'll get 5 "goatse" answers for a word before you get even one right answer. Do you start rejecting people who said something else yet?
6. It solves none of the _real_ problems with captchas. E.g., they're still crackable by proxy, or by sweatshops with 1-2 guys cracking captchas at $1-$2 per hour. E.g., it still is a usability nightmare for a lot of very real people.
So I don't care how much of a genius he might be on an unrelated domain, or who else uses the same approach... for a completely different problem. Both are here just appeal to false authority.
Even geniuses occasionally get a dumb idea. Tesla, for example, was one of the greatest geniuses of this century. He did get a _lot_ of SF ideas, though, like time travel machines, death rays, thought photography, walls of light, etc. Stuff which can't possibly work. E.g., his thought photography was based on the idea that mental im
A polar bear is a cartesian bear after a coordinate transform.
Damnit, where's the smushed bug key?!?
"I like systems, their application excepted", George Sand (French)
Replace one of the words with nonsence and hit submit. If it passes you get a point. The challenge is to figure out what word the computer is least likely to know.
The second request is by definition not a CAPTCHA, since the answer is not known. They're using you to try and determine that answer. This after they've met their security criteria by using a real CAPTCHA. That means this is just unpaid labour! Wait 'till my union rep finds out about this, there'll be trouble!!
Uh that sounds a lot like the Prince of Persia "anti-pirate" feature which asked you to drink the bottle with the letter in:
.SAV file could not fix ;-)
"Page 13, Line 4, Word 5, Letter 2", after ending the first level...
Nothing that a Hex editor operation in the
Ubuntu is an African word meaning 'I can't configure Debian'
Your solution assumes that it's actually possible to tell Slashdotters and assholes apart.
I believe it is doomed to fail.
Any method of anti-spam that causes the user to jump through hoops is a bad design. CAPTCHAs are no more effective than a battery of tests against content at preventing spam, period. While an unscrupulous website operator can lift the CAPTCHA and get unwitting users to submit it, they can't fool systems like, say, Spam Karma that test for the characteristics of spam. I've been using it for quite a while and it's been 100% accurate in telling me what is or is not spam while providing zero inconvenience to the end user. About the only way for spammers to sneak it by is to *gasp* leave comments using a real person, a task so expensive that it's not worth it.
There is a difference between "insightful" and "inciteful" other than spelling.
How do they know if what I type is the real text, if they don't know in advance what it says.
And if they already know what it says, then why would they need someone else to type it for the first time.
the extent of how academics can be o out of touch with reality.
As you probably noticed, my 3'rd objection was, essentially, "but spammers could run it through an OCR and then guess at the 1-2 misshapen letters". So you're telling me that then the system would do the same to validate that you're not a bot.
I dunno... it seems to me that, au contraire, you just described a way to make it easier for bots to pass. Magna cum laude.
You even have the exact way to tune it for maximum effect: the guys with the same OCR software are more likely to pass. Even if you don't exactly know which algorithm they're using, you can just try several and see which gets through those captchas more often.
Note however that you don't even need to be _too_ well tuned. If your OCR software misses maybe a letter in each word, you have a 1/26 chance to pass by just picking a random letter there. If it missed two, you have a 1/676 chance by sheer random chance. Those are _excellent_ odds to get a bot through. A distributed army of zombies could create tens of thousands of spam accounts per day that way.
A polar bear is a cartesian bear after a coordinate transform.
Or better yet, make the CAPTCHA the text of the entire article, thereby forcing people to actually RTFA before being able to post.
Maybe they can help piece together secrets from East Germany.
Ceci n'est pas une signature.
Maybe you should actually read the comments you replied to...which quote the reCAPTCHA website:
In other words, several of your points are entirely invalid. The mystery word is not used at all for verification.
Several of your other arguments are about how it's not any better of a CAPTCHA. Which is preposterous, because it's not supposed to better at being a CAPTCHA. It's supposed to be better at digitizing books, which the current CAPTCHA scheme has exactly 0 effectiveness at. Complaining that this isn't any better of a CAPTCHA is like complaining that charity golf tournaments aren't any better than the Masters. It's a ridiculous argument, because charity golf tournaments have an entirely different focus, while still being a golf tournament.
And it's got a reload button for hard words. You've got a somewhat valid point in that being not random, it is easier to guess. But
You've still got some fragment of an argument left, but most of it is destroyed by the simple facts that:
a) Verification is not based on the "mystery" word
b) It's supposed to be better at digitizing books, not better at being a CAPTCHA
It's just as good as most CAPTCHAs out there, and it digitizes books. It's a good idea.
You all have Oo.o and Firefox, so get World Wind.
This class of CAPTCHA is not always going to work first time, every time. It depends upon the subjective opinion or skill of the user. In my view, the ultimate CAPTCHA has been released:
www.hotcaptcha.com
Paul Gillingwater
MBA, CISSP, CISM
Oh man, you just brought my childhood back =)
... oh so many hours wasted ...
Thank you for the good memories
WTF am I doing replying to an AC at 5 A.M on a Friday night?