Building a Better CAPTCHA
jcatcw writes "Steven J. Vaughan-Nichols reports that CAPTCHA cracking isn't that difficult these days. It has even become a business. For example, DeCaptcher.com will solve CAPTCHAs for your spamming needs at a rate of $2 per 1,000 successfully cracked CAPTCHAs. In response, newer systems are in development. Both Carnegie Mellon and Penn State (is there something about the water in PA?) are working on image-based systems. ESP-PIX and SQ-PIX both require the viewer to interpret pictures. Imagination CAPTCHA from Penn has the user find the center of an image. The idea is that humans are better at image recognition that computers, but humans can legitimately disagree on their interpretations and some humans are color blind. Problems remain. For now, sites would be well advised to look at reCAPTCHA — the system that works with Google Books and the Internet Archive to digitize printed texts — which comes with a wide variety of application and programming plug-ins and an open API."
Very true, though you can turn that around. That is, create a 3rd site where users are rewarded with porn for categorizing a posting as spam or legit. If it's the former, it is deleted from your forum.
Any CAPTCHA system can easily be cracked by building a large database with the inputs and outputs that was actually solved by humans and then saved into the database for lookup later. The inputs don't need to be text, they can contain images ( or hash codes representing images ), or css or whatever is needed to define the input data. The only feasable way to stop this kind of caching of answers is to have no duplicate tests. For example, a large field of randomly colored circles that all vary in size and position and move slowly around, then tell the user to hover the mouse over the largest blue circle and then next have them move the mouse over the green triangle, etc. Then base their "pass or fail" on how well they could move the mouse fast enough. And change the test often, like, put the mouse over the shape that looks like a bunny etc.
I really hate image-based CAPTCHAS, because they discriminate against lynx users. I seriously remember at least one occasion where I was using lynx for whatever obscure reason, and I came upon "enter the text shown in the box at the left". Fail. I like the math problem ones better.
Why jumble the images? Computer monitors function as 75-100 refreshes a second, or more. The human eye will superimpose two images that are 1/12.5 seconds apart, which is why PAL televisions using interlace can trick the eye into seeing a single fluidly-moving picture when playing at 25 frames per second (and thus 12.5 updates on a given line per second).
You should be able to use this to create an animated page, in which you scatter pixels through time, such that persistence of vision tricks the eye into seeing the actual page when an analysis of a single frame would show only random dots.
What you'd end up with is something that a screen scraper or image capture program could never process, but the human brain (because you're exploiting its limitations) can.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
The idea is that humans are better at image recognition that computers, but humans can legitimately disagree on their interpretations and some humans are color blind.
COLOR blind? Some humans are BLIND blind. Others have various vision or vision processing impairments that would make meatware-visual-coprocessor-test CAPTCHAs reject them.
IMHO most CAPTCHAs are already and obviously violating of the Americans with Disabilities Act. So now, in the info-war between weapons and armor (which weapons always win anyhow), even more of us less-than-Aryan-Supermen become collateral damage.
Dogs are (allegedly) color blind and "... on the Internet nobody can tell you're a dog!". Well, maybe PEOPLE can't. But now the web applications can. B-(
The solution to being attacked by better weapons is not better armor. That's only a stopgap. The solution is to hunt down those who misuse weapons and make them incapable of or unwilling to continue.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
...even though CraigsList uses reCAPTCHA and the article talks about a utility that helps spammers automatically post on CL.
Besides, it's fairly easy to set up a Mechanical Turk HIT for users to solve CAPTCHAs for a penny a piece. Assuming you make more than a penny per captcha solved, you're set. If not, make someone successfully solve more than one CAPTCHA per HIT submission.
I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
I'm not sure how, yet, but I want people to start thinking about it this way.
Just like DRM.
See, with DRM, start with the assumption that all DRM can and will be cracked, and that all software and media can and will be pirated. Your challenge, then, is to make the legitimate product provide at least the quality and value of the pirated copy (something most DRM'd solutions fail miserably at), and ideally make it desirable enough that your price starts to seem reasonable, even when the alternative is "free".
So, the same applies to CAPTCHAs. Start with the assumption that all CAPTCHAs can and will be cracked, even if "cracking" means "using Mechanical Turk and/or a real sweatshop to have humans crack it". Now, start thinking in terms of economics. Build a system which doesn't have sufficiently good payoff for cracking it for anyone to bother -- a system which, by its very nature, can't be spammed.
If you can at least get it to where the only waste is bandwidth and disk space, you're doing pretty good. That's about my current spam situation -- it's a statistical filter which operates on the entire message, but it works incredibly well.
Until then, an automated hack that seems to work well, at least to stop blog spam, is to require AJAX, and send a bit of programmatically generated (but always different) JavaScript, and verify that it was executed. This will stop most automated systems until they start specifically targeting you with embedded Javascript engines. Next: Make it computationally expensive, so that they have to use a botnet if they're to get any real results.
Don't thank God, thank a doctor!
I was thinking brute force isn't feasible when every failure generates a new question.
But let me take another stab at it.
What if the question wasn't always "what is in the picture?"
Given a database of 1000 basic images like animals, shapes, fruits, and vegetables matched to the word for what each one is and it's catagory (animal, fruit, etc).. Now the CAPTCHA shows 6 of them in 6 little squares. (~985 quadrillion combinations) It can ask a nearly endless list of questions using simple formulae:
What is the third image?
How many animals are shown? Spell the number.
Type the first 2 letters of each fruit.
Type the shape names using no spaces.
Instead of always asking "what are the 5 digits" now we're asking for an almost arbitrary number of digits. And there are 6 picture images that have to be ID'd.
Did I beat the OCR problem w/o introducing any fatal new ones?
Operator, give me the number for 911!