Baffling the Spam Bots
dumpster_dave writes "Scientific American is running an article, Baffling the Bots on techniques to outsmart and subvert spam bots and their chat-room cousins via CAPTCHA. You have probable seen this in the form of images containing text as gate-keepers to various on-line services. The latest evolution is using non-words and distorting the text such that even the best AI systems cannot decipher them, yet humans can not help but do so [cf., Gestalt Psychology]."
Instead of sending an image of distorted text, send a wave file of distorted speech - easy for the human ear to discern, but harder for run-of-the-mill speech recognition tools to do.
There are no karma whores, only moderation johns
This is a losing battle.
Smart humans will outsmart computers for quite a while. The average human is already dis-comforted with such a test (what's the middle word in the second image?!).
But those systems should work for the dumbest (within reason) humans. They're trying to design a test that's passed by the dumbest of six million, yet makes the smartest of a few (bots) fail.
I give in.
*comment about spambot overlords*
Yes this is a great solution if the only people you want to email you are a little towards the smart side. But speaking as someone who has to deal with "joe sixpack" daily I've seen people who are confused by user@NOSPAMdomain.com and when I tell them to go to http://webmail.domain.com/ to get their webmail they put www. on the front!
These same people if I were verbally giving them the url to slashdot would end up at http://www.slash..org/ (god I wish I were trying to make a joke but seriously I've had this happen).
Because of this my email is plainly visible on our web site, and in my forums, and on a few other forums and on an occasional usenet message. With a combination of RBL's, bayesian filtering, procmail soup and other goodies my spam count per day is kept to a low roar (double figures in spam number rather than four figures, again I wish this were joking).
--- www.f-theocean.com
Easy; When you generate your mangled GIF image, also create a wav/mp3 containing the same information (eg using TTS software, or by concatenating pre-recorded audio files).
Most blind users are running windows with JAWS or similar screen-reading software, and sites like ACB release a lot of their content as mp3's already, so I'd assume that most are well equipped to handle web audio.
455fe10422ca29c4933f95052b792ab2
Earthlink has an optional system like this, where unknown senders are blocked by default. They receive an autoreply giving them a URL to go to where they must enter the text from a CAPTCHA.
Unfortunately, the system does not work very well. My dad sells on eBay, and a buyer of one of his auctions had an Earthlink account, which blocked the message that told how much the shipping would be, where to send payment, etc. When my dad went to the specified URL, and entered the CAPTCHA text as requested he would simply get an error message that he had entered it incorrectly. He forwarded me the Earthlink email and asked me if it was just him; it wasn't; I couldn't get it to work either. The random string of numbers and letters was very distorted, and there were four possible meanings; I tried those plus at least ten more with no sucess. The message never got through.
There are many problems with this type of system. Consider: what if both parties have CAPTCHA-enabled accounts, from different providers? The confirmation messages from both parties get blocked. Smarter systems whitelist people as messages are sent to them, but as in the eBay case, the recipient had no way of knowing my dad's email until AFTER a message from him was received. It's a Catch-22.
And for people who are visually impaired, universal deployment of this system this makes email essentially impossible. Earthlink's page had a link "if you cannot see the picture, click here" and when you got to that they said to call their 1-800 number if you have any problems. Right.
Adding CAPTCHAs to everyone's email systems is NOT the way to solve the spam problem. We need a more realistic, permanent solution. For example, cryptographically authenticating the sender (the "From" field) at the level of the originating ISP (and rejecting messages from senders it cannot authenticate, by password or whatever means), and then having each relay in turn authenticating the previous relay if it trusts it. Headers can be inserted in the emails, signing the previous headers with private encryption keys with their public counterparts obtainable from the ISPs by simple DNS lookups. This will build a chain of trust, which stops when a message gets outside of the sender's network, and therefore allows the original sender to be properly identified back through their ISP. Once we know who messages are from, people can be held responsible. And at that point, anti-spam laws can handle the rest.
It's hard for thee to kick against the pricks.
A big problem with CAPTCHAs is that they can be "broken" with some vigilance and know-how, although not 100% of the time. Yahoo!'s has been broken by a UC Berkeley group, they claim a 92% success rate. The UCB algorithm looks at the image then searches through a dictionary to find the most probably matches and spits them out (you can actually see on the site how it chooses and how close it gets when it misses, mistaking 'grip' for 'slip' and so on).
:)
What is really needed for a *good* CAPTCHA is not pure image obscurity, but rather something that combines hard-to-read images with aspect about language that humans know intuitively, while at the same time being very difficult for computers to sort out. Take word associations, for example. You probably learned how words are associated with each other in 1st grade, so for humans it is a very simple task to pick out words that have a common theme. Computers are a different story. Have a CAPTCHA randomly spit out 10 words to the screen and have the user pick the 3 that are associated with one another, say for example HOUSE, LOG, FRONT, CAT, BROWN, DOG, CART, RUNNING, HOUR, MOUSE.
Even if the algorithm was to correctly identify all 10 words, it would still have to figure out what the association is and then correctly identify the words that fit the association. Assuming that it did correctly identify all of the words, at that point random guessing would yeild a success rate of 0.83%, less if it misidentifies even just one of the words. Combine something like this with a slightly smarter word obfuscator and I think it'd be something that would be very hard to beat...unless you're human, of course
Don't count yourself lucky just yet!
:-(
I used the same method, and my own mailserver with agressive filters, and it worked very well until... a Russian spammer started to send out spam with my mail address as the sender address. He did this via hacked systems (open proxies) so it was not possible to do any blocking.
The load of crap that came in was just unbelievable, and all attempts to contact his spamvertized site or their providers just had no result.
In the end the only thing I could do was remove the MX record for the domain. I pointed it to the spamvertized site instead. Hopefully they are happy with their own bounces.
Of course I cannot receive any legitimate mail on that address anymore
.. is that they can be brokered. If you give me a puzzle, *I* don't have to solve it; all I have to do is induce someone, somewhere, to solve it, and give me the answer. That means I can set up a CAPTCHA-solving factory in Taiwan, or field a porn site where users pay for their pictures in CAPTCHA answers. (*My* CAPTCHAs, the ones my script was assigned to answer in order to make Paypal transactions, not new ones I made up on the spot.)
Suppose that a human can solve your CAPTCHA in an average of five seconds. Suppose unskilled labor costs $6/hour. Then it costs a bit under a cent to find the solution to your CAPTCHA, assuming that I want to solve at least a few thousand a day. As a result it is impractical to protect a service worth more than a penny with a CAPTCHA.
Actually unskilled labor costs far less than $6/hour in some parts of the world, so if CAPTCHAs see wide use the value of the services they can protect is even lower. A tenth of a cent?
CAPTCHAs should be seen as a proof-of-work mechanism, like "hash cash", not as an oracle that can determine whether a transaction was initiated by a human or a machine. Unlike proof-of-worth schemes that burn CPU time, the value of a CAPTCHA won't be inevitably halved every 18 months by Moore's law; on the other hand, it could be suddenly reduced to zero by breakthroughs in image processing.
How many people are unwittingly giving away CAPTCHA answers? The link {to a CGI script which puts out image data} must take a parameter to tell it what image to display, since it can't return any data to the calling page {it's just an image and doesn't have a full set of headers, just a MIME-type} and can't use a temporary file {in case of multiple users accessing it in parallel}. That parameter is probably also present in a hidden field in the form, so that the form processor knows what the user should have typed {or the referring URL itself could be the hidden field}. You only need to see one image, then resubmit the form as though that was the image you were shown.
You have to remember that there are idiots out there who think all there is is IE and Windows. I have seen, and made use of, a few sites which have unwittingly given away access to premium services {hence the ACness -- gotta have that plausible deniability} because their security measures were either non-existent or depended on software I was not using. {I see it a bit like taking a few sheets of toilet paper from an unlocked privy; nobody's ever gonna miss it if they find it's gone, but they'll be annoyed enough to throw the book at you if they find out it was you that took it}.