Armoring Spam Against Anti-Spam Filters
moggyf points to a BBC article about how spam can be successfully tweaked to slip past current filtering methods, excerpting "To finding out how to beat the filters Mr Graham-Cumming sent himself the same message 10,000 times but to each one added a fixed number of random words. When a message got through he trained an 'evil' filter that helped to tune the perfect collection of additional words."
iluvspam adds "It's an interview with POPFile author John Graham-Cumming that summarizes his talk at the recent MIT Spam Conference. You can still listen to the technical details here (choose the Afternoon 1 session, he starts about 75 minutes in)."
As technology gets more complicated, so does the spam. The only way to protect yourself is to not give out your address. Period. Heck, I don't even give my work e-mail address to my parents.
Would that be the same John Graham-Cumming referenced in the article who figured out how to defeat said filter?
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
All spammers have to do is read this analysis of the filter, then included the weighted non-spam strings, while avoiding the spam weighted strings. Pretty simple to blow past their filter.
You do realize you've just comitted a pretty serious Federal crime, don't you? I know you're kidding or just emoting the same frustration many others, myself included, feel about the willful disregard spammers seem to have for many things.
But you might've wanted to add a smiley...
If you've whitelisted your email, that crap won't get through if you're not on the whitelist. That goes regardless of your Subject line. Same story if you do challenge/response, for that matter. Or you can munge, as I do.
I still say spamming needs to be a felony, though.
This post made with the Dvorak layout.
"Friends don't let friends use QWERTY"
Yes. He says there's ways to beat it, but that they're complicated to do.
This would, for most slashdotters, be nothing to worry about. For those of you who didn't RTFA, the entire attack is limited by this particular little gem of info:
He had to send himself thousands of copies of the same message each one holding an encoded chunk of HTML that reported back to him when it got past the filter.
The concept is that the spammer has to find words that are so common in a person's ham that including them in spam would fool the filter. However, as those words are unique to each person, a lot (thousands or more) of spam must be sent to test the filter. The problem for the spammer is to figure out which spam actually got through (in order to identify the important words) - something s/he's not able to do for users with a decent email client...
I still feel quite confident that SpamBayes will keep my inbox free from spam.
May we live long and die out
If people working in anti-spam don't try to break their own filters the spammers will do it for them and we'll be worse off.
There's a direct analogy with cryptographic techniques where breaking them is most of the work... that way we know that they are secure.
John.
A previous story talked about the noise level of spam increasing.
And a very entertaining NYT article that is in the process of expiring.
The upshot is that spam is being forced to look more and more like line noise. It will probably become less and less effective as the message has to submerge to the point where people can't recognize it.
"Provided by the management for your protection."
Of course I can break my own Bayesian filtering.
What matters is that while one person's spam might be very similar to another person's spam, their ham isn't. At best, it would require a semi-personal approach to sneak in spam. That's why you need to continually train your filter in the first place. Rinse and repeat, that's what it's all about.
What's being described is not really a flaw, but rather a saturation point at which it's time to retrain your filter and perhaps even start over with a new database. The old one gets too much 'noise' after some time.
They do point out one thing, be it from the spammers POV: Bayesian filtering is a continuous process and not and end to all solution. It requires fresh input and gets less effective if you keep old crud around for too long and if you train it too much on virtually the same spam/ham.
It's still a much better solution than blacklists.
Yes, it's dedication to research. He sent himself the 10k messages to see if he could outwit his own Bayesian filtering of spam messages. He effectively deduced that if the incoming message can be similar enough to items that have been specifically marked non-spam by the end-user of the Bayesian-spam-filter, it will be not be marked as spam.
/.'ers filter, actually usually including slashdot in the subject or as the name usually will make it through a slashdotter's filter. And the ease of this lies in that tailoring the open sesame words to a market will probably open the doors to all of the e-mail recipients at a domain, particularly is the spam filtering is done at the mail-server level and not at the end-user level. Thus rather than having to send 10k messages to a single user to crack open the spam doors, sending those 10k messages to multiple users at a domain and analysing which ones get through will effectively open the floodgates for all of the users at that internet domain. And using the concept of a priori probability distributions makes the hunt for these sesame words {[tm] /me :) } easier by limiting the dictionary to be searched to the keywords of the field/domain about to be spammed. That is what makes this dangerous.
There's a cunning recursiveness to this which is at that fine line between clever and stupid. The difficulty is, as he also deduces, that each person's Bayesian rules for spam vs. nonspam are unique and will require many attempt in order to infer the pass-through words that will create a false negative and allow the spam to come through. The one step that people are missing is that if the evil spammer wishes to work on spamming a domain (both in the internet sense and in the "domain of expertise/specialization" sense) she can tailor the pass through words to the market. If she's sending spam to Intel or AMD corporate addresses, then lithography might be the magic word; if she's spamming Xilinx, the fpga will route through the Bayesian filter; if she's spamming Dave Barry, then debenture and fish falling from the sky might help spam make it through, Natalie may or may not make it through a
The counterattack from the corportate mail-server will be to look for these similarly unique messages being sent to multiple users.
I don't know about you but here in France we have rules to deal with illicit Poster ads. It's a 100 year old law that people/companies put up on their walls stating that posters will be prosecuted as well as those for whom they are advertising. This takes care of that. If spam laws targetted as well retail stores advertised by the said spams, than far more less Viagra/Nigerian etc stores would be paying spammers to do this. It's as simple as that, why can't it be done? Don't tell me these stores are abroad, there are international laws for that. Also most of these spam advertised companies are US based.
Artificial intelligence is no match for natural stupidity
That's an overly strong statement to make, and even a little bit irritating to people like myself who actually implement statistical content filters, natural language systems, etc.
If you are equating "content based filtering" to "Bayesian filtering" then you really only understand 1% of the current state of document classification. Bayesian filtering is a rage right now because it's a linear time algorithm (i.e., implementable on PC hardware). There are document classification schemes that will eat Bayesian for lunch, which are not appropriate for email filtering at this time because of their computational cost. But with continual progress on the algorithms, new methods for reducing search spaces via extremely clever sense-similarity heuristics, and with computers doubling in speed every 18 months, it's closer than you think.
The spam/ham problem is what data mining researchers would call a "toy problem." You want us to classify documents into only two classifications? Only two? Piece of cake. The problem is, you want us to do it on PC hardware where it isn't feasible to run O(n^2) or O(n^3) machine learning algorithms.
Let the researchers continue what they're doing. People are just now starting to apply SVMs and other cool techniques to the problem of spam filtering. You'd be amazed at how many of the well-known data mining and statistical NLP researchers have not even thought of using their arsenal against spam.
It's coming, please be patient.
He managed to, randomly, find words that were high in _HIS_ "ham" list.
He could have saved himself a lot of time and trouble and just looked in that file.
And that file will be different for EVERY installation. So the words he found ("Berkshire", "Marriott", "wireless", "touch" and "comment") would NOT get spam past MY filter.
So, the spammers have to keep (and update) a word list for EVERY PERSON on their lists.
Which means that, with an incredible amount of effort, the spammers will be able to get spam to the people least likely to purchase a product from a spammer.
There is no problem.
How exactly is attacking me going to help? Unless you yourself are a spammer? Since I make a living working on anti-spam and released POPFile for free I can't see how attacking me is going to make the spam problem any better.
Perhaps you didn't read the article: I am not a spammer, I work for a company that makes anti-spam software.
John.
Nope, because my Bayesian filter works just as well for 0bfu5c4t3d words as it does for properly spelled ones. They are all just sequences of letters, and anything that is deliberately misspelled is going to become identified as spammy very quickly.
Or why you're getting so much spam where the text content is contained primarily in images rather than plaintext?
Nope, because I have images turned off by default in my mail viewer. If a stranger wants me to read his email, he'll need to send it as plain text, because (as you point out) HTML email with images is used as a spam vector and little else.
BTW, this article explains why there will never be a filtering-based solution to solving spam until SMTP itself is made more secure.
Funny, my Bayesian filter is working fine at this very moment. Who should I believe, your article or my own eyes?
Jeremy
I don't care if it's 90,000 hectares. That lake was not my doing.
The problem with obfuscated words is that there is a pretty sizable set of permutations for any given word. If one obfuscated variant ends up in your spam word list, that doesn't take care of the thousands of other obfuscated versions of the exact same word.
Nope, because I have images turned off by default in my mail viewer. If a stranger wants me to read his email, he'll need to send it as plain text, because (as you point out) HTML email with images is used as a spam vector and little else.
Ahh..yes! I have them turned off, too! But isn't the whole point of Bayesian filtering to stop the spam before it reaches your inbox? Sure, you've got images turned off so you don't see the spam, but if Bayesian is so great, why is the spam in your inbox to begin with?
Funny, my Bayesian filter is working fine at this very moment. Who should I believe, your article or my own eyes?
You can believe your own eyes if you wish, but your misconception is assuming that if Bayesian is working for you it is also working for everyone else. Don't get me wrong...Bayesian filtering is a pretty nifty technology. But let's not pretend it's a universal solution that works for everyone.
For whatever reason, the mix of spam I get isn't caught all that effectively by my Bayesian filter. So, believe your eyes if you wish, but don't claim that my eyes must see exactly what yours do.
Shame on Google.
Well, you know the great thing about standards is that you have so many to choose from!
However, if you choose the current (dated 26 January 2000) W3C XHTML recommendations then yes, the quotes are required.
I like my women like my coffee... pale and bitter.
I choose to view all headers, but then I click the [-] in the top left corner of the headers and then see a single line with Subject:, From:, and time. Then when I want to reclassify something, I click the [+] (same place as the [-]) and copy the X-POPFile-Link header to Firebird or whatever browser you use. <http://bugzilla.mozilla.org/show_bug.cgi?id=23114 > is probably what they were referring to when they said this is an email client issue. If that bug is fixed, POPFile will be perfect for me. (Remember that Bugzilla doesn't take /. referrals--you'll have to copy and paste the link location.)