Armoring Spam Against Anti-Spam Filters
moggyf points to a BBC article about how spam can be successfully tweaked to slip past current filtering methods, excerpting "To finding out how to beat the filters Mr Graham-Cumming sent himself the same message 10,000 times but to each one added a fixed number of random words. When a message got through he trained an 'evil' filter that helped to tune the perfect collection of additional words."
iluvspam adds "It's an interview with POPFile author John Graham-Cumming that summarizes his talk at the recent MIT Spam Conference. You can still listen to the technical details here (choose the Afternoon 1 session, he starts about 75 minutes in)."
SO the ultimate spam protection mechanism would be an infinite number of monkeys type my list of words to associate w/ spam. :)
Yep, I never spell check.
More incorrect spellings can be found he
I will pay 1000$ to anyone who seeks out and beats the living daylights out of a spammer. With as many pics on the web as possible for posterity.
Screw these filters and shit. Start creaming spammers worldwide and they'll think twice about it.
Tom
Someday, I'll have a real sig.
POPFile, maintained by John Graham-Cumming, is the best spam filter I've used. There may be small flaws with the fundamental concept of Bayesian filters, but POPFile still blocks all my spam.
Didn't they know something as simple as...
"Make it idiot-proof, and someone will make a better idiot"
Like many other academic studies, such as skinning humans alive to see how long they can live, I think this one should only be placed into the right hands.
It's a pisser that spammers now have another tool to circumvent filters; on the other hand, the people who write the filters know exactly what a spammer would do to make "better" spam.
The question is: who will implement first?
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
If people working in anti-spam don't try to break their own filters the spammers will do it for them and we'll be worse off.
There's a direct analogy with cryptographic techniques where breaking them is most of the work... that way we know that they are secure.
John.
Yes, it's dedication to research. He sent himself the 10k messages to see if he could outwit his own Bayesian filtering of spam messages. He effectively deduced that if the incoming message can be similar enough to items that have been specifically marked non-spam by the end-user of the Bayesian-spam-filter, it will be not be marked as spam.
/.'ers filter, actually usually including slashdot in the subject or as the name usually will make it through a slashdotter's filter. And the ease of this lies in that tailoring the open sesame words to a market will probably open the doors to all of the e-mail recipients at a domain, particularly is the spam filtering is done at the mail-server level and not at the end-user level. Thus rather than having to send 10k messages to a single user to crack open the spam doors, sending those 10k messages to multiple users at a domain and analysing which ones get through will effectively open the floodgates for all of the users at that internet domain. And using the concept of a priori probability distributions makes the hunt for these sesame words {[tm] /me :) } easier by limiting the dictionary to be searched to the keywords of the field/domain about to be spammed. That is what makes this dangerous.
There's a cunning recursiveness to this which is at that fine line between clever and stupid. The difficulty is, as he also deduces, that each person's Bayesian rules for spam vs. nonspam are unique and will require many attempt in order to infer the pass-through words that will create a false negative and allow the spam to come through. The one step that people are missing is that if the evil spammer wishes to work on spamming a domain (both in the internet sense and in the "domain of expertise/specialization" sense) she can tailor the pass through words to the market. If she's sending spam to Intel or AMD corporate addresses, then lithography might be the magic word; if she's spamming Xilinx, the fpga will route through the Bayesian filter; if she's spamming Dave Barry, then debenture and fish falling from the sky might help spam make it through, Natalie may or may not make it through a
The counterattack from the corportate mail-server will be to look for these similarly unique messages being sent to multiple users.
but how do you combat the spammer?
1. Find spammer
2. Kill spammer
3. Become hero of the interweb
4. Write book from prison
5. ???
6. Profit!
Your question is exactly why the death penalty belongs on the street, not in prison.
Eve Fairbanks says I drive a hybrid!LOL
That's an overly strong statement to make, and even a little bit irritating to people like myself who actually implement statistical content filters, natural language systems, etc.
If you are equating "content based filtering" to "Bayesian filtering" then you really only understand 1% of the current state of document classification. Bayesian filtering is a rage right now because it's a linear time algorithm (i.e., implementable on PC hardware). There are document classification schemes that will eat Bayesian for lunch, which are not appropriate for email filtering at this time because of their computational cost. But with continual progress on the algorithms, new methods for reducing search spaces via extremely clever sense-similarity heuristics, and with computers doubling in speed every 18 months, it's closer than you think.
The spam/ham problem is what data mining researchers would call a "toy problem." You want us to classify documents into only two classifications? Only two? Piece of cake. The problem is, you want us to do it on PC hardware where it isn't feasible to run O(n^2) or O(n^3) machine learning algorithms.
Let the researchers continue what they're doing. People are just now starting to apply SVMs and other cool techniques to the problem of spam filtering. You'd be amazed at how many of the well-known data mining and statistical NLP researchers have not even thought of using their arsenal against spam.
It's coming, please be patient.