Armoring Spam Against Anti-Spam Filters

← Back to Stories (view on slashdot.org)

Armoring Spam Against Anti-Spam Filters

Posted by timothy on Wednesday February 4, 2004 @03:18AM from the take-two-viagra-and-call-nigeria-in-the-a.m. dept.

moggyf points to a BBC article about how spam can be successfully tweaked to slip past current filtering methods, excerpting "To finding out how to beat the filters Mr Graham-Cumming sent himself the same message 10,000 times but to each one added a fixed number of random words. When a message got through he trained an 'evil' filter that helped to tune the perfect collection of additional words." iluvspam adds "It's an interview with POPFile author John Graham-Cumming that summarizes his talk at the recent MIT Spam Conference. You can still listen to the technical details here (choose the Afternoon 1 session, he starts about 75 minutes in)."

16 of 511 comments (clear)

infinite monkeys by bluelip · 2004-02-04 03:19 · Score: 5, Funny

SO the ultimate spam protection mechanism would be an infinite number of monkeys type my list of words to associate w/ spam. :)

--

Yep, I never spell check.
More incorrect spellings can be found he
1. Re:infinite monkeys by Jonas+the+Bold · 2004-02-04 03:36 · Score: 5, Funny
  
  You kids and your monkeys
  
  In my day we didn't have monkeys. We had to filter spam by hand. And we liked it!
  
  You kids and your infinite monkeys... Shakespear wouldn't have used monkeys were he alive today. He would have rolled up his sleaves and written hamlet the right way!
  
  Damn kids..
  
  --
  Everything seemed to be going so nice
  'till the end of all beings punched right through the ice
2. Re:infinite monkeys by TheDigitalRaven · 2004-02-04 03:42 · Score: 5, Funny
  
  Hands? Them're luxury! When I were a lad, hands were summat only posh people had. The rest of us had to make do with paws which hadn't evolved fully yet, and we had to filter all of our spam from each mailbox manually, but we had to go to the mailbox - across a river of lava, mind - to collect each message but couldn't filter it until we got back. We'd sort spam twenty six hours a day, getting up two hours before going to bed, and had to eat cold poison while we were doing it. And we had to pay for the priviledge of being allowed to filter our own!
3. Re:infinite monkeys by letxa2000 · 2004-02-04 04:00 · Score: 5, Insightful
  
  I'm not sure I understand why they think this is a problem with Bayesian filtering. Basically, they're saying that if a spammer sends you the same message thousands of times but inserts a few slightly different words each time, and if the thousands of messages get through the Bayesian filter to the user, and if the user doesn't disable HTML bugs on his email client, then we have a problem...?
  First, if the spammer sends thousands of copies of the same message and just changes the "extra words" that he is testing, it will take very little time for Bayesian to adapt to the rest of the message. Suddenly, the rest of the message that previously contained non-spammy words will be considered very spammy and will overwhelm the "extra words" that each message contains. Each time the message is caught as spam, the probability that any future tests get through--regardless of the "extra words"--will be reduced even further.
  Second, as the article said, it's a lot of work on the part of the spammer. They'd have to send out thousands of messages to each target to "sniff them out" and most of those wouldn't even be effective since most of them would be caught by filters and those few that got through very few would load the HTML bugs to identify themselves.
  Finally, it assumes that those that are using Bayesian filters are filtering their email but leaving their security (inasmuch as HTML bugs) wide open. While there may be some people that use Bayesian and leave HTML bugs active, it has to be a small minority.
  In short, it seems to me they've "found" a way to get around Bayesian that won't work, so to speak. I just don't see the problem.... ??
4. Re:infinite monkeys by Theresa1 · 2004-02-04 04:42 · Score: 5, Funny
  
  cold poison ?! you lucky buggers.
  We were so poor we had to eat spam.
  
  --
  This is a manual signature virus. Copy to your signiture file and help me spread.
5. Re:infinite monkeys by Tripster · 2004-02-04 06:37 · Score: 5, Funny
  
  Don't know about you but my wife won't let me have one!
Ok fuck it by tomstdenis · 2004-02-04 03:21 · Score: 5, Funny

I will pay 1000$ to anyone who seeks out and beats the living daylights out of a spammer. With as many pics on the web as possible for posterity.

Screw these filters and shit. Start creaming spammers worldwide and they'll think twice about it.

Tom

--
Someday, I'll have a real sig.
1. Re:Ok fuck it by nigelc · 2004-02-04 03:43 · Score: 5, Funny
  
  Ahh, an international terrorist proposing an attack. We should be invading Canada any day now...
  
  --
  
  Cthulhu Barata Nikto
Obligatory POPFile Link by rmohr02 · 2004-02-04 03:21 · Score: 5, Interesting

POPFile, maintained by John Graham-Cumming, is the best spam filter I've used. There may be small flaws with the fundamental concept of Bayesian filters, but POPFile still blocks all my spam.
Tch tch... by supersam · 2004-02-04 03:22 · Score: 5, Insightful

Didn't they know something as simple as...

"Make it idiot-proof, and someone will make a better idiot"
Re:Hmmm... by somethinghollow · 2004-02-04 03:25 · Score: 5, Insightful

Like many other academic studies, such as skinning humans alive to see how long they can live, I think this one should only be placed into the right hands.

It's a pisser that spammers now have another tool to circumvent filters; on the other hand, the people who write the filters know exactly what a spammer would do to make "better" spam.

The question is: who will implement first?
Re:Great by stevesliva · 2004-02-04 03:30 · Score: 5, Funny

Guess which words all tomorrows SPAM will contain...
Touch my wireless Berkshire Marriot?

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Re:Hmmm... by JohnGrahamCumming · 2004-02-04 03:32 · Score: 5, Informative

If people working in anti-spam don't try to break their own filters the spammers will do it for them and we'll be worse off.

There's a direct analogy with cryptographic techniques where breaking them is most of the work... that way we know that they are secure.

John.
Re:That's dedication... :( by kris_lang · 2004-02-04 03:55 · Score: 5, Informative

Yes, it's dedication to research. He sent himself the 10k messages to see if he could outwit his own Bayesian filtering of spam messages. He effectively deduced that if the incoming message can be similar enough to items that have been specifically marked non-spam by the end-user of the Bayesian-spam-filter, it will be not be marked as spam.

There's a cunning recursiveness to this which is at that fine line between clever and stupid. The difficulty is, as he also deduces, that each person's Bayesian rules for spam vs. nonspam are unique and will require many attempt in order to infer the pass-through words that will create a false negative and allow the spam to come through. The one step that people are missing is that if the evil spammer wishes to work on spamming a domain (both in the internet sense and in the "domain of expertise/specialization" sense) she can tailor the pass through words to the market. If she's sending spam to Intel or AMD corporate addresses, then lithography might be the magic word; if she's spamming Xilinx, the fpga will route through the Bayesian filter; if she's spamming Dave Barry, then debenture and fish falling from the sky might help spam make it through, Natalie may or may not make it through a /.'ers filter, actually usually including slashdot in the subject or as the name usually will make it through a slashdotter's filter. And the ease of this lies in that tailoring the open sesame words to a market will probably open the doors to all of the e-mail recipients at a domain, particularly is the spam filtering is done at the mail-server level and not at the end-user level. Thus rather than having to send 10k messages to a single user to crack open the spam doors, sending those 10k messages to multiple users at a domain and analysing which ones get through will effectively open the floodgates for all of the users at that internet domain. And using the concept of a priori probability distributions makes the hunt for these sesame words {[tm] /me :) } easier by limiting the dictionary to be searched to the keywords of the field/domain about to be spammed. That is what makes this dangerous.

The counterattack from the corportate mail-server will be to look for these similarly unique messages being sent to multiple users.
Re:"and can be combated." by GMontag · 2004-02-04 04:07 · Score: 5, Funny

but how do you combat the spammer?

1. Find spammer

2. Kill spammer

3. Become hero of the interweb

4. Write book from prison

5. ???

6. Profit!

Your question is exactly why the death penalty belongs on the street, not in prison.

--
Eve Fairbanks says I drive a hybrid!LOL
Re:Here's a sneaky one... by pclminion · 2004-02-04 04:30 · Score: 5, Informative

People just have to realise that filtering based on content doesn't work, and will never work, until perhaps we have strong AI.
That's an overly strong statement to make, and even a little bit irritating to people like myself who actually implement statistical content filters, natural language systems, etc.
If you are equating "content based filtering" to "Bayesian filtering" then you really only understand 1% of the current state of document classification. Bayesian filtering is a rage right now because it's a linear time algorithm (i.e., implementable on PC hardware). There are document classification schemes that will eat Bayesian for lunch, which are not appropriate for email filtering at this time because of their computational cost. But with continual progress on the algorithms, new methods for reducing search spaces via extremely clever sense-similarity heuristics, and with computers doubling in speed every 18 months, it's closer than you think.
The spam/ham problem is what data mining researchers would call a "toy problem." You want us to classify documents into only two classifications? Only two? Piece of cake. The problem is, you want us to do it on PC hardware where it isn't feasible to run O(n^2) or O(n^3) machine learning algorithms.
Let the researchers continue what they're doing. People are just now starting to apply SVMs and other cool techniques to the problem of spam filtering. You'd be amazed at how many of the well-known data mining and statistical NLP researchers have not even thought of using their arsenal against spam.
It's coming, please be patient.