Using gzip As A Spam Filter

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday January 27, 2003 @01:15AM from the showing-some-adaptability dept.

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

14 of 268 comments (clear)

Raw data by gazbo · 2003-01-27 01:20 · Score: 5, Informative

This article will make much more sense if you look at the raw data in tabular form.
Re:Text of the full article by Anonymous Coward · 2003-01-27 01:24 · Score: 5, Insightful

> The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.

The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.
Meet the Bayesian Filtering Algorythm by dpete4552 · 2003-01-27 01:25 · Score: 5, Informative

http://www.paulgraham.com/spam.html

--
http://www.archive.org/details/ThePowerOfNightmares
1. Re:Meet the Bayesian Filtering Algorythm by coyul · 2003-01-27 03:34 · Score: 5, Informative
  
  OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.
  
  I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)
  
  May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.
  
  Actually, the Bayesian filter implemented by POPFile is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)
  
  You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.
HTML by Pilferer · 2003-01-27 01:26 · Score: 5, Interesting

That's because most spam includes large amounts of HTML.

My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.
Excellent by Phosphor3k · 2003-01-27 01:28 · Score: 5, Funny

Slashdot can use it to filert out duplicate stories.
Quantitive, not qualititive by psplay · 2003-01-27 01:30 · Score: 5, Interesting

Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.

for example Two Emails:

1 (ham) "You have won a brand new Convertible, from the competition you entered."

and

2 (spam) "A brand new convertible to be won, have you entered?"

Ham would match about 80% with spam.

Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.
Not that different by Synonymous+Soured · 2003-01-27 01:32 · Score: 5, Interesting

A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.

A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.

The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
Same old problem... by artemis67 · 2003-01-27 01:34 · Score: 5, Insightful

Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.
Spammers will adjust their tactics by ultrabot · 2003-01-27 01:34 · Score: 5, Interesting

Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...

--
Save your wrists today - switch to Dvorak
Re:It's all spam by greenjinjo · 2003-01-27 01:39 · Score: 5, Interesting

You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.
Do English-speaking people receive spam in foreign languages?
Yay! by Anonymous Coward · 2003-01-27 01:42 · Score: 5, Funny

What an idea!

I could use this to avoid those people who keep saying the same thing all the time, over and over again...

Now, how can I convince my mother to use e-mail?
Re:Grep it instead! by Walterk · 2003-01-27 02:19 · Score: 5, Funny

Just egrep for '(penis|enlarge|money|auction|cash|advance|fortune )'. And hope no hot babes email you complimenting your penis, or mention they want their breasts enlarged, offer you money, auction off your award winning lego collection or anything like that.

--

"If anyone needs me, I'm in the angry dome."
Sorry, that's not right by martin-boundary · 2003-01-27 02:35 · Score: 5, Interesting

Only naive bayesian models are 0-order Markov. The "naive" refers precisely to the zero order independence assumption. You can have 1-order, 2-order, n-th order bayesian models if you like. Those are called n-gram models. After that, you can have bayesian phrase based models if you like, or paragraph based also.
Bayesian only refers to how you use the probabilities.
Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.