Using gzip As A Spam Filter
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
That's because most spam includes large amounts of HTML.
My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.
Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.
for example Two Emails:
1 (ham) "You have won a brand new Convertible, from the competition you entered."
and
2 (spam) "A brand new convertible to be won, have you entered?"
Ham would match about 80% with spam.
Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.
A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...
Save your wrists today - switch to Dvorak
You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.
Do English-speaking people receive spam in foreign languages?
Bayesian only refers to how you use the probabilities.
Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.