Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

6 of 268 comments (clear)

  1. HTML by Pilferer · · Score: 5, Interesting

    That's because most spam includes large amounts of HTML.

    My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.

  2. Quantitive, not qualititive by psplay · · Score: 5, Interesting

    Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.

    for example Two Emails:

    1 (ham) "You have won a brand new Convertible, from the competition you entered."

    and

    2 (spam) "A brand new convertible to be won, have you entered?"

    Ham would match about 80% with spam.

    Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.

  3. Not that different by Synonymous+Soured · · Score: 5, Interesting

    A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.

    A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.

    The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.

  4. Spammers will adjust their tactics by ultrabot · · Score: 5, Interesting

    Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...

    --
    Save your wrists today - switch to Dvorak
  5. Re:It's all spam by greenjinjo · · Score: 5, Interesting

    You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.

    Do English-speaking people receive spam in foreign languages?

  6. Sorry, that's not right by martin-boundary · · Score: 5, Interesting
    Only naive bayesian models are 0-order Markov. The "naive" refers precisely to the zero order independence assumption. You can have 1-order, 2-order, n-th order bayesian models if you like. Those are called n-gram models. After that, you can have bayesian phrase based models if you like, or paragraph based also.

    Bayesian only refers to how you use the probabilities.

    Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.