Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

6 of 268 comments (clear)

  1. Raw data by gazbo · · Score: 5, Informative

    This article will make much more sense if you look at the raw data in tabular form.

  2. Meet the Bayesian Filtering Algorythm by dpete4552 · · Score: 5, Informative

    http://www.paulgraham.com/spam.html

    --
    http://www.archive.org/details/ThePowerOfNightmares
    1. Re:Meet the Bayesian Filtering Algorythm by coyul · · Score: 5, Informative

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.

      I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)

      May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      Actually, the Bayesian filter implemented by POPFile is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)

      You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.

  3. Re:Text of the full article by Hal-9001 · · Score: 4, Informative

    The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.

    An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.

    --
    "It take 9 months to bear a child, no matter how many women you assign to the job."
  4. Yawn -- read your papers by Anonymous Coward · · Score: 4, Informative

    There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).

  5. bzip2 results by K-Man · · Score: 4, Informative

    Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.

    And, of course, do try this at home.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger