Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

10 of 268 comments (clear)

  1. Raw data by gazbo · · Score: 5, Informative

    This article will make much more sense if you look at the raw data in tabular form.

  2. Meet the Bayesian Filtering Algorythm by dpete4552 · · Score: 5, Informative

    http://www.paulgraham.com/spam.html

    --
    http://www.archive.org/details/ThePowerOfNightmares
    1. Re:Meet the Bayesian Filtering Algorythm by coyul · · Score: 5, Informative

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.

      I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)

      May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      Actually, the Bayesian filter implemented by POPFile is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)

      You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.

  3. Re:this is nice by gazbo · · Score: 3, Informative
    No, the lameness filter does nothing like this. The lameness filter (strictly the postercomment compression filter) just sees how well the isolated text compresses. Too high compression implies too much repetition (hence likely repeatedy copy+pasted junk), too low compression implies random chars - English contains plenty of redundancy.

    This, on the other hand, talks about gziping the mail in the context of corpora of known spam or known ham. Thus it serves as a classification of types of Englishg text, whereas the slashdot system only tries to classify whether or not it is actually English text at all.

  4. How to stop spam.... by oliverthered · · Score: 3, Informative

    1: Get an email account with unlimited addresses.
    2: when registering use a unique address e.g. slashdot@mydomain.com
    3: Make sure you check/uncheck the give my email address to mailing lists.
    4: If ever you get spam to that account get litigious.

    Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.

    --
    thank God the internet isn't a human right.
  5. Re:Text of the full article by Hal-9001 · · Score: 4, Informative

    The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.

    An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.

    --
    "It take 9 months to bear a child, no matter how many women you assign to the job."
  6. Re:Slashdot filter by pudge · · Score: 3, Informative

    Um, except that Slash uses gzip for its compression. So, no. :-)

    What is different, as has been pointed out, is that Slash compresses a particular post and looks at how well it compresses, but does not compress/compare with other posts.

  7. 32k Window... by pridkett · · Score: 3, Informative

    The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.

    I'll be happy with spam assassin until I get CRM114 (and mailfilter) trained and working.

    --
    My Slashdot account is old enough to drink...
  8. Yawn -- read your papers by Anonymous Coward · · Score: 4, Informative

    There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).

  9. bzip2 results by K-Man · · Score: 4, Informative

    Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.

    And, of course, do try this at home.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger