Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

19 of 268 comments (clear)

  1. Raw data by gazbo · · Score: 5, Informative

    This article will make much more sense if you look at the raw data in tabular form.

  2. Meet the Bayesian Filtering Algorythm by dpete4552 · · Score: 5, Informative

    http://www.paulgraham.com/spam.html

    --
    http://www.archive.org/details/ThePowerOfNightmares
    1. Re:Meet the Bayesian Filtering Algorythm by dilute · · Score: 2, Informative

      Baysian filtering looks at word occurrence statistics. This is saying just compare the bulk redundancies of a message as compared to a collection of test messages of a known type, without even looking at the "words". May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed. How to implement that is, well, left as "an exercise for the reader" -- actually, this has been discussed on /.

    2. Re:Meet the Bayesian Filtering Algorythm by coyul · · Score: 5, Informative

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.

      I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)

      May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      Actually, the Bayesian filter implemented by POPFile is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)

      You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.

  3. Re:this is nice by gazbo · · Score: 3, Informative
    No, the lameness filter does nothing like this. The lameness filter (strictly the postercomment compression filter) just sees how well the isolated text compresses. Too high compression implies too much repetition (hence likely repeatedy copy+pasted junk), too low compression implies random chars - English contains plenty of redundancy.

    This, on the other hand, talks about gziping the mail in the context of corpora of known spam or known ham. Thus it serves as a classification of types of Englishg text, whereas the slashdot system only tries to classify whether or not it is actually English text at all.

  4. How to stop spam.... by oliverthered · · Score: 3, Informative

    1: Get an email account with unlimited addresses.
    2: when registering use a unique address e.g. slashdot@mydomain.com
    3: Make sure you check/uncheck the give my email address to mailing lists.
    4: If ever you get spam to that account get litigious.

    Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.

    --
    thank God the internet isn't a human right.
    1. Re:How to stop spam.... by NoseyNick · · Score: 2, Informative

      I've been doing this for years, and in practice, it just means I get 12 copies of most spams, because they got my address from 12 different places, usually web archives of the mailing-lists.

      You can't refuse mail from non-lists to mailinglists@your.domain, because then nobody can contact you saying "I saw your post on foo-list and was wondering if I could get a copy of foo-prog and if you could tell me how you made it foo bar baz".

      --
      Nick Waterman, Sr Tech Director, #include <stddisclaimer>
  5. Re:Same old problem... by isorox · · Score: 2, Informative

    I usually cope by having a couple of folders in kmail I flush spam into

    BODY contains "The following message was sent to you as an opt-in subscriber to RB Express."
    FROM contains Trivia
    TO or CC contains "johnsmith@isorox.co.ku"
    FROM contains theracingpost.com
    TO or CC contains "spam" (I use sitespam@isorox to sign up to sites)
    BODY contains "to receive" AND "more of these offers"
    Move to a Spam folder

    If TO or CC doesnt contain
    isorox.co.ku
    exeter.ac.ku
    ex.ac.ku

    Move to possible Spam

    That gets about 80-90% of my spam.

    I skim Possible Spam when I get time, usually once or twice a day. I skim Spam about once every 2 days. i've got a couple of rules that just delete the spam straight off (known junk addresses that I'll never need, certain subjects, etc). Keep all my spam too, and check it when I get time, just in case.

  6. Re:Text of the full article by Hal-9001 · · Score: 4, Informative

    The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.

    An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.

    --
    "It take 9 months to bear a child, no matter how many women you assign to the job."
  7. Re:Slashdot filter by pudge · · Score: 3, Informative

    Um, except that Slash uses gzip for its compression. So, no. :-)

    What is different, as has been pointed out, is that Slash compresses a particular post and looks at how well it compresses, but does not compress/compare with other posts.

  8. 32k Window... by pridkett · · Score: 3, Informative

    The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.

    I'll be happy with spam assassin until I get CRM114 (and mailfilter) trained and working.

    --
    My Slashdot account is old enough to drink...
  9. Similar article on heise was published a year ago by hanzwurst · · Score: 2, Informative

    German newsticker heise had a similar article a year ago, altough it does not cover spam explicitly.
    The article has a link to another article published in "Physical Review Letters" which deals with the topic of identifying content/author by applying compression algorithms.
    The underlying idea is that LZ77 compressed data is near to the entropy of a message.

  10. Re:Text of the full article by NoseyNick · · Score: 2, Informative

    > The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

    Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results.

    --
    Nick Waterman, Sr Tech Director, #include <stddisclaimer>
  11. Even Better by HereAllNight · · Score: 2, Informative

    Who needs all of these complicated schemes? I just filter the sending domains as they come. Filter every sender containing "specials", "optin", "offer", "special", "deal", "email", "reward", "value", "promotion", "special" and "super, and all subject lines starting with "friend", and 85% is taken care of right away. So far my formula has had no false positives.

  12. Yawn -- read your papers by Anonymous Coward · · Score: 4, Informative

    There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).

  13. Re:HTML by phrantic · · Score: 2, Informative

    Another problem with html is that, if there is some level of sophistication on the part of the spammer they can embedd a file (a gif or jpg) in the html that has a unique name that is uniquely associated with your email address. You open the mail, the file is requested (it doesn't even have to exist) but the 404 error or the html get can be logged on the server, and then it is a simple matter of matching the requested files to the email address and you have a list of good email addresses. This is a really useful technique for "closed loop marketing" which is the corporate edition of Spam.

    --
    --My sig is bigger than your sig--
  14. Re:Text of the full article by Arkham · · Score: 2, Informative

    Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results

    This may be the case, but most of the newer filters available now are not really Bayesian filtering by this definition. I use spambayes, and it has some very sophisticated algorithms to determine the statistical probability of the "spamminess" of a ham/spam.

    Some of these fancier algorithms were developed by Gary Robinson and are discussed in some detail here. You can see the results of these different classification techniques (gary combining, chi-squared) in some nice graphs here.

    On a related note, spambayes is VERY accurate in catching spam for me. Amazingly so in fact. It does a far better job than SpamAssassin or the Bayesian filter in Mail.app in my personal experience.

    --
    - Vincit qui patitur.
  15. bzip2 results by K-Man · · Score: 4, Informative

    Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.

    And, of course, do try this at home.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  16. Re:Sorry, that's not right by Synonymous+Soured · · Score: 2, Informative

    Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.

    You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.

    A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.