Using gzip As A Spam Filter
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
This article will make much more sense if you look at the raw data in tabular form.
http://www.paulgraham.com/spam.html
http://www.archive.org/details/ThePowerOfNightmares
This, on the other hand, talks about gziping the mail in the context of corpora of known spam or known ham. Thus it serves as a classification of types of Englishg text, whereas the slashdot system only tries to classify whether or not it is actually English text at all.
1: Get an email account with unlimited addresses.
2: when registering use a unique address e.g. slashdot@mydomain.com
3: Make sure you check/uncheck the give my email address to mailing lists.
4: If ever you get spam to that account get litigious.
Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.
thank God the internet isn't a human right.
I usually cope by having a couple of folders in kmail I flush spam into
BODY contains "The following message was sent to you as an opt-in subscriber to RB Express."
FROM contains Trivia
TO or CC contains "johnsmith@isorox.co.ku"
FROM contains theracingpost.com
TO or CC contains "spam" (I use sitespam@isorox to sign up to sites)
BODY contains "to receive" AND "more of these offers"
Move to a Spam folder
If TO or CC doesnt contain
isorox.co.ku
exeter.ac.ku
ex.ac.ku
Move to possible Spam
That gets about 80-90% of my spam.
I skim Possible Spam when I get time, usually once or twice a day. I skim Spam about once every 2 days. i've got a couple of rules that just delete the spam straight off (known junk addresses that I'll never need, certain subjects, etc). Keep all my spam too, and check it when I get time, just in case.
The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.
An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.
"It take 9 months to bear a child, no matter how many women you assign to the job."
Um, except that Slash uses gzip for its compression. So, no. :-)
What is different, as has been pointed out, is that Slash compresses a particular post and looks at how well it compresses, but does not compress/compare with other posts.
The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.
I'll be happy with spam assassin until I get CRM114 (and mailfilter) trained and working.
My Slashdot account is old enough to drink...
German newsticker heise had a similar article a year ago, altough it does not cover spam explicitly.
The article has a link to another article published in "Physical Review Letters" which deals with the topic of identifying content/author by applying compression algorithms.
The underlying idea is that LZ77 compressed data is near to the entropy of a message.
> The current fad is in fact Bayesian filtering, sophisticated statistical analysis.
Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results.
Nick Waterman, Sr Tech Director, #include <stddisclaimer>
Who needs all of these complicated schemes? I just filter the sending domains as they come. Filter every sender containing "specials", "optin", "offer", "special", "deal", "email", "reward", "value", "promotion", "special" and "super, and all subject lines starting with "friend", and 85% is taken care of right away. So far my formula has had no false positives.
There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).
Another problem with html is that, if there is some level of sophistication on the part of the spammer they can embedd a file (a gif or jpg) in the html that has a unique name that is uniquely associated with your email address. You open the mail, the file is requested (it doesn't even have to exist) but the 404 error or the html get can be logged on the server, and then it is a simple matter of matching the requested files to the email address and you have a list of good email addresses. This is a really useful technique for "closed loop marketing" which is the corporate edition of Spam.
--My sig is bigger than your sig--
Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results
This may be the case, but most of the newer filters available now are not really Bayesian filtering by this definition. I use spambayes, and it has some very sophisticated algorithms to determine the statistical probability of the "spamminess" of a ham/spam.
Some of these fancier algorithms were developed by Gary Robinson and are discussed in some detail here. You can see the results of these different classification techniques (gary combining, chi-squared) in some nice graphs here.
On a related note, spambayes is VERY accurate in catching spam for me. Amazingly so in fact. It does a far better job than SpamAssassin or the Bayesian filter in Mail.app in my personal experience.
- Vincit qui patitur.
Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.
And, of course, do try this at home.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.
You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.
A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.