Proving Which Spam Filters work Best

← Back to Stories (view on slashdot.org)

Proving Which Spam Filters work Best

Posted by ryuzaki0 on Wednesday August 2, 2006 @04:14PM from the get-rid-of-it dept.

pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.

7 of 263 comments (clear)

Min score:

Reason:

Sort:

In my experience... by vivin · 2006-08-02 16:24 · Score: 4, Informative

... the ones which have worked best (for me) are Bayesian Spam Filters (A Plan for Spam, SpamBayes - a free filter) and CRM114 The Controllable Regex Mutilator (Paul Graham mentions it here). I've always had a very high success rate with these.

--
Vivin Suresh Paliath
http://vivin.net

I like
1. Re:In my experience... by Red+Alastor · 2006-08-02 18:00 · Score: 4, Informative
  
  I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.
  http://popfile.sourceforge.net/
  
  --
  Slashdot anagrams to "Sad Sloth"
Got to go with Brightmail by saha · 2006-08-02 16:46 · Score: 4, Informative

We use Brightmail on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
Flaw in the test by lheal · 2006-08-02 16:48 · Score: 5, Informative

The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

I have found that using two dissimilar systems in a chain is quite effective.

--
Raise your children as if you were teaching them to raise your grandchildren, because you are.
text versions of the material by martin-boundary · 2006-08-02 17:13 · Score: 5, Informative

For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here (or pdf overview).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL). It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.
Possible Text Version by sciop101 · 2006-08-02 18:35 · Score: 4, Informative

On-line Supervised Spam Filter Evaluation
Gordon Cormack and Thomas Lynam

Full Text, May 29, 2006 - PDF Format

http://plg.uwaterloo.ca/~gvcormac/spamcormack.html /

--
The only thing new in this world is the history that you don't know.[Harry Truman]
Out of Date and Worthless by prandal · 2006-08-02 20:09 · Score: 4, Informative

This paper's a complete waste of time.

He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ and the results are outstanding.

With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

And the folk on the spamassassin-users mailing list really rock.