Proving Which Spam Filters work Best

← Back to Stories (view on slashdot.org)

Proving Which Spam Filters work Best

Posted by ryuzaki0 on Wednesday August 2, 2006 @04:14PM from the get-rid-of-it dept.

pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.

34 of 263 comments (clear)

Min score:

Reason:

Sort:

In my experience... by vivin · 2006-08-02 16:24 · Score: 4, Informative

... the ones which have worked best (for me) are Bayesian Spam Filters (A Plan for Spam, SpamBayes - a free filter) and CRM114 The Controllable Regex Mutilator (Paul Graham mentions it here). I've always had a very high success rate with these.

--
Vivin Suresh Paliath
http://vivin.net

I like
1. Re:In my experience... by coffeeisclassy · 2006-08-02 16:29 · Score: 3, Insightful
  
  Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before.... I wonder how long it will be before we see something using the methods available, who wants to bet OpenSource will beet closed source to implementing this?
2. Re:In my experience... by ozmanjusri · 2006-08-02 16:30 · Score: 5, Funny
  
  I've always had a very high success rate with these.
  I haven't tested this one myself, Barrett Filter but I understand it is 100% effective at reducing spam from known sources. False positives may be a problem, however.
  
  --
  "I've got more toys than Teruhisa Kitahara."
3. Re:In my experience... by Red+Alastor · 2006-08-02 18:00 · Score: 4, Informative
  
  I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.
  http://popfile.sourceforge.net/
  
  --
  Slashdot anagrams to "Sad Sloth"
4. Re:In my experience... by I!heartU · 2006-08-02 19:34 · Score: 3, Insightful
  
  Domain keys... now just get everyone to use it.
5. Re:In my experience... by KlaymenDK · 2006-08-02 21:47 · Score: 4, Insightful
  
  "False positives may be a problem, however."
  
  False positives are a HUGE problem compared to the occasional "true negative"(?).
  
  I'd rather have a small trickle of spam emails (I can't believe I'm saying this, but hear me out) than I would risk missing out on that one truly important email.
  
  --
  "Good news, everyone!"
6. Re:In my experience... by jank1887 · 2006-08-03 00:27 · Score: 3, Insightful
  Hello. welcome to the internet.
  First, spam does not need to make sense to make money. Here's some of my latest received headlines:
  
  placing LEDhas
  
  pJapans mission
  
  capture Todays architect shared
  
  6MZ
  
  and the body text (with an attached image):
  -----
  malware
  USDA databases crop
  entente cordial: admission relation contract GB giveaway andd
  studios another page:
  ... (etc.,etc.)
  -------
  AND IT STILL MAKES MONEY!!!
  spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will always be profitible under the current email system. No matter what filters are used. Filters don't fix the spam problem any more than Virus Scanners stop viruses from spreading. It's all reactionary, which translates to 'fighting a never-ending battle on the losing side'.
Why not just douse the server in gas... by shotgunefx · 2006-08-02 16:25 · Score: 3, Funny

400MB?

Why not just douse the server in gas if you want to see it melt.

--

-William Shatner can be neither created nor destroyed.
1. Re:Why not just douse the server in gas... by Tsiangkun · 2006-08-02 16:28 · Score: 5, Funny
  
  I'm getting 8kb/s downloads from the site, it's just like the good old days !
  
  I'll post more next week after I watch the video.
Fantastic Spam Filters Which Work Best Proving! by _vSyncBomb · 2006-08-02 16:30 · Score: 5, Funny

Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!

Bro, in the same vein, I was totally checking out this dope ass site which you might wanna check out too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...

OK, man take care until I see you this Friday at the dinner thing, Slashdot!

Cheers,
John
Under present IST policy... by patio11 · 2006-08-02 16:39 · Score: 3, Funny

... they are not allowed to douse the servers in gas.

--
Help poke pirates in the eyepatch, arr.
RTFA? by glowworm · 2006-08-02 16:39 · Score: 4, Insightful

So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!

--
Orationem pulchram non habens, scribo ista linea in lingua Latina
1. Re:RTFA? by emag · 2006-08-02 16:55 · Score: 5, Funny
  
  "We are sorry that these talks are not available as plain HTML, PDF, or text, however under present IST policy we are not allowed to provide plain HTML, PDF, or text."
  
  --
  "The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
Not surprising... by RealGrouchy · 2006-08-02 16:45 · Score: 4, Insightful

Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.

If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].

It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.

With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.

I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.

- RG>

--
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
Got to go with Brightmail by saha · 2006-08-02 16:46 · Score: 4, Informative

We use Brightmail on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
Flaw in the test by lheal · 2006-08-02 16:48 · Score: 5, Informative

The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

I have found that using two dissimilar systems in a chain is quite effective.

--
Raise your children as if you were teaching them to raise your grandchildren, because you are.
Harder! by Profane+MuthaFucka · 2006-08-02 16:53 · Score: 5, Funny

I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.

--
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
1. Re:Harder! by rts008 · 2006-08-02 17:05 · Score: 4, Funny
  
  I can't come to your house, you insensitive clod!, teh tubes are clogged with clay tablets!
  
  I won't be able to download my internet until Friday now!
  
  Turn that crap down, and get off of my lawn! Damn kids!
  
  --
  Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
2. Re:Harder! by cruachan · 2006-08-02 21:43 · Score: 4, Insightful
  
  Don't knock it, cuneiform on backed clay is the single most successful format for long-term storage ever invented - 3000 years and counting. Heck, most of our modern storage formats can't even manage 30 - tied to read a 8" floppy recently?
3. Re:Harder! by Squalish · 2006-08-03 00:30 · Score: 4, Insightful
  
  Am I the only one that read the means of presentation as a hilarious attack on a university policy of blocking bittorrent? Given that adding 470MB doesn't really add any usable information to a discussion about spam filters over a piece of text, and all.
  
  Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.
  
  --
  People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation
Re: Very Interesting And Generally Really Amusing by Anonymous Coward · 2006-08-02 16:55 · Score: 5, Funny

Hey _vSyncBomb,

Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!

Your buddy,
_vAnoymousCoward
I got the 400M download! by Ossifer · 2006-08-02 16:55 · Score: 3, Funny

And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...
text versions of the material by martin-boundary · 2006-08-02 17:13 · Score: 5, Informative

For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here (or pdf overview).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL). It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.
Ask Slashdot ... by Anonymous Coward · 2006-08-02 17:34 · Score: 5, Funny

Dear Slashdot,
At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
What can I do to fix this ?
Yours faithfully,
Dr. Gord Cormack
No bittorrent... No credibility by bgog · 2006-08-02 18:33 · Score: 4, Insightful

Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.

This guy should spend his time educating the fools at his institution.
Possible Text Version by sciop101 · 2006-08-02 18:35 · Score: 4, Informative

On-line Supervised Spam Filter Evaluation
Gordon Cormack and Thomas Lynam

Full Text, May 29, 2006 - PDF Format

http://plg.uwaterloo.ca/~gvcormac/spamcormack.html /

--
The only thing new in this world is the history that you don't know.[Harry Truman]
1. Re:Possible Text Version by gvc · 2006-08-03 01:45 · Score: 3, Informative
  
  Bogofilter works great. Or SpamAssassin but only if you force-feed it its own judgements. In both cases you have to correct classification errors.
  Fidelis Assis (who has now gone solo after having participated in the CRM114 project) shows great results for his recent solo effort: OSBF-lua Bratko's PPM spam filter -- the one that did great at TREC -- is not yet packaged as a drop-in filter. Same for my DMC spam filter.
  The actual TREC 2005 tests referred to in TFA are here.
GMail Spam Filter by foxylad · 2006-08-02 18:48 · Score: 5, Interesting

I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...

First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.

So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.

A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!

--
Do as you would be done to.
Out of Date and Worthless by prandal · 2006-08-02 20:09 · Score: 4, Informative

This paper's a complete waste of time.

He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ and the results are outstanding.

With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

And the folk on the spamassassin-users mailing list really rock.
1. Re:Out of Date and Worthless by gvc · 2006-08-03 01:18 · Score: 3, Informative
  I assume the paper that you are describing is the 2004 study. The paper described in the talk (which was given 6 months ago or so) described results of the TREC 2005 Spam Track which took place in November 2005. It included a test SpamAssassin 3.x, not 2.3.
  TREC 2006 evaluations are now underway.
  While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:
  
  The spam filters tested in 2004 give pretty well exactly the same performance on 2005 and 2006 data.
  
  New versions of the filters are a little bit better, but not by leaps and bounds, and also get about the same results over the last 2.5 years of data.
  
  There is no evidence that "Bayesian poisining" is a viable technique for defeating statistical spam filters in anything but a very artifical laboratory environment where the poisoner has access to the recipient's inbox
  
  The subject of the paper -- and the talk -- is primarily about testing methodology and the need for controlled scientific investigation. So I hesitate to endorse the simplistic notion of a "winner" of the TREC evaluation. However the technique that did very well was indeed quite novel, so here's a characterization.
  Andrej Bratko used PPM -- a well-known data compression technique to compress ham and spam separately. Well actually he didn't compress them but just build the statistical model necessary to compress them. Then he simply (tentatively) added the unknown message to each model and chose the one that compressed it best. The general technique of using compression has been mentioned here and elsewhere but Bratko used a much stronger compression scheme and was somewhat clever about it.
  I later reproduced Bratko's results using DMC -- a compression schem that I invented 20 years ago -- and got some interesting results. We have a journal article in press describing it and also an evaluation paper at CEAS 2006.
  Bratko A., Cormack G. V., Filipic B., Lynam T. R. and Zupan B., Spam Filtering Using Statistical Data Compression Models
Amusingly, POPFile caught you by patio11 · 2006-08-02 20:15 · Score: 4, Interesting

I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.

--
Help poke pirates in the eyepatch, arr.
Torrent by vivin · 2006-08-02 22:55 · Score: 3, Informative

Here is a torrent I made of the xvid file. It should work (I hope).

--
Vivin Suresh Paliath
http://vivin.net

I like
Dspam floats my boat by Zzeep · 2006-08-02 23:20 · Score: 3, Informative

I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).
Re:Combo of SpamAssassin and Spamhaus by jdowland · 2006-08-02 23:45 · Score: 3, Funny

The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.

Nah, that's such a half measure. The real solution is to not have an email address at all.