Seven Spam Filters Compared

← Back to Stories (view on slashdot.org)

Posted by timothy on Saturday August 23, 2003 @06:54AM from the spam-is-evil dept.

Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."

23 of 213 comments (clear)

Min score:

Reason:

Sort:

Re:Link Please by Neophytus · 2003-08-23 06:56 · Score: 3, Informative

people/editors need to learn the a tag
Re:Link Please by woodhouse · 2003-08-23 06:56 · Score: 2, Informative

clicky
The Link. by AndyFewt · 2003-08-23 06:56 · Score: 2, Informative

Spam Filters
Good testing, but not enough samples by TexTex · 2003-08-23 07:00 · Score: 4, Informative

The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.

For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.

The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.

--
-Barkeep, a draft of your most hazardous brew, for the world is slowly stepping into focus, and I don't like what I see.
1. Re:Good testing, but not enough samples by cly · 2003-08-23 07:13 · Score: 5, Informative
  
  I guess you wrote this after reading the first two experiments.
  
  In the third he used 1200.
  
  Nice way to jump the gun.
2. Re:Good testing, but not enough samples by arth1 · 2003-08-23 08:03 · Score: 4, Informative
  
  I guess you wrote this after reading the first two experiments.
  
  In the third he used 1200.
  
  1273, out of which 1073 were spam. That leaves 200 non-spam messages, which isn't enough for Spamassassin's bayesian filtering to kick in, even if all messages were to be classifed as ham or spam, and not just let through.
  
  To quote sa-learn's man page:
  Another thing to be aware of, is that typically you should aim to train with at least 1000 messages of spam, and 1000 ham messages, if possible. More is better, but anything over about 5000 messages does not improve accuracy signif icantly in our tests.
  The low number of emails, combined with no apparent manual reading on part of the author, makes me want to disregard this whole survey as pure drivel.
  
  Regards,
  --
  *Art
Obligitory "here's my perfect spam solution" by ceswiedler · 2003-08-23 07:02 · Score: 2, Informative

IMO, the best way to go with spam is to combine a heuristic filter with a text/baysian filter, in my case SpamAssassin and SpamProbe. I run them both, and it does a noticably better job than either running alone.

SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html. A Baysian filter can't really catch that, but a heuristic filter can be written to notice the pattern.

Also, set up your Baysian filter to re-learn regularly from your spam folder. SpamProbe adds a unique ID to each message, so it won't process a message twice. Therefore, you can just manually move any false negative spams into the folder, and they'll be learned from.
Flawed Tests by Plix · 2003-08-23 07:14 · Score: 3, Informative

As was noted earlier, the set of messages given to the filters for learning was terribly small. Furthermore, SpamAssassin wasn't tested in a way useful to most as the tests in this article didn't take into account SA's Bayesian filter nor it's network-based tests (Razor, etc).
Re:Mozilla? by thinkninja · 2003-08-23 07:17 · Score: 2, Informative

Very true. I downloaded 1600 messages with Thunderbird today (backlog) and only about 30 weren't spam. That's a huge waste of bandwidth.

--
"The number of Unix installations has grown to ten, with more expected." (Unix Programmer's Manual, 2nd ed.; june 1972)
Active Spam Killer by Admiral+Llama · 2003-08-23 07:24 · Score: 2, Informative

How the heck could Active Spam Killer be left out? I used to get about 150 spams a day and now I get ZERO. No false positives, no false negatives.
It is an autoresponder that checks the sender against a whitelist and a blacklist. If a new e-mail is in neither, then it bounces back an e-mail asking for a confirmation that the sender is a human. Simple!
What About PopFile by MBCook · 2003-08-23 07:29 · Score: 4, Informative

What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.

According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.

I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
PSAM by po8 · 2003-08-23 07:29 · Score: 4, Informative

See our PSAM project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.

The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.

Overall, though, this was an interesting evaluation, and I'm glad that the author published it.
Re:Mozilla? by Anonymous Coward · 2003-08-23 07:32 · Score: 1, Informative

I've been using Mailfilter for a while now and I've built a pretty comprehensive list of keywords in the subjects of spam. It seems to just pull the message headers from the server without downloading the body.

One example rule:
DENY = ^Subject:.*v[i1l!|][a4@][g8]e?r[a4@]

Then I filter whatever gets through that with SpamAssassin.
Re:Spamassassin and Bayes? by numbski · 2003-08-23 07:36 · Score: 4, Informative

Yup. I use it all the time. Save up spam and ham in seperate folders. Then do this:

sa-learn --spam --mbox ~/mail/myspamfolder
sa-learn --ham --mbox ~/mail/myhamfolder

As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

--
Karma: Chameleon (mostly due to the fact that you come and go).
WRONG. by imsabbel · 2003-08-23 07:38 · Score: 5, Informative

Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
Dont forget that the filter sees more than the eye...

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Web interface for spamprobe by bigberk · 2003-08-23 07:42 · Score: 2, Informative

If you decide to try out spamprobe or another bayesian filter, try this web interface which lets you easily reclassify mail, even those marked as spam. I found that "training" the bayesian filters was the hardest part; this definitely simplifies the process.
Off topic but... by CGP314 · 2003-08-23 07:44 · Score: 2, Informative

It wasn't mentioned in the article, but I really must plug popfile. It filters out my spam yes, but it is also a general mail categorizer. It sorts ten yahoo groups for me, personal, work, and school related emails. I know you think you could do this with rules for the emails, but for example, I get several hundred emails a day from the Harry Potter for Grownups List. Popfile can sort them into 'probably interesting' and 'probably not' for me. Very nice.
Mozillas Filters + SA = Kick ass solution! by BrookHarty · 2003-08-23 07:59 · Score: 3, Informative

Dont know why we didnt see Mozilla's filters (Maybe thats covered under Bayesain filters?)

I'm using the standalone Thunderbird and it catchs everything that passes by Spamassassin. Spam is marked but never deleted, so I can go back and check. Some spam programs will delete email, which could delete a good email, unacceptable.

Basically, I'm using a mandrake linux box, imap, procmail, fetchmail and spamassassin. Easy, and I can send/receive email from my linux box, and port 25 is blocked from the Net so nobody can use me as a bouncer.

Only problem I had was, there was no complete document to set this up, I had to piece each part together.

So for anyone who wants to know, heres the quick steps.

1. I'm using mandrake, but had to update SA for the sa-learn utils. (Gotta train SpamAssassin)
2. Setup fetchmail in your personal account.
3. Setup .procmailrc in your home dir

DROPPRIVS=YES
VERBOSE=ON
LOGFILE=/home/useracc ount/procmail.log
:0fw

| /usr/bin/spamc
4. Setup your user_prefs in your local directory for SA. (mine, but im no SA expert, but it works)
required_hits 5
rewrite_subject 0
use_terse_report 1
report_safe 1
use_bayes 1
auto_learn 1
ok_locales en
use_pyzor 1
pyzor_max 9
pyzor_add_header 1
use_razor2 1
always_add_headers 1
always_add_report 1
spam_level_stars 1
pyzor_add_header 1
skip_rbl_checks 0
#timelog_path /home/useraccount/.spamassassin/timelog

5. As root make sure Imap,Spamassassin is running.
6. Load Thunderbird, use Imap, use filters on x-headers.
Consumer Reports did an article on that too by Stavr0 · 2003-08-23 08:48 · Score: 3, Informative

Ratings - Spam-blocking software
SAProxy for Windows (Based on SpamAssassin) got the highest marks.
Five baysian filters were enough by Sits · 2003-08-23 09:09 · Score: 4, Informative

Here's a quote from the article:

Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough.

If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...
Re:massing spam for training purposes. by bobbozzo · 2003-08-23 09:55 · Score: 3, Informative

YES: http://spamarchive.org/
Also remember you need to feed nonspams to bayesian filters also.

--
Nothing to see here; Move along.
Re:Mozilla? by Blain · 2003-08-23 11:36 · Score: 2, Informative
I have been using POPFile for months now, with a fairly complex setup, one of the things I like about POPFile versus the others I've seen (which are two or three bucket systems). It's classifying more than 99% accurately every month for the past three or four months (I reset my statistics around the first of every month) and has never been less than 95% accurate in a month (including its training month). For an idea of what my loads and buckets are like, this list of my buckets and the number of messages classified into them since the first of the month will help:
- ads -- 25 (0.58%)
- bounces -- 2 (0.04%)
- business -- 18 (0.42%)
- family -- 10 (0.23%)
- forwards -- 8 (0.18%)
- list -- 3,242 (75.72%)
- personal -- 68 (1.58%)
- politics -- 11 (0.25%)
- pornspam -- 136 (3.17%)
- scams -- 24 (0.56%)
- spam -- 678 (15.83%)
- webgenerated -- 57 (1.33%)
- website -- 2 (0.04%)
I've been using TB for a couple months now, and very much like it. I've used the built-in junk filtering since I first got it, and have found that it is only getting about 1/3 to 1/2 of the things already catagorized for my spam buckets, with a higher rate of false-positives than POPFile. I would like to see something more reliable, and hope updating the algorithm will help.

As complicated as my buckets may look, this system works very well for me -- with the addition of a "misc" folder that anything not classified goes into, and some filters based on the X-Classified line, almost nothing that gets into my inbox is anything other than personal email.
Re:Spamassassin and Bayes? by arth1 · 2003-08-23 16:34 · Score: 2, Informative

As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

In addition to the above, it might be smart to create three files called "ham", "spam" and "forget":
#!/bin/sh # ham /usr/bin/sa-learn --ham --no-rebuild --single #!/bin/sh # spam /usr/bin/sa-learn --spam --no-rebuild --single #!/bin/sh # forget /usr/bin/sa-learn --forget --single
Complement with a cron job that runs sa-learn --rebuild every night.

Then, if you read your mail on the same box, and the headers doesn't say it was auto-learned, simply pipe the email to either ham or spam. If it was wrongly auto-learned as spam, pipe it to forget. If using pine, it's really easy:
| ham

Of course, if you use razor or other online services that lets you report spam, you might want to pipe some of the spam mails that weren't recognized to "spamassassin -r".

Regards,
--
*Art