Seven Spam Filters Compared
Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
people/editors need to learn the a tag
The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.
For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.
The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.
-Barkeep, a draft of your most hazardous brew, for the world is slowly stepping into focus, and I don't like what I see.
As was noted earlier, the set of messages given to the filters for learning was terribly small. Furthermore, SpamAssassin wasn't tested in a way useful to most as the tests in this article didn't take into account SA's Bayesian filter nor it's network-based tests (Razor, etc).
What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.
According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.
I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
See our PSAM project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.
The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.
Overall, though, this was an interesting evaluation, and I'm glad that the author published it.
Yup. I use it all the time. Save up spam and ham in seperate folders. Then do this:
sa-learn --spam --mbox ~/mail/myspamfolder
sa-learn --ham --mbox ~/mail/myhamfolder
As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:
alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'
Karma: Chameleon (mostly due to the fact that you come and go).
Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
Dont forget that the filter sees more than the eye...
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
I'm using the standalone Thunderbird and it catchs everything that passes by Spamassassin. Spam is marked but never deleted, so I can go back and check. Some spam programs will delete email, which could delete a good email, unacceptable.
Basically, I'm using a mandrake linux box, imap, procmail, fetchmail and spamassassin. Easy, and I can send/receive email from my linux box, and port 25 is blocked from the Net so nobody can use me as a bouncer.
Only problem I had was, there was no complete document to set this up, I had to piece each part together.
So for anyone who wants to know, heres the quick steps.
1. I'm using mandrake, but had to update SA for the sa-learn utils. (Gotta train SpamAssassin)
2. Setup fetchmail in your personal account.
3. Setup
DROPPRIVS=YES
VERBOSE=ON
LOGFILE=/home/userac
|
4. Setup your user_prefs in your local directory for SA. (mine, but im no SA expert, but it works)
required_hits 5
rewrite_subject 0
use_terse_report 1
report_safe 1
use_bayes 1
auto_learn 1
ok_locales en
use_pyzor 1
pyzor_max 9
pyzor_add_header 1
use_razor2 1
always_add_headers 1
always_add_report 1
spam_level_stars 1
pyzor_add_header 1
skip_rbl_checks 0
#timelog_path
5. As root make sure Imap,Spamassassin is running.
6. Load Thunderbird, use Imap, use filters on x-headers.
SAProxy for Windows (Based on SpamAssassin) got the highest marks.
If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...
Also remember you need to feed nonspams to bayesian filters also.
Nothing to see here; Move along.