Seven Spam Filters Compared
Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
Also, what's with keeping the spam threshhold score secret?
They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.
Check it out Here.
GeekWares - Buy and Download Today!
An interesting thread here about how TMDA, a C/R filter, used in conjunction with SpamAssassin, can provide the best of both worlds. While TMDA is by itself effective, there seem to be some humanistic issues involving the assumption that all e-mailers are spammers unless they prove otherwise. The thread explains how Bayesian filtering can be improved by using a decent C/R filter like TMDA without alienating people that send legitimate e-mail.
Personally, I figure anyone thin-skinned enough to be insulted by my C/R filter probably isn't worth talking to anyways, but I digress...
I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.
That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.
Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.
To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.
Nate
since the filters do better after being trained with lots of spam, anyone think of gathering up a huge collection of spam to give to other people? i mean exporting a corpus of spam from outlook, sticking it up for download somewhere, and letting other people import it into a spam folder. then other people could run their filter of choice and train it!
you could even make it all official-like, and somehow guarantee that the spam that's up for downloading is "official" and "virus-free" and "safe for your computer." you know, do geek stuff like check hashes or whatever it takes to verify that the spam collection is legit. whatever it takes to ensure that someone else hasn't filled it with a ton of virus/trojan/etc. attachments. or whatever. i dunno. you know, somehow guarantee it's safe.
imagine it! download spambayes, get spambayes to connect to the official spambayes spamcorpus server, and download the latest 2000 spams! instant training.
anyway. just an idea. mod me down as -1, herrd0kt0r. 8P
I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this:
/home/bogofilter/spamlist.db
We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam.
So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample...
nikola: "|/usr/local/bin/bogofilter -s "
cal: "|/usr/local/bin/bogofilter -s "
bwilson: "|/usr/local/bin/bogofilter -s "
fayre: "|/usr/local/bin/bogofilter -s "
(If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.)
To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days...
# remove records older than 30 days from spamlist.db
/usr/local/bin/bogoutil -a30 -m
This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends.
Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.