Seven Spam Filters Compared
Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
Also, what's with keeping the spam threshhold score secret?
They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.
Check it out Here.
GeekWares - Buy and Download Today!
I use SpamAssassin with the flag threshold set at 5, the default. I have procmail send any message from 5-10 into a spam mailbox which I clean out occasionally, and messages at 10+ straight to /dev/null (after a couple of months of also keeping those in the spam mailbox).
Having a properly trained Bayes database makes a huge difference, not just for flagging spam but for not flagging mail. This is because messages which get a low Bayes probability receive a negative score (from the Bayes test, which offsets any heuristic tests that the message may happen to trip). I now find that nearly all legitimate mail comes in below zero, and nearly all spam comes in above 15. I have never once seen a false positive - either in my testing period, or since I started trashing spam (I occasionally look through the procmail log just to make sure). I see a false negative once every couple of weeks, which is just fine (it's remarkable how inoffensive spam becomes when it's an occasional thing ;).
So yes, now that you've trained it, you should be able to move the threshold again (I assume by "lowered" you actually mean you raised it, ie. had it flag messages as spam only when they scored 7.0 or higher).
Oh yeah just two problems:
1: If i sent someone a mail and got an request to first prove myself i'll jjust write that person off.
2: Just wait for a spammer to fake your address in a spam to another person using that software, you get a nice ping-pong game.
Simple!
An interesting thread here about how TMDA, a C/R filter, used in conjunction with SpamAssassin, can provide the best of both worlds. While TMDA is by itself effective, there seem to be some humanistic issues involving the assumption that all e-mailers are spammers unless they prove otherwise. The thread explains how Bayesian filtering can be improved by using a decent C/R filter like TMDA without alienating people that send legitimate e-mail.
Personally, I figure anyone thin-skinned enough to be insulted by my C/R filter probably isn't worth talking to anyways, but I digress...
I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.
That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.
Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.
To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.
Nate
"this will take exactly as much effort as it would have to just check the e-mail when it first came in"
Not so. It's much easier to manually filter when you have a good idea what to expect. Since the content of the probable-spam mailbox is, er, probably spam, going through it is vastly quicker and more reliable than trying to sift out the randomly distributed real mail from a single unfiltered mailbox. Likewise, the few false negatives in one's inbox stick out much sorer when most spam has been diverted. Doing a best-automated-guess sort into separate piles beforehand really capitalises on the way a human brain distinguishes items in a set. The relative distribution within a pile is important.
Your earlier points are interesting, though.
since the filters do better after being trained with lots of spam, anyone think of gathering up a huge collection of spam to give to other people? i mean exporting a corpus of spam from outlook, sticking it up for download somewhere, and letting other people import it into a spam folder. then other people could run their filter of choice and train it!
you could even make it all official-like, and somehow guarantee that the spam that's up for downloading is "official" and "virus-free" and "safe for your computer." you know, do geek stuff like check hashes or whatever it takes to verify that the spam collection is legit. whatever it takes to ensure that someone else hasn't filled it with a ton of virus/trojan/etc. attachments. or whatever. i dunno. you know, somehow guarantee it's safe.
imagine it! download spambayes, get spambayes to connect to the official spambayes spamcorpus server, and download the latest 2000 spams! instant training.
anyway. just an idea. mod me down as -1, herrd0kt0r. 8P
I use PopFile as well and am equally satisfied. I make sure to reclassify all false negatives and positivies. Accuracy is at 97.65%, I've gotten 2,802 spams for 5,432 mails I've gotten since I installed it.
When me and my friend had a site featured on Yahoo, USA Today, NYT, etc. the spam just went THROUGH THE ROOF. But, thanks to PopFile I didn't have to see any of it.
I'm not disagreeing with the posters that stated that he has low sample size. It might be one of the problems why he doesn't have a higher catch or recall rate.
The main problem I see with bayesian filters is that they are complicated and nontrivial to set up. I've been playing with Bogofilter for several months. And even with sub 1000 corpuses, I get a very high catch rate (greater than 90-some %, though I don't have exact numbers).
The method that I've employed is start with a small set of three hundred or so ham and spam corpuses, then to train on error over time. It's a pain in the ass because I still have to continually inspect the results and tweak the databases.
In addition to that, there are at least a half a dozen parameters that contribute to the success or error rates. So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.
So give the guy a break. I wouldn't say his results are robust enough for an academic publication, but it isn't worthless. It's interesting enough for a read. It's more work than many of us are willing to do.
Also an interesting read is Comparing Bayes Chain Rule with Fisher's Method for Combining Probabilities.
Seems to me like it isn't an artificial constraint, but merely a practical one. It sounds like he scripted the programs to run through his data all at once, so querying the online resources a thousand times an hour would not be feasible. The Bayesian filters were at a similar disadvantage because of the automated testing: normally, each false negative gets added to the spam corpus, which would haved improved their accuracy over time.
I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this:
/home/bogofilter/spamlist.db
We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam.
So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample...
nikola: "|/usr/local/bin/bogofilter -s "
cal: "|/usr/local/bin/bogofilter -s "
bwilson: "|/usr/local/bin/bogofilter -s "
fayre: "|/usr/local/bin/bogofilter -s "
(If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.)
To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days...
# remove records older than 30 days from spamlist.db
/usr/local/bin/bogoutil -a30 -m
This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends.
Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.