Comparison of Bayesian POP3 Spam Filters
kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.
I would have liked to see how my favorite bayesian spam filter, K9, would have faired in your comparison, but it failed to meet your first requirement of being cross platform. It's freeware written in C, is about a 60kb-100kb download, depending on if you get it with the self installer, is easy to use, and has a very small memory footprint. Before today it had sorted my email with over 99.8% accuracy, excluding the first couple days of training, and after only a couple weeks of use, though now it's down to 99.7%.
I have used PopFile in the past on both Windows and Linux, but found K9 to be better suited for environments where Windows is an option. It's very easy to use, having a windowed interface, and it seemed to learn much faster than PopFile did.
I haven't used SpamBayes. I'll have to give it a shot.
The article didn't mention SpamProbe. It is what I use, and it has worked quite well for the past month or so that I've been using it. Perhaps this is just because the author didn't test this spam filter yet, but I like it quite a lot with my current mutt/procmail setup. Take this for what it's worth.
- I love animals. I try to eat at least one a day.
I don't disagree. I think that eventually we should move to a better email model - something like TMDA perhaps, where there is no guarantee that spammers can reach mailboxes. Or better legislation to make spamming punishable, controls on mail routers on million message mailouts, etc. Or djb's Internet Mail 2000, which moves the onus onto the senders network to store all 1m messages at a time, until people pick them up.
The other thing you can do is impose a microcost for mailing - at 1c/mail, spamming isn't economical any more. But then that is going to penalise the people who have legitimate reasons to send a million emails at a time - you'd have to have a very good micropayment system working on the Internet to do this.
However, those things need widespread change, and they need people in positions of power. Joe User at home can push for it, but they still get spam and they still want a short term solution. I suggest that even if they're filtering, the action of having to check their spam filter will make them irate enough. I see it as being like IPV6 - everyone would really have to change at once for the system to be most effective. (I use Freenet6, do you?)
Now that viruses are public, caught quickly, and Microsoft are being a lot less lax with security (I am in no way commending their effort, but they at least mostly fixed the Outlooks), you don't see people writing them nearly as often. I feel spam will get the same.
As far as I know, many of those filters are based on a decision rule of the form
... are in it) > 1-epsilon
...
P(mail is spam | words X, Y, Z,
The computation is then done using Bayse's rule (P(A|B)=P(B|A)*P(A)/P(B)) under certain independance assumption which makes it tractable.
So this is actually bayesian filtering
My favorite filter is spamoracle
No it's not.
I get spam at the rate of 1 spam mail per 6 months or so. Or maybe even less. I can't remember getting a single spam email on my actual email address for about a year.
If you have an account on a crapless domain (i.e. not hotmail.com, msn.com, aol.com and the likes),
it all comes down to this very simple rule:
Do not, under any circumstance, have your email address posted publicly accessible ANYWHERE on the web.
It WILL get trawled. And then it will be spammed relentlessly.
If you have an existing address you don't want to give up, or an address at hotmail.com or a similar place, dump it.
Then exercise a bit of common sense about where you use your actual address.
I have a domain which catches email to unknown addresses and put them in my regular mailbox.
Whenever I have to give an email address to some place on the web, I use *domain-i-am-currently-visiting*@mydomain.com. So if I am visiting foobar.com, I would put in foorbar.com@mydomain.com.
I have been doing this for years. It enables me to see what was the source of the leak when I get spam on one of the addresses.
It has taught me one thing: I have never, ever, ever, in all my years of online shopping, forum posting etc, come across a single website that have ignored their own privacy statement. Ever. Even the slightly sketchy sites (like divx subtitle sites) don't leak addresses.
I was surprised to realize this.
The only addresses I ever get spam on are the ones I know to be publicly displayed on the web.
So it's that easy to avoid spam.
Give me liberty or give me kill -s 9