Comparison of Bayesian POP3 Spam Filters

← Back to Stories (view on slashdot.org)

Comparison of Bayesian POP3 Spam Filters

Posted by michael on Sunday August 10, 2003 @08:02PM from the spam-i-am dept.

kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.

6 of 326 comments (clear)

Min score:

Reason:

Sort:

Other filters by dtfinch · 2003-08-10 20:13 · Score: 4, Informative

I would have liked to see how my favorite bayesian spam filter, K9, would have faired in your comparison, but it failed to meet your first requirement of being cross platform. It's freeware written in C, is about a 60kb-100kb download, depending on if you get it with the self installer, is easy to use, and has a very small memory footprint. Before today it had sorted my email with over 99.8% accuracy, excluding the first couple days of training, and after only a couple weeks of use, though now it's down to 99.7%.

I have used PopFile in the past on both Windows and Linux, but found K9 to be better suited for environments where Windows is an option. It's very easy to use, having a windowed interface, and it seemed to learn much faster than PopFile did.

I haven't used SpamBayes. I'll have to give it a shot.
Spamprobe by 1029 · 2003-08-10 20:14 · Score: 5, Informative

The article didn't mention SpamProbe. It is what I use, and it has worked quite well for the past month or so that I've been using it. Perhaps this is just because the author didn't test this spam filter yet, but I like it quite a lot with my current mutt/procmail setup. Take this for what it's worth.

--
- I love animals. I try to eat at least one a day.
Re:You really just don't get it by Plug · 2003-08-10 20:52 · Score: 4, Informative

I don't disagree. I think that eventually we should move to a better email model - something like TMDA perhaps, where there is no guarantee that spammers can reach mailboxes. Or better legislation to make spamming punishable, controls on mail routers on million message mailouts, etc. Or djb's Internet Mail 2000, which moves the onus onto the senders network to store all 1m messages at a time, until people pick them up.

The other thing you can do is impose a microcost for mailing - at 1c/mail, spamming isn't economical any more. But then that is going to penalise the people who have legitimate reasons to send a million emails at a time - you'd have to have a very good micropayment system working on the Internet to do this.

However, those things need widespread change, and they need people in positions of power. Joe User at home can push for it, but they still get spam and they still want a short term solution. I suggest that even if they're filtering, the action of having to check their spam filter will make them irate enough. I see it as being like IPV6 - everyone would really have to change at once for the system to be most effective. (I use Freenet6, do you?)

Now that viruses are public, caught quickly, and Microsoft are being a lot less lax with security (I am in no way commending their effort, but they at least mostly fixed the Outlooks), you don't see people writing them nearly as often. I feel spam will get the same.
Re:"Bayesian" by file-exists-p · 2003-08-10 21:58 · Score: 5, Informative

As far as I know, many of those filters are based on a decision rule of the form

P(mail is spam | words X, Y, Z, ... are in it) > 1-epsilon

The computation is then done using Bayse's rule (P(A|B)=P(B|A)*P(A)/P(B)) under certain independance assumption which makes it tractable.

So this is actually bayesian filtering ...

My favorite filter is spamoracle
Re:Nitpick... by spongman · 2003-08-10 23:01 · Score: 4, Informative

Here's a bit from the excellent SpamBayes background page:
A remarkable property of chi-combining is that people have generally been sympathetic to its "Unsure" ratings: people usually agree that messages classed Unsure really are hard to categorize. For example, commercial HTML email from a company you do business with is quite likely to score as Unsure the first time the system sees such a message from a particular company. Spam and commercial email both use the language and devices of advertising heavily, so it's hard to tell them apart. Training quickly teaches the system all sorts of things about the commercial email you want, though, ranging from which company sent it and how they addressed you, to the kinds of products and services it's offering.
It's virtually impossible to not get spam? by setien · 2003-08-11 01:24 · Score: 5, Informative

No it's not.
I get spam at the rate of 1 spam mail per 6 months or so. Or maybe even less. I can't remember getting a single spam email on my actual email address for about a year.

If you have an account on a crapless domain (i.e. not hotmail.com, msn.com, aol.com and the likes),
it all comes down to this very simple rule:
Do not, under any circumstance, have your email address posted publicly accessible ANYWHERE on the web.
It WILL get trawled. And then it will be spammed relentlessly.

If you have an existing address you don't want to give up, or an address at hotmail.com or a similar place, dump it.
Then exercise a bit of common sense about where you use your actual address.

I have a domain which catches email to unknown addresses and put them in my regular mailbox.
Whenever I have to give an email address to some place on the web, I use *domain-i-am-currently-visiting*@mydomain.com. So if I am visiting foobar.com, I would put in foorbar.com@mydomain.com.
I have been doing this for years. It enables me to see what was the source of the leak when I get spam on one of the addresses.
It has taught me one thing: I have never, ever, ever, in all my years of online shopping, forum posting etc, come across a single website that have ignored their own privacy statement. Ever. Even the slightly sketchy sites (like divx subtitle sites) don't leak addresses.
I was surprised to realize this.

The only addresses I ever get spam on are the ones I know to be publicly displayed on the web.

So it's that easy to avoid spam.

--
Give me liberty or give me kill -s 9