Sorting the Spam from the Ham

← Back to Stories (view on slashdot.org)

Sorting the Spam from the Ham

Posted by CmdrTaco on Thursday June 26, 2003 @05:27AM from the for-the-beginners dept.

MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond." Math buffs might enjoy reading these pages or browsing this writeup and its many links.

3 of 249 comments (clear)

Min score:

Reason:

Sort:

Spambayes by Chromodromic · 2003-06-26 05:35 · Score: 5, Informative

I use Spambayes with Outlook 2000, and it takes a little tweaking, but it works as advertised. Ahhh, the magic of mathematics. Just now, brought up Outlook, checked my mail and three little messages offering a free Sony headset, 70% off cell accessories, and a chance to take an IQ test just got tossed into my spam folder. Thanks anyway, but I think that means I just passed my IQ test.

Every so often I go in and take out some old, old spam, just to make sure my current preferences are being represented and that's all the maintenance that's required.

This is, however, the second time I've trained the filter. The first time, it incorrectly identified my FreeBSD status mails as spam, and from then on was throwing those into the Spam folder. My own fault, though, since I hadn't included any of these messages in my representative ham.

If you run Outlook, download this filter and use it. You'll be doing yourself, and a world that doesn't need fat-injected, herbally enhanced penises, a favor.

--
Chr0m0Dr0m!C
SpamBayes not Marc Hammond's work only by mpieters · 2003-06-26 06:13 · Score: 5, Informative
SpamBayes was originally conceived by Tim Peters and co at Python Labs, who improved on the orginal algorithm considerably. From there on out, many people helped tune and perfect the implementation, making it the most effective Baysian-based spam filtering tool currently available (IMNSHO).
Mark Hammond then wrote the Outlook plugin, which, admittedly, is considerably more code than SpamBayes, but not SpamBayes itself.
For the complete background on why SpamBayes is so good at what it does, and it's history, see:
- SpamBayes Background
Marc's is not the only application frontend for SpamBayes, here is a list of others:
- SpamBayes Applications
No apologies for this my pedantry offered.
--
"The truth shall make ye fret" -- The Truth, Terry Pratchett
Re:What I want by leshert · 2003-06-26 06:53 · Score: 5, Informative

Spamassassin learns in two ways:
1. Manual training: there is a tool called 'sa-learn'. You can pipe a message to it, or point it to a mailbox, and specify whether the mail is spam or ham.
2. Automatic training: if the score of the mail is significantly low (definitely spam) or significantly high (definitely ham), it will automatically train on the message. This may seem useless, but it's useful in that SA will then start to figure out patterns in spam or ham that don't trigger its rules.

I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox. Then I added a nightly cron job to run sa-learn over the two mboxes and truncate them. This has worked very, very well for me... In I haven't had a single false positive since Bayes kicked in about two months ago, and I got my first false negative in about two weeks today. I typically trap 10-15 spams a day.

One thing to notice: even if you enable it, Bayesian filtering won't kick in until you've recognized at least 200 spam and 200 ham messages. Took me a long time to figure that out (I had plenty of spam, but I wasn't training it on ham at all, which is why I started remapping the mutt commands).

As far as installing it on a server, your users don't have to be able to read each others' mail. I have it installed so that my wife and I each have our own bayes dbs, so neither of us has to read each others' mail. Plus, different users will regard different mail as spam: anything about the Pittsburgh Steelers going to my mailbox is probably spam, but not hers; similarly, anything regarding Linux going to her mailbox is probably spam, but not mine.