Bayesian Filtering For Dummies

← Back to Stories (view on slashdot.org)

Bayesian Filtering For Dummies

Posted by timothy on Monday May 26, 2003 @09:27AM from the to-increase-the-peace dept.

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

19 of 281 comments (clear)

Min score:

Reason:

Sort:

A bit of info on Bayesian filtering by jat850 · 2003-05-26 09:32 · Score: 5, Informative

The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:

Paul Graham's spam page

He talks a little bit more about the technical aspects there.

--
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
1. Re:A bit of info on Bayesian filtering by Rosco+P.+Coltrane · 2003-05-26 10:05 · Score: 2, Informative
  
  From Paul Graham's page :
  
  A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
  
  Tough filter for users of dating services who happen to be gynecologists ...
  
  --
  "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Speaking of dummies... by Anonymous Coward · 2003-05-26 09:34 · Score: 5, Informative

Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)

Paul
Re:Origin of SPAM by jat850 · 2003-05-26 09:36 · Score: 5, Informative

Good question ... through Google Groups I found this page.

--
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
Wrong pic... by Mondoz · 2003-05-26 09:39 · Score: 4, Informative

It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
You'd think they'd use the actual pic of the skit with the Vikings in the cafe...

--
/sig
Re:who're the vikings? by Evil-G · 2003-05-26 09:39 · Score: 5, Informative

A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.
Re:Spam = /dev/null by GammaTau · 2003-05-26 09:45 · Score: 4, Informative

Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this?

You can try bogofilter, ifile, SpamBayes, or POPFile. The newer versions of SpamAssassin also implement some kind of Bayesian filtering.
Re:Apple's Mail app... by Anonymous Coward · 2003-05-26 10:05 · Score: 5, Informative

Actually, the latent semantic analysis (LSA) that Apple uses is not a form of Bayesian reasoning; it uses a singular value decomposition (SVD) to perform generalized factor analysis. However, there is a probabilistic version of LSA out there.
Brief Tech Notes on Bayesian Filtering by robbyjo · 2003-05-26 10:06 · Score: 5, Informative

Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".

Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.

How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.

Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.

First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.

Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].

Repeat until all training e-mails are read.

And here comes the testing phase...

When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:

pspam = P(spam); pnospam = P(not spam); foreach unique words w in e-mail do pspam = pspam * P[w][spam]; pnospam = pnospam * P[w][nospam]; endfor if (pspam > pnospam) then return IS_SPAM; else return IS_NO_SPAM;

Hope this helps.

--

--
Error 500: Internal sig error
Re:Evolution and by C3ntaur · 2003-05-26 10:23 · Score: 3, Informative

Yes, I've done it and here's how:

1. Get and install bogofilter.

2. Make a shell wrapper script that runs bogofilter in passthrough mode, redirect stdout and stderr to files in /tmp for debugging and training bogofilter. Here's mine:

#!/bin/bash /usr/bin/bogofilter -p -u > /tmp/bogo.out 2> /tmp/bogo.err
status=$?
exit $status

3. Make a new local mail folder in evolution to collect spam.

4. Make a filter in evolution that runs the wrapper script. Tools->Filters, choose Incoming, choose Add. Add a criterion that looks like this:
Pipe message to shell command, (path to your wrapper script), returns, 0. Add an action to move the message to your local spam filter.

5. Be sure evolution is set to apply filters to new mail in the inbox(es) you want bogofilter to act on. Tools->Settings, choose Mail Accounts, choose desired inbox(es), choose edit, choose the Receiving Options tab, check the Apply filters to new messages in INBOX on this server.

Please be sure to RTF bogofilter M. You will need to train (and retrain) bogofilter with spam and non-spam samples over time. The switches to do this have CHANGED from version to version. If you have set things up as above, you can use /tmp/bogo.out to retrain bogofilter for the last message processed when necessary.

Good luck, and happy spam filtering!

--
Loading...
Re:Origin of SPAM by Anne+Thwacks · 2003-05-26 10:26 · Score: 4, Informative

While the Monty Python sketch may have inspired the use of the term, the Monty Python usage was in fact a rehash of a sketch by Peter Sellers, dating back to the 1950's which referred to the wartime situation where Cafe's often had fancy things on the Menu, but when you came to order, the item in question was not available.
The sketch is to be found on the album "The Bset of Sellers" - probably released in about 1958, and which also features the nursery rhyme
"Up on the chair behind the door,
hey diddle, diddle,
Hear comes Poppa
so up with the chopper
and split 'im down the middle
And "Balham, gateway to the South" a spoof of the travalogue films that often apepared in the cenema at the time.

--
Sent from my ASR33 using ASCII
Comment removed by account_deleted · 2003-05-26 10:28 · Score: 2, Informative

Comment removed based on user account deletion
naive implementation of naive Bayesian by g4dget · 2003-05-26 10:58 · Score: 3, Informative

Graham's method is called "naive Bayesian", and it's called "naive" for a reason. It works surprisingly well, but it barely scratches the surface of what people are doing with statistical models of text.

The lack of references on Graham's web site to prior work on text classification makes one wonder whether he just is unfamiliar with a huge body of literature going back decades or whether he just deliberately ignores them. Either way, Graham didn't invent any of the techniques and they are far from state-of-the-art. (Incidentally, you'll probably find Octave or Perl/PDL a more convenient language for implementing this stuff than Lisp.)

Anybody seriously interested in text filtering should at least do a little bit of background reading. "Readings in Information Retrieval" by Jones and Willett covers some of the basic papers.
Mozilla does this by Anonymous Coward · 2003-05-26 11:01 · Score: 3, Informative

Mozilla incorporates a twostep filter:

1. Is the sender in the address book? If yes, is not spam, otherwise:
2. Does the message have a probability of 90% that it is spam based on the Bayes filter? If so, flag as spam, otherwise not spam.
Re:who're the vikings? by mlk · 2003-05-26 11:52 · Score: 2, Informative

Not many people know were the term "spam" comes from, but everyone[1] knows what it means (email wise), and it does come from the Monty Python sketch.

However it did not orginally go with bulk email, but instead with some wanker on posting the same post over and over again on a newsgroup or IRC.

[1] Including "normal" users.

--
Wow, I should not post when knackered.
Re:Spam = /dev/null by sabaco · 2003-05-26 11:59 · Score: 2, Informative

Don't forget SpamProbe as well. I've been using it for a couple weeks, and it has been working very well for me. I've gotten around 1400 messages, and so far 1 false positive and 6 false negatives. I don't know how well the other filters work, but that seems pretty good to me. It's sure a hell of a lot better than the DNS blacklists I use. (I'm still using those. After all, they filter out the first 70% of my incoming mail and are probably faster anyway.)

--
This is SO educational! -- Kintaro Oe
Re:It's not bad... by pyr0 · 2003-05-26 13:21 · Score: 2, Informative

That's exactly how spamassassin works. My school has it set up on their exchange servers. The problem with that is...you still get the spam. So even if it does go in another folder, which in turn still wastes your bandwidth and time in deleting it, as far as the spammer knows it went through. The still gets paid for having sent out a successful spam. The only real advantage is being able to read your legit messages without sorting through the spam, and even that is starting to become an issue because it seems the spammers have become much smarter. Thus, I still get around 10 spams a day in my main inbox...although that is still an improvement from the actual 100 plus or minus a day I actually get. I still like the idea of those spam tarpits restricting the bandwidth from known spam domains. All email gets through eventually...which isn't a problem for most normal messages. It is a problem though when the spammers suddenly find their connection to remote mail servers throttled and can't send all they want to in as little time as it would normally take.

--
Project Steve
Re:Browser ad-blocking the same way? by bhtooefr · 2003-05-26 14:06 · Score: 2, Informative

Did you have Windows? If you did, it was probably WebWasher. It is free for home use. The download link is buried in the front page, so here's a direct link to the WebWasher Classic site: http://www.webwasher.com/client/home/index.html?la ng=de_EN
DON'T implement it like the parent by NoOneInParticular · 2003-05-26 23:28 · Score: 3, Informative

If you do it like the parent:
pspam = P(spam); pnospam = P(not spam); foreach unique words w in e-mail do pspam = pspam * P[w][spam]; pnospam = pnospam * P[w][nospam]; endfor if (pspam > pnospam) then return IS_SPAM; else return IS_NO_SPAM;

You'll soon be running out of bits to store the floating point results. Implement it by adding logarithms of probabilities instead of products of them, thus:
lpspam = log(P(spam)); lpnospam = log(P(not spam)); foreach unique words w in e-mail do lpspam = lpspam + log(P[w][spam]); lpnospam = lpnospam + log(P[w][nospam]); endfor if (lpspam > lpnospam) then return IS_SPAM; else return IS_NO_SPAM;

If you have a couple of hundred key-words, this will make a lot of difference concerning the accuracy of the predictions.