Slashdot Mirror


Bayesian Filtering For Dummies

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

6 of 281 comments (clear)

  1. A bit of info on Bayesian filtering by jat850 · · Score: 5, Informative

    The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:

    Paul Graham's spam page

    He talks a little bit more about the technical aspects there.

    --
    the blood has stopped pumping, and he's left to decay
    the me that you know is now made up of wires
  2. Speaking of dummies... by Anonymous Coward · · Score: 5, Informative

    Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)

    Paul

  3. Re:Origin of SPAM by jat850 · · Score: 5, Informative

    Good question ... through Google Groups I found this page.

    --
    the blood has stopped pumping, and he's left to decay
    the me that you know is now made up of wires
  4. Re:who're the vikings? by Evil-G · · Score: 5, Informative

    A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.

  5. Re:Apple's Mail app... by Anonymous Coward · · Score: 5, Informative

    Actually, the latent semantic analysis (LSA) that Apple uses is not a form of Bayesian reasoning; it uses a singular value decomposition (SVD) to perform generalized factor analysis. However, there is a probabilistic version of LSA out there.

  6. Brief Tech Notes on Bayesian Filtering by robbyjo · · Score: 5, Informative

    Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".

    Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.

    How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.

    Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.

    First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.

    Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].

    Repeat until all training e-mails are read.

    And here comes the testing phase...

    When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:

    pspam = P(spam); pnospam = P(not spam);
    foreach unique words w in e-mail do
    pspam = pspam * P[w][spam];
    pnospam = pnospam * P[w][nospam];
    endfor

    if (pspam > pnospam) then return IS_SPAM; else
    return IS_NO_SPAM;

    Hope this helps.

    --

    --
    Error 500: Internal sig error