Slashdot Mirror


Bayesian Filtering For Dummies

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

24 of 281 comments (clear)

  1. Yes, we must filter out the dummies by Anonymous Coward · · Score: 5, Funny

    I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.

    1. Re:Yes, we must filter out the dummies by zoikes · · Score: 5, Interesting

      The moderation system (esp. in its current form - moderation by +karma /.ers) will always be better than automated filtering.

      The key problem is adaptation. "Bayesian filtering is better than simple keyword filtering, but its performance will degrade over time unless its rules are continuously updated (via analysis of new data). And there's the problem that a troll in one story context may be an insightful comment in another.

      Moderation by humans apapts rapidly, accomodates a variety of contexts, and will reflect (and grow with) the overall /. "culture".

    2. Re:Yes, we must filter out the dummies by dJCL · · Score: 5, Interesting

      I've been using a baysian spam filter for months now and I understand how they work... Even thou people find the comment funny, a baysian troll filter on slashdot would work...

      If you were to run every slashdot post throu my mail filter as an e-mail message and properly mark the trolls and others you don't want, and the ones you do want, suddenly you would only get the actual good posts, trolling would die quickly... And because of the user classification system currently in place, slashdot has a huge db to build up the word stats, so it could happen immediatly or faster...

      Seriously, I ask that the slashdot admins consider adding this to slashcode... even if slashdot does not use it, others would... there are too many trolls out there as it is on the net and many people put them only a few rungs higher than spammers on the evolutionary ladder(but lower than an ameoba still)

      The logic behind this can actually be extended, to allow a user to start filtering stories so that they only get ones that interest them, or even to filtering submissions to get rid of the cruft, how often to you think that the trolls post troll story submissions? Save work for the site admins...

      I'm curious if an extension of this idea is how Google News works... anyone know?

      Enjoy.

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    3. Re:Yes, we must filter out the dummies by bluelan · · Score: 5, Insightful
      This wouldn't work.

      Baysian filters for spam work because spam has a significantly different vocabulary distribution than useful e-mail. This is true because spam must deliver a commercial message and play on people's uncertainties.

      Good trolls, on the other hand, look ALMOST like insightful, well written articles. The vocabulary distribution in good trolls is not significantly different than the vocabulary distribution of useful posts. So, Baysian filters would be useless, unless you come up with some smarter characteristics on which to train the filter.

      You could easily develop a filter for ascii-art porno. But, those are offtopic or flaimbait, not trolls.

      --

      I used to be a narrator for bad mimes. (wright)

    4. Re:Yes, we must filter out the dummies by DeadSea · · Score: 4, Interesting
      Bayesian filters for email really only work because spammers can't see which messages you classify as spam. If you implemented a bayesian filter for trolls on slashdot, the trolls would see what words constitute a troll and stop using those words. They would stuff their messages with non-troll words avoiding the bayesian filter.

      The same thing would happen to your mail if the words that your bayesian filter were the same as the words in everybody else's. Spammers would be able to see what make an email seem spamming and they wouldn't do that. Bayesian filtering works for email right now because everybody's filters are a bit different. There is currently no magic bullet to get through everybody's spam filters. Also spammers cannot see your filter so they don't know if their message was filtered. If you opened your archive to me, I could quite easily craft a spam that would land square in your inbox.

  2. A bit of info on Bayesian filtering by jat850 · · Score: 5, Informative

    The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:

    Paul Graham's spam page

    He talks a little bit more about the technical aspects there.

    --
    the blood has stopped pumping, and he's left to decay
    the me that you know is now made up of wires
    1. Re:A bit of info on Bayesian filtering by letxa2000 · · Score: 5, Insightful
      A gynecologist probably wouldn't have a corpus that indicates that "sex" is a .97 spam probability. That's the great thing about Bayesian: the spam probability for each word depends on the mail and spam YOU receive. It works dang well, just as Paul Graham claims. I'm averaging 99.7% accuracy this week, and the one spam that got through was written in German.

    2. Re:A bit of info on Bayesian filtering by GnuVince · · Score: 5, Insightful
      No, because if they have a lot of legitimate mails with words like "sex", "sexy", "penis", "vagina", "viagra", etc., the filter will adapt. That's the whole point. For PG, "sexy" is a sure sign of spam, but for a sexologist, it is not. You train the filter to recognize your spam. So if "sex" appears as much in your legitimate mail than in your spam, "sex" will not be considered a trace of a spam.

      Bayesian filters adapt, that's why they work so well.

  3. Speaking of dummies... by Anonymous Coward · · Score: 5, Informative

    Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)

    Paul

  4. Re:Origin of SPAM by jat850 · · Score: 5, Informative

    Good question ... through Google Groups I found this page.

    --
    the blood has stopped pumping, and he's left to decay
    the me that you know is now made up of wires
  5. Ironic? by popeydotcom · · Score: 4, Funny

    Interesting yes, ironic, no.

    What's your name, Alanis Morissette ?

    1. Re:Ironic? by DavyByrne · · Score: 4, Insightful

      Actually, I've long wondered whether Alanis was quite clever in choosing a title for that song.

      You see, none of the events she describes in the song is an example of irony, making the choice of the title "Ironic," well, ironic.

  6. Do spammer's techniques work on slashdot ? by Rosco+P.+Coltrane · · Score: 4, Funny

    Viagra often spelled V-l-a-g-r-a online

    I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  7. Wrong pic... by Mondoz · · Score: 4, Informative
    It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.

    Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
    You'd think they'd use the actual pic of the skit with the Vikings in the cafe...

    --
    /sig
  8. Re:who're the vikings? by Evil-G · · Score: 5, Informative

    A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.

  9. Re:Spam = /dev/null by GammaTau · · Score: 4, Informative

    Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this?

    You can try bogofilter, ifile, SpamBayes, or POPFile. The newer versions of SpamAssassin also implement some kind of Bayesian filtering.

  10. I don't receive spam by Rosco+P.+Coltrane · · Score: 4, Interesting

    In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).

    So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask :

    - Do other people really receive that much spam, or am I an isolated case ?

    - Do people who receive spam purchase things online, or register software and other services with their real names and email ?

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    1. Re:I don't receive spam by letxa2000 · · Score: 4, Insightful
      There are in fact two big problems with Bayesian filtering (or any content-based filtering) from the perspective of an ISP or company... 1) one person's spam is another person's necessity

      But that's why Bayesian advocates every user having their own Bayesian statistics. It's not a "one size fits all" for the entire ISP or company, as is the case with most keyword filters. Every user has a different set of Bayesian statistics which is why it is very difficult for spammers to get around this filter--they have no way of knowing what words are in each users' statistics.

      2) you still have to waste your bandwidth and CPU before you reject it.

      It's better to waste your bandwidth and your CPU than to waste the time of those receiving the spam. IMHO...

      So Bayesian filters are a good tool of last resort, but there are many other tools that should be used too.

      The quicker everyone uses Bayesian filters (as opposed to waiting until all the other filters are incapable of keeping up with spam) the sooner the spammers will be in trouble. I personally use both a Bayesian filter with an up-to-date blacklist of known spamvertised domains, etc. I find that, quite simply, the simple keyword filters catch spam from known spam sites and Bayesian catches the rest. But if I turned off my normal filters Bayesian would have caught it all since those spams are always assigned a high Bayesian score, too. It almost makes sense to turn off the other filters, but they can be useful if a spammer comes up with a truly unique spam and someone else has already identified the domain name. It's rare, but it can happen. So a combination of technologies is probably the best... but a combination that lacks Bayesian is a combination that could be better.

  11. Apple's Mail app... by useruser · · Score: 4, Interesting

    ...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.

    1. Re:Apple's Mail app... by Anonymous Coward · · Score: 5, Informative

      Actually, the latent semantic analysis (LSA) that Apple uses is not a form of Bayesian reasoning; it uses a singular value decomposition (SVD) to perform generalized factor analysis. However, there is a probabilistic version of LSA out there.

  12. Crude but effective by MrWorf · · Score: 5, Insightful

    I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.

    This is waaaaay better than any other filtermethod I've tried and requires no learning period at all :)

  13. Brief Tech Notes on Bayesian Filtering by robbyjo · · Score: 5, Informative

    Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".

    Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.

    How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.

    Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.

    First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.

    Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].

    Repeat until all training e-mails are read.

    And here comes the testing phase...

    When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:

    pspam = P(spam); pnospam = P(not spam);
    foreach unique words w in e-mail do
    pspam = pspam * P[w][spam];
    pnospam = pnospam * P[w][nospam];
    endfor

    if (pspam > pnospam) then return IS_SPAM; else
    return IS_NO_SPAM;

    Hope this helps.

    --

    --
    Error 500: Internal sig error
  14. Slight modification: white-list+Bayesian is useful by Jeremi · · Score: 4, Interesting
    I've found that if you add a small tweak to the Bayesian Filter, it becomes even more useful. The tweak is this: Any time you tell the Bayesian filter that an email is "non-spam", it auto-adds the From address of that email to a white-list, so that from then on any emails from that address are automatically marked as "non-spam" by the filter, no matter what they contain. (conversely, any time you mark an email as "spam", the source address of that email is removed from the white-list, if it is present)


    This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.


    Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  15. Re:Origin of SPAM by Anne+Thwacks · · Score: 4, Informative
    While the Monty Python sketch may have inspired the use of the term, the Monty Python usage was in fact a rehash of a sketch by Peter Sellers, dating back to the 1950's which referred to the wartime situation where Cafe's often had fancy things on the Menu, but when you came to order, the item in question was not available.

    The sketch is to be found on the album "The Bset of Sellers" - probably released in about 1958, and which also features the nursery rhyme

    "Up on the chair behind the door,
    hey diddle, diddle,
    Hear comes Poppa
    so up with the chopper
    and split 'im down the middle

    And "Balham, gateway to the South" a spoof of the travalogue films that often apepared in the cenema at the time.

    --
    Sent from my ASR33 using ASCII