Slashdot Mirror


Working Bayesian Mail Filter

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

6 of 312 comments (clear)

  1. spambayes.sf.net by supton · · Score: 5, Informative

    Saw this a few weeks back... Spam filter in Python using Naive Bayes.

  2. Re:Whas that? by dvk · · Score: 5, Informative
    From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".

    A couple of URLs quickly found on Google:
    http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html
    http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf

    Also, any decent AI/machine learning textbook ought to cover the topic.

    -DVK

    --
    "The right to figure things out for yourself is the only true freedom everyone shares. Go use it"-R.A.Heinlein
  3. Re:Sure it's promising by outlier · · Score: 5, Informative

    While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

    Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.

  4. Re:Bayes Explained by johnynek · · Score: 5, Informative

    That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

    It should be:

    Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

    and:

    Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)

    --
    jabber: johnynek@jabber.org
  5. Re:Whas that? by sfe_software · · Score: 5, Informative

    If you had just clicked the POPFile link, you would see the explanation.

    I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.

    --
    NGWave - Fast Sound Editor for Windows
  6. Missing the point? by crisco · · Score: 5, Informative
    I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.

    This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).

    Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.

    Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

    --

    Bleh!