Slashdot Mirror


Working Bayesian Mail Filter

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

12 of 312 comments (clear)

  1. spambayes.sf.net by supton · · Score: 5, Informative

    Saw this a few weeks back... Spam filter in Python using Naive Bayes.

  2. Mozilla in Process of adding Bayesian filter by AT · · Score: 5, Interesting

    The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.

  3. Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · · Score: 5, Interesting
    A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

    More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.

    Enough nerd talk for today :-)

  4. Forget Bayes by Evil+Adrian · · Score: 5, Funny

    We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.

    --
    evil adrian
  5. Re:Whas that? by Evil+Adrian · · Score: 5, Funny

    If you had just clicked the POPFile link, you would see the explanation.

    Initiative is your friend.

    Hyperlinks are your friend.

    Don't be afraid, just click.

    --
    evil adrian
  6. Re:Whas that? by dvk · · Score: 5, Informative
    From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".

    A couple of URLs quickly found on Google:
    http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html
    http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf

    Also, any decent AI/machine learning textbook ought to cover the topic.

    -DVK

    --
    "The right to figure things out for yourself is the only true freedom everyone shares. Go use it"-R.A.Heinlein
  7. Re:Sure it's promising by outlier · · Score: 5, Informative

    While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

    Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.

  8. Re:Bayes Explained by johnynek · · Score: 5, Informative

    That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

    It should be:

    Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

    and:

    Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)

    --
    jabber: johnynek@jabber.org
  9. Re:Whas that? by sfe_software · · Score: 5, Informative

    If you had just clicked the POPFile link, you would see the explanation.

    I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.

    --
    NGWave - Fast Sound Editor for Windows
  10. Missing the point? by crisco · · Score: 5, Informative
    I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.

    This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).

    Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.

    Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

    --

    Bleh!

  11. Bayes by John+Garvin · · Score: 5, Funny

    Now we can tell spammers: "All your Bayes are belong to us."

  12. Re:Sure it's promising by Alsee · · Score: 5, Funny

    (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam.

    I have a cousin that lives in Nigeria and we regularly discuss tips on penis enlargement. He works at a bank refinancing mortgages and his wife is a professor at an accredited university. I work in in a Las Vegas casino producing shows featuring live nude showgirls. He offered to help me pay some bills and get out of debt (a generous offer, but I told him I just found a second part time job working from home earning thousands of dollars per week). My wife is a stock broker and I regularly let my cousin in on hot stock tips. I have an herb garden, I take viagra, and use rogaine. Since we both own the same brand of printer we've been working out the best way to refill the ink cartridges. I've been trying to lose weight, but it comes right back as soon as I quit smoking.

    I don't quite understand this "beysian filter" stuff, but I can't wait to try it out!

    -

    --
    - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.