Slashdot Mirror


Working Bayesian Mail Filter

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

6 of 312 comments (clear)

  1. Mozilla in Process of adding Bayesian filter by AT · · Score: 5, Interesting

    The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.

  2. Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · · Score: 5, Interesting
    A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

    More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.

    Enough nerd talk for today :-)

  3. Re:Server-side solutions? by cmeans · · Score: 4, Interesting
    James is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.

  4. Re:Server-side solutions? by koreth · · Score: 4, Interesting

    I've been using SpamProbe (which gets invoked from procmail) with excellent results.

  5. Re:Sure it's promising by Tim+Browse · · Score: 4, Interesting

    One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

    So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

    Tim

  6. Growing a spam filter -- a firsthand experience by devphil · · Score: 4, Interesting


    So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

    Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:

    Ignore the actual contents of the message. 34% of the time, it's spam.

    And it's right.

    --
    You cannot apply a technological solution to a sociological problem. (Edwards' Law)