Slashdot Mirror


More on Bayesian Spam Filtering

michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."

3 of 251 comments (clear)

  1. Post your results here by Jeffrey+Baker · · Score: 5, Interesting
    I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

    I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

    1. Re:Post your results here by Jeffrey+Baker · · Score: 5, Interesting
      I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

      You can grab the source here, but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

  2. The proof of the pudding... by ajm · · Score: 5, Interesting

    ...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

    We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

    Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.