More on Bayesian Spam Filtering
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?
...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.
:) but it would be interesting to see whether what looks convincing in theory pays off in practice.
We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"
Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some
development.lombardi.com
At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.
E t ar.gz
Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.
It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.
The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.READM
And you can download it at:
http://www-cse.ucsd.edu/~wkerney/spamfilter.
-Bill Kerney
wkerney at ucsd.edu
SpamAssassin works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.
With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..
I want to delete my account but Slashdot doesn't allow it.
On the web, see: Assoc. for Uncertainty in Artificial Intelligence -- this is the primary conference devoted to belief networks, which are a class of graphical (in the circles and arrows sense) Bayesian probability models. There are tutorials and other papers on the main AUAI web page, and links to the last several years of conference proceedings. By the way, Heckerman and Horvitz, now doing belief networkish work at MS Research, are in the AUAI crowd.
In print, my favorite reference is E.T. Jaynes, "Probability Theory: The Logic of Science", which is due out soon. See this web site devoted to Jaynes' work for the status. I am also fond of Castillo, Gutierrez, & Hadi, "Expert Systems and Probabilistic Network Models".
There are a vast (well, maybe just large) number of alternative models to classify things; a good introduction is Hastie, Tibshirani, & Friedman, "Elements of Statistical Learning". Incidentally, they use spam classification to illustrate several kinds of models.
Finally, if you're wondering what the heck is the difference between Bayesian probability and any other kind -- just google the posts in sci.stat.math; there is a Bayesian vs frequentist flame war about once a year. :^)