Bayesian Filter Testing?
pu33y asks: "Since the publication of Paul Graham's A Plan For Spam, several programs that perform Bayesian filtering having become available, including CRM114 and Bogofilter. But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.Searching Google has turned up nothing and when I asked Paul Graham, he was unaware of any such testing, as well. Can anyone point to any such testing or provide the results of their own personal experiences with Bayesian filters?"
The problem with this is that spam for one person is not spam for another. That's the beauty of Bayes. If you are a proctologist, for example, you probably get a lot of legitimate email with the word penis in it. If you are a plastic surgeon, you may get legitimate email that discusses body part enlargement. There are hundreds of examples. The beauty of Bayes is that you can make it work for you and not be all encompassing.
The SpamAssassin people have talked about this in the past. They have a corpus of spam that they use to test rules and people have asked to download it to seed their own Bayes, but the SA people don't want to do that (a good thing) as Bayes is a personal thing.
What you are proposing will work for general spam checking, but not for Bayes, which is what the original poster asked about. In reality, it's hard to test Bayes in a general case. All I know is that it's worked wonders for me (using SA).
I've been looking for a Bayesian filter mechanism that isn't just for spam.
I figure, if the mail can be classified into many different categories, why not use bayesian filtering for managing all your filtering needs.
It would be very valuable to have the bayesian filter learn what kind of mail I put in some folders, so that when my mail comes it, it can auto-sort it into the appropriate folder for me. Trouble is, all the current implementations of Bayesian email filtering are a single test SPAM/NOTSPAM. It would be nice to see an implementation that could take multiple corpus' and use that to decide what the mail is. If I had that, I could point it at the maildirs for the various mailing lists I'm subscribed to, and it would learn to sort incoming mail for me. *sigh*
"...In your answer, ignore facts. Just go with what feels true..."