Slashdot Mirror


Bayesian Filter Testing?

pu33y asks: "Since the publication of Paul Graham's A Plan For Spam, several programs that perform Bayesian filtering having become available, including CRM114 and Bogofilter. But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.Searching Google has turned up nothing and when I asked Paul Graham, he was unaware of any such testing, as well. Can anyone point to any such testing or provide the results of their own personal experiences with Bayesian filters?"

10 of 127 comments (clear)

  1. DSpam by jalet · · Score: 4, Interesting

    Dspam (http://www.networkdweebs.com) rocks !

    Some impressive stats were posted to the mailing list.

    It's main feature is that it's completely maintainance free, and that even dumb people can use it (I know, I am).

    My personnal stats are 2 false positives actually (one from PayPal, one from a company I work with), 280 spams learnt (I told it they were spam), 2877 spam catched and 4354 innocent.

    --
    Votez ecolo : Chiez dans l'urne !
  2. Online repository needed by sam+the+lurker · · Score: 5, Interesting

    Ideally, someone, probably an academic, should make a repository of spam available for testing. Software spam filters can say things like, "Correctly classified 99.9% of the email in the UCI spambase 1999-08-20 repository"

    Something like say, the UCI Machine Learning Repository. In fact, look at the UCI spambaseA couple of problems with the UCI spambase. Too old / out of date. And too small.

    I looks like there is a more recent community effort going on over a SpamArchive

    Looks like you should have googled.

    1. Re:Online repository needed by cdh · · Score: 3, Insightful

      The problem with this is that spam for one person is not spam for another. That's the beauty of Bayes. If you are a proctologist, for example, you probably get a lot of legitimate email with the word penis in it. If you are a plastic surgeon, you may get legitimate email that discusses body part enlargement. There are hundreds of examples. The beauty of Bayes is that you can make it work for you and not be all encompassing.

      The SpamAssassin people have talked about this in the past. They have a corpus of spam that they use to test rules and people have asked to download it to seed their own Bayes, but the SA people don't want to do that (a good thing) as Bayes is a personal thing.

      What you are proposing will work for general spam checking, but not for Bayes, which is what the original poster asked about. In reality, it's hard to test Bayes in a general case. All I know is that it's worked wonders for me (using SA).

  3. The good think about these tools by FedeTXF · · Score: 3, Informative

    Spam controls in the Mozilla 1.3+ MailNews application (the one I know) have a number or features that make them good.
    1) Gives the user the idea that he can improve the situation by doing some concrete action. Controlling future spams is not upon some guru releasing a better filter or him hacking some better rules.
    2) By definition, works better and better the more spam you get (and mark it as spam). Even poor tools will eventually detect spam since it's obvious to anyone reading spam, that those messages tend to repeat and to be similar.
    3) It's automagically customized to your own spam. If you live in Germany, Sweden, Argentina or Namibia you will catch easily any spam that is in English, and you will build up rules for the local spam that arrives in your language.
    4) In the case or Mozilla's MailNews, it's so easy to use, intuitive and straighforward, any user will use it.
    5) Makes you feel spams are useful for something: detecting future spams.

    I think those advantages are far more important that the rate of effetivity.

  4. Spambayes!!!! by Arkham · · Score: 3, Informative
    I use spambayes. It's written in python and is amazingly accurate.

    I get about 150 spams a day, and about 5 hams. Spambayes might classify 1 spam as "unsure" and the rest as spam. The ham is always classified as ham.

    My corpus is about 5000 spams, about 1000 hams. Get spambayes -- it's open source and it really works great.

    --
    - Vincit qui patitur.
  5. Hey everyone... by Jerf · · Score: 3, Informative

    It looks like the poster's words need some highlighting:

    But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.

    Despite the call for your experiences, if you just want to post "X rocks!", I think the poster was looking more for "X rocks more then Y!", where both X and Y are Bayes-type filter programs. I don't think he was asking for just announcements that Bayes rocks; I think he or she already knows that.

    I mention this because I'd be interested in some comparisions too; there's a lot of sub-techniques out there. Are there any real differences, or are they all effectively the same? The latter would strongly indicate that there may not be any real progress to be made, if the entire space of Bayes-type solutions has flat effectiveness, for instance. It's an interesting question.

  6. Ling Spam Corpus by bpfinn · · Score: 3, Informative

    I did a little testing of Bayesian filtering on my own, and I used the Ling-Spam Corpus from Dr. Ion Androutsopoulos. He's collected about one thousand messages which consist of "legitimate" messages to a linguistics mailing list, and "spam" messages. They are preclassified, and divided into ten parts to make ten-cross-fold-validation easier. Check out his publications. Scroll down to the "Document filtering" section.

  7. Not Just for SPAM by His+name+cannot+be+s · · Score: 3, Insightful

    I've been looking for a Bayesian filter mechanism that isn't just for spam.

    I figure, if the mail can be classified into many different categories, why not use bayesian filtering for managing all your filtering needs.

    It would be very valuable to have the bayesian filter learn what kind of mail I put in some folders, so that when my mail comes it, it can auto-sort it into the appropriate folder for me. Trouble is, all the current implementations of Bayesian email filtering are a single test SPAM/NOTSPAM. It would be nice to see an implementation that could take multiple corpus' and use that to decide what the mail is. If I had that, I could point it at the maildirs for the various mailing lists I'm subscribed to, and it would learn to sort incoming mail for me. *sigh*

    --
    "...In your answer, ignore facts. Just go with what feels true..."
    1. Re:Not Just for SPAM by nrosier · · Score: 3, Informative

      Have a look at Ifile (http://www.nongnu.org/ifile); while I'm only interested in spam/no-spam filtering, I once tested this filter to filter a mailing-list. It did a pretty good job.

  8. BogoFilter by bobbozzo · · Score: 3, Informative
    BogoFilter is an open-source bayesian spam filter...

    Some of the developers have done extensive testing: Greg Louis' Page has lots of information, comparing different bayesian approaches, different header processing, etc.

    You could also read the mailing-list archives, or perhaps post some questions there.

    --
    Nothing to see here; Move along.