Bayesian Filter Testing?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Tuesday July 1, 2003 @10:09PM from the verifying-new-spam-fighting-techniques dept.

pu33y asks: "Since the publication of Paul Graham's A Plan For Spam, several programs that perform Bayesian filtering having become available, including CRM114 and Bogofilter. But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.Searching Google has turned up nothing and when I asked Paul Graham, he was unaware of any such testing, as well. Can anyone point to any such testing or provide the results of their own personal experiences with Bayesian filters?"

16 of 127 comments (clear)

Min score:

Reason:

Sort:

DSpam by jalet · 2003-07-01 22:25 · Score: 4, Interesting

Dspam (http://www.networkdweebs.com) rocks !

Some impressive stats were posted to the mailing list.

It's main feature is that it's completely maintainance free, and that even dumb people can use it (I know, I am).

My personnal stats are 2 false positives actually (one from PayPal, one from a company I work with), 280 spams learnt (I told it they were spam), 2877 spam catched and 4354 innocent.

--
Votez ecolo : Chiez dans l'urne !
Online repository needed by sam+the+lurker · 2003-07-02 00:43 · Score: 5, Interesting

Ideally, someone, probably an academic, should make a repository of spam available for testing. Software spam filters can say things like, "Correctly classified 99.9% of the email in the UCI spambase 1999-08-20 repository"

Something like say, the UCI Machine Learning Repository. In fact, look at the UCI spambaseA couple of problems with the UCI spambase. Too old / out of date. And too small.

I looks like there is a more recent community effort going on over a SpamArchive

Looks like you should have googled.
1. Re:Online repository needed by cdh · 2003-07-02 02:38 · Score: 3, Insightful
  
  The problem with this is that spam for one person is not spam for another. That's the beauty of Bayes. If you are a proctologist, for example, you probably get a lot of legitimate email with the word penis in it. If you are a plastic surgeon, you may get legitimate email that discusses body part enlargement. There are hundreds of examples. The beauty of Bayes is that you can make it work for you and not be all encompassing.
  
  The SpamAssassin people have talked about this in the past. They have a corpus of spam that they use to test rules and people have asked to download it to seed their own Bayes, but the SA people don't want to do that (a good thing) as Bayes is a personal thing.
  
  What you are proposing will work for general spam checking, but not for Bayes, which is what the original poster asked about. In reality, it's hard to test Bayes in a general case. All I know is that it's worked wonders for me (using SA).
Ella: OpenField Software by biodork · 2003-07-02 00:44 · Score: 2, Interesting

I use Ella from OpenField Software. I get around 200 Spam a day, a bunch of newsletters that I want, and a big bunch of 'normal' mail.

I have had it for about 2 weeks. In the last 3 days I have had 2 false +'s (messge in Spam that shouldn't be there) and 4 that went to the newsletter folder that shouldn't have.

--
Gavin Fischer
The good think about these tools by FedeTXF · 2003-07-02 01:10 · Score: 3, Informative

Spam controls in the Mozilla 1.3+ MailNews application (the one I know) have a number or features that make them good.
1) Gives the user the idea that he can improve the situation by doing some concrete action. Controlling future spams is not upon some guru releasing a better filter or him hacking some better rules.
2) By definition, works better and better the more spam you get (and mark it as spam). Even poor tools will eventually detect spam since it's obvious to anyone reading spam, that those messages tend to repeat and to be similar.
3) It's automagically customized to your own spam. If you live in Germany, Sweden, Argentina or Namibia you will catch easily any spam that is in English, and you will build up rules for the local spam that arrives in your language.
4) In the case or Mozilla's MailNews, it's so easy to use, intuitive and straighforward, any user will use it.
5) Makes you feel spams are useful for something: detecting future spams.

I think those advantages are far more important that the rate of effetivity.
Spambayes!!!! by Arkham · 2003-07-02 02:07 · Score: 3, Informative

I use spambayes. It's written in python and is amazingly accurate.
I get about 150 spams a day, and about 5 hams. Spambayes might classify 1 spam as "unsure" and the rest as spam. The ham is always classified as ham.
My corpus is about 5000 spams, about 1000 hams. Get spambayes -- it's open source and it really works great.

--
- Vincit qui patitur.
Hey everyone... by Jerf · 2003-07-02 02:43 · Score: 3, Informative

It looks like the poster's words need some highlighting:

But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.

Despite the call for your experiences, if you just want to post "X rocks!", I think the poster was looking more for "X rocks more then Y!", where both X and Y are Bayes-type filter programs. I don't think he was asking for just announcements that Bayes rocks; I think he or she already knows that.

I mention this because I'd be interested in some comparisions too; there's a lot of sub-techniques out there. Are there any real differences, or are they all effectively the same? The latter would strongly indicate that there may not be any real progress to be made, if the entire space of Bayes-type solutions has flat effectiveness, for instance. It's an interesting question.
Mozilla's Junk-mail Filters by asa · 2003-07-02 04:09 · Score: 2, Informative

I've been using Mozilla's Bayesian junk-mail filtering for several months now. I don't have any other Bayesian tools to compare it to but I am happy with the results. Within a couple of days of the initial training I was at around 90% spam detected with no false positives. Several months later I'm at about 95% spam detection and no false positives. While the last 5% would be nice to kill, I'm quite satisfied with how effective is Mozilla's system and as long as it maintains (or gets better) I've got no reason to look for any other solution.

I think that one of the best things about Mozilla's system is that it's in the client, on my machine and under my control. While server-side solutions, distributed corpus tools, etc. might be more accurate, not ever having to install or update any 3rd-party apps is really nice.

--Asa
Ling Spam Corpus by bpfinn · 2003-07-02 05:11 · Score: 3, Informative

I did a little testing of Bayesian filtering on my own, and I used the Ling-Spam Corpus from Dr. Ion Androutsopoulos. He's collected about one thousand messages which consist of "legitimate" messages to a linguistics mailing list, and "spam" messages. They are preclassified, and divided into ten parts to make ten-cross-fold-validation easier. Check out his publications. Scroll down to the "Document filtering" section.
Not Just for SPAM by His+name+cannot+be+s · 2003-07-02 05:27 · Score: 3, Insightful

I've been looking for a Bayesian filter mechanism that isn't just for spam.

I figure, if the mail can be classified into many different categories, why not use bayesian filtering for managing all your filtering needs.

It would be very valuable to have the bayesian filter learn what kind of mail I put in some folders, so that when my mail comes it, it can auto-sort it into the appropriate folder for me. Trouble is, all the current implementations of Bayesian email filtering are a single test SPAM/NOTSPAM. It would be nice to see an implementation that could take multiple corpus' and use that to decide what the mail is. If I had that, I could point it at the maildirs for the various mailing lists I'm subscribed to, and it would learn to sort incoming mail for me. *sigh*

--
"...In your answer, ignore facts. Just go with what feels true..."
1. Re:Not Just for SPAM by nrosier · 2003-07-02 06:36 · Score: 3, Informative
  
  Have a look at Ifile (http://www.nongnu.org/ifile); while I'm only interested in spam/no-spam filtering, I once tested this filter to filter a mailing-list. It did a pretty good job.
BogoFilter by bobbozzo · 2003-07-02 06:43 · Score: 3, Informative

BogoFilter is an open-source bayesian spam filter...
Some of the developers have done extensive testing: Greg Louis' Page has lots of information, comparing different bayesian approaches, different header processing, etc.
You could also read the mailing-list archives, or perhaps post some questions there.

--
Nothing to see here; Move along.
Re:Ja rulez by Blkdeath · 2003-07-02 07:32 · Score: 2, Informative

The problem is, even with baysian techniques, there is no way to quarantee that only spam was sorted out. I highly suggest a white list, in addition to filters, as the only way of ensuring that at least known mail is always received.

With Mozilla, you get the best of both worlds. You've got Bayesian filtering with an optional whitelist component. You can select any of your address books as the source of your whitelist (default is "Personal Addresses"), so any of your friends can send you all the SPAM they want without being caught. ;)
Being optional, you can choose to disable it if, say, your friends addresses have been harvested for "Joe Job" SPAM runs. (I know one or two of mine have).
I've actually used the whitelist to my advantage when I requested a sample of a particular new type of SPAM from him so I could watch for it and mark it if Mozilla missed it.
Which brings me to the other big advantage of Mozilla/Bayesian; when SPAMmers adapt, so does it. New SPAM type? Click the trash can and it'll go away.
Nothing can really be a perpetual 100% guarantee of blocking SPAM, but IME, Bayesian filters are the best possible solution we have right now and that's why I emphatically reccomend them to all my friends, family, and customers.

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.
Try here by drew_kime · 2003-07-02 08:32 · Score: 2, Informative

From here:

I've been tracking email spam trends for a while, my personal accounts are going from 3-6 spams daily in 2001 to about 30 spams daily at present. I filter this with SpamAssassin?, so the inbox impact is pretty slight, but the traffic is becoming significant, and the trend (doubling in four months) is downright troubling.
Graphs, methodology, links to more stats.

--
Nope, no sig
my simple filter by Xtifr · 2003-07-02 08:43 · Score: 2, Interesting

For years, the only spam filter I used was a very simple one: if the mail's not from a list I'm on, and not addressed to me, it's spam. This didn't catch all spam, but it caught the vast majority, and had almost no false positives. (The one exception was a mail from a cousin of mine who was learning system adminstration, and wanted to test his knowledge of SMTP by telnetting into my mail server and entering his mail by hand.)

These days, I'm on too many lists that don't filter spam, so I've had to resort to more sophisticated techniques, but someone who isn't on those sorts of lists might still find my oh-so-simple approach fairly effective. Not to disparage Bayesian filtering, but if you want something to compare against...
Re:PC mag test results by drfreak · 2003-07-02 13:19 · Score: 2, Funny

blocked 22 of 29 spam messages, and only legitimate e-mail ended up in their spam folder

Sounds like an ideal mail filter to me!