Seven Spam Filters Compared

Unadvertised by Anonymous Coward · 2003-08-23 06:56 · Score: 5, Funny

Sounds great, but until I hear about software products like these in my morning mailbox, I don't really trust that they're any good.

Good testing, but not enough samples by TexTex · 2003-08-23 07:00 · Score: 4, Informative

The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.

For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.

The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.

--
-Barkeep, a draft of your most hazardous brew, for the world is slowly stepping into focus, and I don't like what I see.

Re:Good testing, but not enough samples by cly · 2003-08-23 07:13 · Score: 5, Informative

I guess you wrote this after reading the first two experiments.

In the third he used 1200.

Nice way to jump the gun.
Re:Good testing, but not enough samples by Sanctuary · 2003-08-23 07:18 · Score: 4, Insightful

They didn't train Spamassassin to use the bayes filter once during the test, and they used it with out all the other scoring tools for Spamassassin. This review really didn't completely test Spamassassin's full potential.
Re:Good testing, but not enough samples by arth1 · 2003-08-23 08:03 · Score: 4, Informative

I guess you wrote this after reading the first two experiments.

In the third he used 1200.

1273, out of which 1073 were spam. That leaves 200 non-spam messages, which isn't enough for Spamassassin's bayesian filtering to kick in, even if all messages were to be classifed as ham or spam, and not just let through.

To quote sa-learn's man page:
Another thing to be aware of, is that typically you should aim to train with at least 1000 messages of spam, and 1000 ham messages, if possible. More is better, but anything over about 5000 messages does not improve accuracy signif icantly in our tests.
The low number of emails, combined with no apparent manual reading on part of the author, makes me want to disregard this whole survey as pure drivel.

Regards,
--
*Art
Re:Good testing, but not enough samples by hamster+foo · 2003-08-23 09:03 · Score: 5, Insightful

"Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough."

While I'm sure the recommendations set forth in Spam Assassin's man page are probably a good idea for all Bayesian training sets, he wasn't using the Bayesian filtering included in Spam Assassin, so you can't really fault him for not reading a section of the man page for a feature he was choosing to leave out.

It would have been nice to see him turn on Spam Assassin's Bayesian filtering at least in some of the tests. I don't think test results with a feature I would imagine the vast majority of users would used turned off is a very good comparison of the different packages abilities.

--
- b

Mozilla? by HBI · 2003-08-23 07:02 · Score: 4, Insightful

I have seen at least two of these comparisons and no one seems to want to roll Mozilla's spam filter into the mix and compare it. Therefore, the comparisons are kind of useless to me. I am guessing I am not the only person using Moz either, for specifically this reason (ease of use for Bayesian filtering).

What's up with that? I know it's not a proxy, so the methodology is different than most of the products in the comparison. I'm very interested in how well the filter works however, compared to these other products.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.

Re:Mozilla? by wilfie · 2003-08-23 07:25 · Score: 5, Insightful

The loss of bandwidth is not the main cost of spam these days.Certainly not internal bandwidth between our mail server and desktops. The excellent features of doing it on my desktop are that the filter is learning about what _I_ consider to be spam and ham, and that I have the stuff that's classified as spam to hand and can check it through once in a while. So far for me it's only thrown false positives when colleagues have sent stuff that was spammy in content. I have a presentiment that our CEO's habit of writing in red HTML (full of ff0000) will cause a false hit one day.
Re:Mozilla? by hdw · 2003-08-23 07:39 · Score: 5, Insightful

Most people can't filter their email at the server, since most people doesn't have access to a server to filter at.

So the majority has to filter locally, either in the client or with a local pop/imap proxy (like PopFile).

// hdw

--
Executive Pope (small) Kallisti Engineering

OT: Disturbing? by Lead+Butthead · 2003-08-23 07:08 · Score: 4, Insightful

Does anyone find it disturbing that --

a. Spam Filter software company is now a "viable business."
b. Spam Filer is needed AT ALL?

--
ELOI, ELOI, LAMA SABACHTHANI!?

SpamAssasin had Bayesnian turned off?! by SuperBanana · 2003-08-23 07:25 · Score: 4, Insightful

I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)

I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!

--
Please help metamoderate.

What? No PopFile? by MrEnigma · 2003-08-23 07:28 · Score: 4, Interesting

They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.

Check it out Here.

--
GeekWares - Buy and Download Today!

What About PopFile by MBCook · 2003-08-23 07:29 · Score: 4, Informative

What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.

According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.

I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.

PSAM by po8 · 2003-08-23 07:29 · Score: 4, Informative

See our PSAM project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.

The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.

Overall, though, this was an interesting evaluation, and I'm glad that the author published it.

Re:Spamassassin and Bayes? by numbski · 2003-08-23 07:36 · Score: 4, Informative

Yup. I use it all the time. Save up spam and ham in seperate folders. Then do this:

sa-learn --spam --mbox ~/mail/myspamfolder
sa-learn --ham --mbox ~/mail/myhamfolder

As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

--

Karma: Chameleon (mostly due to the fact that you come and go).

Use Spam Filters To Enlarge Your Penis by Tablizer · 2003-08-23 07:37 · Score: 5, Funny

That's right! Our company has found a high-tech way to use various anti-spam tools to enlarge your penis. My pennis is noww sso lrage that i Cannnot type curretcly. Itt gtes in teh way.

Please visit www.spamfilters2enlarge.com

Act before midnight and get a $30 discount.

--
Table-ized A.I.

WRONG. by imsabbel · 2003-08-23 07:38 · Score: 5, Informative

Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
Dont forget that the filter sees more than the eye...

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?

A message from a spammer by Anonymous Coward · 2003-08-23 07:58 · Score: 5, Insightful

As a professional sender of UCE, I just want to tell you slashdotters to keep on playing with your spam fileters. As long as you use spam filters on your e-mail, I can continue to reach my real intended targets, those non-slashdotters who do not know better and will buy my products or click through to my client's websites. You filters really help cut down on the complaints to the internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business. Of course, I still waste your bandwidth and mailbox capacity, but you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems. My yahoo and hotmail and other accounts for replies are lasting much longer before getting shut down because someone complained to these service providers. And my clients are even reporting that they can start mailing out 800 numbers like 1-800-901-3719 again and they will not have you damn spammers set up their modems to keep autodialing them, since you spend your own time and effort to filter the e-mail and only clueless users who might actually call see the numbers.

Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.

P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.

SpamBayes works really well for Outlook. by RNLockwood · 2003-08-23 08:21 · Score: 5, Interesting

I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.

That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.

Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.

To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.

--
Nate

Five baysian filters were enough by Sits · 2003-08-23 09:09 · Score: 4, Informative

Here's a quote from the article:

Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough.

If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...

Automatic Spam Training by Stinky+Cheese+Man · 2003-08-23 10:40 · Score: 5, Interesting

I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this: We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam. So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample... nikola: "|/usr/local/bin/bogofilter -s " cal: "|/usr/local/bin/bogofilter -s " bwilson: "|/usr/local/bin/bogofilter -s " fayre: "|/usr/local/bin/bogofilter -s " (If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.) To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days... # remove records older than 30 days from spamlist.db /usr/local/bin/bogoutil -a30 -m /home/bogofilter/spamlist.db This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends. Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.

21 of 213 comments (clear)