Spamassassin Beats CRM-114 In Anti-Spam Shootout

Correct link to CRM-114 by athakur999 · 2004-06-22 15:27 · Score: 5, Informative

CRM-114

The link in the article points to SpamBayes again.

--
"People that quote themselves in their signatures bother me" - athakur999

No HTML, Just ps or pdf, conclusions inside by randyest · 2004-06-22 15:34 · Score: 5, Informative

And a long document it is (funny placeholder images though.) Here's the conclusions for the impatient but interested in a little more than the summary:

Supervised spam filters are effective tools for attenuating spam. The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day. The corresponding risk of mail loss, while minimal, is difficult to quantify. The best-performing filters misclassified a handful of spam messages early in the test suite; none within the second half (25,000 messages). A larger study will be necessary to distinguish the asymptotic probability of ham misclassification from zero.

Most misclassified ham messages are advertising, news digests, mailing list messages, or the results of electronic transactions. From this observation, and the fact that such messages represent a small fraction of incoming mail, we may conclude that the filters find them more difficult to classify. On the other hand, the small number of misclassifications suggests that the filter rapidly learns the characteristics of each advertiser, news service, mailing list, or on-line service from which the recipient wishes to receive messages. We might also conjecture that these misclassifications are more likely to occur soon after subscribing to the particular service (or soon after starting to use the filter), a time at which the user would be more likely to notice, should the message go astray, and retrieve it from the spam file. In contrast, the best filters misclassified no personal messages, and no delivery error messages, which comprise the largest and most critical fraction of ham.

A supervised filter contributes significantly to the effectiveness of Spamassassin's static component, as measured by both ham and spam misclassification probabilities. Two unsupervised configurations also improved the static component, but by a smaller margin. The supervised filter alone performed better than than the static rules alone, but not as well as the combination of the two.

The choice of threshold parameters dominates the observed differences in performance among the four filters implementing methods derived from Graham's and Robinson's proposals. Each shows a different tradeoff between ham accuracy and spam accuracy. ROC analysis shows that the differences not accountable to threshold setting, if any, are small and observable only when the ham misclassification probability is low (i.e. hm
CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning throughout the email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years at the observed rate to achieve the same misclassification rates as the other systems. Both these systems were designed to be used in a train on error configuration, and do not self-train. This configuration could account for a slow learning rate as each system avails itself of the information in only about 1,000 of the 50,000 test messages. In an effort to ensure that we had not misinterpreted the installation instructions, we ran CRM-114 in a train-on-everything configuration and, as predicted by the author, the result was substantially worse.

Spam filter designers should incorporate interfaces making them amenable for testing and deployment in the supervised configuration (figure 4). We propose the three interface functions used in algorithm 1 - filterinit, filtereval, and filtertrain - as a standardized interface. Systems that self-train should provide an option to self-train on everything (subject to correction via filtertrain) as in algorithm 2.

Ham and spam misclassification proportions should be reported separately. Accuracy, weighted accuracy, and precision should be avoided as primary evaluation measures as th

--
everything in moderation

Spamassasin is great! by JohnFromCanada · 2004-06-22 15:35 · Score: 2, Informative

I have been using SpamAssassin in conjunction with Evolution and it has cut my spam to virtually nothing. I wish it was built right into Evolution so that it was a little faster however it is worth the wait as I barely ever get any spam in my Inbox anymore. I set it up with evolution very similar to how it is shown here. I really like using it with Evolution however I am curious if anyone knows of anything that would work faster and as efficient in conjuntion with Evolution?

I've had CRM114 running for a few months . . . by klevin · 2004-06-22 15:38 · Score: 4, Informative

CRM114's best was about 80%, which lasted for a few of weeks (weeks 3-5). Before and after that, it's doing good to catch 25% of the spam. I'm not sure why, but for the last month it's only been catching about 10%. When one gets through, I run it through mailfilter.crm with the learnspam switch. It'll say it's learned it, but if I have it check the spam again, it still lets it past.

Re:I've had CRM114 running for a few months . . . by CoolGopher · 2004-06-22 16:19 · Score: 2, Informative

I've been running CRM114 for about a year now, and it's performing extremely well. Far better than my Mozilla filter. In fact, just the other week I scrapped Mozilla's junk filter completely and am now relying on CRM alone. It's very rare that I get any misses in either direction.

If I was to make an estimate, I'd say that the error rate is something like .1%, quite possibly less (say 1 miss/5 days, with 200 mails per day). This is having started with clean corpus files and train-on-error only.

Good results with spamprobe by bigberk · 2004-06-22 15:38 · Score: 2, Informative

I have been using spamprobe for some time, with the webfilt front-end, and I'm very pleased with the speedy spamprobe program (written in C++).

I receive approximately 10 legit emails/day and about 300 spam/day. I have only had 2 false positives overall (that's 2 out of about 100,000 total emails received) and on average only 2 spams/day split past the filter. Now I'm testing Spambayes on one of my most spammed accounts, but it's definitely much slower than spamprobe and not more accurate as far as I can tell.

Re:The Mozilla ThunderBird SPAM filter by ImpTech · 2004-06-22 15:39 · Score: 2, Informative

Of course its pretty easy to hook spamassassin, bogofilter, or whathaveyou into Evolution. Tutorials abound if you search google. Thunderbird's nice, but IMO Evolution's still a bit nicer, so it was worth my time to plug in a spam filter manually.

compute farms for anti-spam AI? by potus98 · 2004-06-22 15:39 · Score: 4, Informative

From page 24: Hidalgo suggests the use of ROC curves, originally from signal detection theory and used extensively in medical testing, as better capturing the important aspects of spam filter performance.

Perhaps a distributed analysis system (similar to SETI@home) could be used to combat spam. Not only could the idle time of bazillions of CPUs be levereaged to improve "signal" analysis, but perhaps the clients could analyize local incoming mail to corelate new trends in spam originators and then share that information with all of the other clients. Then you could combine that with the genetic evolution improvements of the F1 sim-cars recently mentioned on /.

So there's the high-level idea, now you smart people go make it work. :-)

--
This one gang kept wanting me to join cause I'm pretty good with a bo staff.

Re:compute farms for anti-spam AI? by damiangerous · 2004-06-22 16:38 · Score: 4, Informative

There are already spam packages that do this, at least the collaborative part. Vipul's Razor (which is under the Artistic license) at the personal level and Brightmail (which is closed and not free) at the enterprise/ISP level, off the top of my head.

Spamassassin uses collaborative spam-tracking by vivek7006 · 2004-06-22 15:43 · Score: 2, Informative

Razor: Vipul's Razor is a collaborative spam-tracking database, which works by taking a signature of spam messages. Since spam typically operates by sending an identical message to hundreds of people, Razor short-circuits this by allowing the first person to receive a spam to add it to the database -- at which point everyone else will automatically block it.

This is a really cool.

Re:Spamassassin uses collaborative spam-tracking by bigberk · 2004-06-22 15:53 · Score: 4, Informative

It gets better. Vernon Schryver, networking genius, is responsible for the Distributed Checksum Clearinghouse which does something similar, but as I understand it, is much more efficient for large servers. When our university turned on DCC filtering combined with greylisting, the daily spam to inboxes dropped from hundreds daily to ZERO (I kid you not). I am not aware of any false positives, at least on my account. DCC blew my mind.
Re:Spamassassin uses collaborative spam-tracking by Anonymous Coward · 2004-06-22 16:18 · Score: 1, Informative

What protection does it have against users (intentionally or unintentionally) adding non-spam to the database, thus blocking legitimate e-mail to everyone who uses Razor?

People have done this before by adding mailing list posts to Razor. But SpamAssassin doesn't automatically block messages listed in Razor, it just assigns them a higher spam score.

Razor has some protection too, like the truth evaluation system - see this page for info.

So I'm not the only one... by sholden · 2004-06-22 15:44 · Score: 4, Informative

I did a *much* smaller test of spam filters earlier this year (which was published in hakin9 but not in English).

I also found that crm114 gave poor results in comparison to other filters - but figured I must have set something up incorrectly...

Re:Mozilla Messenger / Thunderbird Performance? by k.ellsworth · 2004-06-22 15:47 · Score: 2, Informative

100% agreed I use mozilla thunderbird spam filter (after some human teaching to it) and it works marvelous, on a spam-me(account used on usenet, and some forums and to anything that i know that will become a spam source but i need to give a valid email address anyways) email account i have i recive ~38K spams a month and thunderbird only misses 3 or 4 per day... sometimes i look the JUNK folder of it and i haven't seen any false positive on it so far. ThunderBird is THE email client, works on Linux and Windoze, the spam filter work better than 99% , any many other tricks.

--
Putting a windows cd backwards, plays evil messages, but it gets worse, putting it right, installs windows.

Problems with Bayesian filtering by dlevitan · 2004-06-22 15:54 · Score: 4, Informative

Up to this past weekend I was using only bogofilter (which is a pure bayesian filter). I seem to get about 200 spam a day on my main account. Until about a month or two ago bogofilter was amazing - I'd get maybe 1 or 2 spam a day, if that many. Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter. Most spam was still being removed by bogofilter, but enough to make me annoyed. This past weekend I also enabled spamassassin (without its bayes filter though), and its cut down the number of spam to maybe 5 a day, but its still too much for me. I'm hoping we have the next breakthrough in spam filtering technology soon (akin to bayesian filtering) because it seems that every new technique we use to filter the spam is eventually targeted by the spammers and bypassed.

Re:Problems with Bayesian filtering by swillden · 2004-06-22 17:55 · Score: 2, Informative

Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter.
This is very surprising to me, and it's not my experience at all (also using bogofilter). My bogofilter doesn't seem to be fooled one bit by those common words, at least not in a way that causes it to missclassify spam. That makes sense, actually, since most common words end up being viewed by the filter as neutral, and if the spammers want to sell their wares, they still have to put the spammy words in. So that big chunk of text from "Huckleberry Finn" at the beginning doesn't fool bogofilter at all.
Well, sort of. What I have noticed is that since lots of spam started putting chunks of non-spammy text in, Bogofilter has begun occasionally missclassifying ham. This also is logical. A word that happens never to have been used in any ham messages may show up in many fool-the-filter blocks in spam messages and therefore be perceived by the filter as a spammy word, with bad results when a ham message shows up that does use it.
One thing that I find very helpful is to use bogofilter's optional three-way classification, which allows you to set two different thresholds. Messages which score above the higher threshold are considered spam, messages which score below the lower threshold are considered ham and messages that fall in between are unknown. Using this system I find that I can pretty safely assume that everything in my Inbox is ham and everything in my Spam folder is spam. About 20 messages per day make it into the "Possible" box, about half spam. So, out of the 2000 e-mail messages that arrive daily (about half spam -- and no I don't read all of my ham), I have to examine 20 for spamminess.
Another issue I've run into, probably mostly because I set my "possible" range very wide, is the problem of "persistent possibles". When a message shows up in the possible box, I drop it into one of two folders "IdentifiedHam" and "IdentifiedSpam". A cron job grabs the messages out of these folders, retrains bogofilter appropriately and then puts them back into the mail queue for reprocessing. The persistent ones still fall into the possible range even after retraining, and it can be very difficult to get them to finally drop into the right category.
My solution is to automatically continue retraining on a message until it evaluates correctly, up to a point. After trying various limits I've found that a maximum of 20 training cycles gives pretty good results. Going much higher tends to cause overtraining problems, so the cron job will retrain at most 20 times on each message before giving up and just putting the message back into the queue. When it shows up in the possible folder again, I just delete it.
Speaking of overtraining, I've found that to be a more general problem. When I first started using bogofilter, the accuracy was terrible the first day, good after the first week, amazing after the first three weeks, but then started to decline after about three months. The problem was that it was overtrained, and was putting too much weight on some words. There's no perfect way to avoid this problem (and the retraining my scripts do tends to exacerbate it a little), but I've found that cleaning out database entries older than 30 days does a pretty good job of keeping the filter operating at peak performance. A daily cron job keeps my filter clean and fresh.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

Re:Why don't people use catch-all accounts? by sr180 · 2004-06-22 16:13 · Score: 4, Informative

Wait till the spammers decide to spam your whole domain. They can start with aaaaaaaa@yourdomain.com and keep going till they get to zzzzzzzz@yourdomain.com, and your mailserver will accept and pass on every single one of these emails.

I would recommend not using a catch all account, but if you have the domain, create, delete and rename email accounts as you need to...

--
In Soviet Russia the insensitive clod is YOU!

Spamgourmet (antichef) and SpamSieve by dougman · 2004-06-22 16:38 · Score: 4, Informative

Why people don't use disposable accounts is beyond me. Once you start using Spamgourmet you'll never go back. I've been active with them over two years and here's my current stats:

Your message stats: 339 forwarded, 43,796 eaten. You have 155 disposable address(es).

yeah, that's right, thanks to disposable addresses I *haven't* read 43,457 spam emails! When I do need (want) to use my real address, I use SpamSieve (with Entourage X) - very good baysean filter (not sure if it Mac only or not).

Re:Why don't people use catch-all accounts? by lewko · 2004-06-22 17:00 · Score: 4, Informative

I used to do the same. Now I'm paying for it.
Several viruses were sent to jane@mydomain, pete@mydomain, sedlskjl@mydomain etc.

Inevitably these same addresses are now being used for Spam and viruses as the source OR destination address (meaning I get bounce messages as well).

I HATE it when moron anti-Virus gateway administrators set them up to return confirmed viruses to sender with a polite note - except I am NOT the sender, my address was spoofed.

Unfortunately I have been using the catch-all trick for so long (e.g. ebay.com@mydomain etc.) that it's not as simple as turning it off or setting up filters - I don't even know what all the 'legit' addresses are as I used to create them on the fly and may only get email to some of them once a year or so.

I only ever busted one person for passing on the account details which was satisfying, but I am getting PLENTY of Spam/viruses now instead.

I use the excellent Spam Gourmet now for instantly creating disposable addresses with the added advantage that they can actually die when I want/need them to.

--
Do you or your partner snore? - Visit www.snoring.com.au

SpamBayes + Thunderbird by Anthracks · 2004-06-22 17:03 · Score: 2, Informative

Thunderbird already has integrated significant improvements based on SpamBayes, I believe. See http://bugzilla.mozilla.org/show_bug.cgi?id=230093 , which was closed about a month ago. The test data from that patch is encouraging, although obviously results will be different for everyone since not everyone gets the same type of spam. If you want to keep tabs on upcoming refinements to junk mail filtering, take a look at the dependencies of this meta bug: http://bugzilla.mozilla.org/show_bug.cgi?id=228674 . Please don't "spam" up that bug with comments though, if you have something to say put it in a specific bug or file a new one if something relelvant doesn't exist.

--
Rock over London, Rock on Chicago. Wheaties: Breakfast of Champions.

Re:Why don't people use catch-all accounts? by dasmegabyte · 2004-06-22 17:06 · Score: 2, Informative

Why would I wait until spammers did that?

Already if a server tries to send the same email to more than three fake addresses at my company, I blacklist the IP for two days. Not just for email, but for any IP traffic. I did this to prevent trojans, but it's a somewhat effective spam deterrant as well.

--
Hey freaks: now you're ju

Re:DSPAM by Daniel+Quinlan · 2004-06-22 17:30 · Score: 3, Informative

Quoting the (unfinished) paper:

CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning through outthe email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years atthe observed rate to achieve the same misclassification rates as the other systems.

This is interesting considering the harsh words the DSPAM author directs towards SpamAssassin in the DSPAM FAQ. In contrast, I think, the SpamAssassin developers say they are interested in testing the "dobly" noise reduction technique that DSPAM employs, see SpamAssassin bug 3078.

Re:Mozilla Messenger / Thunderbird Performance? by Anonymous Coward · 2004-06-22 17:45 · Score: 1, Informative

I have measured Mozilla at 97% accurate and SpamProbe at 99.6% accurate. My mail is very skew, since I get about 20 times more spam than ham.

Mozilla is OK if you only get about 100 spams a day, but I get about 4000 spams a day - and less than 20 legit messages, so I need something better than Mozilla.

For me, Spamprobe had zero false positives, after 18 months of use, so I think if it ever does make a mistake, it would be a message so close to spam that I would not want to read it anyway.

How Apple Mail filters Spam by jjga · 2004-06-22 17:46 · Score: 2, Informative

There is a somewhat interesting article where they more or less explain how the Mac OS X Mail application works regarding Spam:

http://www.macdevcenter.com/pub/a/mac/2004/05/18/s pam_pt2.html

CRM114 Author Response by Anonymous Coward · 2004-06-22 23:56 · Score: 3, Informative

I am the author of CRM114 and I corresponded with Professor Carmack for setup assistance during this study; he did have some problems with CRM114 that he brought to my attention and which were possibly never quite resolved.

I can also state that *do* run CMR114 myself; I also run SpamAssassin (regularly maintained by the systems staff) on a parallel account. I find that SA gets about 90+ percent of what makes it past the firewall's immediate RBL lists (which matches Prof. Cormack's Figure 8 pretty closely); CRM114 nails 99.9% or more (this week, ending June 21, 2004, my CRM114 stats are 2528 nonspam and 1114 spam messages, and had just 1 error (a false reject) which is 99.972% accuracy.

I have gotten reports from some very happy users who are seeing similar accuracies; I've also gotten sad reports similar to Prof. Carmack's that show very weak accuracy.

I can conclude from this (and other reports) that filter performance varies _greatly_ with spam mix - that is to say, Your Mileage Will Vary.

Further, consider Fig 15, which compares CRM114's accuracy with respect to nonspam v. spam. Note that the two curves are displaced considerably, by a factor of accuracy between 3 and 5 times!

This is odd, because CRM114 is _entirely_ symmetrical; it does NOT have any predisposition toward (or against) erring on the side of caution; the only difference between nonspam and spam is the names of their files, which could be changed to "foo.css" and "bar.css" (or even interchanged) without affecting anything else.

Therefore, the two accuracy curves _should_ therefore lie on top of each other; there is no difference in the processing. The fact that the nonspam v. spam curves seem to differ by a factor of 3 to 5 in magnitude gives me some reason to believe that the setup issues Prof. Carmack encountered never really were completely addressed.

-Bill Yerazunis

Postfix Address Verification by DispassionateObserve · 2004-06-23 00:10 · Score: 2, Informative

Turning on Postfix 2.1's "address verification" feature immediately eliminated 90% of the spam that my company was receiving! (SpamAssassin + ClamAV + CRM114 catch the rest). This feature confirms that the incoming email is coming from an account that also accepts email. (Spambots don't normally accept mail, of course...) The spam email never even makes it into your system this way, because the SMTP transaction is deferred until the address is verified. - Mike

And SpamAssassin is just getting better by KjetilK · 2004-06-23 00:22 · Score: 3, Informative

I've been using SA 2.63 for some time now. At first, my statistics was about 90% rejected at SMTP-time, 0.1% false negatives and 0.01% false positives. Spammers have learned to adapt, so now I have about 2% false negatives.

But SpamAssassin is just getting better and better. Version 3.0 is coming up, and 3.0-pre1 was recently released. I do not have a test system available for it, but those who have may want to take it for a spin.

Especially for large sites, this is extremely interesting. It adds relational database support for the Bayes database, so it should be a lot easier to set up on a large site.

I find the lack of individual training the main reason why SA works so well for me, but not very well at my old university.

--
Employee of Inrupt, Project Release Manager and Community Manager for Solid

Re:Mozilla Messenger / Thunderbird Performance? by WuphonsReach · 2004-06-23 01:35 · Score: 2, Informative

I used Thunderbird and the SpamBayes proxy concurrently for a while. SB kicks the crap out of the Thunderbird.

Definitely agree.

I use the SpamBayes MSOutlook plugin for my work e-mail and it is extremely good at discriminating spam from ham. I use Thunderbird for my non-corporate e-mail. SpamBayes has two additional (and rather important features) that Thunderbird/Mozilla just don't have:

1. SpamBayes (at least the Outlook plug-in) actually has (3) levels of classification... definite ham, maybe, and definite spam; and you can route the "maybe" and "definite spam" to two different folders. That means, instead of having to sift through 229 spam messages for false positives, I really only have to closely examine the 29 "maybes". The other 200 I can just give a cursory glance at.

2. SpamBayes keeps track of the folder where a spam message was found. Then, if you click the "you goof! that's ham!" button, SpamBayes is smart enough to put the message back into that folder. Moz's junk mail filter just turns off the junk flag and leaves the message to rot in the junk folder. Sounds like a small thing, but it's a big usability issue.

--
Wolde you bothe eate your cake, and have your cake?

Re:DSPAM by More+Trouble · 2004-06-23 15:13 · Score: 2, Informative

Here's a response from the DSPAM author.

:w

Slashdot Mirror

Spamassassin Beats CRM-114 In Anti-Spam Shootout

29 of 330 comments (clear)