Spamassassin Beats CRM-114 In Anti-Spam Shootout
Simon Lyall writes "A new study of antispam software shows that Spamassassin performed well in various configurations along with Spamprobe , Bogofilter and Spambayes also came out good while CRM-114
failed to live up to its previous claims . The study shows: 'The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.'"
CRM-114
The link in the article points to SpamBayes again.
"People that quote themselves in their signatures bother me" - athakur999
And a long document it is (funny placeholder images though.) Here's the conclusions for the impatient but interested in a little more than the summary:
Supervised spam filters are effective tools for attenuating spam. The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day. The corresponding risk of mail loss, while minimal, is difficult to quantify. The best-performing filters misclassified a handful of spam messages early in the test suite; none within the second half (25,000 messages). A larger study will be necessary to distinguish the asymptotic probability of ham misclassification from zero.
Most misclassified ham messages are advertising, news digests, mailing list messages, or the results of electronic transactions. From this observation, and the fact that such messages represent a small fraction of incoming mail, we may conclude that the filters find them more difficult to classify. On the other hand, the small number of misclassifications suggests that the filter rapidly learns the characteristics of each advertiser, news service, mailing list, or on-line service from which the recipient wishes to receive messages. We might also conjecture that these misclassifications are more likely to occur soon after subscribing to the particular service (or soon after starting to use the filter), a time at which the user would be more likely to notice, should the message go astray, and retrieve it from the spam file. In contrast, the best filters misclassified no personal messages, and no delivery error messages, which comprise the largest and most critical fraction of ham.
A supervised filter contributes significantly to the effectiveness of Spamassassin's static component, as measured by both ham and spam misclassification probabilities. Two unsupervised configurations also improved the static component, but by a smaller margin. The supervised filter alone performed better than than the static rules alone, but not as well as the combination of the two.
The choice of threshold parameters dominates the observed differences in performance among the four filters implementing methods derived from Graham's and Robinson's proposals. Each shows a different tradeoff between ham accuracy and spam accuracy. ROC analysis shows that the differences not accountable to threshold setting, if any, are small and observable only when the ham misclassification probability is low (i.e. hm
CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning throughout the email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years at the observed rate to achieve the same misclassification rates as the other systems. Both these systems were designed to be used in a train on error configuration, and do not self-train. This configuration could account for a slow learning rate as each system avails itself of the information in only about 1,000 of the 50,000 test messages. In an effort to ensure that we had not misinterpreted the installation instructions, we ran CRM-114 in a train-on-everything configuration and, as predicted by the author, the result was substantially worse.
Spam filter designers should incorporate interfaces making them amenable for testing and deployment in the supervised configuration (figure 4). We propose the three interface functions used in algorithm 1 - filterinit, filtereval, and filtertrain - as a standardized interface. Systems that self-train should provide an option to self-train on everything (subject to correction via filtertrain) as in algorithm 2.
Ham and spam misclassification proportions should be reported separately. Accuracy, weighted accuracy, and precision should be avoided as primary evaluation measures as th
everything in moderation
CRM114's best was about 80%, which lasted for a few of weeks (weeks 3-5). Before and after that, it's doing good to catch 25% of the spam. I'm not sure why, but for the last month it's only been catching about 10%. When one gets through, I run it through mailfilter.crm with the learnspam switch. It'll say it's learned it, but if I have it check the spam again, it still lets it past.
From page 24: Hidalgo suggests the use of ROC curves, originally from signal detection theory and used extensively in medical testing, as better capturing the important aspects of spam filter performance.
Perhaps a distributed analysis system (similar to SETI@home) could be used to combat spam. Not only could the idle time of bazillions of CPUs be levereaged to improve "signal" analysis, but perhaps the clients could analyize local incoming mail to corelate new trends in spam originators and then share that information with all of the other clients. Then you could combine that with the genetic evolution improvements of the F1 sim-cars recently mentioned on /.
So there's the high-level idea, now you smart people go make it work. :-)
This one gang kept wanting me to join cause I'm pretty good with a bo staff.
I did a *much* smaller test of spam filters earlier this year (which was published in hakin9 but not in English).
I also found that crm114 gave poor results in comparison to other filters - but figured I must have set something up incorrectly...
It gets better. Vernon Schryver, networking genius, is responsible for the Distributed Checksum Clearinghouse which does something similar, but as I understand it, is much more efficient for large servers. When our university turned on DCC filtering combined with greylisting, the daily spam to inboxes dropped from hundreds daily to ZERO (I kid you not). I am not aware of any false positives, at least on my account. DCC blew my mind.
Up to this past weekend I was using only bogofilter (which is a pure bayesian filter). I seem to get about 200 spam a day on my main account. Until about a month or two ago bogofilter was amazing - I'd get maybe 1 or 2 spam a day, if that many. Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter. Most spam was still being removed by bogofilter, but enough to make me annoyed. This past weekend I also enabled spamassassin (without its bayes filter though), and its cut down the number of spam to maybe 5 a day, but its still too much for me. I'm hoping we have the next breakthrough in spam filtering technology soon (akin to bayesian filtering) because it seems that every new technique we use to filter the spam is eventually targeted by the spammers and bypassed.
I would recommend not using a catch all account, but if you have the domain, create, delete and rename email accounts as you need to...
In Soviet Russia the insensitive clod is YOU!
Why people don't use disposable accounts is beyond me. Once you start using Spamgourmet you'll never go back. I've been active with them over two years and here's my current stats:
Your message stats: 339 forwarded, 43,796 eaten. You have 155 disposable address(es).
yeah, that's right, thanks to disposable addresses I *haven't* read 43,457 spam emails! When I do need (want) to use my real address, I use SpamSieve (with Entourage X) - very good baysean filter (not sure if it Mac only or not).
I used to do the same. Now I'm paying for it.
Several viruses were sent to jane@mydomain, pete@mydomain, sedlskjl@mydomain etc.
Inevitably these same addresses are now being used for Spam and viruses as the source OR destination address (meaning I get bounce messages as well).
I HATE it when moron anti-Virus gateway administrators set them up to return confirmed viruses to sender with a polite note - except I am NOT the sender, my address was spoofed.
Unfortunately I have been using the catch-all trick for so long (e.g. ebay.com@mydomain etc.) that it's not as simple as turning it off or setting up filters - I don't even know what all the 'legit' addresses are as I used to create them on the fly and may only get email to some of them once a year or so.
I only ever busted one person for passing on the account details which was satisfying, but I am getting PLENTY of Spam/viruses now instead.
I use the excellent Spam Gourmet now for instantly creating disposable addresses with the added advantage that they can actually die when I want/need them to.
Do you or your partner snore? - Visit www.snoring.com.au
This is interesting considering the harsh words the DSPAM author directs towards SpamAssassin in the DSPAM FAQ. In contrast, I think, the SpamAssassin developers say they are interested in testing the "dobly" noise reduction technique that DSPAM employs, see SpamAssassin bug 3078.
I am the author of CRM114 and I corresponded with Professor Carmack for setup assistance during this study; he did have some problems with CRM114 that he brought to my attention and which were possibly never quite resolved.
I can also state that *do* run CMR114 myself; I also run SpamAssassin (regularly maintained by the systems staff) on a parallel account. I find that SA gets about 90+ percent of what makes it past the firewall's immediate RBL lists (which matches Prof. Cormack's Figure 8 pretty closely); CRM114 nails 99.9% or more (this week, ending June 21, 2004, my CRM114 stats are 2528 nonspam and 1114 spam messages, and had just 1 error (a false reject) which is 99.972% accuracy.
I have gotten reports from some very happy users who are seeing similar accuracies; I've also gotten sad reports similar to Prof. Carmack's that show very weak accuracy.
I can conclude from this (and other reports) that filter performance varies _greatly_ with spam mix - that is to say, Your Mileage Will Vary.
Further, consider Fig 15, which compares CRM114's accuracy with respect to nonspam v. spam. Note that the two curves are displaced considerably, by a factor of accuracy between 3 and 5 times!
This is odd, because CRM114 is _entirely_ symmetrical; it does NOT have any predisposition toward (or against) erring on the side of caution; the only difference between nonspam and spam is the names of their files, which could be changed to "foo.css" and "bar.css" (or even interchanged) without affecting anything else.
Therefore, the two accuracy curves _should_ therefore lie on top of each other; there is no difference in the processing. The fact that the nonspam v. spam curves seem to differ by a factor of 3 to 5 in magnitude gives me some reason to believe that the setup issues Prof. Carmack encountered never really were completely addressed.
-Bill Yerazunis
But SpamAssassin is just getting better and better. Version 3.0 is coming up, and 3.0-pre1 was recently released. I do not have a test system available for it, but those who have may want to take it for a spin.
Especially for large sites, this is extremely interesting. It adds relational database support for the Bayes database, so it should be a lot easier to set up on a large site.
I find the lack of individual training the main reason why SA works so well for me, but not very well at my old university.
Employee of Inrupt, Project Release Manager and Community Manager for Solid