Spamassassin Beats CRM-114 In Anti-Spam Shootout

Correct link to CRM-114 by athakur999 · 2004-06-22 15:27 · Score: 5, Informative

CRM-114

The link in the article points to SpamBayes again.

--
"People that quote themselves in their signatures bother me" - athakur999

The Mozilla ThunderBird SPAM filter by k.ellsworth · 2004-06-22 15:30 · Score: 5, Interesting

the mozilla spam filter does a very good job too, when it learns enough it becomes over 95% acurate. i dropped evolution for it , and never looked back

--
Putting a windows cd backwards, plays evil messages, but it gets worse, putting it right, installs windows.

Re:The Mozilla ThunderBird SPAM filter by norton_I · 2004-06-22 20:01 · Score: 5, Insightful

Better to do spam filtering with your MTA/MDA anyway, if possible. That way, the same filter is used no matter which email client you use from which computer. Plus, it means you don't have to download spams to your MUA when on a slow connection.

Now if only I could get the rest of my mail configuration to be shared between evolution, mutt, and squirrelmail.

Quit acting like goddamn babies... by Anonymous Coward · 2004-06-22 15:32 · Score: 5, Funny

Baysian, gaysian. Real men hit delete.

Re:Quit acting like goddamn babies... by fireman+sam · 2004-06-22 16:30 · Score: 4, Funny

Pfft, Real men have this as the ~/.bashrc

#!/bin/sh
rm -f /var/spool/mail/$USER

Who needs email.

--
it is only after a long journey that you know the strength of the horse.

No HTML, Just ps or pdf, conclusions inside by randyest · 2004-06-22 15:34 · Score: 5, Informative

And a long document it is (funny placeholder images though.) Here's the conclusions for the impatient but interested in a little more than the summary:

Supervised spam filters are effective tools for attenuating spam. The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day. The corresponding risk of mail loss, while minimal, is difficult to quantify. The best-performing filters misclassified a handful of spam messages early in the test suite; none within the second half (25,000 messages). A larger study will be necessary to distinguish the asymptotic probability of ham misclassification from zero.

Most misclassified ham messages are advertising, news digests, mailing list messages, or the results of electronic transactions. From this observation, and the fact that such messages represent a small fraction of incoming mail, we may conclude that the filters find them more difficult to classify. On the other hand, the small number of misclassifications suggests that the filter rapidly learns the characteristics of each advertiser, news service, mailing list, or on-line service from which the recipient wishes to receive messages. We might also conjecture that these misclassifications are more likely to occur soon after subscribing to the particular service (or soon after starting to use the filter), a time at which the user would be more likely to notice, should the message go astray, and retrieve it from the spam file. In contrast, the best filters misclassified no personal messages, and no delivery error messages, which comprise the largest and most critical fraction of ham.

A supervised filter contributes significantly to the effectiveness of Spamassassin's static component, as measured by both ham and spam misclassification probabilities. Two unsupervised configurations also improved the static component, but by a smaller margin. The supervised filter alone performed better than than the static rules alone, but not as well as the combination of the two.

The choice of threshold parameters dominates the observed differences in performance among the four filters implementing methods derived from Graham's and Robinson's proposals. Each shows a different tradeoff between ham accuracy and spam accuracy. ROC analysis shows that the differences not accountable to threshold setting, if any, are small and observable only when the ham misclassification probability is low (i.e. hm
CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning throughout the email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years at the observed rate to achieve the same misclassification rates as the other systems. Both these systems were designed to be used in a train on error configuration, and do not self-train. This configuration could account for a slow learning rate as each system avails itself of the information in only about 1,000 of the 50,000 test messages. In an effort to ensure that we had not misinterpreted the installation instructions, we ran CRM-114 in a train-on-everything configuration and, as predicted by the author, the result was substantially worse.

Spam filter designers should incorporate interfaces making them amenable for testing and deployment in the supervised configuration (figure 4). We propose the three interface functions used in algorithm 1 - filterinit, filtereval, and filtertrain - as a standardized interface. Systems that self-train should provide an option to self-train on everything (subject to correction via filtertrain) as in algorithm 2.

Ham and spam misclassification proportions should be reported separately. Accuracy, weighted accuracy, and precision should be avoided as primary evaluation measures as th

--
everything in moderation

Mozilla Messenger / Thunderbird Performance? by Mark_MF-WN · 2004-06-22 15:34 · Score: 5, Interesting

I wonder how Mozilla Messenger/Thunderbird's spam filtering stacks up against these filters? I've heard some negative comments about the Mozilla filtering system, but it's worked wonders for me.

A little advice by Anonymous Coward · 2004-06-22 15:37 · Score: 5, Funny

You don't want to face an assassin in a shootout. Maybe a pie eating contest, or a spelling bee... but not a shootout.

I've had CRM114 running for a few months . . . by klevin · 2004-06-22 15:38 · Score: 4, Informative

CRM114's best was about 80%, which lasted for a few of weeks (weeks 3-5). Before and after that, it's doing good to catch 25% of the spam. I'm not sure why, but for the last month it's only been catching about 10%. When one gets through, I run it through mailfilter.crm with the learnspam switch. It'll say it's learned it, but if I have it check the spam again, it still lets it past.

compute farms for anti-spam AI? by potus98 · 2004-06-22 15:39 · Score: 4, Informative

From page 24: Hidalgo suggests the use of ROC curves, originally from signal detection theory and used extensively in medical testing, as better capturing the important aspects of spam filter performance.

Perhaps a distributed analysis system (similar to SETI@home) could be used to combat spam. Not only could the idle time of bazillions of CPUs be levereaged to improve "signal" analysis, but perhaps the clients could analyize local incoming mail to corelate new trends in spam originators and then share that information with all of the other clients. Then you could combine that with the genetic evolution improvements of the F1 sim-cars recently mentioned on /.

So there's the high-level idea, now you smart people go make it work. :-)

--
This one gang kept wanting me to join cause I'm pretty good with a bo staff.

Re:compute farms for anti-spam AI? by damiangerous · 2004-06-22 16:38 · Score: 4, Informative

There are already spam packages that do this, at least the collaborative part. Vipul's Razor (which is under the Artistic license) at the personal level and Brightmail (which is closed and not free) at the enterprise/ISP level, off the top of my head.

Re:in related news by bigberk · 2004-06-22 15:42 · Score: 4, Insightful

Content-based spam filtering is a waste of time. . . RBLs WORK

But content-based filters can very accurately determine what is spam and what's not, and so they can feed RBLs/DNSBLs. Let real spam to real user accounts form the blocklist! One such project is WPBL.

Isn't Human Accuracy always 100% by PetoskeyGuy · 2004-06-22 15:43 · Score: 4, Insightful

From the CRM-114 site...

News Flash: As of Feb 1 through March 1, 2004, 8738 messages (4240 spam, 4498 nonspam), and my total error rate was ONE. That translates to better than 99.984% accuracy, which is over ten times more accurate than human accuracy

Maybe I'm missing something human accuracy always going to be 100%? I tell the computer what is spam, it learns. I may decide that regardless of what it thinks, this last message is OK. So aside from clicking too fast or changing your mind (which is a common thing to do) how can a filter ever suggest it is be better then people at deciding what people want to see?

Re:Isn't Human Accuracy always 100% by sholden · 2004-06-22 15:50 · Score: 4, Insightful

People make mistakes.

Yes, given one message to classify as spam or ham you are going to get it right 100% of the time.

Given 8000 messages to classify the wonders of boredom is going to mean you make a mistake every so often (not an "oops I clicked the wrong button" mistake, but an "oops I put it in the wrong folder because the subject looked spammy and I couldn't be bothered checking the body" mistake).

In practice though, those stats on human accuracy are provided by having one person classify email that has been classified by others - which of course means some of the mistakes in fact be disagreements...
Re:Isn't Human Accuracy always 100% by fireman+sam · 2004-06-22 16:28 · Score: 4, Funny

Remember, an email being classified as spam is sujective. For example, you might consider a message from a Nigerian bank manager spam, but I may consider it a way to pay of the house :)

Or, presonally I consider all email I get with the from hotmail.com is spam. But that is my opinion.

OT: btw, a friend at work actually got a Nigerian scam letter in the post. Because it was not email, he thought it was real.

--
it is only after a long journey that you know the strength of the horse.
Re:Isn't Human Accuracy always 100% by Anonymous Coward · 2004-06-22 16:35 · Score: 4, Funny

OT: you need smarter friends.

So I'm not the only one... by sholden · 2004-06-22 15:44 · Score: 4, Informative

I did a *much* smaller test of spam filters earlier this year (which was published in hakin9 but not in English).

I also found that crm114 gave poor results in comparison to other filters - but figured I must have set something up incorrectly...

Why don't people use catch-all accounts? by mattkinabrewmindspri · 2004-06-22 15:44 · Score: 5, Interesting

When you register with a hosting company, very frequently, they set up what's called a catch-all account, and any email to your domain that's not addressed to a real address goes there. This is how I use it:

I only use my main email address with friends and family, and never post it online.
Whenever I post an email address or register for anything online, I put thatsite@mydomain.com as my email address.
All email is received by one account, but each message can have a different "to:" header. I set my filters to filter mail to different boxes. Email sent to amazon@mydomain.com goes to the amazon folder. Same with ebay, slashdot, whatever.
Any time I start receiving spam, I just set my mail server to disregard email sent to whatever email address is getting the spam, and I can stop doing business with the company that sold my email address.

I receive on average 0 spams per day.

--
Albuquerque PC

Re:Why don't people use catch-all accounts? by sr180 · 2004-06-22 16:13 · Score: 4, Informative

Wait till the spammers decide to spam your whole domain. They can start with aaaaaaaa@yourdomain.com and keep going till they get to zzzzzzzz@yourdomain.com, and your mailserver will accept and pass on every single one of these emails.
I would recommend not using a catch all account, but if you have the domain, create, delete and rename email accounts as you need to...

--
In Soviet Russia the insensitive clod is YOU!
Re:Why don't people use catch-all accounts? by lewko · 2004-06-22 17:00 · Score: 4, Informative

I used to do the same. Now I'm paying for it.
Several viruses were sent to jane@mydomain, pete@mydomain, sedlskjl@mydomain etc.

Inevitably these same addresses are now being used for Spam and viruses as the source OR destination address (meaning I get bounce messages as well).

I HATE it when moron anti-Virus gateway administrators set them up to return confirmed viruses to sender with a polite note - except I am NOT the sender, my address was spoofed.

Unfortunately I have been using the catch-all trick for so long (e.g. ebay.com@mydomain etc.) that it's not as simple as turning it off or setting up filters - I don't even know what all the 'legit' addresses are as I used to create them on the fly and may only get email to some of them once a year or so.

I only ever busted one person for passing on the account details which was satisfying, but I am getting PLENTY of Spam/viruses now instead.

I use the excellent Spam Gourmet now for instantly creating disposable addresses with the added advantage that they can actually die when I want/need them to.

--
Do you or your partner snore? - Visit www.snoring.com.au

Another data point. by juuri · 2004-06-22 15:45 · Score: 4, Interesting

OSX's built in mail seems to be pretty close to the accuracy numbers listed in the above summary. I tend to have one to three pieces of spam slip through which are almost always entirely image based with some poetry or equivalent attached.

I must say I've been pleasantly surprised with the spam filtering it provides and it has been a lot easier than the hoops I used to utilize to clean out my inbox.

--
--- I do not moderate.

DSPAM by More+Trouble · 2004-06-22 15:48 · Score: 4, Insightful

In real world deploys of statistical filters, something like DSPAM's "global user" feature is necessary. The ability to begin with a relatively mature dictionary is critical to the user experience. Personally, DSPAM is filtering around 200 SPAMs per day for me, allowing one through every few days. It's 99.985% effective for me.

:w

Re:Spamassassin uses collaborative spam-tracking by bigberk · 2004-06-22 15:53 · Score: 4, Informative

It gets better. Vernon Schryver, networking genius, is responsible for the Distributed Checksum Clearinghouse which does something similar, but as I understand it, is much more efficient for large servers. When our university turned on DCC filtering combined with greylisting, the daily spam to inboxes dropped from hundreds daily to ZERO (I kid you not). I am not aware of any false positives, at least on my account. DCC blew my mind.

Problems with Bayesian filtering by dlevitan · 2004-06-22 15:54 · Score: 4, Informative

Up to this past weekend I was using only bogofilter (which is a pure bayesian filter). I seem to get about 200 spam a day on my main account. Until about a month or two ago bogofilter was amazing - I'd get maybe 1 or 2 spam a day, if that many. Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter. Most spam was still being removed by bogofilter, but enough to make me annoyed. This past weekend I also enabled spamassassin (without its bayes filter though), and its cut down the number of spam to maybe 5 a day, but its still too much for me. I'm hoping we have the next breakthrough in spam filtering technology soon (akin to bayesian filtering) because it seems that every new technique we use to filter the spam is eventually targeted by the spammers and bypassed.

Issues with testing corpus by w_mute · 2004-06-22 16:00 · Score: 5, Interesting

I haven't read everything in detail yet, but one of the things that stands out is that their 'gold standard' representing the best result consists of 9,038 ham messages (18.4%) 40,048 spams (81.6%). While large, the dataset is unbalanced. One of the things that is recommended by many of the filters is training on equal proportions of ham/spam in order to prevent biasing (overfitting).

Their train on errors approach may simulate what goes on with some filters it doesn't reflect the scenario where there is a initial dataset to be trained on _before_ new messages are processed. Instead, each message is in essence 'new'. So in their tests the machine learning filters start out knowing nothing, but SpamAssassin starts out with its inbuilt ruleset. Not exactly fair.

-Greg

I'm running SpamAssassin at work. by khasim · 2004-06-22 16:21 · Score: 4, Insightful

People LOVE it.

There are some false positives and some false negatives.

But I have it set to delete anything 12+. That gets rid of the worst of the worst spam. So far, not a single complaint of any email being deleted.

Everything else has the subject re-written so people can run their own rule set against it.

In the past 8 hours
1867 messages received
375 messages deleted
1266 messages flagged as spam

So, only a few hundred actual, good emails.

Of course, that's only 4 hours during the regular work day (and 4 hours after work). But you can see the proportions. It saves people a TON of time.

And it makes them happier when they don't have to constantly dig through crap to see if any real messages have arrived.

Now, those spam messages are NOT distributed evenly. Our HR manager had her email address posted on the website. So she gets about 20-25% of the spam.

It's not exactly Big Brother 'cause no human sees the deleted spam.

Re:in related news by Crudely_Indecent · 2004-06-22 16:24 · Score: 4, Interesting

I can certainly see how waiting on our government will decrease the number of messages transmitted through my mail servers daily.

It's reassuring to know that the "authorities" have effectively reduced the number of messages through my server by 10-14k per day......What great guys, those 'authorities', aren't they thoughtful and quick to respond. We've only been waiting for a spam-relief law for....10 years and they finally gave one to us. Oh wait....SpamAssassin is what reduced those messages.

The reason we don't wait for the gov to step in and take care of business is that THEY'VE DONE NOTHING SO FAR. You expect me to believe the government will solve my spam problems? I'm not holding my breath.

A combination of RBLs, DNSBLs, F-Prot, and SpamAssassin is what reduced the number of messages sent through my servers. I'm interested in results NOW, not legislation tomorrow.

--

"Lame" - Galaxar

Spamgourmet (antichef) and SpamSieve by dougman · 2004-06-22 16:38 · Score: 4, Informative

Why people don't use disposable accounts is beyond me. Once you start using Spamgourmet you'll never go back. I've been active with them over two years and here's my current stats:

Your message stats: 339 forwarded, 43,796 eaten. You have 155 disposable address(es).

yeah, that's right, thanks to disposable addresses I *haven't* read 43,457 spam emails! When I do need (want) to use my real address, I use SpamSieve (with Entourage X) - very good baysean filter (not sure if it Mac only or not).

Re:POPFile? by puppetman · 2004-06-22 17:53 · Score: 4, Interesting

Yah, I ran this for about a year before I switched ISPs (and got a new, spam-free email account).

It was amazingly accurate, with about one mistake per thousand emails once I had it trained. I'll go back to it if I start to get a bunch of crap in my in-box. I remember reading that spammers would test their emails against the most popular anti-spam filters, but they still almost never got through Popfile.

I tried SpamAssassin as well, after I had some issues with PopFile (it would stop responding after a large volume of email), and it was more difficult to set up, and didn't have the nice configuration options of Popfile.

Counterintuitive Advertising by KalvinB · 2004-06-22 19:36 · Score: 4, Interesting

Some guy a few stories back mentioned he was getting 3000 ad impressions and 15 clicks a day or so with AdSense. Which is terrible. At first I assumed he was just oversaturating his visitors with ads. But his ad placement is also terrible. It's at the very bottom of the page where few are going to see it. But he is also over saturating. His pages are very busy with information and the ads are on every single page.

What happens when you constantly shove something in someone's face is that they learn to ignore it. Either consciously or subconsciously. In the case of advertising if someone is shown an ad and they aren't interested and another ad is shown there's a very good chance they won't even notice it. Even if they would have been interested in what it was offering. This is because they were annoyed by the first ad so they just mentally block any additional ads.

This is why the response rate to spam is so terrible. People for the most part just subconsciously ignore it. It's just noise.

Advertisers like radio stations because it tends to be a captive audience. People are very unlikely to turn the station when ads come on. However there is one local station that I've learned to turn the channel on when the ads start because I know I'm going to get to my destination before another song comes on. There are other stations that I don't change the channel on because I know it's just a short break.

Just like the guy pumping out 2985 ads that no one clicks on, spammers would benefit immensly by pulling a large chunk of the ads. People are more likely to notice when they aren't bombarded by ads and the response percentage goes up.

It seems counterintuitive that less advertising means a greater response but that's actually the case.

I normally notice the ad banners on Slashdot because that's pretty much all the advertising there is. I rarely ever notice the text ads. Even though they're placed on the left side in the best position as anyone who scrolls the page is probably going to see them. Slashdot's problem is that the ads blend in with the web-site's color scheme too well so they're pretty much invisible to anyone with a scroll wheel.

On GameDev the site is so littered with advertising that I never notice it anymore. By the time I close the stupid popup ads that circumvent Google's pop up blocker using evil little tricks I'm too annoyed to even look at the other ads.

Web-sites get desperate and think more ads == more money. And the actual result is less valuable ad space because the click thru rate is so low and fewer clicks because users tune the ads out which results in less money than if they had focused on the click thru percentage rather than the number of impressions. If you have a web-site with a high click thru rate advertisers are more likely to pay more because they know that if they show an ad there's a very good chance they'll get a click thru.

But then I'm guess spammers have never taken a course in marketing or bothered to think about things from their potential customer's perspective.

Keeping ineffective ads visible hurts the effectiveness of the better ads. Spammers are in effect destroying themselves in that area. As are ad happy web-sites.

Ben

--
Work Safe Porn

DSPAM. by asackett · 2004-06-22 20:20 · Score: 4, Interesting

I've been using DSPAM for nearly a year now, and it's just kept on getting better. I can't imagine life without it now.

I have 17 DNS-based blacklists in front of it, because I would rather block the messages at the network interface than filter them with my own resources, but those that slip through don't stand much of a chance of reaching my inbox. I have had my current email address out there on the web and in Usenet for six years, so I see a lot of junk -- DSPAM stops all but one or two per month. SpamAssassin can't even come close to that.

--

Warning: This signature may offend some viewers.

Slashdot Mirror

Spamassassin Beats CRM-114 In Anti-Spam Shootout

31 of 330 comments (clear)