Spamassassin Beats CRM-114 In Anti-Spam Shootout
Simon Lyall writes "A new study of antispam software shows that Spamassassin performed well in various configurations along with Spamprobe , Bogofilter and Spambayes also came out good while CRM-114
failed to live up to its previous claims . The study shows: 'The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.'"
CRM-114
The link in the article points to SpamBayes again.
"People that quote themselves in their signatures bother me" - athakur999
the mozilla spam filter does a very good job too, when it learns enough it becomes over 95% acurate. i dropped evolution for it , and never looked back
Putting a windows cd backwards, plays evil messages, but it gets worse, putting it right, installs windows.
I must admit that I am not upto date on these new anti-spam software packages, which operate on the server side. However, what is the probability of real mail getting rejected by these things. It seems almost like an invasion of privacy to block my own email even if it is from a "benevolant big brother" perspective.
I guess that is why there are privacy policies though.
aj
GroupShares Inc. - A Free and Interactive Stock Market community!
-------
artlu.net
...false positives?
Baysian, gaysian. Real men hit delete.
The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.
How many false positives though?
John Kerry is a Joke!
I use Spamassassin. Surviving mail then goes through CRM-114. At least in my case, it works better than each of the filters on its own.
And a long document it is (funny placeholder images though.) Here's the conclusions for the impatient but interested in a little more than the summary:
Supervised spam filters are effective tools for attenuating spam. The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day. The corresponding risk of mail loss, while minimal, is difficult to quantify. The best-performing filters misclassified a handful of spam messages early in the test suite; none within the second half (25,000 messages). A larger study will be necessary to distinguish the asymptotic probability of ham misclassification from zero.
Most misclassified ham messages are advertising, news digests, mailing list messages, or the results of electronic transactions. From this observation, and the fact that such messages represent a small fraction of incoming mail, we may conclude that the filters find them more difficult to classify. On the other hand, the small number of misclassifications suggests that the filter rapidly learns the characteristics of each advertiser, news service, mailing list, or on-line service from which the recipient wishes to receive messages. We might also conjecture that these misclassifications are more likely to occur soon after subscribing to the particular service (or soon after starting to use the filter), a time at which the user would be more likely to notice, should the message go astray, and retrieve it from the spam file. In contrast, the best filters misclassified no personal messages, and no delivery error messages, which comprise the largest and most critical fraction of ham.
A supervised filter contributes significantly to the effectiveness of Spamassassin's static component, as measured by both ham and spam misclassification probabilities. Two unsupervised configurations also improved the static component, but by a smaller margin. The supervised filter alone performed better than than the static rules alone, but not as well as the combination of the two.
The choice of threshold parameters dominates the observed differences in performance among the four filters implementing methods derived from Graham's and Robinson's proposals. Each shows a different tradeoff between ham accuracy and spam accuracy. ROC analysis shows that the differences not accountable to threshold setting, if any, are small and observable only when the ham misclassification probability is low (i.e. hm
CRM-114 and DSPAM exhibit substantially inferior performance to the other filters, regardless of threshold setting. Both exhibit substantial learning throughout the email stream, leading us to conjecture that their performance might asymptotically approach that of the other filters. From a practical standpoint, this learning rate would be too slow for personal email filtering as it would take several years at the observed rate to achieve the same misclassification rates as the other systems. Both these systems were designed to be used in a train on error configuration, and do not self-train. This configuration could account for a slow learning rate as each system avails itself of the information in only about 1,000 of the 50,000 test messages. In an effort to ensure that we had not misinterpreted the installation instructions, we ran CRM-114 in a train-on-everything configuration and, as predicted by the author, the result was substantially worse.
Spam filter designers should incorporate interfaces making them amenable for testing and deployment in the supervised configuration (figure 4). We propose the three interface functions used in algorithm 1 - filterinit, filtereval, and filtertrain - as a standardized interface. Systems that self-train should provide an option to self-train on everything (subject to correction via filtertrain) as in algorithm 2.
Ham and spam misclassification proportions should be reported separately. Accuracy, weighted accuracy, and precision should be avoided as primary evaluation measures as th
everything in moderation
I wonder how Mozilla Messenger/Thunderbird's spam filtering stacks up against these filters? I've heard some negative comments about the Mozilla filtering system, but it's worked wonders for me.
I have been using SpamAssassin in conjunction with Evolution and it has cut my spam to virtually nothing. I wish it was built right into Evolution so that it was a little faster however it is worth the wait as I barely ever get any spam in my Inbox anymore. I set it up with evolution very similar to how it is shown here. I really like using it with Evolution however I am curious if anyone knows of anything that would work faster and as efficient in conjuntion with Evolution?
Is to do away with current email protocols and go with new ones with verification.
That should take care of the problems. The gov is now concentrating on this.
How exactly does the US (or other first world country) go about writing a code of law that puts Nigerian spammers in jail?
Hear recorded Slashdot headlines on your phone! New service beta testing. Just call (248) 434-5508
You don't want to face an assassin in a shootout. Maybe a pie eating contest, or a spelling bee... but not a shootout.
CRM114's best was about 80%, which lasted for a few of weeks (weeks 3-5). Before and after that, it's doing good to catch 25% of the spam. I'm not sure why, but for the last month it's only been catching about 10%. When one gets through, I run it through mailfilter.crm with the learnspam switch. It'll say it's learned it, but if I have it check the spam again, it still lets it past.
I have been using spamprobe for some time, with the webfilt front-end, and I'm very pleased with the speedy spamprobe program (written in C++).
I receive approximately 10 legit emails/day and about 300 spam/day. I have only had 2 false positives overall (that's 2 out of about 100,000 total emails received) and on average only 2 spams/day split past the filter. Now I'm testing Spambayes on one of my most spammed accounts, but it's definitely much slower than spamprobe and not more accurate as far as I can tell.
From page 24: Hidalgo suggests the use of ROC curves, originally from signal detection theory and used extensively in medical testing, as better capturing the important aspects of spam filter performance.
Perhaps a distributed analysis system (similar to SETI@home) could be used to combat spam. Not only could the idle time of bazillions of CPUs be levereaged to improve "signal" analysis, but perhaps the clients could analyize local incoming mail to corelate new trends in spam originators and then share that information with all of the other clients. Then you could combine that with the genetic evolution improvements of the F1 sim-cars recently mentioned on /.
So there's the high-level idea, now you smart people go make it work. :-)
This one gang kept wanting me to join cause I'm pretty good with a bo staff.
Content RBLs have been working fairly well for me
Maybe I'm missing something human accuracy always going to be 100%? I tell the computer what is spam, it learns. I may decide that regardless of what it thinks, this last message is OK. So aside from clicking too fast or changing your mind (which is a common thing to do) how can a filter ever suggest it is be better then people at deciding what people want to see?
filtering tools work fairly well, but more importantly they work right now. Waiting for the authorities to "wake from their slumber" might take years, if it ever even happens.
Razor: Vipul's Razor is a collaborative spam-tracking database, which works by taking a signature of spam messages. Since spam typically operates by sending an identical message to hundreds of people, Razor short-circuits this by allowing the first person to receive a spam to add it to the database -- at which point everyone else will automatically block it.
This is a really cool.
I did a *much* smaller test of spam filters earlier this year (which was published in hakin9 but not in English).
I also found that crm114 gave poor results in comparison to other filters - but figured I must have set something up incorrectly...
- I only use my main email address with friends and family, and never post it online.
- Whenever I post an email address or register for anything online, I put thatsite@mydomain.com as my email address.
- All email is received by one account, but each message can have a different "to:" header. I set my filters to filter mail to different boxes. Email sent to amazon@mydomain.com goes to the amazon folder. Same with ebay, slashdot, whatever.
- Any time I start receiving spam, I just set my mail server to disregard email sent to whatever email address is getting the spam, and I can stop doing business with the company that sold my email address.
I receive on average 0 spams per day.Albuquerque PC
OSX's built in mail seems to be pretty close to the accuracy numbers listed in the above summary. I tend to have one to three pieces of spam slip through which are almost always entirely image based with some poetry or equivalent attached.
I must say I've been pleasantly surprised with the spam filtering it provides and it has been a lot easier than the hoops I used to utilize to clean out my inbox.
--- I do not moderate.
Whatever. Your "never-ending battle of updating filters and formulas" works fine.
Anyone know that three letter prefix to get through the CRM-114?
In real world deploys of statistical filters, something like DSPAM's "global user" feature is necessary. The ability to begin with a relatively mature dictionary is critical to the user experience. Personally, DSPAM is filtering around 200 SPAMs per day for me, allowing one through every few days. It's 99.985% effective for me.
:w
It's unforchunately that DSPAM was left out of this very good quality report. I have personally used SpamAssassin, SpamProbe and DSPAM
After using each for a couple months at a time, I found DSPAM to be by far the most effective (after it was properly trained)
DSPAMS claim "DSPAM (as in De-Spam) is an extremely scalable, open-source statistical hybrid anti-spam filter. While most commercial solutions only provide a mere 95% accuracy (1 error in 20), a majority of DSPAM users frequently see between 99.95% (1 error in 2000) all the way up to 99.991% (2 errors in 22,786). DSPAM is currently effective as both a server-side agent for UNIX email servers and a developer's library for mail clients, other anti-spam tools, and similar projects requiring drop-in spam filtering. DSPAM has been implemented on many large and small scale systems with the largest systems being reported at about 125,000 mailboxes." was quite accurate for me
Also check out some priceless photos Priceless Photos
Gamblers Forum
Up to this past weekend I was using only bogofilter (which is a pure bayesian filter). I seem to get about 200 spam a day on my main account. Until about a month or two ago bogofilter was amazing - I'd get maybe 1 or 2 spam a day, if that many. Then recently I suddenly started getting hit with 20 spam messages a day, and I noticed most of those were using lots of common words to bypass bogofilter. Most spam was still being removed by bogofilter, but enough to make me annoyed. This past weekend I also enabled spamassassin (without its bayes filter though), and its cut down the number of spam to maybe 5 a day, but its still too much for me. I'm hoping we have the next breakthrough in spam filtering technology soon (akin to bayesian filtering) because it seems that every new technique we use to filter the spam is eventually targeted by the spammers and bypassed.
Not everyone is as much of an RBL cheerleader as you are.
Only 2 messages out of 150 normally get through that are spam? Good god, I normally get 5-10 spam messages a day that get through SpamAssassin. That's 750-1,500 spam e-mails a day! I thought it was bad before I enabled spamassasin a few months ago... but Jesus, man am I glad I got SA!
users. those silly, silly users. i was in charge of spam for my company for the greater part of a year. using an outdated KEYWORD based system > I was forced to read every.caught.message to look for false positives. ...
did you catch that? yeah...i had to go through EVERY 'spam' tagged e-mail that went through the company.
needless to say, after the first week i was ready to gouge my eyes out. but hey, at least i earned that 'i read your e-mail' sticker!
anyways, the point that i'm failing to make here is the cause of the spam...
the damn users.
whether it be responding to spam, putting their e-mail address in every single webform they encounter while surfing instead of working, signing up for spam voluntarily, or whatever the cause may be..
i ran some numbers on the logs, and came to an astounding find.
a few people were getting literally a thousand messages blocked, per month.
i, on the other hand, had maybe one or two a month.
and i'm not a nazi with my e-mail address....but i do take some care in what places i type it in.
an ounce of prevention goes a long way folks.
SpamAssassin used to be super-good for me, but recently it has become a nightmare... even with Bayes filters on and training it with about almost 2000 spam messages that have escaped it before, I STILL get an enourmous amount of spam every day... maybe I'm doing something wrong with the config, I admit that I haven't spent that much time on that, but it seems like it should be working better :-((.
Spam sucks. Everyone stop buying the products advertised and it'll be over. But then again, people will always be too dumb for an easy solution like that (reminds me of the gooback southpark...)
In this message you claim that no content-based filter "comes close" to the 95% accuracy of your RBLs, but some of the content-based filters in this story do better than that (which is consistent with my own personal accuracy rate from SpamBayes, with e.g. a spam misclassification rate of maybe ~2%).
I haven't read everything in detail yet, but one of the things that stands out is that their 'gold standard' representing the best result consists of 9,038 ham messages (18.4%) 40,048 spams (81.6%). While large, the dataset is unbalanced. One of the things that is recommended by many of the filters is training on equal proportions of ham/spam in order to prevent biasing (overfitting).
Their train on errors approach may simulate what goes on with some filters it doesn't reflect the scenario where there is a initial dataset to be trained on _before_ new messages are processed. Instead, each message is in essence 'new'. So in their tests the machine learning filters start out knowing nothing, but SpamAssassin starts out with its inbuilt ruleset. Not exactly fair.
-Greg
I have tried a number of Baysian type filters and none of them filter the spam when I send it...
just my humble opinion...
i use email for business and receive many letters from clients. i just afraid to loose any of these because of a spam filter. therefore even when i used one, i checked all the emails anyway.
I use Netscape's Bayesian filter as a second tier, and that removes about 60% of the remaining spam.
SpamCop was better, until IronPort bought them and they went black-hat, with Bonded Spammer and the Spam Engine.
Now I have gmail.
RBLs only work against honest admins, getting them to clean up the holes in their security. Spammers aren't honest, and as you say, will just use worms to invade machines to create proxies.
RBLs have been around for years, but the amount of spam Spamassassin catches on its way in to me is ever-increasing. If RBLs worked, the spam problem would have been solved years ago.
On the other hand, the amount of spam getting past Spamassassin to me is pretty steady. I guess that indicates it's getting better. Mostly what gets past is what the article calls "backscatter": delivery failure messages caused by spammers forging my email address.
Should systems that send backscatter be blacklisted? I'd tend to say yes: they should only send failure notices to senders who pass some sort of verification like SPF. Putting them in an RBL really would encourage them to do that.
...hammer the spammer's ISP with complaints until the advertised website is DEAD, DEAD, DEAD.
STOP MISUSING APOSTROPHES, YOU MORONS!!!
The shift key is next to the Z on the left of the keyboard, and next to the / on the right.
It's often used on the first letter after a full stop - '.' character.
People LOVE it.
There are some false positives and some false negatives.
But I have it set to delete anything 12+. That gets rid of the worst of the worst spam. So far, not a single complaint of any email being deleted.
Everything else has the subject re-written so people can run their own rule set against it.
In the past 8 hours
1867 messages received
375 messages deleted
1266 messages flagged as spam
So, only a few hundred actual, good emails.
Of course, that's only 4 hours during the regular work day (and 4 hours after work). But you can see the proportions. It saves people a TON of time.
And it makes them happier when they don't have to constantly dig through crap to see if any real messages have arrived.
Now, those spam messages are NOT distributed evenly. Our HR manager had her email address posted on the website. So she gets about 20-25% of the spam.
It's not exactly Big Brother 'cause no human sees the deleted spam.
RBLs WORK. This is why spammers are forced to use worms to invade users' machines to create proxies. As soon as the authorities wake from their slumber and start prosecuting these scumbags who break into others' machines, the whole spam thing will essentially be over. But don't tell that to the little content-based-filtering-fools. They obviously have money to burn.
In case you havent heard, most of us with real jobs that require spam control cant wait for 'authorities to wake up' and cannot be expected to take advice from people that do, whatever the fuck it is your do, which is OBVIOUSLY not related at all with protecting people and/or resources from the abuse of spammers.
NO SIG
No false positives, disgusting amounts of spams killed. 'Tis a glorious thing.
I can certainly see how waiting on our government will decrease the number of messages transmitted through my mail servers daily.
It's reassuring to know that the "authorities" have effectively reduced the number of messages through my server by 10-14k per day......What great guys, those 'authorities', aren't they thoughtful and quick to respond. We've only been waiting for a spam-relief law for....10 years and they finally gave one to us. Oh wait....SpamAssassin is what reduced those messages.
The reason we don't wait for the gov to step in and take care of business is that THEY'VE DONE NOTHING SO FAR. You expect me to believe the government will solve my spam problems? I'm not holding my breath.
A combination of RBLs, DNSBLs, F-Prot, and SpamAssassin is what reduced the number of messages sent through my servers. I'm interested in results NOW, not legislation tomorrow.
"Lame" - Galaxar
And it has just now learned to filter out almost all the spam. IIRC, SpamAssassin said it would learn what to mark as spam after a couple hundred obvious spams and the same number of obvious non-spams. I still get the occasional false positive.
[Ripley] "I say we take off and nuke the entire planet
from orbit. That's the only way to be sure."
[Hudson] "F--kin' A..."
[Burke] "Ho-ho-hold on a second! The Earth has a
very substantial dollar value attached to it!"
[Ripley] "They can BILL me."
>;k
The first person who says gmail is getting shot. By me.
This article from the beeb puts human accuracy over machine accuracy...
Yahoo! allows you to have suspected spam automatically deleted or moved to a spam folder. It also allows you to disable the spam filter completely. (Mail Options -> Spam Protection)
As for SpamAssassin, I've been using it for about a week on my mail server. There have been about 500 filtered spams and one false positive - an AOL greeting card.
Why people don't use disposable accounts is beyond me. Once you start using Spamgourmet you'll never go back. I've been active with them over two years and here's my current stats:
Your message stats: 339 forwarded, 43,796 eaten. You have 155 disposable address(es).
yeah, that's right, thanks to disposable addresses I *haven't* read 43,457 spam emails! When I do need (want) to use my real address, I use SpamSieve (with Entourage X) - very good baysean filter (not sure if it Mac only or not).
Putting them in an RBL really would encourage them to do that.
Err, You can try that, but I would not recommned it. I think you would quickly find there are not many server out there that you could talk to.
Maybe once X% of the internet adopts some sort of sender verification, an RBL may stand a chance.
Still, spammers would just send backscatter to you through hosts that are permitted to send for a domain, ever see how many 0wn3d windows boxes there are out there.
Where there is a hole, spammers will find it. Too bad there isnt a spammer death sentance.
For an INDIVIDUAL, Bayesian filter works far better than just the regular SpamAssassin rulesets.
That's because the Bayesian system will LEARN from you what you consider to be spam and ham.
I use SpamAssassin with Bayesian filtering turned on and it catches over 90% of the spam. But then I've fed it a decent sized corpus.
Thunderbird already has integrated significant improvements based on SpamBayes, I believe. See http://bugzilla.mozilla.org/show_bug.cgi?id=230093 , which was closed about a month ago. The test data from that patch is encouraging, although obviously results will be different for everyone since not everyone gets the same type of spam. If you want to keep tabs on upcoming refinements to junk mail filtering, take a look at the dependencies of this meta bug: http://bugzilla.mozilla.org/show_bug.cgi?id=228674 . Please don't "spam" up that bug with comments though, if you have something to say put it in a specific bug or file a new one if something relelvant doesn't exist.
Rock over London, Rock on Chicago. Wheaties: Breakfast of Champions.
That's funny. Evolution under MDK 10 uses Spamassassin.
It has to be said, did they set the CRM-114 to discriminator to OPE or some other arrangment of P,O,E cause ya'll know unless you specify the code prefix you can't recall the spam and the doomsday device will go off.. cause that spam can get in there real low, I mean if the spammer is _really_ good he can fly, er send that e-mail right under their radar
No time to read it, son, just email it to me.
Hey freaks: now you're ju
Would be interesting to see how that message sample reacts against more spam filtering technologies, or even webmails with spam protection integration.
maybe I'm doing something wrong (wouldn't be the first time). I run the spamd command as root (tried it with the -d option too), pointed sa-learn at 3000+ spams and about 200 hams and set up kmail filters to pipe everything less than 250k through spamc and move anything with X-Spam-Flag=Yes to junk. It's slow as heck and only filters about 60% of my spam. Bogofilter was doing about 80% (it's more trouble to set up though). But I keep reading posts of people with 98% filter rates.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
This is probably a good life lesson for you.
Learning not to rely on the government for something as trivial as spam legislation will help ease you into not relying on the government for more critical things that it could screw up, like healthcare.
I have used it sitewide (small site, about a dozen active mailboxes) for a few months. Currently it has an error rate of about 1 or 2 mistakes per week per mailbox (in mailboxs that get 100+ spams per day). I did have to do a lot of work to configure it properly though, which may be the reason the authors saw poor performance from it; the "forward to yourself to train" didn't work at all because both my IMAP server and my mail reader would slightly reformat my headers, meaning that CRM114 was training on different text then it saw when it was filtering! So I put together my own system to save pristine copies of all inbound mail and train on them as needed. Maybe the reason CRM114 fared so poorly is the difficulty in setting it up properly?
There is a somewhat interesting article where they more or less explain how the Mac OS X Mail application works regarding Spam:
http://www.macdevcenter.com/pub/a/mac/2004/05/18/s pam_pt2.html
Content-based spam filtering is a waste of time. [...] It's a never-ending battle of updating filters and formulas.
I update my SpamAssassin config file once a year or so. This hardly seems burdensome. And generally my updates have to do with which RBLs it uses for assiging point values. Other than that, I use the defaults plus the Bayesian filter.
Since the filter self-trains based in part on the RBL scores, it autoadjusts to new spam. And if you have spamtrap addresses, you can feed those back in, too.
My setup is well over 99% accurate, with no false positives in months.
RBLs WORK.
Yes, and I use those, too. Some I use for outright rejection of connections, and some count toward the spamminess score. As soon as they get the URL-based RBLs working, I'll use those, too. Why wouldn't you use all the tools at your disposal?
I'd like to second SpamSieve. If more than one piece of spam gets through in a day (where each day I receive > 500 pieces of email), I am truly surprised. My stats for June are:
Works for me. Oh, the false positive was a list that I just signed up for. They sent a confirmation mail, I checked to see if it was caught (it was), and marked it as "good". Piece of cake.
I live ze unknown. I love ze unknown. I am ze unknown.
Are you sure you trained it on a proper corpus? You have to look at your mail with a real mail reader, eg. Mutt. You will probably find that your good mail corpus is full of spam that was marked for deletion but isn't really deleted. This will cause the filter to train badly.
I think 200 hams is way too small. Keep sorting and it should improve.
I lost a ton of emails in v2.63 of spamassassin. I use a chain of fetchmail -> postfix -> kmail get -> filter through spamc -> kmail inbox/spam.
I had to turn off spamc processing because I lost a bunch of email. Maybe it was a bad interaction with kmail, but it was disheartening nontheless. Taking out the spamc filter, I did not run into the problem again.
Firstly it should be remembered that the 'owned' part is a bit subjective as most of the project could live on regardless of 'ownership' thanks to it being opensource. But regardless of that.. am I the only one that finds the prospect of microsoft buying SpamAssassin a bit odd?
Microsoft to buy Network Associates?
At the very least they'd be buying the name and the tarted up version of SpamAssassin sold as SpamKiller.
0daymeme.com: Great stuff.
Some guy a few stories back mentioned he was getting 3000 ad impressions and 15 clicks a day or so with AdSense. Which is terrible. At first I assumed he was just oversaturating his visitors with ads. But his ad placement is also terrible. It's at the very bottom of the page where few are going to see it. But he is also over saturating. His pages are very busy with information and the ads are on every single page.
What happens when you constantly shove something in someone's face is that they learn to ignore it. Either consciously or subconsciously. In the case of advertising if someone is shown an ad and they aren't interested and another ad is shown there's a very good chance they won't even notice it. Even if they would have been interested in what it was offering. This is because they were annoyed by the first ad so they just mentally block any additional ads.
This is why the response rate to spam is so terrible. People for the most part just subconsciously ignore it. It's just noise.
Advertisers like radio stations because it tends to be a captive audience. People are very unlikely to turn the station when ads come on. However there is one local station that I've learned to turn the channel on when the ads start because I know I'm going to get to my destination before another song comes on. There are other stations that I don't change the channel on because I know it's just a short break.
Just like the guy pumping out 2985 ads that no one clicks on, spammers would benefit immensly by pulling a large chunk of the ads. People are more likely to notice when they aren't bombarded by ads and the response percentage goes up.
It seems counterintuitive that less advertising means a greater response but that's actually the case.
I normally notice the ad banners on Slashdot because that's pretty much all the advertising there is. I rarely ever notice the text ads. Even though they're placed on the left side in the best position as anyone who scrolls the page is probably going to see them. Slashdot's problem is that the ads blend in with the web-site's color scheme too well so they're pretty much invisible to anyone with a scroll wheel.
On GameDev the site is so littered with advertising that I never notice it anymore. By the time I close the stupid popup ads that circumvent Google's pop up blocker using evil little tricks I'm too annoyed to even look at the other ads.
Web-sites get desperate and think more ads == more money. And the actual result is less valuable ad space because the click thru rate is so low and fewer clicks because users tune the ads out which results in less money than if they had focused on the click thru percentage rather than the number of impressions. If you have a web-site with a high click thru rate advertisers are more likely to pay more because they know that if they show an ad there's a very good chance they'll get a click thru.
But then I'm guess spammers have never taken a course in marketing or bothered to think about things from their potential customer's perspective.
Keeping ineffective ads visible hurts the effectiveness of the better ads. Spammers are in effect destroying themselves in that area. As are ad happy web-sites.
Ben
Work Safe Porn
RBLs and DNSBLs has way too many false positives on their own! - Especially if you use the big lists like SPEWS and SpamHaus. As they list all IP space of ISPs regardless of whether they're spam-sources or not, you'd end up blocking 99%+ ham from ISPs who simply (allegedly) provide spam-support (often just dns-hosting or less).
Using them in conjunction with SpamAssassin is a much better idea. Then ham will not score above the threshold (only spam-characteristic is the source of the mail), while true spam will get a boost in score and thus pass the threshold with much more certainty.
Remember that SPEWS and SpamHaus are not listing spam-sources, they're actually listing the opposite as about 98% of the listed IP-space is not (and has never been) a spam-source. The purpose is blackmail of course, and the people behind these lists clearly don't realize (or refuse to realize) that the victims cannot do anything to change an ISPs policy. Often they cannot switch ISPs either, either for practical reasons (no unlisted alternatives) or for financial reasons (it can easily cost far more that the yearly proceeds to move hundreds of servers, renumber many thousands of domains etc.).
I've been getting spams lately that seem to be trying to get around the highly effective statistical solutions, such as SpamAssassin, that have been implemented. Spammers seem to be adding random, or possibly even carefully selected dictionary words to skew their statistical rating. Here is an example from the several I've received lately--has anyone seen information about this on /. or elsewhere?
[spammers irritating message snipped]
Thu, 17 Jun 2004 19:42:34 -0500
No Thanks
beatify
sacred atom drank deprecate cathodic thermionic sherman delinquent hanley swum wooster asteroidal bilayer haiti saudi wink bijective reserpine baronial gloss ambrose threadbare chianti predatory earmark bilingual angora palazzi chartres alveolar phosphate civet radish barricade diem laurie minutem! en crusty
camilla jade lineman bendix masonic dublin incontrovertible defecate generous buddhist yesterday endow bitten conley trunk pitchfork beret bloat gelatine dovetail gambia medea niggardly blackburn suey dialogue ilyushin anastigmatic berth abort bodied contractor of ridden embarcadero corset trademark
ID: W993gt72
carnation
constructor maltese bantam airfield pique douglas pungent criterion cloudburst illiterate sausage career stile pebble bonnie shim carbonium
magnesite pembroke abrade jogging dynast physiochemical stochastic sumac conference obtain villain midwinter incompetent eradicable madhouse airline antony household cursory instinctual gratuitous clown shaven des cornflower
I've been using DSPAM for nearly a year now, and it's just kept on getting better. I can't imagine life without it now.
I have 17 DNS-based blacklists in front of it, because I would rather block the messages at the network interface than filter them with my own resources, but those that slip through don't stand much of a chance of reaching my inbox. I have had my current email address out there on the web and in Usenet for six years, so I see a lot of junk -- DSPAM stops all but one or two per month. SpamAssassin can't even come close to that.
Warning: This signature may offend some viewers.
Can anyone suggest the best way of filtering spam received into a mail server running Exchange 2003?
I have no sig yet I must scream.
150 spams a *day*?????
Hell, that's what gets through my filters. On a bad day I get 1000/hour.
I run both spamprobe and bogofilter and find that the OR of the two is noticably better than either alone. Haven't managed to spot why, but the moral is that if you have CPU to burn it can be worth getting a second opinion.
_O_
.|< The named which can be named is not the true named
I didn't see results for how much span GPG and PGP block.
It's normally around 100% on my pc, but sometimes about 110%.
thank God the internet isn't a human right.
I've been using it since March and the stats talk form themselves:
My spamfilters stats.
It's worth mentioning that I don't get false positives with SA, and CRM114 gets 1 every now and then. On a daily basis I get 70 spams caught by CRM114 and not by SA.
EOF
I can empathise with that. The other problem with using disposable accounts as far as business contacts or clients is the potential fall-out from the LACK OF TRUST! What would your contact or client think if you give them a spamgourmet address and they know what spamgourmet does? Or, if you give them a sneakemail address... "Can you spell your sneak e-mail address to me again please? That's a-5-b-z-what?"
--- root@127.0.0.1
Even if Microsft buys NAI, they would not get the SpamAssassin trademark.
SpamAssassin is in the process of becoming a project in the Apache Software Foundation. That process requires the trademark to be assigned to the ASF, which is already in progress as can be seen in this status report.
I'm also using CRM114. On a bad week, maybe 4-5 spam messages sneak by and I probably get 200-300 messages a day (which is overwhelmingly spam). There was a bug in previous versions of CRM that would cause the filter to claim in learned when you trained on an error and would give you a "I already know that's spam, I don't need to relearn that" message upon training. You can fix that by getting the new version and using a training command like this:
command password spam force
That VASTLY improved performance for me.
Seriously. Exposing an Exchange server directly to the net is just asking for trouble. Your best bet is to put the exchange server behind your firewall and relay all incoming mail through a hardened unix machine running your favourite email transport (like postfix). Then you can use any of the well documented spam sifters discussed here and offer your exchange server more protection from the elements.
You really need to make sure Spam Assassin is using DNS RBL and the like. I was seeing the same kind of thing - lots of spam getting through. Once I turned on the RBL checkers, spam levels dropped immediately. The other thing to do is make sure SA is the latest version - the newest spam techniques beat old SA versions.
Make trash your inbox. 100% effective.
Ask me about my vow of silence!
I am the author of CRM114 and I corresponded with Professor Carmack for setup assistance during this study; he did have some problems with CRM114 that he brought to my attention and which were possibly never quite resolved.
I can also state that *do* run CMR114 myself; I also run SpamAssassin (regularly maintained by the systems staff) on a parallel account. I find that SA gets about 90+ percent of what makes it past the firewall's immediate RBL lists (which matches Prof. Cormack's Figure 8 pretty closely); CRM114 nails 99.9% or more (this week, ending June 21, 2004, my CRM114 stats are 2528 nonspam and 1114 spam messages, and had just 1 error (a false reject) which is 99.972% accuracy.
I have gotten reports from some very happy users who are seeing similar accuracies; I've also gotten sad reports similar to Prof. Carmack's that show very weak accuracy.
I can conclude from this (and other reports) that filter performance varies _greatly_ with spam mix - that is to say, Your Mileage Will Vary.
Further, consider Fig 15, which compares CRM114's accuracy with respect to nonspam v. spam. Note that the two curves are displaced considerably, by a factor of accuracy between 3 and 5 times!
This is odd, because CRM114 is _entirely_ symmetrical; it does NOT have any predisposition toward (or against) erring on the side of caution; the only difference between nonspam and spam is the names of their files, which could be changed to "foo.css" and "bar.css" (or even interchanged) without affecting anything else.
Therefore, the two accuracy curves _should_ therefore lie on top of each other; there is no difference in the processing. The fact that the nonspam v. spam curves seem to differ by a factor of 3 to 5 in magnitude gives me some reason to believe that the setup issues Prof. Carmack encountered never really were completely addressed.
-Bill Yerazunis
Personally, I look forward to Bayesian categorization. Not just Spam, but Personal, Work, Bills, etc. It would be splendid if I could have some more dynamic rules instead of doing this stuff manually.
I tried CRM-114 after the previous Slashdot article. I payed a lot of attention to my email and did all the required training. After getting over the initial hump of misclassified email it got to a steady low level. Once it made a mistake and I had to train it, though, I would get a run of false positives and negatives for a bit until it settled out again.
What sent me back to SA was that a number of CRM-114 misclassifications were marking ham as spam. Losing a real message in the sea of spam is much more of a concern for me than getting a bit of spam with the regular stuff. It is very rare that I get ham classified as span in SA.
Turning on Postfix 2.1's "address verification" feature immediately eliminated 90% of the spam that my company was receiving! (SpamAssassin + ClamAV + CRM114 catch the rest). This feature confirms that the incoming email is coming from an account that also accepts email. (Spambots don't normally accept mail, of course...) The spam email never even makes it into your system this way, because the SMTP transaction is deferred until the address is verified. - Mike
But SpamAssassin is just getting better and better. Version 3.0 is coming up, and 3.0-pre1 was recently released. I do not have a test system available for it, but those who have may want to take it for a spin.
Especially for large sites, this is extremely interesting. It adds relational database support for the Bayes database, so it should be a lot easier to set up on a large site.
I find the lack of individual training the main reason why SA works so well for me, but not very well at my old university.
Employee of Inrupt, Project Release Manager and Community Manager for Solid
I evaluated SA as a possible filtering solution where I work, and it was a full order of magnitude slower than bogofilter even with every test disabled. And that *is* running spamc/spamd. Without the daemon it was even worse.
So it may be a nice solution for people who are running it on a small scale, for large installations (e.g. we get over six million SMTP connections a day) it requires a lot more hardware thrown at it.
One thing I really like about SA is how they are very careful to make sure their error rate is on the right side. It's better to let some spam get though than to mark good mail as spam.
My ISP implemented postini the other day and it had collected 30000 messages before I realized that it was blocking my Mexican cousin's email- his trip to visit was almost fubar'd.
And the only way to get the messages back was via frickin scroll, click click several hundred times. (Or open the ssl client scripting can of worms)
I've also been using POPFile for about a year, and it's done an amazing job - 99.87% accuracy, very few false-positives, and great summary info with six email accounts collectively filtered through it.
I recently helped a few friends install it on their machines, and, rather than just having them start from scratch, I copied my Spam corpus for them. With the spam corpus already in place, all of them noticed spam drop to close to zero while they trained their other buckets.
- Jack
I (and the company I work for) use ASSP and have been very impressed with its results. Spam in my boss's inbox went from 100-200 messages per day down to a handful... I'd like to see it compared to the other anti-spam packages mentioned.
Read my keyboard review.
The problem using embedded URLs in spam is that the spammers are already adapted to address this method. They create new domains every day to get around this type of content filtering. for instance. I might receive a spam message with 239e29.23ijei.com and the next day I'll receive the same spam message with hsh9x.39u329.com
I found a much more reliable way to detect spam, unfortunately I will not share it here because I am sure spammers will read this post and adapt.
No comparison of Bayesian systems would be complete without some method to normalize the training of them. In other words, different Bayesian approaches to anti-spam will learn differently from a different training set. So ironically, this comparison is only as good as the completeness of the spam used to train the filters.
http://tinyurl.com/4ny52
amen brother....or sister.....or whoever you are.
"Lame" - Galaxar
We've been using ASSP for just over 200 days now. http://assp.sourceforge.net.
.exe, .pif, etc)
785621 messages processed
334565 messages rejected as spam
159278 viruses blocked (attachments of
Major points in ASSP's favor include the fact that it blocks the email at the network interface (it takes over port 25 and forwards only the stuff on to sendmail that isn't spam), it's easy to install at the server-wide level, anyone on the whitelist can help train the spam filter by emailing it, and it rejects most viruses immediately which keeps the machine running smoothly even during M$ virus blitzkreigs.
But what about the emails I'm missing, you say? They get a message telling them they were rejected and why. Better yet, they get the message even if they are a spammer using a fake return address, which gives you a chance to "opt out" in a fashion they can't legally ignore (yeah, like they care, but still...). We've gotten no complaints from valid users so far and the message tells them to use our phone number to get whitelisted.
What still gets through? Bayesian poisoned emails do occasionally make it past--usually about 3-5 a day, but the spam rate is quite low compared to the deluge we would be under without it.
Examples.
A mom reading an email from her daughter saying "help, i'm being sexually assaulted by a football team" is far less likely go "gee, that contains the word rape so its spam"
A CEO readin an email from his biggest customer saying "you're getting rich. we're placing the order you need to survive" won't dismiss it because of spam words.
Spam filters have a higher chance of deleting the important emails than these overall percentages suggest.
From the article "'The best-performing filters reduced the volume of incoming spam from about 150 messages per day to about 2 messages per day.'"
I don't give a damn if they reduce 150 spams to 2 or 3 or 4 or 5. I care that they do NOT delete the one important email hidden there. Spam filter writers -- start focusing on avoiding false positives, not on trying to delete everything.
The article admits that they didn't follow the training guidelines for CRM114. Its HOWTO and FAQ clearly indicate that training of the type used in the Shootout decreases accuracy significantly. I followed the author's recommendations carefully (having found his rationale for them very rational) and have had very good results.
"The more spam you get the less you read" is what somebody told me at a recent user group meeting.
.procmailrc includes somthing like this... :0 :0 :0 c :0:
The trick is to train the spamfilter against the spamtrap addrs so that when they hit the good addrs the spamfilter knows they're spam.
I use CRM114 train-on-error, so my
* ^X-CRM114-Status: Good
{
* ^TO_compromisedaddr@mydomain.org
{
| $HOME/my/etc/crm114/learnspam
$MAILDIR/checkspam-learnt
}
}
Of the million or so emails I process per day, 80% are marked as spam. Of those approximately 75% are caught by the RBLs before it even reaches the spamassassin engine.
:)
I highly recommend RBLs to anyone. Not only are they fast and usually pretty accurate, but they are very fast learners usually.
One of my favorites is the SURBL which seems to catch a good chunk of it. Bayes filters are always gonna be thrown off by the dummy words thrown in there but the minute they try to link the person to their site BAM the surbl gets them.
I'm surprised greylisting hasn't become more widely used... I've not used it personally, but it sounds effective & fairly benign for non-spam mails.
"A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt
And here is a very long and detailed response on the DSpam site by Jonathan A. Zdziarski himself..
Without RBLs, your content-based system wouldn't work nearly as well. It's like adding caviar to kool-aid. It might make the drink more paletable, but it's more efficent to cut out the kool-aid.
Without RBLs, your content-based system wouldn't work nearly as well. It's like adding caviar to kool-aid. It might make the drink more paletable, but it's more efficent to cut out the kool-aid.
Well, according to my data, it would work a bit less well. And my data doesn't support your kool-aid analogy at all. Why don't you show us your data? You do have data to back up your claims, right?
I do have data. My RBL knocks out about 97% of all spam. And that's without much maintenance. When I get proactive and start monitoring worm-infected PCs, I can up this rate to 99.5%. This is with virtually no measureable legitimate mail being blocked.. something the content-based systems can't say without whitelisting.
I do have data. My RBL knocks out about 97% of all spam.
Thanks, I already have data that RBLs can help get rid of spam. That's why I use them. I also have data that content-sensitive approaches can help get rid of spam. What I don't have is any data to back up your claim that RBLs are to SpamAssassin's content-related filtering as caviar is to koolaid.
And that's without much maintenance. When I get proactive and start monitoring worm-infected PCs, I can up this rate to 99.5%.
As I said in my original post, I can get the same rates, including the lack of false positives, using a combination approach. I get 99.5+% with minimal maintenance.
If you don't know how to make use of content-related tools like bayesian filters, fine. Don't use them. But I'm telling you that they work great for me as part of a combination approach, and I have the data to back it up.
I have exactly the same setup as you do. As some of the others said, you need to keep running sa-learn and it will eventually work.
I was doing this for about two weeks with no noticiable effect, and all of a sudden it started to catch well over 90% of all spam.
With the razor and other remote site checking in place it is slow very though.
Alas gallinaceas de urbe bovis volo