Seven Spam Filters Compared
Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
What article on Freshmeat? It would be nice to have a link.
Slashdot can't even provide a link for an address in an article? Editors must be slacking today.
Sounds great, but until I hear about software products like these in my morning mailbox, I don't really trust that they're any good.
Spam Filters
http://freshmeat.net/articles/view/964/comparative
Seems the link was caught in the filters as well.
Click here.
The only one that's really useful, and not included? Typical freshmeat.
The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.
For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.
The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.
-Barkeep, a draft of your most hazardous brew, for the world is slowly stepping into focus, and I don't like what I see.
Finally a benchmark kind-of-test for spam filters. Now I need to tell my friend about this to see if he wants to switch over from SpamAssassin to SpamProbe, because I know it filters some news letters as spam, and, not being an admin of any sort, it's hard to fix for me.
Note: someone update the text so that the link is htmlized.
IMO, the best way to go with spam is to combine a heuristic filter with a text/baysian filter, in my case SpamAssassin and SpamProbe. I run them both, and it does a noticably better job than either running alone.
SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html. A Baysian filter can't really catch that, but a heuristic filter can be written to notice the pattern.
Also, set up your Baysian filter to re-learn regularly from your spam folder. SpamProbe adds a unique ID to each message, so it won't process a message twice. Therefore, you can just manually move any false negative spams into the folder, and they'll be learned from.
I have seen at least two of these comparisons and no one seems to want to roll Mozilla's spam filter into the mix and compare it. Therefore, the comparisons are kind of useless to me. I am guessing I am not the only person using Moz either, for specifically this reason (ease of use for Bayesian filtering).
What's up with that? I know it's not a proxy, so the methodology is different than most of the products in the comparison. I'm very interested in how well the filter works however, compared to these other products.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Does anyone find it disturbing that --
a. Spam Filter software company is now a "viable business."
b. Spam Filer is needed AT ALL?
ELOI, ELOI, LAMA SABACHTHANI!?
As was noted earlier, the set of messages given to the filters for learning was terribly small. Furthermore, SpamAssassin wasn't tested in a way useful to most as the tests in this article didn't take into account SA's Bayesian filter nor it's network-based tests (Razor, etc).
Also, what's with keeping the spam threshhold score secret?
How the heck could Active Spam Killer be left out? I used to get about 150 spams a day and now I get ZERO. No false positives, no false negatives.
It is an autoresponder that checks the sender against a whitelist and a blacklist. If a new e-mail is in neither, then it bounces back an e-mail asking for a confirmation that the sender is a human. Simple!
I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)
I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)
One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!
Please help metamoderate.
- Contains my initials. I simply ask my friends to insert my initials in the subject line. They're all happy to comply.
- If I opt-in to something, like
/. updates, I allow *that* domain (*@slashdot.org, for example). No third party co-brands are accepted.
Fair enough?There's that and the fact that the greatest spam filter wasn't entered. The "I'm gonna get'ya sucka!" filter.
Instead of going after Spammers, why not go after the companies that hire them to send us Viagra/Penis Enlargement/etc mails? Without them, no Spam. Also, I'd like to know who the fucktards are that repsond to these mails and buy their products.
They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.
Check it out Here.
GeekWares - Buy and Download Today!
What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.
According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.
I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
See our PSAM project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.
The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.
Overall, though, this was an interesting evaluation, and I'm glad that the author published it.
I find it funny that so many people have problems with spam. I have never gotten a spam message with my current email address. Don't spend time trying to filter-- get an obscure email adress like saf4502@E8Hkl3.biz
The Television Wiki
Whoops! The real url is
http://popfile.sourceforge.net/
GeekWares - Buy and Download Today!
That's right! Our company has found a high-tech way to use various anti-spam tools to enlarge your penis. My pennis is noww sso lrage that i Cannnot type curretcly. Itt gtes in teh way.
Please visit www.spamfilters2enlarge.com
Act before midnight and get a $30 discount.
Table-ized A.I.
Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
Dont forget that the filter sees more than the eye...
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
If you decide to try out spamprobe or another bayesian filter, try this web interface which lets you easily reclassify mail, even those marked as spam. I found that "training" the bayesian filters was the hardest part; this definitely simplifies the process.
It wasn't mentioned in the article, but I really must plug popfile. It filters out my spam yes, but it is also a general mail categorizer. It sorts ten yahoo groups for me, personal, work, and school related emails. I know you think you could do this with rules for the emails, but for example, I get several hundred emails a day from the Harry Potter for Grownups List. Popfile can sort them into 'probably interesting' and 'probably not' for me. Very nice.
An interesting thread here about how TMDA, a C/R filter, used in conjunction with SpamAssassin, can provide the best of both worlds. While TMDA is by itself effective, there seem to be some humanistic issues involving the assumption that all e-mailers are spammers unless they prove otherwise. The thread explains how Bayesian filtering can be improved by using a decent C/R filter like TMDA without alienating people that send legitimate e-mail.
Personally, I figure anyone thin-skinned enough to be insulted by my C/R filter probably isn't worth talking to anyways, but I digress...
The quickest way to stop spam in the U.S. would be to have a respected person such as the Surgeon General of the United States say that
1) There is no way to increase the size of your body parts,
2) The cheap Viagra is not Viagra,
3) and so on.
We can help by telling everyone we know not to buy anything from spam. Next time you are at a party or family gathering, make that point.
Spam would disappear if there were no buyers. We need to make it culturally unacceptable to buy anything that is advertised through spam.
I don't know if thunderbird uses the same filter as mozilla, but for me, thunderbird is horrible at spam recognition. I have an account that gets about 50 spams a day, and one legit email from 'word a day'. It consistently screws up even after weeks of training. Thunderbird couldn't find a spam labeled 'young virgin sluts selling herbal viagra from the Congo'. But then again, it's only 0.1, so I'm more than willing to cut it some slack.
Whatever happened to this project:l ter.README
http://www-cse.ucsd.edu/~wkerney/spamfi
?
Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.
P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.
I'm using the standalone Thunderbird and it catchs everything that passes by Spamassassin. Spam is marked but never deleted, so I can go back and check. Some spam programs will delete email, which could delete a good email, unacceptable.
Basically, I'm using a mandrake linux box, imap, procmail, fetchmail and spamassassin. Easy, and I can send/receive email from my linux box, and port 25 is blocked from the Net so nobody can use me as a bouncer.
Only problem I had was, there was no complete document to set this up, I had to piece each part together.
So for anyone who wants to know, heres the quick steps.
1. I'm using mandrake, but had to update SA for the sa-learn utils. (Gotta train SpamAssassin)
2. Setup fetchmail in your personal account.
3. Setup
DROPPRIVS=YES
VERBOSE=ON
LOGFILE=/home/userac
|
4. Setup your user_prefs in your local directory for SA. (mine, but im no SA expert, but it works)
required_hits 5
rewrite_subject 0
use_terse_report 1
report_safe 1
use_bayes 1
auto_learn 1
ok_locales en
use_pyzor 1
pyzor_max 9
pyzor_add_header 1
use_razor2 1
always_add_headers 1
always_add_report 1
spam_level_stars 1
pyzor_add_header 1
skip_rbl_checks 0
#timelog_path
5. As root make sure Imap,Spamassassin is running.
6. Load Thunderbird, use Imap, use filters on x-headers.
I couldn't agree more. I get about 250 spams a day, and after weeks Thunderbird misses the boat (false positive and false negative) about half the time.
Anyone care to point out a decent way to use SA's bayesian filter with this setup:
.forward and .procmailrc to do this). I'm currently deleting spam (score = 5)
I have a linux box running as my web/mail server that has spamassassin on it for anyone who wants to use it (setup
The problem is how to get spam and ham from Outlook back to the linux box correctly. To my knowledge, outlook doesn't export mail in any way that's readable by the sa-learn script. I'd like to setup a bayesian filter, but it seems like a lot of effort to get rid of the 4 or 5 spams that SA actually does let through each day.
"It is seldom that liberty of any kind is lost all at once." -David Hume
I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.
That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.
Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.
To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.
Nate
since the filters do better after being trained with lots of spam, anyone think of gathering up a huge collection of spam to give to other people? i mean exporting a corpus of spam from outlook, sticking it up for download somewhere, and letting other people import it into a spam folder. then other people could run their filter of choice and train it!
you could even make it all official-like, and somehow guarantee that the spam that's up for downloading is "official" and "virus-free" and "safe for your computer." you know, do geek stuff like check hashes or whatever it takes to verify that the spam collection is legit. whatever it takes to ensure that someone else hasn't filled it with a ton of virus/trojan/etc. attachments. or whatever. i dunno. you know, somehow guarantee it's safe.
imagine it! download spambayes, get spambayes to connect to the official spambayes spamcorpus server, and download the latest 2000 spams! instant training.
anyway. just an idea. mod me down as -1, herrd0kt0r. 8P
SAProxy for Windows (Based on SpamAssassin) got the highest marks.
Simply adding random text to a message is not enough to get it past SpamAssassin.
I run SpamAssassin, I know that it catches that stuff.
The reason it does catch it is because it used a WEIGHTED system for classification. If the message has the characteristics of spam, but has random words in it, it will still be considered spam UNLESS those random words have been used previously in ham messages that it has learned.
Now, the odds of the spammer hitting upon words that my version of SpamAssassin has learned as ham are very slim.
And if he did manage it, those same words would most likely not be in someone else's ham list.
So spam that can get through to me will not get through to 90% of the other SpamAssassin users.
I'm running SpamAssassin at work and it is catching over 1,000 spam messages for every false positive or false negative that it lets through. Despite the spammers including random words and random text and all of their other tricks.
i use apple's mail.app with bayesian filtering. i have received maybe 4 or 5 true spam emails in over a year. i haven't yet missed any real emails either. i would have to say that's pretty good. otoh, our groupwise system at work is fscking horrible. i get tons of fscking spam. i have had to set dozens of rules, and it still doesn't matter.
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
All of these anti-spam tools depend on content. In addition, they actually increase the workload on the mail server, not decrease it. That's workload my mail server should never have to do, and would never have to do if only the spammers were prevented from accessing the same internet I've been using since 1986.
Spam is about conSent, not conTent.
In fact I actually receive what some might consider to be spam, because I want it and have let the senders of it know that I want it. How is some heuristic going to know the relationship between the owner of one mailbox and the sender of some email that recipient actually wants and others might consider to be spam?
What good is having a separate spam folder if you keep having to look in there to find missing mail, or mail that you think might be missing but hasn't even been sent? Spammers know you'll eventually look, so they keep it full.
Spam chews up my 28.8k bandwidth to home. The only viable solutions are to prevent the spam traffic from even being sent in the first place. Letting the message body be sent during the SMTP session defeats the purpose. Stopping everything before the DATA phase is essential.
Stopping the SYN packet in the first place would be the ultimate goal. Those spamming operations that are clearly scraping addresses and in some cases even making them up. Spammers have sent email to hundreds of different email addresses on my mail server for which no mailbox or user has ever even existed (so how could they have ever consented). Their ISPs need to drop them out of their network. Certain extreme abuser ISPs have even been blocked entirely from my network as well as hundreds of others.
So excuse me if I see no point in content-based anti-spam tools. It just doesn't do the job for me at all. My metric is not about false positives or false negatives; it's about reducing costs.
now we need to go OSS in diesel cars
I use Pine when I'm at work (ssh into my box at home), but generally speaking, I use Outlook when I'm in winders or Moz when I'm booted into linux on my desktop.
I suppose (as the other poster) mentioned, that I could turn on IMAP, but like I said before, it sure seems like a gigantic pain in the ass to do nothing more than filter out a few extra emails a day.
"It is seldom that liberty of any kind is lost all at once." -David Hume
Comment removed based on user account deletion
If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...
My time and the time of 100,000 users is not.
And since the stuff like the spam filters are getting pretty generic, they can be configured and replicated to numpty users reducing spamming effectiveness by several orders of magnitude.
Poor attempt at irony BTW.
Government of the people, by corporate executives, for corporate profits.
I'm not disagreeing with the posters that stated that he has low sample size. It might be one of the problems why he doesn't have a higher catch or recall rate.
The main problem I see with bayesian filters is that they are complicated and nontrivial to set up. I've been playing with Bogofilter for several months. And even with sub 1000 corpuses, I get a very high catch rate (greater than 90-some %, though I don't have exact numbers).
The method that I've employed is start with a small set of three hundred or so ham and spam corpuses, then to train on error over time. It's a pain in the ass because I still have to continually inspect the results and tweak the databases.
In addition to that, there are at least a half a dozen parameters that contribute to the success or error rates. So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.
So give the guy a break. I wouldn't say his results are robust enough for an academic publication, but it isn't worthless. It's interesting enough for a read. It's more work than many of us are willing to do.
Also an interesting read is Comparing Bayes Chain Rule with Fisher's Method for Combining Probabilities.
I second the comment about using IMAP. I have been using it very successfully and it makes it easy to move spam inbox messages from whatever email program I'm using into a spam mailbox. I then have a script called learn-spam.sh that could be set to run each night to reclassify spam / ham.
I use IMAP to read my mail, mostly because that makes it easy to read from both work and home, and occasionally when I'm on the road. Right now I'm using the bayesian filter in Mozilla. It's great, but since it's client-based that means I have three seperate filters I need to train. Sometimes I'll run into weird problems where two of the filters think an email is good but the third thinks its spam. If I accidentally left the third one running at home when I went to work, it will sometimes decide to re-classify my inbox and make messages "magically" pop in to the junk mail folder behind my back. Not good.
What I'd love is a filter that I could run on my server box at home and point at the IMAP mailboxes at my ISP. I'd want it to filter the messages and move the spam to the Junk IMAP folder rather than a local one. That way all of my mail clients would be seeing the same thing and using the same training data. I'm not sure what the UI to this would be -- there would need to be some way to train the filter in both bulk (this folder is all spam) and individual (this one message is spam) modes.
I've done a bit of looking for a tool like this, but I haven't found anything that looks ideal yet. Some of the filters mention that they support IMAP, but it's unclear whether they're optimized for a multiple-client setup like this. For example, the IMAP-aware Outlook plugins (in SpamBayes?) wouldn't do the trick.
Does anyone know if such a thing exists? I'd prefer one that ran on Windows, since that's what my server runs right now. (I know, I know. But it was very easy to set up, and I'd rather spend time improving my programming skills instead of leaning to be a Linux admin. I was a 4.3BSD admin way back in the day, but it's been a while.) If there were a great solution that only ran on Linux, that might motivate me to switch, though.
Any advice?
Messages classified: 6,116
:)
Classification errors: 88
---
Accuracy: 98.56%
And THAT is with 8, yes EIGHT, different buckets for sorting my mail. Of course 79% of my mail is spam so
Your hair look like poop, Bob! - Wanker.
Surely this article should have been written by Spam Holden?
sig:- (wit >= sarcasm)
When, of course, most spam has forged senders.
Whee, looks like another idiotic pattern I have to bock.
If corporations are people, aren't stockholders guilty of slavery?
One year SpamCop mail account: $30
Setup SpamCopy filtering options: $0
Setup my mail software: $0
Sending out hundreds of spam reports in a few clicks without having to worry about spam filtering technology: Priceless
Please don't bother your Congressmen or Senators proposing legislation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter legitimizes my business and my waste of your bandwidth.
P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. If you think that you can scan e-mail for false positives faster this way you are just fooling yourselves, if you are scanning faster e-mail that you expect to be all spam, you will miss the very false positives that you think you are looking for. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.
Think you've seen this before? Don't complain. Just go through lots more work to set up special filers on your computer so that you will not see it again. You should have to do that. It's the true geek solution, and I would really like it if you did.
Sam's article was a very interesting read, but his results need to be taken with a grain of salt.
To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:
The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.
The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.
Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).
Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.
However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.
Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!
Good luck!
Henry
Please don't bother your Congressmen or Senators proposing legislation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter legitimizes my business and my waste of your bandwidth.
P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. If you think that you can scan e-mail for false positives faster this way you are just fooling yourselves, if you are scanning faster e-mail that you expect to be all spam, you will miss the very false positives that you think you are looking for. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.
Think you've seen this before? Don't complain. Just go through lots more work to set up special filers on your computer so that you will not see it again. You should have to do that. It's the true geek solution, and I would really like it if you did.
No Karma is given if one is modded up "funny".
Personally, I wish that he has included DSPAM and CRM114 in his testing. Otherwise, I thought that it was an enjoyable review.
Could you please forward that email to me... I have a friend from Nigeria that would probably like to become a business partner with this group from the Congo.
--Kevin
/joeyo
2^5
I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this:
/home/bogofilter/spamlist.db
We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam.
So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample...
nikola: "|/usr/local/bin/bogofilter -s "
cal: "|/usr/local/bin/bogofilter -s "
bwilson: "|/usr/local/bin/bogofilter -s "
fayre: "|/usr/local/bin/bogofilter -s "
(If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.)
To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days...
# remove records older than 30 days from spamlist.db
/usr/local/bin/bogoutil -a30 -m
This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends.
Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.
1) imagine if two people used this thing or something similar. now one of them tries sending the other one mail. bam: infinite loop. congratulations.
2) imagine a person using this to mailbomb somebody. fake the from address, send mail to you, now *you*'ve bombed that person.
3) make up your own situation where this scheme utterly fails and even ends up being dangerous. try a bit, there are plenty.
and, last of all, if I send you mail, then it's because I think it's useful for you. if you don't want to receive it, bad for you. fat chance I'll answer some stupid automated mail or click on a link in such a mail.
Mod parent up
I'm an American. I love this country and the freedoms that we used to have.
Comment removed based on user account deletion
Complete BS.
Geeks are ones that set up the spam filters for everyone else. End users will no more have to install spam filters than they have to install DNS entries, multi-peered lines ot the backbone, etc. (In fact, the problem is that often ISPs don't tell you they are filter, or give you the chance to turn it off.)
Your filters really help cut down on the complaints to the Internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business.
Sorry, but my delete key is tied to your ISP's abuse box.
Ok, I actually have a separate "this is spam" key that send the spam off to spamcop. I also use the following procmail script to report anything that scores too high on spamassassin:
The spam_report script is very simple, it just encodes the spam and sends it off to spamcop. It can be found on http://spamcop.net/reporter.pl. I modify the number of stars (spamassassin score) depending on how much time I have on my hand right now. If too many reports get sent to spamcop for me to deal with, I increase the number of stars, when a spammer pisses me off, I decrease the score.Even a small number of vindictive anit-spammers reporting spam will get the spammer's IP address onto spamcop's DNSBL, which feeds back into things like spamassassin.
The amount of spam that reaches my inbox in the last 6 months has been far lower than any time since the mid 1990s. Even with the reporting to spamcop, I'm spending less time dealing with spam now that two or three years ago. Over the last year or so, I've come to believe that Spammer's days are numbered.
Oh, one final note. The original article complained about the fact that spamassassin mine-defangs the spam and then says that it is hard to get the original email back. This isn't true at all. On older versions, you just run it through "spamassassin -d". While you can still do that with newer versions (as per my scripts above), they now create an attachment so you can just click on it if you want to see it.
SPF support for most open source mail servers can be found at libspf2.
I recently switched from bogofilter to SpamBayes. While it still shows the minor issues a young project always has (incompatibility with the dumbdb in Python 2.2.2 of SuSE 8.2 so you have to use gdbm as the internal db driver etc), i consider it one of the most promising spamfilters around.
Sure, it's only one data point, and next week will be different, but i think i'll stick with SpamBayes for now.
It's a combination of Bayesian filtering and whitelists based on your address book. When you first start the application, it goes into pure training mode in which junk mail is flagged but not filtered out of your inbox automatically. You train it for a while, labeling junk yourself and correcting false positives. After the training mode is sufficient (no more false positives at all for a set period of time, though there usually aren't any as anyone in your address book is whitelisted and everyone you hold a correspondence with is added automatically) the filter then prompts you to go into automatic mode, in which it separates junk into its own box. After 10 days or so in the junk box (you can set the exact time, including never), the messages get deleted. And for those annoying people who forward jokes to you but are whitelisted anyway, enough training can actually selectively overcome the whitelist, it's really very cool. For the occasional piece of SPAM that makes its way into your mailbox, you can select it and press the junk button and it immediately banishes it to the junk box and learns for the mistake.
I understand that it wouldn't have worked out considering the methodology behind the tests but I'd be interested to see how Apple's Mail.app compares.
SpamAssassin is used in quite a few of the commercial offerings, it can also filter before passing on for internal delivery. I'd guess that one or two of the others can too, it's not difficult
Custom Rules For SpamAssassin
It seems that the worst part about spam is wasted bandwidth and processing power. Wasted electricity from undesired messages being shoved through fiber optic cable seems like a waste, then even more power to process and discover if it is spam or not, then you have false positives. I think a better solution would be to weed out all the spammers, maybe take the internet away from countries that allow spammers or somthing?
Sig: I stole this sig.
So the results aren't quite up to date. I've trained it on a couple months of spam and non-spam and it seems to significantly improve its classification.
Messages classified: 3,545
Classification errors: 110
Accuracy: 96.89%
This is with 4 buckets. My spam bucket received 2,561 ( 72.24%) of those e-mails, with 7 false positives and 9 false negatives.
Oh yeah, POPFile is cross-platform...Windows, Linux, anything that will run Perl (Windows users, don't be afraid. The installer installs an interpreter for you - you'll never know it's there!)
the crippled SpamAssassin did pretty damn good though.
I don't think it's totally unfair to run SpamAssassin with the bayes disabled in these tests, a lot of people run it that way in the real world, especially on mail gateways where no provision has been made for training & retraining.
We just need to remember that on every score for SpamAssassin in those tests, it can do a lot better. I've heard good things about a few of the others, but SpamAssassin's nothing short of a miracle here, 2 false negatives and one false positive last month on approx 54K messages.
Custom Rules For SpamAssassin
How many times are we going to re-review the same old crap over and over again?
:)
btw I agree with most readers here, the comparison is useless.
this is aside the fact its pointless for windoze users (the generators of most spam). Where is a review of Popfile ?
Ohh, I love the BS line about (Paraphrased!) "we turned bayesian filters off for spamassassin because 5 other filters were good enough" - wtf ?
With low data-sets like that, the article is useless, plus this is not a valid method of dealing with spam anyhow.
Has anyone else noticed how this topic keeps get regurgitated over and over ad-naseum?
blatant plug - anyone who wants to discuss anti-spam in real terms contact me (I'm in the process of setting up a sourceforge page too!)
I've been thinking lately that SpamAssassin might have the best Bayesian implementation, with only a slight change.
AFAIK, most/all Bayesian scanners out there simply tokenize the mail and then use the tokens as the basis of the rating system.
However, SpamAssassin adds an X-Spam-Status header to all mails (by default), which contains a list of the various tests (regex, network, or Bayesian) that the mail triggered. If SA were to move the Bayesian scan to after all other tests have completed, then this list of tests passed could be (or might already be) considered by the tokenizer for the Bayesian algorithm.
The benefit to this is that regex's can discern more patterns in the code (or more correctly, equate patterns) and the network tests are fairly reliable. In a large sense, this is using Bayesian techniques to develop a self-adjusting rating scheme the tests. Using this, one could assess, for instance, how much having a host in the relay chain in an RBL influences the spamminess of an email (for instance, a large amount of email originating from SPEWS-listed IPs is not spam; this would imply that SPEWS would have a lower confidence rating in picking out a spam).
the external filtering stages of Sendmail, postfix, qmail or Exchange's SMTP engine. You know, the place where you can run an external program on the email message. Since ALL of the reviewed spam classifiers were chosen because they run from the command line with only the message as standard input and a classification as the output, I'm sure you can write a quick perl script to use it in that context and acheive the mail accept/reject feature you need.
Maybe you add in an extra header: (X-Int-Spam: Yes) to let downstream clients deal with delivery options.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
The only spam filter that is acceptable is one that utilizes challenge-response. There are NO false positives with filters based on this. I cannot tolerate even one false positive with my email. And the only spam that gets through are those with actual real-life return addresses. These are rare, and since there's a live address on the other end I have the luxury of sending the spammer a bitch-o-gram.
POPFile rocks. It is incredibly easy to use, and very, very accurate. I initially started using it for spam reduction purposes but now I find it's best use is actually sorting my mail... waaaaay better that pre-defined mail filters.
I strongly recommend people check it out if they want a very effective solution that is easy to use and configure.
Read Pynchon.
Hey, it's not my fault that you can't appreciate sarcasm.
You're not the same Timbo of "Timbo's goals" fame are you? If so, any predictions for the season? I can't believe we didn't try to get Mendieta if he was available for free.
"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
Try cleaning your training file (\training.dat) and retraining it, there's probably something wrong with it, and it can't "unlearn" whatever is screwed. It should be able to do a LOT better than that.
And yes, it uses the same filter.
that was supposed to be \training.dat, but /. ate it.
I use it, and it works from message one, not after message 200 or something. Adding just two regex filtering criteria based on looking through spam headers, I was able to rescue two people at the office from their daily flood. The one manager was about to do his nut. This is despite corporate level filtering on the mail server (which is being upgraded as I write). In 4-6 weeks both reported a huge drop from several hundred spams a day down to less that a dozen. The 'bounce' feature in MailWasher is very cool, as some spammers get the bounce and take you off their list - why spend $0.0000001 going after a "non-existant" email address? Best feature, its free to get and run, but a few bucks takes the scrolling about-the-author away. Also it took me 1 click on 2 occasions to rid myself of two large corporates who were fishing my email for sales.
I like the articles and their suggeestions. However, Sam Holden's suggestion to save mrked spam and later review it ignors the problem of scale, access, and time.
I do not know how many messages I receive dayly, but let's say 200+ for sake of discussion (mostly SPAM). Since I am currently traveling alot, often in places without any access, it may be a week or more befor I can wade through the tons of junk. Yesterday I came in to 80+ messages in my inbox (mostly SPAM) and over 1200 messages taged by spamassassin. At that rate it is not worth my time to skim the spamfile. I am sure I am not the onlyone that receives email like this. How much time can any of us afford to deal with checking the filtered SPAM?
My current spam filter is SpamOracle. It's a simple procmail filter based on Bayes' formula. It's really efficient, I haven't had a spam mail in my inbox for a week. The only bad thing is that it's written in ocaml which might not be on everybody's machine. Mandrake users can install a contribs package and don't need ocaml at all.
What I'd really like to see is something that will:
-Try to click the "remove me" or "unsubscribe" link.
-If I still get email from this spammer after X days, email abuse@theirdomain.
-If it doesn't have a remove link, email abuse@theirdomain.
I'm not sure about others, but I find that a lot of spammers respect the "unsubscribe" link. Every few weeks I go through my inbox and do this by hand. Seems fairly easy to automate.
- lather - rinse - repeat -