Seven Spam Filters Compared
Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
I have seen at least two of these comparisons and no one seems to want to roll Mozilla's spam filter into the mix and compare it. Therefore, the comparisons are kind of useless to me. I am guessing I am not the only person using Moz either, for specifically this reason (ease of use for Bayesian filtering).
What's up with that? I know it's not a proxy, so the methodology is different than most of the products in the comparison. I'm very interested in how well the filter works however, compared to these other products.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Does anyone find it disturbing that --
a. Spam Filter software company is now a "viable business."
b. Spam Filer is needed AT ALL?
ELOI, ELOI, LAMA SABACHTHANI!?
They didn't train Spamassassin to use the bayes filter once during the test, and they used it with out all the other scoring tools for Spamassassin. This review really didn't completely test Spamassassin's full potential.
I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)
I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)
One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!
Please help metamoderate.
Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.
P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.
This is a pretty bogus "fix". It might work if you set up such an account and never use it, but if it's used and gets into a spam database the computers can proprigate this e-mail address just like they can any other. The spam database computers simply don't care if the name is "joe" or "saf4502", they deal with both exactly the same. All you'll really do is make it harder for you to pass along an e-mail address verbally to someone.
Spammers get these addresses any number of ways. Many are harvested tens of thousands at a time. If you ever use that e-mail address in a usenet news group, for example, it will get harvested. Of course, you can munge it and give instructions in the post for how someone wanting to reply should unmunge it (replace the number in my name with the square root of the number) but realistically few are going to bother to go to extra work to unmunge an e-mail address, so if you made a post to really try to get some information back rather than to just hear yourself talk, that's a big waste.
Same if you want to post a contact e-mail on your website.
Businesses you deal with are even less likely to unmunge your e-mail address, and if they do you certainly have no protection that they are not the ones about to sell their mailing list database to a spammer.
And even if you just keep your e-mail adderess for close personal contacts, one of them may eventually come across what they think is a "cute" electronic greeting card site on the web and give them your address to send some damn picture of a dancing bunny, or use your e-mail address on some site with an "e-mail to a friend" link for a story they think you would be interested in, or even just let their computer get infested with some worm that goes through address books, and your adddress is in some spam database, soon to be in thousands. Having a hard to remember e-mail address is no more protection than having an easy to use one is.
I even created a dummy e-mail address one time on Mindspring, with a very uncommon name and numbers. Never used it. It started getting spam after a while. Either Mindspring sold the names, or they had a bad security system and some employee sold the names, or they had a really bad security system and someone hacked in and harvested the names.
I'm an American. I love this country and the freedoms that we used to have.
Agreed. The author made up the artificial constraint that "no program is allowed to contact the network" which means that SpamAssassin wasn't able to check the DNS blacklists for things like exploited open proxies/relays in the Received chain, or to check with distributed signiture services like RAZOR/DCC, etc.
If you're not going to let the program use its full capabilities, why test it?
Analogously, what kind of hardware review site would do a review along the lines of "This motherboard supports this extra feature that will improve CPU speed noticeably, but we're going to disable it for our tests (even though most of you would want to use it.)"
"Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough."
While I'm sure the recommendations set forth in Spam Assassin's man page are probably a good idea for all Bayesian training sets, he wasn't using the Bayesian filtering included in Spam Assassin, so you can't really fault him for not reading a section of the man page for a feature he was choosing to leave out.
It would have been nice to see him turn on Spam Assassin's Bayesian filtering at least in some of the tests. I don't think test results with a feature I would imagine the vast majority of users would used turned off is a very good comparison of the different packages abilities.
- b