Slashdot Mirror


Seven Spam Filters Compared

Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."

20 of 213 comments (clear)

  1. Mozilla? by HBI · · Score: 4, Insightful

    I have seen at least two of these comparisons and no one seems to want to roll Mozilla's spam filter into the mix and compare it. Therefore, the comparisons are kind of useless to me. I am guessing I am not the only person using Moz either, for specifically this reason (ease of use for Bayesian filtering).

    What's up with that? I know it's not a proxy, so the methodology is different than most of the products in the comparison. I'm very interested in how well the filter works however, compared to these other products.

    --
    HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
    1. Re:Mozilla? by bobintetley · · Score: 2, Insightful

      Sensible people filter their email at the server and try to waste as little bandwidth as possible.

      Mozilla is no good for this, as you have to download the mail via POP3/IMAP to filter it.

      Don't get me wrong - Moz' spam filter is good at the user level, but you really would want to try and ditch the spam before then (particularly if you run a server for a number of users).

    2. Re:Mozilla? by wilfie · · Score: 5, Insightful

      The loss of bandwidth is not the main cost of spam these days.Certainly not internal bandwidth between our mail server and desktops. The excellent features of doing it on my desktop are that the filter is learning about what _I_ consider to be spam and ham, and that I have the stuff that's classified as spam to hand and can check it through once in a while. So far for me it's only thrown false positives when colleagues have sent stuff that was spammy in content. I have a presentiment that our CEO's habit of writing in red HTML (full of ff0000) will cause a false hit one day.

    3. Re:Mozilla? by hdw · · Score: 5, Insightful

      Most people can't filter their email at the server, since most people doesn't have access to a server to filter at.

      So the majority has to filter locally, either in the client or with a local pop/imap proxy (like PopFile).

      // hdw

      --
      Executive Pope (small) Kallisti Engineering
  2. OT: Disturbing? by Lead+Butthead · · Score: 4, Insightful

    Does anyone find it disturbing that --

    a. Spam Filter software company is now a "viable business."
    b. Spam Filer is needed AT ALL?

    --
    ELOI, ELOI, LAMA SABACHTHANI!?
  3. Re:Good testing, but not enough samples by Sanctuary · · Score: 4, Insightful

    They didn't train Spamassassin to use the bayes filter once during the test, and they used it with out all the other scoring tools for Spamassassin. This review really didn't completely test Spamassassin's full potential.

  4. SpamAssasin had Bayesnian turned off?! by SuperBanana · · Score: 4, Insightful

    I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)

    I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

    One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!

  5. Going after the wrong people... by Anonymous Coward · · Score: 1, Insightful

    Instead of going after Spammers, why not go after the companies that hire them to send us Viagra/Penis Enlargement/etc mails? Without them, no Spam. Also, I'd like to know who the fucktards are that repsond to these mails and buy their products.

  6. Stop spam the low-tech way. by Futurepower(R) · · Score: 2, Insightful


    The quickest way to stop spam in the U.S. would be to have a respected person such as the Surgeon General of the United States say that

    1) There is no way to increase the size of your body parts,

    2) The cheap Viagra is not Viagra,

    3) and so on.

    We can help by telling everyone we know not to buy anything from spam. Next time you are at a party or family gathering, make that point.

    Spam would disappear if there were no buyers. We need to make it culturally unacceptable to buy anything that is advertised through spam.

  7. Re:Active Spam Killer by Admiral+Llama · · Score: 2, Insightful

    1. If you thought it was worthwhile to send me an e-mail in the first place, then you'll probably click the respond button for the bounce message. If not, then I probably don't want to hear from you anyway.

    2. If someone spoofs an e-mail to me from a spam victim, the spam victim will get an e-mail asking them to prove they're real. Fat chance of them ever doing that. Who knows? Maybe the spam victim will be so impressed with the sheer brutality of Active Spam Killer, they'll try it to.

  8. A message from a spammer by Anonymous Coward · · Score: 5, Insightful
    As a professional sender of UCE, I just want to tell you slashdotters to keep on playing with your spam fileters. As long as you use spam filters on your e-mail, I can continue to reach my real intended targets, those non-slashdotters who do not know better and will buy my products or click through to my client's websites. You filters really help cut down on the complaints to the internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business. Of course, I still waste your bandwidth and mailbox capacity, but you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems. My yahoo and hotmail and other accounts for replies are lasting much longer before getting shut down because someone complained to these service providers. And my clients are even reporting that they can start mailing out 800 numbers like 1-800-901-3719 again and they will not have you damn spammers set up their modems to keep autodialing them, since you spend your own time and effort to filter the e-mail and only clueless users who might actually call see the numbers.

    Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.

    P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.

    1. Re:A message from a spammer by jpetts · · Score: 2, Insightful

      This might be considered interesting, but I think it is really just a troll.

      However, one interesting point that trollboy makes, is that the 1-800 numbers end up in the spam, and we don't see them: why not modify the filter so it automagically pulls out all such numbers from the spam, so that they can be easily on hand for those people who want to set up autodialers? In a way this is poetic justice, being analogous to the way the scumbag spammers harvest email addresses from web pages. So yet again, the classification allows an easy way to harvest spam 1-800 numbers from genuine ones.

      Thanks for the suggestion, spammer or troll, whatever you are!!

      PS Googled for the 1-800 number the idiot mentioned in his email, but nothing came up. Did anybody dial it? I'm nowhere near a public telephone at the moment. I'll try when I get back to civilisation if nobody else has already done it...

      --
      Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
    2. Re:A message from a spammer by mce · · Score: 3, Insightful

      There's more to the time-spent-on-spam comparison than what you wrote. If you filter all the spam and quickly check it once a day or once a week, you only look at it whenever you "want" to: i.e. probably during a dead moment inbetween meetings or some such. But if you let it get into your inbox, whatever you're doing may needlessly get interrupted every so many minutes/hours. After all, each e-mail that reaches your inbox might (for instance) be that one important reply you're waiting for and have to process asap...

  9. Re:So weird by arcanumas · · Score: 2, Insightful

    I am not sure about getting spam with such an addres ssaf4502@E8Hkl3.biz . I AM certain , however, that i would not receive regular mail.
    You can not put it in a bussiness card, people will always type it wrong. You definately cannot pronounce it over the phone.
    In fact, most would give up on contacting me through e-mail just looking at this monster.

    --
    Slashdot Sig. version 0.1alpha. Use at your own risk.
  10. Re:So weird by frovingslosh · · Score: 3, Insightful
    Don't spend time trying to filter-- get an obscure email adress like saf4502@E8Hkl3.biz

    This is a pretty bogus "fix". It might work if you set up such an account and never use it, but if it's used and gets into a spam database the computers can proprigate this e-mail address just like they can any other. The spam database computers simply don't care if the name is "joe" or "saf4502", they deal with both exactly the same. All you'll really do is make it harder for you to pass along an e-mail address verbally to someone.

    Spammers get these addresses any number of ways. Many are harvested tens of thousands at a time. If you ever use that e-mail address in a usenet news group, for example, it will get harvested. Of course, you can munge it and give instructions in the post for how someone wanting to reply should unmunge it (replace the number in my name with the square root of the number) but realistically few are going to bother to go to extra work to unmunge an e-mail address, so if you made a post to really try to get some information back rather than to just hear yourself talk, that's a big waste.

    Same if you want to post a contact e-mail on your website.

    Businesses you deal with are even less likely to unmunge your e-mail address, and if they do you certainly have no protection that they are not the ones about to sell their mailing list database to a spammer.

    And even if you just keep your e-mail adderess for close personal contacts, one of them may eventually come across what they think is a "cute" electronic greeting card site on the web and give them your address to send some damn picture of a dancing bunny, or use your e-mail address on some site with an "e-mail to a friend" link for a story they think you would be interested in, or even just let their computer get infested with some worm that goes through address books, and your adddress is in some spam database, soon to be in thousands. Having a hard to remember e-mail address is no more protection than having an easy to use one is.

    I even created a dummy e-mail address one time on Mindspring, with a very uncommon name and numbers. Never used it. It started getting spam after a while. Either Mindspring sold the names, or they had a bad security system and some employee sold the names, or they had a really bad security system and someone hacked in and harvested the names.

    --
    I'm an American. I love this country and the freedoms that we used to have.
  11. Re:Good testing, but not enough samples by skookum · · Score: 3, Insightful

    Agreed. The author made up the artificial constraint that "no program is allowed to contact the network" which means that SpamAssassin wasn't able to check the DNS blacklists for things like exploited open proxies/relays in the Received chain, or to check with distributed signiture services like RAZOR/DCC, etc.

    If you're not going to let the program use its full capabilities, why test it?

    Analogously, what kind of hardware review site would do a review along the lines of "This motherboard supports this extra feature that will improve CPU speed noticeably, but we're going to disable it for our tests (even though most of you would want to use it.)"

  12. Re:Good testing, but not enough samples by hamster+foo · · Score: 5, Insightful

    "Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough."

    While I'm sure the recommendations set forth in Spam Assassin's man page are probably a good idea for all Bayesian training sets, he wasn't using the Bayesian filtering included in Spam Assassin, so you can't really fault him for not reading a section of the man page for a feature he was choosing to leave out.

    It would have been nice to see him turn on Spam Assassin's Bayesian filtering at least in some of the tests. I don't think test results with a feature I would imagine the vast majority of users would used turned off is a very good comparison of the different packages abilities.

    --
    - b
  13. Re:How about Spam Filter + Authentication? by DavidTC · · Score: 2, Insightful
    So, you're taking a message you suspect might be spam, and sending a message to the 'sender'.

    When, of course, most spam has forged senders.

    Whee, looks like another idiotic pattern I have to bock.

    --
    If corporations are people, aren't stockholders guilty of slavery?
  14. Interesting article but unsound methodology by Henry+Stern · · Score: 2, Insightful

    Sam's article was a very interesting read, but his results need to be taken with a grain of salt.

    To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:

    The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.

    The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.

    Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).

    Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.

    However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.

    Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!

    Good luck!

    Henry

  15. Re:SpamBayes works really well for Outlook. by jpetts · · Score: 2, Insightful

    I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.

    Anybdoy looking for a can of spam might want to check out the Ling Spam corpus created by Ion Andoutsopoulos, also available here.

    --
    Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender