Slashdot Mirror


Comparison of Bayesian POP3 Spam Filters

kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.

29 of 326 comments (clear)

  1. Other filters by dtfinch · · Score: 4, Informative

    I would have liked to see how my favorite bayesian spam filter, K9, would have faired in your comparison, but it failed to meet your first requirement of being cross platform. It's freeware written in C, is about a 60kb-100kb download, depending on if you get it with the self installer, is easy to use, and has a very small memory footprint. Before today it had sorted my email with over 99.8% accuracy, excluding the first couple days of training, and after only a couple weeks of use, though now it's down to 99.7%.

    I have used PopFile in the past on both Windows and Linux, but found K9 to be better suited for environments where Windows is an option. It's very easy to use, having a windowed interface, and it seemed to learn much faster than PopFile did.

    I haven't used SpamBayes. I'll have to give it a shot.

  2. Spamprobe by 1029 · · Score: 5, Informative

    The article didn't mention SpamProbe. It is what I use, and it has worked quite well for the past month or so that I've been using it. Perhaps this is just because the author didn't test this spam filter yet, but I like it quite a lot with my current mutt/procmail setup. Take this for what it's worth.

    --
    - I love animals. I try to eat at least one a day.
  3. Re:Nitpick... by RatFink100 · · Score: 2, Informative

    It's a Babylon 5 reference.

  4. YFI list by usotsuki · · Score: 1, Informative
    1. E-mail contains HTML tags of any sort, except for <A>
    2. E-mail contains attachments (unless solicited; whitelist)
    3. With all non-alphanumeric characters removed, certain case-insensitive keyword matches can detect spam
    4. E-mail is a forward or looks like chainmail / Nigerian scam
    5. E-mail contains junk strings in subject or sender
    6. E-mail comes from you, but header doesn't match your send name
    7. E-mail is excessively large (>20K) and unsolicited (whitelist)
    8. E-mail headers and/or text contain Mojibake, if unsolicited (whitelist) - this will block anything in Chinese or Russian, for example
    9. Badly formed headers
    10. Address doesn't match reverse lookup
    If ANY of these apply, then, IMO, YOU FAIL IT!!

    I think, this would be a perfect filter system, if it could be coded. I have a homemade POP3 client that I could stand to add some of this to, I guess...

    -uso.
    --
    Dreams, dreams, don't doubt dreams, dreaming children's dreaming dreams. Sailor Moon SS
    1. Re:YFI list by Oddly_Drac · · Score: 2, Informative

      "Address doesn't match reverse lookup"

      You'd be surprised how many DNS servers are completely misconfigured for this, but I think that a simple ping to the address given could actually show if it _existed_.

      Personally I've found that I can reduce my spam by a huge amount by never viewing HTML...which brings a thought about tracking and tracing the webbugs in any given piece of HTML email...

      --
      Oddly Draconis
      Too cynical to live, too stubborn to die.
  5. Re:And the winner is... by Gaza · · Score: 3, Informative

    SpamBayes has a very well done pop3 proxy that will work with ANY pop3 mail client, including Eudora. There is also an IMAP filter for those that like IMAP and for those procmail fans it also has an app called hammiefilter which is a command line version of the SpamBayes tools.

    SpamBayes also has a very well done and integrated Outlook plugin which leads to the common misconception that SpamBayes will only work with Outlook.

    Also note the review mentioned that both SpamBayes and POPFile work on multiple platforms and he is reviewing the pop3 proxy on both them, not their counter part outlook plugins.

  6. Re:You really just don't get it by Plug · · Score: 4, Informative

    I don't disagree. I think that eventually we should move to a better email model - something like TMDA perhaps, where there is no guarantee that spammers can reach mailboxes. Or better legislation to make spamming punishable, controls on mail routers on million message mailouts, etc. Or djb's Internet Mail 2000, which moves the onus onto the senders network to store all 1m messages at a time, until people pick them up.

    The other thing you can do is impose a microcost for mailing - at 1c/mail, spamming isn't economical any more. But then that is going to penalise the people who have legitimate reasons to send a million emails at a time - you'd have to have a very good micropayment system working on the Internet to do this.

    However, those things need widespread change, and they need people in positions of power. Joe User at home can push for it, but they still get spam and they still want a short term solution. I suggest that even if they're filtering, the action of having to check their spam filter will make them irate enough. I see it as being like IPV6 - everyone would really have to change at once for the system to be most effective. (I use Freenet6, do you?)

    Now that viruses are public, caught quickly, and Microsoft are being a lot less lax with security (I am in no way commending their effort, but they at least mostly fixed the Outlooks), you don't see people writing them nearly as often. I feel spam will get the same.

  7. Re:great by advocate_one · · Score: 2, Informative
    No ads, quality programming, small fee.

    No Adds??? no, it's stuffed to the brim with promos for their own stuff though... (Gardening magazine, History magazine, Nature magazine, Radio times, TellyTubby toys, Fimbles stuff, trailers for upcoming programmes and series)


    Quality programming??? it's gone really downmarket in the last few years..


    Small fee??? That fee is your license for receiving _all_ television programs, even cable and satellite... not just the BBC. Although that license money goes to the BBC, really a goodly share of it should go to the other service providers as well.

    --
    Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
  8. Re:great by Zog+The+Undeniable · · Score: 1, Informative
    Well, I rarely find anything watchable on BBC1 or BBC2 these days (too many soaps, trashy sitcoms and repeats), and the licence fee, while it's cheap compared to Murdoch's Sky subscription, is a tax on watching TV, not an optional payment. Even if you only watch satellite or cable channels you have to pay the BBC.

    You used to get a free satellite viewing card for your licence fee giving access to all the "terrestrial" public channels on satellite, which was great if you had a spare decoder and crappy terrestrial reception like where I live. To save a few quid, the BBC no longer fund these cards and have gone unencrypted, which means I've lost the other terrestrial channels upstairs. Thanks guys.

    --
    When I am king, you will be first against the wall.
  9. Re:Filtering by gfody · · Score: 2, Informative

    you might find this sight particularly useful. it will let you set up a temporary address based on a naming convention that forwards to your real address but expires after a few emails. you can setup something like rusxxxxx@asdf.com where xxxx is whatever you want and it will fwd to your real address so if the badguys get your email its no big deal the temp addy will just stop working.

    --

    bite my glorious golden ass.
  10. Re:Nitpick... by Anonymous Coward · · Score: 1, Informative

    What the hell is wrong with using "last" in that context? What did you do last week? Whatever it was, you sure as hell didn't make it to this week, given your narrow definition. Summary: the adjective "last" is perfectly acceptable as "most recent", see a dictionary. "such a person sure as all hell shouldn't be given an audience on /." ... stfu. (... and "CS technology"?? wtf.)

  11. In related news by heli0 · · Score: 3, Informative

    If you have ever signed up with the Direct Marketing Association's Mail Preference Service (list of people not to send junk mail to), but continue to receive stacks of crap every day, here is what you can do about it: Prohibitory Order

    Links to pdf's you need to print and mail in included.

    "A little-known Federal law allows individuals to send a Prohibitory Order against companies that are sending unsolicited sexually provocative or erotically arousing mail. The Supreme Court went one step further, allowing individuals to decide what constitutes "erotically arousing" mail. The law makes it illegal for a company to send mail to an individual within thirty days of receiving the Order."

    "Postmasters may not refuse to accept a Form 1500 because the advertisment in question does not appear to be sexually oriented. Only the addressee may make that determination."

    --
    Whenever the offence inspires less horror than the punishment, the rigour of penal law is obliged to give way...
  12. Re:Mozilla - filters on client not server by pe1chl · · Score: 3, Informative

    It would be nice if there was filtering done on the server. Then you would not need the packages that are reviewed here.

    However, that means a change to the server, and a change to the POP3 protocol. The ISP would have to install a filtering plugin or a modified version of the server, and the client would subscribe to this service and train it (every client would have his own dictionary). With the first few messages there would be some special POP3 report back to the server indicating that you consider it spam, and from then on the server would filter on its own.

    However, that would be difficult/impractical to roll out, so you will have to live with clientside filtering like in Mozilla.

  13. Re:"Bayesian" by file-exists-p · · Score: 5, Informative

    As far as I know, many of those filters are based on a decision rule of the form

    P(mail is spam | words X, Y, Z, ... are in it) > 1-epsilon

    The computation is then done using Bayse's rule (P(A|B)=P(B|A)*P(A)/P(B)) under certain independance assumption which makes it tractable.

    So this is actually bayesian filtering ...

    My favorite filter is spamoracle

  14. Re:great by Zog+The+Undeniable · · Score: 2, Informative

    Yahoo uses captchas to prevent scripted sign-ups, so if you get anything from a Yahoo mail account, there was once a human (OK, a subhuman) at the other end.

    --
    When I am king, you will be first against the wall.
  15. Re:I changed my mind. Simpler is better. by The+Grassy+Knoll · · Score: 2, Informative

    > There's a commercial solution using this system right now, although the URL escapes me

    Spam Arrest?

    --
    They will never know the simple pleasure of a monkey knife fight
  16. Re:Only useful to a point by spongman · · Score: 3, Informative

    I've been using SpamBayes for about 9 months now and I've never had any problem with this 'new kind of spam' you mention. I just don't see it. I don't have to do anything, write any rules, configure anything, it just gets junked. I've never once had any false positives either I get about 30 spams/day, and out of the 8,200+ spams I have in my spambox, less than 100 of those spams are categorized as having less than 90% probability of being spam.

  17. Re:Nitpick... by AndroidCat · · Score: 3, Informative
    The "last, best hope" was used by Lincoln in the American civil war, "We shall nobly save, or meanly lose, the last best hope of earth."

    It's quite possible that it goes back further to a version of the Bible or Shakespeare. (Always the two to bet on when finding the source of a phrase in one fell swoop.)

    --
    One line blog. I hear that they're called Twitters now.
  18. Re:great by Anonymous Coward · · Score: 1, Informative

    hey - it pays for the radio too guys.. Just listen to radois 4 and 3 if you want a bit of quality.

  19. Re:Nitpick... by spongman · · Score: 4, Informative
    Here's a bit from the excellent SpamBayes background page:
    A remarkable property of chi-combining is that people have generally been sympathetic to its "Unsure" ratings: people usually agree that messages classed Unsure really are hard to categorize. For example, commercial HTML email from a company you do business with is quite likely to score as Unsure the first time the system sees such a message from a particular company. Spam and commercial email both use the language and devices of advertising heavily, so it's hard to tell them apart. Training quickly teaches the system all sorts of things about the commercial email you want, though, ranging from which company sent it and how they addressed you, to the kinds of products and services it's offering.
  20. Re:A new poll is required by Cato · · Score: 3, Informative

    See http://death2spam.net - this is a commercial mailbox service that appears to have really good bayesian-style spam filtering (referenced by Paul Graham in a recent article) - they even fetch URLs in some messages to filter based on website content. They don't require individuals to train on their own messages, which may be controversial but also makes it feasible to deploy this at large scale in ISPs.

    Without major ISP deployments, the response rates to spam will not go down, since the clued-up individuals who deploy filtering themselves would never have responded to spam anyway.

    Your RF analogy is interesting but it breaks down for people with wireless mobile phone links, dialup when travelling, and so on. The best thing is to make spam unprofitable so it goes away.

  21. SpamPal by UpnAtom · · Score: 3, Informative

    I did my own investigation of spam filters about a week ago. I didn't test the actual algorithms, just the features.
    SpamPal with the add-on Bayesian filter (search Google for it) came out top. It works as a proxy and also provides blacklist/whitelist/known Spammer list checking.

  22. Re:POPFile is more than just a spam tool by BradleyUffner · · Score: 2, Informative

    I agree, I just discovered POPFile last week when it was shown on BBSpot. I use an exchange plugin called Outcast that allows POPFile to work over exchange also. I have several buckets setup to help sort incomming email into the correct folder for different projects and it works fantasticly. I've only been training it for about 3 days and it already sorts with almost perfect accuracy.

    POPFile, and Outcast rock.

  23. Re:POPFile is more than just a spam tool by topham · · Score: 2, Informative

    I installed POPFile on my parents computers; I was worried because I thought the interface (web interface) would be confusing to them; since you couldn't do everything within the email client itself.

    Works great. My father, who gets far more spam than the average person (why I don't know) has virtually 100% success rate.

  24. It's virtually impossible to not get spam? by setien · · Score: 5, Informative

    No it's not.
    I get spam at the rate of 1 spam mail per 6 months or so. Or maybe even less. I can't remember getting a single spam email on my actual email address for about a year.

    If you have an account on a crapless domain (i.e. not hotmail.com, msn.com, aol.com and the likes),
    it all comes down to this very simple rule:
    Do not, under any circumstance, have your email address posted publicly accessible ANYWHERE on the web.
    It WILL get trawled. And then it will be spammed relentlessly.

    If you have an existing address you don't want to give up, or an address at hotmail.com or a similar place, dump it.
    Then exercise a bit of common sense about where you use your actual address.

    I have a domain which catches email to unknown addresses and put them in my regular mailbox.
    Whenever I have to give an email address to some place on the web, I use *domain-i-am-currently-visiting*@mydomain.com. So if I am visiting foobar.com, I would put in foorbar.com@mydomain.com.
    I have been doing this for years. It enables me to see what was the source of the leak when I get spam on one of the addresses.
    It has taught me one thing: I have never, ever, ever, in all my years of online shopping, forum posting etc, come across a single website that have ignored their own privacy statement. Ever. Even the slightly sketchy sites (like divx subtitle sites) don't leak addresses.
    I was surprised to realize this.

    The only addresses I ever get spam on are the ones I know to be publicly displayed on the web.

    So it's that easy to avoid spam.

    --
    Give me liberty or give me kill -s 9
  25. Re:Mozilla - filters on client not server by HermanAB · · Score: 2, Informative

    I run SpamProbe on the server. For any given business, everybody will receive pretty much the same sort of mail. So a single database works like a charm, with atypically 99.5% accuracy and zero false positives. This works because Spamprobe also counts word pairs, something that no other word counting filter does. To compensate for the enormous increase in computational load, it uses BerkleyDB as a backend. For corrections, i create a user called spam. Corrections can then be forwarded to this user, to reverse the database entry for that message.

    --
    Oh well, what the hell...
  26. Knowspam by KermitAndLadyHoliday · · Score: 1, Informative

    No one appears to have mentioned Knowspam yet. 100% spam blocking. No filters. Just a simple "prove you're human" auto-reply sent to the sender and a "friends" list. http://knowspam.net/

  27. MIMEDefang + SpamAssassin + Razor by wytcld · · Score: 3, Informative

    SpamAssassin has Bayesian learning, which I have running but not for long enough to test. I recently set up MIMEDefang as a Sendmail milter calling SpamAssassin (which calls Razor). This setup allows Sendmail to reject e-mail beyond an arbitrary SpamAssassin score. The remote mail daemon is informed the mail cannot be delivered.

    Setting that score at 8 has resulted in no false positives over a week (I log From and Subject information - it's all obvious spam). Then stuff that scores between 5 and 8 I divert to a separate mail box, which I comb through every day or two. There have been two false positives that ended up in that over the week. This is with hundreds of e-mails for a half-dozen users coming in a day. I also end up, with this setup, with 2-4 spams making it through to my own mailbox (the bussiest on the system). These are, because of the filtering, the least obnoxious, and easily enough report to Razor to spare others. Meanwhile, I like to keep a window open to the mail server running "tail -f mail.info | grep REJECT" and watch a dozen or so attempted spams an hour refused acceptance with a message like "554 5.7.1 SpamAssassin score of 15, rejected" back to the origin, which is enough that if it wasn't spam any good mail daemon will inform the sender, and they can find another way to get through.

    Even if this gives spammers a clue about ducking SpamAssassin, the spams that can get by it are by far the least obnoxious. I look forward to seeing if the Bayesian feature helps (it feeds itself anything ti scores at over 15 by default). But it's a pretty good system short of that. If it became standard for ISPs to reject all mail with a SpamAssassin score of 8 or higher, the loss of legitimate communications would be exceedingly rare, and politeness standards would be encouraged.

    --
    "with their freedom lost all virtue lose" - Milton
  28. Re:Nitpick... by tim_one · · Score: 2, Informative

    The way spambayes estimates the probability that a msg is spam given that it contains a specific word is thoroughly Bayesian, as described on Gary Robinson's web page, and in his March "Linux Journal" article.

    The way spambayes combines probabilities ("chi-squared combining") is indeed not Bayesian at all. The probability combining scheme Paul Graham suggested isn't correctly Bayesian either, unless you assume the universe consists of equal numbers of ham and spam messages (so that the prior probability of spam is 0.5).