Slashdot Mirror


Sorting the Spam from the Ham

MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond." Math buffs might enjoy reading these pages or browsing this writeup and its many links.

29 of 249 comments (clear)

  1. Spambayes by Chromodromic · · Score: 5, Informative

    I use Spambayes with Outlook 2000, and it takes a little tweaking, but it works as advertised. Ahhh, the magic of mathematics. Just now, brought up Outlook, checked my mail and three little messages offering a free Sony headset, 70% off cell accessories, and a chance to take an IQ test just got tossed into my spam folder. Thanks anyway, but I think that means I just passed my IQ test.

    Every so often I go in and take out some old, old spam, just to make sure my current preferences are being represented and that's all the maintenance that's required.

    This is, however, the second time I've trained the filter. The first time, it incorrectly identified my FreeBSD status mails as spam, and from then on was throwing those into the Spam folder. My own fault, though, since I hadn't included any of these messages in my representative ham.

    If you run Outlook, download this filter and use it. You'll be doing yourself, and a world that doesn't need fat-injected, herbally enhanced penises, a favor.

    --
    Chr0m0Dr0m!C
    1. Re:Spambayes by Anonymous Coward · · Score: 1, Informative

      And for those of you that use OE (or any other mail client), get Mailwasher. The one account version is free. If you have multiple POP accounts, you have to pay a small amount to get the "full" version. I've been using it for at least three months and it has labeled every spam mail accordingly and in very few instances does a legitimate mail get tagged. You should also download the extra filters from here. MW just works.

  2. Written by more than hammond by adamhupp · · Score: 4, Informative
    The Outlook plugin may have been written by Mark Hammond but spambayes is very much a group effort. The project can be found at spambayes.sf.net.

    I've been using spambayes for months now and it really is quite amazing. Now, when I get the occasionaly spam in my mailbox it's actually interesting because I want to figure out why it made it in. The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made. It's made reading email a much more enjoyable activity.

    -Adam

  3. Better Bayesian Filtering by Anonymous Coward · · Score: 2, Informative

    The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.

    Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].

    When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.

    When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.

    So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.

    One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.

    But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.

    Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.

    Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.

    Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.

    I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live hum

  4. Eudora users... by Control-Z · · Score: 2, Informative


    Eudora 6.0 beta has spam filtering which seems to be Bayesian. It's a little slower to learn than PopFile, but it's pretty good so far, and of course integrated with the Eudora UI.

    http://eudora.com/betas

  5. Re:What I want by franimal · · Score: 3, Informative

    Personally, I really like Spambayes and Procmail for use with my IMAP server. It's easy to setup for each user and they can train their own SPAM database. You can even run the training script as a cron job and the users only need to shuffle unknowns to the spam folder. Works well, because users never even have to see the spam, if they don't want to.

  6. Mozilla Mail by respite · · Score: 3, Informative

    In case anyone hasn't tried it yet, the Bayesian filters in the mail client of the Mozilla suite are really impressive. They have worked close to flawless for myself.

  7. Re:SpamAssassin works for me (even on Exchange) by IthnkImParanoid · · Score: 2, Informative

    SpamAssassin is nice, but it's nowhere near the 99% elimination claim in the article (an vaporous claim in the article? The hell you say!)

    SpamAssassin, set at 5 (after I got a false positive at 4) stops about 75-80% of spam, but with some more rules from me (how did SpamAssassin let 'huge c-cks' get through?!) stop closer to 90%.

    The only solution I've tried that worked well has been white lists, but that only works so well because I don't make a lot of new friends :)

    --
    It's nothing but crumpled porno and Ayn Rand.
  8. Fight Spam with SpamProbe by steveha · · Score: 2, Informative

    I wrote an article on how to set up SpamProbe on a server, and make it easy to train. You could also use Bogofilter or any other trainable spam filter, set up the same way.

    I get at least 100 spam messages a day now, and I only see about a half-dozen or so. SpamProbe deals with the rest, and I don't have any problems with false positives. (SpamAssassin thinks that ads for LinuxWorld Expo are spam, but as I have it trained, SpamProbe doesn't.)

    steveha

    --
    lf(1): it's like ls(1) but sorts filenames by extension, tersely
  9. Popfile by isn't+my+name · · Score: 2, Informative

    I use PopFile. What I like about it is that it easily lets me use multiple personalities in Eudora, Outlook or any other mail client. Nice web based interface and a very active development community.

    You can run it locally on Windows or Linux. But, you can also set it up on a server and then use it to filter e-mail from multiple client machines. That's what I like about it. I have a home machine in my basement office but also upstairs in the TV room. Unlike plug-ins that only work locally, I can have my reclassification decisions apply to multiple client machines.

    Right now, they do not have multiple user capabilities so that my wife and I can both use the same instance and not have our classifications interfere with each other. However, you can set up multiple instances bound to different ports. The developers list multi-user capability as a priority.

    Worth checking out along with the other choices.

  10. Re:This is bad news!!! by joeflies · · Score: 2, Informative

    From what I understand, beta testers tell me the next revision of the Outlook client contains a spam filtering function that works pretty well too. I do like the Mozilla 1.4 junk mail features though - works about as good as I could have hoped.

  11. Mozilla by Little+Dave · · Score: 2, Informative

    Having used the spam filtering built in to Mozilla for the last six months, I can testify to its effectiveness. In very little time at all, I'd trained it to send 95% of the filth to the spam directory and avoid doing the same for 95% of good mails. For me, not having to run a "middle man" piece of software was a real boon.

    However, my life isn't totally spam free, as I find that I become neurotic about those 5% false positives that get unhelpfully moved to the spam directory, so still end up having to sift through the grot every once in a while. On the plus side, I now have a solution to my tiny cock problem, I've arranged cheaper home insurance and I have the email address of several horny co-eds who I'm assured are hungry for man juice.

  12. Re:This is bad news!!! by aborchers · · Score: 2, Informative

    Er, wouldn't that first involve switching them to Linux? Come on, man, I have to take baby steps with people who need convincing to leave Outlook! :-)

    --
    Trouble making decisions? Just flip for it.
  13. Dirty Spammer Tricks by dprice · · Score: 2, Informative

    I have been using the Mozilla junk mail filter for a couple of months now. One pop mail account is one that I started using in 1996. It is a spam magnet. In the time I have been using Mozilla, it has accumulated over 12,000 spam messages. That should be plenty of training for the Bayesian filter.

    Mozilla's filter does a reasonably good job at catching spam, but I still get a handful of messages every day that slip through the filter. The ones that slip through seem to be messages that have intentionally munged the spammy words with spaces, numbers, and misspellings. The spammers know that people are filtering, and they are successfully getting through the filter with their dirty tricks. Another trick spammers use is to send a message with nothing but a graphic ad. The filter doesn't have enough words to judge the the spam, so the message slips through.

    I also had some 'ham' messages get filtered, so I still have the annoyance of having to check the 'junk' folder periodically for wanted messages. The filtering makes life easier, but it is still not an ideal solution to the spam problem.

  14. Re:What you want by Anonymous Coward · · Score: 1, Informative

    You might take a look at Spam Sleuth Enterprise I suspect it has what you want, since it has trainable Bayesian (individual to each user), works with any e-mail server, has a web client interface, and a lot more that you may or may not be interested in.

  15. SpamBayes not Marc Hammond's work only by mpieters · · Score: 5, Informative
    SpamBayes was originally conceived by Tim Peters and co at Python Labs, who improved on the orginal algorithm considerably. From there on out, many people helped tune and perfect the implementation, making it the most effective Baysian-based spam filtering tool currently available (IMNSHO).

    Mark Hammond then wrote the Outlook plugin, which, admittedly, is considerably more code than SpamBayes, but not SpamBayes itself.

    For the complete background on why SpamBayes is so good at what it does, and it's history, see:

    Marc's is not the only application frontend for SpamBayes, here is a list of others: No apologies for this my pedantry offered.
    --
    "The truth shall make ye fret" -- The Truth, Terry Pratchett
  16. Re:This is totally useless. by serbanp · · Score: 3, Informative
    No it's not.

    At work I have Outlook always running with the excellent bayesian FREE filter Spammunition www.upserve.com. I also do check the mailbox from home over a dial-up connection.

    If I wouldn't use Spammunition, then I would spend a lot of time downloading spam messages; as it is right now, I get just the ham (several messages instead of many).

    Serban

  17. Mail.app by Anonymous Coward · · Score: 1, Informative

    Isn't this similar to what is used in Apple's Mail.app for sorting junk mail?
    http://www.apple.com/macosx/jaguar/mail.htm l

  18. Re:Remote Images in spam... by zerocool^ · · Score: 3, Informative

    Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)

    Um, only read emails in plain text? Use mh.
    inc; scan; show last
    By the way, those images are baaaad. Usually they're something like img src="blahblah.jpg?userid=32898392" and then, when you open it, there's a log of the image with the userid 32898392 being fetched. Therefore, they know that your email address is valid. So, it's a good idea to filter out images anyway.

    But, come on. Email is a medium for transmitting text. It's not supposed to have flowery backgrounds, blinking text, and embedded images. Mabey i'm a purist? But, it's another thing that use to be beautifully simple that the explosion of advertising on the internet has rendered unuseable.

    --
    sig?
  19. Re:Spam filtering altogether by greed · · Score: 2, Informative

    I don't know spambayes, but bogofilter most definately can operating in a "ranking" mode:

    • X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.12.2
    • X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499150, version=0.12.2
    • X-Bogosity: Yes, tests=bogofilter, spamicity=0.969917, version=0.12.2

    Then you can header-match in your MUA all you want--or not. (I run it all through procmail, but that's because I want all the filtering done before it hits my IMAP server.)

  20. Re:Remote Images in spam... by rusty0101 · · Score: 2, Informative

    This is one of the reasons I have configured Evolution to not display remote images, unless I request them. The other is that pulling remote images has the functionality of verifying your e-mail address. (server operator generates a couple million unique random numbers, creates a table of associations between e-mail names and the random numbers, sends each e-mail address their random number as an img src=protocol://server/uniqRanomNunber/image.php, which does a lookup on the uniquRandomNunber, and confirms your e-mail address. Spamer sells list of confirmed e-mail addresses, and you get more spam.

    Suggestion. If your e-mail client does not allow you to disable remote image retrieval, at the very least turn off preview panes. Bette is to find a client that does allow you to disable remote image retrieval.

    -Rusty

    --
    You never know...
  21. SpamNet by SunPin · · Score: 2, Informative

    I use spamnet by cloudmark. It catches everything. I can't remember the last time I had to click the "block" button. I'm very conscious of where my email ends up and I'm a hardcore advocate of email aliases. As a result, since September (last major crash), spamnet has blocked 4000 pieces while I've actively blocked only 11.

    That's pretty f'n good in my book. So good, in fact, that I send all blocked messages to the "Delete" folder instead of the default "spam" folder and set outlook to permanently delete on close.

    I have two concerns about this program:

    --Money. They are now charging and pretty much deserve it from the average user.
    --Reliability. This company could disappear tomorrow and sell off the server that has compiled spam data.

    Since mathematics isn't going anywhere, I'm leaning towards switching to an open source Bayesian alternative but, as mentioned above, all my spam gets thrown out the door on contact.

    What is the approximate training time of a Bayesian filter?

    --
    Laws are for people with no friends.
  22. The math by bpfinn · · Score: 2, Informative

    I think Tom Mitchell did a good job in explaining the math in his book Machine Learning. It's a very pricy book, so maybe you can look for a used copy.

  23. Re:What I want by leshert · · Score: 5, Informative

    Spamassassin learns in two ways:
    1. Manual training: there is a tool called 'sa-learn'. You can pipe a message to it, or point it to a mailbox, and specify whether the mail is spam or ham.
    2. Automatic training: if the score of the mail is significantly low (definitely spam) or significantly high (definitely ham), it will automatically train on the message. This may seem useless, but it's useful in that SA will then start to figure out patterns in spam or ham that don't trigger its rules.

    I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox. Then I added a nightly cron job to run sa-learn over the two mboxes and truncate them. This has worked very, very well for me... In I haven't had a single false positive since Bayes kicked in about two months ago, and I got my first false negative in about two weeks today. I typically trap 10-15 spams a day.

    One thing to notice: even if you enable it, Bayesian filtering won't kick in until you've recognized at least 200 spam and 200 ham messages. Took me a long time to figure that out (I had plenty of spam, but I wasn't training it on ham at all, which is why I started remapping the mutt commands).

    As far as installing it on a server, your users don't have to be able to read each others' mail. I have it installed so that my wife and I each have our own bayes dbs, so neither of us has to read each others' mail. Plus, different users will regard different mail as spam: anything about the Pittsburgh Steelers going to my mailbox is probably spam, but not hers; similarly, anything regarding Linux going to her mailbox is probably spam, but not mine.

  24. Re:SpamAssassin works for me (even on Exchange) by vanyel · · Score: 4, Informative

    I run a small ISP with spamassassin installed, and I had to increase the default quota when I upgraded to the version with Bayesian filtering and its multi-megabyte databases per user. Combined with spamd bugs forcing me to switch back to running spamassassin individually and the fact that spamd still doesn't serialize processing, so the system still gets hammered by a flood of spam, I'm looking forward to greylisting to help take the load off spamassassin.

  25. Spammunition by Anonymous Coward · · Score: 1, Informative

    Spammunition is a great Outlook plugin that does this.

    Come on!! Give some credit!

  26. Outlook - turn off HTML mail by siamSam · · Score: 2, Informative

    Turn off html mail for Outlook and help keep them from validating your address through this method.

    Place these two keys in .reg files of their own and be able to quickly switch between viewing html and plain text mail. taah dahhh!

    [HKEY_CURRENT_USER\Software\Microsoft\Office\10. 0\ Outlook\Options\Mail]
    "ReadAsPlain"=dword:0000000 1

    OR to turn it back on and view those pretty pictures

    [HKEY_CURRENT_USER\Software\Microsoft\Office\10. 0\ Outlook\Options\Mail]
    "ReadAsPlain"=dword:0000000 0

  27. Re:What I want by JohnGrahamCumming · · Score: 2, Informative

    I agree with you and we are planning to get to that ASAP. There's some underlying work we need to do on performance first (that's planned for v0.20.0) and then we'll have the foundation for multiusers, pretty much as you describe. If anyone out there wants to write an IMAP module (subclass of Proxy::Proxy) then I'd be very happy to accept it. John.

  28. I use Apam Assassin with Hotmil by esanbock · · Score: 3, Informative

    1. Use Debian
    2. apt-get install spamassassin
    3. apt-get install hotway
    4. Add this to your /etc/inetd.conf: pop3 stream tcp nowait nobody /usr/sbin/tcpd /usr/bin/hotwayd
    5. Switch to Kmail
    6. Menu: Settings|Configure Filters
    7. Add first filter.
    a. Select Match Any of the following
    b. Select size 250000
    c. Filter action: PIPE THROUGH spamassassin
    8. Add second filter
    a. Select 'Match any of the following'
    b. Type 'X-Spam-Flag' (no quotes)
    c. Select equals. Type 'YES'
    d. Filter action: Move to folder [your spam folder]
    9. It's crucial thta the second filter happes after the first (use the arrows to the left).

    There you have it - a spam-free Hotmail account. Not quite setup.exe, but this is Linux after all.