Slashdot Mirror


Bayesian Filter Testing?

pu33y asks: "Since the publication of Paul Graham's A Plan For Spam, several programs that perform Bayesian filtering having become available, including CRM114 and Bogofilter. But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.Searching Google has turned up nothing and when I asked Paul Graham, he was unaware of any such testing, as well. Can anyone point to any such testing or provide the results of their own personal experiences with Bayesian filters?"

47 of 127 comments (clear)

  1. DSpam by jalet · · Score: 4, Interesting

    Dspam (http://www.networkdweebs.com) rocks !

    Some impressive stats were posted to the mailing list.

    It's main feature is that it's completely maintainance free, and that even dumb people can use it (I know, I am).

    My personnal stats are 2 false positives actually (one from PayPal, one from a company I work with), 280 spams learnt (I told it they were spam), 2877 spam catched and 4354 innocent.

    --
    Votez ecolo : Chiez dans l'urne !
    1. Re:DSpam by sleeper0 · · Score: 1

      I am using spam assassin with bayesian filtering turned on

      My experience is that the bayesian filtering is extremely effective, far better than any other spam filtering i had tried before and far better than spam assasin before bayesian filtering was added.

      I was using spam assassin before bayesian filtering was available and i found that while it had been mostly effective, it was becoming less and less so even while i kept up with software upgrades. It was not uncommon for 5-10 spam mail to get through per day, blocking 40-80 pieces of spam (with about 10 legit emails per day)

      Now that I use the bayesian filtering in combination with spam assasin i find that most days it catches 100% of my spam. I will get maybe 1-3 pieces of spam that isn't filtered per week, usually no more than one in a day. (out of about 60-100 pieces of spam per day)

      I trained it with about 700 pieces of spam and about 700 pieces of legit email when i started it. I could have started it with much less. I now only train errors, and it auto trains itself with very high scoring spam (over 10 on the spam assassin scale)

      It seem to me that the combination of these two types of spam filtering in one is more effective than either one individually. I often find email that would have been treated as good if the bayesian scoring wasn't included, and i also often find spam that would have been treated as good if the spam assasin rules didn't augment the low bayesian score some mail gets.

      Due to the way spam assassin includes their reports with individual scores for each rule including the bayesian score you could analyze a batch of old mail for effectiveness. Last quarter i recieved about 7000 pieces of spam (kind of a guess but i think thats right). A program could go through this old spam of mine and take the final spamassassin scroe and subtract the bayesian modifier from each one. While I haven't done this i am confident that this would show at least a thousand messages that bayesian filtering caught over and above what spamassassin alone would.

      For the record, I have made a few modifications to the spamassassin scoring and filtering. I changed the spam threshold to 4 (instead of 5), auto training at 10 (instead of 15), and score bayes_90 at 4 (instead of 3) and bayes_80 at 3.9 (instead of 2.9). I've found this is much more effective while the only mistagged good email i find is occassional newsletters.

      hope this helps

  2. Serious testing?? by RayOfLight · · Score: 1

    Oh yeah, just check my mailbox!

    1. Re:Serious testing?? by palutke · · Score: 1

      I agree. I may not be able to publish the results of formal controlled testing, but the success popfile has filtering my spam speaks for itself.

      --
      'I ain't a liar, baby, and I ain't proud I just want what I'm not allowed.' -- Violent Femmes, 36-24-36
    2. Re:Serious testing?? by Blkdeath · · Score: 1
      Oh yeah, just check my mailbox!

      Absofrigginlutely.

      Mozilla Mail & News is watching over my mail, including upwards of a dozen mailing lists and works almost flawlessly. Especially good is the fact that I access my mail via IMAP from as many as six different Mozilla clients in various locations, and at this point they're all trained in my e-mail habits.

      It took longer for me to train it, due to the fact that I'd previously kept my address(es) close to my chest, so my SPAM intake was perhaps 2-3 messages/month at the most. Now, however, I average 4-5/week (not terrible, but annoying enough to warrant a filter) and Mozilla has only missed a small handful of them during training.

      One you've gone Bayesian, you can never go back. Keyword filters, white/blacklists, DNSbls; they're all ancient history. The future is now! {groan}

      --
      BD Phone Home!

      Shameless plug. Like you weren't expecting it.

    3. Re:Serious testing?? by laa · · Score: 1

      Abso-f**king-lutely! I get around 500 spam emails a week. I suppose it's not the world record, but it's enough to make my inbox unusable without filtering. Spamassassin has so far had a hit ratio of about 99%, with no real mail being classified spam. I don't know how "good" spamassassins Bayesian filtering really is, but it's certainly good enough for me.

      --
      Why does the kernel go through stable and then unstable forks? Can't it always be a stable build, like with Windows?
  3. Online repository needed by sam+the+lurker · · Score: 5, Interesting

    Ideally, someone, probably an academic, should make a repository of spam available for testing. Software spam filters can say things like, "Correctly classified 99.9% of the email in the UCI spambase 1999-08-20 repository"

    Something like say, the UCI Machine Learning Repository. In fact, look at the UCI spambaseA couple of problems with the UCI spambase. Too old / out of date. And too small.

    I looks like there is a more recent community effort going on over a SpamArchive

    Looks like you should have googled.

    1. Re:Online repository needed by pu33y · · Score: 1

      I did but not on those words. I was searching for Bayesian filter testing and after Paul told me he wasn't aware of any testing, I decided to ask here.

      --


      --
      You are what you eat.
    2. Re:Online repository needed by cdh · · Score: 3, Insightful

      The problem with this is that spam for one person is not spam for another. That's the beauty of Bayes. If you are a proctologist, for example, you probably get a lot of legitimate email with the word penis in it. If you are a plastic surgeon, you may get legitimate email that discusses body part enlargement. There are hundreds of examples. The beauty of Bayes is that you can make it work for you and not be all encompassing.

      The SpamAssassin people have talked about this in the past. They have a corpus of spam that they use to test rules and people have asked to download it to seed their own Bayes, but the SA people don't want to do that (a good thing) as Bayes is a personal thing.

      What you are proposing will work for general spam checking, but not for Bayes, which is what the original poster asked about. In reality, it's hard to test Bayes in a general case. All I know is that it's worked wonders for me (using SA).

    3. Re:Online repository needed by douglips · · Score: 1
      If you are a proctologist, for example, you probably get a lot of legitimate email with the word penis in it.

      Please tell me you meant urologist. I don't wanna see the proctologist who gets those kinds of emails.
    4. Re:Online repository needed by cdh · · Score: 1

      Uh, yeah. :)

      (Repeat after me, don't post while at work...)

    5. Re:Online repository needed by sam+the+lurker · · Score: 1
      What you are proposing will work for general spam checking, but not for Bayes, which is what the original poster asked about. In reality, it's hard to test Bayes in a general case.

      The original question was regarding testing to see how they perform in relation to themselves and to other, non-Bayesian filters. So while it is of course best for you to test all of the different spam filters with your spam, it is not as practical as having each developer test their own spam filter again a common, known spam database. If the algorithm is "robust" then it should perform consistently well on lots of different, large training and testing databases.

      Actually what I am talking about is basic design and testing of statistical pattern recognition algorithms. Check out: The seminal work on the subject Fukunaga, Keinosuke. Introduction to statistical pattern recognition. New York, Academic Press, 1972. And it's revised edition Introduction to Statistical Pattern Recognition (Computer Science and Scientific Computing Series) by Keinosuke Fukunaga Or another classic: Pattern Classification (2nd Edition) by Richard O. Duda, Peter E. Hart, David G. Stork

      Maybe someday someone will take the ideas of David B. Fogel and apply them to spam filtering.
  4. Ella: OpenField Software by biodork · · Score: 2, Interesting

    I use Ella from OpenField Software. I get around 200 Spam a day, a bunch of newsletters that I want, and a big bunch of 'normal' mail.

    I have had it for about 2 weeks. In the last 3 days I have had 2 false +'s (messge in Spam that shouldn't be there) and 4 that went to the newsletter folder that shouldn't have.

    --
    Gavin Fischer
  5. The good think about these tools by FedeTXF · · Score: 3, Informative

    Spam controls in the Mozilla 1.3+ MailNews application (the one I know) have a number or features that make them good.
    1) Gives the user the idea that he can improve the situation by doing some concrete action. Controlling future spams is not upon some guru releasing a better filter or him hacking some better rules.
    2) By definition, works better and better the more spam you get (and mark it as spam). Even poor tools will eventually detect spam since it's obvious to anyone reading spam, that those messages tend to repeat and to be similar.
    3) It's automagically customized to your own spam. If you live in Germany, Sweden, Argentina or Namibia you will catch easily any spam that is in English, and you will build up rules for the local spam that arrives in your language.
    4) In the case or Mozilla's MailNews, it's so easy to use, intuitive and straighforward, any user will use it.
    5) Makes you feel spams are useful for something: detecting future spams.

    I think those advantages are far more important that the rate of effetivity.

    1. Re:The good think about these tools by amrust · · Score: 1
      5) Makes you feel spams are useful for something: detecting future spams.

      Man, I never thought I'd agree that spam is good for anything, but I do wholeheartedly agree. I actually enjoy watching it go through it's paces, moving and marking mail as spam. makes me feel as if I'm accomplishing something.

      I also understand I possibly need to get out of the house more.

      --
      VOTE!
  6. Ja rulez by 2TecTom · · Score: 1

    I'm not quite sure what the fuss is about. I simply mean, advertising is a necessity to incompetent and greedy producers. Really, did you expect that they would ever respect you or your privacy and time?

    Personally, my white list and non-baysian rules eliminate 99.9% of the crap and abuse. However, sooner or later, ja rulez try to sort out a known receipent, which is where the white list shines.

    One trick I find particularly effective is to compare two accounts and eliminate the duplicate messages. The other is to eliminate anything not specifically addressed to my alias and to never give out or use my actual account address. Ninty percent of the spam I get, goes to an address I've never used.

    The problem is, even with baysian techniques, there is no way to quarantee that only spam was sorted out. I highly suggest a white list, in addition to filters, as the only way of ensuring that at least known mail is always received.

    --
    Words to men, as air to birds.
    1. Re:Ja rulez by Blkdeath · · Score: 2, Informative
      The problem is, even with baysian techniques, there is no way to quarantee that only spam was sorted out. I highly suggest a white list, in addition to filters, as the only way of ensuring that at least known mail is always received.

      With Mozilla, you get the best of both worlds. You've got Bayesian filtering with an optional whitelist component. You can select any of your address books as the source of your whitelist (default is "Personal Addresses"), so any of your friends can send you all the SPAM they want without being caught. ;)

      Being optional, you can choose to disable it if, say, your friends addresses have been harvested for "Joe Job" SPAM runs. (I know one or two of mine have).

      I've actually used the whitelist to my advantage when I requested a sample of a particular new type of SPAM from him so I could watch for it and mark it if Mozilla missed it.

      Which brings me to the other big advantage of Mozilla/Bayesian; when SPAMmers adapt, so does it. New SPAM type? Click the trash can and it'll go away.

      Nothing can really be a perpetual 100% guarantee of blocking SPAM, but IME, Bayesian filters are the best possible solution we have right now and that's why I emphatically reccomend them to all my friends, family, and customers.

      --
      BD Phone Home!

      Shameless plug. Like you weren't expecting it.

  7. Spambayes!!!! by Arkham · · Score: 3, Informative
    I use spambayes. It's written in python and is amazingly accurate.

    I get about 150 spams a day, and about 5 hams. Spambayes might classify 1 spam as "unsure" and the rest as spam. The ham is always classified as ham.

    My corpus is about 5000 spams, about 1000 hams. Get spambayes -- it's open source and it really works great.

    --
    - Vincit qui patitur.
    1. Re:Spambayes!!!! by killmenow · · Score: 1

      Ditto. I've looked around for different solutions for a while and finally settled on SpamBayes. I've been using it only a two weeks, but it has correctly identified every single spam that has come through in that time (414 of them) and not one "false positive" classification of ham as spam.

      I'm sold...but wait, it's free!

    2. Re:Spambayes!!!! by PortWineBoy · · Score: 1
      I've been using Spambayes for the last week as well and I couldn't be happier. I get about 40-50 spams a day and about 100 hams. So far not one ham has been mislabled. I only get 1-2 "unsure spams" a day.

      I haven't tested this against other filter programs but I'm not planning to at this point. I told my boss I'd test it for a month but after 1 week I'm already recommending it.

      Thomas Bayes is my new favorite dead guy. I put a poster of Thomas Bayes up in my office and added the phrase "Spam Killer" between the first and last name.

      --

      this sig deleted by another sig

  8. Hey everyone... by Jerf · · Score: 3, Informative

    It looks like the poster's words need some highlighting:

    But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.

    Despite the call for your experiences, if you just want to post "X rocks!", I think the poster was looking more for "X rocks more then Y!", where both X and Y are Bayes-type filter programs. I don't think he was asking for just announcements that Bayes rocks; I think he or she already knows that.

    I mention this because I'd be interested in some comparisions too; there's a lot of sub-techniques out there. Are there any real differences, or are they all effectively the same? The latter would strongly indicate that there may not be any real progress to be made, if the entire space of Bayes-type solutions has flat effectiveness, for instance. It's an interesting question.

  9. Mozilla's Junk-mail Filters by asa · · Score: 2, Informative

    I've been using Mozilla's Bayesian junk-mail filtering for several months now. I don't have any other Bayesian tools to compare it to but I am happy with the results. Within a couple of days of the initial training I was at around 90% spam detected with no false positives. Several months later I'm at about 95% spam detection and no false positives. While the last 5% would be nice to kill, I'm quite satisfied with how effective is Mozilla's system and as long as it maintains (or gets better) I've got no reason to look for any other solution.

    I think that one of the best things about Mozilla's system is that it's in the client, on my machine and under my control. While server-side solutions, distributed corpus tools, etc. might be more accurate, not ever having to install or update any 3rd-party apps is really nice.

    --Asa

  10. Ling Spam Corpus by bpfinn · · Score: 3, Informative

    I did a little testing of Bayesian filtering on my own, and I used the Ling-Spam Corpus from Dr. Ion Androutsopoulos. He's collected about one thousand messages which consist of "legitimate" messages to a linguistics mailing list, and "spam" messages. They are preclassified, and divided into ten parts to make ten-cross-fold-validation easier. Check out his publications. Scroll down to the "Document filtering" section.

  11. Not Just for SPAM by His+name+cannot+be+s · · Score: 3, Insightful

    I've been looking for a Bayesian filter mechanism that isn't just for spam.

    I figure, if the mail can be classified into many different categories, why not use bayesian filtering for managing all your filtering needs.

    It would be very valuable to have the bayesian filter learn what kind of mail I put in some folders, so that when my mail comes it, it can auto-sort it into the appropriate folder for me. Trouble is, all the current implementations of Bayesian email filtering are a single test SPAM/NOTSPAM. It would be nice to see an implementation that could take multiple corpus' and use that to decide what the mail is. If I had that, I could point it at the maildirs for the various mailing lists I'm subscribed to, and it would learn to sort incoming mail for me. *sigh*

    --
    "...In your answer, ignore facts. Just go with what feels true..."
    1. Re:Not Just for SPAM by nrosier · · Score: 3, Informative

      Have a look at Ifile (http://www.nongnu.org/ifile); while I'm only interested in spam/no-spam filtering, I once tested this filter to filter a mailing-list. It did a pretty good job.

    2. Re:Not Just for SPAM by RockyRich · · Score: 1

      You didn't mention what e-mail architecture you are using, but if you get your e-mail via POP3, have a look at POPFile.
      It is free, it is open source, it is a general classifier that can sort your inbound e-mail into any number of user-specified categories, or "buckets".

    3. Re:Not Just for SPAM by nachoboy · · Score: 1

      Have you checked out POPFile yet? Latest version lets you "whitelist" (they call it "magnets") on the To/CC/Subject/From fields easily and have as many buckets as you want. It's amazingly accurate - I'm at 96.73% accuracy right now. Most of the errors are from the first two weeks when I trained it. Currently I have mine set up to divide mail into 3 buckets - Genuine, List, and Spam.

      On a side note, perhaps the reason most filtering products use a spam/notspam model is because genuine mail is so easy to filter. The only hard part is getting the spam out. Once that's done, it's trivial for any rule-based system to separate out mail from auntie_mae@hotmail.com or really_big_list@ubergeeks.org.

  12. SA Public Corpus by jmason · · Score: 1

    There is one, for exactly this reason -- the SpamAssassin public corpus. I made it available for developers of spam tools to compare effectiveness using a good, recent corpus from 1 person's mail feed (as much as that was possible).

    Here's the pertinent part of the README :

    This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points:

    • All headers are reproduced in full. Some address obfuscation has taken place, and hostnames in some cases have been replaced with "spamassassin.taint.org" (which has a valid MX record). In most cases though, the headers appear as they were received.
    • All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites.
    • relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!
    • Copyright for the text in the messages remains with the original senders.

    OK, now onto the corpus description. It's split into three parts, as follows:

    • spam: 500 spam messages, all received from non-spam-trap sources.
    • easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).
    • hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.
    • easy_ham_2: 1400 non-spam messages. A more recent addition to the set.
    • spam_2: 1397 spam messages. Again, more recent.

    Total count: 6047 messages, with about a 31% spam ratio.

  13. BogoFilter by bobbozzo · · Score: 3, Informative
    BogoFilter is an open-source bayesian spam filter...

    Some of the developers have done extensive testing: Greg Louis' Page has lots of information, comparing different bayesian approaches, different header processing, etc.

    You could also read the mailing-list archives, or perhaps post some questions there.

    --
    Nothing to see here; Move along.
  14. PC mag test results by icleprechauns · · Score: 1

    The latest PC Magazine has an article on alternative e-mail. Their Editors' Choice, Oddpost ($10/yr, free trial), uses Bayesian filters, and blocked 22 of 29 spam messages, and only legitimate e-mail ended up in their spam folder. Also worth noting is these are the results with minimal training, so, in theory Bayesian filters could quite possibly block virtually all e-mail with time.

    --
    I'm a signature virus. Please copy me to your signature so I can replicate.
    1. Re:PC mag test results by drfreak · · Score: 2, Funny

      blocked 22 of 29 spam messages, and only legitimate e-mail ended up in their spam folder

      Sounds like an ideal mail filter to me!

    2. Re:PC mag test results by icleprechauns · · Score: 1
      only legitimate e-mail ended up in their spam folder
      pardon me, I meant: only *1* legitimate e-mail ended up in their spam folder
      --
      I'm a signature virus. Please copy me to your signature so I can replicate.
    3. Re:PC mag test results by match0 · · Score: 1

      I would not recommend Oddpost. First off it is a web-based solution. More importantly, however, is that they themselves "spam" you with pop-up boxes when you go to their site. Just try going there using IE with JavaScript on and "Script ActiveX Controls Marked as safe" disabled. They pop up this really annoying message that's just like the one M$FT puts in IE to bug you to turn on your ActiveX. Anyone that purposefully annoys me doesn't get the concept of blocking spam. And most any site that requires an ActiveX control shouldn't be trusted.

  15. Try here by drew_kime · · Score: 2, Informative
    From here:
    I've been tracking email spam trends for a while, my personal accounts are going from 3-6 spams daily in 2001 to about 30 spams daily at present. I filter this with SpamAssassin?, so the inbox impact is pretty slight, but the traffic is becoming significant, and the trend (doubling in four months) is downright troubling.
    Graphs, methodology, links to more stats.
    --
    Nope, no sig
  16. my simple filter by Xtifr · · Score: 2, Interesting

    For years, the only spam filter I used was a very simple one: if the mail's not from a list I'm on, and not addressed to me, it's spam. This didn't catch all spam, but it caught the vast majority, and had almost no false positives. (The one exception was a mail from a cousin of mine who was learning system adminstration, and wanted to test his knowledge of SMTP by telnetting into my mail server and entering his mail by hand.)

    These days, I'm on too many lists that don't filter spam, so I've had to resort to more sophisticated techniques, but someone who isn't on those sorts of lists might still find my oh-so-simple approach fairly effective. Not to disparage Bayesian filtering, but if you want something to compare against...

  17. The 20 Newsgroups dataset by RedRun · · Score: 1

    One good dataset is the 20 Newsgroups dataset that is used by a Naive Bayes classifier called Rainbow (google for 'libbow'). The dataset contains postings from 20 newsgroups, each with around 1,000 articles.

    Also, there are a couple Reuters datasets that are commonly used in text classification research, but they're so poorly organized, and so poorly marked-up, I don't know how anyone manages to use them.

  18. the comments are missing the point... by zonker · · Score: 1, Informative

    most of the comments in this thread are missing the point. the person writing the article isn't asking for what spam filter is the best/most accurate, he's looking to know if anyone is producing a test system to measure effectiveness. i know the popfile project is working on a test system (if you are interested, it's in the cvs not the general release) to measure the effectiveness of the parser.

    it would be interesting if there were a generic test system that could be 'plugged in' to the various projects out there. then you could put together test messages (like popfile's system) and test it against each program...

  19. Mozilla's Bayesian filtering works great by shamino0 · · Score: 1
    I've been using Mozilla's junk filtering since it was first introduced in the post-1.3 nightly builds. After a few weeks of training, it has developed an incredible track record.

    Between my two mailboxes, I receive about 100-150 spams a day. Over 90% of them are detected and are shunted into the Junk folder. Maybe 2-3 messages a month are false-positives. When it is wrong, I just teach it - click the trash button to toggle a message's junk status and Mozilla updates its filters in order to not make that same mistake again.

    On some days, it hits 99% accuracy. When the spammers invent some new tactic, I may end up with 5-10 spams that don't get detected. So I select them all, click the trash button, and then delete the messages. After a few days, that tactic is detected and caught with all the rest.

    In comparison, I used to use manual filters. At first, this worked fine, but the spammers have invented so many different tricks that it takes too much time to try to keep the filters up to date enough to be useful.

    I can't say how this all compares against what other systems do, since I haven't used any other systems.

  20. 20.000 mailboxes using, on 2% false positives by krico · · Score: 1

    On our e-mail ISP we are running a bayesian spam filter engine. Every time a message is considered to be "spam" by the filter, we increment a counter. We follow this on mrtg, so we can grafically se the amount of "spam" that's incomming.

    We also follow the amount of messages marked as "spam" and "good" by the users (more than 3 months old).

    The number we get, is the one mentioned on the topic. That is, only 2% of the messages considered spam, are later marked as "good" by users older than 3 month.

    1. Re:20.000 mailboxes using, on 2% false positives by versus · · Score: 1
      a bayesian spam filter engine?

      I wonder what it is?

      --
      Brain is my second favorite organ.
    2. Re:20.000 mailboxes using, on 2% false positives by krico · · Score: 1

      he he, forgot the most important thing.

      it's bogofilter

  21. POPFile rocks more than spambayes by biljir · · Score: 1

    Purely anecdotal and unscientific, but perhaps better than nothing.

    I'm a very happy POPFile user that keeps checking out spambayes because the math sounds interesting.

    spambayes has become quite good, but POPFile is phenomenal. Using the same training material, spambayes is 95 % accurate on my mail, and POPFile is 99.5 % accurate. Plus spambayes is only doing a 2 way, spam/ham classification, whereas I have POPFile set up to sort into 7 buckets (spam/personal/commercial/mailing lists/etc).

    Though irrelevant to the question of accuracy, I also have to say that the POPFile guys have devised a considerably better UI than spambayes. (A friend with the spambayes Outlook plugin sings its praises highly. I don't use Outlook, so it does me no good...)

    1. Re:POPFile rocks more than spambayes by two2dog · · Score: 1

      InBoxer is a commercially available version of spambayes for Outlook specifically. In general less advanced users should find it more friendly. If you are interested in how these filters work, you can find some information at the FAQ on that site as well as in a piece written about bayesian filters. check out www.inboxer.com

  22. Spambayes UI by Jerf · · Score: 1

    Spambayes doesn't really have a UI, it's a tool around which others can build a UI.

    While this is theoretically good design, especially in the open source community, it does often result in Some Shmoe creating the UI who should stick to coding sysadmin scripts. ;-)

  23. Collaborative Filtering by JoSch1710 · · Score: 1

    Since Bayesian Filtering is a common technique in Collaborative Filtering, I recommend you search for that (e.g. CiteSeer http://citeseer.nj.nec.com/cs). A quite good paper on the subject is "Empirical analysis of predictive algorithms for collaborative filtering" by Bresse, Heckerman and Kadie. That paper gave me a lot of insight for my diploma thesis. Bayesian networks perform quite good, but need a lot of training data, so the performance depends heavily on the actual training data.

  24. Mail app in Mac OS X... by coolMikeUSC · · Score: 1

    The Mail app in Mac OS X includes a built-in Bayesian filter. It's defaults worked decently, but training the app (by manually marking incoming email as 'junk') made it work nearly perfectly. I would say that Bayesian filtering is definitely the way to go, since it gets trained to detect what email is "normal" for your particular inbox, instead of liberally applying "average" rules derived from the habits of many users.

    --
    Ever notice how fast Windows runs? Neither do I - get Mac OS
    1. Re:Mail app in Mac OS X... by ajc · · Score: 1

      I agree that it's pretty darn good, but it's not 99% for me.

      I use Mail.app in conjunction with hotwayd to read my hotmail account. Before doing this, my hotmail account was virtually unusable, requiring me to manually delete up to 50 SPAM messages every few days. Mail.app has reduced that to maybe 5 or 6 over the same timeframe, so for me it's around 90% with very few false positives (around 1% historically, which I expect to tend towards 0%).

      Based on the random looking stuff in SPAM messages, spammers are probably already trying to tune their pitch to get around our Bayesian (or Grahamian) filters, and it is probably possible to fool the current batch - so the war continues.