Slashdot Mirror


More on Bayesian Spam Filtering

michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."

25 of 251 comments (clear)

  1. Tutorial on Bayesian Inference by rbrito · · Score: 5, Informative

    The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.

    I am a Computer Science student studying Computational Biology (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.

    It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.

    During my searches for finding introductory material on Bayesian Statistics, I found this course page which has some nice introductory notes, including Bayesian Statistics.

    I hope that other people find this resource as useful as I did.

    1. Re:Tutorial on Bayesian Inference by Wile+E.+Heresiarch · · Score: 3, Interesting
      Here are some additional references, on-line & off, about Bayesian probability.

      On the web, see: Assoc. for Uncertainty in Artificial Intelligence -- this is the primary conference devoted to belief networks, which are a class of graphical (in the circles and arrows sense) Bayesian probability models. There are tutorials and other papers on the main AUAI web page, and links to the last several years of conference proceedings. By the way, Heckerman and Horvitz, now doing belief networkish work at MS Research, are in the AUAI crowd.

      In print, my favorite reference is E.T. Jaynes, "Probability Theory: The Logic of Science", which is due out soon. See this web site devoted to Jaynes' work for the status. I am also fond of Castillo, Gutierrez, & Hadi, "Expert Systems and Probabilistic Network Models".

      There are a vast (well, maybe just large) number of alternative models to classify things; a good introduction is Hastie, Tibshirani, & Friedman, "Elements of Statistical Learning". Incidentally, they use spam classification to illustrate several kinds of models.

      Finally, if you're wondering what the heck is the difference between Bayesian probability and any other kind -- just google the posts in sci.stat.math; there is a Bayesian vs frequentist flame war about once a year. :^)

  2. Post your results here by Jeffrey+Baker · · Score: 5, Interesting
    I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

    I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

    1. Re:Post your results here by ajm · · Score: 3, Insightful

      Just out of interest what's your code written in and would you consider posting it?

    2. Re:Post your results here by kwerle · · Score: 3, Interesting

      I implemented Paul's system without the changes you mentioned, and am seeing >95% success (and climbing). 0 false positives. I will be submitting it to sourceforge this week.

    3. Re:Post your results here by Jeffrey+Baker · · Score: 5, Interesting
      I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

      You can grab the source here, but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

    4. Re:Post your results here by kwerle · · Score: 3, Interesting

      So have you been retraining the system as you get more spam

      I continue to train.

      or did you train it initially and leave it that way. How large is your training set?

      I started off with a base.

      Details! My training set was 300 spams and 3500 not-spams.

      I started with a little more than 300 spam, and around 1000 valid messages.
      My count is now:
      Good messages read: 1194
      Bad messages read: 644

      That's because I only train on deleted mail, and I don't tend to delete my mailing lists except for once a month or 2...

      With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

      Against my start set, I nailed about 97%, including refiling 2 false positives from my old anti-spam system as being not spam. I've noticed that the system is really good at nailing stuff it already knows about, but the learning curve is a little steep for 'new spam types'. Still, I'm pretty happy with it.

    5. Re:Post your results here by XDG · · Score: 3, Interesting
      I've implemented it in part -- my code is in perl and will flag e-mails, but I haven't worked it into a filter yet.

      My experience is that I get a few percent false-negatives and about 1% false positives. I'm not seeing zero false positives, like many people are, but that probably has to do with the training sets used. Statistically speaking, you always have to trade off false negative with false positives, so it's reasonable in my 'real world' tests.

      As a side note, everyone should test out of sample. E.g. set aside half your good e-mails and half your spam e-mails, build the filter on one half, and then test on the other half. That's the only way to get a fair test of the filter.

      For my "good" email corpus, I dumped my entire e-mail archive since 1995. That included personal e-mail, receipts from online shopping, some mailing lists, etc. The few things that get flagged as spam (a) are almost always sent in HTML format, and (b) very short with little real content. (E.g., "Hey, looking forward to seeing you this weekend. Call me if you go out. My number is... Bye.")

      The spam corpus I took from on online resource while I build up my own. The e-mails that slip by unflagged are usually (a) short and (b) phrased like friend making a suggestion. (E.g., "Hi, I just thought you'd be interested in hearing about a this new, cool website, http://...") It seems to be close enough to a real message to slip through. Thankfully, few of them are like that.

      I'm including subject lines, from addresses, and the body so far. I'm not parsing ip addresses or html tags specially, however, just basic words using a simple perl regexp.

      Interestingly, "COLOR" is the one of the most often flagged words indicating spam. HTML formatting text seems to be the biggest culprit in my false positives. I might explicitly exclude the ones that show up in good mail (e.g. from friends who use crappy e-mail programs like aol) like COLOR, FONT, FACE, etc., but leave in the ones that spammer use like TD, TR, etc.

      -XDG

    6. Re:Post your results here by Eric+Seppanen · · Score: 3, Insightful

      You might want to consider collaborating with the group working on bogofilter, which is basically the same thing, done in C.

      --
      314-15-9265
  3. The proof of the pudding... by ajm · · Score: 5, Interesting

    ...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

    We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

    Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.

    1. Re:The proof of the pudding... by shadow303 · · Score: 3, Insightful

      From what I can observe from the writeup, Gray appears to be one of the "experts" that I refer to as "theory whores". Hard problems need to be tested, but some people seem to think that they can arrive at good results from an unproven theory. Anybody who has actually tested difficult problems to any extent could tell you that things don't always go as planned. An improvement with might work in theory, sometimes results in disaster due to minor points that the theory does not take into account.
      Also, it bothered me that he objected to Paul's work biasing one side. It was almost like he thought it was a bug, but there was a good reason for biasing (reduce false positives). So my advice for Paul is, until you actually implement your idea, don't go trying to say that it is better than somebody else's method.

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
  4. poor Hotmail users are still in the cold... by saskboy · · Score: 4, Funny

    I have some tricks for Hotmail users who cannot benefit from the technique above:
    Filter any message without the @ in the address.
    Filter Britney, Boobs, Penis, Inches, WIN, ___ ..... and your own email address userid.
    Now you only have about 40 spams a day to deal with instead of 100.
    Uncheck your information from being in the MSN directory too.

    Enjoy :-)
    John

    --
    Saskboy's blog is good. 9 out of 10 dentists agree.
  5. Terrible Spam Filters by DonkeyJimmy · · Score: 3, Informative

    It's good that work is being done to make a good weigted spam filter.

    It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here, near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).

    The adult filter isn't any better.

    --
    "Probably the toughest time in anyone's life is when you have to murder a loved one because they're the devil." -Philips
  6. Let's see by sam_handelman · · Score: 5, Funny

    P (This is spam) = P (This is Spam | It will enlarge my penis) * P (It will enlarge my penis)

    Now, given that I have prior knowledge that:
    P (It will enlarge my penis)

    is very low,

    and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
    P (This is Spam | It will enlarge my penis)

    and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.

    So, that message goes into the keepers.

    Meanwhile,

    P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)

    So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.

    P.S. I've deliberated made a hash of this for a joke. The actual rule is:

    P (A & B) = P (A | B) * P (B)

    --
    The good and new comes from no quarter where it is looked for, and is always something different from what is expected.
  7. filtering not the answer - maybe this is by frovingslosh · · Score: 5, Insightful
    Sadly, unless you are an ISP or other mail service provider, filtering does nothing. The spammers work in volume. They count on hitting everyone to reach that .1% that will respond. That response is what they are after and what they get paid for. You likely know better than to ever deal with anyone who spams you or to ever respond to their spam. Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway. By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

    Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.

    No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.

    This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.

    If we set up enough different false open relay servers I think we could have a real impact on the spammers.

    --
    I'm an American. I love this country and the freedoms that we used to have.
    1. Re:filtering not the answer - maybe this is by stienman · · Score: 3, Insightful

      Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

      -Adam

  8. Neural Net Spam Filtering by ShakaUVM · · Score: 3, Interesting

    At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.

    Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.

    It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.

    The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.README
    And you can download it at:
    http://www-cse.ucsd.edu/~wkerney/spamfilter.t ar.gz

    -Bill Kerney
    wkerney at ucsd.edu

  9. SpamAssassin - duh by Gothmolly · · Score: 3, Interesting

    SpamAssassin works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.

    With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..

    --
    I want to delete my account but Slashdot doesn't allow it.
    1. Re:SpamAssassin - duh by Eric+Seppanen · · Score: 3, Insightful
      Reasons why I don't use SpamAssassin:
      1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.
      2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.
      3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!
      4. Bogofilter is better. duh.
      --
      314-15-9265
  10. Re:I still think passive euthanasia is the best wa by ivan256 · · Score: 3, Informative

    For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...


    Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.

    Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.

  11. Re:filtering not the answer - maybe SPOOFSERVERS by netringer · · Score: 4, Insightful

    I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out.

    BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.

    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  12. Bayesian vs not isn't really the point by XDG · · Score: 4, Insightful
    Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:
    It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.
    On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

    I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

    Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation, but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

    Looking at another article Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
    p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]

    This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

    I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

    As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.

    -XDG

  13. Re:Why just spam? by McFly777 · · Score: 3, Informative

    Easy. Just re-run the spam filter on your 'cleaned' mail using a ruleset generated by splitting the mail into topical vs. everything else.

    --

    McFly777
    - - -
    "What do people mean when they say the computer went down on them?" -Marilyn Pittman
  14. microsofts trademark by portal9 · · Score: 3, Informative

    why are we even considering this method when microsoft has a trademark on it? nothing can be done.. they have a lock on it. trademark here

    1. Re:microsofts trademark by thogard · · Score: 3, Informative

      another stupid patent? This isn't new, its been done with spam on usenet for years. Maybe someone should digout the cancelmoose's freiends as prior art?