Slashdot Mirror


Response to Gordon Cormack's Study of Spam Detection

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

25 of 229 comments (clear)

  1. How I do by mirko · · Score: 5, Interesting

    I set many aliases to my official email and I gave all of these to and only to spammers.
    So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
    This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

    --
    Trolling using another account since 2005.
    1. Re:How I do by julesh · · Score: 3, Informative

      Mail.app's filter isn't Bayesian. Please see previous slashdot article on how it works (I'm too lazy to find the reference right now).

  2. Excellent review by XMichael · · Score: 5, Informative

    On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

    DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

    DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...


    Priceless Photos

  3. Studies create discussion by Timesprout · · Score: 5, Insightful

    I usually frown when I see many of these so called studies offering conclusions, several of which differ radically from my own experience. There recent Java/C++ performance one was a classic example. It gets annoying when a pro MS result is immediately decried as marketing FUD because it just cant be better and a pro Linux result is taken gospel truth here on /. Usually I tend to take all results with a grain of salt or just plain ignore them and focus on the debate around them.

    The benifit of these studies though is that fantical crap aside informed people will usually take the time to interpret results or suggest corrections/improvements that actually benifit developers and improve their knowledge base more than any information provided by the actual study.

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
    1. Re:Studies create discussion by killjoe · · Score: 4, Insightful

      "This is defeatest bullshit. Ignoring your problems doesn't make them go away. "

      You miss an important point. This is not "our" problem, it's YOUR problem. I don't need a GIS program and neither to millions of other other people. YOU need one and too bad for you they cost tens of thousands of dollars. You have no right to complain that somebody else hasn't taken the time and effort required to give you a free equavalent.

      What you need to understand is the open source is nothing but scratching an itch. This is your itch and you need to scratch it.

      OPEN SOURCE ONLY WORKS IF PEOPLE CONTRIBUTE. This very simple and obvious point seems to be lost on most people. You are not supposed to sit around till somebody else does the work and give you something for nothing. You need to contribute.

      You need to start an organization and start raising money to fund an open source development effort or to accelerate and existing one. You need to get involved and contribute. BTW bitching on slashdot does not count as contributing.

      "This is like blaming McDonalds for your big, fat ass, or blaming Microsoft because you got a virus when you didn't run the patch they released to prevent it."

      Or blaming the open source community because they didn't give you something for free.

      --
      evil is as evil does
  4. Spamassasin is good but not that good... by Shoeler · · Score: 5, Informative

    For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.

    Point being - I was darn surprised to see SA at the top of his charts.

    Now - if only mimedefang would easily use another spam-checker....

  5. Re:You don't like my software so I'll flame you by Threni · · Score: 3, Insightful

    > This guy seems a little harsh and just a bit jealous of the success of Gordon
    > Cormack's article.

    Articles aren't 'successful` - they're either useful, or they're just fun to read. Perhaps his is the latter.

    From the response:
    ---
    It turned out that Cormack was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be the one thing lacking from this research paper.
    ---

    One thing I've noticed is that more and more people seem to want an answer NOW - even if it's not the correct answer, or even if the original question asked wasn't the correct one.

    > I'd like to know what makes his opinion any more valid than Gordon's.

    Everyones opinion is as valid as you - the observer - decide it to be.

    But in terms of which filter is the best - what does anyone's opinion have to do with it? If you're bothered about this issue, why not read both articles, think about it, and then perform the tests yourself? Or wait for an impartial third party to perform the relavent tests. There doesn't appear to be any alternative.

  6. Re:You don't like my software so I'll flame you by Otter · · Score: 5, Insightful
    There are some technical objections in there (old versions of software, the fact that Spam Assassin was tested with a spam collection generated by spam assassin). But honestly, after wading through all the whining and sneering, I didn't have the energy to pick the points out of the overall flow.

    Jonathan, next time:

    • Start by summarizing your technical objections.
    • Continue by detailing your technical objections.
    • Leave the nasty rants to the end, or better yet, leave them out entirely.
    • Stop talking about "geeks" in every paragraph.
    • Please stop referring to spam filter comparisons as "science".
  7. Re:You don't like my software so I'll flame you by pclminion · · Score: 4, Insightful
    This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article.

    Let me explain why he's irritated, as somebody who has conducted spam filter statistical tests and made publications on the topic.

    Yes, it is irritating when somebody demonstrates that his method is better than yours. However, most researchers are able to accept this, and continue improving their own work.

    However, what is far more irritating (by an order of magnitude at least) is when somebody "demonstrates" the inferiority of your work, and they do so in a completely scientifically bogus way.

    Let me give a concrete example. Suppose you were Galileo. You have just put forth the postulate that all objects fall at the same speed regardless of mass. A "debunker" attempts to demonstrate that this isn't true by dropping an iron ball and a feather. Obviously, the feather falls much more slowly.

    "Ha ha, neener, neener!" cries the debunker. Of course, Galileo knows his method is flawed. If people actually listen to this supposed debunker, Galileo might become very, very irritated indeed.

  8. I'm not saying we wouldn't get our hair mussed... by VAXcat · · Score: 3, Funny

    I prefer using the original CRM114 discriminator and it's host platform on spammers. If you're not familiar with the original CRM114 and it's delivery platform, it was featured in the following movie... http://www.imdb.com/title/tt0057012/combined

    --
    There is no God, and Dirac is his prophet.
  9. I wouldn't take this critique too seriously by EsbenMoseHansen · · Score: 5, Interesting

    There are several warning signs in this article.

    1. The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
    2. The author has no statistical or published backings for his claim
    3. Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
    4. I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
    5. He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.

    That said, he does raise a few valid points, such as the timeline:

    1. If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
    2. Configurations should really have been published. I see no reason why not.
    --
    Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    1. Re:I wouldn't take this critique too seriously by int2str · · Score: 4, Interesting

      Yes, I agree with your points. The author spends way too much time dicrediting the study.

      I also have to say that my experience was much more along the line of Cormacks. I've tried DSPAM for a while on my server, starting from scratch. Training on error with only new emails. On a small mail server with about 10 users of different types (geeks, businesses, moms etc).
      - DSPAM took way too long to produce any kind of results
      - 2500 emails before advanced features kick in is *a lot* for the average soccer mom
      - DPSAM produced way too many false positives early on
      - The spam filtering accuracy leveled off at about 80% (number from DSPAMs web interfac)

      So this is not another overzealus CS student here, but real world testing.

      The DSPAM author does not address any of the real points and just rags on Cormack.

      Not much of a "rebutal" in my book.

  10. What is typical by Anonymous Coward · · Score: 4, Insightful
    Due to X's extremely high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in newsgroups for 20 years, it is no surprise that X has an abnormally high spam ratio, 81.6%.


    I'm not happy about this, first he says that this account has a abnormally high spam ratio and then says that a normal user can have 60%. Where do we get these figures from I would like to know as my average is pushing up against 100%. I don't think that there is such as thing as an average user, some people seem to get nearly no spam and the rest of us get almost complete spam.

    Reviewing todays inbox reveals around 200 emails, of which 8 were legit. You do the maths, I would be making progress if it was only 81%.
  11. To cut through the spam by NigelJohnstone · · Score: 4, Insightful

    Oh boy he goes on and on, if ever you wanted to cut out the spam in an article...

    His main points (at least the ones I agreed with):

    1. No training period, many features only turn on after lots of real emails have been processed. Fair enough.

    2. No purge window, stale emails get purged over time (e.g. 4 months), but in a test everthing is shoved through at once (in minutes) and so nothing gets purged. Again fair.

    The rest of it complains about the tester, or complains that it was less than ideal conditions & settings for the particular filter.
    We call that 'the real world' here.

    Sys admins are not experts in configuring filters.

    Also he should realise that any new filter gets a better rating than the dominant filter. Spammers try to defeat the most popular filter of the day. So sure a new filter might perform better than an existing one *initially* simply because the spammers are targetting it. Until it becomes dominant and then the spammers adjust the spam to defeat the new dominant filter.

    So in the real world the data set will always be unusual because the spammers make it that way.

  12. Confirmed: Architect IS a verb by cperciva · · Score: 4, Informative
    Quoth the OED:
    architect v. To design (a building). Also transf. and fig. Hence architected ppl. a., designed by an architect; architecting vbl. n. and ppl. a.

    The use of "architect" as a verb isn't even recently invented: Keats wrote "This was architected thus By the great Oceanus" in 1818.
  13. Re:And to that... by calebb · · Score: 3, Insightful

    "You mean like any other normal person who might be wanting to use such a product?"

    And to that, I would say... Someone writing an article for publication in a peer-reviewed journal should become experienced in their area of research before attempting to publish their results!

    For example, I'm sure you don't have much experience with Nuclear Magnetic Resonance imaging - And you might or might not have experience with X11 forwarding. But unless you are fluent with both of those topics, I would not expect you to attempt to publish a paper in a peer-reviewed journal discussing those topics!
    (Like I did, last December)

    However, for the sake of presenting some evidence to back up what I'm saying here, I'll take your example of Consumer Reports.

    From their site: CR has the most comprehensive auto-test program and reliability survey data of any U.S. publication; its auto experts have decades of experience in driving, testing, and reporting on cars.

    ...nevermind, I don't need to say anything else.

  14. Re:You don't like my software so I'll flame you by julesh · · Score: 3, Interesting

    He made a few very good points, but the overall tone was a little too ranty.

    This was the most important point, I think, and was buried 2/3rds of the way down:

    The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. [...] What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?

  15. Constructing arguments by cynicalmoose · · Score: 4, Informative

    As far as I understand, Cormack accepted that he was testing only on one person's corpus, and qualified his findings as such.

    This is something that is featured throughout the rebuttal - an argument that runs:
    a) Such and such was done incorrectly
    b) Therefore the system was inaccurate
    c) Therefore CRM-114 is better than stated

    The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable. If you discarded results each time they contradicted agreed wisdom we would still think of a geocentric universe.

    --
    Exercise your right not to vote. thinkoutside.org
  16. POPFile OTOH by JohnGrahamCumming · · Score: 3, Informative

    Actually publishes statistics from real users. If the user is willing POPFile sends back accuracy information to a central server and then a nightly cron job analyzes it and publishes information on the web for all to see.

    No need to read a study, or even the author's opinion. No wild claims made, just real data.

    Here it is:

    http://www.usethesource.com/popfile_stats.html

    Shows that POPFile has an _average accuracy_ over all users, including the training period of 95%. After it's seen 500 emails it has an accuracy of 97%. And the average POPFile user has 5 categories of classification.

    John.

  17. The problem w/ Bayes by king_ramen · · Score: 3, Informative

    As the author of this article states OVER and OVER, it is REALLY EASY to mess up your filters, and it is very tedious (with lots of permutations) to properly build your corpus. For a centralized spam filtering solution, the goals are: 1. Insulate the users from spam 2. Insulate the users from "administration" 3. Do no harm (no false positives) For these goals, I would take a "dumb" filter, set it conservatively, and hope for 80% catch rate and zero false positives. DSpam has a complicated workflow that requires EACH AND EVERY end user to complete a feedback loop. This is WAY to much to expect from people who are barely capable of finding Google. Unless the ONLY access to the mail is web-based, with a VERY clear "This is Spam" button, Bayes is a sysadmin's nightmare. My only gripe w/ SpamAssassin is performance. If I could get SPAMD to analyze headers in 25ms instead of 2000ms I'd never look back. As it is, DSPAM's performance has me very jealous.

    --
    ----- Refactoring is the reason why man does not mistake himself for a god.
  18. Re:Just read it - by Henry+Stern · · Score: 3, Informative

    1. Cormack is very inexperienced in the area of statistical filtering.

    Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP.

    A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.

    1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:

    a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.

    b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.

    c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.

    2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.

    3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.

    Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.

    Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.

    Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.

  19. It's a decent paper, but take it with some salt... by Ayanami+Rei · · Score: 4, Interesting

    ...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
  20. Cormack and Lynam re Zdziarski's factual errors by gvc · · Score: 4, Informative
    We shall not respond to Mr. Zdziarski's attacks, except to identify the most outstanding factual errors and to note that ad hominem arguments are irrelevant in assessing the validity of our work.

    We encourage interested parties to read our paper and our points of fact re Zdziarski.

    Thomas Lynam
    Gordon Cormack
    June 24, 2004

  21. CRM114 is impossible to get installed by Anonymous Coward · · Score: 3, Insightful

    I remember going through the CRM114 installation docs, and vividly remember the 20 or so steps that I had to go through, and after about 3 or 4 hours of trying to get it installed, I finally gave up. I think part of the goal of software design is to make your software so that people will be able to quickly install and use it. The author of this program lost sight of this important point. I'm not going to sit there and reverse engineer some esoteric codebase just to get it working, and I'm sure alot of other people feel the same way. Therefore, I use SpamAssassin among other things, and it works really well and was quick and relatively painless to get working. I didn't have to go through their source code to figure out how to get it installed.

  22. the corpus was *not* classified by SA alone by jmason · · Score: 5, Informative

    My $.02. disclaimer: I'm one of the SA developers.

    • "The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":

      No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).

      The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.

      However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.

      In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.

      IMO, that's as good as a hand-classified corpus can get.

    • "old versions of software were used":

      It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.

      Given that, using 6-month old release versions of the software under test seems reasonable.

      SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)

    • "test started with untrained filters":

      IMO, that's the real world. People don't start with fully-trained filters.

      In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.

    • "spam in the test is as old as 14 months":

      Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.

    • "it should purge old data":

      SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".

      In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).

      (Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)

    And finally, what Henry said in comment 9520473.

    --j.