Slashdot Mirror


Response to Gordon Cormack's Study of Spam Detection

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

4 of 229 comments (clear)

  1. How I do by mirko · · Score: 5, Interesting

    I set many aliases to my official email and I gave all of these to and only to spammers.
    So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
    This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

    --
    Trolling using another account since 2005.
  2. I wouldn't take this critique too seriously by EsbenMoseHansen · · Score: 5, Interesting

    There are several warning signs in this article.

    1. The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
    2. The author has no statistical or published backings for his claim
    3. Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
    4. I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
    5. He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.

    That said, he does raise a few valid points, such as the timeline:

    1. If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
    2. Configurations should really have been published. I see no reason why not.
    --
    Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    1. Re:I wouldn't take this critique too seriously by int2str · · Score: 4, Interesting

      Yes, I agree with your points. The author spends way too much time dicrediting the study.

      I also have to say that my experience was much more along the line of Cormacks. I've tried DSPAM for a while on my server, starting from scratch. Training on error with only new emails. On a small mail server with about 10 users of different types (geeks, businesses, moms etc).
      - DSPAM took way too long to produce any kind of results
      - 2500 emails before advanced features kick in is *a lot* for the average soccer mom
      - DPSAM produced way too many false positives early on
      - The spam filtering accuracy leveled off at about 80% (number from DSPAMs web interfac)

      So this is not another overzealus CS student here, but real world testing.

      The DSPAM author does not address any of the real points and just rags on Cormack.

      Not much of a "rebutal" in my book.

  3. It's a decent paper, but take it with some salt... by Ayanami+Rei · · Score: 4, Interesting

    ...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON