Response to Gordon Cormack's Study of Spam Detection

← Back to Stories (view on slashdot.org)

Response to Gordon Cormack's Study of Spam Detection

Posted by michael on Thursday June 24, 2004 @03:25AM from the even-stephen dept.

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

6 of 229 comments (clear)

Min score:

Reason:

Sort:

How I do by mirko · 2004-06-24 03:32 · Score: 5, Interesting

I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

--
Trolling using another account since 2005.
I wouldn't take this critique too seriously by EsbenMoseHansen · 2004-06-24 04:12 · Score: 5, Interesting
There are several warning signs in this article.
1. The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
2. The author has no statistical or published backings for his claim
3. Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
4. I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
5. He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.
That said, he does raise a few valid points, such as the timeline:
1. If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
2. Configurations should really have been published. I see no reason why not.
--
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
1. Re:I wouldn't take this critique too seriously by int2str · 2004-06-24 04:49 · Score: 4, Interesting
  
  Yes, I agree with your points. The author spends way too much time dicrediting the study.
  
  I also have to say that my experience was much more along the line of Cormacks. I've tried DSPAM for a while on my server, starting from scratch. Training on error with only new emails. On a small mail server with about 10 users of different types (geeks, businesses, moms etc).
  - DSPAM took way too long to produce any kind of results
  - 2500 emails before advanced features kick in is *a lot* for the average soccer mom
  - DPSAM produced way too many false positives early on
  - The spam filtering accuracy leveled off at about 80% (number from DSPAMs web interfac)
  
  So this is not another overzealus CS student here, but real world testing.
  
  The DSPAM author does not address any of the real points and just rags on Cormack.
  
  Not much of a "rebutal" in my book.
Re:You don't like my software so I'll flame you by julesh · 2004-06-24 04:37 · Score: 3, Interesting

He made a few very good points, but the overall tone was a little too ranty.

This was the most important point, I think, and was buried 2/3rds of the way down:

The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. [...] What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?
DSPAM by Big+Boss · 2004-06-24 04:49 · Score: 2, Interesting

I don't claim to have done any scientific studies on the subject, but I have tried a number of different anti-spam soultions over the past few years. In my experience, the best soultion is a multi-pronged approach that takes advantage of the strong points of a few setups.

If you want to talk about the results from a single filter in my current arsenal, I would give DSPAM the highest marks. I found it to catch more spams than a trained and customized SpamAssassin with no false positives. It's also very fast, unlike SA. My current setup is as follows...

1) RBLs via Postfix. I probably block 80% of inbound spam this way. I choose my RBLs carefully to limit false positives.

2) DSPAM. I typically get better than 99% of the ones that slip through the RBLs with DSPAM.

3) A complex procmail.rc that uses some statistical rules and a few simple checks, such as "is the mail addressed to me". I also use procmail to sort my mailing list messages into IMAP boxes and it includes a simple whitelist.

4) Spamassassin. This doesn't run much anymore, but I keep it around anyway as a last resort checker. If a mail makes it through all the above, SA gets a shot at it.

I tried using SA as my only post RBL filter for a couple months, but it wasn't getting the job done. I then added the procmail script, but still wasn't happy. Putting DSPAM in front of it all seems to work best for me. I now find that I only have a few spams per month make it past DSPAM (they sort into seperate boxes so I can track their performance) and I haven't seen a false positive in quite some time, over a month anyway. I've only been using DSPAM for a few months.

What works for me may be crap for you. Try a few things till you find something that works for you and use that. If you're trying statistical filters, keep in mind that it takes a while to train them. I found I got better than 90% with DSPAM after a small corpus feed and about a week of training.
It's a decent paper, but take it with some salt... by Ayanami+Rei · 2004-06-24 06:33 · Score: 4, Interesting

...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON