Response to Gordon Cormack's Study of Spam Detection

← Back to Stories (view on slashdot.org)

Response to Gordon Cormack's Study of Spam Detection

Posted by michael on Thursday June 24, 2004 @03:25AM from the even-stephen dept.

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

7 of 229 comments (clear)

How I do by mirko · 2004-06-24 03:32 · Score: 5, Interesting

I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

--
Trolling using another account since 2005.
Excellent review by XMichael · 2004-06-24 03:33 · Score: 5, Informative

On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...

Priceless Photos

--
Gamblers Forum
Studies create discussion by Timesprout · 2004-06-24 03:39 · Score: 5, Insightful

I usually frown when I see many of these so called studies offering conclusions, several of which differ radically from my own experience. There recent Java/C++ performance one was a classic example. It gets annoying when a pro MS result is immediately decried as marketing FUD because it just cant be better and a pro Linux result is taken gospel truth here on /. Usually I tend to take all results with a grain of salt or just plain ignore them and focus on the debate around them.

The benifit of these studies though is that fantical crap aside informed people will usually take the time to interpret results or suggest corrections/improvements that actually benifit developers and improve their knowledge base more than any information provided by the actual study.

--
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
Spamassasin is good but not that good... by Shoeler · 2004-06-24 03:41 · Score: 5, Informative

For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.

Point being - I was darn surprised to see SA at the top of his charts.

Now - if only mimedefang would easily use another spam-checker....
Re:You don't like my software so I'll flame you by Otter · 2004-06-24 03:52 · Score: 5, Insightful
There are some technical objections in there (old versions of software, the fact that Spam Assassin was tested with a spam collection generated by spam assassin). But honestly, after wading through all the whining and sneering, I didn't have the energy to pick the points out of the overall flow.
Jonathan, next time:
- Start by summarizing your technical objections.
- Continue by detailing your technical objections.
- Leave the nasty rants to the end, or better yet, leave them out entirely.
- Stop talking about "geeks" in every paragraph.
- Please stop referring to spam filter comparisons as "science".
--
What I'm listening to now on Pandora...
I wouldn't take this critique too seriously by EsbenMoseHansen · 2004-06-24 04:12 · Score: 5, Interesting
There are several warning signs in this article.
1. The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
2. The author has no statistical or published backings for his claim
3. Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
4. I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
5. He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.
That said, he does raise a few valid points, such as the timeline:
1. If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
2. Configurations should really have been published. I see no reason why not.
--
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
the corpus was *not* classified by SA alone by jmason · 2004-06-24 08:09 · Score: 5, Informative
My $.02. disclaimer: I'm one of the SA developers.
- "The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":
  
  No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).
  
  The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.
  
  However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.
  
  In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.
  
  IMO, that's as good as a hand-classified corpus can get.
- "old versions of software were used":
  
  It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.
  
  Given that, using 6-month old release versions of the software under test seems reasonable.
  
  SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
- "test started with untrained filters":
  
  IMO, that's the real world. People don't start with fully-trained filters.
  
  In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
- "spam in the test is as old as 14 months":
  
  Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
- "it should purge old data":
  
  SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".
  
  In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).
  
  (Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)
And finally, what Henry said in comment 9520473.

--j.