Response to Gordon Cormack's Study of Spam Detection

← Back to Stories (view on slashdot.org)

Response to Gordon Cormack's Study of Spam Detection

Posted by michael on Thursday June 24, 2004 @03:25AM from the even-stephen dept.

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

15 of 229 comments (clear)

Min score:

Reason:

Sort:

Excellent review by XMichael · 2004-06-24 03:33 · Score: 5, Informative

On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...

Priceless Photos

--
Gamblers Forum
Spamassasin is good but not that good... by Shoeler · 2004-06-24 03:41 · Score: 5, Informative

For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.

Point being - I was darn surprised to see SA at the top of his charts.

Now - if only mimedefang would easily use another spam-checker....
Just read it - by calebb · 2004-06-24 03:44 · Score: 2, Informative

I just read the whole article - it does repeat itself a few times, but the author provides additional evidence each time his theses were reiterated:

1. Cormack is very inexperienced in the area of statistical filtering. Agreed!!!
2. Cormack went into the testing with many presuppositions. Also Agreed!!

And in case you're not familiar with the word presupposition:
1. To believe or suppose in advance.
2. To require or involve necessarily as an antecedent condition.

Overall, this is a very good article; Check it out if you haven't already done so!
1. Re:Just read it - by Henry+Stern · 2004-06-24 06:03 · Score: 3, Informative
  
  1. Cormack is very inexperienced in the area of statistical filtering.
  
  Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP.
  
  A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.
  
  1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:
  
  a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
  
  b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.
  
  c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.
  
  2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.
  
  3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.
  
  Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.
  
  Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.
  
  Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.
False positives. by Christopher+Thomas · 2004-06-24 04:18 · Score: 2, Informative

I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.

I have a whitelist at work, and it works quite well, but combing through and emptying the spam bucket is still an annoying part of each day.

However, without doing so, I'll never know if I missed that one message in (about) a thousand that's from a vendor that's not in my whitelist.

QOTD: "I don't have a solution, but I do admire the problem.".
1. Re:False positives. by Xentax · 2004-06-24 07:21 · Score: 2, Informative
  
  This *is* already done - statistical filters are trained on both words that are 'spamlike' (words that show up only, or mostly, in lots of email marked by the user as spam), and words that are NOT (words that show up only, or mostly, in email marked not spam).
  
  This is (AFAIK) done against tokens in both the mail body and the headers, which pays dividends if the delivery paths are clustered (for example, if your whole family has accounts with MyISP.com, you'll probably get good filtering provided the spam isn't originating from MyISP.com as well).
  
  Xentax
  
  --
  You shouldn't verb words.
Confirmed: Architect IS a verb by cperciva · 2004-06-24 04:24 · Score: 4, Informative

Quoth the OED:
architect v. To design (a building). Also transf. and fig. Hence architected ppl. a., designed by an architect; architecting vbl. n. and ppl. a.

The use of "architect" as a verb isn't even recently invented: Keats wrote "This was architected thus By the great Oceanus" in 1818.

--
Tarsnap: Online backups for the truly paranoid
Re:How I do by julesh · 2004-06-24 04:35 · Score: 3, Informative

Mail.app's filter isn't Bayesian. Please see previous slashdot article on how it works (I'm too lazy to find the reference right now).
Constructing arguments by cynicalmoose · 2004-06-24 04:40 · Score: 4, Informative

As far as I understand, Cormack accepted that he was testing only on one person's corpus, and qualified his findings as such.

This is something that is featured throughout the rebuttal - an argument that runs:
a) Such and such was done incorrectly
b) Therefore the system was inaccurate
c) Therefore CRM-114 is better than stated

The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable. If you discarded results each time they contradicted agreed wisdom we would still think of a geocentric universe.

--
Exercise your right not to vote. thinkoutside.org
POPFile OTOH by JohnGrahamCumming · 2004-06-24 04:47 · Score: 3, Informative

Actually publishes statistics from real users. If the user is willing POPFile sends back accuracy information to a central server and then a nightly cron job analyzes it and publishes information on the web for all to see.

No need to read a study, or even the author's opinion. No wild claims made, just real data.

Here it is:

http://www.usethesource.com/popfile_stats.html

Shows that POPFile has an _average accuracy_ over all users, including the training period of 95%. After it's seen 500 emails it has an accuracy of 97%. And the average POPFile user has 5 categories of classification.

John.
The problem w/ Bayes by king_ramen · 2004-06-24 05:12 · Score: 3, Informative

As the author of this article states OVER and OVER, it is REALLY EASY to mess up your filters, and it is very tedious (with lots of permutations) to properly build your corpus. For a centralized spam filtering solution, the goals are: 1. Insulate the users from spam 2. Insulate the users from "administration" 3. Do no harm (no false positives) For these goals, I would take a "dumb" filter, set it conservatively, and hope for 80% catch rate and zero false positives. DSpam has a complicated workflow that requires EACH AND EVERY end user to complete a feedback loop. This is WAY to much to expect from people who are barely capable of finding Google. Unless the ONLY access to the mail is web-based, with a VERY clear "This is Spam" button, Bayes is a sysadmin's nightmare. My only gripe w/ SpamAssassin is performance. If I could get SPAMD to analyze headers in 25ms instead of 2000ms I'd never look back. As it is, DSPAM's performance has me very jealous.

--
----- Refactoring is the reason why man does not mistake himself for a god.
Atypical, high volume of traffic? by dougmc · 2004-06-24 05:36 · Score: 2, Informative

This seems very atypical. The test subject does not represent typical email behavior, except among the most hardcore geeks. Even still, typical hardcore geeks will adjust this behavior in an attempt to curve spam. The typical technical user (someone who makes his living online) will have the same email address for perhaps five or more years, and the typical non-technical user (a majority of the users on the Internet, lest we forget) will change email addresses every couple of years. In either case, most sane users use one or two variants at the most.
Who is Jonathan to decide what consitutes sanity?
Maybe I'm a hardcore geek, but I do do exactly what Gordon does -- have several accounts feeding a `master' mail account, using addresses I've owned for over a decade. I also post to Usenet and mailing lists with my unobfuscated mailing address -- I want people to be able to reach me, and I refuse to let the spammers take that away from me.
And I think I'm very sane, thank you.
49,000 emails in eight months is also absurd.
I agree. That's an absurdly *small* amount. I personally receive over 1500 spams/day -- so I'd have 49,000 in under a month. Obviously the amount of spam I receive is because I set myself up as a target, but I'm hardly the only one. Even Jonathan's email address is clearly listed on his page, unobfuscated, so he's doing it too, at least to some degree.
(As a piece of anecdotal evidence, Spamassassin catches all but about 4/day of the spams I get, and false positives are extremely rare. Of course, I have spent a good deal of time tweaking SA to work best with my email, and it now works very well.)
A good test should have included independent tests with corpora from 10-15 different test subject, of all walks of life - geek, doctor, etc.
That sounds fine in theory, but in practice it's hard to do. How many people from all non-geek walks of life save *all* their email, including spam, and are willing to give it to you so you can analyze it?
And merely capturing all their email won't do it -- they need to categorize it for you, because they're the only ones who can reliably decide what's spam *for them* and what's not.
I do agree, that the study had more than it's share of issues, but this critique goes way over the top.
Cormack and Lynam re Zdziarski's factual errors by gvc · 2004-06-24 06:48 · Score: 4, Informative

We shall not respond to Mr. Zdziarski's attacks, except to identify the most outstanding factual errors and to note that ad hominem arguments are irrelevant in assessing the validity of our work.
We encourage interested parties to read our paper and our points of fact re Zdziarski.
Thomas Lynam
Gordon Cormack
June 24, 2004
the corpus was *not* classified by SA alone by jmason · 2004-06-24 08:09 · Score: 5, Informative
My $.02. disclaimer: I'm one of the SA developers.
- "The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":
  
  No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).
  
  The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.
  
  However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.
  
  In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.
  
  IMO, that's as good as a hand-classified corpus can get.
- "old versions of software were used":
  
  It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.
  
  Given that, using 6-month old release versions of the software under test seems reasonable.
  
  SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
- "test started with untrained filters":
  
  IMO, that's the real world. People don't start with fully-trained filters.
  
  In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
- "spam in the test is as old as 14 months":
  
  Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
- "it should purge old data":
  
  SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".
  
  In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).
  
  (Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)
And finally, what Henry said in comment 9520473.

--j.
Or better yet... by Anonymous Coward · 2004-06-24 08:35 · Score: 0, Informative

Just fucking try the software yourself. Quite simply, spamassassin blows, and this is the consistant opinion of the ~4000 people here who have been stuck with it for now. Testing out CRM114 and DSPAM on limited (100 each) groups of people is showing both to be an order of magnitude better than SA. I can't say which is better, but I can say for certain both are in a whole other league from SA, which lets in 1/20 or so spams, and likes to flag abnoxious HTML laden email from management types as spam, much to their disdain. Both of the statistical filters are much better, with test people seeing between 1/100 and 1/500 spams getting through, with only a handful of false positives.