Response to Gordon Cormack's Study of Spam Detection
Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."
I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.
Trolling using another account since 2005.
There are several warning signs in this article.
That said, he does raise a few valid points, such as the timeline:
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
He made a few very good points, but the overall tone was a little too ranty.
This was the most important point, I think, and was buried 2/3rds of the way down:
The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. [...] What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?
I don't claim to have done any scientific studies on the subject, but I have tried a number of different anti-spam soultions over the past few years. In my experience, the best soultion is a multi-pronged approach that takes advantage of the strong points of a few setups.
If you want to talk about the results from a single filter in my current arsenal, I would give DSPAM the highest marks. I found it to catch more spams than a trained and customized SpamAssassin with no false positives. It's also very fast, unlike SA. My current setup is as follows...
1) RBLs via Postfix. I probably block 80% of inbound spam this way. I choose my RBLs carefully to limit false positives.
2) DSPAM. I typically get better than 99% of the ones that slip through the RBLs with DSPAM.
3) A complex procmail.rc that uses some statistical rules and a few simple checks, such as "is the mail addressed to me". I also use procmail to sort my mailing list messages into IMAP boxes and it includes a simple whitelist.
4) Spamassassin. This doesn't run much anymore, but I keep it around anyway as a last resort checker. If a mail makes it through all the above, SA gets a shot at it.
I tried using SA as my only post RBL filter for a couple months, but it wasn't getting the job done. I then added the procmail script, but still wasn't happy. Putting DSPAM in front of it all seems to work best for me. I now find that I only have a few spams per month make it past DSPAM (they sort into seperate boxes so I can track their performance) and I haven't seen a false positive in quite some time, over a month anyway. I've only been using DSPAM for a few months.
What works for me may be crap for you. Try a few things till you find something that works for you and use that. If you're trying statistical filters, keep in mind that it takes a while to train them. I found I got better than 90% with DSPAM after a small corpus feed and about a week of training.
...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON