Paul Graham on Fighting Spam

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Friday August 16, 2002 @04:08AM from the near-and-dear-to-my-heart dept.

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

6 of 675 comments (clear)

Min score:

Reason:

Sort:

This is not news ... by dougmc · 2002-08-16 04:20 · Score: 5, Informative

The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam.
And he's correct. A few years ago, most spam filters did look for individual properties of spam.
BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this, for example, and it works *extremely* well. Before I found Spamassassin, I had a huge procmial recipe that used it's scoring mechanism to do basically the same thing -- but of course spamassassin does it better, so I switched :)
Re:spamassasin by tomknight · 2002-08-16 04:21 · Score: 4, Informative

As you appear to have difficulty reading articles, I've give you a helping hand:
"But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam."
Tom.

--
Oh arse
Too bad! Patented By Microsoft by kotku · 2002-08-16 04:58 · Score: 4, Informative

Microsoft is one step ahead of everyone. Here is the patent summary.
"Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set"
The full details of the patent can be seen here.
Patent Link
I'm surprised you guys don't check at the patent office first before you get all excited about a new idea. Doh!

--
The bikini - security through obscurity since 1943
The design goals of SpamAssassin by belphegore · 2002-08-16 05:30 · Score: 4, Informative

Paul is taking an interesting approach here, but he's not correct in saying that SpamAssassin doesn't use a statitstical approach. He has a bit of a point in noting that his system will generate a prediction probability which is more intuitive than SpamAssassin's scoring system in terms of determining how likely a message is to be spam, but there is also an attractive element to the simplified, non-math way that SA uses scores, which allows them to be more understandable to non-math people.
Seems like a number of the points which Paul makes in the article about spammers being defeatable, about the basic premise that they must get their message through in order to be successful, and that the war on spam is winnable are extensions from my interview with Salon a few months back, but his statistical approach fails to make use of one factor which I believe is critical (and which SpamAssassin attempts to exploit), which is that those commercial messages must convey a commercial message, in other words, they have to be a message, and have some sort of linguistic component which encourages the reader to do something. A purely statistical approach to spam filtering will lose the power of doing analysis of higher-order linguistic concepts.
SpamAssassin's approach is to use the universe's best known natural language processors (humans) to build rules which they believe can differentiate linguistic elements of spam vs nonspam messages, and then use the best optimization and statistical tools we have (currently only using decent tools, not the best tools) to determine how to score those rules against individual messages. The scoring system is very simplistic today, just being a simple sum of the scores of the various rules (though it's slightly nonlinear because of the properties of some of the rules, like the auto-whitelist). Future SpamAssassin development directions include extending the scoring system to be much more non-linear, including examining statistically the frequency of occurrence of combinations of rule triggers.
Automated rule-creation certainly has its place (for example, SpamAssassin's spam-phrase rule, or the auto-whitelist), but I truly believe that the ideal spam filtering system will always have to make the best use it can of human language processing skills. Using this combination of human/computer power, I believe that SpamAssassin can (and often does for many existing users) achieve better ROC performance than anything else.
Re:Another way to stop Spam by FattMattP · 2002-08-16 05:32 · Score: 4, Informative

What you've described is exactly what TMDA does.

--
Prevent email address forgery. Publish SPF records for y
Re:Incorrect statistics by Broccolist · 2002-08-16 07:05 · Score: 4, Informative
In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.
Graham is using a naive Bayes text classifier here, which is a pretty common approach. The naive classifier, as you perceptively point out, does relies on the obviously incorrect assumption that the appearance of any word is independent of all other words. But:
1. It's computationally impossible to be as statistically rigorous as you would like. If we had to keep a probability table of every word given every other word, we'd have awful combinatorial explosion. Even today's most powerful supercomputers would be unable to classify spam :).
2. The naive Bayes classifier, despite the incorrect assumption, has been empirically shown to be one of the best algorithms for dividing text documents into categories. Because of the variety of words and very small correlation between words in different sentences, the assumption seems to do very little harm.
Your objection is one of the reasons why AI researchers shunned Bayesian methods for so long: in practice it's impossible to implement them rigorously. Unfortunately, building a completely rational system is not tractable without a planet-sized computer. The only viable solution is to make compromises: just like humans do, when they skip steps and make not-100%-warranted assumptions in their reasoning.