Paul Graham on Fighting Spam

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Friday August 16, 2002 @04:08AM from the near-and-dear-to-my-heart dept.

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."

17 of 675 comments (clear)

Min score:

Reason:

Sort:

If you use Outlook... by Anonymous Coward · 2002-08-16 04:15 · Score: 2, Informative

(Yeah, yeah, I know...)

But if you do, check out Cloudmark's SpamNet. I've been quite please with it's ability to stop spam, and it gets better the more people that use it.
This is not news ... by dougmc · 2002-08-16 04:20 · Score: 5, Informative

The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam.
And he's correct. A few years ago, most spam filters did look for individual properties of spam.
BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this, for example, and it works *extremely* well. Before I found Spamassassin, I had a huge procmial recipe that used it's scoring mechanism to do basically the same thing -- but of course spamassassin does it better, so I switched :)
1. Re:This is not news ... by DVega · 2002-08-16 06:56 · Score: 3, Informative
  Bayesian filters for spam have extensively been studied and compared in the last few years.
  
  An evaluation of Naive Bayesian anti-spam filtering
  
  An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages
  
  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach
  
  Recently more filtering methods have been studied.
  It's good to see someone implementing these techniques
  --
  MOD THE CHILD UP!
Re:spamassasin by tomknight · 2002-08-16 04:21 · Score: 4, Informative

As you appear to have difficulty reading articles, I've give you a helping hand:
"But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam."
Tom.

--
Oh arse
Re:Another way to stop Spam by Brendan+Byrd · 2002-08-16 04:41 · Score: 3, Informative

SpamAssassin already has this. It's called automatic-whitelisting.

--
Zodiac Survey
Re:spamassasin by KMitchell · 2002-08-16 04:47 · Score: 3, Informative

The theory (as I understand it) is that there are enough "legit words" in the "Sexy email to your gf" (i.e. her/your name/nickname, her/your email addy etc) that they'd cancel out the "bad words"

The big shift in thinking from looking for phrases vs scoring each and every word in an email is that the rest of the email is just as saving/damning as the stuff that filters look for.
news.admin.net-abuse.sightings by 13013dobbs · 2002-08-16 04:55 · Score: 3, Informative

Look in UseNet. The group news.admin.net-abuse.sightings is where people post their spams. Enjoy!

--
No replies made to AC posts. Please log in.
Too bad! Patented By Microsoft by kotku · 2002-08-16 04:58 · Score: 4, Informative

Microsoft is one step ahead of everyone. Here is the patent summary.
"Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set"
The full details of the patent can be seen here.
Patent Link
I'm surprised you guys don't check at the patent office first before you get all excited about a new idea. Doh!

--
The bikini - security through obscurity since 1943
Re:Easy way to beat spam 100% by phamlen · 2002-08-16 05:02 · Score: 2, Informative
Well, you've got a ready-made list of messages to filter *out* of your other mailboxes
This doesn't work because spam messages are not identical. That's the whole problem in a nutshell - how do you determine that one email matches another?
1. Spammers routinely change the wording/spacing/non-essential elements in a message so that they don't match exactly.
2. If you cut down to searching for "parts of a message", then you're back to "content-filtering".
3. the same thing occurs if you check for email address, etc.
Also, it's worth noting that BrightMail and other companies have been using "spam honeypots" for years. Their effectiveness isn't very good.

What is interesting, though, is that you could use this technique extremely powerfully with the Bayesian filter. Instead of writing a script for yourself, have the script automatically move the message into your "spam" corpus. You'll get your spam blocking up hugely without every having to see spam.
Re:Ok, that is hot.... by shayne321 · 2002-08-16 05:11 · Score: 3, Informative

I'd rather have a software package that has 50% filtering and 0 false positives then 100% filtering and 1 false positive. I _never_ want to miss an actual email directed at me.
I have to respectfully disagree here. First, you should NEVER trust an automated mechanism to delete e-mail before you open it (I'm not say you are, just saying it should never be done). When e-mail comes in to my inbox generally it's a user problem or network down situation.. Mozilla beeps at me, and I drop what I'm doing to see what e-mail has just arrived. If it's spam, I've wasted the effort in loosing my train of thought on whatever I was working on, plus whatever amount of time it takes me to refile it in my spam folder and adjust my filters so it doesn't happen again.
Using spamassassin, I filter all e-mails marked as spam off into a "spam" folder which I browse through about once a day at the end of the day just to be sure no legit e-mail has been filed over there. Takes only a second, and generally if the e-mail is "spammish" enough for spamassassin to file it over there it's not an important e-mail, but maybe a package ship notice from UPS, or an order update from amazon.com (though with effective whitelisting you can reduce how often this happens).
Not trying to change your opinion, just wanted to offer an alternate viewpoint. IMHO this is one of the things that makes spamassassin so good is that you can alter your threshold, so that if you can live with some false-positives but hate spam, you can use a lower threshold. If you can live with some spam and never want to miss "legitimate" e-mail, you can use a higher threshold.
Shayne

--
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
The design goals of SpamAssassin by belphegore · 2002-08-16 05:30 · Score: 4, Informative

Paul is taking an interesting approach here, but he's not correct in saying that SpamAssassin doesn't use a statitstical approach. He has a bit of a point in noting that his system will generate a prediction probability which is more intuitive than SpamAssassin's scoring system in terms of determining how likely a message is to be spam, but there is also an attractive element to the simplified, non-math way that SA uses scores, which allows them to be more understandable to non-math people.
Seems like a number of the points which Paul makes in the article about spammers being defeatable, about the basic premise that they must get their message through in order to be successful, and that the war on spam is winnable are extensions from my interview with Salon a few months back, but his statistical approach fails to make use of one factor which I believe is critical (and which SpamAssassin attempts to exploit), which is that those commercial messages must convey a commercial message, in other words, they have to be a message, and have some sort of linguistic component which encourages the reader to do something. A purely statistical approach to spam filtering will lose the power of doing analysis of higher-order linguistic concepts.
SpamAssassin's approach is to use the universe's best known natural language processors (humans) to build rules which they believe can differentiate linguistic elements of spam vs nonspam messages, and then use the best optimization and statistical tools we have (currently only using decent tools, not the best tools) to determine how to score those rules against individual messages. The scoring system is very simplistic today, just being a simple sum of the scores of the various rules (though it's slightly nonlinear because of the properties of some of the rules, like the auto-whitelist). Future SpamAssassin development directions include extending the scoring system to be much more non-linear, including examining statistically the frequency of occurrence of combinations of rule triggers.
Automated rule-creation certainly has its place (for example, SpamAssassin's spam-phrase rule, or the auto-whitelist), but I truly believe that the ideal spam filtering system will always have to make the best use it can of human language processing skills. Using this combination of human/computer power, I believe that SpamAssassin can (and often does for many existing users) achieve better ROC performance than anything else.
Re:Another way to stop Spam by FattMattP · 2002-08-16 05:32 · Score: 4, Informative

What you've described is exactly what TMDA does.

--
Prevent email address forgery. Publish SPF records for y
Re:Another way to stop Spam by 21mhz · 2002-08-16 05:34 · Score: 2, Informative

It already exists.

--
My exception safety is -fno-exceptions.
Re:Please explain the LISP code by bsd-mon · 2002-08-16 05:55 · Score: 2, Informative

LISP is prefix so instead of a+b you'd have +(a b)
IIRC in c this would be similar (LISP guru's please correct me):

int g(char* word) {
/* if word is in good hash, return weight,
else return 0 */
return 2*good_word_weight;
}
int b(char* word) {
/* if word is in bad hash, return weight,
else return 0 */
return bad_word_weight;
}

int main() {
if (g(word) + b(word)

--
To read makes our speaking English good. - X. Harris
Re:Please explain the LISP code by brausch · 2002-08-16 06:07 · Score: 2, Informative

OK, I'll try. He's trying to score the word on a scale from .01 to .99. The value is the probability that the word is a spam word.

g = 2 * (count of how many previous "good messages" the word has appeared in)
b = (count of how many previous "bad messages" the word has appeared in)

if( g+b 5 ) // word hasn't occured enough in previous messages
return 0; // to have a valid score

fb = b / nbad // nbad is number of bad messages in database
fg = g / ngood // ngood in number of good messages in database

score is fb / (fb + fg)
minimum valid score is .01, maximum is .99

--
"Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
Re:Incorrect statistics by Broccolist · 2002-08-16 07:05 · Score: 4, Informative
In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.
Graham is using a naive Bayes text classifier here, which is a pretty common approach. The naive classifier, as you perceptively point out, does relies on the obviously incorrect assumption that the appearance of any word is independent of all other words. But:
1. It's computationally impossible to be as statistically rigorous as you would like. If we had to keep a probability table of every word given every other word, we'd have awful combinatorial explosion. Even today's most powerful supercomputers would be unable to classify spam :).
2. The naive Bayes classifier, despite the incorrect assumption, has been empirically shown to be one of the best algorithms for dividing text documents into categories. Because of the variety of words and very small correlation between words in different sentences, the assumption seems to do very little harm.
Your objection is one of the reasons why AI researchers shunned Bayesian methods for so long: in practice it's impossible to implement them rigorously. Unfortunately, building a completely rational system is not tractable without a planet-sized computer. The only viable solution is to make compromises: just like humans do, when they skip steps and make not-100%-warranted assumptions in their reasoning.
Re:Too bad! Patented By Microsoft by Anonymous Coward · 2002-08-16 15:44 · Score: 1, Informative

I saw Eric Horvitz demo this (along with a lot of other impressive stuff) when I was at MSFT. The spam filtering works very well for him. And yes, he's already written an Outlook COM plugin that does it.

The problem is that Eric works in MS Research, not on a product team. MSR does an excellent job developing cool new technology, and a very bad job working with the product groups to ship it out the door. (Likewise, the product teams do a poor job working with Research.)

The ultimate example of that is the MSAgent technology... otherwise known as "Clippy". Horvitz was the brain behind the original (and very cool) concept. But the Office product team couldn't take the concept and ship it in a useful form, so it shipped in the painful form we all know and hate.

Eventually, Microsoft will figure out how to do successful technology transfer from MSR to the product teams. Hopefully spam filtering will be the first one to get it right.