Paul Graham on Fighting Spam
Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."
(Yeah, yeah, I know...)
But if you do, check out Cloudmark's SpamNet. I've been quite please with it's ability to stop spam, and it gets better the more people that use it.
BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this, for example, and it works *extremely* well. Before I found Spamassassin, I had a huge procmial recipe that used it's scoring mechanism to do basically the same thing -- but of course spamassassin does it better, so I switched :)
"But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam."
Tom.
Oh arse
SpamAssassin already has this. It's called automatic-whitelisting.
Zodiac Survey
The theory (as I understand it) is that there are enough "legit words" in the "Sexy email to your gf" (i.e. her/your name/nickname, her/your email addy etc) that they'd cancel out the "bad words"
The big shift in thinking from looking for phrases vs scoring each and every word in an email is that the rest of the email is just as saving/damning as the stuff that filters look for.
Look in UseNet. The group news.admin.net-abuse.sightings is where people post their spams. Enjoy!
No replies made to AC posts. Please log in.
"Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set"
The full details of the patent can be seen here.
Patent Link
I'm surprised you guys don't check at the patent office first before you get all excited about a new idea. Doh!
The bikini - security through obscurity since 1943
This doesn't work because spam messages are not identical. That's the whole problem in a nutshell - how do you determine that one email matches another?
Also, it's worth noting that BrightMail and other companies have been using "spam honeypots" for years. Their effectiveness isn't very good.
What is interesting, though, is that you could use this technique extremely powerfully with the Bayesian filter. Instead of writing a script for yourself, have the script automatically move the message into your "spam" corpus. You'll get your spam blocking up hugely without every having to see spam.
I'd rather have a software package that has 50% filtering and 0 false positives then 100% filtering and 1 false positive. I _never_ want to miss an actual email directed at me.
I have to respectfully disagree here. First, you should NEVER trust an automated mechanism to delete e-mail before you open it (I'm not say you are, just saying it should never be done). When e-mail comes in to my inbox generally it's a user problem or network down situation.. Mozilla beeps at me, and I drop what I'm doing to see what e-mail has just arrived. If it's spam, I've wasted the effort in loosing my train of thought on whatever I was working on, plus whatever amount of time it takes me to refile it in my spam folder and adjust my filters so it doesn't happen again.
Using spamassassin, I filter all e-mails marked as spam off into a "spam" folder which I browse through about once a day at the end of the day just to be sure no legit e-mail has been filed over there. Takes only a second, and generally if the e-mail is "spammish" enough for spamassassin to file it over there it's not an important e-mail, but maybe a package ship notice from UPS, or an order update from amazon.com (though with effective whitelisting you can reduce how often this happens).
Not trying to change your opinion, just wanted to offer an alternate viewpoint. IMHO this is one of the things that makes spamassassin so good is that you can alter your threshold, so that if you can live with some false-positives but hate spam, you can use a lower threshold. If you can live with some spam and never want to miss "legitimate" e-mail, you can use a higher threshold.
Shayne
Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
Paul is taking an interesting approach here, but he's not correct in saying that SpamAssassin doesn't use a statitstical approach. He has a bit of a point in noting that his system will generate a prediction probability which is more intuitive than SpamAssassin's scoring system in terms of determining how likely a message is to be spam, but there is also an attractive element to the simplified, non-math way that SA uses scores, which allows them to be more understandable to non-math people.
Seems like a number of the points which Paul makes in the article about spammers being defeatable, about the basic premise that they must get their message through in order to be successful, and that the war on spam is winnable are extensions from my interview with Salon a few months back, but his statistical approach fails to make use of one factor which I believe is critical (and which SpamAssassin attempts to exploit), which is that those commercial messages must convey a commercial message, in other words, they have to be a message, and have some sort of linguistic component which encourages the reader to do something. A purely statistical approach to spam filtering will lose the power of doing analysis of higher-order linguistic concepts.
SpamAssassin's approach is to use the universe's best known natural language processors (humans) to build rules which they believe can differentiate linguistic elements of spam vs nonspam messages, and then use the best optimization and statistical tools we have (currently only using decent tools, not the best tools) to determine how to score those rules against individual messages. The scoring system is very simplistic today, just being a simple sum of the scores of the various rules (though it's slightly nonlinear because of the properties of some of the rules, like the auto-whitelist). Future SpamAssassin development directions include extending the scoring system to be much more non-linear, including examining statistically the frequency of occurrence of combinations of rule triggers.
Automated rule-creation certainly has its place (for example, SpamAssassin's spam-phrase rule, or the auto-whitelist), but I truly believe that the ideal spam filtering system will always have to make the best use it can of human language processing skills. Using this combination of human/computer power, I believe that SpamAssassin can (and often does for many existing users) achieve better ROC performance than anything else.
What you've described is exactly what TMDA does.
Prevent email address forgery. Publish SPF records for y
It already exists.
My exception safety is -fno-exceptions.
LISP is prefix so instead of a+b you'd have +(a b)
/* if word is in good hash, return weight,
/* if word is in bad hash, return weight,
IIRC in c this would be similar (LISP guru's please correct me):
int g(char* word) {
else return 0 */
return 2*good_word_weight;
}
int b(char* word) {
else return 0 */
return bad_word_weight;
}
int main() {
if (g(word) + b(word)
To read makes our speaking English good. - X. Harris
OK, I'll try. He's trying to score the word on a scale from .01 to .99. The value is the probability that the word is a spam word.
// word hasn't occured enough in previous messages // to have a valid score
// nbad is number of bad messages in database // ngood in number of good messages in database
.01, maximum is .99
g = 2 * (count of how many previous "good messages" the word has appeared in)
b = (count of how many previous "bad messages" the word has appeared in)
if( g+b 5 )
return 0;
fb = b / nbad
fg = g / ngood
score is fb / (fb + fg)
minimum valid score is
"Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
Graham is using a naive Bayes text classifier here, which is a pretty common approach. The naive classifier, as you perceptively point out, does relies on the obviously incorrect assumption that the appearance of any word is independent of all other words. But:
Your objection is one of the reasons why AI researchers shunned Bayesian methods for so long: in practice it's impossible to implement them rigorously. Unfortunately, building a completely rational system is not tractable without a planet-sized computer. The only viable solution is to make compromises: just like humans do, when they skip steps and make not-100%-warranted assumptions in their reasoning.
I saw Eric Horvitz demo this (along with a lot of other impressive stuff) when I was at MSFT. The spam filtering works very well for him. And yes, he's already written an Outlook COM plugin that does it.
The problem is that Eric works in MS Research, not on a product team. MSR does an excellent job developing cool new technology, and a very bad job working with the product groups to ship it out the door. (Likewise, the product teams do a poor job working with Research.)
The ultimate example of that is the MSAgent technology... otherwise known as "Clippy". Horvitz was the brain behind the original (and very cool) concept. But the Office product team couldn't take the concept and ship it in a useful form, so it shipped in the painful form we all know and hate.
Eventually, Microsoft will figure out how to do successful technology transfer from MSR to the product teams. Hopefully spam filtering will be the first one to get it right.