Working Bayesian Mail Filter

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Sunday November 3, 2002 @06:05AM from the stuff-to-play-with dept.

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

20 of 312 comments (clear)

Min score:

Reason:

Sort:

Whas that? by cos(0) · 2002-11-03 06:08 · Score: 2, Interesting

Would anyone care to explain what is a "Bayesian" mail filter?
Server-side solutions? by Quixote · 2002-11-03 06:12 · Score: 3, Interesting

Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
1. Re:Server-side solutions? by cmeans · 2002-11-03 06:33 · Score: 4, Interesting
  
  James is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.
  
  --
  Give a hand, not a hand-out.
2. Re:Server-side solutions? by koreth · 2002-11-03 06:44 · Score: 4, Interesting
  
  I've been using SpamProbe (which gets invoked from procmail) with excellent results.
Mozilla in Process of adding Bayesian filter by AT · 2002-11-03 06:14 · Score: 5, Interesting

The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · 2002-11-03 06:16 · Score: 5, Interesting

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
1. Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Anonymous Coward · 2002-11-03 07:35 · Score: 1, Interesting
  
  A True Jedi Nerd would use compression based classification. Make two zip/gz/bz2/lzw/whatever archives, one containing known-not-spam and one containing known-spam. For each incoming mail, add to both archives, see which compresses better, bingo, that's the category it's supposed to be in. Obviously needs some tweaking (blocksizes etc.) but that's the gist of it.
  
  Apparently, it does work, though I can't whip out the references just now.
  
  Anyway, naive bayes is interesting mostly because it's so damn fast and only requires one pass through the data; and it works well, it just makes you feel stoopid because it's called "naive".
2. Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Lenbok · 2002-11-03 07:41 · Score: 3, Interesting
  
  Actually compresssion-based techniques don't work particularly well, mainly because they are very sensitive to the amount of training data. If you have a lot of non-spam mail, your non-spam compressor will compress better than your spam compressor.
  
  In the long view, all compression is machine learning anyway :-)
product of marketrons by hfastedge · 2002-11-03 06:26 · Score: 2, Interesting

I don't know if it is true Bayesian

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

As long as you're not developing the idea, it shouldnt matter how it works as long as it works.

I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.

With 1 line of regex I eliminate 95% of my spam:
match and throw it out.

--
-- -- --
Help my mini cause: My journal
Re:Sure it's promising by bmwm3nut · 2002-11-03 06:26 · Score: 2, Interesting

that's the beauty of this approach. the filter learns all the time (or atleast you can set it up that way). so if spammers get smart, it doesn't take long until the filter adjusts. what i'd love to see is this filter built into a mail client where you have two buttons for delete. one, just to delete the mail, the other to delete it and mark it as spam. when you press that button the filter would scan the email and update its rules.
You know what I'd kill for? by Saint+Aardvark · 2002-11-03 06:34 · Score: 3, Interesting

A version of this for Outlook Express.
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

--
Carousel is a lie!
Staged Categories by irritating+environme · 2002-11-03 07:05 · Score: 2, Interesting

An advertised false positive rate of 0% is nice, but why not additional research into the spam, to attempt to categorize into blatant spam, probable spam, borderline, and non-spam, and see if false positives can be plopped into the borderline categories.

Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.

--

Hey, I'm just your average shit and piss factory.
What about random misspellings? by archeopterix · 2002-11-03 07:46 · Score: 2, Interesting

Hm... what about an anti-anti spam filter that mangles the message inserting random misspellings into the spam-identifying words? The bayesian filter would perceive this as a message consisting of many 'unclassified' words, just like a message in some unknown language. Sure, the short words probably haven't got many possible misspellings (cock, c0ck, coock, cokc - hm... starts to look undecipherable ), so they would probably get classified after some time. And this would hopefully lower the spam success ratio. But the possibility still remains...
1. Re:What about random misspellings? by PigleT · 2002-11-03 08:21 · Score: 3, Interesting
  
  Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time.
  Ifile does this, bogofilter does this with some wangling in procmail, ...
  
  That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
  
  --
  ~Tim
  --
  .|` Clouds cross the black moonlight,
  Rushing on down to the circle of the turn
2. Re:What about random misspellings? by archeopterix · 2002-11-03 08:45 · Score: 2, Interesting
  
  Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time. Ifile does this, bogofilter does this with some wangling in procmail, ... That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
  This is clever, but might have some undesirable side effects. Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.
Multi-purpose tool by B'Trey · 2002-11-03 08:10 · Score: 3, Interesting

An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:Sure it's promising by Tim+Browse · 2002-11-03 08:51 · Score: 4, Interesting

One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

Tim
Growing a spam filter -- a firsthand experience by devphil · 2002-11-03 11:30 · Score: 4, Interesting

So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
Ignore the actual contents of the message. 34% of the time, it's spam.

And it's right.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Re:Professional Looking Spam May Be Impossible by Anonymous Coward · 2002-11-03 12:09 · Score: 1, Interesting

Actually, in my experience, spam is written by very intelligent people to look a very specific way to reach a very specific audience.

There is nothing accidental or slap-dash about the layout, or use of colour, or any of the factors involved in laying out an email that will generate sales. I know this because it's my job to know about - I'm in the porn business.

You might hate spam - I know I do - but it works. It works very well. And the way the email looks makes it work best of all.
Spamassasin by fireboy1919 · 2002-11-03 12:59 · Score: 3, Interesting

This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

What really interests me is that Spamassasin claims to use a genetic algorithm to rate how likely an e-mail is to be spam.

--
Mod me down and I will become more powerful than you can possibly imagine!