Working Bayesian Mail Filter

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Sunday November 3, 2002 @06:05AM from the stuff-to-play-with dept.

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

12 of 312 comments (clear)

Min score:

Reason:

Sort:

Server-side solutions? by Quixote · 2002-11-03 06:12 · Score: 3, Interesting

Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
1. Re:Server-side solutions? by cmeans · 2002-11-03 06:33 · Score: 4, Interesting
  
  James is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.
  
  --
  Give a hand, not a hand-out.
2. Re:Server-side solutions? by koreth · 2002-11-03 06:44 · Score: 4, Interesting
  
  I've been using SpamProbe (which gets invoked from procmail) with excellent results.
Mozilla in Process of adding Bayesian filter by AT · 2002-11-03 06:14 · Score: 5, Interesting

The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
Bayesian? Wow!!! I'm sooo excited. (Irony!) by davids-world.com · 2002-11-03 06:16 · Score: 5, Interesting

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
1. Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) by Lenbok · 2002-11-03 07:41 · Score: 3, Interesting
  
  Actually compresssion-based techniques don't work particularly well, mainly because they are very sensitive to the amount of training data. If you have a lot of non-spam mail, your non-spam compressor will compress better than your spam compressor.
  
  In the long view, all compression is machine learning anyway :-)
You know what I'd kill for? by Saint+Aardvark · 2002-11-03 06:34 · Score: 3, Interesting

A version of this for Outlook Express.
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

--
Carousel is a lie!
Multi-purpose tool by B'Trey · 2002-11-03 08:10 · Score: 3, Interesting

An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:What about random misspellings? by PigleT · 2002-11-03 08:21 · Score: 3, Interesting

Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time.
Ifile does this, bogofilter does this with some wangling in procmail, ...

That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.

--
~Tim
--
.|` Clouds cross the black moonlight,
Rushing on down to the circle of the turn
Re:Sure it's promising by Tim+Browse · 2002-11-03 08:51 · Score: 4, Interesting

One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

Tim
Growing a spam filter -- a firsthand experience by devphil · 2002-11-03 11:30 · Score: 4, Interesting

So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
Ignore the actual contents of the message. 34% of the time, it's spam.

And it's right.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Spamassasin by fireboy1919 · 2002-11-03 12:59 · Score: 3, Interesting

This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

What really interests me is that Spamassasin claims to use a genetic algorithm to rate how likely an e-mail is to be spam.

--
Mod me down and I will become more powerful than you can possibly imagine!