Working Bayesian Mail Filter
zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
Saw this a few weeks back... Spam filter in Python using Naive Bayes.
And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.
Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".
Beware: In C++, your friends can see your privates!
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government. http://www.bayesian.org/
We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.
evil adrian
Can someone explain why this filter would be useful to me?
"The lesson to be learned is not to take the comments on slashdot too literally." --Vinnie Falco, BearShare
This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.
-Stype
Bus error -- driver executed.
If you had just clicked the POPFile link, you would see the explanation.
Initiative is your friend.
Hyperlinks are your friend.
Don't be afraid, just click.
evil adrian
A couple of URLs quickly found on Google:/ section-7.html a ssets/images/week09.pdf
http://www.faqs.org/faqs/ai-faq/neural-nets/part3
http://www.csse.monash.edu.au/courseware/cse5230/
Also, any decent AI/machine learning textbook ought to cover the topic.
-DVK
"The right to figure things out for yourself is the only true freedom everyone shares. Go use it"-R.A.Heinlein
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.
Carousel is a lie!
That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.
It should be:
Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)
and:
Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)
jabber: johnynek@jabber.org
SquirrelMail is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor).
Bogofilter has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better or not as good as spamassassin.
You can't see this if you have sigs turned off.
This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.
Stop-Prism.org: Opt Out of Surveillance
I just received the November edition of the TPJ which included a fine article "perlcc & Compiling Perl Script".
In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.
If you had just clicked the POPFile link, you would see the explanation.
I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.
NGWave - Fast Sound Editor for Windows
Read the referenced article. The only way to avoid the filter is to make your email sound like a normal message. In essence, the filter recognizes the sales pitch. If you remove the sales pitch to get your spam past the filter, you've removed the whole point of sending the spam.
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.
Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.
Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.
In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.
A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat, Pimmy, JBMail and PocoMail will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.
If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?
This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).
Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.
Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?
Bleh!
To put this in simpler terms, consider this scenario, 90% of all all X-rays that have a certain feature are from women with breast cancer. That is an easy statistic to compute; you have the x-rays and you follow up with the patients.
The trick is derive a statement like: "If an x-ray has this feature, the patient has NN % chances of having breast cancer. THAT's useful tor screening, but it doesn't follow from the first statment (without some serious statistical calculations).
Bayes theorem has all sorts of applications in prediction. In the case of E-mail, we can greatly oversimply and say "We found that X% of E-mails with this subject line are Spam." "We conclude that an E-mail with this subject line has Y% odds of being spam." Note that these are two very different statements. If we can find Y for the second statement and set a threshold we're comfortable with, say, 95% then we can create a filter with 95% confidence of correctness; it may well be wrong 5% of the time.
Other responses have done a good job with the math so I won't repeat it here.
Now we can tell spammers: "All your Bayes are belong to us."
An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.
Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.
Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time. ...
Ifile does this, bogofilter does this with some wangling in procmail,
That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
~Tim
--
Rushing on down to the circle of the turn
Actually, irony is generally considered to be "use of words to express something different from and often opposite to their literal meaning".
Sarcasm is often defined as a form of irony (but not necessarily), intended to be cutting/offensive etc.
So while his comment may have been sarcasm, it was also irony.
And I'm not pedantic, I'm pernickety. :-)
Tim
I don't think an
In general this illuminates one of the advantages of Unix. Lots of programs are written as filters that read from STDIN (standard input) and write to STDOUT (standard output). My own mail filtering script, for example, does that. I didn't have to learn any mailer-specific API, and my script can be used in different contexts. (Actually my script doesn't write to STDOUT - it saves the message to the appropriate folder.)
Windows does not lend itself to the everything-is-a-filter idea because, among other things, process creation is slow and expensive. When a filter is invoked, a process is launched. Unix has more efficient process creation, and Linux has especially efficient and light process creation. Therefore on Windows a mail filter should be implemented as a reusable software component (probably a COM object) that can be called by the mail client.
Also, most mail clients on Unix use the same mail folder format (mbox) which is basically just the literal messages from the network written to a file. Since it is the assumed common language of mail folders, it encourages software to interoperate on the file level, which my script does by writing messages to mail folders. (Unix is file-centric.) Windows mail clients, in contrast, seem to store mail folders in proprietary formats. That's because Windows philosophy is that an application serves as gatekeeper to "its" files - the file is not a unit of interoperability. In our case it means a standalone mail filter probably couldn't write messages to the mail folder.
Unix is a more friendly, efficient development environment because you can write a mail filter as a standalone program and test it without building a test harness.
I think you may have misunderstood that comment. Since Paul Graham started talking about Bayesian filtering, there's been some tendency here to refer to all learning spam filters as Bayesian. Which results in complaints, which results in the designation "pseudo-Bayesian" for the many independently-discovered learning algorithms that don't have a theoretical underpinning.
Put another way: if an algorithm outputs a dimensionless "score", and the author can't set an upper bound on the score, it's at most pseudo-Bayesian. If it outputs a probability that the message meets certain criteria, then it could be "true Bayesian". Additional implication: the "pseudo-Bayesian" filter may have a stack of rules in addition to its table of probabilities.
I don't think we're splitting hairs on some deep statistical issue. I think we're groping for very rough categories in a new field of application software. If you can establish clearer categories, that might help.
Graham addresses this in the article. One can identify most spam with a simple rules-based engine. That tends to make one lazy in reading the spam folder, which means false positives can languish unread. Enhancing the rules-based engine becomes an ongoing project as the volume and clerverness of spam increase. Hopefully Bayesian filtering can automate this.
Welcome to the future: the mail client in Mac OS X 10.2 uses latent semantic analysis. (This isn't just marketingspeak--my mail folder includes "LSMMap"--LS as in "latent semantic".)
How long until we can set up Bayesian by-word filtering on Slashdot comments?
-- Ed Avis ed@membled.com
So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.
Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
And it's right.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.
(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).
What really interests me is that Spamassasin claims to use a genetic algorithm to rate how likely an e-mail is to be spam.
Mod me down and I will become more powerful than you can possibly imagine!
This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...
patent 6,161,130