Working Bayesian Mail Filter
zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
Saw this a few weeks back... Spam filter in Python using Naive Bayes.
The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.
evil adrian
If you had just clicked the POPFile link, you would see the explanation.
Initiative is your friend.
Hyperlinks are your friend.
Don't be afraid, just click.
evil adrian
A couple of URLs quickly found on Google:/ section-7.html a ssets/images/week09.pdf
http://www.faqs.org/faqs/ai-faq/neural-nets/part3
http://www.csse.monash.edu.au/courseware/cse5230/
Also, any decent AI/machine learning textbook ought to cover the topic.
-DVK
"The right to figure things out for yourself is the only true freedom everyone shares. Go use it"-R.A.Heinlein
While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).
Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.
It should be:
Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)
and:
Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)
jabber: johnynek@jabber.org
If you had just clicked the POPFile link, you would see the explanation.
I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.
NGWave - Fast Sound Editor for Windows
This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).
Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.
Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?
Bleh!
Now we can tell spammers: "All your Bayes are belong to us."
(e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam.
I have a cousin that lives in Nigeria and we regularly discuss tips on penis enlargement. He works at a bank refinancing mortgages and his wife is a professor at an accredited university. I work in in a Las Vegas casino producing shows featuring live nude showgirls. He offered to help me pay some bills and get out of debt (a generous offer, but I told him I just found a second part time job working from home earning thousands of dollars per week). My wife is a stock broker and I regularly let my cousin in on hot stock tips. I have an herb garden, I take viagra, and use rogaine. Since we both own the same brand of printer we've been working out the best way to refill the ink cartridges. I've been trying to lose weight, but it comes right back as soon as I quit smoking.
I don't quite understand this "beysian filter" stuff, but I can't wait to try it out!
-
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.