Bayesian Filtering For Dummies
Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."
I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.
The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:
Paul Graham's spam page
He talks a little bit more about the technical aspects there.
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
I've been using it for a bit on my own e-mail, and it seems to work out. But it's not at the point where I'd be happy to see ISPs implementing it for their customers -- even ignoring the Freedom of Speech issue, it still has the occasional false positive.
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
Does anyone have proof thats where the name comes from?
Mouse powered Chips, Open source Processors and Lego
When all you have is a hammer, everything looks like a skull.
I have used a bayesian filter for some time now and while it is the BEST filter type I have ever used nothing is 100% reliable. While this is the best technology for the average user it is most cirtainly not perfect. Instead I use a combination of moderate bayesian filtering and good old fasion "block sender" filtering.
Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)
Paul
Interesting yes, ironic, no.
What's your name, Alanis Morissette ?
This whole spam thing reminds me of a story I read while in 7th grade. In it, the postage for sending junk mail was decreased to practically nothing. Then, junk mail buried America. Hundreds of years later, archeologists came back and investigated the remains. Their conclusions about our society are kind of humorous. However, the idea of junk mail burying us when the postage goes way down has kind of been proved with spam. Maybe a small tax for spam wouldn't be a bad idea.
Thank you for your support.
Is this truly the only Earth I can live on?
Viagra often spelled V-l-a-g-r-a online
I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
You'd think they'd use the actual pic of the skit with the Vikings in the cafe...
/sig
A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.
So this filter works on analysis of previously filtered mail?
I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.
Above and Below knows I have enough hassle with users and their e-mail already
The Monty Python Comedy troupe did a rather famous (in some Geek circles) skit in which the virtues of canned Spiced Ham are literally sung. Inexplicably, a group of Vikings join in the song.
The poster, obviously better schooled in British farce than luncheon meats, is under the impression that the widely accepted nickname for unsolicited e-mail is derived from the comedy sketch and not from Spam(tm), the food.
I don't know for certain if he's wrong, but I have a hunch he is. I'm guessing a lot more people have eaten Spam than have digested the Python skit...
You can try bogofilter, ifile, SpamBayes, or POPFile. The newer versions of SpamAssassin also implement some kind of Bayesian filtering.
"The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit. "
And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.
About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.
In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).
:
So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask
- Do other people really receive that much spam, or am I an isolated case ?
- Do people who receive spam purchase things online, or register software and other services with their real names and email ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.
I have a simple questions, is there a way to impliment a Bayesian Filter for Evolution without having to add an extra stop for the email, (ie a mail server on my computer from which evolution picks mail up locally).
I do security
I find in our case it stops 98-99% of spam dead in its tracks. There have been a few false positives, and you do need check from time to time just in case an genuine emails are misclassified, but it's surprising just how quickly the filter sorts the wheat from the chaff.
Don't expect miracles but they can save you a lot of time... what I find cool is that it learns so quickly, almost like a complicated neural net should, but it's such a simple idea. I wonder if there are any other uses for this kind of thing?
Sorry, but my karma just ran over your dogma.
I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.
:)
This is waaaaay better than any other filtermethod I've tried and requires no learning period at all
I like SpamBayes for its ability to be trained on past spam. You can point it to a folder full of past spam and it scores them all, which is much faster than gradually teaching the software to recognize spam through individual email updates.
POPFile does not have this convenient ability (yet), though it does do general purpose sorting (i.e. not just differentiate between spam and non-spam, but stuff like work, school, linux or whatever you want). It does take a while to train though.
Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".
Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.
How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.
Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.
First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.
Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].
Repeat until all training e-mails are read.
And here comes the testing phase...
When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:
Hope this helps.
--
Error 500: Internal sig error
This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.
Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.
I don't care if it's 90,000 hectares. That lake was not my doing.
I wonder if a Bayesian classifier could sort out banner ads? I currently use Guidescope to block them, but it would be far better not to rely on a third party to decide what's an ad URL. It think it would work, but training it might be hard.
(And before anyone says "Don't do that, websites will die" my response would be "Good, let most of them die." I hate ads.)
Comment removed based on user account deletion
Instead I filter all of my mail for wanted/expected mail into a (large) tree of input folders, mailing lists, company mailings etc.
Most of what's left is spam, so a quick scan of the inbox (and creation of new rules) weeds out the uncaught desirables and the rest gets dropped in the bitbucket.
The point being that legitimate mail doesn't try to spoof my filters. I haven't (yet) had any spam arriving where it shouldn't. I'd rather my ISP dumped all the crud in the bin for me, but my marginal cost is low as I'm on ADSL. I now also use a distinct email for each purpose, making it easy to spot where spammers got it from and to create new rules as needed. It's a shame I didn't do this at the start as I have a couple of early ones that are spammed but I can't dump.
I hereby inform you that I have NOT been required to provide any decryption keys.
I'm using this now, and it works great!
Get it here.
-ted
Graham's method is called "naive Bayesian", and it's called "naive" for a reason. It works surprisingly well, but it barely scratches the surface of what people are doing with statistical models of text.
The lack of references on Graham's web site to prior work on text classification makes one wonder whether he just is unfamiliar with a huge body of literature going back decades or whether he just deliberately ignores them. Either way, Graham didn't invent any of the techniques and they are far from state-of-the-art. (Incidentally, you'll probably find Octave or Perl/PDL a more convenient language for implementing this stuff than Lisp.)
Anybody seriously interested in text filtering should at least do a little bit of background reading. "Readings in Information Retrieval" by Jones and Willett covers some of the basic papers.
Mozilla incorporates a twostep filter:
1. Is the sender in the address book? If yes, is not spam, otherwise:
2. Does the message have a probability of 90% that it is spam based on the Bayes filter? If so, flag as spam, otherwise not spam.
Great. Wanna filter my email for me? ;)
The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit.
Are we missing a critical factor of the end user who actually responds to SPAM?
If spammers survive on 0.0001% response rate, how many people are actually clicking/buying? Are these people who provide the customers for spammers going to stop or use any sort of filters?
That's what we refer to as "Alanis irony", 8^)
I don't use email. Yes, I have a few addresses but I havent checked them in months. Email is kinda dead way of communication anyway, beaten by things such as mobile phones and instant messaging.
On my mailbox outside my apartment I have a "No Junk Mail please" sticker... This actually works. I tried to put the same sticker on my pc, but the junk mail just keeps on comming... I don't understand....
Ask me no questions, and I'll tell you no lies...
Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities?
If I don't know you, that means I don't want to talk to you. Your email goes straight a junk folder, which I can quickly scan once every few days for from names I recognize. I can add these names to my white list if I so choose.
Granted, my job does not involve me soliciting contacts from the public at large, so this wouldn't work for everyone. I use it on my personal Hotmail account though, and I get to not even consider lots of crap every day.
You can never put too much water in a nuclear reactor.
Not many people know were the term "spam" comes from, but everyone[1] knows what it means (email wise), and it does come from the Monty Python sketch.
However it did not orginally go with bulk email, but instead with some wanker on posting the same post over and over again on a newsgroup or IRC.
[1] Including "normal" users.
Wow, I should not post when knackered.
Don't forget SpamProbe as well. I've been using it for a couple weeks, and it has been working very well for me. I've gotten around 1400 messages, and so far 1 false positive and 6 false negatives. I don't know how well the other filters work, but that seems pretty good to me. It's sure a hell of a lot better than the DNS blacklists I use. (I'm still using those. After all, they filter out the first 70% of my incoming mail and are probably faster anyway.)
This is SO educational! -- Kintaro Oe
You'll soon be running out of bits to store the floating point results. Implement it by adding logarithms of probabilities instead of products of them, thus:
If you have a couple of hundred key-words, this will make a lot of difference concerning the accuracy of the predictions.