Bayesian Filtering For Dummies
Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."
I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.
The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:
Paul Graham's spam page
He talks a little bit more about the technical aspects there.
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)
Paul
Good question ... through Google Groups I found this page.
the blood has stopped pumping, and he's left to decay
the me that you know is now made up of wires
Interesting yes, ironic, no.
What's your name, Alanis Morissette ?
Viagra often spelled V-l-a-g-r-a online
I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
You'd think they'd use the actual pic of the skit with the Vikings in the cafe...
/sig
A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.
You can try bogofilter, ifile, SpamBayes, or POPFile. The newer versions of SpamAssassin also implement some kind of Bayesian filtering.
In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).
:
So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask
- Do other people really receive that much spam, or am I an isolated case ?
- Do people who receive spam purchase things online, or register software and other services with their real names and email ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.
I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.
:)
This is waaaaay better than any other filtermethod I've tried and requires no learning period at all
Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".
Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.
How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.
Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.
First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.
Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].
Repeat until all training e-mails are read.
And here comes the testing phase...
When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:
Hope this helps.
--
Error 500: Internal sig error
This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.
Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.
I don't care if it's 90,000 hectares. That lake was not my doing.
The sketch is to be found on the album "The Bset of Sellers" - probably released in about 1958, and which also features the nursery rhyme
"Up on the chair behind the door,
hey diddle, diddle,
Hear comes Poppa
so up with the chopper
and split 'im down the middle
And "Balham, gateway to the South" a spoof of the travalogue films that often apepared in the cenema at the time.
Sent from my ASR33 using ASCII