Bayesian Filtering For Dummies

More Spam! by James+Littiebrant · 2003-05-26 09:33 · Score: 3, Insightful

I have used a bayesian filter for some time now and while it is the BEST filter type I have ever used nothing is 100% reliable. While this is the best technology for the average user it is most cirtainly not perfect. Instead I use a combination of moderate bayesian filtering and good old fasion "block sender" filtering.

Hmmm by Anonymous Coward · 2003-05-26 09:43 · Score: 2, Insightful

So this filter works on analysis of previously filtered mail?

I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.

Above and Below knows I have enough hassle with users and their e-mail already

Re:Hmmm by letxa2000 · 2003-05-26 10:30 · Score: 3, Insightful

I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.
It's natural to think that is the case, but in reality it isn't. Accidentally putting one email in the wrong corpus ("good" or "spam") will not be enough to kill you. If you consistently fail to put them in the right corpus then over time, yes, things would snowball. But that'll only happen over time. A mistake now and then isn't enough to mess things up.

Re:who're the vikings? by RobotRunAmok · 2003-05-26 09:45 · Score: 2, Insightful

The Monty Python Comedy troupe did a rather famous (in some Geek circles) skit in which the virtues of canned Spiced Ham are literally sung. Inexplicably, a group of Vikings join in the song.

The poster, obviously better schooled in British farce than luncheon meats, is under the impression that the widely accepted nickname for unsolicited e-mail is derived from the comedy sketch and not from Spam(tm), the food.

I don't know for certain if he's wrong, but I have a hunch he is. I'm guessing a lot more people have eaten Spam than have digested the Python skit...

Evolution and by Gyorg_Lavode · 2003-05-26 09:53 · Score: 2, Insightful

I have a simple questions, is there a way to impliment a Bayesian Filter for Evolution without having to add an extra stop for the email, (ie a mail server on my computer from which evolution picks mail up locally).

--
I do security

Here's one I've used by wiggys · 2003-05-26 09:55 · Score: 3, Insightful

I set up Popfile a few weeks ago at work to stop the deluge of spam one of our POP3 accounts was getting. I've never used a spam filter before (other than the usual basic keyword-based ones) and I must say that bayesian filtering is very impressive!

I find in our case it stops 98-99% of spam dead in its tracks. There have been a few false positives, and you do need check from time to time just in case an genuine emails are misclassified, but it's surprising just how quickly the filter sorts the wheat from the chaff.

Don't expect miracles but they can save you a lot of time... what I find cool is that it learns so quickly, almost like a complicated neural net should, but it's such a simple idea. I wonder if there are any other uses for this kind of thing?

--

Sorry, but my karma just ran over your dogma.

Crude but effective by MrWorf · 2003-05-26 09:59 · Score: 5, Insightful

I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.

This is waaaaay better than any other filtermethod I've tried and requires no learning period at all :)

Re:Required Reading by E-mail Users by dJCL · 2003-05-26 10:01 · Score: 3, Insightful

From my understanding of his full explanation(I read it a while ago, can't remember where, dig around some) each e-mail has every word examined and given a rating from 0.01(good) to 0.99(spam), then the 15 words farthest from 0.50 are selected, some averaging is done and if the score is over some threshold(say 0.90) then it is called spam and trashed, I use spamunition for my outlook e-mail(working on moving my e-mail over to linux, hopefully soon, so I can del my windows boxen) and it can give the stats for each e-mail and it appears to use the same formula...

Part of the reason this all works is that spammers slowly change their wording over time to beat the static filters, but the baysian filter will still catch it on other parts of the message, and add the new wording to the db... the only spam that ever get throu to me now is stuff that is worded exactly like a normal e-mail, and even then they have a hard time, yet all my friends have no problems...

I think the key here is to(with this software) never delete any e-mail, spam goes to the spam folder, sort the other stuff, and stuff you wanted, but don't need, move to another folder just so allow the filter to know what to look for... I have 5200 spam e-mails saved and about 1000 legit mail saved and my accuracy level is about 99.9...

Read up on it, this stuff really does work.

Enjoy

--
On Arrakis: early worm gets the bird. Magister mundi sum!

Re:Spam = /dev/null by mnemonic_ · 2003-05-26 10:03 · Score: 2, Insightful

I like SpamBayes for its ability to be trained on past spam. You can point it to a folder full of past spam and it scores them all, which is much faster than gradually teaching the software to recognize spam through individual email updates.

POPFile does not have this convenient ability (yet), though it does do general purpose sorting (i.e. not just differentiate between spam and non-spam, but stuff like work, school, linux or whatever you want). It does take a while to train though.

Browser ad-blocking the same way? by DrJAKing · 2003-05-26 10:19 · Score: 2, Insightful

I wonder if a Bayesian classifier could sort out banner ads? I currently use Guidescope to block them, but it would be far better not to rely on a third party to decide what's an ad URL. It think it would work, but training it might be hard.

(And before anyone says "Don't do that, websites will die" my response would be "Good, let most of them die." I hate ads.)

Re:A bit of info on Bayesian filtering by letxa2000 · 2003-05-26 10:23 · Score: 5, Insightful

A gynecologist probably wouldn't have a corpus that indicates that "sex" is a .97 spam probability. That's the great thing about Bayesian: the spam probability for each word depends on the mail and spam YOU receive. It works dang well, just as Paul Graham claims. I'm averaging 99.7% accuracy this week, and the one spam that got through was written in German.

Re:It's not bad... by letxa2000 · 2003-05-26 10:27 · Score: 2, Insightful

The question is, which produces more false positives: The occasional Bayesian false positive, or the occasional (or not so occasional) good mail that you'll accidentally delete when you're deleting 150 spams per day? If I'm getting 150 spams per day that's 1050 spams per week which is an awful lot of "deletes." You don't think you're going to accidentally throw out a good message now and then when manually deleting that much spam? I'd venture to say that you'll probably accidentally delete more yourself by accident than Bayesian will toss as false positives.

Re:A bit of info on Bayesian filtering by GnuVince · 2003-05-26 10:32 · Score: 5, Insightful

No, because if they have a lot of legitimate mails with words like "sex", "sexy", "penis", "vagina", "viagra", etc., the filter will adapt. That's the whole point. For PG, "sexy" is a sure sign of spam, but for a sexologist, it is not. You train the filter to recognize your spam. So if "sex" appears as much in your legitimate mail than in your spam, "sex" will not be considered a trace of a spam.

Bayesian filters adapt, that's why they work so well.

Re:I don't receive spam by letxa2000 · 2003-05-26 10:44 · Score: 4, Insightful

There are in fact two big problems with Bayesian filtering (or any content-based filtering) from the perspective of an ISP or company... 1) one person's spam is another person's necessity

But that's why Bayesian advocates every user having their own Bayesian statistics. It's not a "one size fits all" for the entire ISP or company, as is the case with most keyword filters. Every user has a different set of Bayesian statistics which is why it is very difficult for spammers to get around this filter--they have no way of knowing what words are in each users' statistics.

2) you still have to waste your bandwidth and CPU before you reject it.

It's better to waste your bandwidth and your CPU than to waste the time of those receiving the spam. IMHO...

So Bayesian filters are a good tool of last resort, but there are many other tools that should be used too.

The quicker everyone uses Bayesian filters (as opposed to waiting until all the other filters are incapable of keeping up with spam) the sooner the spammers will be in trouble. I personally use both a Bayesian filter with an up-to-date blacklist of known spamvertised domains, etc. I find that, quite simply, the simple keyword filters catch spam from known spam sites and Bayesian catches the rest. But if I turned off my normal filters Bayesian would have caught it all since those spams are always assigned a high Bayesian score, too. It almost makes sense to turn off the other filters, but they can be useful if a spammer comes up with a truly unique spam and someone else has already identified the domain name. It's rare, but it can happen. So a combination of technologies is probably the best... but a combination that lacks Bayesian is a combination that could be better.

I don't even try to filter spam out. by belroth · 2003-05-26 10:47 · Score: 2, Insightful

Instead I filter all of my mail for wanted/expected mail into a (large) tree of input folders, mailing lists, company mailings etc.
Most of what's left is spam, so a quick scan of the inbox (and creation of new rules) weeds out the uncaught desirables and the rest gets dropped in the bitbucket.
The point being that legitimate mail doesn't try to spoof my filters. I haven't (yet) had any spam arriving where it shouldn't. I'd rather my ISP dumped all the crud in the bin for me, but my marginal cost is low as I'm on ADSL. I now also use a distinct email for each purpose, making it easy to spot where spammers got it from and to create new rules as needed. It's a shame I didn't do this at the start as I have a couple of early ones that are spammed but I can't dump.

--
I hereby inform you that I have NOT been required to provide any decryption keys.

Re:Yes, we must filter out the dummies by Drakin · 2003-05-26 11:03 · Score: 2, Insightful

Unfortanatly there's also the problem with some uneducated people with mod points who can't tell the differnce between a truely insightful post and one that is a well written troll. Nor, the people who confuse a troll with humor that's on topic in terms of a given discussion.

So while it works, there's still some holes in the system.

0.0001% response rates by rippie78 · 2003-05-26 11:06 · Score: 3, Insightful

The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit.
Are we missing a critical factor of the end user who actually responds to SPAM?
If spammers survive on 0.0001% response rate, how many people are actually clicking/buying? Are these people who provide the customers for spammers going to stop or use any sort of filters?

Re:Brief Tech Notes on Bayesian Filtering by DuSTman31 · 2003-05-26 11:08 · Score: 2, Insightful

Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam"

True. You could simply have a spam and a not spam category. I don't think that'll necessarily lead to the highest accuracies though.

Spam naturally seems to come in several categories - porn, penis enlargements, mortgages etc. However, it's unlikely that any one spam will simultaneously advertise porn and mortgages. Simply having a "spam" and a "not" category will not take advantage of distinctions such as that.

When setting up systems such as popfile, consider creating subcategories for each type of spam you tend to get. More work to train, true, but likely to be more accurat once you're done.

Re:Ironic? by DavyByrne · 2003-05-26 11:18 · Score: 4, Insightful

Actually, I've long wondered whether Alanis was quite clever in choosing a title for that song.

You see, none of the events she describes in the song is an example of irony, making the choice of the title "Ironic," well, ironic.

Re:Yes, we must filter out the dummies by bluelan · 2003-05-26 11:46 · Score: 5, Insightful

This wouldn't work.

Baysian filters for spam work because spam has a significantly different vocabulary distribution than useful e-mail. This is true because spam must deliver a commercial message and play on people's uncertainties.

Good trolls, on the other hand, look ALMOST like insightful, well written articles. The vocabulary distribution in good trolls is not significantly different than the vocabulary distribution of useful posts. So, Baysian filters would be useless, unless you come up with some smarter characteristics on which to train the filter.

You could easily develop a filter for ascii-art porno. But, those are offtopic or flaimbait, not trolls.

--

I used to be a narrator for bad mimes. (wright)

Re:Brief Tech Notes on Bayesian Filtering by Ian+Bicking · 2003-05-26 12:04 · Score: 3, Insightful

Spam naturally seems to come in several categories - porn, penis enlargements, mortgages etc. However, it's unlikely that any one spam will simultaneously advertise porn and mortgages. Simply having a "spam" and a "not" category will not take advantage of distinctions such as that.

Why does it matter what category? To the user they don't care what kind of spam, merely that it's spam. And this isn't just a UI issue -- the filter is not meant to indicate authoritatively what is spam and what is not. Instead it learns what the particular user considers spam. You're only going to introduce inaccuracy if you create more categories, because the user is sometimes going to miscategorize spams (e.g., porn in penis enlargement). The user is not invested in the result of that subcategorization, so it's not a good goal for training.

Certainly there are other categorizations that are useful, e.g., work vs. private mail. Bayesian techniques can be used for further categorization, but they should only be used to categorize as far as the user cares to have their mail categorized.

Bayesian techniques for non-spam wouldn't be that useful, anyway, because non-statistical rules generally work well for everything but spam -- it's only because spammers are specifically trying to defeat non-statistical rules that we need statistical analysis. The only other place for Bayesian techniques, IMHO, is where the user can't articulate the basis of the categorization they desire (but that's probably quite common).

Re:Brief Tech Notes on Bayesian Filtering by nackrm · 2003-05-26 14:03 · Score: 2, Insightful

Pooling spam to teach isn't such a good idea. The problem you might run into is that some people, like say a plastic surgeon, might get many emails that have words like penis, vagina, sex, larger, etc. So their filter info might allow some spam to get through. This is also the reason that mozilla's mail client wouldn't be "pretrained" for you. Instead the email probably had some key qualities to it that were dead givaways to being spam. One of those is the really long strings of characters used by spam mailers to track live email addresses. There are lots of possibilities there.

--

Be a man! View at -1
acm.cs.uwec.edu

Slashdot Mirror

Bayesian Filtering For Dummies

22 of 281 comments (clear)