Slashdot Mirror


Bayesian Filtering For Dummies

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

6 of 281 comments (clear)

  1. Crude but effective by MrWorf · · Score: 5, Insightful

    I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.

    This is waaaaay better than any other filtermethod I've tried and requires no learning period at all :)

  2. Re:A bit of info on Bayesian filtering by letxa2000 · · Score: 5, Insightful
    A gynecologist probably wouldn't have a corpus that indicates that "sex" is a .97 spam probability. That's the great thing about Bayesian: the spam probability for each word depends on the mail and spam YOU receive. It works dang well, just as Paul Graham claims. I'm averaging 99.7% accuracy this week, and the one spam that got through was written in German.

  3. Re:A bit of info on Bayesian filtering by GnuVince · · Score: 5, Insightful
    No, because if they have a lot of legitimate mails with words like "sex", "sexy", "penis", "vagina", "viagra", etc., the filter will adapt. That's the whole point. For PG, "sexy" is a sure sign of spam, but for a sexologist, it is not. You train the filter to recognize your spam. So if "sex" appears as much in your legitimate mail than in your spam, "sex" will not be considered a trace of a spam.

    Bayesian filters adapt, that's why they work so well.

  4. Re:I don't receive spam by letxa2000 · · Score: 4, Insightful
    There are in fact two big problems with Bayesian filtering (or any content-based filtering) from the perspective of an ISP or company... 1) one person's spam is another person's necessity

    But that's why Bayesian advocates every user having their own Bayesian statistics. It's not a "one size fits all" for the entire ISP or company, as is the case with most keyword filters. Every user has a different set of Bayesian statistics which is why it is very difficult for spammers to get around this filter--they have no way of knowing what words are in each users' statistics.

    2) you still have to waste your bandwidth and CPU before you reject it.

    It's better to waste your bandwidth and your CPU than to waste the time of those receiving the spam. IMHO...

    So Bayesian filters are a good tool of last resort, but there are many other tools that should be used too.

    The quicker everyone uses Bayesian filters (as opposed to waiting until all the other filters are incapable of keeping up with spam) the sooner the spammers will be in trouble. I personally use both a Bayesian filter with an up-to-date blacklist of known spamvertised domains, etc. I find that, quite simply, the simple keyword filters catch spam from known spam sites and Bayesian catches the rest. But if I turned off my normal filters Bayesian would have caught it all since those spams are always assigned a high Bayesian score, too. It almost makes sense to turn off the other filters, but they can be useful if a spammer comes up with a truly unique spam and someone else has already identified the domain name. It's rare, but it can happen. So a combination of technologies is probably the best... but a combination that lacks Bayesian is a combination that could be better.

  5. Re:Ironic? by DavyByrne · · Score: 4, Insightful

    Actually, I've long wondered whether Alanis was quite clever in choosing a title for that song.

    You see, none of the events she describes in the song is an example of irony, making the choice of the title "Ironic," well, ironic.

  6. Re:Yes, we must filter out the dummies by bluelan · · Score: 5, Insightful
    This wouldn't work.

    Baysian filters for spam work because spam has a significantly different vocabulary distribution than useful e-mail. This is true because spam must deliver a commercial message and play on people's uncertainties.

    Good trolls, on the other hand, look ALMOST like insightful, well written articles. The vocabulary distribution in good trolls is not significantly different than the vocabulary distribution of useful posts. So, Baysian filters would be useless, unless you come up with some smarter characteristics on which to train the filter.

    You could easily develop a filter for ascii-art porno. But, those are offtopic or flaimbait, not trolls.

    --

    I used to be a narrator for bad mimes. (wright)