Bayesian Filtering For Dummies

← Back to Stories (view on slashdot.org)

Bayesian Filtering For Dummies

Posted by timothy on Monday May 26, 2003 @09:27AM from the to-increase-the-peace dept.

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

15 of 281 comments (clear)

Min score:

Reason:

Sort:

It's not bad... by Sheetrock · 2003-05-26 09:32 · Score: 2, Interesting

I've been using it for a bit on my own e-mail, and it seems to work out. But it's not at the point where I'd be happy to see ISPs implementing it for their customers -- even ignoring the Freedom of Speech issue, it still has the occasional false positive.

--

Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
Origin of SPAM by brejc8 · 2003-05-26 09:32 · Score: 2, Interesting

It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
Does anyone have proof thats where the name comes from?

--
Mouse powered Chips, Open source Processors and Lego
Re:Yes, we must filter out the dummies by zoikes · 2003-05-26 09:42 · Score: 5, Interesting

The moderation system (esp. in its current form - moderation by +karma /.ers) will always be better than automated filtering.

The key problem is adaptation. "Bayesian filtering is better than simple keyword filtering, but its performance will degrade over time unless its rules are continuously updated (via analysis of new data). And there's the problem that a troll in one story context may be an insightful comment in another.

Moderation by humans apapts rapidly, accomodates a variety of contexts, and will reflect (and grow with) the overall /. "culture".
Required Reading by E-mail Users by Shackleford · 2003-05-26 09:46 · Score: 3, Interesting

This "Bayesian Filtering for Dummies" article, titled "How to spot and stop spam" on the BBC web site, gave much useful information on the problem of spam and the filtering method used to get around it. It is quite comprehensible, as you certainly don't need to know the probability theory behind Bayesian filtering to understand it. It gives useful information on the problem of spam, and I'd say that this sort of article is required reading for all those who use e-mail. Why? Becaus it states this fact:
"The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit. "
And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.
About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.
1. Re:Required Reading by E-mail Users by kindbud · 2003-05-26 17:26 · Score: 2, Interesting
  
  I have 5200 spam e-mails saved and about 1000 legit mail saved and my accuracy level is about 99.9...
  
  Yes, but you haven't reduced your exposure to spam. In fact, it looks like now you have to track your spam intake assiduously so as to keep the filter trained. Not many people would consider this an improvement. :)
  
  --
  Edith Keeler Must Die
Re:Yes, we must filter out the dummies by dJCL · 2003-05-26 09:47 · Score: 5, Interesting

I've been using a baysian spam filter for months now and I understand how they work... Even thou people find the comment funny, a baysian troll filter on slashdot would work...

If you were to run every slashdot post throu my mail filter as an e-mail message and properly mark the trolls and others you don't want, and the ones you do want, suddenly you would only get the actual good posts, trolling would die quickly... And because of the user classification system currently in place, slashdot has a huge db to build up the word stats, so it could happen immediatly or faster...

Seriously, I ask that the slashdot admins consider adding this to slashcode... even if slashdot does not use it, others would... there are too many trolls out there as it is on the net and many people put them only a few rungs higher than spammers on the evolutionary ladder(but lower than an ameoba still)

The logic behind this can actually be extended, to allow a user to start filtering stories so that they only get ones that interest them, or even to filtering submissions to get rid of the cruft, how often to you think that the trolls post troll story submissions? Save work for the site admins...

I'm curious if an extension of this idea is how Google News works... anyone know?

Enjoy.

--
On Arrakis: early worm gets the bird. Magister mundi sum!
I don't receive spam by Rosco+P.+Coltrane · 2003-05-26 09:48 · Score: 4, Interesting

In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).

So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask :

- Do other people really receive that much spam, or am I an isolated case ?

- Do people who receive spam purchase things online, or register software and other services with their real names and email ?

--
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Apple's Mail app... by useruser · 2003-05-26 09:49 · Score: 4, Interesting

...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.
Slight modification: white-list+Bayesian is useful by Jeremi · 2003-05-26 10:13 · Score: 4, Interesting

I've found that if you add a small tweak to the Bayesian Filter, it becomes even more useful. The tweak is this: Any time you tell the Bayesian filter that an email is "non-spam", it auto-adds the From address of that email to a white-list, so that from then on any emails from that address are automatically marked as "non-spam" by the filter, no matter what they contain. (conversely, any time you mark an email as "spam", the source address of that email is removed from the white-list, if it is present)

This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.

Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Where to get a nice Bayesian filter. by zerofoo · 2003-05-26 10:54 · Score: 2, Interesting

I'm using this now, and it works great!

Get it here.

-ted
My solution. by Lord+Kholdan · 2003-05-26 11:37 · Score: 3, Interesting

I don't use email. Yes, I have a few addresses but I havent checked them in months. Email is kinda dead way of communication anyway, beaten by things such as mobile phones and instant messaging.
The best email filter by Spud+the+Ninja · 2003-05-26 11:46 · Score: 2, Interesting

Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities?

If I don't know you, that means I don't want to talk to you. Your email goes straight a junk folder, which I can quickly scan once every few days for from names I recognize. I can add these names to my white list if I so choose.

Granted, my job does not involve me soliciting contacts from the public at large, so this wouldn't work for everyone. I use it on my personal Hotmail account though, and I get to not even consider lots of crap every day.

--
You can never put too much water in a nuclear reactor.
Re:Yes, we must filter out the dummies by DeadSea · 2003-05-26 12:22 · Score: 4, Interesting

Bayesian filters for email really only work because spammers can't see which messages you classify as spam. If you implemented a bayesian filter for trolls on slashdot, the trolls would see what words constitute a troll and stop using those words. They would stuff their messages with non-troll words avoiding the bayesian filter.
The same thing would happen to your mail if the words that your bayesian filter were the same as the words in everybody else's. Spammers would be able to see what make an email seem spamming and they wouldn't do that. Bayesian filtering works for email right now because everybody's filters are a bit different. There is currently no magic bullet to get through everybody's spam filters. Also spammers cannot see your filter so they don't know if their message was filtered. If you opened your archive to me, I could quite easily craft a spam that would land square in your inbox.
Re:Yes, we must filter out the dummies by bluGill · 2003-05-26 14:52 · Score: 2, Interesting

Ahh, but a troll that looks genuine at first, and appears on topic is worth a reading for the laugh. It needs to be marked funny, and depending on how good it is might need some explination in a followup post to keep those not in the know from thinking the wrong thing.
OTOH, first post is always useless and a waste of time. So are a few other posts. ASCI-art might be easy to filter, but can you filter the porn ascii-art without blocking the guy trying to make a diagram of some sort so we can better understand what is going on?
Re:"Alanis irony" by joeytsai · 2003-05-27 01:01 · Score: 2, Interesting

Actually it is ironic when you write a song called "ironic" and there are no ironies in it.

--
http://www.talknerdy.org