Bayesian Filtering For Dummies
Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."
I've been using it for a bit on my own e-mail, and it seems to work out. But it's not at the point where I'd be happy to see ISPs implementing it for their customers -- even ignoring the Freedom of Speech issue, it still has the occasional false positive.
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
Does anyone have proof thats where the name comes from?
Mouse powered Chips, Open source Processors and Lego
The moderation system (esp. in its current form - moderation by +karma /.ers) will always be better than automated filtering.
/. "culture".
The key problem is adaptation. "Bayesian filtering is better than simple keyword filtering, but its performance will degrade over time unless its rules are continuously updated (via analysis of new data). And there's the problem that a troll in one story context may be an insightful comment in another.
Moderation by humans apapts rapidly, accomodates a variety of contexts, and will reflect (and grow with) the overall
"The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit. "
And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.
About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.
I've been using a baysian spam filter for months now and I understand how they work... Even thou people find the comment funny, a baysian troll filter on slashdot would work...
If you were to run every slashdot post throu my mail filter as an e-mail message and properly mark the trolls and others you don't want, and the ones you do want, suddenly you would only get the actual good posts, trolling would die quickly... And because of the user classification system currently in place, slashdot has a huge db to build up the word stats, so it could happen immediatly or faster...
Seriously, I ask that the slashdot admins consider adding this to slashcode... even if slashdot does not use it, others would... there are too many trolls out there as it is on the net and many people put them only a few rungs higher than spammers on the evolutionary ladder(but lower than an ameoba still)
The logic behind this can actually be extended, to allow a user to start filtering stories so that they only get ones that interest them, or even to filtering submissions to get rid of the cruft, how often to you think that the trolls post troll story submissions? Save work for the site admins...
I'm curious if an extension of this idea is how Google News works... anyone know?
Enjoy.
On Arrakis: early worm gets the bird. Magister mundi sum!
In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).
:
So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask
- Do other people really receive that much spam, or am I an isolated case ?
- Do people who receive spam purchase things online, or register software and other services with their real names and email ?
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.
This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.
Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.
I don't care if it's 90,000 hectares. That lake was not my doing.
I'm using this now, and it works great!
Get it here.
-ted
I don't use email. Yes, I have a few addresses but I havent checked them in months. Email is kinda dead way of communication anyway, beaten by things such as mobile phones and instant messaging.
Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities?
If I don't know you, that means I don't want to talk to you. Your email goes straight a junk folder, which I can quickly scan once every few days for from names I recognize. I can add these names to my white list if I so choose.
Granted, my job does not involve me soliciting contacts from the public at large, so this wouldn't work for everyone. I use it on my personal Hotmail account though, and I get to not even consider lots of crap every day.
You can never put too much water in a nuclear reactor.
The same thing would happen to your mail if the words that your bayesian filter were the same as the words in everybody else's. Spammers would be able to see what make an email seem spamming and they wouldn't do that. Bayesian filtering works for email right now because everybody's filters are a bit different. There is currently no magic bullet to get through everybody's spam filters. Also spammers cannot see your filter so they don't know if their message was filtered. If you opened your archive to me, I could quite easily craft a spam that would land square in your inbox.
Ahh, but a troll that looks genuine at first, and appears on topic is worth a reading for the laugh. It needs to be marked funny, and depending on how good it is might need some explination in a followup post to keep those not in the know from thinking the wrong thing.
OTOH, first post is always useless and a waste of time. So are a few other posts. ASCI-art might be easy to filter, but can you filter the porn ascii-art without blocking the guy trying to make a diagram of some sort so we can better understand what is going on?
Actually it is ironic when you write a song called "ironic" and there are no ironies in it.
http://www.talknerdy.org