Comparison of Bayesian POP3 Spam Filters
kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.
I still believe that we should have a hunting season for spammers, just like we do for ducks...
Never underestimate the predictability of human stupidity...
Spam is effective because it reaches millions of people who are not installing these filters on their systems. Until ISP's start applying these filters to all spam by default, then the spam filters will have no effect at all, exactly the same number of marks will be reached and respond no matter if the people who know better than to respond to spam go ahead and filter their e-mail or not!
I'm an American. I love this country and the freedoms that we used to have.
The article didn't mention SpamProbe. It is what I use, and it has worked quite well for the past month or so that I've been using it. Perhaps this is just because the author didn't test this spam filter yet, but I like it quite a lot with my current mutt/procmail setup. Take this for what it's worth.
- I love animals. I try to eat at least one a day.
As someone who recently acquired a B.S. in mathematics several days ago, I understand how these filters work. They are an excellent way to fight spam over the older methods.
;)
However, I think that ultimately this sort of thing misses the point. Spam needs to be fought in the courts, not in the battlefield. I'm afraid that the success of these filters will cause spam NOT to become illegal, and thus lead to a world where we have a constant trickle of spam, albeit in small amounts.
I think we all agree that we want spam to be gone entirely, as is evidence by the first post being labeled as "troll"
- I am a viral sig. Please copy me and help me spread. [strain #2] Thank you
Your server and its harddrives still end up being a storage bin for it, and the spammers will continue to send as long as your machine allows it to be recieved. Always remember that spam differs from postal junk mail, in that the -receiver- pays for it. Unsolicited postage due mail.
Spam must be -blocked- and the ISPs that allow/encourage its continued spread must re-educated, or be put out of business. Only when spam becomes costly to send with it diminish.
The current proposed laws concerning the subject are currently focusing on content rather than consent. They dont mind if you get spammed with hundreds of ads, provided what is being advertised isnt fraudulent. They overlook the fact that the claim of you having 'opt in' for the spam is in itself the lie and fraud.
--Teh
I have long been an advocate of Bayesian or keyword based spam filters, but have recently been forced to change my outlook, and to argue that MULTIPLE SIMULTANEOUS solutions are the answer.
I encountered a very simple but unique spam system which works entirely on the sender's address. Simply, you create a small database with the domains/addresses you want to whitelist. Then, a program screens your mail, and if the sender is not in your whitelist, it sends an e-mail BACK to the sender with a simple URL (or even an actual link for HTML e-mail clients) which states that they REALLY want to send the e-mail to its destination. When this is done, they are added to the whitelist. Therefore, mails from forged remote addresses are no longer a problem, and neither are mails from trusted sources. And, better than SPEWS or similar blacklists, the sender gets a SECOND CHANCE to send their mail to you.
There's a commercial solution using this system right now, although the URL escapes me.
Of course, one could encounter problems when ordering online, say. Droids at Amazon will not be clicking your links to make sure your order receipt got through. One could argue that you'd put things like Amazon.com in the whitelist, but what if someone used amazon.com as a spoofed e-mail domain/address? Ay, there's the rub. But if this system were tied in with a Bayesian system, it'd be pretty unbeatable. What's more the Bayesian system would have extra data for negative matches, in the form of e-mails that were never 'approved', and positive data in the form of those that were.
So, I'd be more interested in producing a homebrew system that used MULTIPLE weaker systems, than one supposed 'sure fire' method.. as I feel no one method is perfect, whereas multiple systems can approach this nirvana.
NEVER?....Try the BBC?
No ads, quality programming, small fee.
But you still do get spam. Exactly as much of not more because you use Bayesian filtering. Spam still wastes your bandwidth to download that spam before it can be filtered. Spam still wastes any inbox size limits your ISP might impose. Spam cuts into any quota a forwarding service might now or in the future impose on your account, or it could take you to a higher charge level if you pay for a forwarding service. It costs your ISP money, costs that one way or another are eventually paid by you. Even the processing power for that Bayesian filtering costs you CPU cycles, while having no negative effect on the spammers whatsoever.
While you might not think you care how much spam I get, you might care if dozens, hundreds or thousands of other users at your work also get tons of spam, particularly when all of that spam significantly cuts into your bandwidth. And you will care when overload from spam on your mail server is so bad that it causes failures, effectively causing a D.O.S. situation.
And as long as geeks happly play with their little Bayesian filters, they stop seeing spam and so stop complaining to the providers that are letting spam get through. They stop doing other things that might make spammer's life difficult. Heck, I fully expect some spam haters with an additude like yours to say within earshot of a congressman or Senator something like "Oh, I never get any Spam. Spam can be filtered easily and nothing should be done about it". The spammers should love Bayesian filtering, it takes the presure off them while allowing them to reach exactly the same number of marks with a mailing.
I'm an American. I love this country and the freedoms that we used to have.
I know this is slightly off topic, but can someone answer me a reasonably simple question thats been bugging me for a while?
Why not instead of hunting down the spammers do we not hunt down the people who are selling and advertising their junk via the spammers?
The spammers purposly make themselves difficult to find, but it must be easier to track down a company that is collecting money and sending out products? Why not make the using of spammers services illegal and fine and punish those doing so?
I think Im correct in saying and please tell me if Im wrong, but here in the UK a similar situation is people "fly-posting". In these cases, if advertising posters are put somewhere illegal or unwanted, it is not the person who put the poster up that is fined, but the club, record label, whoever is beign advertised that takes the rap.
Just my 0.02p
As far as I know, many of those filters are based on a decision rule of the form
... are in it) > 1-epsilon
...
P(mail is spam | words X, Y, Z,
The computation is then done using Bayse's rule (P(A|B)=P(B|A)*P(A)/P(B)) under certain independance assumption which makes it tractable.
So this is actually bayesian filtering
My favorite filter is spamoracle
Speaking from experience, I know for a fact that many of the harvesting programs (written in perl, running on linux, written by geeks) are very robust at deciphering most email obfuscation methods. You all sit and shake your fists, and the spamware writers are laughing their asses off.
You have the easy answer: don't obfuscate your email, don't even bother putting it on your posts.
No it's not.
I get spam at the rate of 1 spam mail per 6 months or so. Or maybe even less. I can't remember getting a single spam email on my actual email address for about a year.
If you have an account on a crapless domain (i.e. not hotmail.com, msn.com, aol.com and the likes),
it all comes down to this very simple rule:
Do not, under any circumstance, have your email address posted publicly accessible ANYWHERE on the web.
It WILL get trawled. And then it will be spammed relentlessly.
If you have an existing address you don't want to give up, or an address at hotmail.com or a similar place, dump it.
Then exercise a bit of common sense about where you use your actual address.
I have a domain which catches email to unknown addresses and put them in my regular mailbox.
Whenever I have to give an email address to some place on the web, I use *domain-i-am-currently-visiting*@mydomain.com. So if I am visiting foobar.com, I would put in foorbar.com@mydomain.com.
I have been doing this for years. It enables me to see what was the source of the leak when I get spam on one of the addresses.
It has taught me one thing: I have never, ever, ever, in all my years of online shopping, forum posting etc, come across a single website that have ignored their own privacy statement. Ever. Even the slightly sketchy sites (like divx subtitle sites) don't leak addresses.
I was surprised to realize this.
The only addresses I ever get spam on are the ones I know to be publicly displayed on the web.
So it's that easy to avoid spam.
Give me liberty or give me kill -s 9