Comparison of Bayesian POP3 Spam Filters
kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.
I love spam protection programs. I've been using them for years, but have to switch every couple of months because of the friggen spammers. The people that make the spamming software don't just sit around cackling about how evil they are. They reverse engineer every anti-spam protection out there in an attempt to get around it. While this seems like a good idea (and I will be playing around with these two programs for a while), it's unfortunately only good up to the point when spammers figure a way around it.
I wish the government would somehow make the practice illegal, but I doubt they'll ever get anything to stick. The far better option at this point is to have a class action suit of server owners (who provide mail accounts) against developers of spamming software and spammers. I've gotten enough warnings from my university to know that bandwidth costs money. By sending millions of spams a year into any one e-mail server, that can account for a serious chunk of bandwidth used at significant cost to the provider. It won't stop spam all together, but it will bankrupt anybody that has been doing it.
It's not stupid. It's advanced.
Taking I get 100+ spams a day I've found that its a goo idea to at least use tagging. For example posting on usernet I use usenet@domain.com with something in my sig saying actualy email is me at domain dot com. Anything sent to usenet is automatically deleted. Doesn't stop the flow by any means but at least I can track where the spam came from.
If you are feeling clever you can even use addresses that expire after a week. So something like epochseconds@domain.com
Just my 0.02p
Rus
Cheap UK and US VPS
I have long been an advocate of Bayesian or keyword based spam filters, but have recently been forced to change my outlook, and to argue that MULTIPLE SIMULTANEOUS solutions are the answer.
I encountered a very simple but unique spam system which works entirely on the sender's address. Simply, you create a small database with the domains/addresses you want to whitelist. Then, a program screens your mail, and if the sender is not in your whitelist, it sends an e-mail BACK to the sender with a simple URL (or even an actual link for HTML e-mail clients) which states that they REALLY want to send the e-mail to its destination. When this is done, they are added to the whitelist. Therefore, mails from forged remote addresses are no longer a problem, and neither are mails from trusted sources. And, better than SPEWS or similar blacklists, the sender gets a SECOND CHANCE to send their mail to you.
There's a commercial solution using this system right now, although the URL escapes me.
Of course, one could encounter problems when ordering online, say. Droids at Amazon will not be clicking your links to make sure your order receipt got through. One could argue that you'd put things like Amazon.com in the whitelist, but what if someone used amazon.com as a spoofed e-mail domain/address? Ay, there's the rub. But if this system were tied in with a Bayesian system, it'd be pretty unbeatable. What's more the Bayesian system would have extra data for negative matches, in the form of e-mails that were never 'approved', and positive data in the form of those that were.
So, I'd be more interested in producing a homebrew system that used MULTIPLE weaker systems, than one supposed 'sure fire' method.. as I feel no one method is perfect, whereas multiple systems can approach this nirvana.
ideally, i think the client should take care of the filtering. Pour your resources into improving context based filtering and let the individual clients do the dumping. Widespread usage of this kind of filtering could make spam even further unprofitable. Since spam is entirely business related, it would likely reduce the numbers of it passing through the network.
From a sysadmin's POV, this doesn't halt the issue of spam eating bandwidth or disk space. I'll address that next.
Disk space depends on what kind of e-mail your organization uses. For POP3, most people delete e-mail on the server after its downloaded, so while the disk space may be consumed with spam, it would be temporary. That is unless you have alot of dead or rarely used accounts. In that case, you should have policies in place for when to wipe user's accounts out after a set period of time. Or set up some kind of forwarding policy. If you're using something like IMAP, then using a server-wide content filtering system as mentioned above would be effective.
For bandwidth, the only way to halt spam from consuming your bandwidth is by blocking packets at the router. If you use SPEWS to dump the e-mail by your e-mail server, its still consumed your bandwidth. So you'd have to block the packets directly. I think this is draconian and should be avoided, for the net's sake. Unfortunately there really is no good solution to this, for as long as spam flows, it flows and consumes bandwidth. The only way to halt it is to halt the initial spamming to begin with. As mentioned above, when your spammer's audience never exists as a result of good content filtering, the spam will be unprofitable and lessen somewhat.
Attacking users and their ISP's won't do much good, aside from causing spammers to jump from isp to isp, something they're readily willing to do. Attacking regular users just makes you a big jerk.
Yes it does, the developers have created a test suite and a very extensive tokenizer. Any additional pseudowords, or new ideas to tokenize a message are tested very throughly before they are added (as most tend to actually lower accuracy instead of raise it). There have even been tests using SpamBayes on just headers and just message bodies and both have worked very well.
I'll second this. Have been using spamprobe since December. It took longer than a month before it was fully trained. These days it's very good. And the best thing (except once when someone quoted the full body of a spam when complaining about spams on a mailing list): It has never given me a false positive.
I'd personally go for the last option... Maybe the next-to-last if their suit takes place in a really democratic place (there are 278 millions American citizens and 2,2 of them are in jail, this is a *lot*).
Trolling using another account since 2005.
Speaking of filtering for others... I don't - but I do run my own little mail server.
Even after implementing all the postfix uce rules and adding in the RBL's - and using spamassassin... I still saw some spam slipping in...
So I hacked together a tiny little perl script that monitors my mail log... after any IP address gets more than 3 "554" messages (generated by the RBL's) the source IP gets a lovely little teergrube.
I waste their resources and prevent them from trying to deliver any other shit that might get through spamassassin...
Script can be found at here but is only good for postfix/linux/iptables peoples.
e3 :: blogging the wireless freenet
How does bayesian filters solve the problem of pure-image spams? -I.e. HTML mails that contain nothing else than an IMG tag. I only see collaborative filters solving this problem - SPAMfighter would be an example of this.
Moz's Bayesian filtering works well, but its Achilles heel is that it doesn't work on the POP3 server, so you still have to download everything. As POP3 allows the header and the first part of the message body to be read without downloading it, surely there could be an option - once Moz has been trained and you're fairly sure the false positive rate is negligible - for filters to operate on the server and delete spam from there?
When I am king, you will be first against the wall.
I agree with everything that you said about filters being ineffective. But I strongly disagree with your "only thing" statement. Particularly if you mean it as any of the systems I've ever heard about, such as "If it's not in the address book, the sender must acknowledge a challange message" type of approaches. The problem with such systems is that many of us get quite a bit of e-mail each day from people who are not in our regular address books, some of it quite important to us. We do not want that mail lost because the system at the other end was not in out address book and did not waste their time responding to a challange and response type system. For example, say I purchased something on-line from a vendor I had never dealt with before. Their e-mail system may automatically kick out an e-mail that informs me the product was shipped and give me an important Fed-ex or UPS tracking number. I'm glad they do such things with their shipping systems, and I don't expect them to manually respond to every challange they get back; realistically they will send any such challanges to the bit bucket and people who want e-mail that is important to them will end up never getting it.
So I do not believe that Authentication of senders , at least in any of the traditionally suggested ways, is the correct approach. Much of the spam problem we have is due to what I consider flaws in SMTP. I would very much like to see a replacement for SMTP that considered the spam problems (as well as other problems inherent in SMTP). As an example, another post here mentioned a system where the mail is held, not on your ISP or upstream provider's system until you download it, but rather is held on the sender's or sender's ISP's system. The recipent would presumably receive only a very short indicator of where they have mail waiting, and would fetch it themselves when they are ready to receive it. The puts the burden of storage on the sender or the service provider for the sender, and avoids considerable bandwidth wasted by senders who supposedly send out e-mail with addresses generated to match all combinations of up to x characters (the excuse Mindspring gave to me when addresses that I created but never gave out or used started getting spam, not that I believe them). In addition to putting this burden on the sender, it would insure that there was a good address in the e-mail to fetch the mail from, so spammers would have a much harder time injecting their spam into the system and would be much more traceable. And while I'm not foolish enough to think that laws could completely stop spam, we've seen how laws did drastically curtail fax spam, and some fax spammers have recently been made to pay serious fines. I do think laws would have a big effect on spammers; ther are a lot of spammers who just don't want to have to move out of the country to keep up spamming, and those of us who hate spam will track the spam back to US sources if we have a law with teeth in it to impose fines (or worse) on them when we do.
Of course, and change to or replacement of SMTP must be phased in over time. It's not a short term solution to spam. But I expect SMTP would quickly go the way of gopher or archie or the rest if a viable new protocol was presented that addressed these problems effectively, and this is where I think out greatest chances for sucess are.
I'm an American. I love this country and the freedoms that we used to have.
another goodone is if the domain from the envelope sender doesn't have a MX record. bam guarenteed spam. The other one is to verify the sender not just the domain. This kills all those spams from lkiqprejbn@yahoo.com which are obviously bulldust.
That alone kills off about 70% (IMO) of the spam that comes through servers that I administer, and as far as I know, only 2 emails(over the last 4 years or so) that wern't ment to be rejected were rejected because they had invalid sender envelopes.
HTH
cya
Andrew
One of the things I love about popfile is it is not a Spam filter. It is a general mail filter. I have about ten categories of mail that it sorts out for me. This also helps cut out false positives. 'Work', 'Personal', 'Friends' and all much more similar to eacth other than 'Spam'.
Bulk emailing, like any business is a numbers game. By significantly decreasing the # of successful responses to a set of SPAM (through filters) the business costs remain the same w/ the returns dropping. Eventually the business is no longer feasible.
[INCREASE TONE]
SPAM absolutely does not need to be fought in the courts when the markets can work this out on their own (as we see w/ these filters). In the end we'll have better technology for sorting and filtering emails which can be applied to other applications and the spammers will be gone or significantly reduced.
[BREATHE... BREATHE...]
Legislation would only be valid in the country in which the legislation was enacted so spammers could simply move their operations to a SPAM friendly country.
[GRADUALLY INCREASE TONE]
Also, what constitues spam? What if I only send 10,000 emails out? What if I change the email each time I send it so it's unique to you? What if I'm not selling anything? What if someone comrpomised my system and sent all the emails from my PC? Why shouldn' ISPs be liable too... yeah, why are they letting people send those SPAMs... let's sue them too... somebody get a rope!!
[BEGIN ALL OUT RANT!]
So the moral of the story is... everyone remain calm... keep working on your filters and other new technologies... and soon we'll have fewer spammers and better tech and some intelligent hacker out there will have a whole heap load of cash for coming up w/ the solution.
Of course w/ all of the existing hideous legislation we have today... SCO may announce that they are diversifying into bulk emailing and that they have a patent on any spam filtering algorythms and therefore if you ever remove any of their emails you must send them a $699 licensing fee for the use of their IP.
POPfile really got shortchanged by this review. It serves as much more that a spam filter. I thought I'll give SpamBayes a try anyway but the Outlook plugin won't install on my XP machine. Some problem with an unresolved dependency in shlwapi.dll... boring. The point is, the SpamBayes site doesn't have a tech support forum where I can ask for help with these kind of problems.
The power of Christ compiles you!
Speaking from experience, I know for a fact that many of the harvesting programs (written in perl, running on linux, written by geeks) are very robust at deciphering most email obfuscation methods. You all sit and shake your fists, and the spamware writers are laughing their asses off.
You have the easy answer: don't obfuscate your email, don't even bother putting it on your posts.
Actually SpamBayes isn't bayesian at all. It uses a chi^2-based algorithm which was shown in (the extensive spambayes team's) tests to be superior to regular bayesian filtering.
You just don't get the whole concept of Bayesian spam filtering. It works on a personal basis; don't forget that, statistically speaking, one man's spam is another man's legitimate personal e-mail. For example, if you send and receive a disproportionately large amount of messages containing cock jokes and talking about tits and sex (which, being a 20-year-old male, I can tell you is about 80% of my friends), under a "typical" or system-wide Bayesian filter that might be installed by some ISP, you're almost certainly going to lose a lot of messages that weren't spam. Which is the worst-case scenario for a spam filter. What's worse, the ISP would have to employ some sort of "spam czar" to monitor (people's private) incoming e-mail and make judgement calls as to what is and is not spam. That's a call I want to make, not one that I want made for me.
The best way to eliminate spam, to me, is a two-part system whereby the ISP (via procmail, etc.) eliminates all mail that is definitely spam, and then passes along anything questionable to the user. Bayesian filter should be implemented in the client, which, thankfully, is becoming more and more common. ISPs should think about bundling clients that already support Bayesian sampling, enabling it by default, explaining in very clear terms how to use it, etc., but that's about all they can do.
I think there is a world market for maybe five personal web logs.
$0.04US charge for every Email SENT. Collage accounts can get refunded costs by delivering a sent mail list.
This will stop spamming quick... or at least make it slow way down.
1,000,000 spams = $40,000.00US more than the entier net worth of the most sucessful spammer.
Do not look at laser with remaining good eye.
Spammers love to use open proxies to hide, and are now engaged not only in scans to find then, but also in campaigns to create them. Trojans and worms like SoBig. While each offense is small, it's on a scale large enough to have them behind bars for quite a while.
One line blog. I hear that they're called Twitters now.
POPFiles utility does not lie just in managing the spam menace. To me, the real utility in POPFile is the ability to create x number of buckets and train it to sort your mail. SpamBayes looks great for spam but has no further utility. I like having POPFile sort my work from personal emails, and file all my mailing lists in another, and even jokes. Of course there is the spam folder that I check every now and then. I look forward to it being able to support IMAP servers as well.
spammers should love Bayesian filtering, it takes the presure off them while allowing them to reach exactly the same number of marks with a mailing.
I'm afraid you've made the cardinal mistake of thinking that spammers follow logic.
First question: Why do people install filters on their mailboxes?
Answer: To stop spam.
Now, take a look at any interview with any spammer.. you'll note that when they're asked, the spammer will say "I don't send it to people who don't want it."
They'll also say "we're always coming up with ways to bypass filters."
Now, you'd think that with the two statements, that one of them is false - however (besides the fact that spammers lie), any sociologist will tell you that the spammer actually believes he's telling the truth in each of these statements..
How he justifies it in his mind is that he believes that even though someone has installed a spam filter, that this person only wants to filter spam from other spammers - that his spam is somehow "special".
Spammers are sociopaths, and like all sociopaths, they believe the rules do not apply to them.
If spammers weren't sociopaths, and were capable of applied logic, then they'd realize that any filter (not just Bayseian) would benefit them.. but then, if they weren't sociopaths, they wouldn't be spammers in the first place.
say I purchased something on-line from a vendor I had never dealt with before. Their e-mail system may automatically kick out an e-mail
Using TMDA, you would generate a "keyword" address: A unique addressed, identified by a keyword embedded in the address, which would allow your vendor to bypass the C/R system. If they keyword address starts being abused then (1) you can easily disable it, and (2) you know not to do business with that vendor again.
As an example, another post here mentioned a system where the mail is held, not on your ISP or upstream provider's system until you download it, but rather is held on the sender's or sender's ISP's system.
This system quickly breaks down, though, as delays are introduced by having to wait to fetch each piece of mail. People bothered by such delays will write/obtain software that automatically fetches the mail at a predetermined time, which would then shift the bandwidth problem (part of it, anyways) back to the recipient.
The other problem with sender authentication is who, exactly, determines whether a sender is authenticated? I run my own e-mail server. Will I have to pay out bucks for an "authority" to confirm that my sending address is valid? Right now, some ISP's (notably Time-Warner offshoots) are denying access to their SMTP servers under the guise of reducing spam. If your IP happens to fall within a certain range, they simply don't allow you access. We will end up in the same morass RBL has put us in: Who plays God in determining whether a sender is truly "authentic" or "worthy"?
The "unsure" feature directly combats the latest Spammer technique -- filter poisoning.
You've all seen it work; the Spammers don't just send you the same spam once, they send you it 5 to 20 times, and they include a clipping from the headlines or something under their pitch.
They're not doing it to get that one mail past to you. They're actually HOPING that you classify all 20 mails as spam.
Why?
Because every time you classify that mail as spam, EVERY SINGLE WORD of that news clipping is "poisoned" inside the filter, and becomes an indicator of a spam. Then you turn around, and get an email from someone legitimate using those common words... and it gets wrongly classified too.
Enough false positives, and the spammers win, because they'll get you to turn the filter back off.
Enough is enough -- time to establish open hunting season on Spammers.
As a network/web/computer manager, my email has been provided to dozens of companies and trade shows. I still remember the day (August, 3 years ago) when someone first sold my address to a spam list. I went from 2-3 spams per day to 15-20. This spring brought another explosion, this time into the 100+ range. I am currently receiving over 6,000 spam messages every month! Obviously my main email address was useless and needed to be burned on a pyre to purge the evil.
After a week or two of this, I installed SpamBayes in the form of it's outlook plugin. I showed it my email archive as my "good" messages, and a bunch of spam gleaned from my deleted folder as "bad". My mailbox is now perfectly clean. I have received at least 15,000 spam messages since installing SpamBayes, and I have probably had to hit the "Delete As Spam" button about 10 times for ones that it missed, most of those being variations on the Nigerian scheme. It has never grabbed a real message, and the "Unsure" feature localizes everything that I really need to look at in one place.
If you have a spam problem, get SpamBayes. It is that simple. There is no need to speculate about that better method that you thought up, or how it really won't work because of XYZ theory... it works almost perfectly, and it lets you know about anything that it is not sure about with the "Unsure" folder, so it never throws the baby out with the bathwater. In short, this is almost the perfect Spam filter. It even caught the emails that were using GIFs to avoid being filtered on content, placing them in unsure until I said "this is spam", after which I never saw another one. Pretty darned cool!
It is actually kind of fun to watch this thing work. I came in this morning to find 568 new messages in my spam folder, 3 in unsure, all of which were spam. No spam anywhere to be found in my inbox, just 15 unread messages that were correctly left alone by SpamBayes. Just imagine having to flip through 600 emails to find 15 real messages! Now I just hit "CTRL-A DEL" in my spam folder and it is all gone! 5 seconds a day to deal with spam, I can live with that....
The other good news from that study is that they show that spam does decrease after you remove your email address from websites... in other words, they don't keep the addresses as much as we generally believe. You aren't on every spammers list forever just because they get your address once.
I was more than a little disappointed to see that Apple's Mail.app was not included in the comparison. It wouldn't surprise me in the least if it were already the most widely used Bayesian spam filter. Unsurprisingly, it is also very easy to use.
Mail.app also combines Bayesian filtering with the Address book -- any mail from a known correspondent won't be tagged as Junk. This reduces the risk of false positives. This is an integration cheat not available to stand-alone spam filters, because Apple supplies the Address book app and provides other integration between the two applications. But, (as a self-centered end-user) I don't care that it is a cheat, I am merely happy that it all works well. (And I cross my fingers and hope that somehow, Apple's C/C++/Objective-C programmers are less prone to leaving buffer overflow holes than Microsoft's programmers clearly are.)
The author needs to read Edward Tufte's books on presenting information (e.g., The Visual Display of Quantitative Information).