Comparison of Bayesian POP3 Spam Filters

← Back to Stories (view on slashdot.org)

Comparison of Bayesian POP3 Spam Filters

Posted by michael on Sunday August 10, 2003 @08:02PM from the spam-i-am dept.

kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.

13 of 326 comments (clear)

Min score:

Reason:

Sort:

Only useful to a point by KU_Fletch · 2003-08-10 20:16 · Score: 4, Interesting

I love spam protection programs. I've been using them for years, but have to switch every couple of months because of the friggen spammers. The people that make the spamming software don't just sit around cackling about how evil they are. They reverse engineer every anti-spam protection out there in an attempt to get around it. While this seems like a good idea (and I will be playing around with these two programs for a while), it's unfortunately only good up to the point when spammers figure a way around it.

I wish the government would somehow make the practice illegal, but I doubt they'll ever get anything to stick. The far better option at this point is to have a class action suit of server owners (who provide mail accounts) against developers of spamming software and spammers. I've gotten enough warnings from my university to know that bandwidth costs money. By sending millions of spams a year into any one e-mail server, that can account for a serious chunk of bandwidth used at significant cost to the provider. It won't stop spam all together, but it will bankrupt anybody that has been doing it.

--
It's not stupid. It's advanced.
I changed my mind. Simpler is better. by Peter+Cooper · 2003-08-10 20:24 · Score: 5, Interesting

I have long been an advocate of Bayesian or keyword based spam filters, but have recently been forced to change my outlook, and to argue that MULTIPLE SIMULTANEOUS solutions are the answer.

I encountered a very simple but unique spam system which works entirely on the sender's address. Simply, you create a small database with the domains/addresses you want to whitelist. Then, a program screens your mail, and if the sender is not in your whitelist, it sends an e-mail BACK to the sender with a simple URL (or even an actual link for HTML e-mail clients) which states that they REALLY want to send the e-mail to its destination. When this is done, they are added to the whitelist. Therefore, mails from forged remote addresses are no longer a problem, and neither are mails from trusted sources. And, better than SPEWS or similar blacklists, the sender gets a SECOND CHANCE to send their mail to you.

There's a commercial solution using this system right now, although the URL escapes me.

Of course, one could encounter problems when ordering online, say. Droids at Amazon will not be clicking your links to make sure your order receipt got through. One could argue that you'd put things like Amazon.com in the whitelist, but what if someone used amazon.com as a spoofed e-mail domain/address? Ay, there's the rub. But if this system were tied in with a Bayesian system, it'd be pretty unbeatable. What's more the Bayesian system would have extra data for negative matches, in the form of e-mails that were never 'approved', and positive data in the form of those that were.

So, I'd be more interested in producing a homebrew system that used MULTIPLE weaker systems, than one supposed 'sure fire' method.. as I feel no one method is perfect, whereas multiple systems can approach this nirvana.
1. Re:I changed my mind. Simpler is better. by ctr2sprt · 2003-08-10 20:38 · Score: 4, Interesting
  
  Any approach that triggers an automatic action on your behalf is bad, because it can be turned against you. It's not likely that email would make a terribly good DDoS service, but a system like the one you describe would certainly be vulnerable to it. And I think it would only last a week, at most, before spammers figured out a way around it. They can already handle "NOSPAM" being inserted in email addresses, and recently added the ability to reverse and combine email addresses until they get something plausible.
  I do agree with you that we need multiple layers of safeguards in order to solve spam - or at least to hide it away so nobody has to look at it - but I don't think your specific example is very good.
2. Re:I changed my mind. Simpler is better. by scj · 2003-08-10 22:38 · Score: 5, Interesting
  I had thought of something similar for fighting spam. Here's how I'd handle each email:
  
  If the email is from someone in my whitelist, allow the mail to go through and feed it as 'ham' to the Bayesian filter.
  
  If the email is not in my whitelist, run it through spam filtering software (Spamassassin works well) to determine if it is likely to be spam.
  
  If it seems like spam, then use a challenge-response system (like TMDA) to find out if a human sent the email.
  
  If the mail doesn't seem like spam, just deliver it. If I get 3 non-spammy messages from the same person (separated by a day or more) then add them to my whitelist automatically.
  
  If someone responds to the TMDA challenge, put them in the whitelist and deliver the original email.
  
  If no one responds to the TMDA challenge after a week, feed the mail as 'spam' the the Bayesian filter.
  
  In addition, I'd use a system like Sneakemail to generate random email addresses to give out to businesses I want to do business with and use to sign up to mailing lists. These email addresses would be added to my whitelist so they could send me mail without going through the challenge-response system. If they start spamming me, I put the random email I gave them on my blacklist.
  
  This system has the following benefits:
  
  Business mail I want (like receipts and newsletters from companies I do business with) get through always since the Sneakemail-type address is whitelisted. This solves the problem of businesses not responding to TMDA challenges.
  
  My real email address is protected from businesses who are likely to sell it and from people farming addresses from mailing lists.
  
  Personal email that the spam filter sees as non-spam gets delivered without bothering the sender with a challenge-response system.
  
  Personal email that does seem spammy by the filter still has a second chance to make it through the system with the challenge-response system. This should reduce false-positives to include only spammy emails from people who don't respond the the challenge.
  
  The Bayesian filter is automatically trained based on mails from people in my whitelist and mails from people who never respond to the challenge-response.
  
  You would still get spam with this system (personal email that your filter thinks is non-spam), but hopefully your false-positive rate would be zero. Also, you don't annoy other people much by only sending challenge-response messages to spam-like emails. Finally, this would be easy for end users to use. They don't have to train the spam filter, since it should train itself. The only complicated part would be generating and using the random emails that you give to businesses and mailing lists.
A new poll is required by mirko · 2003-08-10 20:42 · Score: 4, Interesting
How should spammers be dealt with ?
- Ban their original networks
- Throw them in jail
- Kill them
- Fine them 0.01$/email and improve third world infrastructures with the money.
- Filter/Ignore them.
I'd personally go for the last option... Maybe the next-to-last if their suit takes place in a really democratic place (there are 278 millions American citizens and 2,2 of them are in jail, this is a *lot*).
--
Trolling using another account since 2005.
Re:You just don't get it by Anonynmous+Cow · 2003-08-10 20:46 · Score: 4, Interesting

Speaking of filtering for others... I don't - but I do run my own little mail server.

Even after implementing all the postfix uce rules and adding in the RBL's - and using spamassassin... I still saw some spam slipping in...

So I hacked together a tiny little perl script that monitors my mail log... after any IP address gets more than 3 "554" messages (generated by the RBL's) the source IP gets a lovely little teergrube.

I waste their resources and prevent them from trying to deliver any other shit that might get through spamassassin...

Script can be found at here but is only good for postfix/linux/iptables peoples.

--
e3 :: blogging the wireless freenet
Something he misses about popfile. by CGP314 · 2003-08-10 21:49 · Score: 4, Interesting

One of the things I love about popfile is it is not a Spam filter. It is a general mail filter. I have about ten categories of mail that it sorts out for me. This also helps cut out false positives. 'Work', 'Personal', 'Friends' and all much more similar to eacth other than 'Spam'.
Re:hmm, if you really are so clever by Anonymous Coward · 2003-08-10 22:20 · Score: 5, Interesting

Very good.
Speaking from experience, I know for a fact that many of the harvesting programs (written in perl, running on linux, written by geeks) are very robust at deciphering most email obfuscation methods. You all sit and shake your fists, and the spamware writers are laughing their asses off.
You have the easy answer: don't obfuscate your email, don't even bother putting it on your posts.
Re:Nitpick... by spongman · 2003-08-10 22:22 · Score: 4, Interesting

Actually SpamBayes isn't bayesian at all. It uses a chi^2-based algorithm which was shown in (the extensive spambayes team's) tests to be superior to regular bayesian filtering.
POPFile is more than just a spam tool by rediguana · 2003-08-10 23:20 · Score: 4, Interesting

POPFiles utility does not lie just in managing the spam menace. To me, the real utility in POPFile is the ability to create x number of buckets and train it to sort your mail. SpamBayes looks great for spam but has no further utility. I like having POPFile sort my work from personal emails, and file all my mailing lists in another, and even jokes. Of course there is the spam folder that I check every now and then. I look forward to it being able to support IMAP servers as well.
Re:You really just don't get it by schon · 2003-08-11 02:25 · Score: 5, Interesting

spammers should love Bayesian filtering, it takes the presure off them while allowing them to reach exactly the same number of marks with a mailing.

I'm afraid you've made the cardinal mistake of thinking that spammers follow logic.

First question: Why do people install filters on their mailboxes?

Answer: To stop spam.

Now, take a look at any interview with any spammer.. you'll note that when they're asked, the spammer will say "I don't send it to people who don't want it."

They'll also say "we're always coming up with ways to bypass filters."

Now, you'd think that with the two statements, that one of them is false - however (besides the fact that spammers lie), any sociologist will tell you that the spammer actually believes he's telling the truth in each of these statements..

How he justifies it in his mind is that he believes that even though someone has installed a spam filter, that this person only wants to filter spam from other spammers - that his spam is somehow "special".

Spammers are sociopaths, and like all sociopaths, they believe the rules do not apply to them.

If spammers weren't sociopaths, and were capable of applied logic, then they'd realize that any filter (not just Bayseian) would benefit them.. but then, if they weren't sociopaths, they wouldn't be spammers in the first place.
The real reason SpamBayes wins... by Moryath · 2003-08-11 02:56 · Score: 4, Interesting

The "unsure" feature directly combats the latest Spammer technique -- filter poisoning.

You've all seen it work; the Spammers don't just send you the same spam once, they send you it 5 to 20 times, and they include a clipping from the headlines or something under their pitch.

They're not doing it to get that one mail past to you. They're actually HOPING that you classify all 20 mails as spam.

Why?

Because every time you classify that mail as spam, EVERY SINGLE WORD of that news clipping is "poisoned" inside the filter, and becomes an indicator of a spam. Then you turn around, and get an email from someone legitimate using those common words... and it gets wrongly classified too.

Enough false positives, and the spammers win, because they'll get you to turn the filter back off.

Enough is enough -- time to establish open hunting season on Spammers.
SpamBayes Testimonial by Cytotoxic · 2003-08-11 03:37 · Score: 4, Interesting

As a network/web/computer manager, my email has been provided to dozens of companies and trade shows. I still remember the day (August, 3 years ago) when someone first sold my address to a spam list. I went from 2-3 spams per day to 15-20. This spring brought another explosion, this time into the 100+ range. I am currently receiving over 6,000 spam messages every month! Obviously my main email address was useless and needed to be burned on a pyre to purge the evil.
After a week or two of this, I installed SpamBayes in the form of it's outlook plugin. I showed it my email archive as my "good" messages, and a bunch of spam gleaned from my deleted folder as "bad". My mailbox is now perfectly clean. I have received at least 15,000 spam messages since installing SpamBayes, and I have probably had to hit the "Delete As Spam" button about 10 times for ones that it missed, most of those being variations on the Nigerian scheme. It has never grabbed a real message, and the "Unsure" feature localizes everything that I really need to look at in one place.
If you have a spam problem, get SpamBayes. It is that simple. There is no need to speculate about that better method that you thought up, or how it really won't work because of XYZ theory... it works almost perfectly, and it lets you know about anything that it is not sure about with the "Unsure" folder, so it never throws the baby out with the bathwater. In short, this is almost the perfect Spam filter. It even caught the emails that were using GIFs to avoid being filtered on content, placing them in unsure until I said "this is spam", after which I never saw another one. Pretty darned cool!
It is actually kind of fun to watch this thing work. I came in this morning to find 568 new messages in my spam folder, 3 in unsure, all of which were spam. No spam anywhere to be found in my inbox, just 15 unread messages that were correctly left alone by SpamBayes. Just imagine having to flip through 600 emails to find 15 real messages! Now I just hit "CTRL-A DEL" in my spam folder and it is all gone! 5 seconds a day to deal with spam, I can live with that....