Comparison of Bayesian POP3 Spam Filters

← Back to Stories (view on slashdot.org)

Comparison of Bayesian POP3 Spam Filters

Posted by michael on Sunday August 10, 2003 @08:02PM from the spam-i-am dept.

kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.

4 of 326 comments (clear)

Min score:

Reason:

Sort:

I changed my mind. Simpler is better. by Peter+Cooper · 2003-08-10 20:24 · Score: 5, Interesting

I have long been an advocate of Bayesian or keyword based spam filters, but have recently been forced to change my outlook, and to argue that MULTIPLE SIMULTANEOUS solutions are the answer.

I encountered a very simple but unique spam system which works entirely on the sender's address. Simply, you create a small database with the domains/addresses you want to whitelist. Then, a program screens your mail, and if the sender is not in your whitelist, it sends an e-mail BACK to the sender with a simple URL (or even an actual link for HTML e-mail clients) which states that they REALLY want to send the e-mail to its destination. When this is done, they are added to the whitelist. Therefore, mails from forged remote addresses are no longer a problem, and neither are mails from trusted sources. And, better than SPEWS or similar blacklists, the sender gets a SECOND CHANCE to send their mail to you.

There's a commercial solution using this system right now, although the URL escapes me.

Of course, one could encounter problems when ordering online, say. Droids at Amazon will not be clicking your links to make sure your order receipt got through. One could argue that you'd put things like Amazon.com in the whitelist, but what if someone used amazon.com as a spoofed e-mail domain/address? Ay, there's the rub. But if this system were tied in with a Bayesian system, it'd be pretty unbeatable. What's more the Bayesian system would have extra data for negative matches, in the form of e-mails that were never 'approved', and positive data in the form of those that were.

So, I'd be more interested in producing a homebrew system that used MULTIPLE weaker systems, than one supposed 'sure fire' method.. as I feel no one method is perfect, whereas multiple systems can approach this nirvana.
1. Re:I changed my mind. Simpler is better. by scj · 2003-08-10 22:38 · Score: 5, Interesting
  I had thought of something similar for fighting spam. Here's how I'd handle each email:
  
  If the email is from someone in my whitelist, allow the mail to go through and feed it as 'ham' to the Bayesian filter.
  
  If the email is not in my whitelist, run it through spam filtering software (Spamassassin works well) to determine if it is likely to be spam.
  
  If it seems like spam, then use a challenge-response system (like TMDA) to find out if a human sent the email.
  
  If the mail doesn't seem like spam, just deliver it. If I get 3 non-spammy messages from the same person (separated by a day or more) then add them to my whitelist automatically.
  
  If someone responds to the TMDA challenge, put them in the whitelist and deliver the original email.
  
  If no one responds to the TMDA challenge after a week, feed the mail as 'spam' the the Bayesian filter.
  
  In addition, I'd use a system like Sneakemail to generate random email addresses to give out to businesses I want to do business with and use to sign up to mailing lists. These email addresses would be added to my whitelist so they could send me mail without going through the challenge-response system. If they start spamming me, I put the random email I gave them on my blacklist.
  
  This system has the following benefits:
  
  Business mail I want (like receipts and newsletters from companies I do business with) get through always since the Sneakemail-type address is whitelisted. This solves the problem of businesses not responding to TMDA challenges.
  
  My real email address is protected from businesses who are likely to sell it and from people farming addresses from mailing lists.
  
  Personal email that the spam filter sees as non-spam gets delivered without bothering the sender with a challenge-response system.
  
  Personal email that does seem spammy by the filter still has a second chance to make it through the system with the challenge-response system. This should reduce false-positives to include only spammy emails from people who don't respond the the challenge.
  
  The Bayesian filter is automatically trained based on mails from people in my whitelist and mails from people who never respond to the challenge-response.
  
  You would still get spam with this system (personal email that your filter thinks is non-spam), but hopefully your false-positive rate would be zero. Also, you don't annoy other people much by only sending challenge-response messages to spam-like emails. Finally, this would be easy for end users to use. They don't have to train the spam filter, since it should train itself. The only complicated part would be generating and using the random emails that you give to businesses and mailing lists.
Re:hmm, if you really are so clever by Anonymous Coward · 2003-08-10 22:20 · Score: 5, Interesting

Very good.
Speaking from experience, I know for a fact that many of the harvesting programs (written in perl, running on linux, written by geeks) are very robust at deciphering most email obfuscation methods. You all sit and shake your fists, and the spamware writers are laughing their asses off.
You have the easy answer: don't obfuscate your email, don't even bother putting it on your posts.
Re:You really just don't get it by schon · 2003-08-11 02:25 · Score: 5, Interesting

spammers should love Bayesian filtering, it takes the presure off them while allowing them to reach exactly the same number of marks with a mailing.

I'm afraid you've made the cardinal mistake of thinking that spammers follow logic.

First question: Why do people install filters on their mailboxes?

Answer: To stop spam.

Now, take a look at any interview with any spammer.. you'll note that when they're asked, the spammer will say "I don't send it to people who don't want it."

They'll also say "we're always coming up with ways to bypass filters."

Now, you'd think that with the two statements, that one of them is false - however (besides the fact that spammers lie), any sociologist will tell you that the spammer actually believes he's telling the truth in each of these statements..

How he justifies it in his mind is that he believes that even though someone has installed a spam filter, that this person only wants to filter spam from other spammers - that his spam is somehow "special".

Spammers are sociopaths, and like all sociopaths, they believe the rules do not apply to them.

If spammers weren't sociopaths, and were capable of applied logic, then they'd realize that any filter (not just Bayseian) would benefit them.. but then, if they weren't sociopaths, they wouldn't be spammers in the first place.