Working Bayesian Mail Filter

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Sunday November 3, 2002 @06:05AM from the stuff-to-play-with dept.

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

18 of 312 comments (clear)

Min score:

Reason:

Sort:

Sure it's promising by bigberk · 2002-11-03 06:12 · Score: 4, Insightful

And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.
1. Re:Sure it's promising by tsg · 2002-11-03 08:46 · Score: 4, Insightful
  
  Any solution that requires spammers to be more clever is going to reduce the number of spammers. And that is the end goal.
  
  --
  People's desire to believe they are right is much stronger than their desire to be right.
2. Re:Sure it's promising by stand · 2002-11-03 18:30 · Score: 2, Insightful
  
  This is true, but remember, with computer-related ventures like a spam operation, all you need is one clever person to write the clever program that gets distributed to all the morons. This spam filter is a perfect example. I'm not clever enough to write something like that myself, but I'm certainly clever enough to download it and use it.
  
  --
  Four fifths of all our troubles in this life would disappear if we would just sit down and keep still. -C. Coolidge
That Google search... by Jugalator · 2002-11-03 06:15 · Score: 4, Insightful

Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".

--
Beware: In C++, your friends can see your privates!
IMAP by Evil+Adrian · 2002-11-03 06:22 · Score: 2, Insightful

Does anyone know of any spam solutions for IMAP? Everything I've seen out there is POP3, but goddammit I like my IMAP folders!!! (Not to mention that the server on which my e-mail resides gets backed up nightly...)

--
evil adrian
As effective as a well trained secretary by Gribflex · 2002-11-03 06:29 · Score: 1, Insightful

As I understand it, the Bayesian mail filtering system works by:
a) you receiving mail
b) designating where it should go
c) the filter tries to understand your reasoning
d) in the future, before step 1 occurs, the filter tries to interpret whether or not you want the mail based upon statistical analysis of what you have done in the past

Where as current mail filtering techniques work by culling your mail on exact specifications (it doesn't try to interpret. If it doesn't know, it does nothing).

I quite like the idea of my mail filtering software becoming intelligent over time, however I can see a potential for email traffic being lost using this method. The Bayesian mail filter is essentially as effective as a (hopefuly well trained) secretary. When you first get your secretary, she brings you everything. Then she starts culling the most obvious junk mail. Then she would start examining the normal letters... are they important? Relevant? Is this the person who should be dealing with it?

After time, you have your secretary very well trained, and she culls out everything which is not of immediate importance. In real life, this leads to the following problems:

a) you receive mail from an unknown source which could be important (some guy's discovered a new way to _________) but who isn't credible by your standards. His mail gets tossed aside, or redirected to someone else who probably doesn't care.

b) you receive mail from a trusted source at a bad address. i.e. your son is in Zimbabwe (sp?) on vacation. He sends you a letter postmarked from Zimbabwe, on museum letter head (couldn't find anything else handy). Knowing that you do not have dealings in Zimbabwe, and that this is most likely someone asking for charity, your secretary trashes it.

We've all heard stories of the first example, and it's not too hard to imagine the second. My worry is that, just like a good secretary, my mail filtering software will begin to filter for me. I will lose some control and, for the convenience of not having to hit the delete key a few extra times, I may miss potentially important email.

Chance is never a good thing to bring into your business.
1. Re:As effective as a well trained secretary by bmwm3nut · 2002-11-03 06:48 · Score: 2, Insightful
  
  but, unlike your secretary not showing you things. you can just set up the filter to put the spam in a spam folder. you can then periodically look at it and see if there are any false positives. or you can tell the filter to delete things that are 95% spam, but put things that are still most likely spam in a special folder. that's what's great about learning algorithims, they can always adapt to what you want (if you teach them enough).
Not integrated solution by unfortunateson · 2002-11-03 06:32 · Score: 2, Insightful

What will make this thing work is if it is integrated with the e-mail client.

With this tool, you unfortunately have to manually add a message of a certain classification (work, pr0n, spam, family...) to the progrma through the perl script -- very awkward.

A tool like this need to run as a daemon and 'notice' when a message is added to a folder. Unfortunately, with different formats for e-mail folders, it's a much tougher job.

As it stands, with something like Outlook, I'd have to export each message individually, then run the Perl script. I can probably add a macro to do that (with its own pains -- you add a VBA macro to Outlook and it gripes every time you start up), and possibly even one that responds to filing in a folder.... hmm... maybe I will try this out.

--
Design for Use, not Construction!
Professional Looking Spam May Be Impossible by Bob9113 · 2002-11-03 06:44 · Score: 4, Insightful

This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.

--
Stop-Prism.org: Opt Out of Surveillance
perlcc by Camel+Pilot · 2002-11-03 06:53 · Score: 3, Insightful

I just received the November edition of the TPJ which included a fine article "perlcc & Compiling Perl Script".

In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.
Developers missed this... by bigberk · 2002-11-03 07:15 · Score: 3, Insightful

In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.

A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat, Pimmy, JBMail and PocoMail will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.
Is this intended for server, client, or both? by Rooney444 · 2002-11-03 07:29 · Score: 3, Insightful

If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?
Image-based spam by Anonymous Coward · 2002-11-03 07:43 · Score: 1, Insightful

Why wouldn't spammers do something like this to circumvent the filter (i.e. simple image-based spam with text that doesn't raise any alarms):

Content-Type: multipart/related;
type="multipart/alternative";
boundary="----=_NextPart...."

This is a multi-part message in MIME format.

------=_NextPart_....
Content-Type: multipart/alternative;
boundary="----=_NextPart_...."

------=_NextPart_....
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi

------=_NextPart_....
Content-Type: image/jpeg;
name="Spam goes here.jpg"
Content-Transfer-Encoding: base64
Content-ID: /9j/4AAQSkZJRgABAgEBLAEsAAD/7QlMUGhvdG9zaG9wIDMuMA A4QklNA+0KUmVzb2x1dGlvbgAA
etc...
Re:what is the point then? by rgmoore · 2002-11-03 07:55 · Score: 2, Insightful

Well, there are potentially three points. One is that hopefully after a while the filter will work well enough that you can develop some real confidence in it and you won't have to check every time to see that it's working right. I'm pretty close to that point with bogofilter; I so rarely see any false positives that I can almost afford to flush the messages without checking. Actually, I assume that what I'll really do is to change the rules a bit so that alleged spam is sent to a waiting folder and doesn't even show up in my main inbox.
That gets to point two: now I'll be able to check for spam in batch mode. Instead of going through my inbox every time I look for messages, marking some as spam and reading others, I'll be able to read just about everything in my inbox without worrying about spam. Then once a week I can check my spam box and see if there's actually anything legitimate there. This is going to be faster than doing it every time a new message shows up in my inbox.
I'm not a compulsive mail reader, but for some people this would also be really useful because it would protect them from distractions. They are working on something and then their mailbox beeps them to let them know that a message has arrived. Unfortunately, when they check it out it turns out that their train of thought has been needlessly disrupted by another spam. If they can filter out the spam before the notification while still being alerted promptly when a real message shows up, that's a big win.

--
There's no point in questioning authority if you aren't going to listen to the answers.
this battle cannot be won by mboedick · 2002-11-03 08:13 · Score: 4, Insightful

These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.

Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.
1. Re:this battle cannot be won by shayne321 · 2002-11-03 10:48 · Score: 4, Insightful
  
  These technologies are interesting, but the problem of spam should be solved at the source.
  And how do you propose we solve the problem at its source? Make it illegal? They'll just find loopholes in the law and/or move to a country where it is legal. Hunt them down and murder their wife and kids in front of them then hang them from a tree? Satisfying though it may be, last I checked murder was illegal.
  Techniques like this CAN eventually solve the problem.. As others have pointed out, for someone to buy something from a spammer they have to READ the spam. If they send out 1 million spams and 500,000 read them and 20 of them buy something, they'll keep doing it. If they send out 1 million and only 500 people read it and 1 person buys something, they'll loose their source of income and have to find a new line of work.
  Also, for each obstacle we put in their way (checksum databases, open relay databases, filters, etc) it costs them more time, effort and therefore, money to send their crap - all for less income.
  Shayne
  
  --
  Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
2. Re:this battle cannot be won by crucini · 2002-11-03 10:56 · Score: 3, Insightful
  
  It's all very well to say that spam should be stopped at the source, but how do you plan to do that? Blocklists that pressure the ISP? SPEWS is pretty effective, but Verio, UUNet and Sprint are deeply committed to spam. They won't dislodge their pet spammers until they feel financial pain. Want the government to stop spam at the source? I see lots of problems with that. One of them is the creation of another eternal government responsibility like the war on drugs. They will forever need more funding for "the war on spam" because spammers are getting more clever. These federal agencies develop a symbiotic relationship with the "problems" they're trying to "solve".
  
  In practice, a multipronged approach will work best, combining prosecution, litigation, blocklists, content-based filtering, complaints to upstream providers and education of new users. Graham's article, in fact, shows how attempts to avoid prosecution push spammers into the arms of content-based filtering.
  
  I don't ask for a 100% solution to spam, because any such solution will have awful side effects.
Re:product of marketrons by crucini · 2002-11-03 09:41 · Score: 3, Insightful

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

I think you may have misunderstood that comment. Since Paul Graham started talking about Bayesian filtering, there's been some tendency here to refer to all learning spam filters as Bayesian. Which results in complaints, which results in the designation "pseudo-Bayesian" for the many independently-discovered learning algorithms that don't have a theoretical underpinning.

Put another way: if an algorithm outputs a dimensionless "score", and the author can't set an upper bound on the score, it's at most pseudo-Bayesian. If it outputs a probability that the message meets certain criteria, then it could be "true Bayesian". Additional implication: the "pseudo-Bayesian" filter may have a stack of rules in addition to its table of probabilities.

I don't think we're splitting hairs on some deep statistical issue. I think we're groping for very rough categories in a new field of application software. If you can establish clearer categories, that might help.

With 1 line of regex I eliminate 95% of my spam: match and throw it out.

Graham addresses this in the article. One can identify most spam with a simple rules-based engine. That tends to make one lazy in reading the spam folder, which means false positives can languish unread. Enhancing the rules-based engine becomes an ongoing project as the volume and clerverness of spam increase. Hopefully Bayesian filtering can automate this.