Slashdot Mirror


Working Bayesian Mail Filter

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

11 of 312 comments (clear)

  1. Sure it's promising by bigberk · · Score: 4, Insightful

    And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.

    1. Re:Sure it's promising by tsg · · Score: 4, Insightful

      Any solution that requires spammers to be more clever is going to reduce the number of spammers. And that is the end goal.

      --
      People's desire to believe they are right is much stronger than their desire to be right.
  2. That Google search... by Jugalator · · Score: 4, Insightful

    Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".

    --
    Beware: In C++, your friends can see your privates!
  3. Professional Looking Spam May Be Impossible by Bob9113 · · Score: 4, Insightful

    This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.

  4. perlcc by Camel+Pilot · · Score: 3, Insightful

    I just received the November edition of the TPJ which included a fine article "perlcc & Compiling Perl Script".

    In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.

  5. Developers missed this... by bigberk · · Score: 3, Insightful

    In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.

    A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat, Pimmy, JBMail and PocoMail will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.

  6. Is this intended for server, client, or both? by Rooney444 · · Score: 3, Insightful

    If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?

  7. this battle cannot be won by mboedick · · Score: 4, Insightful

    These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.

    Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.

    1. Re:this battle cannot be won by shayne321 · · Score: 4, Insightful

      These technologies are interesting, but the problem of spam should be solved at the source.

      And how do you propose we solve the problem at its source? Make it illegal? They'll just find loopholes in the law and/or move to a country where it is legal. Hunt them down and murder their wife and kids in front of them then hang them from a tree? Satisfying though it may be, last I checked murder was illegal.

      Techniques like this CAN eventually solve the problem.. As others have pointed out, for someone to buy something from a spammer they have to READ the spam. If they send out 1 million spams and 500,000 read them and 20 of them buy something, they'll keep doing it. If they send out 1 million and only 500 people read it and 1 person buys something, they'll loose their source of income and have to find a new line of work.

      Also, for each obstacle we put in their way (checksum databases, open relay databases, filters, etc) it costs them more time, effort and therefore, money to send their crap - all for less income.

      Shayne

      --
      Today I didn't even have to use my AK; I got to say it was a good day -- Icecube
    2. Re:this battle cannot be won by crucini · · Score: 3, Insightful

      It's all very well to say that spam should be stopped at the source, but how do you plan to do that? Blocklists that pressure the ISP? SPEWS is pretty effective, but Verio, UUNet and Sprint are deeply committed to spam. They won't dislodge their pet spammers until they feel financial pain. Want the government to stop spam at the source? I see lots of problems with that. One of them is the creation of another eternal government responsibility like the war on drugs. They will forever need more funding for "the war on spam" because spammers are getting more clever. These federal agencies develop a symbiotic relationship with the "problems" they're trying to "solve".

      In practice, a multipronged approach will work best, combining prosecution, litigation, blocklists, content-based filtering, complaints to upstream providers and education of new users. Graham's article, in fact, shows how attempts to avoid prosecution push spammers into the arms of content-based filtering.

      I don't ask for a 100% solution to spam, because any such solution will have awful side effects.

  8. Re:product of marketrons by crucini · · Score: 3, Insightful
    You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

    I think you may have misunderstood that comment. Since Paul Graham started talking about Bayesian filtering, there's been some tendency here to refer to all learning spam filters as Bayesian. Which results in complaints, which results in the designation "pseudo-Bayesian" for the many independently-discovered learning algorithms that don't have a theoretical underpinning.

    Put another way: if an algorithm outputs a dimensionless "score", and the author can't set an upper bound on the score, it's at most pseudo-Bayesian. If it outputs a probability that the message meets certain criteria, then it could be "true Bayesian". Additional implication: the "pseudo-Bayesian" filter may have a stack of rules in addition to its table of probabilities.

    I don't think we're splitting hairs on some deep statistical issue. I think we're groping for very rough categories in a new field of application software. If you can establish clearer categories, that might help.
    With 1 line of regex I eliminate 95% of my spam: match and throw it out.

    Graham addresses this in the article. One can identify most spam with a simple rules-based engine. That tends to make one lazy in reading the spam folder, which means false positives can languish unread. Enhancing the rules-based engine becomes an ongoing project as the volume and clerverness of spam increase. Hopefully Bayesian filtering can automate this.