Slashdot Mirror


Comparison of Bayesian POP3 Spam Filters

kreide writes "Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Although various techniques exits to combat the problem, spammers seemed to be winning the war - until a new, powerful weapon appeared on the scene: Bayesian filters, our last, best hope for spam-free inboxes. In this review I compare POP3 based bayesian spam filters." We did an Ask Slashdot on this a few weeks ago.

32 of 326 comments (clear)

  1. Only useful to a point by KU_Fletch · · Score: 4, Interesting

    I love spam protection programs. I've been using them for years, but have to switch every couple of months because of the friggen spammers. The people that make the spamming software don't just sit around cackling about how evil they are. They reverse engineer every anti-spam protection out there in an attempt to get around it. While this seems like a good idea (and I will be playing around with these two programs for a while), it's unfortunately only good up to the point when spammers figure a way around it.

    I wish the government would somehow make the practice illegal, but I doubt they'll ever get anything to stick. The far better option at this point is to have a class action suit of server owners (who provide mail accounts) against developers of spamming software and spammers. I've gotten enough warnings from my university to know that bandwidth costs money. By sending millions of spams a year into any one e-mail server, that can account for a serious chunk of bandwidth used at significant cost to the provider. It won't stop spam all together, but it will bankrupt anybody that has been doing it.

    --
    It's not stupid. It's advanced.
    1. Re:Only useful to a point by Steve+B · · Score: 2, Interesting
      They reverse engineer every anti-spam protection out there in an attempt to get around it.

      This is why a real anti-spam legal reform would clearly equate circumvention of an anti-spam filter with circumvention of a password prompt. Both are attempts to crack into someone else's computer without permission -- indeed, against an express prohibition -- and the former ought to carry the same penalties as the latter.

      --
      /. If the government wants us to respect the law, it should set a better example.
  2. Filtering by rf0 · · Score: 3, Interesting

    Taking I get 100+ spams a day I've found that its a goo idea to at least use tagging. For example posting on usernet I use usenet@domain.com with something in my sig saying actualy email is me at domain dot com. Anything sent to usenet is automatically deleted. Doesn't stop the flow by any means but at least I can track where the spam came from.

    If you are feeling clever you can even use addresses that expire after a week. So something like epochseconds@domain.com

    Just my 0.02p

    Rus

  3. I changed my mind. Simpler is better. by Peter+Cooper · · Score: 5, Interesting

    I have long been an advocate of Bayesian or keyword based spam filters, but have recently been forced to change my outlook, and to argue that MULTIPLE SIMULTANEOUS solutions are the answer.

    I encountered a very simple but unique spam system which works entirely on the sender's address. Simply, you create a small database with the domains/addresses you want to whitelist. Then, a program screens your mail, and if the sender is not in your whitelist, it sends an e-mail BACK to the sender with a simple URL (or even an actual link for HTML e-mail clients) which states that they REALLY want to send the e-mail to its destination. When this is done, they are added to the whitelist. Therefore, mails from forged remote addresses are no longer a problem, and neither are mails from trusted sources. And, better than SPEWS or similar blacklists, the sender gets a SECOND CHANCE to send their mail to you.

    There's a commercial solution using this system right now, although the URL escapes me.

    Of course, one could encounter problems when ordering online, say. Droids at Amazon will not be clicking your links to make sure your order receipt got through. One could argue that you'd put things like Amazon.com in the whitelist, but what if someone used amazon.com as a spoofed e-mail domain/address? Ay, there's the rub. But if this system were tied in with a Bayesian system, it'd be pretty unbeatable. What's more the Bayesian system would have extra data for negative matches, in the form of e-mails that were never 'approved', and positive data in the form of those that were.

    So, I'd be more interested in producing a homebrew system that used MULTIPLE weaker systems, than one supposed 'sure fire' method.. as I feel no one method is perfect, whereas multiple systems can approach this nirvana.

    1. Re:I changed my mind. Simpler is better. by ctr2sprt · · Score: 4, Interesting
      Any approach that triggers an automatic action on your behalf is bad, because it can be turned against you. It's not likely that email would make a terribly good DDoS service, but a system like the one you describe would certainly be vulnerable to it. And I think it would only last a week, at most, before spammers figured out a way around it. They can already handle "NOSPAM" being inserted in email addresses, and recently added the ability to reverse and combine email addresses until they get something plausible.

      I do agree with you that we need multiple layers of safeguards in order to solve spam - or at least to hide it away so nobody has to look at it - but I don't think your specific example is very good.

    2. Re:I changed my mind. Simpler is better. by scj · · Score: 5, Interesting
      I had thought of something similar for fighting spam. Here's how I'd handle each email:
      1. If the email is from someone in my whitelist, allow the mail to go through and feed it as 'ham' to the Bayesian filter.
      2. If the email is not in my whitelist, run it through spam filtering software (Spamassassin works well) to determine if it is likely to be spam.
      3. If it seems like spam, then use a challenge-response system (like TMDA) to find out if a human sent the email.
      4. If the mail doesn't seem like spam, just deliver it. If I get 3 non-spammy messages from the same person (separated by a day or more) then add them to my whitelist automatically.
      5. If someone responds to the TMDA challenge, put them in the whitelist and deliver the original email.
      6. If no one responds to the TMDA challenge after a week, feed the mail as 'spam' the the Bayesian filter.
      In addition, I'd use a system like Sneakemail to generate random email addresses to give out to businesses I want to do business with and use to sign up to mailing lists. These email addresses would be added to my whitelist so they could send me mail without going through the challenge-response system. If they start spamming me, I put the random email I gave them on my blacklist.

      This system has the following benefits:
      • Business mail I want (like receipts and newsletters from companies I do business with) get through always since the Sneakemail-type address is whitelisted. This solves the problem of businesses not responding to TMDA challenges.
      • My real email address is protected from businesses who are likely to sell it and from people farming addresses from mailing lists.
      • Personal email that the spam filter sees as non-spam gets delivered without bothering the sender with a challenge-response system.
      • Personal email that does seem spammy by the filter still has a second chance to make it through the system with the challenge-response system. This should reduce false-positives to include only spammy emails from people who don't respond the the challenge.
      • The Bayesian filter is automatically trained based on mails from people in my whitelist and mails from people who never respond to the challenge-response.
      You would still get spam with this system (personal email that your filter thinks is non-spam), but hopefully your false-positive rate would be zero. Also, you don't annoy other people much by only sending challenge-response messages to spam-like emails. Finally, this would be easy for end users to use. They don't have to train the spam filter, since it should train itself. The only complicated part would be generating and using the random emails that you give to businesses and mailing lists.
  4. Re:great by Tirel · · Score: 2, Interesting

    ideally, i think the client should take care of the filtering. Pour your resources into improving context based filtering and let the individual clients do the dumping. Widespread usage of this kind of filtering could make spam even further unprofitable. Since spam is entirely business related, it would likely reduce the numbers of it passing through the network.

    From a sysadmin's POV, this doesn't halt the issue of spam eating bandwidth or disk space. I'll address that next.

    Disk space depends on what kind of e-mail your organization uses. For POP3, most people delete e-mail on the server after its downloaded, so while the disk space may be consumed with spam, it would be temporary. That is unless you have alot of dead or rarely used accounts. In that case, you should have policies in place for when to wipe user's accounts out after a set period of time. Or set up some kind of forwarding policy. If you're using something like IMAP, then using a server-wide content filtering system as mentioned above would be effective.

    For bandwidth, the only way to halt spam from consuming your bandwidth is by blocking packets at the router. If you use SPEWS to dump the e-mail by your e-mail server, its still consumed your bandwidth. So you'd have to block the packets directly. I think this is draconian and should be avoided, for the net's sake. Unfortunately there really is no good solution to this, for as long as spam flows, it flows and consumes bandwidth. The only way to halt it is to halt the initial spamming to begin with. As mentioned above, when your spammer's audience never exists as a result of good content filtering, the spam will be unprofitable and lessen somewhat.

    Attacking users and their ISP's won't do much good, aside from causing spammers to jump from isp to isp, something they're readily willing to do. Attacking regular users just makes you a big jerk.

  5. Re:What about features other than text? by Gaza · · Score: 3, Interesting

    Yes it does, the developers have created a test suite and a very extensive tokenizer. Any additional pseudowords, or new ideas to tokenize a message are tested very throughly before they are added (as most tend to actually lower accuracy instead of raise it). There have even been tests using SpamBayes on just headers and just message bodies and both have worked very well.

  6. Re:Spamprobe by opk · · Score: 3, Interesting

    I'll second this. Have been using spamprobe since December. It took longer than a month before it was fully trained. These days it's very good. And the best thing (except once when someone quoted the full body of a spam when complaining about spams on a mailing list): It has never given me a false positive.

  7. A new poll is required by mirko · · Score: 4, Interesting
    How should spammers be dealt with ?
    • Ban their original networks
    • Throw them in jail
    • Kill them
    • Fine them 0.01$/email and improve third world infrastructures with the money.
    • Filter/Ignore them.


    I'd personally go for the last option... Maybe the next-to-last if their suit takes place in a really democratic place (there are 278 millions American citizens and 2,2 of them are in jail, this is a *lot*).
    --
    Trolling using another account since 2005.
    1. Re:A new poll is required by anubi · · Score: 2, Interesting
      I like your last option best, too. I hate to suppress anyone's right to say whatever they want to, but then I want to reserve my right to what I choose to pay attention to.

      Under the existing technology, a spammer is like the royal pest on a city bus which takes advantage of the captive audience. The analogy here is that we have to download our POP box, we have no way of arranging our affairs to where the signals exist, but we deliberately choose not to tap into them.

      I believe the technology must change. I am loathe to try to settle what I consider a technological issue by passing some sort of law... doing this just makes immense profits for litigators, but does little to solve the underlying problems.

      If the technology could change to where ISP's could provide individual bayesian-type filters at the server level so that messages fitting criteria that each individual screens for, this could let the ISP off the hook for dropping messages, as well as having to supply any long-term storage for them... Somehow I get the idea that spammed messages are going to be very similar and should show a very marked correlation to the same spam sent to other accounts in that ISP. The ISP, upon determining a significant number of accounts filters have flagged a particular mailing as a spam may provide the ISP with the opportunity to only store ONE copy of the spam, while possibly putting only pointers to it to the subscribers.

      So, what I would think would solve this is if the internet became more like radio transmissions. I support the idea that anybody can transmit whatever they want to the public, and if anyone wants to listen in, fine. But, like RF, it has to make it through the filters before it gets to the listener. The damn-near infinite advantage to the net-based paradigm is we have an almost infinite bandwidth in the notion that anyone can set up his transmitter and not step on someone else's signal. ( i.e, there's only so many "channels" in the AM, FM, or TV broadcast bands, whereas the internet does not have this limitation. ).

      Anyway, thats my two cents worth.

      --
      "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]

  8. Re:You just don't get it by Anonynmous+Cow · · Score: 4, Interesting

    Speaking of filtering for others... I don't - but I do run my own little mail server.

    Even after implementing all the postfix uce rules and adding in the RBL's - and using spamassassin... I still saw some spam slipping in...

    So I hacked together a tiny little perl script that monitors my mail log... after any IP address gets more than 3 "554" messages (generated by the RBL's) the source IP gets a lovely little teergrube.

    I waste their resources and prevent them from trying to deliver any other shit that might get through spamassassin...

    Script can be found at here but is only good for postfix/linux/iptables peoples.

  9. Spammers will just just HTML with images.. by Anonymous Coward · · Score: 1, Interesting

    How does bayesian filters solve the problem of pure-image spams? -I.e. HTML mails that contain nothing else than an IMG tag. I only see collaborative filters solving this problem - SPAMfighter would be an example of this.

  10. Mozilla - filters on client not server by Zog+The+Undeniable · · Score: 3, Interesting

    Moz's Bayesian filtering works well, but its Achilles heel is that it doesn't work on the POP3 server, so you still have to download everything. As POP3 allows the header and the first part of the message body to be read without downloading it, surely there could be an option - once Moz has been trained and you're fairly sure the false positive rate is negligible - for filters to operate on the server and delete spam from there?

    --
    When I am king, you will be first against the wall.
    1. Re:Mozilla - filters on client not server by letxa2000 · · Score: 2, Interesting
      You have pretty much described PrismEmail. It, among other things, does Bayesian filtering. It's server-based so you don't have to download the spam. It's user-specific so you have your own Bayesian corpus that applies only to you, not server-wide. You can inspect blocked email on the server at any time or wait for a single spam report each night to see a list of all email blocked--a quick click will then release any message that was misclassified. And you can just click on a link in the headers of a message if it was spam and it got through.

      Really, all the people that think that server-side Bayesian filtering is impossible are confused. No, you can't have a single corpus that applies to everyone on the server--that defeats the purpose of Bayesian. But you definitely can do the user-specific filtering on the server. Let the server do the work, you only download the good stuff, and there's nothing to install locally.

  11. Re:Authentication of senders by frovingslosh · · Score: 3, Interesting
    The only thing that can truly save email is to switch to a service that requires authentication of senders.

    I agree with everything that you said about filters being ineffective. But I strongly disagree with your "only thing" statement. Particularly if you mean it as any of the systems I've ever heard about, such as "If it's not in the address book, the sender must acknowledge a challange message" type of approaches. The problem with such systems is that many of us get quite a bit of e-mail each day from people who are not in our regular address books, some of it quite important to us. We do not want that mail lost because the system at the other end was not in out address book and did not waste their time responding to a challange and response type system. For example, say I purchased something on-line from a vendor I had never dealt with before. Their e-mail system may automatically kick out an e-mail that informs me the product was shipped and give me an important Fed-ex or UPS tracking number. I'm glad they do such things with their shipping systems, and I don't expect them to manually respond to every challange they get back; realistically they will send any such challanges to the bit bucket and people who want e-mail that is important to them will end up never getting it.

    So I do not believe that Authentication of senders , at least in any of the traditionally suggested ways, is the correct approach. Much of the spam problem we have is due to what I consider flaws in SMTP. I would very much like to see a replacement for SMTP that considered the spam problems (as well as other problems inherent in SMTP). As an example, another post here mentioned a system where the mail is held, not on your ISP or upstream provider's system until you download it, but rather is held on the sender's or sender's ISP's system. The recipent would presumably receive only a very short indicator of where they have mail waiting, and would fetch it themselves when they are ready to receive it. The puts the burden of storage on the sender or the service provider for the sender, and avoids considerable bandwidth wasted by senders who supposedly send out e-mail with addresses generated to match all combinations of up to x characters (the excuse Mindspring gave to me when addresses that I created but never gave out or used started getting spam, not that I believe them). In addition to putting this burden on the sender, it would insure that there was a good address in the e-mail to fetch the mail from, so spammers would have a much harder time injecting their spam into the system and would be much more traceable. And while I'm not foolish enough to think that laws could completely stop spam, we've seen how laws did drastically curtail fax spam, and some fax spammers have recently been made to pay serious fines. I do think laws would have a big effect on spammers; ther are a lot of spammers who just don't want to have to move out of the country to keep up spamming, and those of us who hate spam will track the spam back to US sources if we have a law with teeth in it to impose fines (or worse) on them when we do.

    Of course, and change to or replacement of SMTP must be phased in over time. It's not a short term solution to spam. But I expect SMTP would quickly go the way of gopher or archie or the rest if a viable new protocol was presented that addressed these problems effectively, and this is where I think out greatest chances for sucess are.

    --
    I'm an American. I love this country and the freedoms that we used to have.
  12. Re:YFI list by aduxorth · · Score: 2, Interesting

    another goodone is if the domain from the envelope sender doesn't have a MX record. bam guarenteed spam. The other one is to verify the sender not just the domain. This kills all those spams from lkiqprejbn@yahoo.com which are obviously bulldust.

    That alone kills off about 70% (IMO) of the spam that comes through servers that I administer, and as far as I know, only 2 emails(over the last 4 years or so) that wern't ment to be rejected were rejected because they had invalid sender envelopes.

    HTH
    cya
    Andrew

  13. Something he misses about popfile. by CGP314 · · Score: 4, Interesting

    One of the things I love about popfile is it is not a Spam filter. It is a general mail filter. I have about ten categories of mail that it sorts out for me. This also helps cut out false positives. 'Work', 'Personal', 'Friends' and all much more similar to eacth other than 'Spam'.

  14. Re:Missing the point? by Ingolfke · · Score: 2, Interesting

    Bulk emailing, like any business is a numbers game. By significantly decreasing the # of successful responses to a set of SPAM (through filters) the business costs remain the same w/ the returns dropping. Eventually the business is no longer feasible.

    [INCREASE TONE]
    SPAM absolutely does not need to be fought in the courts when the markets can work this out on their own (as we see w/ these filters). In the end we'll have better technology for sorting and filtering emails which can be applied to other applications and the spammers will be gone or significantly reduced.

    [BREATHE... BREATHE...]
    Legislation would only be valid in the country in which the legislation was enacted so spammers could simply move their operations to a SPAM friendly country.

    [GRADUALLY INCREASE TONE]
    Also, what constitues spam? What if I only send 10,000 emails out? What if I change the email each time I send it so it's unique to you? What if I'm not selling anything? What if someone comrpomised my system and sent all the emails from my PC? Why shouldn' ISPs be liable too... yeah, why are they letting people send those SPAMs... let's sue them too... somebody get a rope!!

    [BEGIN ALL OUT RANT!]
    So the moral of the story is... everyone remain calm... keep working on your filters and other new technologies... and soon we'll have fewer spammers and better tech and some intelligent hacker out there will have a whole heap load of cash for coming up w/ the solution.

    Of course w/ all of the existing hideous legislation we have today... SCO may announce that they are diversifying into bulk emailing and that they have a patent on any spam filtering algorythms and therefore if you ever remove any of their emails you must send them a $699 licensing fee for the use of their IP.

  15. Eh... by hendrix69 · · Score: 2, Interesting

    POPfile really got shortchanged by this review. It serves as much more that a spam filter. I thought I'll give SpamBayes a try anyway but the Outlook plugin won't install on my XP machine. Some problem with an unresolved dependency in shlwapi.dll... boring. The point is, the SpamBayes site doesn't have a tech support forum where I can ask for help with these kind of problems.

    --
    The power of Christ compiles you!
  16. Re:hmm, if you really are so clever by Anonymous Coward · · Score: 5, Interesting
    Very good.

    Speaking from experience, I know for a fact that many of the harvesting programs (written in perl, running on linux, written by geeks) are very robust at deciphering most email obfuscation methods. You all sit and shake your fists, and the spamware writers are laughing their asses off.

    You have the easy answer: don't obfuscate your email, don't even bother putting it on your posts.

  17. Re:Nitpick... by spongman · · Score: 4, Interesting

    Actually SpamBayes isn't bayesian at all. It uses a chi^2-based algorithm which was shown in (the extensive spambayes team's) tests to be superior to regular bayesian filtering.

  18. Re:You just don't get it by drix · · Score: 1, Interesting

    You just don't get the whole concept of Bayesian spam filtering. It works on a personal basis; don't forget that, statistically speaking, one man's spam is another man's legitimate personal e-mail. For example, if you send and receive a disproportionately large amount of messages containing cock jokes and talking about tits and sex (which, being a 20-year-old male, I can tell you is about 80% of my friends), under a "typical" or system-wide Bayesian filter that might be installed by some ISP, you're almost certainly going to lose a lot of messages that weren't spam. Which is the worst-case scenario for a spam filter. What's worse, the ISP would have to employ some sort of "spam czar" to monitor (people's private) incoming e-mail and make judgement calls as to what is and is not spam. That's a call I want to make, not one that I want made for me.

    The best way to eliminate spam, to me, is a two-part system whereby the ISP (via procmail, etc.) eliminates all mail that is definitely spam, and then passes along anything questionable to the user. Bayesian filter should be implemented in the client, which, thankfully, is becoming more and more common. ISPs should think about bundling clients that already support Bayesian sampling, enabling it by default, explaining in very clear terms how to use it, etc., but that's about all they can do.

    --

    I think there is a world market for maybe five personal web logs.
  19. simplest solution... by Lumpy · · Score: 1, Interesting

    $0.04US charge for every Email SENT. Collage accounts can get refunded costs by delivering a sent mail list.

    This will stop spamming quick... or at least make it slow way down.
    1,000,000 spams = $40,000.00US more than the entier net worth of the most sucessful spammer.

    --
    Do not look at laser with remaining good eye.
  20. Re:Bayesian filters are useful, but... by AndroidCat · · Score: 2, Interesting

    Spammers love to use open proxies to hide, and are now engaged not only in scans to find then, but also in campaigns to create them. Trojans and worms like SoBig. While each offense is small, it's on a scale large enough to have them behind bars for quite a while.

    --
    One line blog. I hear that they're called Twitters now.
  21. POPFile is more than just a spam tool by rediguana · · Score: 4, Interesting

    POPFiles utility does not lie just in managing the spam menace. To me, the real utility in POPFile is the ability to create x number of buckets and train it to sort your mail. SpamBayes looks great for spam but has no further utility. I like having POPFile sort my work from personal emails, and file all my mailing lists in another, and even jokes. Of course there is the spam folder that I check every now and then. I look forward to it being able to support IMAP servers as well.

  22. Re:You really just don't get it by schon · · Score: 5, Interesting

    spammers should love Bayesian filtering, it takes the presure off them while allowing them to reach exactly the same number of marks with a mailing.

    I'm afraid you've made the cardinal mistake of thinking that spammers follow logic.

    First question: Why do people install filters on their mailboxes?

    Answer: To stop spam.

    Now, take a look at any interview with any spammer.. you'll note that when they're asked, the spammer will say "I don't send it to people who don't want it."

    They'll also say "we're always coming up with ways to bypass filters."

    Now, you'd think that with the two statements, that one of them is false - however (besides the fact that spammers lie), any sociologist will tell you that the spammer actually believes he's telling the truth in each of these statements..

    How he justifies it in his mind is that he believes that even though someone has installed a spam filter, that this person only wants to filter spam from other spammers - that his spam is somehow "special".

    Spammers are sociopaths, and like all sociopaths, they believe the rules do not apply to them.

    If spammers weren't sociopaths, and were capable of applied logic, then they'd realize that any filter (not just Bayseian) would benefit them.. but then, if they weren't sociopaths, they wouldn't be spammers in the first place.

  23. Re:Authentication of senders by pongo000 · · Score: 2, Interesting

    say I purchased something on-line from a vendor I had never dealt with before. Their e-mail system may automatically kick out an e-mail

    Using TMDA, you would generate a "keyword" address: A unique addressed, identified by a keyword embedded in the address, which would allow your vendor to bypass the C/R system. If they keyword address starts being abused then (1) you can easily disable it, and (2) you know not to do business with that vendor again.

    As an example, another post here mentioned a system where the mail is held, not on your ISP or upstream provider's system until you download it, but rather is held on the sender's or sender's ISP's system.

    This system quickly breaks down, though, as delays are introduced by having to wait to fetch each piece of mail. People bothered by such delays will write/obtain software that automatically fetches the mail at a predetermined time, which would then shift the bandwidth problem (part of it, anyways) back to the recipient.

    The other problem with sender authentication is who, exactly, determines whether a sender is authenticated? I run my own e-mail server. Will I have to pay out bucks for an "authority" to confirm that my sending address is valid? Right now, some ISP's (notably Time-Warner offshoots) are denying access to their SMTP servers under the guise of reducing spam. If your IP happens to fall within a certain range, they simply don't allow you access. We will end up in the same morass RBL has put us in: Who plays God in determining whether a sender is truly "authentic" or "worthy"?

  24. The real reason SpamBayes wins... by Moryath · · Score: 4, Interesting

    The "unsure" feature directly combats the latest Spammer technique -- filter poisoning.

    You've all seen it work; the Spammers don't just send you the same spam once, they send you it 5 to 20 times, and they include a clipping from the headlines or something under their pitch.

    They're not doing it to get that one mail past to you. They're actually HOPING that you classify all 20 mails as spam.

    Why?

    Because every time you classify that mail as spam, EVERY SINGLE WORD of that news clipping is "poisoned" inside the filter, and becomes an indicator of a spam. Then you turn around, and get an email from someone legitimate using those common words... and it gets wrongly classified too.

    Enough false positives, and the spammers win, because they'll get you to turn the filter back off.

    Enough is enough -- time to establish open hunting season on Spammers.

  25. SpamBayes Testimonial by Cytotoxic · · Score: 4, Interesting

    As a network/web/computer manager, my email has been provided to dozens of companies and trade shows. I still remember the day (August, 3 years ago) when someone first sold my address to a spam list. I went from 2-3 spams per day to 15-20. This spring brought another explosion, this time into the 100+ range. I am currently receiving over 6,000 spam messages every month! Obviously my main email address was useless and needed to be burned on a pyre to purge the evil.
    After a week or two of this, I installed SpamBayes in the form of it's outlook plugin. I showed it my email archive as my "good" messages, and a bunch of spam gleaned from my deleted folder as "bad". My mailbox is now perfectly clean. I have received at least 15,000 spam messages since installing SpamBayes, and I have probably had to hit the "Delete As Spam" button about 10 times for ones that it missed, most of those being variations on the Nigerian scheme. It has never grabbed a real message, and the "Unsure" feature localizes everything that I really need to look at in one place.
    If you have a spam problem, get SpamBayes. It is that simple. There is no need to speculate about that better method that you thought up, or how it really won't work because of XYZ theory... it works almost perfectly, and it lets you know about anything that it is not sure about with the "Unsure" folder, so it never throws the baby out with the bathwater. In short, this is almost the perfect Spam filter. It even caught the emails that were using GIFs to avoid being filtered on content, placing them in unsure until I said "this is spam", after which I never saw another one. Pretty darned cool!
    It is actually kind of fun to watch this thing work. I came in this morning to find 568 new messages in my spam folder, 3 in unsure, all of which were spam. No spam anywhere to be found in my inbox, just 15 unread messages that were correctly left alone by SpamBayes. Just imagine having to flip through 600 emails to find 15 real messages! Now I just hit "CTRL-A DEL" in my spam folder and it is all gone! 5 seconds a day to deal with spam, I can live with that....

  26. Re:hmm, if you really are so clever by Wilk4 · · Score: 2, Interesting
    According to Why Am I Getting All This Spam? Unsolicited Commercial E-mail Research Six Month Report, most harvesters really *aren't* that smart, so even simple email address obfuscation and removal from websites can have a dramatic impact on how much spam you get.

    The other good news from that study is that they show that spam does decrease after you remove your email address from websites... in other words, they don't keep the addresses as much as we generally believe. You aren't on every spammers list forever just because they get your address once.

  27. Mail.app, remark on graphics by dr2chase · · Score: 2, Interesting

    I was more than a little disappointed to see that Apple's Mail.app was not included in the comparison. It wouldn't surprise me in the least if it were already the most widely used Bayesian spam filter. Unsurprisingly, it is also very easy to use.

    Mail.app also combines Bayesian filtering with the Address book -- any mail from a known correspondent won't be tagged as Junk. This reduces the risk of false positives. This is an integration cheat not available to stand-alone spam filters, because Apple supplies the Address book app and provides other integration between the two applications. But, (as a self-centered end-user) I don't care that it is a cheat, I am merely happy that it all works well. (And I cross my fingers and hope that somehow, Apple's C/C++/Objective-C programmers are less prone to leaving buffer overflow holes than Microsoft's programmers clearly are.)

    The author needs to read Edward Tufte's books on presenting information (e.g., The Visual Display of Quantitative Information).