Sorting the Spam from the Ham
MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond."
Math buffs might enjoy reading
these pages
or browsing
this writeup
and its many links.
You're getting modding as funny, but I just figured out exactly how true this is. My main email account is used primarily for work, so it was very easy to set up white lists for 30 or so email addresses with a few family and friends thrown in, and route to a special folder. I still check the default folder, of course, but I turned off notification for everything except the white folder.
I went from checking my email every 5-10 minutes to a handful of times a day.
It's nothing but crumpled porno and Ayn Rand.
For an article in an "IT tech" section of a paper, this is really very weak.
It really doesn't do much more than precis Paul Graham's arguments, then ends in a blatant plug for just one Outlook addon.
I suppose if there are still people in the column's audience who haven't heard this all before, and it gets the message out that spam can be effectively filtered, it's a minor goodness.
As I wrote only late last night, using Bayesian classification with only two categories (spam and "non-spam") is somewhat short-sighted, since if properly trained, a Bayes classifier can do a much better job than ordinary mail filtering (procmail, Mozilla or Mail.app filters, you name it).
In fact, if I had to bet on the next "killer apps", mail sorting and RSS filtering based on Bayesian classification would be right at the top of my list, based solely on the actual time-saving benefits for users. And I can't see any reason for Bayesian filtering not being included in Mozilla Mail and Apple's own (revamped) Mail.app.
I have to use Outlook at work, and after setting up Outclass (which requires POPfile) with several "buckets" to classify my corporate e-mail by project and field, I'm definetly not going back. Outlook, even with extensive use of Rules Wizard and categories, simply cannot cope with the diverse kinds of project-related e-mail I swap with colleagues, and Outclass is the only thing I could find that could deal with Exchange, PST folders and multiple Bayesian "buckets" categories.
Come on, do the right thing and tell Apple and The Mozilla Project that you want configurable Bayesian filtering on their mail clients.
Bayesian is more or less word based, so graphical only messages fly right by my Mozilla mail filters. I believe it does the check after the html has rendered. If they ran the filter before the html was rendered, they might have slighty better results. Eventually all spammers will learn the undetectable patterns that only a handfull seem to know now, and it will once again render mail filters useless. I hate HTML e-mail.
I have seen all of the local client software and I personally have never bothered with it.
I always felt that the whole point of spam being annoying was that it wasted bandwidth. It gets sent to my server, and then I have to download it all from my server, and then it gets sorted away from my eyes in my client.
It is fairly trivial if you get enough regular mail for it to matter, and you are on a fast connection.
But I can't tell you how annoying it is to be on a slow dial-up connection and download 50 messages and then see that they all got filtered into the spam folder and that there were no "real" messages.
While there is a nice feeling of seeing them all get caught, it is annoying to have to wait for a download (and pay for it) and then get no return on the investment.
That is why I always try to have the spam blocking on the server side. Although I now spend most of my time using ssh into my server and that way it isn't downloading all of the mail until I want to see something.
Perhaps if I combine the fact that I have SA on the server, and then if I also had a client side option, I would get everything properly blocked that way (the only reason stuff gets through my server setup right now is if the server is under a high load, then my SA script will time out and the mail gets through).
There are some odd things afoot now, in the Villa Straylight.
I know your post was meant to be funny, but it brings up a point:
/all/ of us on teh intarweb. ;-)
So what? If more computer products benefit, don't we all? Anything that makes Outlook better is good in my book. Perhaps this will eliminate some virus-and-worm-carrying spam--and that's good for
Mikey-San
Karma: +Eleventy billion (mostly affected by watching Celebrity Jeopardy)
Umm, a naive Bayesian filter would score duplicate posts highly, because after all they contain all the same words that were good last time.
"I believe that the cult of the particular brings only death - for it bases order on likeness." St.-Exupery
No, I don't think I'll trust you on that... :)
I have already seen the effectiveness of Popfile drop from 99% to 95% in the last 3 months.
That's very strange, but based on what you said below it seems that that's due to a limitation of Popfile as opposed to Bayesian itself. I've seen my Bayesian effectiveness INCREASE in the last 3 months.
Now spammers are including several paragraphs of unrelated (ie, un-spammy) text at the end of their message
There is a common misconception--both among spammers and anti-spammers--that doing the above will get your messages through. In some rare cases it might, but you have to remember that a good Bayesian filter is only going to pay attention to the most spammy and least spammy words. Just entering a useless, non-spammy paragraph is not enough. Unless that non-spammy paragraph happens to contain quite a few words that are downright NON-SPAM in my corpus all that verbage isn't going to do squat to lower the overall spam score of the message.
Basically, you need to know that my email typically talks about microcontrollers, I have a friend named Nathan, or my mom is named Angie. Just flooding me with words that don't appear in spam will do nothing unless you flood me with words that are extremely non-spammy in my particular corpus. And it's unlikely some random paragraph will manage to do that.
So now Popfile will have to have a MIME decoder?
You mean it doesn't now? This is what causes me to think that this is a limitation of Popfile more than a limitation of Bayesian and, perhaps, is why my Bayesian effectiveness is climbing and yours is falling.
And then they'll send their SPAM in GIFs.
At which point the fact that a message contains just an IMG is going to receive a high spam score. No-one says Bayesian can just score words. You can create a token that means "Message only contains an IMG" or something like that. Bayesian doesn't mean we're done developing--it just means that the logical work is done. Now all we need to do is keep our eyes out for new "characteristics" of spam that can be detected and considered to be a "token."
So then Popfile will have to use some kind of text-to-graphic weighting factor (note: no longer pure Bayesian/Naive filtering...)
Very doubtful for the reason mentioned above. You look at characteristics of the mail. And if you find that the message is basically just an IMG, that's a major strike against it. I severely doubt that you have to OCR the image unless your real email is also sent to you as images instead of text.
And then they'll start attaching a megabyte of unrelated text to the SPAM.
Again, just adding "innocent" text is not enough to get past Bayesian. You have to have the RIGHT innocent text, and that's different for each person. And, again, if they start adding megabytes of useless text you add a characteristic for "Text of message is over 100k". Suddenly Bayesian will realize that 99% of those messages are spam...
And note that the countermeasures will have increased the size of the average spam from 3-5K to a megabyte plus. Great for bandwidth.
I doubt that will be the case for the reasons mentioned above. Spammers are adding useless paragraphs now because they don't understand Bayesian.
Again, you just need to remember that 1) Bayesian isn't fooled just by adding paragraphs or megabytes of meaningless text. 2) Bayesian doesn't mean we never have to think about spam. It just means the hard work of deciding whether or not a message is spam is done. Now all we need to do is keep our eyes open for new "identifying characteristics" that often appear in spam. The rest falls into place automatically.
As for your indictment of spam filtering providers, could you please explain where the spamassassin devteam is making money?
My choices with regards to spam at the moment are simple. Use spamassassin or something like it, or wade through spam myself. I know which I'd prefer.
Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)