Plan for Spam, Version 2
bugbear writes "I just posted a new version of the Plan for
Spam Bayesian filtering algorithm. The big change is to mark tokens by context. The new version decreases spams missed by 50%, to 2.5 per 1000, even though spam has gotten harder to filter since the summer. I also talk about how spam will evolve, and what to do about it."
The latest development Spamassassin has an interesting application of Bayesian filtering. Basically, it takes all of SA's existing heuristics, uses that to develop a sense of what is and is not spam, and then pumps the results through a Bayesian filter that learns from these messages.
As with any other SA test, no single element of the chain is trusted enough to definitively call something spam, but if a message would have squeeked through before, this new filter can put the final nail in its coffin through word analysis against previous spam.
So, why did I use a subject about "ENDING spam"? Because one of the tools that spammers have is SA itself. They can use it to score their messages and determine how "spamish" it is. The problem now is that each SA installation will have subtly different scoring, and the message may be "ok" according to the spammer's version, but my version has a better sense of the mail that *I* get.
SpamAssassin is definitely a tool worth checking out if you have not already. Install it in daemon mode (spamd) and then use "spamc -f" in your procmailrc or the equiv for your MTA.
Very nice tool, and a real time-saver for me.
I'm really excited about all of the neat stuff happening with Bayesian filtering and related technologies, but I just wanted to put in a plug for TMDA, Tagged Message Delivery Agent, which uses a whitelist-centric strategy. Since I began using it, the amount of spam I have to look at is virtually at zero. If you haven't read about it yet, check it out.
The article mentions compiling a vast collection of spam. Such a project is already underway at SpamArchive.
Content-Type: text/html (or text/plain)
Content-Transfer-Encoding: base64
Because a lot of filters don't know how to decipher this. For me, this makes it a lot easier to filter, though. I get no legitimate e-mail encoded this way, so I just have procmail dump any e-mail encoded this way. Problem solved, and without the CPU burden of decoding or running expensive spam filters.
The url for the project is popfile.sourceforge.net
I didn't try it yet, but it I will try it really soon now!
OK, signal and noise. What if the signal was all in one frequency band and the noise all in another. Problem separating them? No.
What if, in effect, a similar distinction held for spam in the transmission channel - that spam by itself selected a pathway to the recipient that was never used by the signal? Block that pathway and the spam never gets through.
Spam doesn't select a pathway but spammers do. If you could block relay spam at the open relays it would be dead. You can't, of course - the open relays are controlled by people who don't know the need to block spam. You know that, I know that. If you can't change the people then change the open relays (from the spammers' points of view.) Set up a system that looks like an open relay and stop the spam. An open relay honeypot.
I asked an operator of such a honeypot how he did last year:
> How did 2002 end?
From March 7 to December 26 2002, the total was:
235,624,232
Using one Pentium 90 he stopped spam to 235 million recipients. Think about that number when you see filter people reporting what they stop just for their own domains. This was spam to recipients all over, not simply to the honeypot operators domain: he operates at the relay level. He stopped 100% of the spam, no deception deceived him, no tuning was needed, no valid email was caught - it is perfect filtering. Perfect filtering - who else has that?
And you can do it at home on your DSL or cable connection (the guy above uses sendmail -bd, but Windows users have a program they can use):
http://jackpot.uk.net/
Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.
The problem is solved, it is a matter of implementation and of getting active systems everywhere in the net space (so there's no safe IP space for the spammers anywhere.)
Remember: A single Pentium 90, 235 million spam messages stopped in 10 months.
It's all fine and dandy to have a spamtrap account if you never plan to read it, but what if you want to get online bank statement notifications or other important notices? I just noticed my friendly credit card company (Capital One) took it upon themselves to introduce my previously spam-free e-mail account to their business partners so they could introduce me to the wonderful world of buying fucking flowers for valentines day. Thanks alot assholes. And no, they have NO option to opt out of this fucking crap. The spam is posted from the same address as the statement notifications with a friendly disclaimer saying they're not in any way affiliated. Nice.
I went through over 500 spam a day down to about 3 or so and I figured out that those last 3 are due to the fact that they are bypassing the filter (I have a bunch of different urls and the server that it is all hosted on also has its own name - so mail sent to that username at that host doesn't get sent through any filters and the way that the filters are setup there - pair.com - I can't trap that particular servername).
I have been very impressed with SA and am writing scripts to track the stats even better (I love seeing what it has pulled out everyday).
So far I have had zero false positives out of about 1-2megs of mail being filtered everyday for nearly a month now.
SA has multiple different ways of searching the mail - any one of them can be easily bypassed by any given e-mail - but all of them together are really damn good at getting rid of spam.
I'm very impressed with it and how well it learns (although straight "out of the box" - or perhaps I should say "straight out of the tar.gz" it brought me down from 500+ spam to 5-10 a day and then I tweaked how my accounts were filtering into SA and that fixed the rest.
There are some odd things afoot now, in the Villa Straylight.
Praed argued, very eloquoently & persuasively (hey, he's a lawyer :) that there are laws on the books banning spam in nearly every state. All you have to do is find a way to bring those laws to your assistance. In particular, note that:
As a lawyer that has successfully prosecuted a number of spammers, Praed was able to talk about all of this with some authority. He cautioned everyone though that laws will never eradicate spam -- as he put it, "people still rob banks since that's where the money is". But legislation & prosecution can still be a very valuable tool in fighting spam, and an important supplement to things like better mail filters. This is a big problem, and is going to need a variety of tiered solutions to control it.
DO NOT LEAVE IT IS NOT REAL
Hi, that was me . Unfortunately this only works for Outlook (not even Outlook Express), but it's been working great for me.
As others have pointed out, Vipul's Razor is a great open-source solution.
Checking SourceForge , I found the following additional packages:
BogoFilter
SpamAssassin
JoeEmail
Bayesian anti-spam classifier
Anti-Spam SMTP Proxy Server
Bayesian Mail Filter
JunkFilter
SpamProbe - fast bayesian spam filter
Mailfilter
IMAPAssassin
That's just from the first page of search results. If you'd like to see all the results (I did a search for "spam" from their search box), click here .
I feel fantastic, and I'm still alive.