Computationally Cheap Spam Filtering?
"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.
Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."
Personally, I would recommend going with spamassasin, RBL and stuff gets too much non-spam and causes user problems. I would recommend that next to your mail server you setup a load balancer + DHCP server. Then modify a copy Knoppix to run as a spamassasin server and use the mysql for configuration of spamassasin, then buy several cheap consumer level PCs and plug them in for cheaply scale your processing power. Then run spamassasin for 50,000 users.
At our uiversity, Virginia Tech, the hardware e-mail virus scanners (Mirapoint Messaging Server )also do Spam Assassin now, it puts info in headers (sample below). Filter for "X-Junkmail: UCE" and you've got a spam filter (though I run a more aggressive SA on my workstation, since I can customize it there).
B ELOW, CLICK_HERE_LINK,DATE_IN_FUTURE_12_24,HGH, HTML_FONT_COLOR_CYAN,HTML_FONT_COLOR_GRAY, HTML_FONT_COLOR_NAME,HTML_FONT_COLOR_RED, HTML_FONT_COLOR_UNSAFE,HTML_FONT_COLOR_YELLOW,NO_Q S_ASKED, RCVD_IN_DSBL,REMOVE_PAGE,SPAM_PHRASE_13_21,SPAM_RE DIRECTOR, SUSPICIOUS_RECIPS,USER_AGENT_OUTLOOK,VERY_SUSP_REC IPS version=2.44T ype: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_d+Bzp/dF6h/2OkPD89OTbQ)"
C ontent-Type: text/plain; CHARSET=US-ASCII
Return-Path:
Received: from vt.edu (gkar.cc.vt.edu [198.82.161.196]) by xxxx.xxxx.vt.edu (8.12.8/linuxconf) with ESMTP id h47JISRm004277 for ; Wed, 7 May 2003 15:18:28 -0400
Received: from steiner.cc.vt.edu ([10.1.1.14]) by gkar.cc.vt.edu (Sun Internet Mail Server sims.3.5.2001.05.04.11.50.p10) with ESMTP id for noone@xxxx.xxxx.vt.edu; Wed, 7 May 2003 15:18:31 -0400 (EDT)
Received: from aol.com (host217-40-92-155.in-addr.btopenworld.com [217.40.92.155]) by steiner.cc.vt.edu (Mirapoint Messaging Server MOS 3.3.2-CR) with SMTP id BIE36579; Wed, 07 May 2003 15:18:17 -0400 (EDT)
Date: Thu, 08 May 2003 03:13:26 -0800
From: Kate Welsh
Subject: [SPAM] Remember me?
To: spam@vt.edu
Message-id:
MIME-version: 1.0
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-Junkmail: UCE(58)
X-Priority: 3
X-Spam-Status: Yes, hits=13.0 required=5.0 tests=ALL_NATURAL,BASE64_ENC_TEXT,BIG_FONT,CLICK_
X-Spam-Flag: YES
X-Spam-Level: *************
X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)
X-Spam-Prev-Content-
X-Evolution-Source: imap://jackie@localhost/
I am, and always will be, an idiot. Karma: Coma (mostly effected by
I can't say that I don't give a fuck. I've just run out of fuck to give.
Ok, I'm not the email admin here at work, I avoid the whole mail subsystem 'cause I already have enough to do elsewhere.
.20 running Popfile during a monday morning after a long weekend but the machine is half the other one, a PIII 667Mhz with 256Mb RAM.
Anyway, for our 1500 users we use SpamAssassin with RBL and blacklists and our meager server (PIII 1.26GHz with 512Mb RAM) doesn't even reach 0.20, the heuristics is turned down due to the processor usage but it filters about 90% of the spam with very little load.
I, personally, use Popfile (search Sourceforge) as my personal filter - with it's database right now, not that big, just some 8Mb with over 200,000 emails since training (from my huge spam database) and normal usage over the past year for me and a dozen other users. Very easy to set up and use, you just need to train it with a good database. It's stats state that it has a 99.85% correctness rate. The machine has reached
Ash nazg durbatuluk, ash nazg gimbatul Ash nazg thrakatuluk agh burzum-ishi krimpatul
Classifying spam is essentially the same problem as classifying programs into those that terminate that those that don't (the halting problem). This leads us to the following conclusions:
1) Filtering spam is not trivial. A program that filters spam X% better than another program will be X^2% more complicated or worse.
2) You can't write a program that will filter perfectly. At best, all you can do is develop a set of heuristics that you hope aren't too complicated. The less complicated the heuristic, the fewer resources it will require.
3) There's a limit to how simple your heuristics can be.
4) The system of spam is not just the message: it's the spammer, plus the message, plus the recipient. This is because a certain message considered spam by some will not be considered spam by others. That means that the heuristics that account for the person reading the spam will be better than those that don't. The source of a spam is also important: a message consisting of a spam report to a spam newsgroup is not a spam, though it may contain a complete spam message.
5) The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities. In short, you're going to see an endless increase in the number of processor cycles consumed by spam filters, asymptotically approaching the requirements of a full-up human brain simulation.
On the other hand, this will sell a hell of a lot of computers.
If tits were wings it'd be flying around.
The more "low hanging fruit" you pick off the less your computationally expensive filters have to do. For example, if the other system greets you with:
EHLO your.machine.ip.address
or
EHLO your.machine.name then it IS a spammer. Reject now. There are some patches and configurations for Postfix so you can declare that RCPT from certain domains like yahoo and hotmail be verified to have a hotmail EHLO that properly resolves. This is more expensive as a dns lookup is required but this will probably be cached locally pretty quickly.
You can also unceremoniously drop any connection that starts pipelining before you say it is OK to pipeline and any EHLO that has an illegal hostname.
This, at least, reduces the work your scanning engines will have to do. Still, even if you catch nearly all the spam with the easy checks you will only reduce your mail volume by ~40% (current estimated overall spam volume) so that leaves you with 60% to scan.
I suppose your main MX could do the easy checks then send the remainder off to as many round-robin scanners as necessary which in turn could pass the mail on for delivery.
One starts to realize why some places just roll over and pay tens of thousands of dollars to someone else to do it for them.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
SpamAssassin can run as a daemon (see here) so it doesn't have to start up the perl interpreter for each message. This is the preferred mode for large installations.
People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.
It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
At the SMTP server
At the SMTP Filter Proxy Server or LDA
Just remember to shortcut the process along the way. If email can be dropped or tagged for any reason, do so immediately and quit processing it.
assert(expired(knowledge));
You could take some steps on the user education side of things. Before being given an account, they should learn a few things about how to keep their address safe, like:
Also, if you're working for an organization which may want to expose user addresses to the internet via a web site, you may want to work with the web master and legal to create a click-through agreement that would stop spam harvesting robots while only requiring a couple extra clicks for the legitimate public. Or work with the web master to create a standard human-only readable way to post email addresses, e.g. "email lauren at our domain of example.com".
You may wish to register an additional domain or two to provide disposable email address services to your users.
Consider a piece of software that blocks IPs attempting to brute-force email addresses. Some filter monitoring the logs for excessive bounces from an IP and passing it to the firewall would work. I don't know of any examples of this software, but if you're doing a large email service you may get these kinds of attacks.
Scanning messages for spam and rejecting at the SMTP level is a very bad idea. I'm the sysadmin for a company where about 25% of our email message traffic is spam. However, we also have a hard-working sales department who actually need commercial and sales messages. If a message from a client is marked as 'spam' because they're negotiating a sales deal, the sales staff still need to see this message. If a client's counter-offer is rejected at the mail server with a "you sent us SPAM" message, you can kiss that potential income goodbye.
False positives can be more harmful than messages getting through the spam filter.