Computationally Cheap Spam Filtering?
"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.
Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."
Personally, I would recommend going with spamassasin, RBL and stuff gets too much non-spam and causes user problems. I would recommend that next to your mail server you setup a load balancer + DHCP server. Then modify a copy Knoppix to run as a spamassasin server and use the mysql for configuration of spamassasin, then buy several cheap consumer level PCs and plug them in for cheaply scale your processing power. Then run spamassasin for 50,000 users.
SpamAssassin can run as a daemon (see here) so it doesn't have to start up the perl interpreter for each message. This is the preferred mode for large installations.
People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.
It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
You could take some steps on the user education side of things. Before being given an account, they should learn a few things about how to keep their address safe, like:
Also, if you're working for an organization which may want to expose user addresses to the internet via a web site, you may want to work with the web master and legal to create a click-through agreement that would stop spam harvesting robots while only requiring a couple extra clicks for the legitimate public. Or work with the web master to create a standard human-only readable way to post email addresses, e.g. "email lauren at our domain of example.com".
You may wish to register an additional domain or two to provide disposable email address services to your users.
Consider a piece of software that blocks IPs attempting to brute-force email addresses. Some filter monitoring the logs for excessive bounces from an IP and passing it to the firewall would work. I don't know of any examples of this software, but if you're doing a large email service you may get these kinds of attacks.
Scanning messages for spam and rejecting at the SMTP level is a very bad idea. I'm the sysadmin for a company where about 25% of our email message traffic is spam. However, we also have a hard-working sales department who actually need commercial and sales messages. If a message from a client is marked as 'spam' because they're negotiating a sales deal, the sales staff still need to see this message. If a client's counter-offer is rejected at the mail server with a "you sent us SPAM" message, you can kiss that potential income goodbye.
False positives can be more harmful than messages getting through the spam filter.