Computationally Cheap Spam Filtering?
"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.
Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."
Personally, I would recommend going with spamassasin, RBL and stuff gets too much non-spam and causes user problems. I would recommend that next to your mail server you setup a load balancer + DHCP server. Then modify a copy Knoppix to run as a spamassasin server and use the mysql for configuration of spamassasin, then buy several cheap consumer level PCs and plug them in for cheaply scale your processing power. Then run spamassasin for 50,000 users.
Caveat: I know very little about email in the real world. These are just my thoughts.
What are your requirements? Do you have very limited hardware to work with? Do you need a particularly low latency for delivery? How many messages do you need to process per minute? (or per second)
If it's possible, having a seperate spam filtering box might be a good idea. If that gets loaded down you could even make a cluster of them. I'm not sure that high-level spam filtering really takes as much cpu time as you're implying, but even so it should be pretty simple to set up something like this.
Another possibility is to limit the amount of cpu time that the spam filitering process takes and simply bypass it when it can't be done. Perhaps a mail can wait a maximum of 10 seconds before being automatically sent on. This could even be combined with a seperate server or cluster approach. I have no idea how this would be implemented, though I have a hunch that qmail or exim would be at least extensible enough to allow it. I think it's worth jumping through hoops to keep the high level spam filitering.
Thirdly, you could try turning the mail filter into a "server" program itself, so you don't have to start a new process for each email you filter.
See you, space cowboy...
I've gotten many requests to tag people's mail rather then deleting it. Within a month, they all say 'fuckit, just toss it.'
--Dan
SpamAssassin can run as a daemon (see here) so it doesn't have to start up the perl interpreter for each message. This is the preferred mode for large installations.
People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.
It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
I personally use a large number of DNS blacklists. I call them from Sendmail and reject mail with them. Many people don't like DNSBLs; of course I believe these people are ignorany fools who couldn't admin a mail system if their life depended on it. That's ok. At the very least you should be able to use the DNSBLs that list open relays, open proxies, open SOCKS boxes, and vulnerable formmail.cgi web servers. We can surely all agree that you don't want your mail server talking to another mail server that's known to be vulnerable. Most of these specific lists require that an open * be abused before they list them. I'd also contend that we can all justify using Spamhaus's Spamhaus Block List (SBL). It lists known spammers and it very specific about it. You can block roughly 75% of spam with that list alone. Where you use these DNSBLs is up to you. Like I said above, I call all of mine straight from Sendmail. You can configure SpamAssassin to call these DNSBLs for you and assign a score you define. It's pretty easy. This way you can still use lists like SPEWS that rely on collateral damage to score mail but not outright block it. I use SPEWS and love it but it does block some legit mail by design. If you only score off of SPEWS you can minimize the FPs while still maximizing your spam filtering efforts. I am preparing to score foreign countries and RFC-Ignorant domains off of this as well.
I do not recommend you use the DCC. I highly recommend you use Razor which IMHO addresses the shortcomings in DCC. Submissions to Razor have to be confirmed unlike in the DCC. This way other people confirm that the message someone submits is actually spam and not JCPenny's spring mailing list. SpamAssassin can make these calls as well.
The mail system you're describing is going to be fairly large. This isn't something you want a single box handling. Ideally you'd put the spam and AV checks on a mailhub ahead of the actual MTA or cluster of MTAs. These boxes act as a spam firewall of sorts and takes the CPU intensive tasks you mentioned off of the actual mail server. I'm not actually using this type of setup myself but I will be eventually. There was a Slashdot article a while back about a setup roughly your size and what I guy did to make it work. It was quite a nice setup. I can't find the link now. IIRC, he scored mail and then sent probable spam via a seperate mail queue to a seperate spool for each user. Then using IMAP the user could check their probable spam for FPs. It was a nice setup.
You also mentioned Bayesian filtering. Let me make something very clear. Bayesian filters must be applied on a user by user basis. You can't simply enable Bayes for all 50,000 as one lump sum. It will never be able to learn what is an isn't spam that way. You have to let it learn on a user by users basis. The existing Bayes abilities within SpamAssassin don't work well (or at least easily) when SA is called from MIMEDefang. There are supposedly hacks for this but I have yet to see a working one. Along those same lines user-defined preferences also don't work well (or at least easily) fro
I would disagree strongly here.
... and then only at the beginning.
...
I use spambayes for my spam filtering.
I get about 50 items classified as spam per day. These have a spam probability according to my spam and ham corpuses of > 90% (usually 100%).
However, I also get 2-6 things classified as *possibly* spam each day. These are things which have a spam probability of > 15%.
These get mixed up in among 200-400 other messages each day.
Once I got things set up, I have *never* had anything classified definitely spam which wasn't.
Most of the stuff that's classified as *possibly* spam, isn't. Most of that tends to be company announcements, which (even though I've included all of them as ham) have enough spam indicators to confuse things.
I've had very few things which were spam get through
Anyway, my point is that by separating out the *possible* spam from the *definite* spam, you greatly reduce how much you need to look over. I barely even glance at what is in my spam folder, but I consider each piece that goes into my possible spam folder.
In addition, spambayes requires a spam corpus to be maintained. Couldn't do that if it didn't let spam through
Spambayes isn't designed to be used for a large number of people, but there's no reason it couldn't - apart from the state reasons of computation and storage space, and that it works best on an individual or small group basis.
You could take some steps on the user education side of things. Before being given an account, they should learn a few things about how to keep their address safe, like:
Also, if you're working for an organization which may want to expose user addresses to the internet via a web site, you may want to work with the web master and legal to create a click-through agreement that would stop spam harvesting robots while only requiring a couple extra clicks for the legitimate public. Or work with the web master to create a standard human-only readable way to post email addresses, e.g. "email lauren at our domain of example.com".
You may wish to register an additional domain or two to provide disposable email address services to your users.
Consider a piece of software that blocks IPs attempting to brute-force email addresses. Some filter monitoring the logs for excessive bounces from an IP and passing it to the firewall would work. I don't know of any examples of this software, but if you're doing a large email service you may get these kinds of attacks.
Scanning messages for spam and rejecting at the SMTP level is a very bad idea. I'm the sysadmin for a company where about 25% of our email message traffic is spam. However, we also have a hard-working sales department who actually need commercial and sales messages. If a message from a client is marked as 'spam' because they're negotiating a sales deal, the sales staff still need to see this message. If a client's counter-offer is rejected at the mail server with a "you sent us SPAM" message, you can kiss that potential income goodbye.
False positives can be more harmful than messages getting through the spam filter.