Computationally Cheap Spam Filtering?

Posted by Cliff on Thursday May 8, 2003 @08:08AM from the cheap-and-inexpensive-crap-detection dept.

Roadmaster asks: "Usually, the most effective spam filtering techniques are somewhat resource intensive. Heuristic checkers like Spamassassin, or bayesian filters like spamprobe are processor and storage hungry. This is fine for small setups; I've been using spamprobe to filter spam for 3 users with great results. I'm now however faced with a big challenge: a mail server that will eventually be handling mail for over 50,000 users and needs to have some sort of anti-spam measures. What are some good and computationally cheap spam prevention measures?"

"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.

Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."

1 of 85 comments (clear)

Min score:

Reason:

Sort:

Run spamd/spamc version of SpamAssassin by bill_mcgonigle · 2003-05-08 09:20 · Score: 5, Interesting

SpamAssassin can run as a daemon (see here) so it doesn't have to start up the perl interpreter for each message. This is the preferred mode for large installations.

People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.

It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)