Computationally Cheap Spam Filtering?

← Back to Stories (view on slashdot.org)

Computationally Cheap Spam Filtering?

Posted by Cliff on Thursday May 8, 2003 @08:08AM from the cheap-and-inexpensive-crap-detection dept.

Roadmaster asks: "Usually, the most effective spam filtering techniques are somewhat resource intensive. Heuristic checkers like Spamassassin, or bayesian filters like spamprobe are processor and storage hungry. This is fine for small setups; I've been using spamprobe to filter spam for 3 users with great results. I'm now however faced with a big challenge: a mail server that will eventually be handling mail for over 50,000 users and needs to have some sort of anti-spam measures. What are some good and computationally cheap spam prevention measures?"

"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.

Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."

5 of 85 comments (clear)

Min score:

Reason:

Sort:

FP? by GrendelT · 2003-05-08 08:12 · Score: 2, Insightful

what kind of platform are you running from? if you'll have that many clients to support, you might consider having a dedicated spam-filter. that way you dont have to worry about resource-hogging filters
One very fast check is extremely effective: by -dsr- · 2003-05-08 08:59 · Score: 2, Insightful

One very fast check is extremely effective: look at the first line of each MIME attachment to see if it's a Microsoft executable file. If it is, quarantine it.

(I wish I had thought of this, but Russell Nelson did.)
Multi level approach by linuxwrangler · 2003-05-08 09:18 · Score: 3, Insightful

The more "low hanging fruit" you pick off the less your computationally expensive filters have to do. For example, if the other system greets you with:
EHLO your.machine.ip.address
or
EHLO your.machine.name then it IS a spammer. Reject now. There are some patches and configurations for Postfix so you can declare that RCPT from certain domains like yahoo and hotmail be verified to have a hotmail EHLO that properly resolves. This is more expensive as a dns lookup is required but this will probably be cached locally pretty quickly.

You can also unceremoniously drop any connection that starts pipelining before you say it is OK to pipeline and any EHLO that has an illegal hostname.

This, at least, reduces the work your scanning engines will have to do. Still, even if you catch nearly all the spam with the easy checks you will only reduce your mail volume by ~40% (current estimated overall spam volume) so that leaves you with 60% to scan.

I suppose your main MX could do the easy checks then send the remainder off to as many round-robin scanners as necessary which in turn could pass the mail on for delivery.

One starts to realize why some places just roll over and pay tens of thousands of dollars to someone else to do it for them.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
Re:A few comments... by aridhol · 2003-05-08 09:28 · Score: 4, Insightful

If it's for a company email system, make sure that the higher-ups know exactly the implications of this system. And I don't mean just the CEO. Make sure that every department head knows. Go to a meeting with them first, to discuss it. Do not enforce your own policy. That's not your job.
Make sure that people know that they can (and probably will) lose legitimate email. Make sure there's a way to bypass the filters. For example, hold the email until you can confirm the sender (reply to sender, and if your message bounces or isn't replied to in n days, delete). Let users setup their own configuration (scores, whitelists, etc), but be able to override some things (eg don't let them blacklist internal mail).

--
I can't say that I don't give a fuck. I've just run out of fuck to give.
Filter via proxy, not LDA by runswithd6s · 2003-05-08 10:52 · Score: 3, Insightful
Spam and virus filtering in an efficient manner for anyone is a major issue, and it has already been mentioned that there are multiple ways of accomplishing this. In designing your process, think in terms of dropping or rejecting email from the process loop as soon as possible.
At the SMTP server
- Drop email from known blacklisted servers via your email server access file
- Allow email from known whitelisted servers or addresses
- Use RBL lists
- Filter out mis-behaving SMTP servers, ones that don't follow standard protocols
- Disable ESMTP commands that give the spammer access to your local users lists (VRFY, etc..)
- Only relay email from authenticated servers and users
- Impose a size limit to messages (50k) if possible.
At the SMTP Filter Proxy Server or LDA
- Allow emails from recipient-based whitelists
- Drop emails from recipient-based blacklists
- Process Tagged messages (from TMDA)
- Run your faster classification programs: clamav (for viruses), bogofilter (bayesian)
- Run your slower classification programs: procmail, spamassassin
Just remember to shortcut the process along the way. If email can be dropped or tagged for any reason, do so immediately and quit processing it.
--
assert(expired(knowledge)); /* core dump */