Slashdot Mirror


Computationally Cheap Spam Filtering?

Roadmaster asks: "Usually, the most effective spam filtering techniques are somewhat resource intensive. Heuristic checkers like Spamassassin, or bayesian filters like spamprobe are processor and storage hungry. This is fine for small setups; I've been using spamprobe to filter spam for 3 users with great results. I'm now however faced with a big challenge: a mail server that will eventually be handling mail for over 50,000 users and needs to have some sort of anti-spam measures. What are some good and computationally cheap spam prevention measures?"

"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.

Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."

4 of 85 comments (clear)

  1. Hardware Virus Checker by Kalak · · Score: 4, Informative

    At our uiversity, Virginia Tech, the hardware e-mail virus scanners (Mirapoint Messaging Server )also do Spam Assassin now, it puts info in headers (sample below). Filter for "X-Junkmail: UCE" and you've got a spam filter (though I run a more aggressive SA on my workstation, since I can customize it there).

    Return-Path:
    Received: from vt.edu (gkar.cc.vt.edu [198.82.161.196]) by xxxx.xxxx.vt.edu (8.12.8/linuxconf) with ESMTP id h47JISRm004277 for ; Wed, 7 May 2003 15:18:28 -0400
    Received: from steiner.cc.vt.edu ([10.1.1.14]) by gkar.cc.vt.edu (Sun Internet Mail Server sims.3.5.2001.05.04.11.50.p10) with ESMTP id for noone@xxxx.xxxx.vt.edu; Wed, 7 May 2003 15:18:31 -0400 (EDT)
    Received: from aol.com (host217-40-92-155.in-addr.btopenworld.com [217.40.92.155]) by steiner.cc.vt.edu (Mirapoint Messaging Server MOS 3.3.2-CR) with SMTP id BIE36579; Wed, 07 May 2003 15:18:17 -0400 (EDT)
    Date: Thu, 08 May 2003 03:13:26 -0800
    From: Kate Welsh
    Subject: [SPAM] Remember me?
    To: spam@vt.edu
    Message-id:
    MIME-version: 1.0
    X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
    Importance: Normal
    X-Junkmail: UCE(58)
    X-Priority: 3
    X-Spam-Status: Yes, hits=13.0 required=5.0 tests=ALL_NATURAL,BASE64_ENC_TEXT,BIG_FONT,CLICK_B ELOW, CLICK_HERE_LINK,DATE_IN_FUTURE_12_24,HGH, HTML_FONT_COLOR_CYAN,HTML_FONT_COLOR_GRAY, HTML_FONT_COLOR_NAME,HTML_FONT_COLOR_RED, HTML_FONT_COLOR_UNSAFE,HTML_FONT_COLOR_YELLOW,NO_Q S_ASKED, RCVD_IN_DSBL,REMOVE_PAGE,SPAM_PHRASE_13_21,SPAM_RE DIRECTOR, SUSPICIOUS_RECIPS,USER_AGENT_OUTLOOK,VERY_SUSP_REC IPS version=2.44
    X-Spam-Flag: YES
    X-Spam-Level: *************
    X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)
    X-Spam-Prev-Content-T ype: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_d+Bzp/dF6h/2OkPD89OTbQ)"
    C ontent-Type: text/plain; CHARSET=US-ASCII
    X-Evolution-Source: imap://jackie@localhost/

    --
    I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
  2. A few comments... by aridhol · · Score: 4, Informative
    Ideally, I'd prefer something that does reject the message if it's spam
    Careful. I'm assuming that these will be clients or co-workers or similar. BE CAREFUL. You do not want to drop messages. What happens if a client's email is lost because it looks like spam (refers to money, etc). Better to tag it, and let the user decide.

    Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS).
    SpamAssassin is a stacking of checks. You can set up its config to skip those checks that you don't want to bother with. You may need to adjust scores if you do that, however.
    --
    I can't say that I don't give a fuck. I've just run out of fuck to give.
  3. Spam Assassin and/or Popfile by Universal+Nerd · · Score: 5, Informative

    Ok, I'm not the email admin here at work, I avoid the whole mail subsystem 'cause I already have enough to do elsewhere.

    Anyway, for our 1500 users we use SpamAssassin with RBL and blacklists and our meager server (PIII 1.26GHz with 512Mb RAM) doesn't even reach 0.20, the heuristics is turned down due to the processor usage but it filters about 90% of the spam with very little load.

    I, personally, use Popfile (search Sourceforge) as my personal filter - with it's database right now, not that big, just some 8Mb with over 200,000 emails since training (from my huge spam database) and normal usage over the past year for me and a dozen other users. Very easy to set up and use, you just need to train it with a good database. It's stats state that it has a 99.85% correctness rate. The machine has reached .20 running Popfile during a monday morning after a long weekend but the machine is half the other one, a PIII 667Mhz with 256Mb RAM.

    --
    Ash nazg durbatuluk, ash nazg gimbatul Ash nazg thrakatuluk agh burzum-ishi krimpatul
  4. The problem is hard by PD · · Score: 5, Informative

    Classifying spam is essentially the same problem as classifying programs into those that terminate that those that don't (the halting problem). This leads us to the following conclusions:

    1) Filtering spam is not trivial. A program that filters spam X% better than another program will be X^2% more complicated or worse.

    2) You can't write a program that will filter perfectly. At best, all you can do is develop a set of heuristics that you hope aren't too complicated. The less complicated the heuristic, the fewer resources it will require.

    3) There's a limit to how simple your heuristics can be.

    4) The system of spam is not just the message: it's the spammer, plus the message, plus the recipient. This is because a certain message considered spam by some will not be considered spam by others. That means that the heuristics that account for the person reading the spam will be better than those that don't. The source of a spam is also important: a message consisting of a spam report to a spam newsgroup is not a spam, though it may contain a complete spam message.

    5) The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities. In short, you're going to see an endless increase in the number of processor cycles consumed by spam filters, asymptotically approaching the requirements of a full-up human brain simulation.

    On the other hand, this will sell a hell of a lot of computers.