Slashdot Mirror


Computationally Cheap Spam Filtering?

Roadmaster asks: "Usually, the most effective spam filtering techniques are somewhat resource intensive. Heuristic checkers like Spamassassin, or bayesian filters like spamprobe are processor and storage hungry. This is fine for small setups; I've been using spamprobe to filter spam for 3 users with great results. I'm now however faced with a big challenge: a mail server that will eventually be handling mail for over 50,000 users and needs to have some sort of anti-spam measures. What are some good and computationally cheap spam prevention measures?"

"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.

Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."

85 comments

  1. FP? by GrendelT · · Score: 2, Insightful

    what kind of platform are you running from? if you'll have that many clients to support, you might consider having a dedicated spam-filter. that way you dont have to worry about resource-hogging filters

  2. Solution by m0rph3us0 · · Score: 3, Interesting

    Personally, I would recommend going with spamassasin, RBL and stuff gets too much non-spam and causes user problems. I would recommend that next to your mail server you setup a load balancer + DHCP server. Then modify a copy Knoppix to run as a spamassasin server and use the mysql for configuration of spamassasin, then buy several cheap consumer level PCs and plug them in for cheaply scale your processing power. Then run spamassasin for 50,000 users.

    1. Re:Solution by Anonymous Coward · · Score: 0

      what do you mean by several for the PCs?

      We have an oldish computer here that does nothing but SpamAssassin for like 15 users and it's too slow. Email is getting held up for hours. Plus, spammers are attaching huge junk files to the emails to trigger SA's auto-time out. So the spams are still getting through, but also loading down the SA machine as much as possible. Our timeout's at like 30 seconds. So a spam to everyone here takes 7-8 minutes to look at it, and then timeout and ultimately pass on. If only 100 spams a day come in per user, you're looking at 13 hours of computer delay time wasted on those spams which aren't even being caught and which a faster computer won't help. Email starts clogging up.

      RBL and it's ilk are the devil though. Big probs for our users.

    2. Re:Solution by Gordonjcp · · Score: 1

      You have other problems then. I run Spamassassin for around 15 users on an old 486/100, and it takes around 1-2 seconds to handle each mail message. Try fixing your mail configuration first.

  3. Hardware Virus Checker by Kalak · · Score: 4, Informative

    At our uiversity, Virginia Tech, the hardware e-mail virus scanners (Mirapoint Messaging Server )also do Spam Assassin now, it puts info in headers (sample below). Filter for "X-Junkmail: UCE" and you've got a spam filter (though I run a more aggressive SA on my workstation, since I can customize it there).

    Return-Path:
    Received: from vt.edu (gkar.cc.vt.edu [198.82.161.196]) by xxxx.xxxx.vt.edu (8.12.8/linuxconf) with ESMTP id h47JISRm004277 for ; Wed, 7 May 2003 15:18:28 -0400
    Received: from steiner.cc.vt.edu ([10.1.1.14]) by gkar.cc.vt.edu (Sun Internet Mail Server sims.3.5.2001.05.04.11.50.p10) with ESMTP id for noone@xxxx.xxxx.vt.edu; Wed, 7 May 2003 15:18:31 -0400 (EDT)
    Received: from aol.com (host217-40-92-155.in-addr.btopenworld.com [217.40.92.155]) by steiner.cc.vt.edu (Mirapoint Messaging Server MOS 3.3.2-CR) with SMTP id BIE36579; Wed, 07 May 2003 15:18:17 -0400 (EDT)
    Date: Thu, 08 May 2003 03:13:26 -0800
    From: Kate Welsh
    Subject: [SPAM] Remember me?
    To: spam@vt.edu
    Message-id:
    MIME-version: 1.0
    X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
    Importance: Normal
    X-Junkmail: UCE(58)
    X-Priority: 3
    X-Spam-Status: Yes, hits=13.0 required=5.0 tests=ALL_NATURAL,BASE64_ENC_TEXT,BIG_FONT,CLICK_B ELOW, CLICK_HERE_LINK,DATE_IN_FUTURE_12_24,HGH, HTML_FONT_COLOR_CYAN,HTML_FONT_COLOR_GRAY, HTML_FONT_COLOR_NAME,HTML_FONT_COLOR_RED, HTML_FONT_COLOR_UNSAFE,HTML_FONT_COLOR_YELLOW,NO_Q S_ASKED, RCVD_IN_DSBL,REMOVE_PAGE,SPAM_PHRASE_13_21,SPAM_RE DIRECTOR, SUSPICIOUS_RECIPS,USER_AGENT_OUTLOOK,VERY_SUSP_REC IPS version=2.44
    X-Spam-Flag: YES
    X-Spam-Level: *************
    X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)
    X-Spam-Prev-Content-T ype: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_d+Bzp/dF6h/2OkPD89OTbQ)"
    C ontent-Type: text/plain; CHARSET=US-ASCII
    X-Evolution-Source: imap://jackie@localhost/

    --
    I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
    1. Re:Hardware Virus Checker by itwerx · · Score: 1

      I just looked around on Mirapoint's site and I can't find any reference to SpamAssassin. Where did you hear that's what they use? Or did I misunderstand the post?

    2. Re:Hardware Virus Checker by cymen · · Score: 1

      Well if he's getting this line in his email headers:

      "X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)"

      Then....

    3. Re:Hardware Virus Checker by Kalak · · Score: 1

      Our mail server admins let VT know about it during testing. It was recently purchased after a successful trial. I'm not sure the arrangement, but you can probably ask postmaster at vt dot edu and get a good referral on the setup.

      --
      I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
    4. Re:Hardware Virus Checker by itwerx · · Score: 1

      That's kind of interesting because McAfee recently bought DeerSoft which was the only commercial provider of SpamAssassin I could find recently when our corporate powers-that-be mandated that while we would have a spam filter and that it would not be open-source.
      Think I could get an "official" statement from MiraPoint that it uses SpamAssassin? We'd rather not wait for McAfee to get their act together...
      Any contacts there I could call/email? (If you don't want to reply publicly I can be reached via slashdot@itwerx.net :)

      Thanks!

    5. Re:Hardware Virus Checker by Kalak · · Score: 1

      postmaster at vt dot edu would be the best start.

      --
      I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
    6. Re:Hardware Virus Checker by itwerx · · Score: 1

      Tried that but no response yet as of the time of this posting. :(

  4. Client side? by Drakon · · Score: 2, Informative

    how about filtering the mail client side?

    For outlook users, i recommend Spammunition and I just use mozilla's spam filtering, which works great.

    1. Re:Client side? by shepd · · Score: 1

      Query: Why is it that cool software for outlook never exists for outlook express?

      Do the authors get a special deal from M$ for not making their software work with express?

      --
      If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
    2. Re:Client side? by jbert · · Score: 1

      Because Outlook is based on MAPI (allowing a certain degree of plug-in behaviour) and has its own plug-in architecture which makes it very extensible, whereas Outlook Express is (was?) much more monolithic.

      Very different programs in all but name.

  5. Some Ideas by rmull · · Score: 2, Interesting

    Caveat: I know very little about email in the real world. These are just my thoughts.

    What are your requirements? Do you have very limited hardware to work with? Do you need a particularly low latency for delivery? How many messages do you need to process per minute? (or per second)

    If it's possible, having a seperate spam filtering box might be a good idea. If that gets loaded down you could even make a cluster of them. I'm not sure that high-level spam filtering really takes as much cpu time as you're implying, but even so it should be pretty simple to set up something like this.

    Another possibility is to limit the amount of cpu time that the spam filitering process takes and simply bypass it when it can't be done. Perhaps a mail can wait a maximum of 10 seconds before being automatically sent on. This could even be combined with a seperate server or cluster approach. I have no idea how this would be implemented, though I have a hunch that qmail or exim would be at least extensible enough to allow it. I think it's worth jumping through hoops to keep the high level spam filitering.

    Thirdly, you could try turning the mail filter into a "server" program itself, so you don't have to start a new process for each email you filter.

    --
    See you, space cowboy...
  6. Another cheap way.... by m0rph3us0 · · Score: 2, Informative

    Use TDMA http://tmda.net/.

    1. Re:Another cheap way.... by joshuac · · Score: 1, Funny

      nah, CDMA is way better.

  7. A few comments... by aridhol · · Score: 4, Informative
    Ideally, I'd prefer something that does reject the message if it's spam
    Careful. I'm assuming that these will be clients or co-workers or similar. BE CAREFUL. You do not want to drop messages. What happens if a client's email is lost because it looks like spam (refers to money, etc). Better to tag it, and let the user decide.

    Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS).
    SpamAssassin is a stacking of checks. You can set up its config to skip those checks that you don't want to bother with. You may need to adjust scores if you do that, however.
    --
    I can't say that I don't give a fuck. I've just run out of fuck to give.
    1. Re:A few comments... by Harik · · Score: 2, Interesting
      Careful. I'm assuming that these will be clients or co-workers or similar. BE CAREFUL. You do not want to drop messages. What happens if a client's email is lost because it looks like spam (refers to money, etc). Better to tag it, and let the user decide.
      Um, don't even bother. Either filter and drop the spam, or just let everything through. Having someone go through all the marked spam messages is just as wasteful as going through the unmarked ones. If you're that afraid of dropping something, consider this: People select-all *SPAM* delete. Why should that be part of their daily routine? Why waste the storage space and network bandwidth to make a human do what you can do on the mailserver?

      I've gotten many requests to tag people's mail rather then deleting it. Within a month, they all say 'fuckit, just toss it.'

      --Dan

    2. Re:A few comments... by aridhol · · Score: 4, Insightful
      If it's for a company email system, make sure that the higher-ups know exactly the implications of this system. And I don't mean just the CEO. Make sure that every department head knows. Go to a meeting with them first, to discuss it. Do not enforce your own policy. That's not your job.

      Make sure that people know that they can (and probably will) lose legitimate email. Make sure there's a way to bypass the filters. For example, hold the email until you can confirm the sender (reply to sender, and if your message bounces or isn't replied to in n days, delete). Let users setup their own configuration (scores, whitelists, etc), but be able to override some things (eg don't let them blacklist internal mail).

      --
      I can't say that I don't give a fuck. I've just run out of fuck to give.
    3. Re:A few comments... by tdelaney · · Score: 2, Interesting

      I would disagree strongly here.

      I use spambayes for my spam filtering.

      I get about 50 items classified as spam per day. These have a spam probability according to my spam and ham corpuses of > 90% (usually 100%).

      However, I also get 2-6 things classified as *possibly* spam each day. These are things which have a spam probability of > 15%.

      These get mixed up in among 200-400 other messages each day.

      Once I got things set up, I have *never* had anything classified definitely spam which wasn't.

      Most of the stuff that's classified as *possibly* spam, isn't. Most of that tends to be company announcements, which (even though I've included all of them as ham) have enough spam indicators to confuse things.

      I've had very few things which were spam get through ... and then only at the beginning.

      Anyway, my point is that by separating out the *possible* spam from the *definite* spam, you greatly reduce how much you need to look over. I barely even glance at what is in my spam folder, but I consider each piece that goes into my possible spam folder.

      In addition, spambayes requires a spam corpus to be maintained. Couldn't do that if it didn't let spam through ...

      Spambayes isn't designed to be used for a large number of people, but there's no reason it couldn't - apart from the state reasons of computation and storage space, and that it works best on an individual or small group basis.

    4. Re:A few comments... by milkman_matt · · Score: 2, Informative
      Careful. I'm assuming that these will be clients or co-workers or similar. BE CAREFUL. You do not want to drop messages. What happens if a client's email is lost because it looks like spam (refers to money, etc). Better to tag it, and let the user decide.

      I was wondering this myself when I set up spamassassin on our mail server here (We're a web hosting provider using Communigate Pro). It filters mail for everyone by just changing the subject line to let them know that it's been marked as spam. From there they can read it, delete it, filter it based on the new common subject line into a junk folder or whatnot.. I got sick of doing that, so what I did is this: I set it up to be read in by the mail server, then I applied a rule to it which states If the message is tagged as spam, reply with the following custom message. To paraphrase.. that message basically says "the message you sent was considered spam by our email system's spam filter for reasons stated in the header of this message. if you'd like to get ahold of me about this you can call me at (number+ext)." This has worked out pretty well for us so far.

      -matt

  8. Spam Assassin and/or Popfile by Universal+Nerd · · Score: 5, Informative

    Ok, I'm not the email admin here at work, I avoid the whole mail subsystem 'cause I already have enough to do elsewhere.

    Anyway, for our 1500 users we use SpamAssassin with RBL and blacklists and our meager server (PIII 1.26GHz with 512Mb RAM) doesn't even reach 0.20, the heuristics is turned down due to the processor usage but it filters about 90% of the spam with very little load.

    I, personally, use Popfile (search Sourceforge) as my personal filter - with it's database right now, not that big, just some 8Mb with over 200,000 emails since training (from my huge spam database) and normal usage over the past year for me and a dozen other users. Very easy to set up and use, you just need to train it with a good database. It's stats state that it has a 99.85% correctness rate. The machine has reached .20 running Popfile during a monday morning after a long weekend but the machine is half the other one, a PIII 667Mhz with 256Mb RAM.

    --
    Ash nazg durbatuluk, ash nazg gimbatul Ash nazg thrakatuluk agh burzum-ishi krimpatul
    1. Re:Spam Assassin and/or Popfile by Magus311X · · Score: 1

      I'm also going to recommend POPfile, especially with the SMTP proxy coming in 0.19.

      I set up POPFile my last week at my old job. 20 users, 1000 emails a day, 80%+ of which was spam.

      PIII 1GHz, 512M of RAM running Win2K server. Load on that box despite running Exchange to boot was maybe 5-6% CPU when idling, and their accuracy only 2 months is 99.98% and we've never even reset the statistics!
      -----

  9. Loadbalancing by sporty · · Score: 1

    What you may need is loadbalancing and multiple servers. Granted, it's a function of how much mail on the whole you have to filter, some form of loadbalancing will be needed.

    Round robin dns'ing, a load balancing machine, a firewall that can do the likes (bigIP, yuck, i hate them).

    Your question is geared towards SMTP, but it's generally a network service question and how to handle X amount of traffic with Y resources.

    --

    -
    ping -f 255.255.255.255 # if only

  10. Cloudmark Authority - a consensus based blocker by Cuchullain · · Score: 2, Informative

    I have had pretty decent results with the spamnet client on our mail clients. They are now using the data that the free clients generate to block at the server. The product is called Authority, or something like that.

    I receive about 50-70 spam mails per day, and the client has been blocking 98 percent of them every day. I have been very impressed by it.

    See if their server product is appropriate for you. It simply uses a consensus derived list from client users to block messages at the server. Kind of a blacklist thing.

    Cuchullain

    --
    "If sharing a thing in no way diminishes it, it is not rightly owned if it is not shared." -St. Augustine
    1. Re:Cloudmark Authority - a consensus based blocker by galaxy300 · · Score: 1

      Yeah...but I just noticed that the Cloudmark website has started announcing "Download Spamnet -- 30 days free!". I think that the client is going to be a pay service and it still doesn't work on Outlook Express. Time to upgrade to the server product, a little bait and switch?

  11. The problem is hard by PD · · Score: 5, Informative

    Classifying spam is essentially the same problem as classifying programs into those that terminate that those that don't (the halting problem). This leads us to the following conclusions:

    1) Filtering spam is not trivial. A program that filters spam X% better than another program will be X^2% more complicated or worse.

    2) You can't write a program that will filter perfectly. At best, all you can do is develop a set of heuristics that you hope aren't too complicated. The less complicated the heuristic, the fewer resources it will require.

    3) There's a limit to how simple your heuristics can be.

    4) The system of spam is not just the message: it's the spammer, plus the message, plus the recipient. This is because a certain message considered spam by some will not be considered spam by others. That means that the heuristics that account for the person reading the spam will be better than those that don't. The source of a spam is also important: a message consisting of a spam report to a spam newsgroup is not a spam, though it may contain a complete spam message.

    5) The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities. In short, you're going to see an endless increase in the number of processor cycles consumed by spam filters, asymptotically approaching the requirements of a full-up human brain simulation.

    On the other hand, this will sell a hell of a lot of computers.

    1. Re:The problem is hard by cperciva · · Score: 1

      Classifying spam is essentially the same problem as classifying programs into those that terminate that those that don't (the halting problem).

      What an amazing claim. Could you elaborate on this? For example, how do you apply Cantor's diagonalization argument to email?

    2. Re:The problem is hard by PD · · Score: 1

      Simple. Spams or programs can be represented as sequences of numbers, making them indistinguishable at the machine representation level. Take all of those bits of those numbers in a row and read them as a single number that represents that spam.

      At this point, Cantor's diagonalization is trivial.

    3. Re:The problem is hard by cperciva · · Score: 1

      At this point, Cantor's diagonalization is trivial.

      You have an interesting definition of "trivial".

      Ok, suppose I have a function which I claim distinguishes with perfect accuracy between spam and non-spam. How do you propose to construct a message which it mis-identifies?

    4. Re:The problem is hard by PD · · Score: 1

      Let's get real theoretical now: :-)

      Suppose you constructed a message that said in part: 'You have an interesting definition of "trivial".' You ran this through your perfect classification function and it said "NOT SPAM".

      I received the message and said to myself "well, look at that spam." Clearly the spam classifier mis-identified the spam as being ham, because I would consider it spam.

      I'm not trying to duck your question. OK, well maybe I am. I spent a bit of time thinking about it, and it seems that diagonalization would be hard, if not impossible.

      My original point is that spam is partly in the eye of the beholder, and I'll leave my argument at that, and abandon the statement about exact equivalence to the halting problem.

    5. Re:The problem is hard by cperciva · · Score: 1

      That's my point. The problems with building a spam-classifier are practical, not theoretical. Until we agree upon exactly how we define "spam" we can't build a perfect classifier; but there is no analogue to the halting problem (Turing), the incompleteness-inconsistency theorem (Godel), or the diagonal argument (Cantor).

    6. Re:The problem is hard by alienmole · · Score: 1
      The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities.

      That may not be true, and it goes to the same issue of definition of spam that makes this not an example of the halting problem. Since the final definition of spam is in the eye of the beholder, only a perfect model of a particular person's cognitive processes will be able to definitively distinguish spam from non-spam.

      OTOH, some relatively dumb processes, such as Bayesian analysis, have already proved reasonably effective in identifying spam. The kinds of analyses these programs do is actually slow and difficult for humans to do. This leads me to the conclusion that "the best spam filters will eventually be" (and already are) programs that pre-filter email using analyses that humans would find tedious and inefficient. But the final analysis is likely to have to remain up to the human recipient, even given strong AI.

      If you doubt that, ask yourself if another person could be 100% reliable at filtering spam for you. The answer is no - I can give counterexamples if you think it's yes. Another person represents about the strongest AI we can hope for, but even if we had stronger (more intelligent?) AI, that doesn't mean it would be able to perfectly predict our thought processes.

  12. Most effective spam filtering I know of by Sevn · · Score: 1

    I've been using these filters for quite some
    time, and they catch about 95 percent of it.

    Enjoy:

    If From/Sender or Subject contain:

    post-line, yahoo, mail, (your account name),
    postforme, photos, hot, degree, earthlink, aol,
    opt, cum, young, hollywood, notme, naked, penis,
    bigger, usa, model, women, girl, slut, prize, won
    msn, horny, dirty, gang, where, winner, price,
    teen, printer

    Move to folder spam. :)

    You'll get some false positives once in a great
    while, but it's nice to have all your spam in
    one folder and weeded out of your regular mail.

    I'd also like to give a big thumbs up to
    register.com's webmail spam filtering. It's easy,
    and it works very well. My spam has dropped from
    400 a week to about 10 a week.

    --
    For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
  13. Easiest Solution: by Canthros · · Score: 1

    Have someone else do it for you. There are companies around that do mail filtering and so forth; they use Postini here. Costs a couple bucks per month per mailbox, but they also hold onto mail if the server here becomes unavailable, and it's never your headache to keep up with.

    --
    Canthros
  14. I've got a better solution by .@. · · Score: 2, Informative

    I've been thinking about this problem, and its various dead-end solutions (micropayments, rewriting SMTP, strong client/server auth, third-party circles of trust), and have come to the conclusion that none are necessary, or particularly desireable.

    I've put together the beginnings of an alternate proposal, which draws on some of the good aspects of the above approaches, without the need to rewrite SMTP. It's a community-based, peer-based approach that leaves the power in the hands of the operator. Plus, there's no profit motive (except that it's in an operator's best interest, and thus the corporate owner's best interest, to maintain his/her server's level of trust).

    --
    .@.
  15. Here's the best: by Randolpho · · Score: 1

    mail.yahoo.com. :)

    --
    "Times have not become more violent. They have just become more televised."
    -Marilyn Manson
    1. Re:Here's the best: by maxume · · Score: 1

      This can seemingly be improved upon. For reasons that are no longer clear to me, I have a yahoo mail account that does not match the username of the yahoo account. I get about 1 spam per month to this address; Of those, I think some of them are coming to a different address that I have forwarded to the yahoo account. Because of this, and use of another yahoo account as a spam trap, I have yet to experience the hell that many people seem to be going thru with spam. Add in something like YahooPops(check sf.net) and email is still pretty useful for me.

      --
      Nerd rage is the funniest rage.
  16. One very fast check is extremely effective: by -dsr- · · Score: 2, Insightful

    One very fast check is extremely effective: look at the first line of each MIME attachment to see if it's a Microsoft executable file. If it is, quarantine it.

    (I wish I had thought of this, but Russell Nelson did.)

  17. any large-ish mail server needs the following by Anonymous Coward · · Score: 0

    Since one of the major issues of spam is the bandwidth cost - it is favorable to be able to reject spam before accepting the entire message body. On a busy-enough mail server message bodies for spam may account for many gigabytes of wasted bandwidth.

    It would be interesting to see a spam-filter that would just drop SMTP connection as soon as it received enough information to determine if a message is spam (be that just the header, or half of message's body). I would imagine that for any major mail server this is a MUST - otherwise filtering spam is just a convenience to the users rather than a way to fight resource waste.

  18. Bogofilter rocks! by bobv-pillars-net · · Score: 2, Informative

    I had the same problem.

    Was pretty happy with spamassassin, but our mailserver was crumbling under the load.

    Switched to bogofilter and, after a training period, we're now getting better accuracy (97.6%) with spam recognition than we did with SpamAssassin, with MUCH reduced server load.

    --
    The Web is like Usenet, but
    the elephants are untrained.
    1. Re:Bogofilter rocks! by Darnit · · Score: 1

      Same here.

    2. Re:Bogofilter rocks! by Anonymous Coward · · Score: 0

      Bogofilter kicks ass with sylpheed-claws for individual users too. Here's a how-to on oreilly.net.

  19. I've used by motha_chucker · · Score: 2, Informative

    Xwall because we are running Exchange. It supports Bayesian filtering as well as MAPS/RBL rejection and virus scanning. Currently I am running only MAPS/RBL and have found it to be very effective with very few false positives. To answer your question regarding effectiveness of quick checks, I would have to say in my experience that they are effective. I have not stopped 100% of incoming spam but I would say around 98% and feel that is acceptable. Xwall is also cheap, $300.00 USD. Unfortunatley it will only run on windows.

  20. Multi level approach by linuxwrangler · · Score: 3, Insightful

    The more "low hanging fruit" you pick off the less your computationally expensive filters have to do. For example, if the other system greets you with:
    EHLO your.machine.ip.address
    or
    EHLO your.machine.name then it IS a spammer. Reject now. There are some patches and configurations for Postfix so you can declare that RCPT from certain domains like yahoo and hotmail be verified to have a hotmail EHLO that properly resolves. This is more expensive as a dns lookup is required but this will probably be cached locally pretty quickly.

    You can also unceremoniously drop any connection that starts pipelining before you say it is OK to pipeline and any EHLO that has an illegal hostname.

    This, at least, reduces the work your scanning engines will have to do. Still, even if you catch nearly all the spam with the easy checks you will only reduce your mail volume by ~40% (current estimated overall spam volume) so that leaves you with 60% to scan.

    I suppose your main MX could do the easy checks then send the remainder off to as many round-robin scanners as necessary which in turn could pass the mail on for delivery.

    One starts to realize why some places just roll over and pay tens of thousands of dollars to someone else to do it for them.

    --

    ~~~~~~~
    "You are not remembered for doing what is expected of you." - Atul Chitnis
    1. Re:Multi level approach by morzel · · Score: 1
      You can also unceremoniously drop any connection that starts pipelining before you say it is OK to pipeline and any EHLO that has an illegal hostname.
      Dropping connections like this is not a good thing since the other party (ofc. depending on the implementation) will assume that due to network problems the connection failed; resulting in a re-connect after some time-out.
      This may effectively drain more resources than you were trying to save. Always send a 5xx return code (permanent error) to the server, so the other party knows that it should not attempt delivery again.

      --
      Okay... I'll do the stupid things first, then you shy people follow.
      [Zappa]
    2. Re:Multi level approach by Anonymous Coward · · Score: 0

      I've yet to see a mailserver that honors the 5xx error codes. My servers send out more than 20,000 5xx error codes a week and that doesn't stop those same servers from connecting again and again. even recent versions of postfix and sendmail do.

  21. Run spamd/spamc version of SpamAssassin by bill_mcgonigle · · Score: 5, Interesting

    SpamAssassin can run as a daemon (see here) so it doesn't have to start up the perl interpreter for each message. This is the preferred mode for large installations.

    People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.

    It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    1. Re:Run spamd/spamc version of SpamAssassin by Anonymous Coward · · Score: 1, Informative

      We're seeing spammers pad out the emails so that SA times out and passes them on as legit.

    2. Re:Run spamd/spamc version of SpamAssassin by bill_mcgonigle · · Score: 1

      We're seeing spammers pad out the emails so that SA times out and passes them on as legit.

      You mean in length? I have SpamAssassin processing many many multimegabyte mails without timing out...

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  22. Error in URL by blacksqr · · Score: 0

    Should be

    http://tmda.net

  23. Why not just... by Apreche · · Score: 1

    Instead of having a centralized spam filter, why not have a spam filter on each users's individual machine? Sure spam will get through, but each user will take care of their own spam problem. This will also make it the users fault if something gets blocked. I mean it doesn't even have to be the best filtering in the world. Just use Mozilla Mail and the spam filter is has built in.

    --
    The GeekNights podcast is going strong. Listen!
  24. Tarproxy by blacksqr · · Score: 2, Informative

    Tarproxy (http://www.martiansoftware.com/tarproxy) seems like such an excellent solution, I don't know why it's not more visible. I would think that if just a few large mail servers started using it, spam might virtually stop overnight; thus rendering discussions of efficiency of filters moot.

    I can only think that commercial spam filtering companies are terrified of it, thus are somehow keeping it out of the public eye.

  25. Answers by Anonymous Coward · · Score: 1, Interesting
    You said you'd like to actually reject some mail. For this to work it has to be done during the SMTP transaction. You can't wait until the LDA gets its hands on the message. You have to do it at the MTA level. SpamAssassin can still do this. However now you need to glue it to Sendmail via a Milter. I highly recommend MIMEDefang for your milter. Actually if you're rolling it out for 50,000 users then I recommend you purchase the commerical version called CanIt. That way you get support and features that aren't in the open-source version. MIMEDefang is a wonder tool. David did a helluva job on it.

    I personally use a large number of DNS blacklists. I call them from Sendmail and reject mail with them. Many people don't like DNSBLs; of course I believe these people are ignorany fools who couldn't admin a mail system if their life depended on it. That's ok. At the very least you should be able to use the DNSBLs that list open relays, open proxies, open SOCKS boxes, and vulnerable formmail.cgi web servers. We can surely all agree that you don't want your mail server talking to another mail server that's known to be vulnerable. Most of these specific lists require that an open * be abused before they list them. I'd also contend that we can all justify using Spamhaus's Spamhaus Block List (SBL). It lists known spammers and it very specific about it. You can block roughly 75% of spam with that list alone. Where you use these DNSBLs is up to you. Like I said above, I call all of mine straight from Sendmail. You can configure SpamAssassin to call these DNSBLs for you and assign a score you define. It's pretty easy. This way you can still use lists like SPEWS that rely on collateral damage to score mail but not outright block it. I use SPEWS and love it but it does block some legit mail by design. If you only score off of SPEWS you can minimize the FPs while still maximizing your spam filtering efforts. I am preparing to score foreign countries and RFC-Ignorant domains off of this as well.

    I do not recommend you use the DCC. I highly recommend you use Razor which IMHO addresses the shortcomings in DCC. Submissions to Razor have to be confirmed unlike in the DCC. This way other people confirm that the message someone submits is actually spam and not JCPenny's spring mailing list. SpamAssassin can make these calls as well.

    The mail system you're describing is going to be fairly large. This isn't something you want a single box handling. Ideally you'd put the spam and AV checks on a mailhub ahead of the actual MTA or cluster of MTAs. These boxes act as a spam firewall of sorts and takes the CPU intensive tasks you mentioned off of the actual mail server. I'm not actually using this type of setup myself but I will be eventually. There was a Slashdot article a while back about a setup roughly your size and what I guy did to make it work. It was quite a nice setup. I can't find the link now. IIRC, he scored mail and then sent probable spam via a seperate mail queue to a seperate spool for each user. Then using IMAP the user could check their probable spam for FPs. It was a nice setup.

    You also mentioned Bayesian filtering. Let me make something very clear. Bayesian filters must be applied on a user by user basis. You can't simply enable Bayes for all 50,000 as one lump sum. It will never be able to learn what is an isn't spam that way. You have to let it learn on a user by users basis. The existing Bayes abilities within SpamAssassin don't work well (or at least easily) when SA is called from MIMEDefang. There are supposedly hacks for this but I have yet to see a working one. Along those same lines user-defined preferences also don't work well (or at least easily) fro

  26. Multiple Filters? by TubeSteak · · Score: 1
    I know it's an uneducated question, but would it be possible to put two (or more) spam filters/programs back to back? You could find some cheap and dirty filters, lower their spam threshold(s) to reduce/eliminate false positives and hopefully still stop a majority of the spam.

    is this too complicated to implement?

    --
    [Fuck Beta]
    o0t!
    1. Re:Multiple Filters? by perlchild · · Score: 1

      You would end up processing the body of the message twice with one test each, which would be computationally harder than one system doing two tests...

  27. we use it, no problem by sprzepiora · · Score: 1

    Our mail server is for a hosting company, it currently does about 10000 emails a day. The server itsekf is a quad PPro 200 with 512 megs of ram. Not only is the load below 1% all the time, our setup is more cpu intensive. Spamassassin is run out of procmail so it can be turned off for people who don't want it.

  28. amavisd-new+spamassassin+clamav by Pointer80 · · Score: 2, Informative

    We're currently handling mail for 4k+ accounts using 2 frontend servers running postfix that do all of the filtering and then pass the messages back to our backend mail server.

    The frontend mail servers are running amavisd-new which is configured to use spamassassin and clamav. You can use DNS RR or just have multiple MX recs to load balance as many of these filtering servers as you need. Our filtering servers are cheap XP2100+s (w/1GB of ram) in a rack mount case that cost us ~$650 each. Amavis is just tagging the message headers with X-Spam and X-Virus headers as necessary.

    The backend server is currently sendmail (migrating to postfix+cyrus). Once the migration is complete, our users will be given access to squirrelmail with a modified version of the avelsieve plugin (wizard-like with radio buttons) that will automatically create sieve scripts to drop spam/viruses into their own folders for later examination. We'll then use cyrus's builtin utility to purge those folders (spam/viruses) of messages that are more than X days old to keep disk usage under control.

    I've documented a similar setup that I'm using on my home system here. The only difference between the two (work/home) is that on my home system everything is on one box.

    I've heard claims that clamav doesn't work well. One of the 2 filtering servers has blocked 12135 viruses between 03/06 and 05/08. That works for me. :) Our mail system handles (including rejects) ~500k messages a week, so it's by no means a large system.

    Good luck with your project.

    /pointer

    --
    [%- PROCESS life -%]
    1. Re:amavisd-new+spamassassin+clamav by Rastor · · Score: 2, Informative

      I second this proposal. The amavisd-new+spamassassin combination is a highly efficient way to eliminate spam (rejecting the message if it's spam, as you requested) with near 100% accuracy.

    2. Re:amavisd-new+spamassassin+clamav by penthouseplayah · · Score: 1

      I've myself been looking at antivirus solutions for our dormatory 400-500 userbase. We found out that it would be quite expensive (Sophos, HBedv and others), until we saw RAVantivirus. 300$ a year first year and then 80% off the next years to come (60$ a year). This is for a single mailserver with up to 5000 mailboxes, and it works perfectly, installs in 10 minuttes and has easy configuration. Look at www.ravantivirus.com

    3. Re:amavisd-new+spamassassin+clamav by Pointer80 · · Score: 1

      The number of domains that we host makes it much more expensive than $300. :(

      /pointer

      --
      [%- PROCESS life -%]
  29. Filter via proxy, not LDA by runswithd6s · · Score: 3, Insightful
    Spam and virus filtering in an efficient manner for anyone is a major issue, and it has already been mentioned that there are multiple ways of accomplishing this. In designing your process, think in terms of dropping or rejecting email from the process loop as soon as possible.

    At the SMTP server

    • Drop email from known blacklisted servers via your email server access file
    • Allow email from known whitelisted servers or addresses
    • Use RBL lists
    • Filter out mis-behaving SMTP servers, ones that don't follow standard protocols
    • Disable ESMTP commands that give the spammer access to your local users lists (VRFY, etc..)
    • Only relay email from authenticated servers and users
    • Impose a size limit to messages (50k) if possible.

    At the SMTP Filter Proxy Server or LDA

    • Allow emails from recipient-based whitelists
    • Drop emails from recipient-based blacklists
    • Process Tagged messages (from TMDA)
    • Run your faster classification programs: clamav (for viruses), bogofilter (bayesian)
    • Run your slower classification programs: procmail, spamassassin

    Just remember to shortcut the process along the way. If email can be dropped or tagged for any reason, do so immediately and quit processing it.

    --
    assert(expired(knowledge)); /* core dump */
    1. Re:Filter via proxy, not LDA by afidel · · Score: 1

      50K per message, are you INSANE??? I hate it when servers limit me to 2 or 5 MB let alone something so insane. Let me guess, you never send anything with attachments, get real.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    2. Re:Filter via proxy, not LDA by Nathaniel · · Score: 1
      "50K per message, are you INSANE??? I hate it when servers limit me to 2 or 5 MB let alone something so insane. Let me guess, you never send anything with attachments, get real."

      Why are you passing by value instead of passing by reference?

      That is to say, why are you sending an attachement instead of a URL or some other pointer to the file?

      Chewie did say "... if possible". That hardly sounds insane to me.

    3. Re:Filter via proxy, not LDA by afidel · · Score: 1

      Because my personal web account has a 5MB limit and most of it is already used.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  30. user education by Parsec · · Score: 3, Interesting

    You could take some steps on the user education side of things. Before being given an account, they should learn a few things about how to keep their address safe, like:

    • spamgourmet.com and other disposable email address providers.
    • The ethics of buying from spammers (some people really don't know!) Make the counterpoint that it's perfectly acceptable to buy from sponsors of lists that they want to be subscribed to, to help support the list.
    • Always watch for checkboxes with tricky text used to gain permission when submitting their email somewhere.
    • When to click that unsubscribe link (which spam may be legitimate).
    • Offer to teach the finer points of tracking down and reporting spam. Report not just the sending IP, but also advertised web site, using the various web whois interfaces.
    • Point users to legislative possibilities that they may wish to contact their governmental representative to support.

    Also, if you're working for an organization which may want to expose user addresses to the internet via a web site, you may want to work with the web master and legal to create a click-through agreement that would stop spam harvesting robots while only requiring a couple extra clicks for the legitimate public. Or work with the web master to create a standard human-only readable way to post email addresses, e.g. "email lauren at our domain of example.com".

    You may wish to register an additional domain or two to provide disposable email address services to your users.

    Consider a piece of software that blocks IPs attempting to brute-force email addresses. Some filter monitoring the logs for excessive bounces from an IP and passing it to the firewall would work. I don't know of any examples of this software, but if you're doing a large email service you may get these kinds of attacks.

  31. followup:user education by Parsec · · Score: 1

    To my fourth paragraph: Spamgourmet is apparently a sourceforge project

  32. Sendmail patches / config? by DamienMcKenna · · Score: 1

    Do you know of any sites that explain how to do some of these for sendmail, either via patches or (preferable) using some config changes? Thanks.

    1. Re:Sendmail patches / config? by cbcbcb · · Score: 2, Informative

      SAUCE applies aggressive correctness checks to incoming mail. Works with exim, but apparently could be adapted: http://www.chiark.greenend.org.uk/~ian/sauce/

  33. Tips for Sendmail configuration? by DamienMcKenna · · Score: 1

    Does anyone have tips for sendmail configuration for some of these, eg disabling ESMTP user listing, blocking irregularly configured incoming SMTP servers, etc?

    1. Re:Tips for Sendmail configuration? by anon+mouse-cow-aard · · Score: 1

      extract from a sendmail.mc :-)

      INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamassassin.sock, F=, T=C:15m;S:4m;R:4m;E:10m')

      INPUT_MAIL_FILTER(`mimedefang', `S=unix:/var/spool/MIMEDefang/mimedefang.sock, F=T, T=S:60s;R:60s;E:5m')

      define(`confPRIVACY_FLAGS', `authwarnings,novrfy,noexpn,restrictqrun')dnl

  34. Another vote for Popfile... by aquarian · · Score: 1

    I've been using Popfile too. It doesn't seem to slow things down any, downloading 3-400 messages a day over a cable modem (most of which is spam, and is successfully marked as such). FWIW I have a 700 Mhz machine, and use Win2k Pro w/ Outlook Express.

    Now that I'm looking to deal with all my mail from a server, I'm trying to find a way to use the Popfile filters I've so carefully trained over the last few months!

    BTW, my Popfile's accuracy is also just under 99%.

  35. Client side: Eudora by User+956 · · Score: 2, Informative

    For outlook users, i recommend Spammunition [upserve.com] and I just use mozilla's spam filtering, which works great.

    Eudora users can use Spamnix. Works like a charm.

    --
    The theory of relativity doesn't work right in Arkansas.
  36. Redundancy checking by kinema · · Score: 1

    i don't know too much about the real world of spam checking but it seems from the spam that i do revive much of it is redundant per domain. would it reduce the computation if you checked for redundant messages (mass spam messages) before you fed the stream of messages into the spam filter? --adam

  37. Your DNS is probably hosed. by Ashurbanipal · · Score: 1

    If it takes more than a couple of seconds to process normal emails with spamassassin, you are horribly misconfigured.

    I recommend you examine some log files (what a concept!) and do some tests of name resolution. The timeouts you describe are typical of a mailserver with a completely b0rked DNS.

    You should always run a local name resolver on a mailserver anyway, with query access limited to 127.0.0.1 (loopback) so others hosts cannot use the machine as a nameserver. That way, you can set up dummy zones for various purposes (like, communicating with sites incapable of managing DNS properly).

    Check /etc/nsswitch.conf if your machine runs the name service switch (Sun, HP and most modern unix workalikes); check /etc/resolv.conf if your nsswitch specifies "files" for host lookups; and use dig or nslookup to test.

  38. SMTP rejecting of spam considered harmful by Mozai · · Score: 3, Interesting

    Scanning messages for spam and rejecting at the SMTP level is a very bad idea. I'm the sysadmin for a company where about 25% of our email message traffic is spam. However, we also have a hard-working sales department who actually need commercial and sales messages. If a message from a client is marked as 'spam' because they're negotiating a sales deal, the sales staff still need to see this message. If a client's counter-offer is rejected at the mail server with a "you sent us SPAM" message, you can kiss that potential income goodbye.

    False positives can be more harmful than messages getting through the spam filter.

  39. $300 is cheap? Pass the caviar! by Ashurbanipal · · Score: 1

    I bet you could support a couple of small central African villages for $300 a year...

    "When I was a boy, you could get a Baby Ruth bar for a nickel, and it was as big around as your leg."

  40. Excellent point by Andy+Dodd · · Score: 1

    That's what I was going to suggest.

    I would start with a static domain-based blocking scheme. It requires a bit of maintenance (I need to add 10 or so domains/week), but I reject a LOT of mail with no false positives.

    Then use a more computationally intensive filter to catch what gets past the domain-based blocker. Potentially tie them together. (Have the computationally intensive checker make a list of domains. Then you can checkmark ones you want to block. I get legit mail from Yahoo users, so I can't block them, which is where a heuristic or bayesian filter would be useful. On the other hand, blocking Azoogle.com takes care of 10% of my spam. That number used to be 25%+)

    --
    retrorocket.o not found, launch anyway?
  41. BAD by SuiteSisterMary · · Score: 1
    Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam.

    If it's not directly aimed at you, you DO NOT delete or reject it. PERIOD. You tag it. Maybe you even quarentine it. But you DO NOT reject it out of hand.

    --
    Vintage computer games and RPG books available. Email me if you're interested.
  42. Re:$300 is cheap? Pass the caviar! by cdh · · Score: 1

    Yes, for something that can support anything close to 50K users, $300 is way cheap. The OP said that he's using Exchange, if you're laying out for Exchange licenses, then again, $300 is way cheap.

    The "real world" sometimes requires you to actually spend money. Sometimes paying for things is cheaper than trying to piece together a bunch of other non-releated packages.

  43. use decision trees by g4dget · · Score: 1
    Most of the "Bayesian" spam filtering is naive Bayesian, which is really just an important-sounding excuse for using a computationally expensive and simplistic classification method.

    If you want computationally efficient methods for detecting spam, look into decision trees (search on Google for decision trees and spam filtering). If you set them up properly, they result in a sequence of simple tests like "Is this addressed to me?", "Does the subject line contain the word 'penis' or 'breast'?", etc. Like the so-called Bayesian spam filters, decision trees also give you probabilities. Properly trained, they probably work at least as well as Bayesian methods, and they should run a lot faster.

    There are a bunch of open source packages available for deriving decision trees from data. Furthermore, while computing decision trees is non-trivial, they can be converted automatically into a few lines of Perl or C code--no runtime libraries required.

  44. Thanks! by shepd · · Score: 1

    I was always wondering what the difference was. :-)

    --
    If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
  45. Spam filtering minus system resources by Anonymous Coward · · Score: 0

    Tired of using up system resources to Spam software that on it's best day has a 75% success rate? I work for a company that offers spam solution on a pre-gateway level (route your MX record) and we use a combination of four methods to determine whether or not a message is spam. They are: Heuristics, Distributed checksum clearing house, Bayes, and RBLs. Our results have been better then we expected, we get a 99% success rate with zero false positives!!! Any sys admin who would like to try it on their domain for free contact me at Jmeindl@electricmail.com

  46. Re:$300 is cheap? Pass the caviar! by Anonymous Coward · · Score: 0
    for something that can support anything close to 50K users, $300 is way cheap
    Sendmail (free, except for the brain damage it causes the mail admin) scales to 50K+ users. And it is far more robust and reliable than Exchange. I have used both - in fact I was a beta-tester for the first three versions of Exchange - and you will spend more time and hardware resources on $$expensive Exchange than on free Sendmail.

    The ROI proposition on Exchange absolutely sucks compared to free alternatives like postfix, qmail, exim, and sendmail. The free mailers lead in scaleability, reliability, hardware optimization, virus scanning, and spam detection. The much-vaunted groupware functions of Exchange/Outlook are easily served by free PHP engines running on Apache, and that situation gets better every day as the alternatives evolve. Soon Microsoft will be completely out of the competition there as well.

    Yes, you sometimes have to spend money. But spending money on crap is not good for your business. Spend your money on competent admins instead of braindead software designed for admins making $20K salary a year. Then invest your savings in the Open Source projects that make your business run, and prosper into the new economy.