Slashdot Mirror


Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"

31 of 190 comments (clear)

  1. spamassassin by mdaitc · · Score: 5, Insightful

    have you tried spamassassin?

    1. Re:spamassassin by Scutter · · Score: 2

      Latest News: 2011-06-16: SpamAssassin 3.3.2 has been released, a minor new release primarily to support perl-5.12 and later. Visit the downloads page to pick it up, and for more info.

      Last update was more than two years ago. I know you can refresh your rule sets periodically, but is the software even still maintained?

      --

      "Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"
    2. Re:spamassassin by dbIII · · Score: 4, Informative

      There is still stuff going on in the dev version with an svn commit listed on August 30 2013.
      http://spamassassin.markmail.org/search/?q=#query:%20list%3Aorg.apache.spamassassin.commits+page:1+state:facets

    3. Re:spamassassin by wvmarle · · Score: 4, Informative

      Add greylisting to the mix. For me it stops approx. 90% of junk at the gate. That alone saves >90% of your server's spam workload (90% of the spam checker; a bit extra due to the mail server not having to process the mail at all).

      Of course I don't know about legitimate mail but if someone is trying to send legitimate mail trough a spam-type minimised mail server that doesn't retry, that's their problem...

    4. Re:spamassassin by wvmarle · · Score: 4, Insightful

      Maybe the software is pretty much finished? In that case there's not much more to do - no new features to add, and sooner or later you'll run out of bugs to fix.

    5. Re:spamassassin by FridayBob · · Score: 2

      On the mail servers I maintain, I employ SpamAssassin only a last resort because it is resource-intensive. Submitter hmilz's approach is not only resource-intensive, but also labor intensive, so I would never recommend it.

      I've used Exim for my MTA since 2001 and my main defense against spam has always been to filter it out before SpamAssassin comes into play based on analysis of header information and checking against DNS black lists. Actually, the first thing I do is look for obvious fakes from a limited number of well-known domains: gmail messages that are not sent from a Google server, eBay messages not from an ebay.com server, etc. Such messages are rejected immediately. However, the bulk of the filters I've collected and developed over the years check a number of items: whether the sender's reverse DNS address is in order, the HELO is correct, whether the sending IP address or any domains mentioned in the header lines are blacklisted, whether the callout works and any DKIM signature is valid, if an RFC-compliant date and To are included, whether any attachments are included with file types that I consider risky (e.g. ,bat, .btm, .cmd, .com, .cpl, .dat, .dll, .exe, etc.), if the message headers contain non-ASCII or characters from some unspecified character set, whether any SPF record says that the sending server really is authorized as an MX for the sender's domain, and finally if the incoming message is using one of my domains in its message ID. For all of these types of checks I often have multiple filter statements.

      In the past I would usually reject messages that matched any of these filters. I would hardly ever receive any spam, but would see lots of false-positives, so I had to maintain very long white lists. When I finally got tired of that, I modified the above filtering system so that each filter was categorized. Each category has a variable that starts out as zero, but gets changed to a one with a match for any of the filters in that category. Later in the process the system counts the number of category variables that equal one. Generally, I figure "three strikes and you're out" is a good rule to apply.

      Moreover, my MTA configuration works with a spambox system. For instance, if an incoming message scores only one or two category matches, and/or the message scores less than a certain number of SpamAssassin points, then it gets deposited in the user's spambox instead of their inbox. I've been running the four MTA's in my care like this for the last three years and they've been very reliable and almost totally free of maintenance. They don't require very much in the way of resources either. But best of all, I've had no more complaints from the users at all.

    6. Re:spamassassin by Scutter · · Score: 2

      I don't think there's any such thing as "pretty much finished", especially with a piece of software involved in the arms race that is spam vs. filtering. There's only so much you can do with rules before you need to revisit your engine. Also, it's not just the software that's been stagnant for two years. The website itself hasn't been updated in as long. Not a single news item since 2011. The other respondent mentioned that dev is still active, but dev is not production. Dev is dev. Ever since Spamassassin moved to Apache, it's been pretty much dead.

      --

      "Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"
    7. Re:spamassassin by bill_mcgonigle · · Score: 4, Informative

      The rules sets are updated pretty frequently - that's where the front lines of the battle are. As others have said, the engine is pretty mature.

      The question, I guess, is what do you want spamassassin to do that can't be expressed with the current rules language?

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    8. Re:spamassassin by Architect_sasyr · · Score: 2

      One client in 4 years of greylisting has had that problem, for something like 40,000 unique senders per month. I like those numbers.

      --
      Me failed English...
      FreeBSD over Linux. If my comments seem odd, this may explain...
    9. Re:spamassassin by DNS-and-BIND · · Score: 2

      No, today's technology user has been brainwashed by mobile applications that update frequently. I have seen complaints on perfectly good software: "Has not been updated in a year whats wrong this software sux 1 star". Developers also use software updates as a sort of beta test: push it out, and if it crashes a lot of systems then update it again. Iterate as necessary. I've seen three releases in a day and five in a week using this "plan". The users don't help by considering mature (i.e. un-updated, essentially finished) software as garbage.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
  2. You could speed up your current solution by russotto · · Score: 5, Interesting

    Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.

    1. Re:You could speed up your current solution by PetiePooo · · Score: 5, Informative

      ...Most of your time is likely spent parsing the patterns.

      I second that. And as your rules have built up, there are likely some that have never been used beyond when they were first put in. I'd instrument your next solution to identify outliers and cull them over time so your parser doesn't have to work so hard.

  3. Database? by K.+S.+Kyosuke · · Score: 2, Insightful

    What would the database achieve? I'm not sure what is the exact nature of the patterns (an example would really help here), but perhaps writing a compiler from the patterns into some decision procedure in something reasonably efficient yet featuring quick start, such as SBCL or Gambit, could help.

    --
    Ezekiel 23:20
  4. bogofilter by jon787 · · Score: 4, Informative

    http://bogofilter.sourceforge.net/

    I haven't timed it to see how well its been doing in the 6 years I've had it though.

    --
    X(7): A program for managing terminal windows. See also screen(1).
  5. What a forking awful solution... by Anonymous Coward · · Score: 2, Informative

    Sorry, couldn't resist the pun.

    Your problem (besides not using existing Bayesian tools...) is that every single egrep is a fork. As others have pointed out, you should rewrite your script in something like Python and use the native regex libraries. Even if you have to read and 'compile' the regex list every time, you're saving a *massive* amount of OS-level overhead.

  6. ragel by Anonymous Coward · · Score: 2, Interesting

    Try compiling your patterns using Ragel: http://www.complang.org/ragel/

    Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.

    Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained, and has been for years.

  7. Re:Or... by asmkm22 · · Score: 3, Insightful

    Which pretty much defeats the whole point of hosting your own email...

  8. Re:Or... by bmo · · Score: 4, Informative

    Well, the OP wasn't exactly clear if it was just his personal account or whether it's a corporate server. My "least amount of work" thing is to forward every email address I have to gmail, and pull mail from there via imap. I get a few hundred spams a day just on one mail account, and I haven't lost any real mail due to Gmail's filtering.

    >postini

    You do realize that is being EOLed, yes?

    http://postini-transition.googleapps.com/

    >why gmail filters better than postini

    Probably separate spam databases. Stuff like that happens. Gmail probably gets orders of magnitude more spam to "teach" the system.

    YMMV.

    Using grep and procmail is the stone-knives-and-bearskins approach to filtering. There are a lot of other filtering systems that will be much more efficient on Unix systems. He can begin by using greylisting to filter out the non-compliant "fire and forget" spambots and then filter the winnowed pile o' crap. At least greylisting's not server intensive (it throws the load back to the sender) since 5xx and 4xx errors are cheap.

    Also blocking mail from dynamic IPs is a good idea.

    At this point he can then run the mail through a series of weighted RBLs. Reach a certain score and it's tossed. That's the processor intensive bit, but it's at the end and non-intensive filtering has already happened.

    --
    BMO

  9. Matching multiple simultaneous regular expressions by careysb · · Score: 2

    Many years ago I worked with a Unix development tool called LEX that could handle matching multiple patterns simultaneously. Perhaps there is an updated tool that would do the same thing. Java has a 3rd party library called ANTLR that might do the trick. It would involved re-compiling every time a new pattern is added but it should be extremely fast.

  10. Sqlite will be awesome by swillden · · Score: 2

    Sqlite, or anything that uses an index, will be screaming fast.

    Your statement of your current solution makes me wonder, though.. are you using "egrep -F -f pattern_file e_mail_message"? Or are you running egrep many times, once per line of the pattern file, or once per line of the message? I would think that given a pattern file egrep would be smart enough to do something better than repeatedly scanning the input, but based on the time it's taking, it sounds like that's happening.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  11. Re:perl or python or whatever by retchdog · · Score: 2

    doing anything but repeated egreps is probably fast enough. he should do whatever is easiest, which probably isn't lisp.

    --
    "They were pure niggers." – Noam Chomsky
  12. Re:Short circuit by CanadianMacFan · · Score: 2

    You might also want to look at how patterns are added to the file too. If they are added to the end then the latest spam of the day message will need to parse all of the patterns until it hits the latest pattern. Of course ideally you might want to set something up that looks at the hits each pattern gets so that you could parse the most likely patterns first followed by the latest patterns.

  13. Re:Or... by Count+Fenring · · Score: 3, Interesting

    Is it? I've never had a false positive in all the years I've been using GMail.

    That you noticed. There's a fairly high bias inherent there; it just has to not have hit something that was both noticeable and that you knew was incoming.

  14. Problem spotted. by girlintraining · · Score: 4, Insightful

    The problem is that you're using egrep in the first place. Here's the thing -- the overwhelming majority of your cycles are getting sucked loading, initializing, executing, then unloading, that thread. It's not that using regular expressions is processor-intensive... it's that repeatedly launching the same executable is.

    Use something that can load once, read in the patterns, check all the e-mails that are queued, sort them, then exit. Your execution time will go from 15 seconds to 150 milliseconds.

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:Problem spotted. by complete+loony · · Score: 3, Interesting

      If you have sufficient programming experience, I'd recommend basing this solution on redgrep. It's an llvm based expression compiler that should be able to combine multiple expressions into a single machine code state machine, assuming it doesn't run out of memory in the process. With a bit of effort you could output all of your compiled expressions into a single executable so you'll only need to wait for the compilation time when you add more filters.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
  15. Procmail is a fine tool -- but the wrong tool by Arrogant-Bastard · · Score: 5, Informative

    If spam has made it far enough that it's actually reached your personal instance of procmail, then there's been a problem earlier in the chain. Procmail rulesets should be a last resort, and they should only be asked to deal with minor issues that aren't dealt with via earlier rulesets.

    The first line of defense are your perimeter routers. They should implement BCP 38, they should block bogons, and they should bidirectionally deny all traffic to/from the Spamhaus DROP list. In addition, they should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them. Yes, this is a reversal of default-permit, for a simple reason: default-permit for SMTP stopped being reasonable around 2000. Use http://www.ipdeny.com/ to pick up the ranges per-country and only permit what you need. (Obviously a major research university can't do this. But Joe's Furniture, which does not have customers in Peru or Pakistan or Greece, can.)

    Then use blacklists, the best defense against spam we've ever developed. (Source: 30+ years of email experience) Spamhaus's Zen blacklist is a good one with a low FP rate and a tolerable FN rate. Augment these with local blacklists based on domains and network allocations. Augment those with as much blocking of generic hostnames and dynamic IP space as possible: real mail servers have real hostnames and are on static addresses.

    Then enforce RFC requirements: sending host must have rDNS, that PTR must resolve, what it resolves to should be the sending host's IP. Sending host must HELO as FQDN or bracketed dotted-quad; if FQDN, must resolve. Sending host must not send traffic pre-greeting. And so on. Enforcing these DOES mean occasionally you block mail sent by non-spamming entities: but since they are incompetent non-spamming entities, why would you want mail from them?

    Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.

    Rate-limit based on normative values for your site. For example: if analysis of a year's worth of mail logs shows that during that time you never received more than 10 messages a day from ANY host, then rate-limit at 30 or 40. You'll never hit in normal practice; but if you get hammered by a fast-sending host, you'll blunt the attack. Note that these don't have to be perfect to work: provided you send deferrals (SMTP response codes 4xx) instead of refusals (5xx) the worst that happens is that you will mistakenly impose a delay.

    There's more -- it's possible to get quite crafty about this. But note that NONE of these measures pay any attention to content. There's a reason for that: spammers can defeat content-based measures at will. They won't have it so easy with these.

    Deployed in production in various setups ranging from a dozen to eight million users, these steps yield a FP rate of about 10e-6 to 10e-7 and a FN rate around 10e-5 to 10e-6. Tuning helps, of course: initial rates can be higher but log analysis (which all sensible postmasters do) readily brings them down. If you have the luxury of running your own mail server just for yourself, then you can REALLY tune this setup: you should be able to get the FN rate down to 10e-7 after a few months.

  16. Re:Or... by dbIII · · Score: 4, Interesting

    and I haven't lost any real mail due to Gmail's filtering.

    Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.

    IMHO it's better to do the filtering somewhere where you have access to the stuff that is discarded. False positives may be rare now but they still happen. That's why I like stuff such as MailScanner (open source wrapper for spamassassin+your choice of commercial antivirus and/or clamav+other open source stuff+distributed updating rulesets) run on site. There's plenty of others that give you this function including some of the commercial "appliances" and outsourced email filtering.

    Also blocking mail from dynamic IPs is a good idea.

    It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.

  17. regular expression optimiser by lkcl · · Score: 2

    i'd be interested to see what happens if you run those regex's through this:
            http://bisqwit.iki.fi/source/regexopt.html

    btw can we please get a copy of the patterns you're using? i think they might prove useful for other people. also i'd like to test them myself against regexopt.

    oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k. that cut down the amount of CPU cycles but it was still far far too much memory and far too CPU intensive.

    the one thing that did work well is greylisting, however the problem with greylisting i find is that if you happen not to be at the computer or have direct access to the server and people on the phone say "i'm sending you a message now, have you got it?" you *know* it's going to be at least an hour before it'll arrive. so, unless you can whitelist them in advance (which you can't always do) greylisting does actually interfere with legitimate business.

    anyway: in the end i gave up and went to gmail, but with gmail fucking up how they're doing things i have to revisit this and set up a mail server again. thus we come full circle...

  18. Re:Or... by pongo000 · · Score: 2

    At this point he can then run the mail through a series of weighted RBLs.

    Fuck you and your RBLs. RBLs are a draconian solution that do immeasurable damage to those of us who (1) aren't spammers, and (2) choose to run our own mailservers on business-class IPs. I can't tell you how many times various IPs I use for outbound mail (I run several mailing lists) end up on an RBL for absolutely no fucking reason.

    Oh, because someone in the same /24 block sent spam? Really? That's a good reason to block an entire /24 subnet?

    RBLs are a solution in search of a problem. Some of them are nothing more than moneymakers for the people that run them: In order to get off their list, they blackmail you into paying money.

    Want to do the world a favor? Don't use RBLs. You'll just end up finding yourself blacklisted at some point anyway.

  19. Use perl by Forever+Wondering · · Score: 2

    A long time ago I benchmarked perl's regex engine against about 5 others. At the time, it was 10x faster than the nearest competitor for the same regex/data.

    Also, you can use perl's "study". Or, split the regexes across threads.

    Also, with perl you can do some hierarchical saviings. For example:
    /Ffoo/ ...
    /Fbar/ ...
    /Fbaz/ ...

    Could be redone as:
        if (/F/) {
    ... if (/Ffoo/)
    ... if (/Fbar/
    ... if (/Fbaz/)

        }

    The above is trivial example, but you get the idea.

    Also, how much time is spent compiling (vs. executing) the regexes in egrep? I imagine a lot and you have to do this for each incoming message.

    Note that spamassassin (and hence perl) can be set up as a daemon where the regexes are compiled once. The messages are passed through a socket to the daemon. This means that the only CPU time spent is on executing the regexes--a considerable savings.

    Additionally, perl regexes have [considerably] more functionality/utility than egrep ones. You might be able to recode/consolidate yours and get the same [or better] bang for less buck.

    --
    Like a good neighbor, fsck is there ...
  20. Re:Or... by CaptQuark · · Score: 2

    Last night I sent my Gmail account an email from my ISP email system, then waited for it to show up. Nothing. So I resent it. Second time nothing.

    The email contained two screen captures I needed at the office. The subject line was "Steve on telework". Nothing obvious that would trip Postini's spam filter. It is now 24 hours later and neither has shown up. I wonder how many other emails I don't get.

    ~~