Ask Slashdot: Speeding Up Personal Anti-Spam Filters?
New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"
have you tried spamassassin?
You could route everything through gmail and wash out the spam.
Gmail's spam detection is spectacular.
inb4 gmail hate.
--
BMO
Look up CRM114.
Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.
What would the database achieve? I'm not sure what is the exact nature of the patterns (an example would really help here), but perhaps writing a compiler from the patterns into some decision procedure in something reasonably efficient yet featuring quick start, such as SBCL or Gambit, could help.
Ezekiel 23:20
http://bogofilter.sourceforge.net/
I haven't timed it to see how well its been doing in the 6 years I've had it though.
X(7): A program for managing terminal windows. See also screen(1).
a big one. 15s per email???
holy smokes you are so fired.
Sorry, couldn't resist the pun.
Your problem (besides not using existing Bayesian tools...) is that every single egrep is a fork. As others have pointed out, you should rewrite your script in something like Python and use the native regex libraries. Even if you have to read and 'compile' the regex list every time, you're saving a *massive* amount of OS-level overhead.
It seems you could easily distribute the load on multiple machines, each doing a subset of the regex.
Try compiling your patterns using Ragel: http://www.complang.org/ragel/
Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.
Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained, and has been for years.
15 seconds per email? That must be one heck of a pattern list. I used to rely on procmail for filtering. In simpler times it did everything I needed.
First of all, setup grey listing. 99.99% of the emails you're receiving never make it past grey listing. You can nearly forget about filtering again once grey listing is enabled.
Second add a reject client like zen.spamhaus.org to your mail server to stop the emails that make it past grey listing.
You can continue to filter anything that makes it past those two barriers, but I think you'll find your filters are redundant at that point. In fact you can probably cut procmail from the process entirely, unless you use it do other stuff with the mail.
http://www.gigamonkeys.com/book/practical-a-spam-filter.html has the nuts and bolts. CL-PPCRE does perl regex matching faster than perl.
Don't complain about syntax, grammar, or spelling. There is no.hell like input on android.
Does your process require that all of the regexes are tried in turn or is it the case that if it hits one of your patterns that it's marked as spam? If the latter, are you able to rank the patterns from most likely to least likely to be matched? And, if so, can you stop your process once a match is made? If all of those things are true, then you should be able to cut the time/CPU/energy required to do the filtering
Consider using a proper learning filter, like dspam. You can pipe it through procmail just as easily, and you can feed your corpus of spam into it. You won't get 100%, but it'll recognize spam you haven't seen. :0f /usr/bin/dspam --deliver=stdout
*
|
I've heard, but never timed it myself, that perl is faster for regexp-type stuff than even the specialized tools, just from the massive amount of optimization it has accrued over the years; here is a completely unbiased source. Use a perl or python script, and consider using Storable (perl) or pickle (python) to serialize the data structure, I guess, but just having the whole list in memory will help.
According to this, perl regexps are (unsurprisingly) a superset of egrep's.
I don't see how introducing SQL could do much to help speed, or anything else, in this application.
"They were pure niggers." – Noam Chomsky
Many years ago I worked with a Unix development tool called LEX that could handle matching multiple patterns simultaneously. Perhaps there is an updated tool that would do the same thing. Java has a 3rd party library called ANTLR that might do the trick. It would involved re-compiling every time a new pattern is added but it should be extremely fast.
Sqlite, or anything that uses an index, will be screaming fast.
Your statement of your current solution makes me wonder, though.. are you using "egrep -F -f pattern_file e_mail_message"? Or are you running egrep many times, once per line of the pattern file, or once per line of the message? I would think that given a pattern file egrep would be smart enough to do something better than repeatedly scanning the input, but based on the time it's taking, it sounds like that's happening.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
... I just gave up on email. Even w/o spam it's more hassle than I like.
now we need to go OSS in diesel cars
Use Perl. Its regex engine is highly optimized and very fast. It should really fly on fixed strings.
There's a Stack Overflow question that addresses this very thing with some Perl code you can try.
You could run whitelisting rules first to allow messages that are obviously non-spam through without them having to pass through all of the spam rules. This could be the standard address book whitelisting so all of your friends' and colleagues' messages pass immediately.
For a bit more complex solution you could run messages through something like SpamAssassin first -- for any messages that have a spam score above a certain threshold you run them through your custom rule set. Since you have a high degree of trust in your rule set you could make this threshold quite low -- again mainly so SpamAssassin will just act as a whitelist to let clearly good messages through immediately.
Just route everything from Facebook, LinkedIn, my dad, Apple and "i*" to the spam folder, and most of it is covered.
This issue is a bit more complicated than you think.
Junkemailfilter
Http://www.junkemailfilter.com
Outsource
The problem is that you're using egrep in the first place. Here's the thing -- the overwhelming majority of your cycles are getting sucked loading, initializing, executing, then unloading, that thread. It's not that using regular expressions is processor-intensive... it's that repeatedly launching the same executable is.
Use something that can load once, read in the patterns, check all the e-mails that are queued, sort them, then exit. Your execution time will go from 15 seconds to 150 milliseconds.
#fuckbeta #iamslashdot #dicemustdie
If spam has made it far enough that it's actually reached your personal instance of procmail, then there's been a problem earlier in the chain. Procmail rulesets should be a last resort, and they should only be asked to deal with minor issues that aren't dealt with via earlier rulesets.
The first line of defense are your perimeter routers. They should implement BCP 38, they should block bogons, and they should bidirectionally deny all traffic to/from the Spamhaus DROP list. In addition, they should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them. Yes, this is a reversal of default-permit, for a simple reason: default-permit for SMTP stopped being reasonable around 2000. Use http://www.ipdeny.com/ to pick up the ranges per-country and only permit what you need. (Obviously a major research university can't do this. But Joe's Furniture, which does not have customers in Peru or Pakistan or Greece, can.)
Then use blacklists, the best defense against spam we've ever developed. (Source: 30+ years of email experience) Spamhaus's Zen blacklist is a good one with a low FP rate and a tolerable FN rate. Augment these with local blacklists based on domains and network allocations. Augment those with as much blocking of generic hostnames and dynamic IP space as possible: real mail servers have real hostnames and are on static addresses.
Then enforce RFC requirements: sending host must have rDNS, that PTR must resolve, what it resolves to should be the sending host's IP. Sending host must HELO as FQDN or bracketed dotted-quad; if FQDN, must resolve. Sending host must not send traffic pre-greeting. And so on. Enforcing these DOES mean occasionally you block mail sent by non-spamming entities: but since they are incompetent non-spamming entities, why would you want mail from them?
Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.
Rate-limit based on normative values for your site. For example: if analysis of a year's worth of mail logs shows that during that time you never received more than 10 messages a day from ANY host, then rate-limit at 30 or 40. You'll never hit in normal practice; but if you get hammered by a fast-sending host, you'll blunt the attack. Note that these don't have to be perfect to work: provided you send deferrals (SMTP response codes 4xx) instead of refusals (5xx) the worst that happens is that you will mistakenly impose a delay.
There's more -- it's possible to get quite crafty about this. But note that NONE of these measures pay any attention to content. There's a reason for that: spammers can defeat content-based measures at will. They won't have it so easy with these.
Deployed in production in various setups ranging from a dozen to eight million users, these steps yield a FP rate of about 10e-6 to 10e-7 and a FN rate around 10e-5 to 10e-6. Tuning helps, of course: initial rates can be higher but log analysis (which all sensible postmasters do) readily brings them down. If you have the luxury of running your own mail server just for yourself, then you can REALLY tune this setup: you should be able to get the FN rate down to 10e-7 after a few months.
When my ISP discontinued the use of procmail filters, I moved it to my home computer and configured two filters in Evolution: the first one to auto-remove mail marked by my ISP as suspected spam, and the next to pipe the mail through bmf and remove it if it tested positive for spam. When I say "auto-remove", I mean it's moved to a spam folder where I can double-check it in case false positives get through.
http://sourceforge.net/projects/bmf/
Install CRM114, set it up, and begin teaching it spam from non-spam.
Very quickly it will "learn" and you'll seldom ever see a spam message.
http://crm114.sourceforge.net/
I've used Fail2Ban and some regular expressions to help filter out things. For example, when you email someone and you get the address wrong, you get an email kicked back with the 450 error code.
So, I use Fail2Ban to look for 450 error codes, and if it sees that 5x within 10 minutes, it blocks your IP address for 24 hours.
Couple that with blocking entire countries IP ranges (China, Russia, etc.), I see little to no spam at all.
I've been using popfile for years. Works great! Try it.
i'd be interested to see what happens if you run those regex's through this:
http://bisqwit.iki.fi/source/regexopt.html
btw can we please get a copy of the patterns you're using? i think they might prove useful for other people. also i'd like to test them myself against regexopt.
oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k. that cut down the amount of CPU cycles but it was still far far too much memory and far too CPU intensive.
the one thing that did work well is greylisting, however the problem with greylisting i find is that if you happen not to be at the computer or have direct access to the server and people on the phone say "i'm sending you a message now, have you got it?" you *know* it's going to be at least an hour before it'll arrive. so, unless you can whitelist them in advance (which you can't always do) greylisting does actually interfere with legitimate business.
anyway: in the end i gave up and went to gmail, but with gmail fucking up how they're doing things i have to revisit this and set up a mail server again. thus we come full circle...
Unless it's a fun hobby for you, it makes much more sense to just pay for email and let somebody else to it. Personal email can be gotten for about $2/month.
I don't respond to AC's.
The canonical spam solution checklist.
I'm going with Specificaly, your plan fails to account for: (x) Users of email will not put up with it.
Help stamp out iliturcy.
Provided your pattern file is under 340K, 'agrep -f' is about twice as fast.
there are million dollar companies that can detect it faster and even better than your OSS bullshit half assed script for free
quit pretending, its not 1994 anymore
depending on where your time is going, consider splitting the file up into pieces and run each piece in a different thread.
vice chair orange county java users group (ocjug.org).
A long time ago I benchmarked perl's regex engine against about 5 others. At the time, it was 10x faster than the nearest competitor for the same regex/data.
Also, you can use perl's "study". Or, split the regexes across threads.
Also, with perl you can do some hierarchical saviings. For example:
/Ffoo/ ...
/Fbar/ ...
/Fbaz/ ...
Could be redone as:
... if (/Ffoo/)
... if (/Fbar/
... if (/Fbaz/)
if (/F/) {
}
The above is trivial example, but you get the idea.
Also, how much time is spent compiling (vs. executing) the regexes in egrep? I imagine a lot and you have to do this for each incoming message.
Note that spamassassin (and hence perl) can be set up as a daemon where the regexes are compiled once. The messages are passed through a socket to the daemon. This means that the only CPU time spent is on executing the regexes--a considerable savings.
Additionally, perl regexes have [considerably] more functionality/utility than egrep ones. You might be able to recode/consolidate yours and get the same [or better] bang for less buck.
Like a good neighbor, fsck is there
Here's several things you can do to make this faster.
1) first don't keep invoking egrep. this has to parse the command line and then re-load the egrep command itself every time. Instead do this from within a loaded program. Perl is a very good choice for this
2) the perl command can pre-compile the regular expression. So you can leave the perl program running as a process then simply feed it new data to analyse.
3) given you are searching for words, you probably want to split the incoming stream on white space one-time not every time.
4) even better than that, take the e-mail, parse it to words, then parse each word into all 3,4,5,6,7,8 consecutive strings. Then just look these up in a hash table.
5) if you are only trying to match from the start of the word, (not interior word strings) then this hashing becomes trivial.
Some drink at the fountain of knowledge. Others just gargle.
Start with spamd and get your spam levels down immensely.
man page: http://www.openbsd.org/cgi-bin/man.cgi?query=spamd&sektion=8
I have been using various procmail stuff but for years I am now relying on bogofilter.
I meanwhile have disabled autolearn as thats the stuff taking time.
I trained it with a couple megabytes of ham and spam and be done. From time to time when something gets classifies wrong ill push it for learning.
never had the whish to look for something else.
I don't even use spam blockers. Instead I've purchased a domain, which is quite affordable nowadays. I have a catch-all redirect, so I any mail addressed to *@mydomain.com.
Then, I give a unique username to each organisation. e.g. slashdot@mydomain.com. If I receive spam at this address, I inform them, then kill the username. I can also just create slashdot2@mydomain.com if I want to keep dealing with their company.
Now, I receive only a few spam emails each year, so I need to do zero automated filtering. I also don't have to deal with the worry of false positives at all.
http://en.wikipedia.org/wiki/Greylisting
A project I worked on many years ago re-wrote a monitoring system in Java.
It was Perl, running a rather large list of regex's over syslog files.
The process of converting it to Java resulted in a 100x speed up - despite Perl possibly having a faster regex implementation. The regular expressions are compiled once on start-up. Regular expressions can be very fast - they're just slow to parse and compile.
I don't know what are you doing to run so massively slow, but I've a similar setup running in an ancient P2 400MHz server machine and with thousands of regexp filter rules in procmail scanning each incomming non-matching email only takes around a hundred milliseconds, matching spam emails way much less.
if you have regexes like this:
re1
re2
re3
You can just combine them into
(re1|re2|re3)
and have it fork just one copy. Even better if you can compile them.
How about instead of sitting there watching it process you just block your own access to viewing this 15 second delay and ignore it. Just don't care about it. Pretend it doesn't happen and your mail just arrived in your inbox.
I can see no situation where email being delays by 15 seconds is going to cause a an issue.
There is a comparison of blacklists: http://dnsbl.inps.de/analyse.cgi?type=monthly&lang=en
nosig today
You get so much mail so furiously that you can't suffer a 15 second delay? I presume you're talking about a personal mail server... if you're hosting mail for a 1000 people then yeah that's a problem.
Yes Francis, the world has gone crazy.
ASSP, Anti-Spam SMTP Proxy.
Ran it for a few years with a domain of a several hundred users. What I liked best is that it blocks spam during the SMTP conversation with the spammy sender.
Just forward your mail through gmail. That way all the spam disappears and the NSA can get their data without trouble.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
If you require using sophisticated procmail filters on your personal account then it seems like your setup is wrong from the get-go. Your incoming mail server should be taking the brunt of the work and using a progressive and efficient filtering before any filtering by content.
I use a spamdyke based front end that has a whole arsenal of whilte, black, and gray filtering of emails using RBLs RBLHS, reverse lookups, etc. It also can do header "pattern" filtering as well, but I currently don't use that feature. This blocks almost all spam quickly and efficiently. The last stage is to run it through spamassassin for those things that are in the gray (not a simple reject/accept, but a cumulative scoring) area. Worst case mail delays are on the order of few seconds through the whole chain. Spamassassin only gets a small number of incoming emails to work on. The stragglers usually come via accounts at yahoo, live, etc.
The nice thing about spamdyke and other systems like it is that it does it's job very fast. For example, the blacklists and whitelists in spamdyke can be setup as directory tree structure so it is a very quick lookup to determine whether to accept or reject the specified domain or ip address.
I also use systems like honeypots and hunter-seekers. The latter looks at what is graylisted or accepted by spamdyke and does http checks on the domain to see if it should be blacklisted. It also may decide to do tests in ip address neighbors to see if more should be blacklisted.
Like all systems, you must be proactive at identifying mail that shouldn't have been rejected. It is a rare situation, but there are a few companies with badly configured mail servers (like no reverse dns entries). However, after many years of operation my whitelist contains only a handful of domains. The automated blacklist process sends me email when it adds a domain, just in case.
It's called CRMiin, at http://crm114.sourceforge.net//.
It's a technically fascinating tool, named after the old Dr. Strangelove movie's tool for filtering authorized communication.. It doesn't get the attention it deserves because it's never been well packaged, the author publishes it open source but hasn't cooperated with wrapping it in "autoconf" or some other build structure to build Debian or Red Hat based packages. It uses Markovian, *not* Bayesian pattern matching, which makes an enormous improvement in its pattern matching.
Instead of working from a programmed set of filters with programmed keywords, which professional spammers tune their spam to avoif, it builds its own filters from those Markovian matches of what you don't personally want to see, and relies on you deciding "spam/not-spam" to update its rules, much as Google does these days. But becuase the filters are individual and embedded in a neural net, it's very difficult to *deduce* the rules, and they change. In fact, it's even possible to train it with a data set that no one else is allowed to see and put it on an outgoing mail filter. This turns out to be useful for filtering outgoing, confidential data from doctors or stock brokers or intelligence agencies.
Thank you for posting that checklist, that's a vital document for any spam planning.
SpamAssassin, executed through procmail on the mail client's email, is indeed resource intensive and does not scale well for an organization. Other people have mentioned other upstream filtering techniques, such as grey listing and DNS blacklists, but those are limited because of the large numbers of zombied Windows clients around the world, which have their resources rented as botnets to send spam from legitimate environments around the world, partly to evade these filters.
My experience is that spam requires management, not silver bullets. Layers of defense such as supporting SPF, which filters very early and cheaply based on DNS records, helps eliminate most forged gmail.com and hotmail.com and other large domain phishing. More powerful, more expensive filters such as SpamAssassin can be applied on the vastly reduced volume of email that gets past the earlier filters. Unfortunately, if you're processing with a local "procmail" by pulling the email from the mail server to your local machine, it's already too late to activate DNS blacklists or SPF, so the increasing burden on SpamAssassin is predictable.
I'm afraid I don't have a great solution for the original poster except tp push the filtering upstream, to the mail server itself, to reduce the load with those lightweight filters such as SPF or blacklists.
No - I did my homework to find out exactly how somebody managed to fuck up communication and greatly delay messages from one end, and found that the answer was greylisting implemented very poorly at the remote end. My comment above is because I "did my homework" and observed the downsides. Those downsides are now listed in the wikipedia article.
For the record of yourself and the other idiot making noise about MS Exchange, I had not configured either of the two servers and instead came in after the problem came to light. So it's not just "flamebait" it's also a stupid jump to a conclusion just because I'm critical of yet another flawed anti-spam stopgap that can backfire if care is not taken. Spammers are channelling stuff via real mail servers now or getting their bots to resend so greylisting is losing what effectiveness it had anyway.
We see people complaining about this problem a lot, and yet for some reason they are afraid to actually put energy into a real solution. Repeat after me : filters can never end spam. That's right, never. All your filters (same can be said for every filter, everywhere) do is encourage the spammers to make their spam more obfuscated to improve their odds of passing future filters. It is a huge waste of time and resources and it's an arms race that the spammers will win.
If you want to actually end spam, you need to collaborate with other people who want to end spam. The way to end spam is not through technology but through economics; as their is only one reason why spam is sent - it is profitable. If you can interrupt the flow of money to the spammer they will move on to a different venture. Until then you're only spinning your wheels and wasting time, storage, and CPU cycles.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Note first, I am _not_ saying to replace your call to grep with a call to perl. Perl _is_ fast on assembling strings into a great matching system, but it still takes a _very_ long time to parse, say, 65000 separate strings.
So combine them all into one. Use Regexp::Assemble. With a little bit of fidgetting, it works with GNU grep, as well. Here's an example script, that I've named regex-opt:
!BEGIN regex-opt.pl!
#!/usr/bin/perl
use strict;
use Regexp::Assemble;
my $gnu = 0;
if ((defined $ARGV[0]) && $ARGV[0] eq '-gnu') {
shift;
$gnu = 1;
}
my $ra = Regexp::Assemble->new;
while () {
$ra->add($_);
}
my $string = $ra->as_string();
if ($gnu) {
$string =~ s/\\d/[0-9]/g;
$string =~ s/\(\?:/\(/g;
$string =~ s/([()?|]{})/\\$1/g;
}
print $string;
!END!
So, you have a file with your tens of thousands of lines of patterns to match. Ok, ./regex-opt < patterns.txt > matchpattern.re. This may work with egrep, but it's perl regex syntax, so maybe not completely -- procmail | egrep -f matchpattern.re
With 65000 lines, GNU grep takes about half an hour for the tasks I give it. After assembling all 65000 lines into one expression, even when that expression is _megabytes_ in size, it loads quickly and has the speed of a decision tree.
So, as you accumulate new patterns, output them to a file. Also, _always_ keep your list of separate match patterns -- I'm not sure how well this package can handle reparsing a regex back into itself. Do matches like so:
egrep -f <(cat matchpattern.re newpatterns.txt)
and once a week,
cat allpatterns.txt newpatterns.txt | regex-opt > matchpattern.re; sort -u allpatterns.txt newpatterns.txt > temp.txt && mv temp.txt allpatterns.txt && rm newpatterns.txt
Consider using the Berkeley Database:
http://linux.about.com/cs/linux101/g/libdb.htm
> SpamAssassin, executed through procmail on the mail client's
> email, is indeed resource intensive and does not scale well for
> an organization.
it does scale much better if run through amavis as a persistent process, rather than forked from procmail for each incoming message - much of the CPU usage is from compiling (and re-compiling) the regular expressions over and over again.
pre-processing your regexp lists to consolidate them into far fewer but much longer regexps also gives huge benefits - e.g. instead of 1000 RE rules of 1 line each, join them with '|' and reduce them to 10 or 50. it's far less computational work to match against 50 long and slightly complicated REs than against 1000 simple REs.
in practice, this means generating your spamassasin local.cf file with a script, from one or more "source" files.
even without amavis, SA comes with spamd which provides the same benefit of avoiding RE-recompile - but IMO is a lot more work to configure and maintain than using amavis
even so, i try to reject as much spam as possible in the MTA before the mail gets passed to amavis & SA for final checking.
> My experience is that spam requires management,
> not silver bullets. Layers of defense [...]
yes! SPF, greylisting (even a 5 or 10 second greylisting delay is enough to filter out a huge amount of spam), careful use of RBLs (spamhaus are ethical and have reasonable policies), RHSBLs, DULs, MX-record checks (e.g. reject mail if MX record points to 127.0.0.1), HELO/EHLO checks (block mail claiming to be from my domains or IP address), blocking mail from specific senders and sender domains, tarpitting spammers, and more.
another useful technique is to use well-crafted fail2ban rules to monitor /var/log/mail.log and create temporary iptables rules to block persistent spam sources.
on my home mail server, i also block all mail from specific countries, using IP address and TLD blocking lists - but that's not a good option when spam-filtering mail for a company or organisation.
That seems a very sophisticated, enlightened, multi-layered approach. It can be very difficult to implement so broadly if your mail services are in the hands of another corporate group. MS Exchange managers, for example, can become quite concerned and upset if you want to implement greylisting and SPF blacklists before it even reaches their mail servers, but that's where it's most effective.
Merging the SpamAssassin checks into larger but more efficient regexp statements is a useful technique that I'd encourage you to publish, especially if you publish the tools to build those new rules and move aside the old ones.
My own tests on a Core 2 show the ANLTR Java lexer to get about 1-2 MB/sec throughput. The C output gets around 4-5, and using -flto hits about 8-9.
@ 1000 emails/day, that's 15000 seconds processing time, or over 4 hours. That seems a bit excessive, but being having your email delayed a cumulative 4+ hours/day.
Like it is done here:
http://www.sanesecurity.co.uk/
i thought merging REs was standard practice by now. i've been doing it since long before I started using SpamAssassin, when I was still mostly using postfix body_checks and header_checks.
here's some of my anti-spam stuff.
the scripts are old, but pretty close to what i actually still use today to generate postfix body/header checks and spamassassin rules.
they're not packaged software you can just install and use - think of them as examples of a particular approach to managing anti-spam rulesets.
BTW, note that with SpamAssassin, fewer and larger rules require less CPU time to run, but reduce the likelihood of multiple matches if there are multiple spammy phrases in an email - max one match per rule. this is why the scripts are configured to generate max of 500-character rule lines, when SA can easily handle 5000 or more characters per line. also, shorter lines are easier to read when debugging problems, and each rule is generated with a unique identifier so I can see which rules are matching for each msg
Use CRIU (Checkpoint Restore in Userspace) to checkpoint a hot version of grep that has been started and given a couple seconds to load in the dictionary and build it's pattern matcher and is thus just awaiting stdin (which you haven't given it). Restore a fresh instance for every new email, and pass the new email into the just-opened stdin for that restored, hot, waiting to go instance.
Instead of launching a fresh grep and initializing it with your corpus, this will create a grep that you can online which will be ready to go, awaiting input.
Ma-fucking-gic.
Traditionally one could achieve this effect by forking child workers, but that's a fucking huge pain in the ass as far as program design goes, making things really complicated- instead of a single program doing a single thing, it couples many uses of a program into a single programs lifestyle. Daemonized apps require system level management and have to be running. Service apps require complex interfaces to handle the different servicings they are performing. Decouple concerns (stay unix'y: stdin->program->stdout), and CRIU the bitch. Just use a hot program, rather than a cold one.
If the problem persists: fuck grep, it's pattern matching is rubbish and it's worthless. Please let us know. You might also consider 'head' 'ing the first 64k or some such of your email to avoid pattern matching the entire doc.
It's possible to use a hottened egrep by booting up one egrep, checkpointing it, then restoring that checkpoint again and again whenever you need an instance.
http://ask.slashdot.org/comments.pl?sid=4150171&cid=44759217
The problem is not using egrep, the problem is not using an existing already launched copy of egrep. Which, you CAN do. And I'd even recommend doing so, because it's manageable and uses sane well known and unfancy tools that are decoupled from each other.
Thanks for writing GIT. So many in this thread immediately jump into alternative options without discussing what's really at the heart of this problem. Grep is fine software and is known to do it's job well. As you say, the problem is simply that grep has startup costs, but those can be near totally ameliorated out.