Slashdot Mirror


Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"

190 comments

  1. spamassassin by mdaitc · · Score: 5, Insightful

    have you tried spamassassin?

    1. Re:spamassassin by Scutter · · Score: 2

      Latest News: 2011-06-16: SpamAssassin 3.3.2 has been released, a minor new release primarily to support perl-5.12 and later. Visit the downloads page to pick it up, and for more info.

      Last update was more than two years ago. I know you can refresh your rule sets periodically, but is the software even still maintained?

      --

      "Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"
    2. Re:spamassassin by Anonymous Coward · · Score: 0

      Sendmail, MIMEdefang and spamassassin. Hell MIMEdefang alone is quite powerful.

      Also, I'd like to point out to the guy that the fact that he submitted this question to slashdot, makes it look like he's asking "How do I shot web?"

    3. Re:spamassassin by dbIII · · Score: 4, Informative

      There is still stuff going on in the dev version with an svn commit listed on August 30 2013.
      http://spamassassin.markmail.org/search/?q=#query:%20list%3Aorg.apache.spamassassin.commits+page:1+state:facets

    4. Re:spamassassin by wvmarle · · Score: 4, Informative

      Add greylisting to the mix. For me it stops approx. 90% of junk at the gate. That alone saves >90% of your server's spam workload (90% of the spam checker; a bit extra due to the mail server not having to process the mail at all).

      Of course I don't know about legitimate mail but if someone is trying to send legitimate mail trough a spam-type minimised mail server that doesn't retry, that's their problem...

    5. Re:spamassassin by wvmarle · · Score: 4, Insightful

      Maybe the software is pretty much finished? In that case there's not much more to do - no new features to add, and sooner or later you'll run out of bugs to fix.

    6. Re:spamassassin by FridayBob · · Score: 2

      On the mail servers I maintain, I employ SpamAssassin only a last resort because it is resource-intensive. Submitter hmilz's approach is not only resource-intensive, but also labor intensive, so I would never recommend it.

      I've used Exim for my MTA since 2001 and my main defense against spam has always been to filter it out before SpamAssassin comes into play based on analysis of header information and checking against DNS black lists. Actually, the first thing I do is look for obvious fakes from a limited number of well-known domains: gmail messages that are not sent from a Google server, eBay messages not from an ebay.com server, etc. Such messages are rejected immediately. However, the bulk of the filters I've collected and developed over the years check a number of items: whether the sender's reverse DNS address is in order, the HELO is correct, whether the sending IP address or any domains mentioned in the header lines are blacklisted, whether the callout works and any DKIM signature is valid, if an RFC-compliant date and To are included, whether any attachments are included with file types that I consider risky (e.g. ,bat, .btm, .cmd, .com, .cpl, .dat, .dll, .exe, etc.), if the message headers contain non-ASCII or characters from some unspecified character set, whether any SPF record says that the sending server really is authorized as an MX for the sender's domain, and finally if the incoming message is using one of my domains in its message ID. For all of these types of checks I often have multiple filter statements.

      In the past I would usually reject messages that matched any of these filters. I would hardly ever receive any spam, but would see lots of false-positives, so I had to maintain very long white lists. When I finally got tired of that, I modified the above filtering system so that each filter was categorized. Each category has a variable that starts out as zero, but gets changed to a one with a match for any of the filters in that category. Later in the process the system counts the number of category variables that equal one. Generally, I figure "three strikes and you're out" is a good rule to apply.

      Moreover, my MTA configuration works with a spambox system. For instance, if an incoming message scores only one or two category matches, and/or the message scores less than a certain number of SpamAssassin points, then it gets deposited in the user's spambox instead of their inbox. I've been running the four MTA's in my care like this for the last three years and they've been very reliable and almost totally free of maintenance. They don't require very much in the way of resources either. But best of all, I've had no more complaints from the users at all.

    7. Re:spamassassin by whoever57 · · Score: 1

      have you tried spamassassin?

      Indeed. I just looked at logs on a server that acts as an incoming mail filter for a small company. The range of times for spamassassin (spamd) to filter the incoming emails was about 1 to 7 seconds. with most being in the range of 2-4 seconds. This is without bypassing spamd for large emails (spam can be relied upon to be small)

      --
      The real "Libtards" are the Libertarians!
    8. Re:spamassassin by Scutter · · Score: 2

      I don't think there's any such thing as "pretty much finished", especially with a piece of software involved in the arms race that is spam vs. filtering. There's only so much you can do with rules before you need to revisit your engine. Also, it's not just the software that's been stagnant for two years. The website itself hasn't been updated in as long. Not a single news item since 2011. The other respondent mentioned that dev is still active, but dev is not production. Dev is dev. Ever since Spamassassin moved to Apache, it's been pretty much dead.

      --

      "Tell me doctor, with all of your defenses, are there any provisions for an attack by killer bees?"
    9. Re:spamassassin by bill_mcgonigle · · Score: 4, Informative

      The rules sets are updated pretty frequently - that's where the front lines of the battle are. As others have said, the engine is pretty mature.

      The question, I guess, is what do you want spamassassin to do that can't be expressed with the current rules language?

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    10. Re:spamassassin by pongo000 · · Score: 0

      have you tried spamassassin?

      Don't follow this advice. SA has become so slow that it's almost useless. On a VM with 1GB RAM, it takes anywhere from 15-60 seconds to process a single e-mail, and is an incredible resource hog. I've been running SA for years, run the latest stuff, and have pretty much done every tweak imaginable. And the default rules are about useless now as well: The scores are set so low that you have to set a low threshold, increasing your false positive rate. About 50% of the mail on my mail server (personal use, maybe 200-300 inbound messages a day, 90% spam) just gets passed due to spamd timing out.

      Unfortunately, there appear to be no decent alternatives out there. Greylisting is nice, but spammers are wising up to it, and simply resend spam. There was a time about 3-4 years ago that zero spam came through (same inbound volume)...now, it's more like 5-10 a day. Not that I'm complaining. My point being that switching over to SA will not solve any of the submitter's resource woes with procmail.

    11. Re:spamassassin by Chetchez · · Score: 1, Funny

      I express my spam filter rules with interpretive ribbon dancing.

    12. Re:spamassassin by Anonymous Coward · · Score: 0

      But best of all, I've had no more complaints from the users at all.

      That's a clear warning sign. Users always complain. Better recheck your rules to see if you're dropping all mails from your users.

    13. Re:spamassassin by xrayspx · · Score: 1

      There's a lot to do to SA to make it "good". I shared your opinion a year ago. I run a relatively low volume personal mail server for a few domains and a few users. I had SA, but it didn't do much, and I had bigger fish to fry dealing with much larger mail sites than my stupid personal nonsense. I typically get about 300-500 spams a day, and very few legit mails. I was getting false positives, so I'd just never see the mail, and tons of false negatives. About 20% of the daily spam was hitting my inbox, making it unlikely that I'd ever even check my personal mail. If you mailed me, and I didn't have an existing filter from you, there was maybe a 60% chance I would notice your mail in time for it to matter.

      I decided one day to fix all this, regardless of what that entailed. I lowered the threshold for SA to a score of 4 (which they bark at you not to do, but fuck 'em, I've seen maybe 6 legit mails with a score higher than 4.5, in my world anyway). The key components were: enabling remote checks, RAZOR and DCC, and having SA train its filters off of my false negatives. I use the Train SA script, so I drop any false negatives in a Train Spam folder, and this picks them up and runs them through SA's filters to train it.

      My false negative rate dropped pretty much immediately from 20% to ~3% to 5% on weekdays, and zero to 1% on weekends, which I can live with. In the year or so since I actually put my back into fixing this, I've gotten maybe 2 false positives.

      I don't see long processing times, mail comes through pretty much as I send it in my tests on my VPS, but again, I only get a few hundred mails/day. If I had volume over a few dozen thousand/day, I'd probably just bite the bullet and pay Google (Postini) to make it go away.

    14. Re:spamassassin by jgrahn · · Score: 1

      I don't think there's any such thing as "pretty much finished",

      There is; software designed according to "do one thing and do it well" ... for example the Unix cat(1) command is probably pretty stable by now. Same with fgrep(1).

      especially with a piece of software involved in the arms race that is spam vs. filtering.

      ... but yeah, well, I don't know Spamassassin but I suspect it has broader and more loosely-defined goals.

    15. Re:spamassassin by Anonymous Coward · · Score: 0

      I use razor collaborative spam filter calling it directly from procmail, but my local rules usually catch any spam before razor-check is called.

    16. Re:spamassassin by Architect_sasyr · · Score: 2

      One client in 4 years of greylisting has had that problem, for something like 40,000 unique senders per month. I like those numbers.

      --
      Me failed English...
      FreeBSD over Linux. If my comments seem odd, this may explain...
    17. Re:spamassassin by Anonymous Coward · · Score: 0

      Don't follow this advice. SA has become so slow that it's almost useless. On a VM with 1GB RAM, it takes anywhere from 15-60 seconds to process a single e-mail,

      Well, so don't use "a VM" then. What is this sickness about running everything in a VM these days? Virtualization is a resource hog itself, and should therefore only be used when it actually is useful. Which isn't everytime.

      Use just one OS, and run all your services there directly. A good os (i.e. not windows) provide enough isolation between processes to do this safely. One piece of software do generally not mess up another - if you aren't on windows.

    18. Re:spamassassin by DNS-and-BIND · · Score: 2

      No, today's technology user has been brainwashed by mobile applications that update frequently. I have seen complaints on perfectly good software: "Has not been updated in a year whats wrong this software sux 1 star". Developers also use software updates as a sort of beta test: push it out, and if it crashes a lot of systems then update it again. Iterate as necessary. I've seen three releases in a day and five in a week using this "plan". The users don't help by considering mature (i.e. un-updated, essentially finished) software as garbage.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    19. Re:spamassassin by bofh29a · · Score: 1

      If you're CPU bound, use sa-compile. It made a P4 regularly hitting 1.0 load drop down to 0.6

      http://wiki.apache.org/spamassassin/FasterPerformance

    20. Re:spamassassin by hmilz · · Score: 1

      Yes sure, SA is also in the chain. But SA alone was never able to do the job even if I fed each mail that got through to sa-learn --spam (which I still do). This is why I started the local spam pattern file. It's a personal account btw, maybe that was unclear.

    21. Re:spamassassin by fredklein · · Score: 1

      Just switch over to Email Certification.

      Long story short, everyone who wants to send Certified mail has to be 'certified' by their ISP. (UN-certified mail would still be possible, if
      you wish.) Getting certified is nothing more than providing enough information to positively identify you, and costs a nominal fee.
      In return, you create a public/private key pair, and give the public one to the certifier. The private key goes into your email server, which
      adds some headers to each outgoing email. One of these is encrypted with the private key. When someone with a certification-compliant email
      program receives a certified email, the program reads the headers, connects to the certifer's certification server, and downloads the public
      key. It then uses the public key to decrypt the encrypted header. If successful, it proves that email came from the specified server, and no one
      else.

      If you get spam, your email client has a big 'report certified spam' button. Click it, and an email is auto-launched to the certifier of the
      sender. The certifier contacts the sender and demands an explanation. If sender was hacked, they fix the security hole and tell certifier they
      did so. If spam was not spam, or a misunderstanding, they explain.

      If, OTOH, the sender does not reply, then the certifier revokes their certification, and from that moment on, all their (the senders) emails are
      UN-certified.

      What if a Certifier themselves is 'evil'? Well, it's certainly possible to have blacklists like they do now, but, instead of blacklisting IP
      addresses, which get re-assigned and cause trouble for their new owners, it would be evil Certifiers that get listed and blocked.
      Eventually, it'll reach a point where any spam that is sent out will get the sender 'de-certified' almost immediately. That means everyone else
      probably never ends up seeing the spam at all (depending on how their clients handle un-certified emails. Most people will probably auto-trash
      them.)

      However, white lists are still possible. If you like getting emails from a certain un-certified sources, just white-list them, and you'll
      continue to get them. You can also use challenge-response or keyword set-ups for people sending you un-certified email.

      TL;DR:
      By proving who sent the email (or, more precisely, which server did), Email Certification can hold the server owner responsible. If they send
      spam, they get de-certified, which means in all likely hood, they lose the ability to email anyone at all. Spammers who can't get certified
      can't send emails anyone will see.

    22. Re:spamassassin by fisted · · Score: 1

      What a shitty idea.

    23. Re:spamassassin by fredklein · · Score: 1

      What a well-thought-out and detailed response. I particularly like that way you went into detail on every point you raised, weighing the pros and the cons.

    24. Re:spamassassin by Anonymous Coward · · Score: 0

      Of note, Yahoo! Groups will not attempt to deliver e-mail twice. If the first attempt fails, even with a temporary failure code, they will refuse to retry. Worse, when they send e-mail advising you of the situation, they send it from a different mailserver, which both retries temporary failures and prevents their listservering hosts from ever getting onto the whitelist.

      Have I mentioned I hate dealing with Yahoo! Groups and advise people to steer clear of them?

    25. Re:spamassassin by paenguin · · Score: 1

      By proving who sent the email (or, more precisely, which server did), Email Certification can hold the server owner responsible. If they send
      spam, they get de-certified, which means in all likely hood, they lose the ability to email anyone at all. Spammers who can't get certified
      can't send emails anyone will see.

      I see you've never had your server compromised. And of course nobody can spoof an email header or perform a Joe Job.

      These are just two obvious holes. There are certainly more.

      --
      We should start referring to processes which run in the background by their correct technical name... paenguins.
    26. Re:spamassassin by fredklein · · Score: 1

      I see you've never had your server compromised.

      "The certifier contacts the sender and demands an explanation. If sender was hacked, they fix the security hole and tell certifier they
      did so. If spam was not spam, or a misunderstanding, they explain."

      A hacked server might result in the revocation of the certification (and thus the UN-certification of all the emails sent by it), but the company can simply re-certify (with a new key pair).

      And of course nobody can spoof an email header or perform a Joe Job.

      What's what the Public-key cryptography is for. No email can pretend to be from your server, unless it has an encrypted header encrypted with your private key. Which is, you know, private.

      These are just two obvious holes. There are certainly more.

      Actually, they're not holes at all.

    27. Re:spamassassin by cas2000 · · Score: 1

      yeah, i agree too. and it's good to see you've changed your mind when presented with compelling info.

      "what a shitty idea" is about as much detail in response that a shitty idea like that deserves. in fact, it's more than what it deserves.

    28. Re:spamassassin by fredklein · · Score: 1

      Oh, I always change my mind when presented with irrefutable evidence such as "That sucks" or "shitty idea". I do my best to ignore things like 'arguments', 'reasoning' or 'logic'- those only serve to inflame the situation.

    29. Re:spamassassin by Meski · · Score: 1

      Mod the parent up. Stable software beats weekly releases.

    30. Re:spamassassin by Meski · · Score: 1

      Look back over previous comments on email certification if you want the more considered response. Hell, someone's even made a form for it.

    31. Re:spamassassin by hermitdev · · Score: 1

      Better recheck your rules to see if you're dropping all mails from your users.

      But, that's a feature, not a bug.

  2. Or... by bmo · · Score: 0

    You could route everything through gmail and wash out the spam.

    Gmail's spam detection is spectacular.

    inb4 gmail hate.

    --
    BMO

    1. Re:Or... by Anonymous Coward · · Score: 0

      Then why isn't Postini's? We use it at work, and it can't seem to deal with any sort of Chinese spam or most phishing attempts.

    2. Re:Or... by Anonymous Coward · · Score: 1

      GMail's false positive rate is absurdly high.

    3. Re:Or... by Anonymous Coward · · Score: 0

      Outsourcing the management and storage of your emails to Google's big ears when you are capable of doing it yourself is definitely stupid.

    4. Re:Or... by asmkm22 · · Score: 3, Insightful

      Which pretty much defeats the whole point of hosting your own email...

    5. Re:Or... by bmo · · Score: 4, Informative

      Well, the OP wasn't exactly clear if it was just his personal account or whether it's a corporate server. My "least amount of work" thing is to forward every email address I have to gmail, and pull mail from there via imap. I get a few hundred spams a day just on one mail account, and I haven't lost any real mail due to Gmail's filtering.

      >postini

      You do realize that is being EOLed, yes?

      http://postini-transition.googleapps.com/

      >why gmail filters better than postini

      Probably separate spam databases. Stuff like that happens. Gmail probably gets orders of magnitude more spam to "teach" the system.

      YMMV.

      Using grep and procmail is the stone-knives-and-bearskins approach to filtering. There are a lot of other filtering systems that will be much more efficient on Unix systems. He can begin by using greylisting to filter out the non-compliant "fire and forget" spambots and then filter the winnowed pile o' crap. At least greylisting's not server intensive (it throws the load back to the sender) since 5xx and 4xx errors are cheap.

      Also blocking mail from dynamic IPs is a good idea.

      At this point he can then run the mail through a series of weighted RBLs. Reach a certain score and it's tossed. That's the processor intensive bit, but it's at the end and non-intensive filtering has already happened.

      --
      BMO

    6. Re:Or... by YukariHirai · · Score: 1

      Is it? I've never had a false positive in all the years I've been using GMail.

    7. Re:Or... by Count+Fenring · · Score: 3, Interesting

      Is it? I've never had a false positive in all the years I've been using GMail.

      That you noticed. There's a fairly high bias inherent there; it just has to not have hit something that was both noticeable and that you knew was incoming.

    8. Re: Or... by Anonymous Coward · · Score: 0

      Um... there's a "spam" folder that one can check, too. I have, and I do, but I'm not noticing any false positives.

    9. Re:Or... by dbIII · · Score: 4, Interesting

      and I haven't lost any real mail due to Gmail's filtering.

      Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.

      IMHO it's better to do the filtering somewhere where you have access to the stuff that is discarded. False positives may be rare now but they still happen. That's why I like stuff such as MailScanner (open source wrapper for spamassassin+your choice of commercial antivirus and/or clamav+other open source stuff+distributed updating rulesets) run on site. There's plenty of others that give you this function including some of the commercial "appliances" and outsourced email filtering.

      Also blocking mail from dynamic IPs is a good idea.

      It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.

    10. Re:Or... by FridayBob · · Score: 1

      And allow Google and the US government to scan all of my mail? No thanks. The same goes for Hotmail, Yahoo and any other commercial email service provider (and certainly those based in the US). These days it makes more sense than ever to maintain your own MTA.

    11. Re:Or... by vux984 · · Score: 1

      And allow Google and the US government to scan all of my mail?

      Well, routing it outside of gmail keep gmails hands off it, well the half that doesn't originate with them to start with. But i think you'll need to do quite a bit more to keep the us govt out of it.

    12. Re:Or... by Anonymous Coward · · Score: 0

      Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.

      Bullshit. Google either refuses the mail during the SMTP transaction or accepts it and queue's it, then it either delivers it to your inbox or delivers it to your spam folder. Gmail does not black hole emails. It's more likely that the senders were ignoring/deleting the bounces, the sender's mail server administrator was too lazy or incompetent to look at server logs, or you were too stupid to check your spam folder.

      It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.

      I'll bet that IP has a generic hostname in rDNS, too. If you're gonna run a mail server but can't be bothered to set and maintain a non-generic hostname, don't expect your IP to stay off dynamic IP lists. How do you think those IPs ended up on the list in the first place?

    13. Re:Or... by Nemyst · · Score: 1

      In years of using the service I've never once been told that I'd missed an email by anyone I know. I'll take that as a sufficient confirmation that I haven't missed anything important. Things that I have missed which people haven't followed up on or notified me of are just about as good as junk mail anyway.

    14. Re:Or... by macraig · · Score: 1

      Gmail's spam detection is spectacular.

      Are you new here? Its false positives are equally spectacular. Some days I could swear there's some mean bored Google sysadmin who happens to be a less-than-friendly coworker from my past who's just sitting there randomly applying the Spam label to messages in my Inbox.

    15. Re:Or... by pongo000 · · Score: 2

      At this point he can then run the mail through a series of weighted RBLs.

      Fuck you and your RBLs. RBLs are a draconian solution that do immeasurable damage to those of us who (1) aren't spammers, and (2) choose to run our own mailservers on business-class IPs. I can't tell you how many times various IPs I use for outbound mail (I run several mailing lists) end up on an RBL for absolutely no fucking reason.

      Oh, because someone in the same /24 block sent spam? Really? That's a good reason to block an entire /24 subnet?

      RBLs are a solution in search of a problem. Some of them are nothing more than moneymakers for the people that run them: In order to get off their list, they blackmail you into paying money.

      Want to do the world a favor? Don't use RBLs. You'll just end up finding yourself blacklisted at some point anyway.

    16. Re:Or... by CaptQuark · · Score: 2

      Last night I sent my Gmail account an email from my ISP email system, then waited for it to show up. Nothing. So I resent it. Second time nothing.

      The email contained two screen captures I needed at the office. The subject line was "Steve on telework". Nothing obvious that would trip Postini's spam filter. It is now 24 hours later and neither has shown up. I wonder how many other emails I don't get.

      ~~

    17. Re:Or... by dbIII · · Score: 1
      OK then - maybe dumped them into the recipient's spam folder, all I know is the recipients never became aware of the emails until the GIF files were changed to PNG. Maybe gmail policy has changed since or maybe they only said they looked in their spam folders but never actually did. Either way the emails were not getting read if they had a GIF on them.

      How do you think those IPs ended up on the list in the first place

      All your insulting bets are wrong, all that happened is some IP addresses got reassigned quite a few years ago. What used to be dynamic with one ISP became static with the company that bought them out.

    18. Re:Or... by jgrahn · · Score: 1

      Fuck you and your RBLs. RBLs are a draconian solution that do immeasurable damage to those of us who (1) aren't spammers, and (2) choose to run our own mailservers on business-class IPs. [...] Oh, because someone in the same /24 block sent spam? Really? That's a good reason to block an entire /24 subnet?

      That sucks. And more generally, I believe the anti-spam fundamentalists have caused as much damage to the use of reliable mail, as the actual spammers. They are much too willing to accept collateral damage. Sad, when you consider how much effort went into designing (a) guaranteed delivery or (b) guaranteed notification of non-delivery.

      For this particular situation: I accepted early on that many misconfigured mail servers "work" like you describe, so I make sure that my ISPs lets me relay via them.

    19. Re:Or... by CBravo · · Score: 1

      I send many millions of emails a week for mailing lists. I never get blacklistings 'for nothing'. What is annoying about blacklists is that they see so many spammers that they fail to see that normal people make mistakes (a lot). Since it is 'one strike and you are out' it is difficult to fix a situation for a legitimate situation.

      Example: A medium size company mails its 2000 customers, this time in Finland, and we end up with a blacklisting. This customer comes up with a new list: The opt-outs and bounces (he is a sucker). However, we see 6% more permanent/hard-bounces than in the list. This can be because the spamfilters identified the spam and replied with hardbounces or it can be because the hardbounce list is bad (and there are more spamtraps). I already identified the list earlier as 'bad'. If no blacklist would have bothered, I would have allowed an opt-in mailing, asking recipients (again) for permission. Now I have to tell them to go to another ESP because I do not want to risk be a 'repeating offender' in their eyes. For the recipient, the first would be better. This is not a question you can ask a blacklist guy.

      I guess the bottom line is, it is hard to communicate with blacklist people to solve legitimate situations. How can you fix that (without offering bad guys a vector for DOS-sing them)?

      --
      nosig today
    20. Re:Or... by Anonymous Coward · · Score: 0

      You've checked your spam folder, right?

    21. Re:Or... by FridayBob · · Score: 1

      Well, routing it outside of gmail keep gmails hands off it, well the half that doesn't originate with them to start with. But i think you'll need to do quite a bit more to keep the us govt out of it.

      Of course you're right about that, especially if I correspond with someone with e.g. a gmail account. But, why make it any easier for the NSA? It gets a harder for them to listen in when I correspond with people who also maintain their own email servers, and whole a lot harder when those servers can also do automatic encryption. If what they're doing is unconstitutional and a violation of your privacy, why make life easy for them and knowingly play into their hands?

    22. Re:Or... by jon3k · · Score: 1

      Bullshit.

    23. Re:Or... by jonbryce · · Score: 1

      RBLs block around 95% of my incoming mail with very minimal false positives. Sorry if you don't like them, but people use RBLs because they work.

    24. Re:Or... by jonbryce · · Score: 1

      I got a lot of them. Mostly emails from things like newspapers that I specifically asked them to send me.

    25. Re:Or... by Anonymous Coward · · Score: 0

      Don't be fucking stupid.

    26. Re:Or... by bmo · · Score: 1

      Fuck you and your RBLs.

      Not every RBL is created equal. Anyone with half a brain knows which ones are good and which ones are so-so. The trick is to /weight/ them. Give a smaller score to the RBL you think makes mistakes.

      Go take your blind hate elsewhere.

      --
      BMO

    27. Re:Or... by bmo · · Score: 1

      And allow Google and the US government to scan all of my mail?

      Where have you been for the past (looks at calendar) 25 years?

      The government archives every bit of email (and generic net traffic, for that matter) it can get its hands on. Remember Echelon? Remember Total Information Awareness? Snowden's "revelation" was something that everyone, who's paid attention for a few decades, already knew or assumed. The cost of a gigabyte is 3.5 cents retail these days. That's a lot of mail archiving for not a lot of money.

      Unless you encrypt your mail locally before sending (and not rely on MTAs that use SSL) and your friends encrypt /their/ mail, then some computer along the way is going to read it.

      It depends on how much you want to work at this. Most people don't care. There are people who do care, but don't know what to do, and this is a problem, because there are no turnkey/idiot-proof encrypted messaging systems out there that can be installed by Joe-User. It would be helpful if, by default, Thunderbird came bundled with gpg, for example, but configuring gpg isn't Joe-User proof.

      Maintaining your own MTA is not enough. It has to happen client-side, before the mail even hits the MTA.

      --
      BMO

      To give you an idea of how long I've been doing this shit, I remember the line eater.

    28. Re:Or... by pongo000 · · Score: 1

      I apologize for the personal attack. Not sure why I did that. Guess I let my emotions get the better of me.

      Still, fuck RBLs. Sadly, many who should know better do not weight RBLs, and instead outright reject any mail that scores a hit. These operators are slowly destroying the email infrastructure by not only fragmenting and marginalizing the smaller email providers (including individuals who choose to responsibly run their own SMTP service), but by implicitly forcing individuals to seek mail services through corporate providers (think "do no evil"). I have gotten to the point where I simply tell subscribers to the lists I admin that they will have to use another ISP if they want to subscribe because their email provider blindly defers to one or more RBLs, most of which are dodgy to begin with (think pay to play, or let's ban entire subnets because we aren't technologically adept enough to filter on just one IP address).

    29. Re:Or... by bmo · · Score: 1

      Then the problem isn't with the RBLs, but bad admins that never even skimmed news.admin.net-abuse.email. And yes, there are more bad admins and RBLs out there than you can shake a stick at.

      But then there are excellent services out there like Spamhaus.

      --
      BMO - Lumber Cartel #2501

    30. Re:Or... by bmo · · Score: 1

      To follow up, I have to say something -

      Even with greylisting on an account that I've had since last century, it gets a few dozen spams. Without greylisting, it got hundreds a day - literally hundreds.

      And technically, greylisting "breaks" email. If you have a non-compliant server that doesn't re-send 15 minutes later (as is the default in Sendmail, I believe), whoever you're trying to reach is never going to see your mail.

      But without filtering, email is utterly useless. There used to be the idea that advertising through email might be useful, but that well has been poisoned and dead goats thrown in it, nearly poisoning the entire water table.

      --
      BMO

    31. Re:Or... by Anonymous Coward · · Score: 0

      many who should know better do not weight RBLs, and instead outright reject any mail that scores a hit.

      This. A single infected machine in a network of 2,000 machines can therefore completely shut down your ability to send out any mails. W00t! NOT! :/

  3. CRM114 by Anonymous Coward · · Score: 1

    Look up CRM114.

  4. You could speed up your current solution by russotto · · Score: 5, Interesting

    Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.

    1. Re:You could speed up your current solution by PetiePooo · · Score: 5, Informative

      ...Most of your time is likely spent parsing the patterns.

      I second that. And as your rules have built up, there are likely some that have never been used beyond when they were first put in. I'd instrument your next solution to identify outliers and cull them over time so your parser doesn't have to work so hard.

    2. Re:You could speed up your current solution by grcumb · · Score: 1

      Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.

      I'm probably going to get shat on by kids who don't know any better, but....

      Use Perl. If a complex set of regular expressions is taking 15 seconds per email, then there's clearly something wrong with the implementation. I suspect you're doing too much backtracking. I've been guilty of the same in the past. In one case, simply anchoring my regular expressions to the start and end of the string reduced running time literally by two orders of magnitude. Just glom the whole message into a string and go nuts.

      And before someone makes a 'write-only' joke about Perl regular expressions, I'd suggest you take a look at Perl 6 regex grammars, which provide you with the ability to lay out complex rulesets with ease - and makes them vastly easier to read.

      As with any programming issue, it's horses for courses, and when it comes to parsing text with regular expressions, Perl is still at the head of its class.

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    3. Re:You could speed up your current solution by niftymitch · · Score: 1

      Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.

      Yes but a more resource friendly set of tools might begin with the OP's procmail to move the mail
      onto a local machine quickly. Filters inside of procmail are hobbled. Do this as one message
      per file (http://unix.stackexchange.com/questions/62563/savings-emails-as-individual-files-using-procmail).
      Procmail locks and gates are OS dependent but still slow.

      Next test each message with one or more simple "grep expressions" that then pass it or gate it
      to more complex expressions. On a multi core machine with a SSD disk this might be quick.

      Now move or pull the files to a location for a human reader in folders or dirs or what ever
      the reader expects.

      Better filters do exist and are well recommended. You can teach them honey pot style with a public account on gmail
      and slurp the service discovered spam into a training file. To some degree there is a need to isolate but not delete
      trouble files so you can retrain the filter.

      Do watch out for mime attachments where the content differs by reader/ tool. A safe text reader can
      be fooled because an unsafe pile of poo is wrapped in a mime that does not speak. Then a rich visual tool
      handy dandy mail tool will open the trouble payload based on the text content.

      --
      Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
    4. Re:You could speed up your current solution by jafo · · Score: 1

      Doubtful that the time is largely spent compiling the regexes... But without knowing more about the OPs exact setup, it's hard to say. In particular, we don't know how many rules the OP has in their corpus. It could easily be tens of thousands or hundreds of thousands, if they just throw a bunch of strings they've seen in spam into a list of "don't let me see this message again" expressions. egrep is probably already compiling any expressions, it's just doing a *LOT* of matching.

      You could try doing statistical matching on the corpus and moving more frequent matches earlier, so that matches cause the rules to terminate more quickly. "-q" might help speed it up by short-circuiting on failure (not sure if it does this or not, but I see no reason why "-q" wouldn't).

      But to really improve the performance, you're probably going to have to simply be more clever than looking for a bunch of strings. For example, using something like razor fingerprinting or bayesian matching.

      You can't just drop your corpus into a database and solve it, you'd need to come up with a way of indexing the data such as fingerprinting to get something that you can index.

      You might also want to do different checks depending on whether the message is directly addressed to you or not. For example, any e-mail that doesn't mention one of my addresses in the To or Cc, or that comes from specific mailing lists, gets stored into a separate folder that I look at very rarely. The vast majority of spam that I get goes into that box.

      Sender IP is VERY easy to use for a database lookup. When I get spam from an IP, I will often set of a blacklist for IPs around that address. Unless it is something like gmail or another big mail service that I recognize. It's surprising how often I get spam from a bunch of very similar IPs (in the same /24 or same /22).

    5. Re:You could speed up your current solution by stoatwblr · · Score: 1

      "Use Perl."

      Even better use a long-lived perl process and feed that (Spamd, etc). Perl's startup load can overwhemly even a lightly loaded mailserver if there are a bunch of parallel invokations.

      2nd, 3rded and amen to the suggestion to use spam assassin. It's mature. The rulesets are updated regularly and it works. Greylisting is becoming less effective with every passing year.

  5. Database? by K.+S.+Kyosuke · · Score: 2, Insightful

    What would the database achieve? I'm not sure what is the exact nature of the patterns (an example would really help here), but perhaps writing a compiler from the patterns into some decision procedure in something reasonably efficient yet featuring quick start, such as SBCL or Gambit, could help.

    --
    Ezekiel 23:20
    1. Re:Database? by jgrahn · · Score: 1

      What would the database achieve?

      Some people believe the solution to any problem involving data is a database ...

    2. Re:Database? by swillden · · Score: 1

      If he's using egrep -F, the patterns are fixed strings.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  6. bogofilter by jon787 · · Score: 4, Informative

    http://bogofilter.sourceforge.net/

    I haven't timed it to see how well its been doing in the 6 years I've had it though.

    --
    X(7): A program for managing terminal windows. See also screen(1).
    1. Re:bogofilter by SigmundFloyd · · Score: 1

      http://bogofilter.sourceforge.net/

      Seconded. Procmail + bogofilter + spam.mbox = no problem.
      I keep - and periodically review - a "spam" mbox for the rare false positive.

      I haven't timed it to see how well its been doing in the 6 years I've had it though.

      It's written in C, so it's very likely much faster and leaner than Spamassassin.

      --
      Knowledge is power; knowledge shared is power lost.
    2. Re:bogofilter by Anonymous Coward · · Score: 0

      bogofilter is fast though it takes some training before it knows what _you_ consider to be spam and what _you_ consider to be ham.

  7. tokens and a lookup table by Anonymous Coward · · Score: 0

    a big one. 15s per email???
    holy smokes you are so fired.

  8. What a forking awful solution... by Anonymous Coward · · Score: 2, Informative

    Sorry, couldn't resist the pun.

    Your problem (besides not using existing Bayesian tools...) is that every single egrep is a fork. As others have pointed out, you should rewrite your script in something like Python and use the native regex libraries. Even if you have to read and 'compile' the regex list every time, you're saving a *massive* amount of OS-level overhead.

    1. Re:What a forking awful solution... by Anonymous Coward · · Score: 1

      Or you could use a more pleasant language for regexp, like perl. It's also faster.

    2. Re:What a forking awful solution... by Anonymous Coward · · Score: 0

      The way he describes it in the summary it's one call of egrep per message. That's not responsible for the overhead.

  9. Distribute by manu0601 · · Score: 1

    It seems you could easily distribute the load on multiple machines, each doing a subset of the regex.

    1. Re:Distribute by Anonymous Coward · · Score: 1

      There's something hilarious about having to distribute email filtering across several machines.

    2. Re:Distribute by Cryacin · · Score: 1

      Yeah, just use "the cloud" to "drag and drop" your email into, so that you can "sanitise your synergy" and "reclaim your potential".

      Phew, enough markitechture for one day. I have to go shower now.

      --
      Science advances one funeral at a time- Max Planck
    3. Re:Distribute by jgrahn · · Score: 1

      It seems you could easily distribute the load on multiple machines, each doing a subset of the regex.

      There's something hilarious about having to distribute email filtering across several machines.

      No; it's tragic that people reach for distributed, or "multi-core", whenever something runs too slowly. If filtering a mail according to a set of REs takes 15 seconds of CPU time like the OP writes, he's clearly doing something wrong, or hitting some limitation of procmail's design (being unable to amortize work, such as reading and parsing the REs, between mails).

      Distributing wasteful work is not the right solution, in particular not if energy efficiency is a major concern like the OP suggests. (And I find it hard to believe that a 15s delay of mail delivery is his main concern!)

  10. ragel by Anonymous Coward · · Score: 2, Interesting

    Try compiling your patterns using Ragel: http://www.complang.org/ragel/

    Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.

    Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained, and has been for years.

    1. Re:ragel by Anonymous Coward · · Score: 0

      I should mention that I include RE2 in that "orders of magnitude" improvement statement. RE2 falls down pretty quickly under memory pressure when you try to union too many expressions together. You can union thousands of patterns together using Ragel, whereas RE2 can blow up after dozens.

  11. Grey Listing and zen.spamhaus.org by Anonymous Coward · · Score: 0

    15 seconds per email? That must be one heck of a pattern list. I used to rely on procmail for filtering. In simpler times it did everything I needed.

    First of all, setup grey listing. 99.99% of the emails you're receiving never make it past grey listing. You can nearly forget about filtering again once grey listing is enabled.

    Second add a reject client like zen.spamhaus.org to your mail server to stop the emails that make it past grey listing.

    You can continue to filter anything that makes it past those two barriers, but I think you'll find your filters are redundant at that point. In fact you can probably cut procmail from the process entirely, unless you use it do other stuff with the mail.

    1. Re:Grey Listing and zen.spamhaus.org by sumdumass · · Score: 1

      I second grey listing and using spamhause filter lists.

      It is well worth it. I had a 30 user environment receiving about 100 to 300 spam emails a day per user go to approximately 10 a piece making it through. Then when I activated the spam filtering it went to about 10 a week for about 1/3 of the users. The biggest problem is third party user machines being compromised and the spam being sent through their internet's email servers (grey listing doesn't stop legitimate servers and most big ISP's don't make it on the spamhaus lists). Usually this contains a virus attached to it and the antivirus on the mail server catches it.

      PS.. the reason the spam was so high is because they sign up for all sorts of crap from their work computers and use their work email almost as a personal email. The partner's don't mind as long as their work is getting done. The biggest offenders were the partners (read owners) themselves.

    2. Re:Grey Listing and zen.spamhaus.org by dbIII · · Score: 1
      Yes, used greylisting for a couple of days and now am well aware of an inherent flaw that could cost the people who use it their jobs. Consider how it works and then consider that people at the top of organisations like to think of email as a nearly instant communications system and really don't like it when that last minute tender they've been working on all night gets delayed for half an hour (as is typical, or several hours with insane greylisting settings I've seen used) just because the person they sent it to has never had email from them before. It's a nice idea from an IT perspective but from a business perspective it sucks dog balls. A lot of places have far too many of those edge cases to make it worthwhile. Other places are more slow moving and it just doesn't matter if the most urgent email arrives a few hours late, but good luck trying to explain why if you're not in one of those places :)
      Reducing the time in greylisting does reduce the potential damage but then it's a balance between the patience of the people that expect instant communications and the patience of spammers. While most spambots only tried once in 2005 things have moved on since - greylisting is well behind the spam arms race.

      grey listing doesn't stop legitimate servers

      It stops them long enough for it to be a problem in enough cases that I kept getting a lot of "why doesn't X have my email yet" phone calls when greylisting first became popular. I then ran it myself for a while to see what was going on and to see where some of those who were using it were applying frankly insane settings, and how even less tight settings were problematic on occasion.
      Sometimes it's better to look at entire systems to resolve problems instead of a tightly focused technical only approach. If you guys are going to call yourselves "engineers" you should act like them and consider entire systems instead of single bolts or what the manual tells you to do. Cute tricks that fuck around with communication policy shouldn't be used unless you can take the consequences of changing communication policy. If it's going to put your boss on the carpet in front of the CEO you have a duty to your boss of explaining to them why you are doing it.

    3. Re:Grey Listing and zen.spamhaus.org by sumdumass · · Score: 1

      Yes, used greylisting for a couple of days and now am well aware of an inherent flaw that could cost the people who use it their jobs. Consider how it works and then consider that people at the top of organisations like to think of email as a nearly instant communications system and really don't like it when that last minute tender they've been working on all night gets delayed for half an hour (as is typical, or several hours with insane greylisting settings I've seen used) just because the person they sent it to has never had email from them before. It's a nice idea from an IT perspective but from a business perspective it sucks dog balls

      The default resend interval on most mail systems is between 2 and 15 minutes that I know of. If you are using it, it doesn't impact anyone you are sending to. If who you are sending it to is using it, it is their problem not yours.

      I'm also surprised by the comment of " never had email from them before". First, are you confusing grey listing with white listing and a challenge response? Second, I'm not sure I would be sending something I worked on all night to someone I never communicated with before by email. It might be possible that the specific person is a different person, but the email should work domain wide (if I emailed your secretary I should be able to email you without the grey listing).

      It stops them long enough for it to be a problem in enough cases that I kept getting a lot of "why doesn't X have my email yet" phone calls when greylisting first became popular. I then ran it myself for a while to see what was going on and to see where some of those who were using it were applying frankly insane settings, and how even less tight settings were problematic on occasion.

      If grey listing stops the email from being received for hours, there is something wrong with the server sending it. Most default times are minutes. I'm not sure I know of a scenario that would require specific minute by minute communications that wouldn't warrant a phone call or something more instantaneous. Even without grey listing, the emails can be delayed for several minutes. Most email clients do not update more then every couple minutes anyways.

      The longest I have seen an email delayed from grey listing is about 5 minutes. Anyways, I'm not sure we are talking about the same things. Grey listing simply drops the first connection attempt by a server not in a white list and requires the smtp server to retry at their set interval. There aren't a whole lot of setting you can do besides specifically allowing domains or servers and specifically denying them. The person sending the email might have their servers jacked around but that is their problem, not yours. If a route goes down somehow and the communication is interrupted, it will retry the communications anyways- this is no different except it's on purpose.

      Sometimes it's better to look at entire systems to resolve problems instead of a tightly focused technical only approach. If you guys are going to call yourselves "engineers" you should act like them and consider entire systems instead of single bolts or what the manual tells you to do. Cute tricks that fuck around with communication policy shouldn't be used unless you can take the consequences of changing communication policy. If it's going to put your boss on the carpet in front of the CEO you have a duty to your boss of explaining to them why you are doing it.

      I'm not convinced we are talking about the same things here. Like I said, the longest I have seen an email delayed is about 5 minutes. I say about because the log measures seconds between connection attempts when I check them and I have never seen them go over 300-350 seconds unless something was wrong with the route (traceroute fails to complete).

    4. Re:Grey Listing and zen.spamhaus.org by dbIII · · Score: 1

      Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter. You're misleading the newbies that haven't worked out from your handle and posting history that you like to pretend to be dumb as shit and offer utterly stupid suggestions as if they are viable. People playing games like yours make it difficult to have an honest and factual discussion in this place.

    5. Re:Grey Listing and zen.spamhaus.org by Anonymous Coward · · Score: 0

      Sometimes it's better to look at entire systems to resolve problems instead of a tightly focused technical only approach. If you guys are going to call yourselves "engineers" you should act like them and consider entire systems instead of single bolts or what the manual tells you to do. Cute tricks that fuck around with communication policy shouldn't be used unless you can take the consequences of changing communication policy. If it's going to put your boss on the carpet in front of the CEO you have a duty to your boss of explaining to them why you are doing it.

      You sound like the kind of prick who has jacked around their MS Exchange server with 'cute tricks' to the point that it can't properly respond to standard mail communication protocols anymore. Or maybe you just don't know how to do anything beyond clicking the 'start mail server' button and left the settings on defaults so that it only ever plays nice with other Exchange servers. Then when your boss misses an important email you blame it on greylisting so you don't get your ass wiped up and down the hallway for being the incompetent ass that you are. You clearly don't understand how grey listing works any more than you understand the how the 'entire system' works.

      Go back to designing web pages and stop posing as some kind of IT admin.

    6. Re:Grey Listing and zen.spamhaus.org by Anonymous Coward · · Score: 0

      Translation: "I really have no idea what I'm talking about, but I'm pretty sure you're wrong."

    7. Re:Grey Listing and zen.spamhaus.org by sumdumass · · Score: 1

      Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter.

      Lol.. Even if the particular grey listing you are using allows you to set the interval between reconnection attempts, you do not need to do so for grey listing to work. 90% or better spam email sent will never reconnect after being dropped once. The legitimate SMTPs will reconnect in a couple minutes if it is legit and the email will be received then subjected to other spam filtering processes if present on the servers. There is no pretending involved other then maybe your supposed experience with it.

      http://projects.puremagic.com/greylisting/

    8. Re:Grey Listing and zen.spamhaus.org by nabsltd · · Score: 1

      Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter.

      Unless you are insanely stupid and use a greylist wait of a couple of hours, it really doesn't matter. Typical values for the wait are in the sub-5 minute range. Since it's resource intensive to retry at intervals smaller than about 5 minutes, good client configurations aren't going to even get to the first retry until after the greylist wait has expired. But, based on my own logs, you could set the wait to as little as 10 seconds and still have an extremely effective front line defense against spam.

      Once an IP passes the wait successfully, e-mail from that address isn't ever delayed again (although I personally set a 40 day timeout on the whitelist to account for dynamic IP addresses). Add in the ability to pre-seed the list with known-to-retry systems (gmail, Yahoo, Amazon, etc.) and known common senders to your domain, and most people really will never notice any delay. And, if you run for about a month in "log only" mode, you'll get a great starting list of IPs you might need to whitelist. And, you need to think about server farms, so you might want to accept a retry from the same /24 subnet (which is what I do).

      Also, if you were dumb enough to greylist mail that came from your internal network (which is the only way that a greylisting config would cause your boss to ask "why doesn't X have my email yet"), you don't understand greylisting enough to be in a discussion about it. For external senders (who should never be asking you that question, anyway), you can always explain the reason the e-mail is delayed: whoever configured their e-mail server didn't feel that a quick retry after a temporary failure was required.

      So, this might be flamebait, but what it really comes down to is that your issues with greylisting were likely because you didn't do your homework and create a config that was as non-intrusive as possible.

    9. Re:Grey Listing and zen.spamhaus.org by dbIII · · Score: 1

      Now that's an interesting reading comprehension failure and blaming victims instead of perpetrators. The problems with greylisting impact the sender more than those that have implemented greylisting. An effective setup requires someone who at least knows as much that could be picked up from a quick read of the wikipedia article instead of the misleading rubbish a few posts above. Quick retries don't save you from idiots that tune their settings to reduce spam but forget the entire point of email, it just means you have a lot more retries before they let you in.

  12. DIY by Jmc23 · · Score: 1

    http://www.gigamonkeys.com/book/practical-a-spam-filter.html has the nuts and bolts. CL-PPCRE does perl regex matching faster than perl.

    --
    Don't complain about syntax, grammar, or spelling. There is no.hell like input on android.
  13. Short circuit by thulben · · Score: 1

    Does your process require that all of the regexes are tried in turn or is it the case that if it hits one of your patterns that it's marked as spam? If the latter, are you able to rank the patterns from most likely to least likely to be matched? And, if so, can you stop your process once a match is made? If all of those things are true, then you should be able to cut the time/CPU/energy required to do the filtering

    1. Re:Short circuit by CanadianMacFan · · Score: 2

      You might also want to look at how patterns are added to the file too. If they are added to the end then the latest spam of the day message will need to parse all of the patterns until it hits the latest pattern. Of course ideally you might want to set something up that looks at the hits each pattern gets so that you could parse the most likely patterns first followed by the latest patterns.

  14. dspam by Rob+Bos · · Score: 1

    Consider using a proper learning filter, like dspam. You can pipe it through procmail just as easily, and you can feed your corpus of spam into it. You won't get 100%, but it'll recognize spam you haven't seen. :0f
    *
    | /usr/bin/dspam --deliver=stdout

    1. Re: dspam by paulc · · Score: 1

      I'll give a +1 for dspam. I run it on a couple of accounts via procmail under qmail and it works really well. When spam gets through to my inbox I just move it to a spam training folder and an hourly cron task passes that folder's contents back to dspam for training.

    2. Re:dspam by mishehu · · Score: 1

      If I had mod points I'd have given you a +1 as well too. I've been using dspam for my own systems as well as clients' systems for years now, with MySQL as the backend (InnoDB tables though, not MyISAM). The only downside is that it can end up eating a fair amount of filesize, but it's extremely fast and highly accurate. Combine that with other methods like RBL, spf checks, dk, etc., and I get but a false-positive once every 3 months or more, and a false-negative once every 6-12 months.

      SHR Spam Hit Rate 98.48%
      HSR Ham Strike Rate: 0.23%
      PPV Positive predictive value: 99.93%
      OCA Overall Accuracy: 98.77%

      And this is with the same database for the past 3-5 years or so now.

  15. perl or python or whatever by retchdog · · Score: 1

    I've heard, but never timed it myself, that perl is faster for regexp-type stuff than even the specialized tools, just from the massive amount of optimization it has accrued over the years; here is a completely unbiased source. Use a perl or python script, and consider using Storable (perl) or pickle (python) to serialize the data structure, I guess, but just having the whole list in memory will help.

    According to this, perl regexps are (unsurprisingly) a superset of egrep's.

    I don't see how introducing SQL could do much to help speed, or anything else, in this application.

    --
    "They were pure niggers." – Noam Chomsky
    1. Re:perl or python or whatever by Jmc23 · · Score: 1

      regexps in cl-ppcre are faster than perl.

      --
      Don't complain about syntax, grammar, or spelling. There is no.hell like input on android.
    2. Re:perl or python or whatever by retchdog · · Score: 2

      doing anything but repeated egreps is probably fast enough. he should do whatever is easiest, which probably isn't lisp.

      --
      "They were pure niggers." – Noam Chomsky
    3. Re:perl or python or whatever by Jmc23 · · Score: 1
      Don't bring your prejudices into this!

      It doesn't get much easier than someone not only handing you the code but also holding your hand and walking through every single function. Unless you want to use a magic black box and where's the fun in that?

      --
      Don't complain about syntax, grammar, or spelling. There is no.hell like input on android.
  16. Matching multiple simultaneous regular expressions by careysb · · Score: 2

    Many years ago I worked with a Unix development tool called LEX that could handle matching multiple patterns simultaneously. Perhaps there is an updated tool that would do the same thing. Java has a 3rd party library called ANTLR that might do the trick. It would involved re-compiling every time a new pattern is added but it should be extremely fast.

  17. Sqlite will be awesome by swillden · · Score: 2

    Sqlite, or anything that uses an index, will be screaming fast.

    Your statement of your current solution makes me wonder, though.. are you using "egrep -F -f pattern_file e_mail_message"? Or are you running egrep many times, once per line of the pattern file, or once per line of the message? I would think that given a pattern file egrep would be smart enough to do something better than repeatedly scanning the input, but based on the time it's taking, it sounds like that's happening.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    1. Re:Sqlite will be awesome by neonsignal · · Score: 1

      I doubt that he is using "grep -F -f ...", because fgrep can search for a hundred thousand patterns in a megabyte of data in under a second even on a modest machine (and most of the time is building up the regex state machine). I suspect he is using "egrep -f", and lots of patterns with wildcards. Worse, he will be running it once on each email, which means rebuilding the regex state machine each time.

    2. Re:Sqlite will be awesome by Anonymous Coward · · Score: 0

      "..using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it."

      Sadly more likely, he is iterating the patterns file and grepping the email one pattern at a time.

    3. Re:Sqlite will be awesome by idunham · · Score: 1

      Agreed.
      egrep -F is the same as fgrep, and it uses fixed strings.
      If he says "patterns", it's obviously egrep or grep, probably egrep -f.
      egrep -f patterns -lr maildir is likely to be faster, because of startup costs.

    4. Re:Sqlite will be awesome by hmilz · · Score: 1

      It's actually a single "egrep -i -o -f " per mail. For each mail, egrep is forked exactly once, which means writing my own tool will not reduce the OS overhead. I might give perl a try though, but I doubt that forking perl will be much faster than forking grep.

    5. Re:Sqlite will be awesome by jon3k · · Score: 1

      Can someone explain what the big O notation would be for this? I'm still trying to wrap my brain around big O notation.

      I'd think it would be O(n^2) but that can't be right because it's two different sets of data (not N raised to itself). So is there even a: O(N^X)?

      I'm assuming that for each mail message (outer loop) each RegEx is processed, which might be an incorrect assumption.

    6. Re:Sqlite will be awesome by swillden · · Score: 1

      O(NM).

      If the pattern file has N lines and the e-mail has M lines, and if we count comparing one line of the pattern file against one line of the e-mail as one operation, then for each of the N lines of the file, we have to do M comparisons.

      There are better algorithms than this obvious one, though, and it would surprise me if egrep didn't use one of them when given the whole list of patterns at once.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    7. Re:Sqlite will be awesome by nabsltd · · Score: 1

      It's actually a single "egrep -i -o -f " per mail. For each mail, egrep is forked exactly once, which means writing my own tool will not reduce the OS overhead. I might give perl a try though, but I doubt that forking perl will be much faster than forking grep.

      So, stop forking. There are lots of spam filters that use perl as the engine and run as daemons with a socket to write to. This keeps the compiled perl regular expressions in memory (assuming you're not swapping because of low memory).

      Spamassassin can use pretty much the same file you have right now as a source for patterns, and runs quicker than what you are seeing for your setup. I don't use any custom rules for SA, and only see about 5 spam e-mails per week in my mail client, and all end up in the "marked as spam by SA" folder. I do use greylisting and strict SMTP syntax checks to stop a lot before SA even sees them.

    8. Re:Sqlite will be awesome by Anonymous Coward · · Score: 0

      O(L(1+N+NM)), with L emails needing 1 fork+exec each... feel free to simplify if you're obsessed with that.

    9. Re:Sqlite will be awesome by swillden · · Score: 1

      O(L(1+N+NM)) = O(NM)

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  18. The simpler solution is ... by Skapare · · Score: 1

    ... I just gave up on email. Even w/o spam it's more hassle than I like.

    --
    now we need to go OSS in diesel cars
  19. Use Perl by Anonymous Coward · · Score: 1

    Use Perl. Its regex engine is highly optimized and very fast. It should really fly on fixed strings.

    There's a Stack Overflow question that addresses this very thing with some Perl code you can try.

  20. Whitelist first by Anonymous Coward · · Score: 0

    You could run whitelisting rules first to allow messages that are obviously non-spam through without them having to pass through all of the spam rules. This could be the standard address book whitelisting so all of your friends' and colleagues' messages pass immediately.

    For a bit more complex solution you could run messages through something like SpamAssassin first -- for any messages that have a spam score above a certain threshold you run them through your custom rule set. Since you have a high degree of trust in your rule set you could make this threshold quite low -- again mainly so SpamAssassin will just act as a whitelist to let clearly good messages through immediately.
     

  21. Easy way to handle spam... by NotQuiteReal · · Score: 1

    Just route everything from Facebook, LinkedIn, my dad, Apple and "i*" to the spam folder, and most of it is covered.

    --
    This issue is a bit more complicated than you think.
  22. Oneword by Anonymous Coward · · Score: 0

    Junkemailfilter

    Http://www.junkemailfilter.com

    Outsource

  23. Problem spotted. by girlintraining · · Score: 4, Insightful

    The problem is that you're using egrep in the first place. Here's the thing -- the overwhelming majority of your cycles are getting sucked loading, initializing, executing, then unloading, that thread. It's not that using regular expressions is processor-intensive... it's that repeatedly launching the same executable is.

    Use something that can load once, read in the patterns, check all the e-mails that are queued, sort them, then exit. Your execution time will go from 15 seconds to 150 milliseconds.

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:Problem spotted. by complete+loony · · Score: 3, Interesting

      If you have sufficient programming experience, I'd recommend basing this solution on redgrep. It's an llvm based expression compiler that should be able to combine multiple expressions into a single machine code state machine, assuming it doesn't run out of memory in the process. With a bit of effort you could output all of your compiled expressions into a single executable so you'll only need to wait for the compilation time when you add more filters.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    2. Re:Problem spotted. by arth1 · · Score: 1

      You mean like doing an egrep +F instead of multiple egreps? I sure hope he already does.

    3. Re:Problem spotted. by Anonymous Coward · · Score: 0

      this is what I was thinking. egrep loading a pattern matching file once should be much much faster than 15 seconds. launching an egrep for every line in a pattern file, and then matching it against the email is a terrible way to do this.

    4. Re:Problem spotted. by jafo · · Score: 1

      grep *CAN* take a bunch of patterns, we simply don't know if the user in question is using it in that way. Agreed though, if you are running egrep once for every pattern you are looking for, that is probably your problem and simply putting the patterns in a file and having egrep load the patterns from it via the "-f" flag will likely reduce this dramatically. However, doing many matches is still relatively expensive.

  24. Procmail is a fine tool -- but the wrong tool by Arrogant-Bastard · · Score: 5, Informative

    If spam has made it far enough that it's actually reached your personal instance of procmail, then there's been a problem earlier in the chain. Procmail rulesets should be a last resort, and they should only be asked to deal with minor issues that aren't dealt with via earlier rulesets.

    The first line of defense are your perimeter routers. They should implement BCP 38, they should block bogons, and they should bidirectionally deny all traffic to/from the Spamhaus DROP list. In addition, they should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them. Yes, this is a reversal of default-permit, for a simple reason: default-permit for SMTP stopped being reasonable around 2000. Use http://www.ipdeny.com/ to pick up the ranges per-country and only permit what you need. (Obviously a major research university can't do this. But Joe's Furniture, which does not have customers in Peru or Pakistan or Greece, can.)

    Then use blacklists, the best defense against spam we've ever developed. (Source: 30+ years of email experience) Spamhaus's Zen blacklist is a good one with a low FP rate and a tolerable FN rate. Augment these with local blacklists based on domains and network allocations. Augment those with as much blocking of generic hostnames and dynamic IP space as possible: real mail servers have real hostnames and are on static addresses.

    Then enforce RFC requirements: sending host must have rDNS, that PTR must resolve, what it resolves to should be the sending host's IP. Sending host must HELO as FQDN or bracketed dotted-quad; if FQDN, must resolve. Sending host must not send traffic pre-greeting. And so on. Enforcing these DOES mean occasionally you block mail sent by non-spamming entities: but since they are incompetent non-spamming entities, why would you want mail from them?

    Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.

    Rate-limit based on normative values for your site. For example: if analysis of a year's worth of mail logs shows that during that time you never received more than 10 messages a day from ANY host, then rate-limit at 30 or 40. You'll never hit in normal practice; but if you get hammered by a fast-sending host, you'll blunt the attack. Note that these don't have to be perfect to work: provided you send deferrals (SMTP response codes 4xx) instead of refusals (5xx) the worst that happens is that you will mistakenly impose a delay.

    There's more -- it's possible to get quite crafty about this. But note that NONE of these measures pay any attention to content. There's a reason for that: spammers can defeat content-based measures at will. They won't have it so easy with these.

    Deployed in production in various setups ranging from a dozen to eight million users, these steps yield a FP rate of about 10e-6 to 10e-7 and a FN rate around 10e-5 to 10e-6. Tuning helps, of course: initial rates can be higher but log analysis (which all sensible postmasters do) readily brings them down. If you have the luxury of running your own mail server just for yourself, then you can REALLY tune this setup: you should be able to get the FN rate down to 10e-7 after a few months.

    1. Re:Procmail is a fine tool -- but the wrong tool by thegarbz · · Score: 1

      That's a very informative post, but the first part is making a big assumption that someone has that level of control over the network. Much of what you say is exactly the type of filtering that is applied by spamassassin and other various tools at the end end of the chain. Many of us don't have the option to work higher up the chain.

    2. Re:Procmail is a fine tool -- but the wrong tool by SigmundFloyd · · Score: 0

      should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them.

      Distributed botnets make that solution ineffective.

      --
      Knowledge is power; knowledge shared is power lost.
    3. Re:Procmail is a fine tool -- but the wrong tool by Anonymous Coward · · Score: 0

      You always have "that kind of control". Of course, some people don't have control over "perimeter routers" - another department or outside contractor does that. But so what? If you can't drop packets at the perimeter routers, drop them at the host using exactly the same rules. You'll stop the same amount of spam, for the price of a little more network traffic.

    4. Re:Procmail is a fine tool -- but the wrong tool by hmilz · · Score: 1

      Thanks for the hints - the main problem ist an architectural one, though. My machine is a UUCP leaf node, and I have currently no intention of changing this. Although this may sound 90's, it has a number of advantages - I don't need to keep my incoming SMTP port clean (with the downside that I cannot filter directly there), and I have a complete mail subdomain for myself.

    5. Re:Procmail is a fine tool -- but the wrong tool by jafo · · Score: 1

      In an ideal world, many of the tips you mention would be fine and not produce any false positives. Unfortunately, we don't live in that world and users *WILL* receive e-mail from servers without proper PTF records, that don't know how to properly deal with greylisting (sending from multiple IPs or sender addresses, immediately bouncing on a 4xx response), from an IP that is on a blacklist... And god forbid you have any users, because they often will squeeze you from both ends: "I've *GOT* to receive this e-mail RIGHT NOW", but also: "Why am I getting so much spam?"

    6. Re:Procmail is a fine tool -- but the wrong tool by Anonymous Coward · · Score: 0

      Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.

      Alas, this includes Yahoo! Groups. Personally I'm inclined to get those mailing lists off Yahoo! Groups, but people cite that it is easy to use and there aren't very many freely available list servers out there.

  25. Bayesian Mail Filter by Trevin · · Score: 1
    I've used bmf via procmail on my ISP shell account for years, and it was extremely reliable and accurate. As an added bonus, it automatically forwarded spam to uce@ftc.gov.

    When my ISP discontinued the use of procmail filters, I moved it to my home computer and configured two filters in Evolution: the first one to auto-remove mail marked by my ISP as suspected spam, and the next to pipe the mail through bmf and remove it if it tested positive for spam. When I say "auto-remove", I mean it's moved to a spam folder where I can double-check it in case false positives get through.

    http://sourceforge.net/projects/bmf/

  26. Install CRM114 by Anonymous Coward · · Score: 0

    Install CRM114, set it up, and begin teaching it spam from non-spam.

    Very quickly it will "learn" and you'll seldom ever see a spam message.

    http://crm114.sourceforge.net/

  27. Fail2Ban by Anonymous Coward · · Score: 0

    I've used Fail2Ban and some regular expressions to help filter out things. For example, when you email someone and you get the address wrong, you get an email kicked back with the 450 error code.

    So, I use Fail2Ban to look for 450 error codes, and if it sees that 5x within 10 minutes, it blocks your IP address for 24 hours.

    Couple that with blocking entire countries IP ranges (China, Russia, etc.), I see little to no spam at all.

  28. Popfile by duke_cheetah2003 · · Score: 1

    I've been using popfile for years. Works great! Try it.

    1. Re:Popfile by _Shorty-dammit · · Score: 1

      I don't know why anyone would use anything other than gmail, but I guess some people have their uses. I'll cast another vote for popfile, as I used that before gmail, but I don't really know how it might perform with a large volume of email. It was awesome for just my own personal mail while I was using it.

  29. regular expression optimiser by lkcl · · Score: 2

    i'd be interested to see what happens if you run those regex's through this:
            http://bisqwit.iki.fi/source/regexopt.html

    btw can we please get a copy of the patterns you're using? i think they might prove useful for other people. also i'd like to test them myself against regexopt.

    oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k. that cut down the amount of CPU cycles but it was still far far too much memory and far too CPU intensive.

    the one thing that did work well is greylisting, however the problem with greylisting i find is that if you happen not to be at the computer or have direct access to the server and people on the phone say "i'm sending you a message now, have you got it?" you *know* it's going to be at least an hour before it'll arrive. so, unless you can whitelist them in advance (which you can't always do) greylisting does actually interfere with legitimate business.

    anyway: in the end i gave up and went to gmail, but with gmail fucking up how they're doing things i have to revisit this and set up a mail server again. thus we come full circle...

    1. Re:regular expression optimiser by thegarbz · · Score: 1

      oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k.

      I'm sorry to say it but you must be doing something wrong. I have a very default installation of spamassassin and sendmail also running at MTA time and on my really crappy old spare parts server it never takes more than a second or two to process a mail item. This also does not appear to vary depending on email size, a 10MB email seems to take just as long as a plain text one. I don't think out of the box spam-assassin checks attachments or any type of external content.

    2. Re:regular expression optimiser by lkcl · · Score: 1

      thanks thegarbz - i didn't mention that i added in pyzor and razor, and i think clamav as well. also as my domain's been up for a while it does receive a considerable amount of spam. the load just got to be too much. i'll investigate alternatives and also bear in mind that spamassassin worked well for you.

  30. Pay somebody by DogDude · · Score: 1

    Unless it's a fun hobby for you, it makes much more sense to just pay for email and let somebody else to it. Personal email can be gotten for about $2/month.

    --
    I don't respond to AC's.
  31. O Hai. Has this been posted? by symbolset · · Score: 1

    The canonical spam solution checklist.

    I'm going with Specificaly, your plan fails to account for: (x) Users of email will not put up with it.

    --
    Help stamp out iliturcy.
  32. agrep by rudick · · Score: 1

    Provided your pattern file is under 340K, 'agrep -f' is about twice as fast.

  33. quit wasting your time by Anonymous Coward · · Score: 0

    there are million dollar companies that can detect it faster and even better than your OSS bullshit half assed script for free

    quit pretending, its not 1994 anymore

    1. Re:quit wasting your time by gringer · · Score: 1

      there are million dollar companies that can detect it faster and even better than your OSS bullshit half assed script for free

      The NSA, for example. Use a US server as your email service provider, and you get filtering for free!

      --
      Ask me about repetitive DNA
  34. mutilthread? by rtayek · · Score: 1

    depending on where your time is going, consider splitting the file up into pieces and run each piece in a different thread.

    --
    vice chair orange county java users group (ocjug.org).
  35. Use perl by Forever+Wondering · · Score: 2

    A long time ago I benchmarked perl's regex engine against about 5 others. At the time, it was 10x faster than the nearest competitor for the same regex/data.

    Also, you can use perl's "study". Or, split the regexes across threads.

    Also, with perl you can do some hierarchical saviings. For example:
    /Ffoo/ ...
    /Fbar/ ...
    /Fbaz/ ...

    Could be redone as:
        if (/F/) {
    ... if (/Ffoo/)
    ... if (/Fbar/
    ... if (/Fbaz/)

        }

    The above is trivial example, but you get the idea.

    Also, how much time is spent compiling (vs. executing) the regexes in egrep? I imagine a lot and you have to do this for each incoming message.

    Note that spamassassin (and hence perl) can be set up as a daemon where the regexes are compiled once. The messages are passed through a socket to the daemon. This means that the only CPU time spent is on executing the regexes--a considerable savings.

    Additionally, perl regexes have [considerably] more functionality/utility than egrep ones. You might be able to recode/consolidate yours and get the same [or better] bang for less buck.

    --
    Like a good neighbor, fsck is there ...
  36. Current solution is awful by goombah99 · · Score: 1

    Here's several things you can do to make this faster.
    1) first don't keep invoking egrep. this has to parse the command line and then re-load the egrep command itself every time. Instead do this from within a loaded program. Perl is a very good choice for this
    2) the perl command can pre-compile the regular expression. So you can leave the perl program running as a process then simply feed it new data to analyse.
    3) given you are searching for words, you probably want to split the incoming stream on white space one-time not every time.
    4) even better than that, take the e-mail, parse it to words, then parse each word into all 3,4,5,6,7,8 consecutive strings. Then just look these up in a hash table.
    5) if you are only trying to match from the start of the word, (not interior word strings) then this hashing becomes trivial.

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re:Current solution is awful by Anonymous Coward · · Score: 0

      Actually "egrep" and "-F" conflict... -E and -F are exclusive.
      And the algorithms used for -F are supposed to be more efficient for many search strings case than -E ... for the matching part.

      TBH I'd like to see actual benchmark data besides just theorycrafting, since this kind of scenarios tend to have more variables than OP mentions.

  37. OpenBSD spamd by Anonymous Coward · · Score: 0

    Start with spamd and get your spam levels down immensely.

    man page: http://www.openbsd.org/cgi-bin/man.cgi?query=spamd&sektion=8

  38. bogofilter by Anonymous Coward · · Score: 0

    I have been using various procmail stuff but for years I am now relying on bogofilter.

    I meanwhile have disabled autolearn as thats the stuff taking time.

    I trained it with a couple megabytes of ham and spam and be done. From time to time when something gets classifies wrong ill push it for learning.

    never had the whish to look for something else.

  39. Buy a domain by postglock · · Score: 1

    I don't even use spam blockers. Instead I've purchased a domain, which is quite affordable nowadays. I have a catch-all redirect, so I any mail addressed to *@mydomain.com.

    Then, I give a unique username to each organisation. e.g. slashdot@mydomain.com. If I receive spam at this address, I inform them, then kill the username. I can also just create slashdot2@mydomain.com if I want to keep dealing with their company.

    Now, I receive only a few spam emails each year, so I need to do zero automated filtering. I also don't have to deal with the worry of false positives at all.

    1. Re:Buy a domain by jon3k · · Score: 1

      You can also do this via gmail. Gmail will accept and deliver email to +@gmail.com and delivery it to you. Try it out.

      So anytime you sign up for something, just use: postglock+slashdot@gmail.com. Then if you get spam, just look at the "To:" address, you can even write a filter based on the + sign in the "To:" field, if you wanted.

    2. Re:Buy a domain by jon3k · · Score: 1

      sorry, replying to myself. First line should be been any email delivered to: name+any_string@gmail.com

    3. Re:Buy a domain by postglock · · Score: 1

      I did hear about this, but I hadn't thought about writing a filter after receiving spam. That's a cool idea.

      The only part that makes me slightly wary is that since so many use gmail, you'd think that spammers would automatically remove the +slashdot part pretty soon.

    4. Re:Buy a domain by jon3k · · Score: 1

      Entirely possible - but here's something cool. I have Google for Your Domain setup for a personal domain. I just tested it, and I was able to send an email to: jon+test@[mydomain].com. Now there's no way for a spammer to know if Google is handling my mail (easily) so they'd have to assume that the + was a legitimate character. I mean, in theory, they could lookup the MX records and if they point to google, strip the +[characters up to]@ off, but I seriously doubt many, if any at all, would do this.

    5. Re:Buy a domain by Sancho · · Score: 1

      Can you explain more about this service you have? Do you just mean Google Apps for Domains?

    6. Re:Buy a domain by jon3k · · Score: 1

      Yeah just apps for domain. I believe they shut down the free version, but I've heard you can still get access by signing up for Google App Engine.

    7. Re:Buy a domain by postglock · · Score: 1

      Nice one. I use Google Apps too. I think that inactivating specific accounts is probably just as quick as creating a filter, in my estimation. It's a bit cleaner too, so I'll continue using it for now.

  40. Go to the definition instead of tricksters by dbIII · · Score: 1
  41. Pre-compiled regex. by viperidaenz · · Score: 1

    A project I worked on many years ago re-wrote a monitoring system in Java.
    It was Perl, running a rather large list of regex's over syslog files.

    The process of converting it to Java resulted in a 100x speed up - despite Perl possibly having a faster regex implementation. The regular expressions are compiled once on start-up. Regular expressions can be very fast - they're just slow to parse and compile.

    1. Re:Pre-compiled regex. by zippthorne · · Score: 1

      If you're going to leave the process running, perl can compile the regexes ahead of time, too...

      --
      Can you be Even More Awesome?!
  42. Hm? My similar procmail setup takes 0.1s per email by Anonymous Coward · · Score: 0

    I don't know what are you doing to run so massively slow, but I've a similar setup running in an ancient P2 400MHz server machine and with thousands of regexp filter rules in procmail scanning each incomming non-matching email only takes around a hundred milliseconds, matching spam emails way much less.

  43. Why not just make them one big RULE by Anonymous Coward · · Score: 0

    if you have regexes like this:

    re1
    re2
    re3

    You can just combine them into
    (re1|re2|re3)

    and have it fork just one copy. Even better if you can compile them.

  44. Just don't check the Spam Filtering? by wadeal · · Score: 1

    How about instead of sitting there watching it process you just block your own access to viewing this 15 second delay and ignore it. Just don't care about it. Pretend it doesn't happen and your mail just arrived in your inbox.

    I can see no situation where email being delays by 15 seconds is going to cause a an issue.

  45. Comparison of blacklists by CBravo · · Score: 1

    There is a comparison of blacklists: http://dnsbl.inps.de/analyse.cgi?type=monthly&lang=en

    --
    nosig today
  46. You get that much mail? by Mysticalfruit · · Score: 1

    You get so much mail so furiously that you can't suffer a 15 second delay? I presume you're talking about a personal mail server... if you're hosting mail for a 1000 people then yeah that's a problem.

    --
    Yes Francis, the world has gone crazy.
    1. Re:You get that much mail? by jon3k · · Score: 1

      If it's running for 15 seconds maybe it's just putting an annoyingly high load on the server. Also consider that for every legitimate mail, you could be getting a lot of spam. I know I would be annoyed if my CPU load shot up randomly ever 5 or 10 minutes when a piece of spam came in.

    2. Re:You get that much mail? by jonbryce · · Score: 1

      I get around 500 emails per day to my mail server of which maybe one or two are legitimate. A 15 second delay means a maximum theoretical capacity of 5760 emails per day before emails arrive at the server faster than the spam filter can process them. Even lower overall numbers will cause substantial bottlenecks at busy times of the day.

  47. If you're running your own MX, recommend ASSP by Anonymous Coward · · Score: 0

    ASSP, Anti-Spam SMTP Proxy.

    Ran it for a few years with a domain of a several hundred users. What I liked best is that it blocks spam during the SMTP conversation with the spammy sender.

  48. Gmail forwarder by flyingfsck · · Score: 1

    Just forward your mail through gmail. That way all the spam disappears and the NSA can get their data without trouble.

    --
    Excuse me, but please get off my Pennisetum Clandestinum, eh!
  49. This sounds like a bad setup to me by ggendel · · Score: 1

    If you require using sophisticated procmail filters on your personal account then it seems like your setup is wrong from the get-go. Your incoming mail server should be taking the brunt of the work and using a progressive and efficient filtering before any filtering by content.

    I use a spamdyke based front end that has a whole arsenal of whilte, black, and gray filtering of emails using RBLs RBLHS, reverse lookups, etc. It also can do header "pattern" filtering as well, but I currently don't use that feature. This blocks almost all spam quickly and efficiently. The last stage is to run it through spamassassin for those things that are in the gray (not a simple reject/accept, but a cumulative scoring) area. Worst case mail delays are on the order of few seconds through the whole chain. Spamassassin only gets a small number of incoming emails to work on. The stragglers usually come via accounts at yahoo, live, etc.

    The nice thing about spamdyke and other systems like it is that it does it's job very fast. For example, the blacklists and whitelists in spamdyke can be setup as directory tree structure so it is a very quick lookup to determine whether to accept or reject the specified domain or ip address.

    I also use systems like honeypots and hunter-seekers. The latter looks at what is graylisted or accepted by spamdyke and does http checks on the domain to see if it should be blacklisted. It also may decide to do tests in ip address neighbors to see if more should be blacklisted.

    Like all systems, you must be proactive at identifying mail that shouldn't have been rejected. It is a rare situation, but there are a few companies with badly configured mail servers (like no reverse dns entries). However, after many years of operation my whitelist contains only a handful of domains. The automated blacklist process sends me email when it adds a domain, just in case.

  50. CRM114, most accurate spam filter by Anonymous Coward · · Score: 0

    It's called CRMiin, at http://crm114.sourceforge.net//.

    It's a technically fascinating tool, named after the old Dr. Strangelove movie's tool for filtering authorized communication.. It doesn't get the attention it deserves because it's never been well packaged, the author publishes it open source but hasn't cooperated with wrapping it in "autoconf" or some other build structure to build Debian or Red Hat based packages. It uses Markovian, *not* Bayesian pattern matching, which makes an enormous improvement in its pattern matching.

    Instead of working from a programmed set of filters with programmed keywords, which professional spammers tune their spam to avoif, it builds its own filters from those Markovian matches of what you don't personally want to see, and relies on you deciding "spam/not-spam" to update its rules, much as Google does these days. But becuase the filters are individual and embedded in a neural net, it's very difficult to *deduce* the rules, and they change. In fact, it's even possible to train it with a data set that no one else is allowed to see and put it on an outgoing mail filter. This turns out to be useful for filtering outgoing, confidential data from doctors or stock brokers or intelligence agencies.

  51. Re:O Hai. Has this been posted? by Antique+Geekmeister · · Score: 1

    Thank you for posting that checklist, that's a vital document for any spam planning.

    SpamAssassin, executed through procmail on the mail client's email, is indeed resource intensive and does not scale well for an organization. Other people have mentioned other upstream filtering techniques, such as grey listing and DNS blacklists, but those are limited because of the large numbers of zombied Windows clients around the world, which have their resources rented as botnets to send spam from legitimate environments around the world, partly to evade these filters.

    My experience is that spam requires management, not silver bullets. Layers of defense such as supporting SPF, which filters very early and cheaply based on DNS records, helps eliminate most forged gmail.com and hotmail.com and other large domain phishing. More powerful, more expensive filters such as SpamAssassin can be applied on the vastly reduced volume of email that gets past the earlier filters. Unfortunately, if you're processing with a local "procmail" by pulling the email from the mail server to your local machine, it's already too late to activate DNS blacklists or SPF, so the increasing burden on SpamAssassin is predictable.

    I'm afraid I don't have a great solution for the original poster except tp push the filtering upstream, to the mail server itself, to reduce the load with those lightweight filters such as SPF or blacklists.

  52. What is it with blaming the observer? by dbIII · · Score: 1

    So, this might be flamebait, but what it really comes down to is that your issues with greylisting were likely because you didn't do your homework

    No - I did my homework to find out exactly how somebody managed to fuck up communication and greatly delay messages from one end, and found that the answer was greylisting implemented very poorly at the remote end. My comment above is because I "did my homework" and observed the downsides. Those downsides are now listed in the wikipedia article.
    For the record of yourself and the other idiot making noise about MS Exchange, I had not configured either of the two servers and instead came in after the problem came to light. So it's not just "flamebait" it's also a stupid jump to a conclusion just because I'm critical of yet another flawed anti-spam stopgap that can backfire if care is not taken. Spammers are channelling stuff via real mail servers now or getting their bots to resend so greylisting is losing what effectiveness it had anyway.

  53. The collective, barking up the wrong tree together by damn_registrars · · Score: 1

    We see people complaining about this problem a lot, and yet for some reason they are afraid to actually put energy into a real solution. Repeat after me : filters can never end spam. That's right, never. All your filters (same can be said for every filter, everywhere) do is encourage the spammers to make their spam more obfuscated to improve their odds of passing future filters. It is a huge waste of time and resources and it's an arms race that the spammers will win.

    If you want to actually end spam, you need to collaborate with other people who want to end spam. The way to end spam is not through technology but through economics; as their is only one reason why spam is sent - it is profitable. If you can interrupt the flow of money to the spammer they will move on to a different venture. Until then you're only spinning your wheels and wasting time, storage, and CPU cycles.

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
  54. Regexp::Assemble by __aawavt7683 · · Score: 1

    Note first, I am _not_ saying to replace your call to grep with a call to perl. Perl _is_ fast on assembling strings into a great matching system, but it still takes a _very_ long time to parse, say, 65000 separate strings.

    So combine them all into one. Use Regexp::Assemble. With a little bit of fidgetting, it works with GNU grep, as well. Here's an example script, that I've named regex-opt:

    !BEGIN regex-opt.pl!
    #!/usr/bin/perl
    use strict;
    use Regexp::Assemble;

    my $gnu = 0;
    if ((defined $ARGV[0]) && $ARGV[0] eq '-gnu') {
            shift;
            $gnu = 1;
    }

    my $ra = Regexp::Assemble->new;
    while () {
            $ra->add($_);
    }

    my $string = $ra->as_string();

    if ($gnu) {
            $string =~ s/\\d/[0-9]/g;
            $string =~ s/\(\?:/\(/g;
            $string =~ s/([()?|]{})/\\$1/g;
    }
    print $string;
    !END!

    So, you have a file with your tens of thousands of lines of patterns to match. Ok, ./regex-opt < patterns.txt > matchpattern.re. This may work with egrep, but it's perl regex syntax, so maybe not completely -- procmail | egrep -f matchpattern.re

    With 65000 lines, GNU grep takes about half an hour for the tasks I give it. After assembling all 65000 lines into one expression, even when that expression is _megabytes_ in size, it loads quickly and has the speed of a decision tree.

    So, as you accumulate new patterns, output them to a file. Also, _always_ keep your list of separate match patterns -- I'm not sure how well this package can handle reparsing a regex back into itself. Do matches like so:
    egrep -f <(cat matchpattern.re newpatterns.txt)

    and once a week,
    cat allpatterns.txt newpatterns.txt | regex-opt > matchpattern.re; sort -u allpatterns.txt newpatterns.txt > temp.txt && mv temp.txt allpatterns.txt && rm newpatterns.txt

  55. LibDB by Anonymous Coward · · Score: 0

    Consider using the Berkeley Database:

    http://linux.about.com/cs/linux101/g/libdb.htm

  56. Re:O Hai. Has this been posted? by cas2000 · · Score: 1

    > SpamAssassin, executed through procmail on the mail client's
    > email, is indeed resource intensive and does not scale well for
    > an organization.

    it does scale much better if run through amavis as a persistent process, rather than forked from procmail for each incoming message - much of the CPU usage is from compiling (and re-compiling) the regular expressions over and over again.

    pre-processing your regexp lists to consolidate them into far fewer but much longer regexps also gives huge benefits - e.g. instead of 1000 RE rules of 1 line each, join them with '|' and reduce them to 10 or 50. it's far less computational work to match against 50 long and slightly complicated REs than against 1000 simple REs.

    in practice, this means generating your spamassasin local.cf file with a script, from one or more "source" files.

    even without amavis, SA comes with spamd which provides the same benefit of avoiding RE-recompile - but IMO is a lot more work to configure and maintain than using amavis

    even so, i try to reject as much spam as possible in the MTA before the mail gets passed to amavis & SA for final checking.

    > My experience is that spam requires management,
    > not silver bullets. Layers of defense [...]

    yes! SPF, greylisting (even a 5 or 10 second greylisting delay is enough to filter out a huge amount of spam), careful use of RBLs (spamhaus are ethical and have reasonable policies), RHSBLs, DULs, MX-record checks (e.g. reject mail if MX record points to 127.0.0.1), HELO/EHLO checks (block mail claiming to be from my domains or IP address), blocking mail from specific senders and sender domains, tarpitting spammers, and more.

    another useful technique is to use well-crafted fail2ban rules to monitor /var/log/mail.log and create temporary iptables rules to block persistent spam sources.

    on my home mail server, i also block all mail from specific countries, using IP address and TLD blocking lists - but that's not a good option when spam-filtering mail for a company or organisation.

  57. Re:O Hai. Has this been posted? by Antique+Geekmeister · · Score: 1

    That seems a very sophisticated, enlightened, multi-layered approach. It can be very difficult to implement so broadly if your mail services are in the hands of another corporate group. MS Exchange managers, for example, can become quite concerned and upset if you want to implement greylisting and SPF blacklists before it even reaches their mail servers, but that's where it's most effective.

    Merging the SpamAssassin checks into larger but more efficient regexp statements is a useful technique that I'd encourage you to publish, especially if you publish the tools to build those new rules and move aside the old ones.

  58. Re:Matching multiple simultaneous regular expressi by Anonymous Coward · · Score: 0

    My own tests on a Core 2 show the ANLTR Java lexer to get about 1-2 MB/sec throughput. The C output gets around 4-5, and using -flto hits about 8-9.

  59. Re:You get that much mail? (??much?) by lpq · · Score: 1

    @ 1000 emails/day, that's 15000 seconds processing time, or over 4 hours. That seems a bit excessive, but being having your email delayed a cumulative 4+ hours/day.

  60. Use ClamAV engine by Anonymous Coward · · Score: 0

    Like it is done here:
    http://www.sanesecurity.co.uk/

  61. Re:O Hai. Has this been posted? by cas2000 · · Score: 1

    i thought merging REs was standard practice by now. i've been doing it since long before I started using SpamAssassin, when I was still mostly using postfix body_checks and header_checks.

    here's some of my anti-spam stuff.

    the scripts are old, but pretty close to what i actually still use today to generate postfix body/header checks and spamassassin rules.

    they're not packaged software you can just install and use - think of them as examples of a particular approach to managing anti-spam rulesets.

    BTW, note that with SpamAssassin, fewer and larger rules require less CPU time to run, but reduce the likelihood of multiple matches if there are multiple spammy phrases in an email - max one match per rule. this is why the scripts are configured to generate max of 500-character rule lines, when SA can easily handle 5000 or more characters per line. also, shorter lines are easier to read when debugging problems, and each rule is generated with a unique identifier so I can see which rules are matching for each msg

  62. Grep, Hot, no Sugar: Checkpoint-Restore by LordMyren · · Score: 1

    Use CRIU (Checkpoint Restore in Userspace) to checkpoint a hot version of grep that has been started and given a couple seconds to load in the dictionary and build it's pattern matcher and is thus just awaiting stdin (which you haven't given it). Restore a fresh instance for every new email, and pass the new email into the just-opened stdin for that restored, hot, waiting to go instance.

    Instead of launching a fresh grep and initializing it with your corpus, this will create a grep that you can online which will be ready to go, awaiting input.

    Ma-fucking-gic.

    Traditionally one could achieve this effect by forking child workers, but that's a fucking huge pain in the ass as far as program design goes, making things really complicated- instead of a single program doing a single thing, it couples many uses of a program into a single programs lifestyle. Daemonized apps require system level management and have to be running. Service apps require complex interfaces to handle the different servicings they are performing. Decouple concerns (stay unix'y: stdin->program->stdout), and CRIU the bitch. Just use a hot program, rather than a cold one.

    If the problem persists: fuck grep, it's pattern matching is rubbish and it's worthless. Please let us know. You might also consider 'head' 'ing the first 64k or some such of your email to avoid pattern matching the entire doc.

  63. Answer found: hot egrep. by LordMyren · · Score: 1

    It's possible to use a hottened egrep by booting up one egrep, checkpointing it, then restoring that checkpoint again and again whenever you need an instance.
    http://ask.slashdot.org/comments.pl?sid=4150171&cid=44759217

    The problem is not using egrep, the problem is not using an existing already launched copy of egrep. Which, you CAN do. And I'd even recommend doing so, because it's manageable and uses sane well known and unfancy tools that are decoupled from each other.

    Thanks for writing GIT. So many in this thread immediately jump into alternative options without discussing what's really at the heart of this problem. Grep is fine software and is known to do it's job well. As you say, the problem is simply that grep has startup costs, but those can be near totally ameliorated out.