DSPAM v3.2 Released
Nuclear Elephant writes "After four months of development DSPAM v3.2 has been released, bringing many new enhancements and filtering technologies. These include distributed computing support, implementation of Bill Yerazunis' Sparse Binary Polynomial Hashing algorithm (from CRM114), and v1.2 of Bayesian Noise Reduction. Other enhancements include SQLite support and many significant performance enhancements for PostgreSQL. DSPAM's official release is next week, but you can download the preview release now. Users of the project have also contributed towards creating a new logo for this release."
Naah, meanwhile SpamAssassin supports bayesian filtering as well, besides its rule based filtering.
Here's what it shows.
ONLY the 3.2 Preview Release 1 is currently out!
.sig
I have been using DSpam for my network for quite some time now (~a month or two) and have since not recieved a complaint from any users, seems to me it works better than CRM114.
What you want is ClamAV:
http://clamav.sourceforge.net/
This is a legit message from someones mail system. You are receiving this because someone has been infected with a virus. Their computer is sending messages from your email address, and some of these messages are going to non-existant mail addresses. Because they are spoofing your mail address in the From: you are receiving all the bounces.
So technically, this isnt spam or junk mail. Its someones email system doing what its supposed to, returning 'your' email because the sender didnt exist.
Unfortunately, probably not much you can do about this without blocking all such legit system messages.
It's your server and hopefully you'll never have to suffer the 'collateral damage' of living near a spammer (network neighbourhood wise). It has happened to me a couple of times. The first time I actually spent time sending my reply from my gmail account, and told the guy about it. The second time I didn't even bother.
Netblock blacklisting is a really poor solution. In some cases a single spammer causes a /24 and then a /16 to be blocked. It doesn't make sense to me. OTOH, I discovered some time ago that blocking Windows boxes works wonderfully, and it's extremely easy to do with OpenBSD's pf :-)
Btw, do you understand that changing ISP may not be an option?
I'm running a mailserver with postfix, dspam, squirrelmail, courier pop/imap, amavis and Postfix Admin where I also integrated the DSPAM phpControlCenter.
DSPAM has currently given my 0 false positives.
The clue with dspam is to start with a clean database for each user and let them start to 'sort out their spam'. For imap it's stupidly simple. Everyone has two folders "spam" and "notspam", where you can drag&drop an email to the right folder. A script picks up any emails in each folder every hour and do the necessary add-spam/not-spam processes.
For pop it's just a matter of forwaring the email to add-spam/not-spam adresses.
This works so very well, because each use get to decide which emails he think is spam and which emails he would like to recieve.
Also, if they log on to their webmail they can control what emails are marked as spam from their DSPAM phpControlCenter, and also correct any false positives, if there are any, or choose to block sender adresses and more.
Spamcop and Spamhaus I agree with. SORBS demand payment for removal of clean servers (albeit not to them). That just doesn't chime when people spam through an isp's smtp server and get caught.
There are a lot of things you can (and should) do to keep small databases in DSPAM when disk is an issue. The problem is some of this is in the FAQ rather than the docs...but you can change your training mode to TOE (which only trains on error), set up merged groups (which uses a global db and then each user only stores corrections, almost as accurate), do some creative purging, and if you're really paranoid about disk, turn off some features like chained tokens (although i don't think it's necessary).
As for a gray area, DSPAM has a confidence level (has for many versions now) which you can use to greylist messages, or you can set up classification networks and neural networks to have DSPAM consult other users' dictionaries (neural networks is kind of cool because it seeks out the most reliable users for classifying your mail).
So yeah, it's done what you want for quite a while now. I've managed to get my system down to about 5MB per user using merged groups and TOE, and most of my users get 99.9% or better.
Why does DSPAM get front page treatment when the latest POPFile release (which now handles POP3, IMAP, SMTP and NNTP filtering) and has an XML-RPC external interface, supports different databases, etc. etc. gets rejected as a story?
/. has recently turned into some combination of Freshmeat and PC Magazine? Yes.
Perhaps it's because I don't tend to make super-wild claims about POPFile's accuracy? Or come up with cool marketing names for the internal technology?
POPFile's the only Bayesian filter that can:
1. Do more than spam vs. anti-spam and
2. Filter POP3, IMAP, SMTP and NNTP (that's right Usenet news)
Do I have an axe to grind with Jonathan and DSPAM? No, it's a cool project. Does it annoy me that
John.
Did you run the nightly and weekly purge scripts, as documented? (purge.sql for your DBI driver)
Did you also change the model to TUM from the default? ( MUCH more accurate results over TOE or TEFT in our case, and we get a lot of spam!)
I'm not sure what this means, but I've never personally had this problem. dspam gives each spam a percentage, which I can sort on using the web interface. Those with a lower percentage "might be" spam, but need to be checked. Those with a higher percentage (confidence), ARE spam. After 6 months of running dspam, I hardly ever check the quarantine now, because they're all spam. Its learned what is and what is not spam, and delivers accordingly.
I, like you, used SA for a year or two, and had it trained down to a 2.0 threshold (from the default of 5.0). I also had over 300 custom rulesets that blocked based on incoming subject at the MTA side, before even accepting the mail message and sending it to SA. I also used 13 RBLs. We were getting over 5,000 incoming spam a day, and about a dozen would slip through to the user's mailboxes. After 2 years and all of that, we were only at about 90% effectiveness (and yes, my SA rulesets were kept updated all the time)
After 2 weeks of using dspam, we were already at 98%, and not a single spam had slipped through to any user's mailbox. Granted, in the early period of using it for us, some messages were marked as False Positives, but that hasn't happened for ANY user in several months now.
We also stopped using the custom MTA rulesets, and don't use any RBLs either.
dspam absolutely blows away SA (currently, until/unless SA changes) in our particular subset of the mail we receive.
I got nothing against content-filtering measures, as long as one is aware that this should be just the last layer of defense againts spam. Think about it, if your SMTP has already swallowed the spammer's email content, you have already lost precious bandwith.
Especially if you host your own SMTP, you should put up a layered system of defenses: RBL lists, maybe tarpitting, white/graylisting, and then content filtering.
the database did grow huge... ...performance was terrible.
Did you try TOE mode? Instead of analyzing everything, it just uses the errors. That means significantly less utilization of your data backend. From the FAQ:
Switch to TOE Mode. DSPAM v2.10 supports TOE (Train-On-Error) mode, which only performs writes to the database in the event that a misclassification has occured (or if a user has fewer than 4000 innocent messages in corpus). Train-on-error mode should make a significant reduction in the number of writes (and therefore locks) being performed on your database, and may actually improve accuracy as TOE has been known to do so. The default mode of learning is TEFT (Train Everything). This performs a much more detailed training of incoming messages and can more easily adapt to new types of email behavior for users, but does use up a significant number of resources. This is a definite thing to try if you're bottlenecked!