Computationally Cheap Spam Filtering?
"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.
Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."
At our uiversity, Virginia Tech, the hardware e-mail virus scanners (Mirapoint Messaging Server )also do Spam Assassin now, it puts info in headers (sample below). Filter for "X-Junkmail: UCE" and you've got a spam filter (though I run a more aggressive SA on my workstation, since I can customize it there).
B ELOW, CLICK_HERE_LINK,DATE_IN_FUTURE_12_24,HGH, HTML_FONT_COLOR_CYAN,HTML_FONT_COLOR_GRAY, HTML_FONT_COLOR_NAME,HTML_FONT_COLOR_RED, HTML_FONT_COLOR_UNSAFE,HTML_FONT_COLOR_YELLOW,NO_Q S_ASKED, RCVD_IN_DSBL,REMOVE_PAGE,SPAM_PHRASE_13_21,SPAM_RE DIRECTOR, SUSPICIOUS_RECIPS,USER_AGENT_OUTLOOK,VERY_SUSP_REC IPS version=2.44T ype: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_d+Bzp/dF6h/2OkPD89OTbQ)"
C ontent-Type: text/plain; CHARSET=US-ASCII
Return-Path:
Received: from vt.edu (gkar.cc.vt.edu [198.82.161.196]) by xxxx.xxxx.vt.edu (8.12.8/linuxconf) with ESMTP id h47JISRm004277 for ; Wed, 7 May 2003 15:18:28 -0400
Received: from steiner.cc.vt.edu ([10.1.1.14]) by gkar.cc.vt.edu (Sun Internet Mail Server sims.3.5.2001.05.04.11.50.p10) with ESMTP id for noone@xxxx.xxxx.vt.edu; Wed, 7 May 2003 15:18:31 -0400 (EDT)
Received: from aol.com (host217-40-92-155.in-addr.btopenworld.com [217.40.92.155]) by steiner.cc.vt.edu (Mirapoint Messaging Server MOS 3.3.2-CR) with SMTP id BIE36579; Wed, 07 May 2003 15:18:17 -0400 (EDT)
Date: Thu, 08 May 2003 03:13:26 -0800
From: Kate Welsh
Subject: [SPAM] Remember me?
To: spam@vt.edu
Message-id:
MIME-version: 1.0
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-Junkmail: UCE(58)
X-Priority: 3
X-Spam-Status: Yes, hits=13.0 required=5.0 tests=ALL_NATURAL,BASE64_ENC_TEXT,BIG_FONT,CLICK_
X-Spam-Flag: YES
X-Spam-Level: *************
X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)
X-Spam-Prev-Content-
X-Evolution-Source: imap://jackie@localhost/
I am, and always will be, an idiot. Karma: Coma (mostly effected by
how about filtering the mail client side?
For outlook users, i recommend Spammunition and I just use mozilla's spam filtering, which works great.
Buttsex.
Use TDMA http://tmda.net/.
I can't say that I don't give a fuck. I've just run out of fuck to give.
Ok, I'm not the email admin here at work, I avoid the whole mail subsystem 'cause I already have enough to do elsewhere.
.20 running Popfile during a monday morning after a long weekend but the machine is half the other one, a PIII 667Mhz with 256Mb RAM.
Anyway, for our 1500 users we use SpamAssassin with RBL and blacklists and our meager server (PIII 1.26GHz with 512Mb RAM) doesn't even reach 0.20, the heuristics is turned down due to the processor usage but it filters about 90% of the spam with very little load.
I, personally, use Popfile (search Sourceforge) as my personal filter - with it's database right now, not that big, just some 8Mb with over 200,000 emails since training (from my huge spam database) and normal usage over the past year for me and a dozen other users. Very easy to set up and use, you just need to train it with a good database. It's stats state that it has a 99.85% correctness rate. The machine has reached
Ash nazg durbatuluk, ash nazg gimbatul Ash nazg thrakatuluk agh burzum-ishi krimpatul
I have had pretty decent results with the spamnet client on our mail clients. They are now using the data that the free clients generate to block at the server. The product is called Authority, or something like that.
I receive about 50-70 spam mails per day, and the client has been blocking 98 percent of them every day. I have been very impressed by it.
See if their server product is appropriate for you. It simply uses a consensus derived list from client users to block messages at the server. Kind of a blacklist thing.
Cuchullain
"If sharing a thing in no way diminishes it, it is not rightly owned if it is not shared." -St. Augustine
Classifying spam is essentially the same problem as classifying programs into those that terminate that those that don't (the halting problem). This leads us to the following conclusions:
1) Filtering spam is not trivial. A program that filters spam X% better than another program will be X^2% more complicated or worse.
2) You can't write a program that will filter perfectly. At best, all you can do is develop a set of heuristics that you hope aren't too complicated. The less complicated the heuristic, the fewer resources it will require.
3) There's a limit to how simple your heuristics can be.
4) The system of spam is not just the message: it's the spammer, plus the message, plus the recipient. This is because a certain message considered spam by some will not be considered spam by others. That means that the heuristics that account for the person reading the spam will be better than those that don't. The source of a spam is also important: a message consisting of a spam report to a spam newsgroup is not a spam, though it may contain a complete spam message.
5) The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities. In short, you're going to see an endless increase in the number of processor cycles consumed by spam filters, asymptotically approaching the requirements of a full-up human brain simulation.
On the other hand, this will sell a hell of a lot of computers.
If tits were wings it'd be flying around.
I've been thinking about this problem, and its various dead-end solutions (micropayments, rewriting SMTP, strong client/server auth, third-party circles of trust), and have come to the conclusion that none are necessary, or particularly desireable.
I've put together the beginnings of an alternate proposal, which draws on some of the good aspects of the above approaches, without the need to rewrite SMTP. It's a community-based, peer-based approach that leaves the power in the hands of the operator. Plus, there's no profit motive (except that it's in an operator's best interest, and thus the corporate owner's best interest, to maintain his/her server's level of trust).
.@.
I had the same problem.
Was pretty happy with spamassassin, but our mailserver was crumbling under the load.
Switched to bogofilter and, after a training period, we're now getting better accuracy (97.6%) with spam recognition than we did with SpamAssassin, with MUCH reduced server load.
The Web is like Usenet, but
the elephants are untrained.
Xwall because we are running Exchange. It supports Bayesian filtering as well as MAPS/RBL rejection and virus scanning. Currently I am running only MAPS/RBL and have found it to be very effective with very few false positives. To answer your question regarding effectiveness of quick checks, I would have to say in my experience that they are effective. I have not stopped 100% of incoming spam but I would say around 98% and feel that is acceptable. Xwall is also cheap, $300.00 USD. Unfortunatley it will only run on windows.
Tarproxy (http://www.martiansoftware.com/tarproxy) seems like such an excellent solution, I don't know why it's not more visible. I would think that if just a few large mail servers started using it, spam might virtually stop overnight; thus rendering discussions of efficiency of filters moot.
I can only think that commercial spam filtering companies are terrified of it, thus are somehow keeping it out of the public eye.
We're currently handling mail for 4k+ accounts using 2 frontend servers running postfix that do all of the filtering and then pass the messages back to our backend mail server.
:) Our mail system handles (including rejects) ~500k messages a week, so it's by no means a large system.
/pointer
The frontend mail servers are running amavisd-new which is configured to use spamassassin and clamav. You can use DNS RR or just have multiple MX recs to load balance as many of these filtering servers as you need. Our filtering servers are cheap XP2100+s (w/1GB of ram) in a rack mount case that cost us ~$650 each. Amavis is just tagging the message headers with X-Spam and X-Virus headers as necessary.
The backend server is currently sendmail (migrating to postfix+cyrus). Once the migration is complete, our users will be given access to squirrelmail with a modified version of the avelsieve plugin (wizard-like with radio buttons) that will automatically create sieve scripts to drop spam/viruses into their own folders for later examination. We'll then use cyrus's builtin utility to purge those folders (spam/viruses) of messages that are more than X days old to keep disk usage under control.
I've documented a similar setup that I'm using on my home system here. The only difference between the two (work/home) is that on my home system everything is on one box.
I've heard claims that clamav doesn't work well. One of the 2 filtering servers has blocked 12135 viruses between 03/06 and 05/08. That works for me.
Good luck with your project.
[%- PROCESS life -%]
We're seeing spammers pad out the emails so that SA times out and passes them on as legit.
For outlook users, i recommend Spammunition [upserve.com] and I just use mozilla's spam filtering, which works great.
Eudora users can use Spamnix. Works like a charm.
The theory of relativity doesn't work right in Arkansas.
SAUCE applies aggressive correctness checks to incoming mail. Works with exim, but apparently could be adapted: http://www.chiark.greenend.org.uk/~ian/sauce/