Proving Which Spam Filters work Best
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
400 Megs that is.......
Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.
... the ones which have worked best (for me) are Bayesian Spam Filters (A Plan for Spam, SpamBayes - a free filter) and CRM114 The Controllable Regex Mutilator (Paul Graham mentions it here). I've always had a very high success rate with these.
Vivin Suresh Paliath
http://vivin.net
I like
400MB?
Why not just douse the server in gas if you want to see it melt.
-William Shatner can be neither created nor destroyed.
At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.
Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!
Bro, in the same vein, I was totally checking out this dope ass site which you might wanna check out too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...
OK, man take care until I see you this Friday at the dinner thing, Slashdot!
Cheers,
John
... they are not allowed to douse the servers in gas.
Help poke pirates in the eyepatch, arr.
So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!
Orationem pulchram non habens, scribo ista linea in lingua Latina
DUL = DailUp List... a bit of a misnomer as it commonly refers to all dynamic hosts. My spam went down dramatically after starting to use Trend's DUL (formerly MAPS). Alas, it's a pay service, but it all comes down to your pain threshold. Mine is low relative to my income.
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.
If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].
It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.
With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.
I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.
- RG>
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
We use Brightmail on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.
I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.
As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.
I have found that using two dissimilar systems in a chain is quite effective.
Raise your children as if you were teaching them to raise your grandchildren, because you are.
It may not be coincidence that a little-known filter algorithm produces the best results; many spammers probably test their spew on the more popular filters to try and fool them. If this new filter becomes more popular you may see its reliability decay.
whitelist
Against viruses and spam. For obvious reasons - hackers and spammers put their efforts into circumnavigating the major systems since this will maximize the impact of their work. That's why smaller anti-suckware products will often do a better job since the focus isn't on them.
:p
It leads to a sad but inevitable cycle of products being improved, gaining popularity, then losing their effectiveness since they are now a bigger target.
At least, until a watertight (rather than guess-work) solution is found. I believe this is impossible without changing the way email works at a fundamental level. Even the much praised challenge-response is subject to email spoofing.
Reminds me of why I like living in Australia - globally speaking we're relatively irrelevant, making us a relatively small target. Hopefully we'll stay relatively irrelevant, lol
I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
Hey _vSyncBomb,
Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!
Your buddy,
_vAnoymousCoward
And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...
Yes, I do get a lot of spam dealing with asses of the Asian variety. Luckily, most of it is tagged as such by Gmail's filter.
I see that the organization is not authorised to host a torrent. Would it be possible for someone who has downloaded the video to put one up somewhere? Id be interested to see what kind of speed we would get out of a /. torrent too...
Postini.
"To err is human, to mod Funny divine."
I never used email because of the spam problem and the rampant use of IMs but once I started using G-Mail I never get spam in my inbox and my instant message time has dropped 70% I'd say. Whatever G-Mail uses is the one I would use if I was using a client to download my emails.
All someone needs to do is rig this video up to the wonderful Microsoft Voice Recognition software, and then post the resultant transcript. Surely it won't have that many errors...
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here (or pdf overview).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL). It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.
SPAM FILTERS WORK
important filters - SPAM
Download Spam Filters in a number of formats: ,XviD(473M) ,DiVX(473M) SEX ,MPG(472M) ,OGG/Theora(481M) ,Real Media(471M) ,WIN ,Windows Media(476M) ,FREE ,SEX ,WIN
BUY SPAM FILTERS
Gord Cormack talk about the science, logistics, and politics of Spam Filter Evaluation.
I think it's trying to communicate with us...
Game... blouses.
Is there any filter that doesn't give false positives? I don't mean "almost none", I mean zero . It isn't a matter of "holding out for perfect". Some of us simply can't afford to have a key email discarded as "spam".
44% [===============> ] 220,996,832 89.21K/s ETA 48:16
almost got it!
Dear Slashdot,
At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
What can I do to fix this ?
Yours faithfully,
Dr. Gord Cormack
Survivor didn't spawn the current reality TV craze -- Who Wants to be a Millionaire did. (Though Survivor was already in development).
A 400mb video file? Is this a joke? WTF is everyone thinking that everything on the web needs to be on video all of a sudden. I just blogged about this today: http://www.anotherblogger.com/2006/08/02/please-no -more-gratuitous-videoblogging/
Spam about asian donkeys is a new one on me, though.
Martin Brooks / Slayer99 #linux / UIN 2178117
Thanks for blogging about it... but did it really have to be a video blog?
(just kidding)
Whoever stated that signature sizes should be limited to one hundred and twenty characters can just go ahead and kiss my
So... um... I really don't want to wait 8 hours or more to find out which mysterious and generally "unheard of" spam filter performed the best. Does anybody know where a text version of the results can be found?
I use the built in Spam filter in Exchange 2k3 set to level 8. All "filtered" e-mails are archived. I get maybe 3 or 4 a day (on a "bad" day) that make it through. Once a week (or more if I can be bothered) I view the archive and send on any that aren't spam (<1%) on and those that are spam get junked. I do this using a little tool I wrote that displays the From, To and subject of all these e-mails. If I can't tell from these fields whether the e-mail is a SPAM or not (and it generally is anyway) then I can view the contents of the .eml file.
P**s easy, effective and "Free".
dnuof eruc rof aixelsid
Does anyone else find it mildly ammusing that U.W., one of the top tech schools in North America, due to their regressive policy disallowing the use of torrents, now has a server getting a proper slashdotting?
Anyone care to post a link?
Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.
This guy should spend his time educating the fools at his institution.
Gordon Cormack and Thomas Lynam
Full Text, May 29, 2006 - PDF Format
http://plg.uwaterloo.ca/~gvcormac/spamcormack.html /
The only thing new in this world is the history that you don't know.[Harry Truman]
In my office, the IT department is so cool they implemented the best spam filter ever (when Email server is up): Manual Filtering. It's awesome. Some of us can trash all of our spam before we even read it by carefully reviewing the subject line and sender. We never have false positives, so we don't miss anything. Granted, most people spend 3+ hours a day Emailing, but its OK. We filter out all spam, never miss anything. Some people even collect spam instead of junking it.
Funny thou, we keep buying a particular brand of hard drives for Email storage in our servers. The IT guys keep talking about their sea gate retirement plans. Good to see they want to spend their late years in sunny mexican beaches.
I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...
First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.
So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.
A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!
Do as you would be done to.
So which one is the "unheard of spam filter?"
/. submission (or at least a link).
Wouldn't it make sense to put this in the
Did I miss the obvious "and the winner is..." some place?
I have to push this as it usually gets missed from reviews as it's a hybrid P2P solution and not a straightforward filter, but Cloudmark's safetybar product (http://www.cloudmark.com/) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.
On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.
Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.
I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.
IMHO, the criteria for best spam filter is very simple. It is the filter that is able to consistantly maintain the highest spam to false positive ratio.
:D
Feel free to add to it.
Sometimes I wish I was a plumber, then I'd know how to deal with other people's shit.
The more effective way I have found to stop spam is grey listing. In the last two months, I have had zero spam messages go through to my mail server. I use GSLT (http://www.xmailserver.org/glst-mod.html), which is mostly for the XMail mail server ( http://www.xmailserver.org/) but will work anywhere.
s _spam_postfix?page=0%2C0, lots and lots of good advice on spam filtering.
You should also check this article http://www.freesoftwaremagazine.com/articles/focu
This paper's a complete waste of time.
He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.
We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.
Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ and the results are outstanding.
With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.
And the folk on the spamassassin-users mailing list really rock.
I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.
Help poke pirates in the eyepatch, arr.
How many spam do you get a day? I get hundreds. Half of them are not in my native language (much like half the mail in my inbox), which means it takes more than a split-second glance to figure out what is going on. I'd guess my accuracy in split-second decisions is probably on the order of 95%, which if I were a spam filter would earn me a D-. Paul Graham, who probably has more typical email habits when compared with the average Slashdotter, says he misses about 3 per 2,000. http://www.paulgraham.com/wsy.html There are systems which are better than that.
In Soviet Spam Filter, the computer doesn't trust YOU to filter the email.
Help poke pirates in the eyepatch, arr.
By the time that I have downloaded the video the war will have moved on a couple of iterations ...
Why do spammers even bother to try to get around spam filters? If someone is actively blocking spam, it stands to reason that they are the least likely to buy any of your HerB0l Vi.aGra anyway, so what's the point in attempting to get into their inbox?
Cress, cress, lovely lovely cress
It looks like another win for compression algorithms. Not only do they maximize entropy in your data while shortening it, they can also be used successfully to earmark pieces of text as being written in a certain language, or written by a certain author, and now they can be used for spam detection. The usefullness just keeps on coming. Colour me impressed.
Religion is what happens when nature strikes and groupthink goes wrong.
dont use spam filters and we reply to every damn email we get no matter what
Clearly that's the new fork of SpamAssassin that ensures only Vi4gra, penis-enlargement pills and "meet h0T n4k3d t33n s1uts" invitations get through...
Everything in moderation, including moderation itself
I haven't watched the video (its still downloading) but after reading some of the comments it seems that spammers try to circumvent the most popular spam blockers. SO after watching this video, if the best spam blocker becomes the most popular, won't that then make it less effective? Dammit. If that happens I'll have to waste another day downloading the next video.
You want fun, go home and buy a monkey!
And if it started getting worse you could move to tassie and get that feeling of irrelevancy back.
http://michaelsmith.id.au
Here is a torrent I made of the xvid file. It should work (I hope).
Vivin Suresh Paliath
http://vivin.net
I like
See here
The key paragraph:
If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.
I can throw myself at the ground, and miss.
INFORMATIVE? Mod the parent FUNNY, please.
Vivin Suresh Paliath
http://vivin.net
I like
I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).
I have the impression the Java mail program has languished a bit and I haven't used it for years, but the best spam filter I've used I built up myself using their filtering capabilities. You didn't just "add" filter criteria. You could link them together in "AND", "OR" and "XOR".
I don't know what they use or how they do it, but my main email address is with Yahoo and they seem to have solved the problem.
Each month I get maybe 800 odd spam emails and a dozen or so real emails.
Once every month or so I get an email in my inbox which is spam and I click the 'this is spam' button. About September last year was the last time a real email ended up the bulk (spam) email folder. I used to check and delete the spam emails every couple of days, but now just let it build up and be deleted by the 30 day time out.
Obviously a web-based account is not quite as convenient as a local account, but it seems to handle the spam onslaught and still be useful.
Sadly, the way this was done, there is no way to test how well Greylisting would have helped.
IMarv
Trusting software vendors is no smarter than trus
I would like to add my voice to that of the original poster. Brightmail is remarkably good at eliminating spam, and I do not know of any false positives in the years I have used it. (and yes, I have the habit of doing a quick eyeball scan of my spam folder before dumping it)
RTFM; please, I beg you.
I dont want to read your blog ... could you make a video?
There's only 15 or so people using the Exchange server I set up, but it still gets around 150 spams a day. I installed GFI Mail Essentails, it works ok but lately the amount of spams getting through is increasing - I would say at least 2 or 3 per user per day. I get especially angry at all the Cialis/Viagra e-mails... they could at least throw in some female targeted spams once in a while! Do they assume if you use e-mail that you're a man?? Or that women are already perfect?? LOL
99% means 3-4 junk emails in my inbox every day. That ain't so good.
I don't know about you, but these new anti-spam measures are starting to scare me...
Typos... that's just how I role.
Personally, I can't accept unpredictable delays in my email, so I have opted out of greylisting. Also greylisting has a non-zero and very hard-to-measure false positive rate.
at least the download is quite slow. Who wouldn't want to wait 4 hours for the amazing revelation that you have to train your filter well or you will get false positives.
Gmail, I suspect, is taking a brute-force approach to classifying e-mail as spam: if a large number of hundreds of thousands (millions?) of users say it's spam using the Report Spam function, it probably is.
-- Old Man Kensey
Your model is flawed. I used to do the same thing, but EVENTUALLY that one 'private' address WILL escape into the wild, and then you:
:)
(a) are fscked, or
(b) must create a NEW address to keep private (and cross our fingers again).
For me, my 'private' address is "@.", so creating a new one is not a valid option (being dave2205 is okay on Hotmail, but not on the family domain...).
Add to that the fact that I frequently access my email from different computers (locations). Using IMAP and webmail is a must, and while our host does use some form of spam filter it's nowhere near as good as a well-trained Bayesian.
It's now so bad that I've all but given up on using 'alias addresses' and just give everyone my once-private address. That would rid me of the hassle of managing the aliases at the expense of presumably only slightly more spam.
Unless you have a better idea.
"Good news, everyone!"
I'm sure that if there were a pressing need to use BitTorrent for something academic that could not easily be done any other way, that an exception would be made (but the only thing I can think of that would fall into this category would be research on BitTorrent itself).
-- Old Man Kensey
My 'private' address is "FirstName@LastName.TLD".
Forgot about using GT and LT signs...
"Good news, everyone!"
Here are the slides from the 400MB video presentation.
Feh, I scoff at your breakable clay tablets. If you want durability, you can't do better than spreading ochre on the walls of a cave. Cave paintings have lasted for tens of thousands of years!
The only spam filter I have ever used that doesn't seem to degrade significantly over time is Cloudmark SpamNet (they renamed it to Desktop or something). Every other filter I used got progressively worse. Don't know how they do it, but highly recommended.
My company bought a Barracuda a year and a half ago, it worked great for almost a year. 99.9% of all spam got stopped by it, but in the last 6 months, more and more spam just cruises right thru it. Barracuda's vendor-supplied filter rules have become totally ineffective now, it's almost like they've lost all their talent and ability to create new rule sets in their periodic updates that are able to counter the spammers' latest tactics.
What I imagine would be a good spam filter, would be like this:
First, run through a spellcheck and grammarcheck with a fairly lax spellcheck (we all make mistakes, but not all the time). And filter out anything but Norwegian, Danish, Swedish, French and English text. That oughta kill a good 60% of my spam. Next, some technology to kill the image-only spams (checksums? content likeness to known spam?). Then, run through a bayesian filter (or some such technology).
Now, I just need a good spellcheck/grammarcheck library etc, and then maybe I can beat the spammers for a good while.
What do you guys think? Should I spend some quality time with Perl?
Stop the brainwash
Has greylisting been used more widely than just by pair.com?
The problem with greylisting is not as much the delay, but the messages that never get delivered. I've missed two or three important emails over the last year (i.e. whose absence I later noticed: a renewal notification from a domain registrar, etc.) because of greylisting.
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
...Spam Filters test... you. :)
Actually, yes -- 8" floppies hold up really well and I hardly ever have trouble reading them. Those 3.5" things are another story though. And QIC-40/QIC-80 tapes didn't really work even when they were new. We may have already peaked when it comes to reliable data storage.
Most of the spam that does get through is the "Poetry Spam". Excerpts from The Bible, Harry Potter, Playboy Fourm, etc. have been dumped into my mail server and some have gotten through because they had some "legitimate" text but the gif (most often) are the ones with virus and/or "mal-links".
hey all,
/w an html to click/type into, gets into their inbox. im suprised it doesn't yet exist.
this spam really needs a perm solution. other then hacking people's white lists i thought having a system similar to earthlink's "proove yer a real person by typing in a random code" would be good. an even simpler idea would be click the correct pic (from 8-9 pics to choose from) and then it would finally send to their email.
i'd give up newsletters in order to have something like this.
analysing email is a waste as spammers find ways around them. if the pics could be changed up randomly or the code to enter changes then it makes their work much harder to get any spam in.
if i got 1-2/day i wouldn't mind but gmail drops 50 in my spam folder (which of course most skim through anyways).
send a mail, bounces back
if this does exist please drop me a line and let me know. i'd love to use it...
bluetigerbc
at gmail dot com
you can use automatically created disposable e-mail addresses.
Fast forward to 39:42 into the movie to see his rankings.
Here's what I saw (YMMV):
1) bogofilter
2) ijsSPAM2
3) spamprobe
4) spamasas-b (learning only)
5) crmSPAM3 (1:40 ham eaten)
Of course, he immediately showed other views of the data and had different rankings. Basically, you need to decide how much real email you are willing to lose to fight **any** spam getting in.
After sitting through the full 58 minutes I was truly disapointed that Dr. Cormack was still pacing back and forth. The style and delivery of the presentation was truly horrid. 58 minutes of his pontification was enough. I still need to sort through the "data" he presented in a very unclear manner to see it it makes sense. Too much "off the top of the head" and not enough deliberate, directed information.
CanIt works a charm for me. It's free ( beer free ) for 50 users, and uses open-source tools to get the job done. I used to get 30 - 50 spam messages per day ( and this was years back, before there was so much spam ). I might get 2 per week now, and the bayesian filter learns from experience, so whatever comes in at least helps you block more of the same stuff.