Proving Which Spam Filters work Best
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.
... the ones which have worked best (for me) are Bayesian Spam Filters (A Plan for Spam, SpamBayes - a free filter) and CRM114 The Controllable Regex Mutilator (Paul Graham mentions it here). I've always had a very high success rate with these.
Vivin Suresh Paliath
http://vivin.net
I like
400MB?
Why not just douse the server in gas if you want to see it melt.
-William Shatner can be neither created nor destroyed.
At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.
Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!
Bro, in the same vein, I was totally checking out this dope ass site which you might wanna check out too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...
OK, man take care until I see you this Friday at the dinner thing, Slashdot!
Cheers,
John
... they are not allowed to douse the servers in gas.
Help poke pirates in the eyepatch, arr.
So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!
Orationem pulchram non habens, scribo ista linea in lingua Latina
Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.
If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].
It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.
With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.
I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.
- RG>
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
We use Brightmail on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.
I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.
As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.
I have found that using two dissimilar systems in a chain is quite effective.
Raise your children as if you were teaching them to raise your grandchildren, because you are.
I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
Hey _vSyncBomb,
Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!
Your buddy,
_vAnoymousCoward
And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...
I wonder how hard it would be for Slashdot/OSTG to host a tracker for large, article-related files like this. I don't think it would require a lot of funding to run, and it would certainly help with convention presentation videos.
The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here (or pdf overview).
You can duplicate those tests yourself if you download the evaluation toolkit (GPL). It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).
There's also a video talk given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).
There's a new scheduled test towards the end of the year at TREC 2006.
There is no classification system with zero real risk, except for delivering all mail to the Inbox. Sorry.
If your mail is that important, you should be using couriers instead of email.
Dear Slashdot,
At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
What can I do to fix this ?
Yours faithfully,
Dr. Gord Cormack
I hope you also have another word, because the Postini service is incredibly bad. I had it enabled on my account at acm.org, and the Postini system was generating roughly one false positive for every 10 true positives. I disabled the Postini filtering and started using Spamassassin. Both the false positive and false negative rates are much improved. Among the traffic that Postini was flagging as spam were the Wikipedia article of the day, my daily email from musicbrainz.org, all messages to the BATN mailing list, many replies to my items for sale on craigslist, and other kinds of completely legitimate traffic. Among the mail they chose to deliver were messages in Korean, Cyrillic, other scripts I can't read, and known viruses.
Their main problem is the system doesn't learn. Using their web interface, I look through the spam folder and request delivery of all the false positives. The next day, nearly-identical mails are still generating false positives. You'd think it would be easy these days to design a filter that learns from negative reinforcement.
Spam about asian donkeys is a new one on me, though.
Martin Brooks / Slayer99 #linux / UIN 2178117
Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.
This guy should spend his time educating the fools at his institution.
Gordon Cormack and Thomas Lynam
Full Text, May 29, 2006 - PDF Format
http://plg.uwaterloo.ca/~gvcormac/spamcormack.html /
The only thing new in this world is the history that you don't know.[Harry Truman]
I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...
First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.
So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.
A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!
Do as you would be done to.
So which one is the "unheard of spam filter?"
/. submission (or at least a link).
Wouldn't it make sense to put this in the
Did I miss the obvious "and the winner is..." some place?
I have to push this as it usually gets missed from reviews as it's a hybrid P2P solution and not a straightforward filter, but Cloudmark's safetybar product (http://www.cloudmark.com/) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.
On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.
Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.
I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.
This paper's a complete waste of time.
He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.
We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.
Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ and the results are outstanding.
With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.
And the folk on the spamassassin-users mailing list really rock.
I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.
Help poke pirates in the eyepatch, arr.
However, note that we are talking about two separate scenarios:
- a home server for an user with no responsibilities
- a project/ISP-wide mail server
In the former, delaying mail for weeks may be acceptable -- but even then, I wouldn't touch something with a 1:500 false positive ratio with a long stick.The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
By the time that I have downloaded the video the war will have moved on a couple of iterations ...
It looks like another win for compression algorithms. Not only do they maximize entropy in your data while shortening it, they can also be used successfully to earmark pieces of text as being written in a certain language, or written by a certain author, and now they can be used for spam detection. The usefullness just keeps on coming. Colour me impressed.
Religion is what happens when nature strikes and groupthink goes wrong.
Here is a torrent I made of the xvid file. It should work (I hope).
Vivin Suresh Paliath
http://vivin.net
I like
See here
The key paragraph:
If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.
I can throw myself at the ground, and miss.
I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).
If an end user is trying to block spam, then yes, they are probably not the sort of person likely to buy your product. At least until spam-blocking becomes more main stream in email clients (e.g Mozilla Thunderbird).
However, its very often the end user's ISP doing the spam filtering - and this has no direct bearing on the gullibility of the email recipient.
Here are the slides from the 400MB video presentation.