Proving Which Spam Filters work Best
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.
Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before.... I wonder how long it will be before we see something using the methods available, who wants to bet OpenSource will beet closed source to implementing this?
Bah. We use Spamassassin, multiple DNSBLs, and I still get hundreds per day, most of them to addresses published on websites (unavoidable).
The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.
Mooniacs for iOS and Android
So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!
Orationem pulchram non habens, scribo ista linea in lingua Latina
Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.
If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].
It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.
With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.
I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.
- RG>
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
Heh, even if you are reasonably diligent in protecting your email address, 9/10 it will still get out(though maybe not as bad). All it takes is one recipient with a compromised windows box and your address can be all over the spammers lists in no time.
Or, as in my case, you could assume that a university you apply to will not send out a giant mass email to all the incoming graduate students inviting them to the graduate orientation. So now I have the email address of every grad student entering the University of Minnesota this year(and probably a few that aren't) and they have mine. All it takes is one infected box and my previously spam-free gmail account will no longer stay that way. The kicker is that I decided not to go to UMN because they didn't offer me funding...oy!
Monstar L
The problem with the spam filters, which you have stated, is that eventually a spammer figures out how to craft a spam which avoids the feature detection systems. Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.
Lately, I've been thinking about this problem a lot. The classic method of computer classification systems (Bayes, SVM, whatever) are all based on trying to detect features in a set of objects which separate the objects into two classes. But there is only one feature which is shared by all spam, and which is not shared by mail I wish to receive: all spam is sent by assholes. The problem is, you can't algorithmically detect the asshole coefficient solely from the contents of an SMTP transmission. Therefore I have recently come to the conclusion that we need to revert to a web of trust for accepting email. I have long avoided webs of trust because they seem difficult to manage, but I've come to believe that they are the only way to solve this spam problem.
There is no classification system with zero real risk, except for delivering all mail to the Inbox. Sorry.
If your mail is that important, you should be using couriers instead of email.
A 400mb video file? Is this a joke? WTF is everyone thinking that everything on the web needs to be on video all of a sudden. I just blogged about this today: http://www.anotherblogger.com/2006/08/02/please-no -more-gratuitous-videoblogging/
Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.
This guy should spend his time educating the fools at his institution.
Domain keys... now just get everyone to use it.
The more effective way I have found to stop spam is grey listing. In the last two months, I have had zero spam messages go through to my mail server. I use GSLT (http://www.xmailserver.org/glst-mod.html), which is mostly for the XMail mail server ( http://www.xmailserver.org/) but will work anywhere.
s _spam_postfix?page=0%2C0, lots and lots of good advice on spam filtering.
You should also check this article http://www.freesoftwaremagazine.com/articles/focu
By the time that I have downloaded the video the war will have moved on a couple of iterations ...
Well, the spammers have heard of the other methods too and try to subvert them. So give them time and see how it performs if and when it becomes more commonly used and the spammers are trying to beat it.
Don't knock it, cuneiform on backed clay is the single most successful format for long-term storage ever invented - 3000 years and counting. Heck, most of our modern storage formats can't even manage 30 - tied to read a 8" floppy recently?
"False positives may be a problem, however."
False positives are a HUGE problem compared to the occasional "true negative"(?).
I'd rather have a small trickle of spam emails (I can't believe I'm saying this, but hear me out) than I would risk missing out on that one truly important email.
"Good news, everyone!"
I'm not going to knock it but your statement is very far from the truth. Determining the "most successful" long term storage method invented would require waiting till the year 5xxx something to see if something we've currently invented beats cuneiform. Even then it's pretty hard to prove one way or another since a lot of the cuneiform we have today is being carefuly taken care of to prolong it's lifetime I'd suspect (though I have no confirmation of that part).
Yep, you're right. The best long-term information storage media ever invented is poetry.
"I've got more toys than Teruhisa Kitahara."
If an end user is trying to block spam, then yes, they are probably not the sort of person likely to buy your product. At least until spam-blocking becomes more main stream in email clients (e.g Mozilla Thunderbird).
However, its very often the end user's ISP doing the spam filtering - and this has no direct bearing on the gullibility of the email recipient.
First, spam does not need to make sense to make money. Here's some of my latest received headlines:
- placing LEDhas
- pJapans mission
- capture Todays architect shared
- 6MZ
and the body text (with an attached image):-----
malware
USDA databases crop
entente cordial: admission relation contract GB giveaway andd
studios another page:
-------
AND IT STILL MAKES MONEY!!!
spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will always be profitible under the current email system. No matter what filters are used. Filters don't fix the spam problem any more than Virus Scanners stop viruses from spreading. It's all reactionary, which translates to 'fighting a never-ending battle on the losing side'.
Am I the only one that read the means of presentation as a hilarious attack on a university policy of blocking bittorrent? Given that adding 470MB doesn't really add any usable information to a discussion about spam filters over a piece of text, and all.
Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.
People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation
A web of trust will work only until someone you trust's computer gets subverted. The zombie network you mentioned doesn't happen by itself. Now the smaller, more technically proficient web of trust, the less likely it is to be subverted, but it's still vulnerable to someone you trust having their computer hijacked.