Slashdot Mirror


Proving Which Spam Filters work Best

pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.

66 of 263 comments (clear)

  1. Easier? by Ec|ipse · · Score: 2, Insightful

    Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.

  2. In my experience... by vivin · · Score: 4, Informative

    ... the ones which have worked best (for me) are Bayesian Spam Filters (A Plan for Spam, SpamBayes - a free filter) and CRM114 The Controllable Regex Mutilator (Paul Graham mentions it here). I've always had a very high success rate with these.

    --
    Vivin Suresh Paliath
    http://vivin.net

    I like
    1. Re:In my experience... by coffeeisclassy · · Score: 3, Insightful

      Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before.... I wonder how long it will be before we see something using the methods available, who wants to bet OpenSource will beet closed source to implementing this?

    2. Re:In my experience... by ozmanjusri · · Score: 5, Funny
      I've always had a very high success rate with these.

      I haven't tested this one myself, Barrett Filter but I understand it is 100% effective at reducing spam from known sources. False positives may be a problem, however.

      --
      "I've got more toys than Teruhisa Kitahara."
    3. Re:In my experience... by Red+Alastor · · Score: 4, Informative
      I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.

      http://popfile.sourceforge.net/

      --
      Slashdot anagrams to "Sad Sloth"
    4. Re:In my experience... by I!heartU · · Score: 3, Insightful

      Domain keys... now just get everyone to use it.

    5. Re:In my experience... by 1u3hr · · Score: 2, Insightful
      Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before

      Well, the spammers have heard of the other methods too and try to subvert them. So give them time and see how it performs if and when it becomes more commonly used and the spammers are trying to beat it.

    6. Re:In my experience... by KlaymenDK · · Score: 4, Insightful

      "False positives may be a problem, however."

      False positives are a HUGE problem compared to the occasional "true negative"(?).

      I'd rather have a small trickle of spam emails (I can't believe I'm saying this, but hear me out) than I would risk missing out on that one truly important email.

    7. Re:In my experience... by jank1887 · · Score: 3, Insightful
      Hello. welcome to the internet.
      First, spam does not need to make sense to make money. Here's some of my latest received headlines:
      • placing LEDhas
      • pJapans mission
      • capture Todays architect shared
      • 6MZ
      and the body text (with an attached image):

      -----
      malware

      USDA databases crop

      entente cordial: admission relation contract GB giveaway andd

      studios another page:

      ... (etc.,etc.)
      -------
      AND IT STILL MAKES MONEY!!!
      spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will always be profitible under the current email system. No matter what filters are used. Filters don't fix the spam problem any more than Virus Scanners stop viruses from spreading. It's all reactionary, which translates to 'fighting a never-ending battle on the losing side'.

    8. Re:In my experience... by ajs · · Score: 2, Informative
      In my experience, the commercial offerings (such as mail frontier) aren't too bad. As far as open source stuff, my personal setup of choice is:
      • Spamhaus SBL/XBL filtering (hard SMTP-time DNSBLing) based on my expereince with them and their consistent listing of VIOLATORS, not just anyone who shares a netblock with a spammer (i.e. they may not catch as much as some others, but they don't have the FP rate that others do)
      • Greylisting. This is controversial because many people can't tolerate the delay it introduces. I found a radical decrease in spam when using it (because honeypots have already located a spammer by the time they try again), and only marginal headaches introduced by the delays of new senders. YMMV, and I wouldn't use it in a production environment.
      • SpamAssassin. I tweek the RBL settings (I *never* want to even score SORBS, for example), and configure razor, but otherwise pretty much leave it in its default configuration, and it works great!
      • Thunderbird mail filtering. I use evolution and thunderbird. I don't bother turning on mail filtering in evolution, since it uses SpamAssassin, and there's no point using SA twice on the same message. I *do* use thunderbirds filtering as yet-another layer of filtering when I'm using that, and it does a good job of classifying what little spam is left.


      YMMV. Good luck.
  3. Why not just douse the server in gas... by shotgunefx · · Score: 3, Funny

    400MB?

    Why not just douse the server in gas if you want to see it melt.

    --

    -William Shatner can be neither created nor destroyed.
    1. Re:Why not just douse the server in gas... by Tsiangkun · · Score: 5, Funny

      I'm getting 8kb/s downloads from the site, it's just like the good old days !

      I'll post more next week after I watch the video.

    2. Re:Why not just douse the server in gas... by coffeeisclassy · · Score: 2, Informative

      Its round robin mirrored accross a whole bunch of different servers so if youre only getting 8kb/s you could try cancelling and downloading again and seeing if it goes faster.

  4. Combo of SpamAssassin and Spamhaus by hyperion454 · · Score: 2, Interesting

    At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.

    1. Re:Combo of SpamAssassin and Spamhaus by emag · · Score: 2, Informative

      And turn off SMTP VRFY. Either that, or having windows systems @ my ISP managed to get the address associated with my account on spam lists. This is an address that's *only* used internally by my ISP (I use pobox or my own domain whenever someone asks for an address). Even that wasn't enough to provent it from getting harvested. :-(

      --
      "The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
    2. Re:Combo of SpamAssassin and Spamhaus by antifoidulus · · Score: 2, Insightful

      Heh, even if you are reasonably diligent in protecting your email address, 9/10 it will still get out(though maybe not as bad). All it takes is one recipient with a compromised windows box and your address can be all over the spammers lists in no time.
      Or, as in my case, you could assume that a university you apply to will not send out a giant mass email to all the incoming graduate students inviting them to the graduate orientation. So now I have the email address of every grad student entering the University of Minnesota this year(and probably a few that aren't) and they have mine. All it takes is one infected box and my previously spam-free gmail account will no longer stay that way. The kicker is that I decided not to go to UMN because they didn't offer me funding...oy!

    3. Re:Combo of SpamAssassin and Spamhaus by jdowland · · Score: 3, Funny
      The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.


      Nah, that's such a half measure. The real solution is to not have an email address at all.
  5. Fantastic Spam Filters Which Work Best Proving! by _vSyncBomb · · Score: 5, Funny

    Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!

    Bro, in the same vein, I was totally checking out this dope ass site which you might wanna check out too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...

    OK, man take care until I see you this Friday at the dinner thing, Slashdot!

    Cheers,
    John

  6. Under present IST policy... by patio11 · · Score: 3, Funny

    ... they are not allowed to douse the servers in gas.

  7. RTFA? by glowworm · · Score: 4, Insightful

    So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!

    --
    Orationem pulchram non habens, scribo ista linea in lingua Latina
    1. Re:RTFA? by emag · · Score: 5, Funny

      "We are sorry that these talks are not available as plain HTML, PDF, or text, however under present IST policy we are not allowed to provide plain HTML, PDF, or text."

      --
      "The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
  8. Not surprising... by RealGrouchy · · Score: 4, Insightful

    Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.

    If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].

    It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.

    With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.

    I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.

    - RG>

    --
    Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
  9. Got to go with Brightmail by saha · · Score: 4, Informative

    We use Brightmail on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

    I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.

    1. Re:Got to go with Brightmail by hacker · · Score: 2, Informative
      Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

      And what happened when you retrained those false positives as ham? Did you see future mails of the same/similar type get caught again? I bet you didn't.

      I've been using dspam for a very long time for my users, and they love it. They love having zero spam in their mailbox, they love the simplicity of the user interface. They love how it treats users on a per-user basis, not globally (i.e. some users WANT html emails, some do not. Each can mark them as they see fit.)

      Here's an example of my own stats..

      hacker: TP True Positives: 122601
      TN True Negatives: 124711
      FP False Positives: 211
      FN False Negatives: 1046
      SC Spam Corpusfed: 3708
      NC Nonspam Corpusfed: 456
      TL Training Left: 0
      SHR Spam Hit Rate 99.15%
      HSR Ham Strike Rate: 0.17%
      OCA Overall Accuracy: 99.49%

  10. Flaw in the test by lheal · · Score: 5, Informative

    The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

    As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

    I have found that using two dissimilar systems in a chain is quite effective.

    --
    Raise your children as if you were teaching them to raise your grandchildren, because you are.
    1. Re:Flaw in the test by Jeffrey+Baker · · Score: 2, Insightful

      The problem with the spam filters, which you have stated, is that eventually a spammer figures out how to craft a spam which avoids the feature detection systems. Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.

      Lately, I've been thinking about this problem a lot. The classic method of computer classification systems (Bayes, SVM, whatever) are all based on trying to detect features in a set of objects which separate the objects into two classes. But there is only one feature which is shared by all spam, and which is not shared by mail I wish to receive: all spam is sent by assholes. The problem is, you can't algorithmically detect the asshole coefficient solely from the contents of an SMTP transmission. Therefore I have recently come to the conclusion that we need to revert to a web of trust for accepting email. I have long avoided webs of trust because they seem difficult to manage, but I've come to believe that they are the only way to solve this spam problem.

    2. Re:Flaw in the test by shawn.fox · · Score: 2, Interesting

      Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.

      Do you happen to use Ameritrade? I started receiving these emails this Sunday myself (July 30). Since I always use disposable email addresses I immediatly noticed that the email was being sent to the disposable address I had created for Ameritrade. I sent them an email complaining about it and accusing them of either giving away my email address to some third party who was spamming me or that someone had stolen customer account information from them. I have yet to hear any response back from them.

    3. Re:Flaw in the test by perlchild · · Score: 2, Insightful

      A web of trust will work only until someone you trust's computer gets subverted. The zombie network you mentioned doesn't happen by itself. Now the smaller, more technically proficient web of trust, the less likely it is to be subverted, but it's still vulnerable to someone you trust having their computer hijacked.

  11. Harder! by Profane+MuthaFucka · · Score: 5, Funny

    I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.

    --
    Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    1. Re:Harder! by rts008 · · Score: 4, Funny

      I can't come to your house, you insensitive clod!, teh tubes are clogged with clay tablets!

      I won't be able to download my internet until Friday now!

      Turn that crap down, and get off of my lawn! Damn kids!

      --
      Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
    2. Re:Harder! by cruachan · · Score: 4, Insightful

      Don't knock it, cuneiform on backed clay is the single most successful format for long-term storage ever invented - 3000 years and counting. Heck, most of our modern storage formats can't even manage 30 - tied to read a 8" floppy recently?

    3. Re:Harder! by Jartan · · Score: 2, Insightful

      I'm not going to knock it but your statement is very far from the truth. Determining the "most successful" long term storage method invented would require waiting till the year 5xxx something to see if something we've currently invented beats cuneiform. Even then it's pretty hard to prove one way or another since a lot of the cuneiform we have today is being carefuly taken care of to prolong it's lifetime I'd suspect (though I have no confirmation of that part).

    4. Re:Harder! by ozmanjusri · · Score: 2, Insightful
      I'm not going to knock it but your statement is very far from the truth.

      Yep, you're right. The best long-term information storage media ever invented is poetry.

      --
      "I've got more toys than Teruhisa Kitahara."
    5. Re:Harder! by Squalish · · Score: 4, Insightful

      Am I the only one that read the means of presentation as a hilarious attack on a university policy of blocking bittorrent? Given that adding 470MB doesn't really add any usable information to a discussion about spam filters over a piece of text, and all.

      Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.

      --
      People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation
    6. Re:Harder! by Crayon+Kid · · Score: 2, Funny

      Bwahaha, I'm moving my blog to clay tablets. They will undoubtedly survive the next Ice Age and the people of year 5000 will be forced to read about my cat, how I hate Emo's and that guy at work who doesn't wash. But first I'll change my blog nick to "Earth Imperial Overlord Supreme", just to fuck with them future dudes.

      --
      i ate crayons when i was a kid and now i have two braincells and the blue ones taste nicer
    7. Re:Harder! by Morkano · · Score: 2, Interesting

      You know, for a university with supposedly the best engineering and CS programs in Canada, their actual use of technology is pretty crazy. You'd think they'd understand it well enough to realize that bit torrent is a great delivery method.

      I remember when I applied to go there, I didn't get the email stating my acceptance until weeks and weeks after I got the physical package. Ha!

      --
      Victory or awesome!
    8. Re:Harder! by Anonymous Coward · · Score: 2, Informative

      Hey, we can't help it if people decide to post our videos to ./ and Digg!
      [/innocence]

      Here are UW's traffic stats, in case anyone's interested:
      http://noc.uwaterloo.ca/cgi-bin/14all.cgi?log=cn-r text_gi2&cfg=cn-rtext.cfg

      Also note the spikes on Monday and Tuesday from when we posted our last two talks.

    9. Re:Harder! by cruachan · · Score: 2, Informative

      True, but as I per below, there's literally mounds of baked clay tablets because they are so indestructable. Apparently they used to get shovelled into foundations and the like. The estimate I heard was that at current rates it will take scholars several hundred years to translate what we've found already. Compare that to parchment records where the discovery of even a few new scraps is a major event (http://news.bbc.co.uk/1/hi/sci/tech/5235894.stm and particularly http://news.bbc.co.uk/1/hi/world/europe/5216320.st m). Point is in the race for the most successful long term storage mechanism cuniform on baked clay is way ahead of the field, nothing else comes close.

      Excellent 'In Our Time' programme on Babylon and it's Literature here - http://www.bbc.co.uk/radio4/history/inourtime/inou rtime_20040603.shtml

  12. Re: Very Interesting And Generally Really Amusing by Anonymous Coward · · Score: 5, Funny

    Hey _vSyncBomb,

      Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!

    Your buddy,
    _vAnoymousCoward

  13. I got the 400M download! by Ossifer · · Score: 3, Funny

    And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...

  14. Re:Torrents by Pantero+Blanco · · Score: 2, Interesting

    I wonder how hard it would be for Slashdot/OSTG to host a tracker for large, article-related files like this. I don't think it would require a lot of funding to run, and it would certainly help with convention presentation videos.

  15. text versions of the material by martin-boundary · · Score: 5, Informative
    For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.

    The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here (or pdf overview).

    You can duplicate those tests yourself if you download the evaluation toolkit (GPL). It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).

    There's also a video talk given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).

    There's a new scheduled test towards the end of the year at TREC 2006.

  16. Re:Only one question... by Jeffrey+Baker · · Score: 2, Insightful

    There is no classification system with zero real risk, except for delivering all mail to the Inbox. Sorry.

    If your mail is that important, you should be using couriers instead of email.

  17. Ask Slashdot ... by Anonymous Coward · · Score: 5, Funny

    Dear Slashdot,
    At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
    What can I do to fix this ?
    Yours faithfully,
    Dr. Gord Cormack

  18. Re:I have one word: by Jeffrey+Baker · · Score: 2, Informative

    I hope you also have another word, because the Postini service is incredibly bad. I had it enabled on my account at acm.org, and the Postini system was generating roughly one false positive for every 10 true positives. I disabled the Postini filtering and started using Spamassassin. Both the false positive and false negative rates are much improved. Among the traffic that Postini was flagging as spam were the Wikipedia article of the day, my daily email from musicbrainz.org, all messages to the BATN mailing list, many replies to my items for sale on craigslist, and other kinds of completely legitimate traffic. Among the mail they chose to deliver were messages in Korean, Cyrillic, other scripts I can't read, and known viruses.

    Their main problem is the system doesn't learn. Using their web interface, I look through the spam folder and request delivery of all the false positives. The next day, nearly-identical mails are still generating false positives. You'd think it would be easy these days to design a filter that learns from negative reinforcement.

  19. Good job the I don't filter web content by slayer99 · · Score: 2, Funny
    "In his study he looked at the major spam filters ( DSPAM, SpamAssasian"

    Spam about asian donkeys is a new one on me, though.

    --
    Martin Brooks / Slayer99 #linux / UIN 2178117
  20. No bittorrent... No credibility by bgog · · Score: 4, Insightful

    Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.

    This guy should spend his time educating the fools at his institution.

  21. Possible Text Version by sciop101 · · Score: 4, Informative
    On-line Supervised Spam Filter Evaluation
    Gordon Cormack and Thomas Lynam

    Full Text, May 29, 2006 - PDF Format

    http://plg.uwaterloo.ca/~gvcormac/spamcormack.html /

    --
    The only thing new in this world is the history that you don't know.[Harry Truman]
    1. Re:Possible Text Version by gvc · · Score: 3, Informative
      Bogofilter works great. Or SpamAssassin but only if you force-feed it its own judgements. In both cases you have to correct classification errors.

      Fidelis Assis (who has now gone solo after having participated in the CRM114 project) shows great results for his recent solo effort: OSBF-lua Bratko's PPM spam filter -- the one that did great at TREC -- is not yet packaged as a drop-in filter. Same for my DMC spam filter.

      The actual TREC 2005 tests referred to in TFA are here.

  22. GMail Spam Filter by foxylad · · Score: 5, Interesting

    I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...

    First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.

    So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.

    A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!

    --
    Do as you would be done to.
    1. Re:GMail Spam Filter by sd.fhasldff · · Score: 2, Interesting

      This is actually something Google could sell. Access to their mail filter. I do realize that they have "corporate email", but that still smacks a lot of GMail and some businesses would rather avoid that. Instead, they could provide a simple access to their spam filter. Yes, requiring all email to be piped through a Google server if they don't want to make the filter available as a binary (presumably updated regularly).

      To minimize bandwidth consumption and (partly, at least) allay privacy / corporate secrecy worries, the email piped through Google's servers could be limited to anything that didn't pass a white-list filter (e.g. removing all internal corporate email, as well as email from established business partners).

  23. So Which One Won? by ryanisflyboy · · Score: 2, Interesting

    So which one is the "unheard of spam filter?"

    Wouldn't it make sense to put this in the /. submission (or at least a link).

    Did I miss the obvious "and the winner is..." some place?

  24. Cloudmark's SpamNet by cruachan · · Score: 2, Interesting

    I have to push this as it usually gets missed from reviews as it's a hybrid P2P solution and not a straightforward filter, but Cloudmark's safetybar product (http://www.cloudmark.com/) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.

    On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.

    Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.

    I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.

  25. Out of Date and Worthless by prandal · · Score: 4, Informative

    This paper's a complete waste of time.

    He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

    We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

    Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ and the results are outstanding.

    With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

    And the folk on the spamassassin-users mailing list really rock.

    1. Re:Out of Date and Worthless by gvc · · Score: 3, Informative
      I assume the paper that you are describing is the 2004 study. The paper described in the talk (which was given 6 months ago or so) described results of the TREC 2005 Spam Track which took place in November 2005. It included a test SpamAssassin 3.x, not 2.3.

      TREC 2006 evaluations are now underway.

      While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:

      • The spam filters tested in 2004 give pretty well exactly the same performance on 2005 and 2006 data.
      • New versions of the filters are a little bit better, but not by leaps and bounds, and also get about the same results over the last 2.5 years of data.
      • There is no evidence that "Bayesian poisining" is a viable technique for defeating statistical spam filters in anything but a very artifical laboratory environment where the poisoner has access to the recipient's inbox
      The subject of the paper -- and the talk -- is primarily about testing methodology and the need for controlled scientific investigation. So I hesitate to endorse the simplistic notion of a "winner" of the TREC evaluation. However the technique that did very well was indeed quite novel, so here's a characterization.
      Andrej Bratko used PPM -- a well-known data compression technique to compress ham and spam separately. Well actually he didn't compress them but just build the statistical model necessary to compress them. Then he simply (tentatively) added the unknown message to each model and chose the one that compressed it best. The general technique of using compression has been mentioned here and elsewhere but Bratko used a much stronger compression scheme and was somewhat clever about it.

      I later reproduced Bratko's results using DMC -- a compression schem that I invented 20 years ago -- and got some interesting results. We have a journal article in press describing it and also an evaluation paper at CEAS 2006.

      Bratko A., Cormack G. V., Filipic B., Lynam T. R. and Zupan B., Spam Filtering Using Statistical Data Compression Models

  26. Amusingly, POPFile caught you by patio11 · · Score: 4, Interesting

    I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.

  27. Re:MS Anti Spam... by KiloByte · · Score: 2, Informative
    A false positive rate of 1:100
    No, better than 1:100 - that's what <1% means. It's actually around the 1:500
    And thus still 200 times worse than the acceptable rate.
    Usually, anti-spam solutions which give more than 1:100000 are considered worthless
    Got links, or is that just your opinion?
    There was a massive flamefest on debian-devel about spam filtering recently, but false positive ratios in that range were something commonly used by most participants in the discussion. I don't have the time to find a bunch of such posts right now, but the most recent thread is "greylisting on debian.org". This particular thread deals mostly with acceptable delays, but it does include quite a bit of statistics.

    However, note that we are talking about two separate scenarios:

    • a home server for an user with no responsibilities
    • a project/ISP-wide mail server
    In the former, delaying mail for weeks may be acceptable -- but even then, I wouldn't touch something with a 1:500 false positive ratio with a long stick.
    --
    The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  28. It is a war by Alain+Williams · · Score: 2, Insightful
    Spam is a war between the spammers and the system administrators/spam filters. The spam filters adopt a new technique; then spammers then work round it; the spam filters advance; ...

    By the time that I have downloaded the video the war will have moved on a couple of iterations ...

  29. Way to go compression ! by bytesex · · Score: 2, Interesting

    It looks like another win for compression algorithms. Not only do they maximize entropy in your data while shortening it, they can also be used successfully to earmark pieces of text as being written in a certain language, or written by a certain author, and now they can be used for spam detection. The usefullness just keeps on coming. Colour me impressed.

    --
    Religion is what happens when nature strikes and groupthink goes wrong.
  30. Torrent by vivin · · Score: 3, Informative

    Here is a torrent I made of the xvid file. It should work (I hope).

    --
    Vivin Suresh Paliath
    http://vivin.net

    I like
    1. Re:Torrent by jsharkey · · Score: 2, Informative

      Go get VideoLAN client and you can stream download the OGG version. Just open the URL as a Network Stream:

      http://www.csclub.uwaterloo.ca/media/files/cormack -spam.ogg

      Very handy use of VLC! :)

    2. Re:Torrent by wayne · · Score: 2, Informative

      Your tracker is still 440'ing, so I have put up an alternative tracker. As I write this, I only have about 9% of the avi downloaded, so if someone else can seed the complete cormack-spam-xvid.avi file, I would greatly appreciate it.

      --
      SPF support for most open source mail servers can be found at libspf2.
  31. Paul Vixie on botnets and spam by dodobh · · Score: 2, Interesting

    See here

    The key paragraph:

    If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.

    --
    I can throw myself at the ground, and miss.
  32. Dspam floats my boat by Zzeep · · Score: 3, Informative

    I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).

  33. Re:Why do they try? by maubp · · Score: 2, Insightful

    If an end user is trying to block spam, then yes, they are probably not the sort of person likely to buy your product. At least until spam-blocking becomes more main stream in email clients (e.g Mozilla Thunderbird).

    However, its very often the end user's ISP doing the spam filtering - and this has no direct bearing on the gullibility of the email recipient.

  34. Slides from the presentation by gvc · · Score: 2, Informative