Slashdot Mirror


Two Spam Filters 10 Times As Accurate As Humans

Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."

139 of 487 comments (clear)

  1. Outclassed... by Klatoo55 · · Score: 5, Funny

    I'm sorry, Dave... That Nigerian guy looks suspicious and I can't let you send him money.

    --
    ------- "A true friend stabs you in the front." -Eliot
  2. Comment removed by account_deleted · · Score: 5, Insightful

    Comment removed based on user account deletion

  3. IM Spam by jeffskyrunner · · Score: 5, Interesting

    Once Email Spam is eliminated, then IM spam will begin...

    --
    Jeff
    1. Re:IM Spam by Vancouverite · · Score: 2, Informative

      Far too late for that. ICQ has had IM Spam for some time, as has Yahoo, MSChat, and AOL.

      What *will* happen is that trawling robots will now also trawl for IM addresses, rather than just email addresses. As it is, only deliberate IM spammers (who are usually in an IM chat group with an intellectually stimulating name such as "Yung Hunnies 4 Married Men") are harvesting the IM addresses that show up in these chat groups. In the future, don't have your ICQ # or Jabber ID on your website, or you are setting yourself up for more spam.

      Hmmm... a use for reverse 3133t spelling? "Contact me at ICQ #lEloAAT" (1310447)

      --
      We are the Music Makers, and We are the Dreamers of Dreams...
  4. wait, WTF? by PedanticSpellingTrol · · Score: 5, Insightful

    I presume they mean more accurate than a human that was only looking at the subject line? I fail to see how someone could misclassify an email after they'd already opened it unless it was some kind of marathon testing, which would be totally unrepresentative of any real life situation. Once you're getting 6,000 messages, it's time to reach for "Delete All" and change your address, methinks

    1. Re:wait, WTF? by LBArrettAnderson · · Score: 2, Interesting

      look at it this way... you've just tuned in to your favorite radio station and you hear your favorite DJ talking about something. Sometimes you could mix what he's saying up between an advertisement or something he's discussing for the sake of discussing.

      i'm sure there's spam out there that makes it seem like it's one of your friends talking to you (sending with "nick" or "john" as the sender name) and talks to you in a friendly manner about how great this product is.

      i've got a few of those, but luckily all my friends have weird names.

    2. Re:wait, WTF? by HeelToe · · Score: 3, Interesting

      6000 over what period?

      This represents 8 days worth of spam for me. Yes, ~800 per day.

      My address has been valid for 10 years. Why should I change it? Bogofilter is currently letting 2-3 per day into my inbox. I generally check for false-positives, but as the training has progressed, I am finding none anymore.

      I plan to implement a single-shot, one try notification sender. I.e., if the mail gets classified as spam: lookup the mx record for the envelope return address, if it's nonexistent, lookup the a record. Make a connection and try to deliver a message indicating their message (include subject reference) was identified as spam, include a way for them to reliably get a message through to me. If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway.

  5. 2+2=3 by Chess_the_cat · · Score: 2
    the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984%

    Am I crazy or is that nowhere near "10 times better"?

    --
    Support the First Amendment. Read at -1
    1. Re:2+2=3 by Celandro · · Score: 3, Informative

      No, you are just bad at math
      1 - .9984 = .0016
      1 - .99984 = .00016

      A factor of 10 in reduced error rates

      160 errors per 10 thousand vs 16.

    2. Re:2+2=3 by Deraj+DeZine · · Score: 2, Funny

      Yeah, "10 times better" should be 998.4%, right?

      And that's impossible. No one can give more than one hundred percent. By definition that is the most anyone can give

      --
      True story.
    3. Re:2+2=3 by kfg · · Score: 4, Insightful

      Congratulations, Mon Ami.

      You have just unlocked the secret of virtually every news report that says "ten times more likely."

      To get cancer. To have a heart attack. To suffer from the heartbreak of psoriasis. Whatever.

      Yes, these numbers indicate "10 times better," and if you were to ask the reporter how likely am I to avoid cancer in both situations, these are the sorts of numbers he would show you.

      Eat health food and your chance of having a heart attack is 99.984%. Eat too many donuts and your chance of having a heart attack is 99.983%, 10 times worse!

      Always, always, always ask to see the raw numbers so that you know what "10 times worse" means.

      Then ask if the numbers were collected by phone survey. If they were, throw them all away and have donut and a cup of coffee.

      KFG

    4. Re:2+2=3 by kfg · · Score: 2, Insightful

      Yeah, I was waiting for someone to nail me on that. In fact I was waiting for someone to agree with me. :)

      I totally buggered that whole section, but it was just so funny I let it stand with the errata note that I had buggered it.

      Ironically people know I "eat healthy," so I'm frequently asked where they should go to buy healthy food, to which I almost always reply:

      "For God's sake man, whatever you do, don't go in the health food store!

      "Well. . . where do I go then?"

      "They've got these things now called "Supermarkets." Look, over here, brown rice, dried beans and lentils. Over here, the produce aisle. You need frickin' binoculars to see the end of the thing. Broccoli, Bok Choy, squash, potatoes to the ceiling, it's the middle of February and there are crates of oranges that were hanging on the tree a few days ago. Why go anywhere else?"

      "But, but . . . what about organic?"

      "Here, take my binoculars, look down there. No, to the right a little, yeah, see? A whole organic section if you want. Supermarkets today aren't the supermarkets of 20 years ago. They're catering to customer demand. Go figure.

      But really, if you want my advice? Save your money. Only buy organic if the price is the same. If you eat the "normal" stuff there's a 99.84% chance it won't kill you. If you eat the organic there's a 99.984% chance it won't kill you, and they got those numbers by taking a phone survey, or from the I Ching, or something like that."

      KFG

  6. can it be used with SA? by Chuck+Bucket · · Score: 4, Interesting

    can this be used with Spamassasin, or is a stand alone program? Does it need something like Amasis to run?

    CB

    1. Re:can it be used with SA? by Neil+Blender · · Score: 2, Funny

      can this be used with Spamassasin, or is a stand alone program? Does it need something like Amasis to run?

      I'd tell you, but I'm not 100% sure.

    2. Re:can it be used with SA? by Scott+Laird · · Score: 2, Informative

      My personal problem with SA is that it's really just a muddled average of a bunch of guessed-at filters for recognizing spam. The individual filters aren't very accurate, but the idea is that the average across a bunch of filters will be more accurate then any individual filter.

      Bayes-based filters, on the other hand, directly calculate the probability of specific words appearing in spam vs. non-spam messages. Newer versions calculate the probability of short phrases, HTML tags, and mail headers as well. There's no guesswork involved (unlike SA)--if you feed them enough of yesterday's spam, then they're going to be really good with today and tomorrow's spam. The spammers keep evolving, so sooner or later messages will get through, but the filters keep evolving, too, and it's really hard to beat a good filter these days.

      I've been using SpamProbe for almost 6 months, and it's amazingly accurate. I haven't had a false positive in months, and I only see a couple false negatives per month.

  7. Who is sending that one? by ObviousGuy · · Score: 5, Funny

    If your email is indistuinguishable from spam by a human, perhaps the problem isn't the receiver. It's the sender.

    Forgive me if I don't feel any pity that some moron's email gets filtered to the junk bin because I couldn't discern it from spam.

    --
    I have been pwned because my /. password was too easy to guess.
  8. SPAM definition by Embedded+Geek · · Score: 2, Insightful

    Isn't the rough defintion of SPAM "Anything I don't want in my mailbox"? If that's the case, isn't the human score going to be 100% (at least for the intended recipient)?

    --

    "Prepare for the worst - hope for the best."

  9. To get this new spam filter... by Anonymous Coward · · Score: 5, Funny

    Just enter a valid email address, and hit submit!

  10. Re:Huh? Aren't humans 100%? by MarkJensen · · Score: 5, Informative

    I haven't been 100% accurate.

    I received an email from my sister-in-law from her work, and the address looked suspicious (one of those weird-looking "letter and number" jumbles.

    I deleted it. It happens.

  11. Re:Huh? Aren't humans 100%? by msgmonkey · · Score: 2, Informative

    Humans sometimes make mistakes, that's where the inaccuracy comes from.

  12. Better by gid13 · · Score: 4, Interesting

    Well, it certainly sounds better than the pay-per-email "postage" idea. If postage hasn't stopped snail spam, why would it stop e-mail spam?

  13. Re:Huh? Aren't humans 100%? by hatrisc · · Score: 2, Interesting

    but can you identify spam before opening it 100% of the time? Now, I realize that the mail program is looking at the actual data as well, which gives it an advantage, but on the other hand, how else can IT detect spam?

    --
    I write code.
  14. Number of significant digits... by jsimon12 · · Score: 4, Informative

    Human=99.84
    New proggie=99.984

    So the human misses .16% and the machine only missues .016% hence the machine is 10 times better.

  15. Re:Huh? Aren't humans 100%? by Behrooz · · Score: 4, Insightful

    I suppose it depends how you're defining spam. Perhaps the ultimate spam messages that don't get past them are capable of passing a turing test... hence fooling those gullible human recipients into thinking that it isn't even spam!

    Fortunately, soon we will all be able to use the superhuman spam-detection capabilities of these filters to save us from ourselves. Imagine all of those pesky e-mails from your 'friends' getting caught by your spam filter before they even impinge upon your consciousness.

    It'd be a wonderful world.

    --
    "We have to go forth and crush every world view that doesn't believe in tolerance and free speech." - David Brin
  16. less thought for me... by Digitus1337 · · Score: 3, Funny

    ...and only one locked pod bay door per 6250, I like those odds.

  17. Re:Huh? Aren't humans 100%? by gid13 · · Score: 5, Insightful

    If you read the post, it quotes a study and says humans are only accurate 99.84% of the time.

    Kinda makes you wonder how they can know the filters are right though. :)

    (please don't reply telling me how)

  18. Hmmmm by Anonymous Coward · · Score: 5, Funny

    Probably used those same people who open viruses as test subjects.

  19. i tend to think... by caino59 · · Score: 3, Funny

    that i'm 100% accurate.

    maybe some of those people just dont know where their 'del' key is, or what it does...

  20. It is 10 times better by flicken · · Score: 2, Informative
    Think of it in terms of an error rate:
    100%-99.84% = 0.16%
    100%-99.984% = 0.016%

    0.16% = 10 * 0.016%
    --
    20 mil and I will! Learn Esperanto with 20M others.
  21. Combined accuracy? by LagDemon · · Score: 2, Interesting

    Does this mean that if I use the 2 together, i get a 99.99999728% accuracy? Awesome! THat means it would takes months for me to see a single error!

    --


    Beware of he who would deny you access to information, for in his heart he dreams himself your master.
    1. Re:Combined accuracy? by canajin56 · · Score: 2, Interesting

      No, that only works if the probability of system X being wrong is independent of the particular message it is checking. (This also means that their figures are dependent on the makeup of the e-mail you are getting) Also, you couldn't really combine them usefully. If one says yes and the other says no, what do you do? You could either accept in these cases, or reject. But either way you could increase the error over just using one or the other.

      --
      ASCII stupid question, get a stupid ANSI
  22. how to lie with statistics.. by isaac338 · · Score: 2, Interesting

    1 in 6250?

    Who wants to bet that they only sent two 'spam' and one of them was disguised well? ;)

  23. Obligatory Q... When will mozilla/TB have them? by sisukapalli1 · · Score: 5, Interesting

    I reached the conclusion of "two filters better than humans" by using two sequential filters:
    server side spamassassin, and a couple of simple procmail recipes. They have kept almost all the SPAM away.

    However, it is good to see such good techniques becoming available and we can hope to see them as straight forward usable tools.

    So, when will mozilla/TB (or your favourite server side or client side filter) get them?

    S

  24. knowspam.net by flyingrobots · · Score: 2, Interesting

    I still think it is the best 'filter' available, since filtering is a lookup into a database of 'good senders' http://www.knowspam.net

    1. Re:knowspam.net by perlchild · · Score: 2, Insightful

      until the next "trinoo-like" proxy allows spammers to spend email from a desktop near you...

  25. actually by Digitus1337 · · Score: 5, Funny

    it's not that humans are not as accurate, it's that 1 in X times we really do want a mini camera or free porn. It is what seperates us from those cold, heartless machines.... mini cameras and porn....

    1. Re:actually by Deraj+DeZine · · Score: 2, Funny

      What about that 1 in 6250 for the automated filters? Your computer might be spying on you at this very moment!

      This is indeed a disturbing development.

      --
      True story.
  26. News story Headline by tacokill · · Score: 3, Funny

    My Machine outhinks me!!"

    I've seen better stories in Highlights for Children

  27. Re:Huh? Aren't humans 100%? by mattkime · · Score: 5, Insightful

    Obviously you've never seen someone new to the internet sit in front of their computer. Lots of people don't know what popups are. Lots of people read some spam not knowing what it is. To these people, a computer is merely an interesting string of sensations.

    --
    Know what I like about atheists? I've yet to meet one that believes God is on their side.
  28. Re:Spamassassin by pclminion · · Score: 2, Interesting
    It's hard to believe that a single approach like this is better than SpamAssassin.

    SpamAssassin is a single approach. It looks at a bunch of features, then combines them linearly and compares the result against a threshold function. It's a relatively simplistic method, compared to these two. Not hard to see how more sophisticated methods could do better.

  29. *slams head against wall* by Faust7 · · Score: 5, Funny

    I received an email from my sister-in-law from her work

    Yeah, so did I. The subject line was "I want you so bad."

    I deleted it. Turned out the message was genuine. I'll never forgive myself...

    1. Re:*slams head against wall* by Bendebecker · · Score: 4, Funny

      If you can't forgive yourself, I'll forgive you... as soon as I recieve your sister-in-law's email address.

      --
      There's a growing sense that even if The Future comes,
      most of us won't be able to afford it.
      -- Lemmy
    2. Re:*slams head against wall* by maddskillz · · Score: 5, Funny

      If it was your sister-in-law sending you that subject line, you probably did the right thing and deleted it

  30. I'm sure they're great, but... by LesPaul75 · · Score: 5, Insightful

    I'm also sure that Yahoo's "SpamGuard" was great when they first introduced it. Now, It catches roughly half of all the spam I get. Why? Because people have figured out how it works and taken advantage of it. The same will happen with any content-recognition-based spam software. In the extreme case, even if a piece of software were 100% accurate at saying "This piece of e-mail looks like spam," then spammers would just make their e-mails look exactly like e-mail from one of your buddies. How could software ever tell the difference between:

    Hey, dude, check out this website I found. There are some hot naked chicks and stuff. Sweet.
    Signed,
    Your Buddy


    and

    Hey, dude, check out this website I found. There are some hot naked chicks and stuff. Sweet.
    Signed,
    SpamKiddy


    Even a human can't tell the difference. The only real difference is who they're from.

    1. Re:I'm sure they're great, but... by RedWizzard · · Score: 2, Insightful
      Even a human can't tell the difference. The only real difference is who they're from.
      And that is all you need. I want website recommendations from friends, I don't want them from random spambots. That's enough for a human or a program to decide that one of those messages is spam and one is not.
  31. Re:Huh? Aren't humans 100%? by Celandro · · Score: 4, Insightful

    Perhaps they mean that Human A is reading email intended for Human B and attempting to classify the email as spam or not spam. I wouldnt be surprised if a computer could do a better job at that sort of task. Besides Im sure Human B wouldnt want Human A reading that cyber sex chat log.

  32. Re:How can a human be wrong? by pclminion · · Score: 4, Informative
    No matter what, in the end, the human CANT be wrong... right?

    [*Bing* -- mail from VP of sales pops into my inbox. Subject: "Making money fast!"]

    [*Bam* -- I hit delete, thinking "Stupid Spam!"]

    Ahh, shit! Lookie, a human screwed up.

    The filter would have actually examined the message and probably decided that it was legitimate.

  33. Here's the real test by Otter · · Score: 2, Interesting
    I'm very happy with POPFile but there's one thing it just can't handle -- bounces from spam with my domain forged in the header when the original text isn't included. And how could it know? The response is the same whether it's to my mail or to spam. The domain is a clue, I guess, but otherwise it seems like an impossible task. I just let them be sorted into my inbox and delete them manually.

    If these filters can hit 99.99% with those, I'd be quite impressed.

  34. Adaptive adversaries by Pendersempai · · Score: 5, Insightful

    It's really easy to design an effective solution when the problem is purely mechanical or natural. As long as you're working with spammers who don't adapt, you can slice through their shitstorms very effectively.

    But when a single solution becomes mainstream, spammers will adapt to it. Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

    Google found an excellent way to rank websites, but then it became widespread enough that webmasters began to game the system it had created. It's been playing catch-up ever since.

    Once the adversary begins to adapt, we lapse into the same cat-and-mouse game of technological barriers and counter-barriers that we've seen so many times before.

    1. Re:Adaptive adversaries by kindbud · · Score: 2, Informative

      Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

      That does not work. If anything, it makes the spam easier to identify, especially dictionary-salad-type spams that just list random words most of which real people hardly ever use in actual emails. Dictonary salad just gives the Bayesian classifier more spam terms to work with. The rest of the terms, the ones that are common in real emails, converge on a neutral score real quick, and simply stop counting one way or another.

      --
      Edith Keeler Must Die
    2. Re:Adaptive adversaries by JuggleGeek · · Score: 2, Informative
      But when a single solution becomes mainstream, spammers will adapt to it. Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

      I can't see how that would change anything. The "bad" keywords are still in the spam. The gobbledy-gook words (usually short clips of random books/stories/something) are legitimate words, but aren't very likely to have a high coincidence of words found on in my legitimate email.

      I'm not using bayesian filtering, but I can't see those making much difference.

    3. Re:Adaptive adversaries by KjetilK · · Score: 2, Insightful
      It doesn't work for people who train their filters themselves. Indeed, with my well-trained SA install, my Bayes marks those spams as BAYES_99.

      But my old university, that has 40000 users, this has completely defeated their Bayesian filters. They say that the disk and CPU needed to have per-user bayesian training is prohibetively expensive, and they found that training for all users were doing more harm than good.

      So, we definately need more approaches to the problem.

      --
      Employee of Inrupt, Project Release Manager and Community Manager for Solid
  35. Re:Huh? Aren't humans 100%? by evilmrhenry · · Score: 5, Insightful

    Quite simple:
    With 10 messages (after automatic spam detection) humans are 100% accurate.

    With 1,000 messages, (before automatic spam detection)
    humans are less than 100% accurate.

    The experiment was done on 5849 messages.

    Remember; one thing computers are good at is doing boring things repeatedly.

  36. Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · · Score: 5, Interesting

    No, humans are not 100%.

    If you see a strange name in your inbox with an odd title, that might be a Nigerian businessman, or it might be your long lost Nigerian brother.

    I recently tried to order a t-shirt from this guy for a band he used to be in. I found his band because we have the same (semi-uncommon) name. So, he got an email From: himself. I had to send him two emails because he deleted the first one assuming it was spam.

    I ordered some RAM for my dad a while back. He gets 200 spam emails a day (email addy in resume & web page), and he deleted the confirmation email from the RAM vendor. The RAM never shipped, and it took us a week to figure out that there was a problem.

    People make mistakes all the time. Why is this an unexpected result? People are jackasses. This should be obvious.

    --

    There are no trails. There are no trees out here.
  37. Could somebody explain this to me... by heldlikesound · · Score: 5, Interesting

    I order all kinds of stuff online, wouldn't the receipt emails look like spam? My current spam solution is very simple:

    1. display my email online as little as possible

    2. use a number of addresses that all filter into one account, then filter by the sent-to address... this has turned up some VERY interesting results, for instance. I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...

    3. i built a rudementary filter that looks for viagra,free,debt,enlarge, etc... if the sender is not in my address book, and the email contains these words, it is sent to a "check these out" folder...

    How might a spam filter help me out without zapping confirmation type emails?

    --


    Cloud City Digital: DVD Production at its cheapest/finest
    1. Re:Could somebody explain this to me... by caseih · · Score: 4, Informative

      If you don't control the mail server to create aliases for yourself, you can also employ RFC-compiliant suffixes to your e-mail address. For example:
      foobar+dellorders@mydomain.com.

    2. Re:Could somebody explain this to me... by Anonymous Coward · · Score: 2, Funny

      I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...

      You probably ticked off "Eric" the Indian tech. I talked to that guy yesterday. What a jerk.

    3. Re:Could somebody explain this to me... by Fnkmaster · · Score: 3, Informative

      Unfortunately, even though it's RFC-compliant, I've found probably half the sites I have to give my email address to won't grok the username+filtername@mydomain.com syntax. It's convenient when it works, but it doesn't work enough to rely on. No, throw-away spam-bait email addresses that you use for 6 months at a time for all online ordering and the like, then eventually trash when they get too spam-ridden are the best solution I know of.

    4. Re:Could somebody explain this to me... by mdfst13 · · Score: 3, Informative

      username+filtername@domain.com should go to username@domain.com as per the RFC (the +filtername is carried but not used by servers, or at least it shouldn't be). Some email clients will allow you to use this for such things as folder sorting (i.e. username+foldername goes into foldername automatically). If this worked consistently, it would be good for people who don't have the ability to make more usernames.

      AFAIK, username-filtername will still just go to username-filtername, i.e. you have to configure your mail server to handle username-filtername separately from username. This works great when you can specify as many usernames as you want (i.e. if you manage your own server or have a catch-all on your domain).

      Maybe you are talking about something different than the original poster?

      One reason why the - would work when the + does not is that the - can appear multiple times, so it just another valid character (like a letter, number, or underscore). The + can only appear once, so many servers can ignore it, drop it, or puke on it.

      Interestingly enough, while the (optional) challenge/response system is what gets the press, the main purpose of TMDA is to create aliases like username-filter (and then filter based on them). Thus the name: *Tagged* Message Delivery Agent. The -filter is the tag of Tagged.

  38. Operating on a different scale... by ptolemu · · Score: 2, Interesting

    I think these guys are trying to put the focus on the server side of things where they emphasize greater speed and efficiency in eliminating spam from a large number of accounts as opposed to a single one. Just out of curiosity, do Thunderbird and iMail use similar filtering techniques with their junk mail controls?

  39. Re:Huh? Aren't humans 100%? by Dulimano · · Score: 2, Interesting

    No, imaginary humans with infinite time and dedication are 100%. But real humans are not. The percent goes down with time and dedication continuously, so I really don't understand what this 99.84% means.

  40. This is just carp. by corian · · Score: 3, Insightful

    Spam is what is defined by humans as Spam.

    To determine the accuracy of a spam detector, it is necessary first to come up with a sample of what is or isn't Spam. (I'd assume a human would do this?) So the best result we can get be evaluating humans is how often they agree with the result of the initial label.

    This figure probably won't be 100%. People have slightly different concepts of what mail is requested vs. unwanted, and what is advertising or useful information. So there is a valid possibility of disagreement.

    That doesn't mean humans can't do the job accurataly. (After all, if they couldn't, then the initial human-made labels would themselves be wrong and any data based on them meaningless!)

    If the training data is labeled with the same criteria as the test data, it is obviously possible that a trained system can acheive results which more closely agree with the test data. They are being trained on similiar data. But that doesn't mean that the system is MORE accurate at detecting spam than humans. It means that the system agrees with a particular human (or set of humans) more than other people do in a labelling of spam/non-spam.

    For all we know, the evaluators idea of spam is "wrong".

    1. Re:This is just carp. by sholden · · Score: 4, Insightful

      They are learning algorithms. For measuring their accuracy you have to assume that the data is correctly classified so you can see how they do.

      The point is that humans also aren't perfect. Have a person classify 10000 emails and they will make a few mistakes. Point out those mistakes, and they will say "yes, I got that wrong it is an email from my wife reminding me to pick up milk and not a spam trying to sell me printer ink, I must have been day dreaming."

      Just like if you give a person a document and say "find all the spelling errors" they will probably miss some. This is not because they have a different definition of how those words are spelt, it is because they made some mistakes.

      For the training/testing data, some double checking needs to be done to find the mistakes the human classifying it almost certainly made.

      It's a pretty normal situation in any machine learning application, you don't have to be perfect to be as good as a human - after all humans are only human.

  41. Re:Huh? Aren't humans 100%? by dbarclay10 · · Score: 4, Interesting
    How can a spam filter be more accurate than humans? Humans are always the last step in spam filtering.. i use popfile and it catches 99% but it still needs me.. because im the only one capable of identifying spam 100% of the time.

    And if the study posted about is accruate, of those 1% that are left, you will (if you're a perfectly average person) accidentally delete 0.16% of good messages. Surely you've deleted a valid message by accident before? I do it regularily, deleting 25 spam messages with a single good one embedded in it when I just woke up before I had my coffee is not a good thing ;)

    At the very least, if you were given the same data as these tests, that would be true. Consider if you *didn't* use popfile - how many spams would you be deleting every day, and how many good messages would be accidentally deleted? I know that if I had to manually delete the two or three hundred spams interspersed with good messages, my false-positive rate (the percentage of good mail I accidentally deleted) would skyrocket.

    So just be glad you've got popfile. Not only do you not have to go through as much spam, but you're also more accurate while going through the little you must.

    --

    Barclay family motto:
    Aut agere aut mori.
    (Either action or death.)
  42. Re:Huh? Aren't humans 100%? by BillyBlaze · · Score: 2, Insightful

    If you have no spam filters, then classifying email amounts to "delete, delete, delete, delete, down-arrow, delete, delete, down-arrow, delete, delete, whoops!" That one mistake just dropped your average to 90%. Frankly, I'm amazed humans scored as well as they did.

  43. The true test of a spam filter... by GrpA · · Score: 5, Insightful

    Results of new spam filters cannot help but to be bogus... The true test of a filter is how well it works *after* all the spammers know how it works and try to circumvent it.

    --
    Enjoy science fiction? "Turing Evolved" - AI, Mecha, Androids and rail-gun battles. What more could you want?
    1. Re:The true test of a spam filter... by Anonymous Coward · · Score: 2, Interesting

      Statistical/Probabilistic filters are adaptive, and are capable of learning new characteristics of spam. This is the biggest difference between SpamAssassin (which has a set of predefined "rules") and these two filters. These filters break down each message into tokens and statistically weigh the tokens based on prior learning. If one of them makes a mistake, you can teach it. AFAIK these have been around for at least a couple years, and have only increased in accuracy over time.

  44. Spinal Tap? by Anonymous Coward · · Score: 2, Funny

    Hasn't anybody noticed the obvious Spinal Tap reference?

    Jeanine: You know, it might have been better if the, uh, album had been mixed right.
    David: Well I suppose you could cry about that, of course it's true. I mean it's true.
    Jeanine: It wasn't...it was mixed all wrong, wasn't it?
    Nigel: It was mixed wrong?
    Jeanine: Yeah....
    Nigel: Were you there?
    Jeanine: ...you couldn't hear the...
    Nigel: How do you know it was mixed wrong?
    David: But she's...she's heard the...she's heard the record.
    Jeanine: No, but I've heard the album.
    Nigel: So you're judgement is that it was mixed wrong.
    Jeanine: You couldn't hear the lyrics all over it.
    David: You don't agree that you can't hear the
    vocals?
    Nigel: No, I don't. I do not agree. No.
    David: Well I think maybe....
    Nigel: It's interesting that she's bringing it up.
    David: Well she'd like to hear the vocals.
    Nigel: I mean it's like it's me saying, you know, you're using the wrong conditioner for your hair.
    David: Don't be stupid.
    Jeanine: You don't, you don't do heavy metal in
    dobly, you know, I Mean...it's
    Nigel: In what??? In what???
    Jeanine: In dobly...
    Nigel: In dublin!?! What's that?
    David: She means Dolby, alright? She means
    dolby, you know? You know perfectly well what she means.

    This spam filter goes to 11!

  45. This spam goes up to 11... by Anonymous Coward · · Score: 2, Funny

    DSPAM implements a Dolby-type noise reduction algorithm called Dobly

    Despite the musical reference on the DSPAM site, I figured some people still won't get the joke. So here it is:

    JEANINE: You don't- you don't do heavy metal in dobly, you know, I mean...it's--
    NIGEL: In what??? In what???
    JEANINE: In dobly...
    NIGEL (GRINS): In doubly!?! What's that?
    DAVID: She means Dolby, alright? She means Dolby, you know? You know perfectly well what she means.


    --from the movie "Spinal Tap"

  46. Re:Bleh. by Mmm_Coco · · Score: 2, Insightful

    programs out perform humans all the time. Where am I? my GPS knows. What was that person's number? my PDA knows. What is 2365 times 8675309? just use a calculator: 20517105785. Wow, I was just out performed three times in the space of a minute.

  47. Re:can it be used with SA? -yes by wideangle · · Score: 5, Informative

    A CRM114 plugin for SA is available, thanks to Devin Nate:

    http://bugzilla.spamassassin.org/show_bug.cgi?id =2 301

  48. Image Noise Reduction and Machine Learning by use_compress · · Score: 3, Interesting

    I find it interesting that an algorithm that was originally for image noise reduction found it's way to Machine Learning through a company whose purpose is to impliment noise reduction in audio. From my Googling, I think this is the first time anyone has used Baysian Noise Reduction in Machine Learning. Does anyone know otherwise?

  49. Re:Huh? Aren't humans 100%? by helzerr · · Score: 3, Funny
    To these people, a computer is merely an interesting string of sensations.

    Best phrase I've read all week... Oh, yeah, it's only Monday! This one will probably hold me over 'till Friday, though. ;-)

  50. More accurate than what..? by EdMcMan · · Score: 2, Insightful

    If humans don't have 100% accuracy, who/what is defining what spam is?

  51. Let's get this straight people! by mabu · · Score: 4, Insightful

    client/server-side filtering does NOT solve the problem!

    The biggest problem with spam is the invasion of third party computers on the Internet. The ILLEGAL activity spammers perpetrate by breaking into machines, forging headers and hijacking servers.

    Any filtering method does not address this most serious problem, and even if you do not see any spam in your inbox, you're still paying for the bandwidth and system resources these spammers steal.

    Stop with the filtering algorhythms and take some of that energy and contact your local Attorney General, DA and FBI and demand that they prosecute these people who are BREAKING THE LAW.

    1. Re:Let's get this straight people! by mabu · · Score: 2, Interesting

      As an ISP that has to try to do my best to provide my clients with "spam free" e-mail, I have to pass these costs onto the clients, whether they're in the form of charges for additional bandwidth or ineffective server-side filtering systems.

      When you filter e-mail at the client or server side based on content, the spammers have no idea that their efforts are truly ineffective. At least RBLs send them a message. Content-based filtering is TOTALLY, TOTALLY ineffective. Yea, it makes the spam go away for a short period, but adds the burden of having to deal with legitimate mail being blocked and you still have to waste 70+% of resources you wouldn't normally need to handle legitimate e-mail. When you're not managing systems that are constantly under attack, you might not realize what a complete fucking mess it is.

      On any given day, I have at least 20-30 probes and attempts to DOS my open ports into breaking down and giving these spammers some form of access. I'm having to build new systems to handle the existing load, not because my clients' need more resources, but the spammers progressively eat up more and more system resources. E-mail IS an almost-instanteous communication medium. BUT, because of spammers, it no longer is in many cases, especially with larger ISPs. The spammers, because the authorities won't shut them down, are screwing everything up and content-based filtering is something they LOVE because it's completely ineffective in the long run.

    2. Re:Let's get this straight people! by sootman · · Score: 2, Informative

      Laws don't stop people from driving drunk*, and drunk drivers are in this country and even (by definition) driving out in public, in plain sight of everyone. How, exactly, would US law enforcement prosecute a $NATIONALITY1 spammer who's using a hijacked $NATIONALITY2 computer?

      Laws are fine, but what would *really* work is if everyone were filtering spam, and everyone tells all their newbie friends & relatives what spam is and installs blocking software for them. If sending 1,000,000 spams no longer results in 10 sales, spam *will* stop.

      * yes, laws do stop *some* people from driving whilke drunk, but laws have not eliminated the problem of drunk driving.

      --
      Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
  52. You joke, but... by Ancient+Devices+King · · Score: 2, Interesting

    I know a guy who has a Korean grad student who doesn't speak English very well. He manages to produce subject lines for the messages he sends that get him blocked by spam filters nearly all the time. Not his fault really, but it happens.

    --
    -"It seems like you're trying to exploit a security hole. Would you like help?"
  53. Re:Huh? Aren't humans 100%? by kfg · · Score: 3, Interesting

    People are jackasses.

    Hence we have spam in the first place.

    KFG

  54. Don't worry by sik0fewl · · Score: 4, Funny

    Don't worry, I can forward you the one she sent me. Sounds like the same email.

    --
    I remember when legal used to mean lawful, now it means some kind of loophole. - Leo Kessler
  55. Re:Huh? Aren't humans 100%? by queen+of+everything · · Score: 3, Funny

    I work with some people who use their computer every single day. Have had an email address for years, who still buys what they read in an email. Photoshop for $50...sure! Herbal viagra...why not?

    Well, she always has a big smile on her face, maybe there's something to this spam thing.

    --
    "Wisdom is not a product of schooling but of the life-long attempt to acquire it." -Albert Einstein
  56. Re:Huh? by fprefect · · Score: 2, Insightful

    How can you be sure that you've never deleted an important email as spam?

    --
    Matt Slot / Bitwise Operator / Ambrosia Software, Inc.
  57. Re:Huh? Aren't humans 100%? by DougWhite · · Score: 2, Insightful

    Not to sound like a litigation whore, but ...

    I wonder if it would be possible to sue these spammers for interfering with a business transaction. Granted, the amount in question here is minimal, but just the possibility that a spammer could be found liable for this might deter some of them.

    If that doesn't work we should sign up every megacorp CEO on every spammer list possible, and hope s/he misses an important memo costing megacorp millions. Then megacorp could sue spammer into oblivion.

  58. Re:Huh? Aren't humans 100%? by rixstep · · Score: 4, Funny

    Lots of people don't know what popups are.

    Uh, sure they do. Popups - that's like those porn storms, isn't it? Some people say it only happens with IE and Windows, but I talked to my service provider and they told me 'just pull the power plug out of the wall when that happens'.

    Easily fixed.

  59. Re:Huh? Aren't humans 100%? by Trejkaz · · Score: 4, Interesting

    That actually makes humans much more accurate. We can eliminate many of the messages just by looking at the subject.

    The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

    --
    Karma: It's all a bunch of tree-huggin' hippy crap!
  60. One number not enough by blamanj · · Score: 4, Insightful

    Saying an algorithm is x% accurate is not sufficient, because there are two kinds of errors: false acceptance of spam, and false rejection of non-spam. Personally, I'd settle for 90% false acceptance if I knew the false reject rate was 100% rather than have a program that was 99% at both.

  61. How not to evaluate filters by Daniel+Quinlan · · Score: 5, Insightful
    The study referenced is:
    • On the author's mail (where all he does is probably talk about CRM114 and probably does not subscribe to many newsletters or non-technical mailing lists).
    • A pre-trained filter. It can't be compared apples-to-apples with any filter that doesn't require training.
    • Using his own filter on his own mail! Of course it does well.

    ... to mention a few of the problems. The statistics and methodology behind these claims are really questionable. I think both Consumer Reports and PC Magazine have both done better evaluations of spam filters (read that however you want).

    Also, I wonder how many people have actually looked at CRM114 and tried to use it.

    The really interesting thing about CRM114 is the windowed polynomial hashing technique used although there's some evidence that it can work just as well (if not better) on a much smaller window of only two tokens. I'm hoping someone will do a full exploration of the idea for SpamAssassin's Bayes module.

  62. Re:Huh? by Mysteray · · Score: 2, Insightful
    Would someone like to explain how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?

    That's an easy one. The computer is 10 times better at recognizing what it has decided is spam. We humans are lucky to even be in the same league.

    Now that you understand that, you're one step close to being "computer literate".

  63. Do we buy viagra 0.16% of the time by nri · · Score: 3, Insightful

    If we humans are only 99.84% accurate, then 0.16% of the time we will incorrectly think the email is real and buy viagra ? I don't think so.
    I read the email and delete it. Exactly the same as the spam filters do it, only MORE accuratly. I think the tests applied would have been between a human reading the header of an email and deciding whether to open it or not verses the spam filter making the decision for us. BUT the spam filter makes its decision by opening the email. Therefore to have a proper comparision I should be allowed to open the email as well before I make the decision. Therefore I am 100% accurate.

    --
    if :w! doesn't work, try :!cvs commit -m""
  64. The CRM114? by tramm · · Score: 3, Funny

    I bet it allows messages from General Jack D Ripper or any email that contains the secret phrase "purity of essence", "peace on earth" or "precious bodily fluids".

    --
    -- http://www.swcp.com/~hudson/
  65. Re:Huh? Aren't humans 100%? by Trejkaz · · Score: 4, Funny

    Presumably they must use a superhuman who has 100.00% accuracy.

    --
    Karma: It's all a bunch of tree-huggin' hippy crap!
  66. They're trying to sell you something by brucmack · · Score: 2, Insightful

    The thing with spam is that it's supposed to be a way for somebody to make money... i.e. they are trying to sell you something, be it directly or indirectly. I can't think offhand of an email I have recently received that could be misconstrued as trying to sell me something. From that simple viewpoint, spam can never look exactly like regular mail, because it has a different purpose.

  67. Re:Huh? Aren't humans 100%? by gvc · · Score: 3, Interesting
    Last week I ran a spam filter on all the email I recieved for the last several months. The filter came up with a dozen 'false positives' - messages that I had not flagged as spam when I manually classified them. 11 of them were clearly errors I made in my original classification. The 12th was a solicitation from the alumni association of my alma mater ....

    Before I used a spam filter, I once missed a very important message whose subject line was something to the effect of "URGENT - DON't REBOOT THIS MORNING." That was a bad one to miss.

    Of course humans make mistakes, and it is entirely possible for an automated or semi-automated system to be more accurate than a human alone.

  68. Re:Huh? Aren't humans 100%? by Stunning+Tard · · Score: 2, Funny

    Maybe it's the %0.16 of people who are responding to spam.

  69. CRM is more then just spam filter. by k_head · · Score: 2, Informative

    CRM is actually quite a acinating product. It's like a super grep where you can match against blocks of text instead of just lines. It also has some logic operators and such. I think there is a quote on his web site that refers to it as "grep bitten by a radioactive spider" and it's true.

    You can use it for lot more then spam processing, it's a really neat all purpose tool.

    --
    The best way to support the US war effort is to continue buying American products.
  70. Better filter AND make money too ! by PHPhD2B · · Score: 2, Funny
    I have developed a spam filter that is 100 percent (ONE HUNDRED PER CENT) effective at deleting Unwanted Messages.

    In addition, every user will get special discounts on software, mp3s and computer parts with my partners, and two FREE MP3'S every month.

    There are also special savings on 100% all-natural and effective male enhancement products. A portion of the rebates will go towards a $100000 fund needed to get 100,000,000 dollars (ONE HUNDRED MILLION DOLLARS) from Liberia into an account in Switzerland. If you provide your social security number (SSN) and your checking and savings account number you will get part of the ONE HUNDRED MILLION US DOLLARS. Only the first 100 people will qualify, so hurry up and don't miss this offer!

    --
    --I am Sun Tzu of the Borg. Resistance is feudal.
  71. Thats a problem. by geekoid · · Score: 2, Interesting

    If there is no universal bottom line of what Spam is, we can never manage it.

    I think 'unsolicited request for money from a for profit oranization' will fit into everybodies base definition. Some people will expand on it, but we need a defined place to start.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  72. Re:Huh? Aren't humans 100%? by bhanafee · · Score: 4, Insightful

    No, humans aren't 100% and yes, you can test for that. Try a thought experiment: fill a bin with 50,000 red balls and 50,000 blue balls. Ask a human to sort them all. The result probably won't be 100%, but you can still check the result and figure out how accurate the human is without relying on a superhuman ability to tell the balls apart. Same thing for spam: if you start with a known training set, you can test humans to see how well the spam is identified by manual sorting.

  73. Human accuracy doesn't scale linearly by Kaboom13 · · Score: 5, Insightful

    I'm not surprised a filter beat the human, considering the study used a sample of 5849 messages. As the sample size increases, the filter's accuray will increase, and the human's will decrease. Furthermore the higher the spam/real ration, the better the filter will do in comparison to a human trying to sort at a reasonable speed. The reason being humans tend to skim, and rairly actually read entire subjects, much less messages. Give a human 5000 messages and an hour and he will probably make some mistakes. On the other hand, in 10 messages, the human will probably be 100% correct. Most email filters rely on this already, as they tend to err on the side of caution. With the bulk of the spam taken out, it is not a burden to have the human check the iffy bits. Furthermore the type of email can mislead humans. A business-type email sent to someone's personal email is much more likely to be mistaken as spam, and vice versa. The main disadvantage of automated filtering is people generally have an idea of when a really important e-mail is going to come (the type that false positives are completely unacceptable) and who it will be from.

  74. Re:Huh? Aren't humans 100%? by Anonymous Coward · · Score: 5, Insightful

    The post quotes "a study" which gives the 99.84% figure. In fact, the 99.84% figure is mentioned in the one paper as "the human author's measured accuracy as an antispam filter...on the first pass". This is what we who understand statistics call "nonsense". An individual human had an estimated accuracy of 99.84% when looking at one particular sample set of data, once. This is not a meaningful number, and it sure as heck ain't "a study".

  75. Re:Huh? Aren't humans 100%? by Pieroxy · · Score: 3, Funny

    I have never deleted an email I meant to keep

    How could you possibly know? You deleted it!!

  76. Re:Huh? Aren't humans 100%? by SLot · · Score: 3, Funny

    Then megacorp could sue spammer into oblivion.

    Or more likely, megacorp fires it's mail administrators for being incompetent and goes on about it's business.

  77. Re:Huh? by iMoron · · Score: 2, Insightful

    By your definition, every spam message is a mistake for the spam filter because it "reads" all of them (at least to the same extend as it "reads" any non-spam email). The filter is more accurate because it is fast enough to be more thorough than any human can possibly be expected to be. If we could thoroughly analyze hundreds of emails in a matter of seconds, we would have no need for spam filters. We have spam filters because we don't have the time (or the patience, for that matter) to be as careful as a filter.

  78. Re:Huh? Aren't humans 100%? by Marvin_OScribbley · · Score: 4, Funny

    I talked to my service provider and they told me 'just pull the power plug out of the wall when that happens'.

    Ok, now the screen dimmed a little and I heard the hard drive spin down, but the pop ups are still a comin! Oh, and something about "battery level at 98%" or something.

    --
    I'm not a journalist, but I play one on slashdot
  79. Re:Help setting this up by Anonymous Coward · · Score: 2, Funny

    would love to rtfm, but I want a fairly simple answer to this, how can I do a 30 minute job of integrating this into the mozilla mail client, or does it have to be tied into the server itself? I was wondering if this was a quick, easy fix, or if it is an all weekend type of project.

    Most likely it will take at least as long as reading an article. So you might as well not bother.

  80. Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · · Score: 2, Interesting

    Or your dad is an idiot who doesn't know how to route his email.

    But I was only contesting the great-grandparent poster, who said that humans are by definition 100% accurate.

    While my dad may be an idiot, he is also human. I am correct, great-grandparent poster is incorrect, and you are off topic. As far as I can tell, I've never deleted an email I meant to keep either. But you and I aren't the only people worth discussing.

    --

    There are no trails. There are no trees out here.
  81. Re:Huh? Aren't humans 100%? by ergean · · Score: 5, Funny

    There goes my bussines idea. I wanted to start a bussines that puts humans in an eastern europe contry to sort corporate e-mail.

    Now I have to think again about putting humans to decorticate sunflower seeds, it's cheper than all those machines.

  82. Re:Huh? Aren't humans 100%? by QuantumFTL · · Score: 4, Interesting

    The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

    I believe that humans can be 100% accurate (or thereabouts) if they read the *ENTIRE* message, however that's exactly the point - if you have to read an entire message to tell that it's spam, the spam has succeeded.

    Their number probably concerns how people can tell without reading the entire message whether or not the message is spam. My brother accidentally deleted a few messages I had sent to him, however if he had read them fully he would have known they were legit.

    Cheers,
    Justin

  83. Re:Huh? Aren't humans 100%? by Andrew+Cady · · Score: 2, Informative

    If every individual human has an accuracy of 99.983%, then two independent humans have an accuracy of 1 - .00017^2 or 99.99999711%. This would allow ample accuracy to judge the computer, except that it's not true[1]. A better answer is the one you suggest: humans must judge spam from subject/author alone, whereas computers get to look at the whole message. Humans reading the whole message, and possibly even following included links, responding, etc., can be assumed to have full accuracy, within epistemic bounds. Indeed, merely re-checking your work, etc. - being consciously more diligent than the average spam-sorter - should insure your accuracy is better than average.

    As for how accuracy was actually judged in this particular study, I suppose you would have to read the article for that. I haven't, myself...

    [1] It assumes the probability of error is equal for every message, which is obviously not true (i.e., that error is random rather than systematic). The real accuracy of two humans in concert is surely much lower; OTOH, it is still sure to be much, much higher than the accuracy of a single human.

  84. Re:Huh? Aren't humans 100%? by Trejkaz · · Score: 4, Insightful

    But the computer reads the entire message, so it's not really a fair comparison, is it? How many more lines of information was the computer allowed to look at to create its superior result?

    --
    Karma: It's all a bunch of tree-huggin' hippy crap!
  85. Current Spam filters by Anonymous Coward · · Score: 2, Interesting


    Current spam filters may be "10x" better than humans, current spam filters may be terrible on future spam.

    Filters beating spam and spam beating filters is a continuous arms race. In the limit, optimal spam filtering is equivalent to solving NLP (natural language processing); Unless you build a filter that can fully understand the text (syntax, semantics, pragmatics, world knowledge, the whole shebang), an adversary can always construct spam to defeat your filter.

  86. Re:Huh? Aren't humans 100%? by Fnkmaster · · Score: 4, Funny
    Well, she always has a big smile on her face, maybe there's something to this spam thing.


    You mean you've never noticed this before? Idiots are some of the happiest people I know.

  87. Re:X-CRM114-code-prefix: OPE by ajlitt · · Score: 2, Funny

    Please give this man a drink of grain alcohol and rainwater.

  88. Spot the reference... by Maj.+Kong · · Score: 5, Informative
    CRM114 was a piece of encryption gear in Major Kong's...err, my B-52 in the movie Dr. Strangelove . It allowed only properly coded messages to be received by the crew. When the Soviet SAM detonated near the airframe, the CRM114 was damaged and the crew could not get the recall order.
    Kong: (announcing through headset intercom )

    This is your attack profile: to insure that the enemy cannot monitor voice transmission or plant false transmission, the CRM114 is to be switched into all the receiver circuits. Emergency phase code prefix is to be set on the dials of the CRM. This'll block any transmission other than those preceded by code prefix. Stand by to set code prefix.

    ObKubrick: In 2001: A Space Odyssey, one of the pods was marked with the designation CRM-114. And in Clockwork Orange, Alex is injected with serum 114. I suppose CRM-114 is to Kubrick as THX1138 is to Lucas.

    Dobly, on the other hand, is from This is Spinal Tap , a mispronounciation of "Dolby" by David St. Hubbins's girlfriend:

    Jeanine Pettibone: You don't do heavy metal in Dobly, you know.

    Not to mention that it probably avoids trademark infringement (though I wouldn't put it past Dolby Labs or Thomas Dolby to raise a stink).

    Maj. Kong
    --

    Shoot, a fella' could have a pretty good weekend in Vegas with all that stuff.
    1. Re:Spot the reference... by metamatic · · Score: 2, Informative

      In fact, Thomas Dolby was sued for trademark violation by Dolby Labs. The court found in his favor, as he'd been known as "Thomas Dolby" as a nickname since his school days, when he used to play with tape decks all the time.

      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
  89. Re:Huh? Aren't humans 100%? by po8 · · Score: 4, Informative
    How do you know your training set is correct?

    Good question! We're working on this problem, among other things, at the PSAM project. We have a project to produce high-quality benchmark corpora for spam filter testing. Watch that space for ongoing work, or e-mail us an offer to pitch in and help---we could use it!

  90. Re:Huh? Aren't humans 100%? by Harinezumi · · Score: 5, Informative
    Computers are neither lazy nor pressed for time, and therefore can afford to read and evaluate every single line of every single message. Humans generally can't be bothered to be so diligent, and while they have the ability to get a 100% rate, in most cases they devote so little attention to the task of filtering email that the success rate drops.

    When these factors are considered, I think it's quite possible to write software that in the long run has a higher success rate than a human who has better things to do than filter his mail all day.

  91. Dolby-type noise reduction algorithm called Dobly? by omeomi · · Score: 4, Interesting

    Dolby noise reduction works by filtering a spectrum into a bunch of bands, each of which are compressed (in an audio sense, not in a digital sense), and recorded to tape. On playback, they go through an expander...how does that concept translate to spam filtering? It can't be "dolby-type", that doesn't make any sense...

  92. Digital signatures and a public key infrastructure by Tracy+Reed · · Score: 2, Insightful

    ...are still the only real solution to the issue of trust, reputation, and accountability on the Internet. We need it for so many other things in addition to guaranteeing email legitimacy.

    If every user or at least every server had a key and we all signed each others keys creating a web of trust and only accepted signed and trusted mail the spam problem would be solved. I really dislike the way SSL certificates are handed out. A central CA is a very bad idea due to the cost and browser lock-in issues etc. With GPG and web of trust if you want to run a mail server you need to talk to a friend who is already running one and get them to sign your key. Perhaps we could even use DNS to propagate and cache the keys and sigs. If you sign a key that turns out to be a spammer you better revoke that signature fast before the person upstreeam from you revokes yours. Problem solved. Now if only we could get the big guys to go along with it...

  93. Not the best idea by Vainglorious+Coward · · Score: 5, Insightful

    What you're planning has already been done, it's called TMDA, and it's not such a good idea. You're going to send out 800 "challenge" emails per day - have you given any thought to how many of those will be genuine addresses, but have nothing to do with the spam you receive because they just happen to be the joe-job victim? These kind of challenge/response systems may slighlty alleviate your own suffering through spam, but at a cost to all those unfortunate enough to have had their email addresses faked. And if the sheer impoliteness of such net behaviour doesn't put you off, note that you're using up more of your own bandwidth to send out such challenges

    If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway

    It would make a lot more sense to make these kind of checks when you're receiving the email in the first place. Reject at the SMTP level - you never accept and process the spam in the first place

    --
    My next sig will be ready soon, but subscribers can beat the rush
    1. Re:Not the best idea by warrax_666 · · Score: 2, Interesting

      I don't think SMTP allows for a "reject" after getting to the DATA portion of the SMTP transaction. That prevents most (effective) spam filters from working at SMTP time. If it were possible, wouldn't everybody be doing this?

      Hmm... maybe it's time to update SMTP to allow for this? (Sure, bandwidth is still being consumed, but at least legitimate senders would know that their message didn't get through because of "spamminess")

      --
      HAND.
    2. Re:Not the best idea by Continental+Drift · · Score: 2, Interesting

      I disagree, I think that a white list with challenge auto-replies, as I use, are clearly effective and add just a little to mail traffic. I encourage others to use such a system, which would eliminate problems from having the spam reply-to being a real address. Since applying this schema, I've gotten exactly one spam message in my inbox. That's an excellent percentage.

    3. Re:Not the best idea by Vainglorious+Coward · · Score: 2, Insightful

      I've gotten exactly one spam message in my inbox. That's an excellent percentage.

      Excellent *for you* that is. How many unwanted emails have you sent out to joe-job victims? Here's my basic problem - after black/white list weeding, you're always left with a body of messages that you need to decide what to do with. Rather than taking on that burden yourself, you lay it off on others. That's just plain rude, and little different than the MO of a spammer - "let other people bear the costs of my own selfish actions"

      --
      My next sig will be ready soon, but subscribers can beat the rush
  94. It _can't_ know which pr0n I think is spam vs good by ron_ivi · · Score: 4, Funny
    I signed up for lots of junk mail lists; some solicited, some not -- sometimes from the same organizations.

    How would it know if I consider brunettes non-spam but blondes spam? I did opt-in for one of those email categories, but not the other.

  95. Re:Case study in linguistics by acb · · Score: 2, Insightful

    From what I gather of Pinker's theory is that language is implemented by a dedicated module in the human brain. This module is just neurological hardware, operating entirely by physical means, and does not invoke any sort of deus ex machina; therefore, what it does is an algorithm.

    The language module does invoke other parts of the brain, such as general knowledge; however, there's nothing in the process that depends on it being in a human brain. Given that cognition is a physical process, one could postulate a computer program that could achieve the same results, even if drawing on a very large database of cultural information. The suggestion that language is "innately human" sounds a bit too much like carbon chauvinism, the belief that intelligence is an exclusive property of carbon-based life.

  96. Re:Huh? Aren't humans 100%? by Trejkaz · · Score: 3, Interesting

    I dunno. I'm running CRM114 now, and it's taking something like 1.5 seconds to identify emails. I am on a slow machine though, which used to do SpamAssassin at around 4 seconds, and inaccurately to boot. CRM114 is a big improvement, and if it trains well after the first fortnight I'll kiss TMDA goodbye.

    --
    Karma: It's all a bunch of tree-huggin' hippy crap!
  97. Re:Huh? Aren't humans 100%? by fferreres · · Score: 2, Informative

    Yes, but it is meaningfull nonetheless. If you just think that it's very likely that after reviewing 650 messages, you may have missed one email that you thought was spam, then the "study" is right. I don't care if the number is 900 or 400 emails. Those 400 mails are making me lose a _lot_ of time, and if I value my time, I am losing a lot of productivity, and also missing an important email.

    If the program can have a .99 accuracy, then it's a real time saver, and if it only makes a mistaque every 2000 emails, then SURELY I will be more accurate than me. That depends of course, on how much spam you do get. I get arround 20 to 1 ratio of spam to real meat, and I get arround 100 spam messages a day. I can't spend 1 hour a day cleaning spam with 99,9% accuracy, so I am forced to quick sweep. This thing could make me regain the time, and the false positives would mean i even make less mistakes than manually.

    The important things is how accurate the antispam tool is, and how accurate I am (ratio of spam to meat, and how much a miss costs me). How much other people make mistaues is not really that important. Everybody knows how much time they have, and how much spam to meat they have, and thus, it's very likely that if they don't have a LOT of time to waste, they will be making a mistake for every 200 to 600 spam messages.

    --
    unfinished: (adj.)
  98. Re:Help setting this up by PugMajere · · Score: 3, Informative

    Umm, Fetchmail + procmail on your local machine?

    Not sure exactly why you need a pop3 proxy involved, just use Fetchmail to deliver locally, run things through procmail.

    Set your local mailserver (sendmail/qmail/postfix/exim/whatever) to use your ISP's SMTP server as a smarthost, and it'll send everything it doesn't recognize as local off to them to handle.

  99. Share the luxury by bigberk · · Score: 5, Interesting

    Having such a powerful statistical spam filter is definitely a luxury. I have no difficulty believing the accuracy values presented here. I have had experience with spamprobe, CRM114, bogofilter, spambayes, and spamassassin and all of these do an amazing job to the point where spam no longer exists (for you).

    Which leads to me plug a little project called WPBL that uses exactly these types of statistical spam filters to spot spam sources in a distributed fashion. Each project member uploads hourly the IPs they see relaying spam and non-spam, where the 'decision' is made by these extremely reliable filters. This effectively converts your regular mail account into an intelligent spam-trap that feeds a central blocklist.

    The more members we get, the better we can identify active spam sources around the world. This information is then used by some sites for quite large-scale blocking. Since you're doing all this filtering processing anyway, why not also share "what you learn" (the IPs that are spamming you)?

    If this grabs your interest, read up on the reporting scripts or alternatively, the open WPBL data upload protocol if you want to code your own report generator. Bandwidth usage is minimal.

  100. Re:Huh? Aren't humans 100%? by bananahammock · · Score: 2, Insightful

    That should explain why Dubya's always smiling even when he's trying to be serious.

  101. Re:Help setting this up by SethJohnson · · Score: 4, Insightful


    ModernGeek,

    I recommend you stick with hotmail. Dabbling in stuff like spamassasin is going to be just too much work for someone as lazy as you sound. Apple makes a good built-in spam filter on its Mail client app. Why don't you go there?
  102. Sample by Anonymous Coward · · Score: 2, Insightful

    I say get a bunch of honeypots and do the test again.

    A human doesn't have to determine if it's spam simply by the title.

    The human should have all the advantages these filters have body / header / ip .

    Cheers

  103. Well by DRACO- · · Score: 3, Insightful

    Well if the human was given the chance to read the body text as well like the filters do, then they would be 100% able to delete their own spam.

    DRACO-

    --
    Consider yourself blessed if you are sneezed on by a dragon and only get wet, it could have been a fireball.
  104. lies, damned lies, and... by stile · · Score: 2, Insightful

    statistics.

    This headline is misleading. I refuse to RTFA, because I imagine the "10 times as effective" figure comes from the article itself.

    Come on, folks. The figures do, in fact, show a 10 times increase in effectiveness between humans and these filters. But what the heck does that mean? I have to question the studies. How did they come up with this 99.84% figure? Does it mean that one person will mis-classify about 16 emails in 10000 (a small number indeed)? Or did one or two outliers taint the data?

    The important thing here is that we're comparing three averages. Were the conditions between the trials the same? Were the humans given time limits? Were the accounting methods accurate? Were the spam messages the same?

    It's quite possible that these averages were bounded by possible error quantities (they should have been!) and that these were tossed when reporting the numbers to us. This was so that a startling result (10 times as effective as a human) could be shown in a headline. It's all about coming up with a flashy "fact".

    It's very easy to make numbers say what you want them to say, so I'd be a little wary of running around to your friends "citing" this 10x improvement figure without doing some deep delving into the processes involved in arriving at the number.

  105. Re:Huh? Aren't humans 100%? by gujo-odori · · Score: 3, Informative

    I write spam filters for a living, and I promise you that they can eliminate many of the spams just by looking at the subject too.

    Of course, so can I. Now, since I write the filter based on my human judgement of what constitutes spam, which is more accurate?

  106. Re:Huh? Aren't humans 100%? by R.Caley · · Score: 4, Insightful
    fill a bin with 50,000 red balls and 50,000 blue balls. Ask a human to sort them all.

    Not comparable. The job of a junk mail filter is to drop things I don't want to read. It is trying ot match my evaluation, not to match a semi-objective criterion like red or blue.

    If I read 1000 messages and say which I wish I hadn't read, then I am 100% accurate by definition.

    Of course, if they are really talking about a pure spam filter -- ie one which identifies unsolicited commercial email -- then they can be more accurate than me, but at an uninteresting, perhaps even counter-productive, task:

    I may get unsilicited commercial email I do want to read one day. Almost happened once (I had inadvertantly signed up for it, so it was not really unsolicited, and I didn't actually buy the piece of kit they had on special offer that week, but was tempted). I also get stuff I don't want which isn't spam (notably email from virus infected machines).

    The referenced study seems to be a very sloppy job from this POV. They don't define what their criterion of sucess is, and to the extent they put in a hand waving attempt it is clearly nonsense:

    Because spam (sometimes termed ?unsolicited commercial email? or ?marketing messages?) is neither expected nor desired[...]
    `Unsolicited' does not imply `not desired'. If they don't tease those two apart, they can't get interesting results for real world applications. Eg, someone mailing my work address with a commercial proposition may well be a very welcome unsolicited commercial email.
    --
    _O_
    .|<
    The named which can be named is not the true named
  107. Overkill by mdfst13 · · Score: 2, Interesting

    We don't need to trust the *person* sending the mail. It would be sufficient to trust the machine that is doing so.

    Look at http://spf.pobox.com/ which is sufficient. With SPF, you know that if you are getting SPAM saying it is from @ultraviolet.org, then it really is from @ultraviolet.org (or at least someone who ultraviolet.org trusts).

    Your solution requires a certain level of technical proficiency (setting up and managing the key) of *all* participants. SPF's solution only requires technical proficiency from those who manage DNS settings and those who manage email servers (in particular the person who manages your email server).

    Also, what about *stolen* keys? And who handles key checking? SSL certificates are restricted to a few root signers, but you don't want a central certificate authority. PGP/GPG work well because they only involve small numbers of people. In general, you know the person directly. Occasionally it will be a friend of a friend message. What do you do when the chain is 10 or a 100 or a 1000 keys long? How long will it take for you to find out that 978 has since revoked their signature for 977 (counting in steps from you, i.e. you are 0 and 1000 is the original signer of this chain)? Or how long will it take you to verify all 1000 keys if you try to do it real time (i.e. when you get the message)?

  108. 10 times...? by holizz · · Score: 2, Funny

    If humans are 99.84% accurate and these filters are ten times as accurate, wouldn't that make these filters 998.4% accurate or am I missing something?

  109. Re:Help setting this up by gwynevans · · Score: 2, Informative

    Sounds like POPfile was what you were actually looking for!