Two Spam Filters 10 Times As Accurate As Humans

← Back to Stories (view on slashdot.org)

Two Spam Filters 10 Times As Accurate As Humans

Posted by timothy on Monday February 23, 2004 @01:13PM from the dev/null-is-getting-fatter dept.

Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."

27 of 487 comments (clear)

Min score:

Reason:

Sort:

Re:Huh? Aren't humans 100%? by MarkJensen · 2004-02-23 13:17 · Score: 5, Informative

I haven't been 100% accurate.

I received an email from my sister-in-law from her work, and the address looked suspicious (one of those weird-looking "letter and number" jumbles.

I deleted it. It happens.
Re:Huh? Aren't humans 100%? by msgmonkey · 2004-02-23 13:17 · Score: 2, Informative

Humans sometimes make mistakes, that's where the inaccuracy comes from.
Number of significant digits... by jsimon12 · 2004-02-23 13:18 · Score: 4, Informative

Human=99.84
New proggie=99.984

So the human misses .16% and the machine only missues .016% hence the machine is 10 times better.
Re:2+2=3 by Celandro · 2004-02-23 13:19 · Score: 3, Informative

No, you are just bad at math
1 - .9984 = .0016
1 - .99984 = .00016

A factor of 10 in reduced error rates

160 errors per 10 thousand vs 16.
It is 10 times better by flicken · 2004-02-23 13:21 · Score: 2, Informative

Think of it in terms of an error rate:
100%-99.84% = 0.16% 100%-99.984% = 0.016% 0.16% = 10 * 0.016%

--
20 mil and I will! Learn Esperanto with 20M others.
Re:How can a human be wrong? by pclminion · 2004-02-23 13:27 · Score: 4, Informative

No matter what, in the end, the human CANT be wrong... right?
[*Bing* -- mail from VP of sales pops into my inbox. Subject: "Making money fast!"]
[*Bam* -- I hit delete, thinking "Stupid Spam!"]
Ahh, shit! Lookie, a human screwed up.
The filter would have actually examined the message and probably decided that it was legitimate.
Re:IM Spam by Vancouverite · 2004-02-23 13:29 · Score: 2, Informative

Far too late for that. ICQ has had IM Spam for some time, as has Yahoo, MSChat, and AOL.

What *will* happen is that trawling robots will now also trawl for IM addresses, rather than just email addresses. As it is, only deliberate IM spammers (who are usually in an IM chat group with an intellectually stimulating name such as "Yung Hunnies 4 Married Men") are harvesting the IM addresses that show up in these chat groups. In the future, don't have your ICQ # or Jabber ID on your website, or you are setting yourself up for more spam.

Hmmm... a use for reverse 3133t spelling? "Contact me at ICQ #lEloAAT" (1310447)

--
We are the Music Makers, and We are the Dreamers of Dreams...
Re:Spamassassin by Anonymous Coward · 2004-02-23 13:33 · Score: 1, Informative

It's not a single approach: Mr. Yerazunis's setup for CRM114 sits behind several DNS blacklists, which pre-filter a huge amount of it. (I know his sys-admin.)

But it is far superior to SpamAssassin because it now examines groups of words. The short phrases and words identified by SpamAssassin are avoided by spammers, who are now adding huge amounts of un-displayed random text and terrible HTML tricks to avoid SpamAssassin and similar filters and to avoid the various hash functions that detect familiar phrases.
Re:can it be used with SA? -yes by wideangle · 2004-02-23 13:41 · Score: 5, Informative

A CRM114 plugin for SA is available, thanks to Devin Nate:

http://bugzilla.spamassassin.org/show_bug.cgi?id =2 301
Re:can it be used with SA? by Scott+Laird · 2004-02-23 13:46 · Score: 2, Informative

My personal problem with SA is that it's really just a muddled average of a bunch of guessed-at filters for recognizing spam. The individual filters aren't very accurate, but the idea is that the average across a bunch of filters will be more accurate then any individual filter.

Bayes-based filters, on the other hand, directly calculate the probability of specific words appearing in spam vs. non-spam messages. Newer versions calculate the probability of short phrases, HTML tags, and mail headers as well. There's no guesswork involved (unlike SA)--if you feed them enough of yesterday's spam, then they're going to be really good with today and tomorrow's spam. The spammers keep evolving, so sooner or later messages will get through, but the filters keep evolving, too, and it's really hard to beat a good filter these days.

I've been using SpamProbe for almost 6 months, and it's amazingly accurate. I haven't had a false positive in months, and I only see a couple false negatives per month.
Re:Could somebody explain this to me... by caseih · 2004-02-23 13:56 · Score: 4, Informative

If you don't control the mail server to create aliases for yourself, you can also employ RFC-compiliant suffixes to your e-mail address. For example:
foobar+dellorders@mydomain.com.
CRM is more then just spam filter. by k_head · 2004-02-23 14:12 · Score: 2, Informative

CRM is actually quite a acinating product. It's like a super grep where you can match against blocks of text instead of just lines. It also has some logic operators and such. I think there is a quote on his web site that refers to it as "grep bitten by a radioactive spider" and it's true.

You can use it for lot more then spam processing, it's a really neat all purpose tool.

--
The best way to support the US war effort is to continue buying American products.
Re:Adaptive adversaries by kindbud · 2004-02-23 14:28 · Score: 2, Informative

Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

That does not work. If anything, it makes the spam easier to identify, especially dictionary-salad-type spams that just list random words most of which real people hardly ever use in actual emails. Dictonary salad just gives the Bayesian classifier more spam terms to work with. The rest of the terms, the ones that are common in real emails, converge on a neutral score real quick, and simply stop counting one way or another.

--
Edith Keeler Must Die
Re:Adaptive adversaries by JuggleGeek · 2004-02-23 14:36 · Score: 2, Informative

But when a single solution becomes mainstream, spammers will adapt to it. Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.
I can't see how that would change anything. The "bad" keywords are still in the spam. The gobbledy-gook words (usually short clips of random books/stories/something) are legitimate words, but aren't very likely to have a high coincidence of words found on in my legitimate email.
I'm not using bayesian filtering, but I can't see those making much difference.
human == correspondence secretary by Anonymous Coward · 2004-02-23 14:39 · Score: 1, Informative

All who are harping on about human spam detection rates, the article states:

"By comparison, a human
is only about 99.84% accurate in filtering spam and nonspam, so any of these filters
is more effective than a human "correspondence secretary"."

So, they define "human" to be a secretary, not an uber geek.
Re:Huh? Aren't humans 100%? by Andrew+Cady · 2004-02-23 15:14 · Score: 2, Informative

If every individual human has an accuracy of 99.983%, then two independent humans have an accuracy of 1 - .00017^2 or 99.99999711%. This would allow ample accuracy to judge the computer, except that it's not true[1]. A better answer is the one you suggest: humans must judge spam from subject/author alone, whereas computers get to look at the whole message. Humans reading the whole message, and possibly even following included links, responding, etc., can be assumed to have full accuracy, within epistemic bounds. Indeed, merely re-checking your work, etc. - being consciously more diligent than the average spam-sorter - should insure your accuracy is better than average.

As for how accuracy was actually judged in this particular study, I suppose you would have to read the article for that. I haven't, myself...

[1] It assumes the probability of error is equal for every message, which is obviously not true (i.e., that error is random rather than systematic). The real accuracy of two humans in concert is surely much lower; OTOH, it is still sure to be much, much higher than the accuracy of a single human.
Re:Could somebody explain this to me... by Fnkmaster · 2004-02-23 15:53 · Score: 3, Informative

Unfortunately, even though it's RFC-compliant, I've found probably half the sites I have to give my email address to won't grok the username+filtername@mydomain.com syntax. It's convenient when it works, but it doesn't work enough to rely on. No, throw-away spam-bait email addresses that you use for 6 months at a time for all online ordering and the like, then eventually trash when they get too spam-ridden are the best solution I know of.
Spot the reference... by Maj.+Kong · 2004-02-23 16:09 · Score: 5, Informative

CRM114 was a piece of encryption gear in Major Kong's...err, my B-52 in the movie Dr. Strangelove . It allowed only properly coded messages to be received by the crew. When the Soviet SAM detonated near the airframe, the CRM114 was damaged and the crew could not get the recall order.

Kong: (announcing through headset intercom )

This is your attack profile: to insure that the enemy cannot monitor voice transmission or plant false transmission, the CRM114 is to be switched into all the receiver circuits. Emergency phase code prefix is to be set on the dials of the CRM. This'll block any transmission other than those preceded by code prefix. Stand by to set code prefix.

ObKubrick: In 2001: A Space Odyssey, one of the pods was marked with the designation CRM-114. And in Clockwork Orange, Alex is injected with serum 114. I suppose CRM-114 is to Kubrick as THX1138 is to Lucas.

Dobly, on the other hand, is from This is Spinal Tap , a mispronounciation of "Dolby" by David St. Hubbins's girlfriend:

Jeanine Pettibone: You don't do heavy metal in Dobly, you know.

Not to mention that it probably avoids trademark infringement (though I wouldn't put it past Dolby Labs or Thomas Dolby to raise a stink).

Maj. Kong

--

Shoot, a fella' could have a pretty good weekend in Vegas with all that stuff.
1. Re:Spot the reference... by metamatic · 2004-02-24 11:19 · Score: 2, Informative
  
  In fact, Thomas Dolby was sued for trademark violation by Dolby Labs. The court found in his favor, as he'd been known as "Thomas Dolby" as a nickname since his school days, when he used to play with tape decks all the time.
  
  --
  GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Re:Huh? Aren't humans 100%? by po8 · 2004-02-23 16:27 · Score: 4, Informative

How do you know your training set is correct?

Good question! We're working on this problem, among other things, at the PSAM project. We have a project to produce high-quality benchmark corpora for spam filter testing. Watch that space for ongoing work, or e-mail us an offer to pitch in and help---we could use it!
Re:Huh? Aren't humans 100%? by Harinezumi · 2004-02-23 16:29 · Score: 5, Informative

Computers are neither lazy nor pressed for time, and therefore can afford to read and evaluate every single line of every single message. Humans generally can't be bothered to be so diligent, and while they have the ability to get a 100% rate, in most cases they devote so little attention to the task of filtering email that the success rate drops.
When these factors are considered, I think it's quite possible to write software that in the long run has a higher success rate than a human who has better things to do than filter his mail all day.
Re:Huh? Aren't humans 100%? by fferreres · 2004-02-23 17:43 · Score: 2, Informative

Yes, but it is meaningfull nonetheless. If you just think that it's very likely that after reviewing 650 messages, you may have missed one email that you thought was spam, then the "study" is right. I don't care if the number is 900 or 400 emails. Those 400 mails are making me lose a _lot_ of time, and if I value my time, I am losing a lot of productivity, and also missing an important email.

If the program can have a .99 accuracy, then it's a real time saver, and if it only makes a mistaque every 2000 emails, then SURELY I will be more accurate than me. That depends of course, on how much spam you do get. I get arround 20 to 1 ratio of spam to real meat, and I get arround 100 spam messages a day. I can't spend 1 hour a day cleaning spam with 99,9% accuracy, so I am forced to quick sweep. This thing could make me regain the time, and the false positives would mean i even make less mistakes than manually.

The important things is how accurate the antispam tool is, and how accurate I am (ratio of spam to meat, and how much a miss costs me). How much other people make mistaues is not really that important. Everybody knows how much time they have, and how much spam to meat they have, and thus, it's very likely that if they don't have a LOT of time to waste, they will be making a mistake for every 200 to 600 spam messages.

--
unfinished: (adj.)
Re:Let's get this straight people! by sootman · 2004-02-23 17:50 · Score: 2, Informative

Laws don't stop people from driving drunk*, and drunk drivers are in this country and even (by definition) driving out in public, in plain sight of everyone. How, exactly, would US law enforcement prosecute a $NATIONALITY1 spammer who's using a hijacked $NATIONALITY2 computer?

Laws are fine, but what would *really* work is if everyone were filtering spam, and everyone tells all their newbie friends & relatives what spam is and installs blocking software for them. If sending 1,000,000 spams no longer results in 10 sales, spam *will* stop.

* yes, laws do stop *some* people from driving whilke drunk, but laws have not eliminated the problem of drunk driving.

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Re:Help setting this up by PugMajere · 2004-02-23 18:13 · Score: 3, Informative

Umm, Fetchmail + procmail on your local machine?

Not sure exactly why you need a pop3 proxy involved, just use Fetchmail to deliver locally, run things through procmail.

Set your local mailserver (sendmail/qmail/postfix/exim/whatever) to use your ISP's SMTP server as a smarthost, and it'll send everything it doesn't recognize as local off to them to handle.
Re:Huh? Aren't humans 100%? by gujo-odori · 2004-02-23 22:26 · Score: 3, Informative

I write spam filters for a living, and I promise you that they can eliminate many of the spams just by looking at the subject too.

Of course, so can I. Now, since I write the filter based on my human judgement of what constitutes spam, which is more accurate?
Re:Could somebody explain this to me... by mdfst13 · 2004-02-24 00:59 · Score: 3, Informative

username+filtername@domain.com should go to username@domain.com as per the RFC (the +filtername is carried but not used by servers, or at least it shouldn't be). Some email clients will allow you to use this for such things as folder sorting (i.e. username+foldername goes into foldername automatically). If this worked consistently, it would be good for people who don't have the ability to make more usernames.

AFAIK, username-filtername will still just go to username-filtername, i.e. you have to configure your mail server to handle username-filtername separately from username. This works great when you can specify as many usernames as you want (i.e. if you manage your own server or have a catch-all on your domain).

Maybe you are talking about something different than the original poster?

One reason why the - would work when the + does not is that the - can appear multiple times, so it just another valid character (like a letter, number, or underscore). The + can only appear once, so many servers can ignore it, drop it, or puke on it.

Interestingly enough, while the (optional) challenge/response system is what gets the press, the main purpose of TMDA is to create aliases like username-filter (and then filter based on them). Thus the name: *Tagged* Message Delivery Agent. The -filter is the tag of Tagged.
Re:Help setting this up by gwynevans · 2004-02-24 22:05 · Score: 2, Informative

Sounds like POPfile was what you were actually looking for!