Two Spam Filters 10 Times As Accurate As Humans
Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM, announced recently
that their filters have achieved accuracy rates ten times better than a human is capable of. Based on a study by Bill Yerazunis of CRM114, the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984% (1 misclassification in 6250 messages) using completely different approaches (CRM114 touts Markovan, while DSPAM implements a Dolby-type noise reduction algorithm called Dobly). If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."
I haven't been 100% accurate.
I received an email from my sister-in-law from her work, and the address looked suspicious (one of those weird-looking "letter and number" jumbles.
I deleted it. It happens.
Humans sometimes make mistakes, that's where the inaccuracy comes from.
Human=99.84
.16% and the machine only missues .016% hence the machine is 10 times better.
New proggie=99.984
So the human misses
No, you are just bad at math .9984 = .0016 .99984 = .00016
1 -
1 -
A factor of 10 in reduced error rates
160 errors per 10 thousand vs 16.
20 mil and I will! Learn Esperanto with 20M others.
[*Bing* -- mail from VP of sales pops into my inbox. Subject: "Making money fast!"]
[*Bam* -- I hit delete, thinking "Stupid Spam!"]
Ahh, shit! Lookie, a human screwed up.
The filter would have actually examined the message and probably decided that it was legitimate.
Far too late for that. ICQ has had IM Spam for some time, as has Yahoo, MSChat, and AOL.
What *will* happen is that trawling robots will now also trawl for IM addresses, rather than just email addresses. As it is, only deliberate IM spammers (who are usually in an IM chat group with an intellectually stimulating name such as "Yung Hunnies 4 Married Men") are harvesting the IM addresses that show up in these chat groups. In the future, don't have your ICQ # or Jabber ID on your website, or you are setting yourself up for more spam.
Hmmm... a use for reverse 3133t spelling? "Contact me at ICQ #lEloAAT" (1310447)
We are the Music Makers, and We are the Dreamers of Dreams...
It's not a single approach: Mr. Yerazunis's setup for CRM114 sits behind several DNS blacklists, which pre-filter a huge amount of it. (I know his sys-admin.)
But it is far superior to SpamAssassin because it now examines groups of words. The short phrases and words identified by SpamAssassin are avoided by spammers, who are now adding huge amounts of un-displayed random text and terrible HTML tricks to avoid SpamAssassin and similar filters and to avoid the various hash functions that detect familiar phrases.
A CRM114 plugin for SA is available, thanks to Devin Nate:
d =2 301
http://bugzilla.spamassassin.org/show_bug.cgi?i
My personal problem with SA is that it's really just a muddled average of a bunch of guessed-at filters for recognizing spam. The individual filters aren't very accurate, but the idea is that the average across a bunch of filters will be more accurate then any individual filter.
Bayes-based filters, on the other hand, directly calculate the probability of specific words appearing in spam vs. non-spam messages. Newer versions calculate the probability of short phrases, HTML tags, and mail headers as well. There's no guesswork involved (unlike SA)--if you feed them enough of yesterday's spam, then they're going to be really good with today and tomorrow's spam. The spammers keep evolving, so sooner or later messages will get through, but the filters keep evolving, too, and it's really hard to beat a good filter these days.
I've been using SpamProbe for almost 6 months, and it's amazingly accurate. I haven't had a false positive in months, and I only see a couple false negatives per month.
If you don't control the mail server to create aliases for yourself, you can also employ RFC-compiliant suffixes to your e-mail address. For example:
foobar+dellorders@mydomain.com.
CRM is actually quite a acinating product. It's like a super grep where you can match against blocks of text instead of just lines. It also has some logic operators and such. I think there is a quote on his web site that refers to it as "grep bitten by a radioactive spider" and it's true.
You can use it for lot more then spam processing, it's a really neat all purpose tool.
The best way to support the US war effort is to continue buying American products.
Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.
That does not work. If anything, it makes the spam easier to identify, especially dictionary-salad-type spams that just list random words most of which real people hardly ever use in actual emails. Dictonary salad just gives the Bayesian classifier more spam terms to work with. The rest of the terms, the ones that are common in real emails, converge on a neutral score real quick, and simply stop counting one way or another.
Edith Keeler Must Die
I can't see how that would change anything. The "bad" keywords are still in the spam. The gobbledy-gook words (usually short clips of random books/stories/something) are legitimate words, but aren't very likely to have a high coincidence of words found on in my legitimate email.
I'm not using bayesian filtering, but I can't see those making much difference.
All who are harping on about human spam detection rates, the article states:
"By comparison, a human
is only about 99.84% accurate in filtering spam and nonspam, so any of these filters
is more effective than a human "correspondence secretary"."
So, they define "human" to be a secretary, not an uber geek.
If every individual human has an accuracy of 99.983%, then two independent humans have an accuracy of 1 - .00017^2 or 99.99999711%. This would allow ample accuracy to judge the computer, except that it's not true[1]. A better answer is the one you suggest: humans must judge spam from subject/author alone, whereas computers get to look at the whole message. Humans reading the whole message, and possibly even following included links, responding, etc., can be assumed to have full accuracy, within epistemic bounds. Indeed, merely re-checking your work, etc. - being consciously more diligent than the average spam-sorter - should insure your accuracy is better than average.
As for how accuracy was actually judged in this particular study, I suppose you would have to read the article for that. I haven't, myself...
[1] It assumes the probability of error is equal for every message, which is obviously not true (i.e., that error is random rather than systematic). The real accuracy of two humans in concert is surely much lower; OTOH, it is still sure to be much, much higher than the accuracy of a single human.
Unfortunately, even though it's RFC-compliant, I've found probably half the sites I have to give my email address to won't grok the username+filtername@mydomain.com syntax. It's convenient when it works, but it doesn't work enough to rely on. No, throw-away spam-bait email addresses that you use for 6 months at a time for all online ordering and the like, then eventually trash when they get too spam-ridden are the best solution I know of.
ObKubrick: In 2001: A Space Odyssey, one of the pods was marked with the designation CRM-114. And in Clockwork Orange, Alex is injected with serum 114. I suppose CRM-114 is to Kubrick as THX1138 is to Lucas.
Dobly, on the other hand, is from This is Spinal Tap , a mispronounciation of "Dolby" by David St. Hubbins's girlfriend:
Not to mention that it probably avoids trademark infringement (though I wouldn't put it past Dolby Labs or Thomas Dolby to raise a stink).
Maj. Kong
Shoot, a fella' could have a pretty good weekend in Vegas with all that stuff.
Good question! We're working on this problem, among other things, at the PSAM project. We have a project to produce high-quality benchmark corpora for spam filter testing. Watch that space for ongoing work, or e-mail us an offer to pitch in and help---we could use it!
When these factors are considered, I think it's quite possible to write software that in the long run has a higher success rate than a human who has better things to do than filter his mail all day.
Yes, but it is meaningfull nonetheless. If you just think that it's very likely that after reviewing 650 messages, you may have missed one email that you thought was spam, then the "study" is right. I don't care if the number is 900 or 400 emails. Those 400 mails are making me lose a _lot_ of time, and if I value my time, I am losing a lot of productivity, and also missing an important email.
.99 accuracy, then it's a real time saver, and if it only makes a mistaque every 2000 emails, then SURELY I will be more accurate than me. That depends of course, on how much spam you do get. I get arround 20 to 1 ratio of spam to real meat, and I get arround 100 spam messages a day. I can't spend 1 hour a day cleaning spam with 99,9% accuracy, so I am forced to quick sweep. This thing could make me regain the time, and the false positives would mean i even make less mistakes than manually.
If the program can have a
The important things is how accurate the antispam tool is, and how accurate I am (ratio of spam to meat, and how much a miss costs me). How much other people make mistaues is not really that important. Everybody knows how much time they have, and how much spam to meat they have, and thus, it's very likely that if they don't have a LOT of time to waste, they will be making a mistake for every 200 to 600 spam messages.
unfinished: (adj.)
Laws don't stop people from driving drunk*, and drunk drivers are in this country and even (by definition) driving out in public, in plain sight of everyone. How, exactly, would US law enforcement prosecute a $NATIONALITY1 spammer who's using a hijacked $NATIONALITY2 computer?
Laws are fine, but what would *really* work is if everyone were filtering spam, and everyone tells all their newbie friends & relatives what spam is and installs blocking software for them. If sending 1,000,000 spams no longer results in 10 sales, spam *will* stop.
* yes, laws do stop *some* people from driving whilke drunk, but laws have not eliminated the problem of drunk driving.
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Umm, Fetchmail + procmail on your local machine?
Not sure exactly why you need a pop3 proxy involved, just use Fetchmail to deliver locally, run things through procmail.
Set your local mailserver (sendmail/qmail/postfix/exim/whatever) to use your ISP's SMTP server as a smarthost, and it'll send everything it doesn't recognize as local off to them to handle.
I write spam filters for a living, and I promise you that they can eliminate many of the spams just by looking at the subject too.
Of course, so can I. Now, since I write the filter based on my human judgement of what constitutes spam, which is more accurate?
username+filtername@domain.com should go to username@domain.com as per the RFC (the +filtername is carried but not used by servers, or at least it shouldn't be). Some email clients will allow you to use this for such things as folder sorting (i.e. username+foldername goes into foldername automatically). If this worked consistently, it would be good for people who don't have the ability to make more usernames.
AFAIK, username-filtername will still just go to username-filtername, i.e. you have to configure your mail server to handle username-filtername separately from username. This works great when you can specify as many usernames as you want (i.e. if you manage your own server or have a catch-all on your domain).
Maybe you are talking about something different than the original poster?
One reason why the - would work when the + does not is that the - can appear multiple times, so it just another valid character (like a letter, number, or underscore). The + can only appear once, so many servers can ignore it, drop it, or puke on it.
Interestingly enough, while the (optional) challenge/response system is what gets the press, the main purpose of TMDA is to create aliases like username-filter (and then filter based on them). Thus the name: *Tagged* Message Delivery Agent. The -filter is the tag of Tagged.
Sounds like POPfile was what you were actually looking for!