Two Spam Filters 10 Times As Accurate As Humans

Outclassed... by Klatoo55 · 2004-02-23 13:14 · Score: 5, Funny

I'm sorry, Dave... That Nigerian guy looks suspicious and I can't let you send him money.

--
------- "A true friend stabs you in the front." -Eliot

Re:Outclassed... by Anonymous Coward · 2004-02-23 14:23 · Score: 0

No, Dave... It doesn't matter that you are a nigerian resident on a trip and he is your dad.

Comment removed by account_deleted · 2004-02-23 13:15 · Score: 5, Insightful

Comment removed based on user account deletion

IM Spam by jeffskyrunner · 2004-02-23 13:15 · Score: 5, Interesting

Once Email Spam is eliminated, then IM spam will begin...

--
Jeff

Re:IM Spam by Anonymous Coward · 2004-02-23 13:17 · Score: 0

I already get IM spam, over ICQ. lots of cute girls asking me to check out their webcams...
Re:IM Spam by Anonymous Coward · 2004-02-23 13:17 · Score: 0

mandatory sterilization for the people buying from spam, and the actual spammers.

we dont want those idiots running around with their subpar genes
Re:IM Spam by Vancouverite · 2004-02-23 13:29 · Score: 2, Informative

Far too late for that. ICQ has had IM Spam for some time, as has Yahoo, MSChat, and AOL.

What *will* happen is that trawling robots will now also trawl for IM addresses, rather than just email addresses. As it is, only deliberate IM spammers (who are usually in an IM chat group with an intellectually stimulating name such as "Yung Hunnies 4 Married Men") are harvesting the IM addresses that show up in these chat groups. In the future, don't have your ICQ # or Jabber ID on your website, or you are setting yourself up for more spam.

Hmmm... a use for reverse 3133t spelling? "Contact me at ICQ #lEloAAT" (1310447)

--
We are the Music Makers, and We are the Dreamers of Dreams...
Re:IM Spam by rokzy · 2004-02-23 13:31 · Score: 1

I use (a)MSN and it has the option that no-one can contact me unless I've already got them on my list.

worked perfectly so far.
Re:IM Spam by Anonymous Coward · 2004-02-23 13:34 · Score: 0

I don't participate in chat groups, but I still get spam through ICQ.

I'd be tempted to say that it's a case of someone sending out blanket ICQ messages to addresses, starting with account #n and ending with account #n+x
Re:IM Spam by Anonymous Coward · 2004-02-23 13:38 · Score: 0

Yeah - it would be nice if those other IM services would do what MSN Msgr does.
Re:IM Spam by Trejkaz · 2004-02-23 14:10 · Score: 1

IM already works on a whitelist so if you just turn on the "block messages from unlisted contacts" option, you will eliminate all message "spim".

What you have left is subscription spim, but I don't see how you could reliably get rid of that. I guess in time, certain users will need to build up trust with other users, such that if you're on a super-paranoid client, a user can only request subscription to you if their trust rating is above a certain figure, their trust being determined by the number of, and quality of, links back to you.

I guess I've thought too much about all this already... I should be spending this time removing TMDA from my system and installing CRM114... the only reason I'm using TMDA at all is because SpamAssassin wasn't doing an appropriately good job.

--
Karma: It's all a bunch of tree-huggin' hippy crap!
Re:IM Spam by the_truk_stop · 2004-02-23 14:36 · Score: 1

Whitelisting, my dear sir. Whitelisting.
Re:IM Spam by Nimey · 2004-02-23 14:41 · Score: 1

There's been IM spam for /years/ already. My ICQ account has been spammed several times by someone wanting me to visit their website, which is invariably pornographic.

--
Hail Eris, full of mischief...

E pluribus sanguinem
Re:IM Spam by ctime · 2004-02-23 14:48 · Score: 1

IM spam will begin? Who's to say it hasn't? I've received random spam messages on ICQ, AOL IM, IRC, and windows messenger for christ sakes.

The key thing about IM's: Generally in the most popular versions, the system is under central control (as compared to the open smtp protocol), which means, if company's start to see that THEIR spam (um, i mean, legitimate "advertisment windows") start to become overshadowed by unwashed hordes of spammers, they can simply use their central power to shut down offending accounts..some very simple mechanisms include putting limits on messages sent per amount of time, and also having users "warn" because of of spam..
Re:IM Spam by Andrew+Cady · 2004-02-23 15:17 · Score: 1

First of all, on any IM network, this is (for practical purposes) a client issue. Second, ICQ and AOL's clients and the popular free alternatives for their services also offer this.
Re:IM Spam by Anonymous Coward · 2004-02-23 16:28 · Score: 0

you must not have many friends
Re:IM Spam by ctrl-alt-elite · 2004-02-23 20:58 · Score: 1

It already has begun.

Spamassassin by Czmyt · 2004-02-23 13:15 · Score: 1

It's hard to believe that a single approach like this is better than SpamAssassin. I wonder hot is compares?

Re:Spamassassin by pclminion · 2004-02-23 13:24 · Score: 2, Interesting

It's hard to believe that a single approach like this is better than SpamAssassin.
SpamAssassin is a single approach. It looks at a bunch of features, then combines them linearly and compares the result against a threshold function. It's a relatively simplistic method, compared to these two. Not hard to see how more sophisticated methods could do better.
Re:Spamassassin by Anonymous Coward · 2004-02-23 13:32 · Score: 0

> SpamAssassin is a single approach. It looks at a bunch of features

Didn't you just contradict yourself? SA uses RBLs, pattern matching, and spam-reporting clearinghouses to identify spam. How is that a single approach?
Re:Spamassassin by Anonymous Coward · 2004-02-23 13:33 · Score: 1, Informative

It's not a single approach: Mr. Yerazunis's setup for CRM114 sits behind several DNS blacklists, which pre-filter a huge amount of it. (I know his sys-admin.)

But it is far superior to SpamAssassin because it now examines groups of words. The short phrases and words identified by SpamAssassin are avoided by spammers, who are now adding huge amounts of un-displayed random text and terrible HTML tricks to avoid SpamAssassin and similar filters and to avoid the various hash functions that detect familiar phrases.
Re:Spamassassin by gregfortune · 2004-02-23 13:35 · Score: 1

I wonder hot is compares?

Now that's sneaky!! ;o)
Re:Spamassassin by pclminion · 2004-02-23 13:37 · Score: 1

Didn't you just contradict yourself? SA uses RBLs, pattern matching, and spam-reporting clearinghouses to identify spam. How is that a single approach?
The different features you mention are simply elements of a feature vector whose magnitude is compared to a threshold function. It's a linear classifier. The fact that the features are variegated doesn't change the fact that it's just a linear classifier.
Re:Spamassassin by DonGar · 2004-02-23 15:03 · Score: 1

Um... isn't that going to be true for ANY system?

I mean, in the end, they will all collapse to a single threshold function that chooses between keep and throw away.

Perhaps the SpamAssasin collapsing mechanism is simplistic, but still...

--
plus-good, double-plus-good

wait, WTF? by PedanticSpellingTrol · 2004-02-23 13:15 · Score: 5, Insightful

I presume they mean more accurate than a human that was only looking at the subject line? I fail to see how someone could misclassify an email after they'd already opened it unless it was some kind of marathon testing, which would be totally unrepresentative of any real life situation. Once you're getting 6,000 messages, it's time to reach for "Delete All" and change your address, methinks

Re:wait, WTF? by LBArrettAnderson · 2004-02-23 13:29 · Score: 2, Interesting

look at it this way... you've just tuned in to your favorite radio station and you hear your favorite DJ talking about something. Sometimes you could mix what he's saying up between an advertisement or something he's discussing for the sake of discussing.

i'm sure there's spam out there that makes it seem like it's one of your friends talking to you (sending with "nick" or "john" as the sender name) and talks to you in a friendly manner about how great this product is.

i've got a few of those, but luckily all my friends have weird names.
Re:wait, WTF? by HeelToe · 2004-02-23 13:42 · Score: 3, Interesting

6000 over what period?

This represents 8 days worth of spam for me. Yes, ~800 per day.

My address has been valid for 10 years. Why should I change it? Bogofilter is currently letting 2-3 per day into my inbox. I generally check for false-positives, but as the training has progressed, I am finding none anymore.

I plan to implement a single-shot, one try notification sender. I.e., if the mail gets classified as spam: lookup the mx record for the envelope return address, if it's nonexistent, lookup the a record. Make a connection and try to deliver a message indicating their message (include subject reference) was identified as spam, include a way for them to reliably get a message through to me. If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway.
Re:wait, WTF? by stonecypher · 2004-02-23 14:23 · Score: 1

Ahahhaa. Since when is marathon deleting spam on monday morning anything but the norm?

I presume your company doesn't put your email address up on the site, then.

--
StoneCypher is Full of BS
Re:wait, WTF? by frohike · 2004-02-23 14:58 · Score: 1

My spam has ramped up again lately, and I'm not even to blame for it -- it was after the Debian project put up the unfiltered mail archives on their web site from a couple of lists I was on that I started getting bombarded. As far as I know those are pretty permanent too, so this address is pretty much destroyed unless I fight the spam flood back.

Yeah, on a good week I get over 1000 spams, which is probably about 80% of my mail. This is on an address that's about 8 years old.

And anyway, to paraphrase Michael Bolton in Office Space, why should I have to change my email address? They're the ones that suck!!

--
Cryptic Allusion - New Mac and Dreamcast Games!
Re:wait, WTF? by Anonymous Coward · 2004-02-23 16:31 · Score: 0

i get about 1,000 spams per day.
Re:wait, WTF? by That's+Unpossible! · 2004-02-23 16:58 · Score: 1

6000 over what period?

This represents 8 days worth of spam for me. Yes, ~800 per day.

My address has been valid for 10 years. Why should I change it?

I feel for ya, but I think you answered your own question. 800 a day?

--
Ironically, the word ironically is often used incorrectly.
Re:wait, WTF? by Anonymous Coward · 2004-02-24 03:07 · Score: 0

I was able to reduce spam mails by simple fake rejecting spam. I have 2 instances of filters:
spamassasin (which fake rejects at 5) and dbacl.

A few weeks after fake rejecting the number of spam mails started to drop. Now I receive 5 times less spam.

(btw my MTA is exim)
Re:wait, WTF? by HeelToe · 2004-02-24 05:10 · Score: 1

See my other post, but essentially:

Why should I be bullied by these bastards into changing an online identity?

My filters keep all but 5 a day out of my inbox.

2+2=3 by Chess_the_cat · 2004-02-23 13:15 · Score: 2

the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984%

Am I crazy or is that nowhere near "10 times better"?

--
Support the First Amendment. Read at -1

Re:2+2=3 by LightningBolt! · 2004-02-23 13:19 · Score: 1

If you look at how inaccurate they are, it makes sense... Humans: 0.16% Computers: 0.016%

--
Old people fall. Young people spring. Rich people summer and winter.
Re:2+2=3 by Anonymous Coward · 2004-02-23 13:19 · Score: 0

Quote from the Dobly site:

According to a study by Bill Yerazunis (CRM114), humans are approximately 99.84% accurate at filtering spam. As of today, DSPAM has classified 2835 spams and 3050 nonspams in my mailbox with only 1 false accept and 1 false reject. The false accept was caused by a bug in the Bayesian Dobly code which was fixed, so depending on how you count it, I am getting either 99.964% or 99.983% accuracy - nearly ten times more accurate than a human!
Re:2+2=3 by Celandro · 2004-02-23 13:19 · Score: 3, Informative

No, you are just bad at math
1 - .9984 = .0016
1 - .99984 = .00016

A factor of 10 in reduced error rates

160 errors per 10 thousand vs 16.
Re:2+2=3 by nzkoz · 2004-02-23 13:20 · Score: 1

I think the point they're making is that they pass 1/10th as many spams as a human does. Not what most people would consider 10 times better, but still an improvement.

--
Cheers Koz
Re:2+2=3 by Cocodude · 2004-02-23 13:21 · Score: 1

Think of it as a human making an error of 0.16% (100% - 99.84%), and the filters 0.016% (100% - 99.984%). Thus, the human makes ten times as many mistake, which can be seen as the filters being "ten times better".
Re:2+2=3 by canajin56 · 2004-02-23 13:22 · Score: 1, Redundant

99.84% chance of success is a one in 625 chance of failure. 99.983% chance of success is a one in 5000 chance of failure. 99.984% = 1 in 6250. So yes, it is around 10 times better :D

--
ASCII stupid question, get a stupid ANSI
Re:2+2=3 by Deraj+DeZine · 2004-02-23 13:22 · Score: 2, Funny

Yeah, "10 times better" should be 998.4%, right?

And that's impossible. No one can give more than one hundred percent. By definition that is the most anyone can give

--
True story.
Re:2+2=3 by Ralp · 2004-02-23 13:22 · Score: 1

Human: 99.84% accurate = 0.16% inaccurate
Filters: 99.984% accurate = 0.016% inaccurate

0.16% inaccuracy means ten times as much spam will get through as 0.016% inaccuracy, thus, ten times better.
(At least by that standard of "better", I must qualify for anyone who wants to twist the statistics another way!)
Re:2+2=3 by The+Dark · 2004-02-23 13:25 · Score: 1

Your not crazy, they are claiming 10 times fewer incorrect classifications.
Although "10 times less inaccurate than humans" doesn't sound as catchy.

--
sig's not here
Re:2+2=3 by Bishop · 2004-02-23 13:27 · Score: 1

99.84% == 1 error in 625 tests
99.984% == 1 error in 6250 tests

99.984% is 10 times better then 99.84%. This is not obvious until you do the math.
Re:2+2=3 by Anonymous Coward · 2004-02-23 13:27 · Score: 0

Man, the human that sorted those 10,000 messages has gotta had the biggest wang and helped the most Nigerian princess save their inheritance. Not to mention the excellent interest rate on his home mortgage.
Re:2+2=3 by Anonymous Coward · 2004-02-23 13:34 · Score: 0

Well, at least one mod doesn't watch the Simpsons. This is from the episode with the baseball team. If you thought that I actually believed the math up there, you've got a sad outlook on humanity.

Seriously... ten times better meaning multiply the percentage?
Re:2+2=3 by Anonymous Coward · 2004-02-23 13:35 · Score: 0

Am I crazy or is that nowhere near "10 times better"?

Who says the two have to be mutually exclusive?
Re:2+2=3 by Flozzin · 2004-02-23 13:42 · Score: 1

Since we are dealing with percentages, 99.93% and 99.94% are ten times better then 99.84%. If you would look the numbers that change are the tenths...

--
"Cowardice in a race, as in an individual, is the unpardonable sin." --Teddy Roosevelt
Re:2+2=3 by kfg · 2004-02-23 14:19 · Score: 4, Insightful

Congratulations, Mon Ami.

You have just unlocked the secret of virtually every news report that says "ten times more likely."

To get cancer. To have a heart attack. To suffer from the heartbreak of psoriasis. Whatever.

Yes, these numbers indicate "10 times better," and if you were to ask the reporter how likely am I to avoid cancer in both situations, these are the sorts of numbers he would show you.

Eat health food and your chance of having a heart attack is 99.984%. Eat too many donuts and your chance of having a heart attack is 99.983%, 10 times worse!

Always, always, always ask to see the raw numbers so that you know what "10 times worse" means.

Then ask if the numbers were collected by phone survey. If they were, throw them all away and have donut and a cup of coffee.

KFG
Re:2+2=3 by miskatonic+alumnus · 2004-02-23 14:21 · Score: 1

This is obvious... I mean, if my friend drinks 99.84% of a coke, and I drink 99.984% of a coke, then I drink TEN times as much coke as my friend, right? What nonsense! Talk about "Lying with statistics."
Re:2+2=3 by Coneasfast · 2004-02-23 14:46 · Score: 1

A factor of 10 in reduced error rates

not to mention the amount of time it would take a human to go through thousands of emails

--
Marge, get me your address book, 4 beers, and my conversation hat.
Re:2+2=3 by jnicholson · 2004-02-23 14:49 · Score: 1

No, he left 10 times as much of the coke in the glass.

--
"Do not drill any holes in your cat - it will not like it."
-- Nick Davies
Re:2+2=3 by sdo1 · 2004-02-23 15:35 · Score: 1

How many times have you heard "Profits were up 50%!!!!!". Big whoop. On $1M sales, you make $1.00. Next year you make a buck fifty. Wow! Profits up 50%!!! You hear that kind of crap constantly and it means nothing.

-S

--
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
Re:2+2=3 by nytmare · 2004-02-23 15:41 · Score: 1

No, you are just bad at math
1 - .9984 = .0016
1 - .99984 = .00016

A factor of 10 in reduced error rates

160 errors per 10 thousand vs 16.

You mean per 100 thousand.

Bad at math, bad at math; everyone here is bad at math!
Re:2+2=3 by miskatonic+alumnus · 2004-02-23 15:57 · Score: 1

True. But where I come from, when someone says x is 10 times y, they don't mean 1-x is ten times 1-y.
Re:2+2=3 by Anonymous Coward · 2004-02-23 16:31 · Score: 0

Eat health food and your chance of having a heart attack is 99.984%.

If by "health food" you mean "deadly poison".
Re:2+2=3 by kfg · 2004-02-23 17:10 · Score: 2, Insightful

Yeah, I was waiting for someone to nail me on that. In fact I was waiting for someone to agree with me. :)

I totally buggered that whole section, but it was just so funny I let it stand with the errata note that I had buggered it.

Ironically people know I "eat healthy," so I'm frequently asked where they should go to buy healthy food, to which I almost always reply:

"For God's sake man, whatever you do, don't go in the health food store!

"Well. . . where do I go then?"

"They've got these things now called "Supermarkets." Look, over here, brown rice, dried beans and lentils. Over here, the produce aisle. You need frickin' binoculars to see the end of the thing. Broccoli, Bok Choy, squash, potatoes to the ceiling, it's the middle of February and there are crates of oranges that were hanging on the tree a few days ago. Why go anywhere else?"

"But, but . . . what about organic?"

"Here, take my binoculars, look down there. No, to the right a little, yeah, see? A whole organic section if you want. Supermarkets today aren't the supermarkets of 20 years ago. They're catering to customer demand. Go figure.

But really, if you want my advice? Save your money. Only buy organic if the price is the same. If you eat the "normal" stuff there's a 99.84% chance it won't kill you. If you eat the organic there's a 99.984% chance it won't kill you, and they got those numbers by taking a phone survey, or from the I Ching, or something like that."

KFG
Re:2+2=3 by corngrower · 2004-02-23 17:28 · Score: 1

No, you should have said that you're ten times better at drinking a glass of coke than your friend.
Re:2+2=3 by Anonymous Coward · 2004-02-23 17:31 · Score: 0

*sigh*

Well, this is so goddamn offtopic it's not even funny, but for the sake of the knowledge of the couple of people reading at 0, the health food store is NOT always more expensive than your average Supermarket for certain stuff.

In fact, things marked "Health Food" at a regular supermarket are generally more expensive than a health food store. Tofu, soy milk, veggie burgers--all more expensive at the supermarket.

But for regular veggies and stuff--hell yeah, go to the supermarket, but if they're organic then the health food store is probably cheaper.

Also, Whole Foods market owns a company called 365 which specializes in cheap organic and "all-natural" products... a lot of times the stuff is damn cheap.
Re:2+2=3 by kfg · 2004-02-23 17:59 · Score: 1

For the record I never said anything about the prices in health food stores or anything marked "Health Food" in the supermarket. In fact, I explicitly didn't bring up anything marked "Health Food" in the supermarket.

KFG
Re:2+2=3 by miskatonic+alumnus · 2004-02-23 19:15 · Score: 1

Oh, I get it! So like if my rate of catching a ball is 20% and my friend has a success rate of 1%, then like I'm (1-0.01)/(1-0.2)= 1.2375 times a better catcher than my friend. Who is bad at math?
Re:2+2=3 by Celandro · 2004-02-24 06:09 · Score: 1

If you catch balls that poorly, you arent going to be playing pro ball any time soon..

More realistically, if player A catches the ball 80% of the time and player B catches the ball 90% of the time, saying player A is twice as good as player B makes a lot more sense than saying player A is 12.5% better. Noone says Player A has half the error rate of player B, they just say twice as good.
Re:2+2=3 by miskatonic+alumnus · 2004-02-24 10:07 · Score: 1

My whole point here is this -- Saying that 90% is 2 times as good as 80% is misleading and mathematically ambiguous. The word "times", as in 6 is 2 times 3, means a multiple (2) of 3. Furthermore, the quantities being compared must be stated at the outset. I mean, if we're going to play crazy games with the numbers, let's map the reals from [0,1] into the space of polynomials. Then, to compare the quantities in [0,1], take the ratio of the Square roots of the Laplace transforms of the cubes of the polynomials. Use that as your "multiple". The whole thing is ridiculous. There are no special rules for using percents -- they are just plain vanilla real numbers. So, the simplest meaning of "A is n times better than B" means take the ratio of A to B to get n. Here, we are interested in successes (like 99.84%) rather than failures. So, the spam filters are only 1.001442 times better than humans, or a relative increase of 0.14%. Now, I know that doesn't sound as impressive as 10 times.
Re:2+2=3 by Celandro · 2004-02-24 15:08 · Score: 1

The question is wether you care about the success rate or the failure rate. When you are talking about success rates in the multiple 9s category (such as odds you will not die in a car crash on the way to work today), the only meaningful numbers to compare are the failure rates. Continuing the car example, if a suv rolls over 10 times as often as a sedan, wouldnt you say the sedan is 10 times better? Or would you say its .0001% better because roll overs are fairly infrequent.

So .0014 failure vs .00014 is a factor of 10.. you can either say the failure rate is 10% of a human, or a human fails 10 times as often as the computer.
Re:2+2=3 by miskatonic+alumnus · 2004-02-24 17:15 · Score: 1

I don't see where you arrive at the failure rates being the "only" meaningful numbers to compare. Let's consider the extremes. You're basically saying that for numbers x,y in the interval [0,1] that x is m times better than y translates into (1-y)/(1-x) = m. That is, x = 1 - [(1-y)/m]. Now, take y=0% and x=80%, which gives m=5. Therefore, 80% success is 5 times better than 0% success. But, 90% is 10 times better than 0%. Seems a little odd to me. At the other extreme, if x=100% and y is any smaller value, we must have 100% is infinitely better than y. This doesn't make any sense whatsoever. No, I wouldn't say the sedan is 10 times better. It's an abuse of mathematical terminology.
Re:2+2=3 by jnicholson · 2004-02-25 09:56 · Score: 1

Quite true. The phrase ought to have been '10 time less unlikely' but people are often confused by probabilities.

--
"Do not drill any holes in your cat - it will not like it."
-- Nick Davies

can it be used with SA? by Chuck+Bucket · 2004-02-23 13:16 · Score: 4, Interesting

can this be used with Spamassasin, or is a stand alone program? Does it need something like Amasis to run?

CB

--
free ipod and free gmail!

Re:can it be used with SA? by Neil+Blender · 2004-02-23 13:28 · Score: 2, Funny

can this be used with Spamassasin, or is a stand alone program? Does it need something like Amasis to run?

I'd tell you, but I'm not 100% sure.
Re:can it be used with SA? by Scott+Laird · 2004-02-23 13:46 · Score: 2, Informative

My personal problem with SA is that it's really just a muddled average of a bunch of guessed-at filters for recognizing spam. The individual filters aren't very accurate, but the idea is that the average across a bunch of filters will be more accurate then any individual filter.

Bayes-based filters, on the other hand, directly calculate the probability of specific words appearing in spam vs. non-spam messages. Newer versions calculate the probability of short phrases, HTML tags, and mail headers as well. There's no guesswork involved (unlike SA)--if you feed them enough of yesterday's spam, then they're going to be really good with today and tomorrow's spam. The spammers keep evolving, so sooner or later messages will get through, but the filters keep evolving, too, and it's really hard to beat a good filter these days.

I've been using SpamProbe for almost 6 months, and it's amazingly accurate. I haven't had a false positive in months, and I only see a couple false negatives per month.
Re:can it be used with SA? by dmaxwell · 2004-02-23 14:03 · Score: 1

SA has a Bayes classifer in addition to it's other tests. The end user can adjust the Bayes scores up or down to suit.
Re:can it be used with SA? by Scott+Laird · 2004-02-23 14:11 · Score: 1

Yeah, but if the Bayes classifier is much more accurate then any of the other classifiers (that's been my experience), then why not just turn all of the other ones off and go solely by the Bayes filter. Averaging 'great' with a lot of 'okay's doesn't do a whole lot to improve the 'great'.
Re:can it be used with SA? by CXI · 2004-02-23 15:58 · Score: 1

That's why you can configure it. You are arguing that SA give you too many tools. Let me ask you this: why is my toolbox full of tools when 90% of the time I only use the screwdriver? (Hint, it has to do with the other 10%).
Re:can it be used with SA? by CTachyon · 2004-02-23 16:12 · Score: 1

The other filters are still very useful because they allow training an empty Bayesian filter -- very spammy and very hammy messages are automatically used as training material. Besides, the DNS blacklist checks catch nearly as much spam as the Bayesian filter, at least for me.

--
Range Voting: preference intensity matters
Re:can it be used with SA? by Scott+Laird · 2004-02-23 16:21 · Score: 1

Actually, that *is* my argument. I used to carry around a multi-tool with a billion little blades and driver tips. In the end, I realized that all I really ever used was the #1 Phillips screwdriver, and the one on the multitool sucked--it was too short to reach into cases, and too hard to pull out. I was better served with a simple, cheap screwdriver. Some times single-use tools are much more powerful then jack-of-all-trades tools. And sometimes the uber-tool wins. Complexity is the enemy, do what you can to simplify the tools that you use.
Re:can it be used with SA? by GSloop · 2004-02-23 16:41 · Score: 1

I think your point is dang wrong.

SA has a whole tool box of great tools. The bayes classifier is really good. If you're willing to train it exclusively, I'm sure you could live with it alone.

For for site wide implimentation where single user bayes training, the additional tools are really useful. I've implimented SA sitewide for trhee sites in the last couple of weeks. With minimal training, most users are getting 90-95% on a shared bayes DB.

Sure, they could all use popfile - but that would require a lot more work in training all the users.

SA has a whole bunch of really good tools.

If you're willing to only use a table-saw to build your table and chairs, be my guest. I like to have the whole shop at my disposal. Sure, I don't use the bandsaw for every project, but when it comes into play, nothing will substitute.

Everyone to their own I guess...

Cheers,
Greg
Re:can it be used with SA? by draziw · 2004-02-23 16:43 · Score: 1

Every now and then a spam will get by with a 50% bayes score - but the other filters will get it up to a high enough score to block it. Then I feed to to the bayes filters, and they get better. Bayes doesn't keep spam at 99% though - so the other filters help. I use RBLs, SA, custom rule sets, and bayes (in SA). Every message also gets virus scanned by clamav (great, free, and fast updates), and f-prot (fast and solid).

Ryan
Re:can it be used with SA? by Scott+Laird · 2004-02-23 17:07 · Score: 1

This venturing a bit far from where we started--the point I was making is that an advanced Bayes filter can (in at least some situations) do a better job then SpamAssassin, even though SA includes a wide variety of filtering tools. I'm not trying to claim that SA is useless or that tablesaws are the end-all-be-all of woodworking tools (I don't actually use my tablesaw all that much :-).

For initially seeding the bayes table, I can certainly see SA's other filters as useful. Personally, I have a few thousand test cases sitting around that I can use for Bayes trainging, but random users don't generally work that way (personal experience: they keep everything, unsorted, in one big inbox, never hit delete, and expect me to keep it backed up for them. And then complain that my mail server is slow accessing their inbox).

However, in a lot of cases, I'd rather use one focused tool then one that can be customized and tuned into doing the job. I don't generally use perl to copy files--I use cp. When filtering my own pile of spam, I'll stick to a simple Bayesean filter because it's much more effective for me and I don't have to spend any time tweaking the weights of different filters. It's been a while since I used SA, but balancing the different filters was a pain, and it was always generating false positives. I'm sure it's better now, but I'm just amazed at how well SpamProbe has worked for me--it's been at least 4 months since I saw a false positive, and I only rarely see spam in my inbox. It's just a nicely tuned, nicely balanced tool that doesn't take a lot of effort to do a good job. And *that*'s what I'm really looking for in tools.
Re:can it be used with SA? by csk_1975 · 2004-02-23 17:21 · Score: 1

Bayes is good but sometimes a regex is very useful to trap "special" spam signs:-

body Viagra1 /\b(?!viagra)(?:v|\\\/).?[ili1\|\!].?[4aa\@].?g.?[ 4aa\@]?.?r.?[4aa\@]/i
score Viagra1 100
Averaging isn't really the right term is it? Its more like aiding and abetting the filter by adding and subtracting scores to get a total indication of the how spammy the message is - this allows you to setup rules specific to your circumstance to help the Bayesian classifier avoid FPs (and FNs).
Re:can it be used with SA? by KjetilK · 2004-02-23 22:19 · Score: 1

Thanks a lot! Yeah, I agree, Bayes is really great, and it catches a lot of my spam, but some things that are really common is good to catch with regexs too. So, I inserted your rule and gave it a score of 1.0 to test it out, thanks a lot!
Since I reject stuff at a score of 12, it is not just about having it marked as spam, but also being so sure about it that I can get it above 12, so that I don't have to see it.

--
Employee of Inrupt, Project Release Manager and Community Manager for Solid
Re:can it be used with SA? by GSloop · 2004-02-24 07:17 · Score: 1

Noted...

However, I think you're generallizing betwixt site-wide and personal spam filtering.

The two are NOT at all alike. The goal is the same, but the methods that are reasonable to get there are very different.

I very much doubt that bayes filtering alone, without specific *individual* user training would capture say, 90%+ of all spam. SA *with* bayes does - without individual training. I feed it a general smattering of 200 hams, a thousand or two spams and let it go.

The bayes system is awesome. Not perfect, but very good. But the rest of the tools are great too. Don't mess with the scores. They've already tested them on a huge range of ham and spam - mucking with them individually just likely screws things up when you're using them site-wide.

The tinkering you can do for individual mail is completely unworkable for site-wide implimentations.

SA is designed mainly for site wide implimentations. When you're running 100K+ pieces of mail a day, it may even benefit you NOT to run bayes because of the compute intensive nature.

Anyway.

I just think you're knocking a site wide product for having a deep and wide defense. Sure, it's more complex and hairy. But that's the nature of site wide defenses.

But if you're doing simple individual filtering, SA probably isn't the best tool to use, though I think it still does an incredible job. Just a lot of extra work for just one or two user(s).

Cheers,
Greg

BTW - on woodworking - I just purchased a LN low angle jack plane, and I think I'll be using it more than any other tool in my shop! Regind the blade to 40 degrees and you have a "high-angle" jack. I can hand-plane QS sycamore, maple, super figured woods - pretty much anything. Plus I can do end grain with the low angle! Now *THAT'S* a tool!

Who is sending that one? by ObviousGuy · 2004-02-23 13:16 · Score: 5, Funny

If your email is indistuinguishable from spam by a human, perhaps the problem isn't the receiver. It's the sender.

Forgive me if I don't feel any pity that some moron's email gets filtered to the junk bin because I couldn't discern it from spam.

--
I have been pwned because my /. password was too easy to guess.

Re:Who is sending that one? by segment · 2004-02-23 13:46 · Score: 1

wget -qO - http://www.mixpills.com/|sed -n '4p'|awk '{print $6,$7,$8}'|mail `whoami`

or...

lynx -dump http://www.mixpills.com/|sed -n '45p'

--
MoFscker

Bleh. by SphericalCrusher · 2004-02-23 13:16 · Score: 0

That has to be some stupid people it is comparing to.

The day that something programmed out performs a human just goes to show how bad the World is coming to... although there was that Chess game that beat the World's Champion.. even though that was a different story. =/

--
"Instant gratification takes too long." - Carrie Fisher

Re:Bleh. by Mmm_Coco · 2004-02-23 13:40 · Score: 2, Insightful

programs out perform humans all the time. Where am I? my GPS knows. What was that person's number? my PDA knows. What is 2365 times 8675309? just use a calculator: 20517105785. Wow, I was just out performed three times in the space of a minute.
Re:Bleh. by SphericalCrusher · 2004-02-23 14:05 · Score: 0

Eh, you're missing the point.

The point is that someone programmed and told that computer what to do. How would it have known to do that on its own? It can't think for itself.

--
"Instant gratification takes too long." - Carrie Fisher

SPAM definition by Embedded+Geek · 2004-02-23 13:16 · Score: 2, Insightful

Isn't the rough defintion of SPAM "Anything I don't want in my mailbox"? If that's the case, isn't the human score going to be 100% (at least for the intended recipient)?

--

"Prepare for the worst - hope for the best."

Re:SPAM definition by 2short · 2004-02-23 14:19 · Score: 1

Isn't the rough defintion of SPAM "Anything I don't want in my mailbox"?

No. Spam is unsolicited bulk email. The dumb jokes I've seen a hundred times forwarded by my brother-in-law are not spam, even though I don't want them.

If that's the case, isn't the human score going to be 100% (at least for the intended recipient)?

No. The whole point is that a human faced with hundreds of emails a day, most of which they don't want, will almost certainly end up accidentally deleting some they actually do want. Humans are generally not 100% at anything.

To get this new spam filter... by Anonymous Coward · 2004-02-23 13:16 · Score: 5, Funny

Just enter a valid email address, and hit submit!

Huh? by MBCook · 2004-02-23 13:17 · Score: 1, Interesting

OK, I am the one who DEFINES what spam is for me, hence everything I say is spam is, and everything I say isn' isn't. I'm 100% accurate by the fact that as the person who defines what spam is for me, I know exactly what spam is.

Would someone like to explain how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.

Re:Huh? by jumpingfred · 2004-02-23 13:19 · Score: 1

I don't know about you but I sometimes make mistakes and delete the wrong mail.
Re:Huh? by phillk6751 · 2004-02-23 13:22 · Score: 0

besides, are you absolutely sure that an e-mail message is spam by what its subject line and sender is? If you filter the e-mail solely based on that, spam filters can truely be more accurate than a human, depending on your situation
Re:Huh? by miyako · 2004-02-23 13:39 · Score: 1

yeah, but what about people like my grandma who asked me who mr such and such was and why he wanted her to enlarge her member.
Or conversely people like me who delete anything with a subject line that starts with FW:
What about that guy who was suckered into half a million from the nigerian spammer, or my aunt who paid 4 times for some pos firewall software from that ad "your computer is broadcasting an ip address", or my friends sister who was conned out of quite a bit of money when someone said they had control over her system and wanted money (they sent her a link to a page with a frame that pointed to C:\)
To be honest, I'm supprised that the average (l)user can identify over 90% of spam.
I understand your point that spam is unwanted email, and you are the one who decides if it is unwanted or not, but sometimes it's easy to forget about those less technologically inclined than ourselves.

--
Famous Last Words: "hmm...wikipedia says it's edible"
Re:Huh? by jjeffries · 2004-02-23 13:43 · Score: 1

I work for an ISP. We run spamassassin, which is good but not perfect. I also get some of the missed spam and quasi-spam forwarded to me, in the hopes that I'll block it.

Some of these things I get are just plain SA misses, but others are kind of in a grey area. One guy keeps sending me an insurance-related email that he keeps getting, but it's not spam--he signed up to get quotes or insurance news or something, so I won't block it.

So, anyway, SA thinks it's not spam, I think it's not spam, and the recipient thinks it is. Who's right?
Re:Huh? by devphil · 2004-02-23 13:46 · Score: 1

I am the one who DEFINES what spam is for me, hence everything I say is spam is, and everything I say isn' isn't.

Exactly. You see spam, you hand it to DSPAM and say, "spam". You see good email, you can hand it to DSPAM and say, "ignore this". DSPAM adapts and becomes excellent at doing its job. You, on the other hand...

how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?

...are not 100% accurate. Once every several hundred emails, the From and Subject lines are forged well enough that you look at the mail. You realize instantly that it's spam and trash it, but you still looked where DSPAM would not.

Of course, anybody asked on /. will always claim that they are 100% accurate all the time.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Re:Huh? by nacturation · 2004-02-23 13:50 · Score: 1, Insightful

99.84% accuracy rate means misclassifying 1 email in every 625 you receive. Are you really that accurate that you don't make a single mistake in almost a thousand emails? Here, "mistake" can mean reading an email you thought was valid but it turned out to be spam; or deleting an email you thought was spam but it really was valid.

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Re:Huh? by fprefect · 2004-02-23 13:51 · Score: 2, Insightful

How can you be sure that you've never deleted an important email as spam?

--
Matt Slot / Bitwise Operator / Ambrosia Software, Inc.
Re:Huh? by Mysteray · 2004-02-23 13:56 · Score: 2, Insightful

Would someone like to explain how a program (even if it's right 99.something% of the time) is more accurate than me (100%)?

That's an easy one. The computer is 10 times better at recognizing what it has decided is spam. We humans are lucky to even be in the same league.
Now that you understand that, you're one step close to being "computer literate".
Re:Huh? by perlchild · 2004-02-23 13:58 · Score: 1

Question for you:
provided that an automated anti-spam tool is a bandwidth-and-time saving process

provided that to definte an email as spam or not, you may have to read/download most of it

wouldn't you say the human who DEFINES the spam has a ZERO chance of success in this test, as to define the spam, he just may have to read most of it, saving little time and no bandwidth?
Re:Huh? by iMoron · 2004-02-23 14:40 · Score: 2, Insightful

By your definition, every spam message is a mistake for the spam filter because it "reads" all of them (at least to the same extend as it "reads" any non-spam email). The filter is more accurate because it is fast enough to be more thorough than any human can possibly be expected to be. If we could thoroughly analyze hundreds of emails in a matter of seconds, we would have no need for spam filters. We have spam filters because we don't have the time (or the patience, for that matter) to be as careful as a filter.
Re:Huh? by Dirtside · 2004-02-23 15:46 · Score: 1

Here, "mistake" can mean reading an email you thought was valid but it turned out to be spam; or deleting an email you thought was spam but it really was valid.
The problem is, what the guys who ran the test define as "spam" may not match what one of the test subjects defines as "spam." So a test subject might see a mail, think, "That's not spam!" and mark it as real. I don't mean that he reads it wrong: I mean that no matter how much he inspected it, he'd say, "No, I wouldn't call this spam." Even if you specify a universal definition, whether a piece of spam matches that definition can be arguable.

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
Re:Huh? by Anonymous Coward · 2004-02-23 22:48 · Score: 0

Line

Is this possible? by Knetzar · 2004-02-23 13:17 · Score: 1, Interesting

How does one test a program like this that's more acurate the humans?

Re:Is this possible? by MourningBlade · 2004-02-23 13:55 · Score: 1

How does one test a program like this that's more acurate the humans?
Simple, have someone classify their own mail for a month, and then have another person go over that person's decisions marking them correct or incorrect. Do the same for the filter.
If the probability that a person will incorrectly classify an email as spam is 10%, the probability that two people would do so is 10% of 10%. In other words: 1%.
There are going to be mistakes. If you have a large enough sample population and checking crew, you're *going* to get errors, and you're *going* to find (most) of them.
If I were doing the experiment, I'd use two checkers, minimum, just to be sure. Probably three, if I could round up another person.
Re:Is this possible? by Knetzar · 2004-02-23 14:20 · Score: 1

If it isn't your email it's hard to tell if it's spam or not. For example, I get newsletters from the EFF, how would someone else know that those emails are not just random spam?
Re:Is this possible? by 2short · 2004-02-23 14:29 · Score: 1

They could ask you. They could track down the source of the messages, and extensively interview the EFF about why they sent you the mail. Etc.

"How good are humans at such-and-such a task?" is a pretty common research question, and it's perfectly possible to asses it even though human researchers must be the ones who determine the correct answer. Basically, they just put a lot more time effort into determining the correct answer than you put into coming up with your answer.

If you took 5 seconds to decide a mail message was spam, and I put 5 hours into determining it was not, I can feel pretty confident in deciding you were wrong. Particularly if part of that was having you spend five minutes carefully reading the message over, after which you agree you were mistaken.
Re:Is this possible? by QuantumFTL · 2004-02-23 15:15 · Score: 1

How does one test a program like this that's more acurate the humans?

I believe the number that they give concerns how accurate humans are at determining spam without reading the entire message.

Consider a test setup where 10 humans must independently reach the same conclusion after reading each email (in isolation). That should reach 100% accuracy on this sample size.

Of course what I suggest is probably overkill however if you have to read an entire spam to decide accurately whether or not it's spam, then the spammers have won.

Cheers,
Justin
Re:Is this possible? by InfiniteWisdom · 2004-02-23 16:01 · Score: 1

Have several humans check it. Have the human check only a small number of e-mails at a time so that he/she is unlikely to make a mistake.

The statement would probably be better phrased as "more accurate than humans under typical conditions"
Re:Is this possible? by MourningBlade · 2004-02-24 08:25 · Score: 1

You could be asked to check over those emails which the reviewers considered mis-categorized.

Re:Huh? Aren't humans 100%? by MarkJensen · 2004-02-23 13:17 · Score: 5, Informative

I haven't been 100% accurate.

I received an email from my sister-in-law from her work, and the address looked suspicious (one of those weird-looking "letter and number" jumbles.

I deleted it. It happens.

Re:Huh? Aren't humans 100%? by msgmonkey · 2004-02-23 13:17 · Score: 2, Informative

Humans sometimes make mistakes, that's where the inaccuracy comes from.

AI by phillk6751 · 2004-02-23 13:17 · Score: 0

Accuracy of the SBPH/BCR classifier has been seen in excess of 99 per cent, for 1/4 megabyte of learning text. In other words, CRM114 learns, and it learns fast .

Great, someone finally came up with a spam filter that learns.

Better by gid13 · 2004-02-23 13:17 · Score: 4, Interesting

Well, it certainly sounds better than the pay-per-email "postage" idea. If postage hasn't stopped snail spam, why would it stop e-mail spam?

Re:Better by Grrr · 2004-02-23 13:28 · Score: 1

If postage hasn't stopped snail spam, why would it stop e-mail spam?

The sender's cost of e-mail spam is negligible, per address, compared to snail mail postage (in the USA, anyway).

<grrr>
Re:Better by techno-vampire · 2004-02-23 17:41 · Score: 1

Junk snail-mail isn't a problem because it pays for itself. Not only do the senders have to spend the money to create, print and send it, it subsidizes regular mail. That's right: without junk mail, first class postage would be higher than it is.
The problem with spam is that it costs nothing to send, and is, in effect, subsidized by honest people. Spam not only clogs our inboxes, it raises our ISP fees and that's why there are so many spammers. You don't need a very high success rate when you pay nothing to send out your advertisements.

--
Good, inexpensive web hosting

Re:Huh? Aren't humans 100%? by hatrisc · 2004-02-23 13:17 · Score: 2, Interesting

but can you identify spam before opening it 100% of the time? Now, I realize that the mail program is looking at the actual data as well, which gives it an advantage, but on the other hand, how else can IT detect spam?

--
I write code.

Number of significant digits... by jsimon12 · 2004-02-23 13:18 · Score: 4, Informative

Human=99.84
New proggie=99.984

So the human misses .16% and the machine only missues .016% hence the machine is 10 times better.

Re:Number of significant digits... by Anonymous Coward · 2004-02-23 13:41 · Score: 0

OMG - you guys replied this math question to _death_. Get a grip everyone!
Re:Number of significant digits... by sploxx · 2004-02-23 13:50 · Score: 1

The problem is that you need some reference which says "this is spam and this is ham".
And who edits this reference?
Re:Number of significant digits... by Anonymous Coward · 2004-02-23 13:57 · Score: 0

It's just like grading exams:

0.000-90.00% = F
90.001-99.00% = D
99.001-99.90% = C
99.901-99.99% = B
>99.99% = A

Re:Huh? Aren't humans 100%? by Phillup · 2004-02-23 13:18 · Score: 1, Insightful

I agree 100%.

If I say it is spam, I'm not reading it... and I am deleting it.

Any software that tries to stop me is removed via

rm -Rf

because it is faulty.

--

--Phillip

Can you say BIRTH TAX

Re:Huh? Aren't humans 100%? by Behrooz · 2004-02-23 13:18 · Score: 4, Insightful

I suppose it depends how you're defining spam. Perhaps the ultimate spam messages that don't get past them are capable of passing a turing test... hence fooling those gullible human recipients into thinking that it isn't even spam!

Fortunately, soon we will all be able to use the superhuman spam-detection capabilities of these filters to save us from ourselves. Imagine all of those pesky e-mails from your 'friends' getting caught by your spam filter before they even impinge upon your consciousness.

It'd be a wonderful world.

--
"We have to go forth and crush every world view that doesn't believe in tolerance and free speech." - David Brin

less thought for me... by Digitus1337 · 2004-02-23 13:18 · Score: 3, Funny

...and only one locked pod bay door per 6250, I like those odds.

Use a realtime blacklist + spam filtering by servicepack158 · 2004-02-23 13:19 · Score: 1

One doesn't ever seem like enough. Like blacklist + spamassassin. how come you can never get to the links in the spam anyway, what's the point ? :)

Re:Use a realtime blacklist + spam filtering by Anonymous Coward · 2004-02-23 14:14 · Score: 0

So what is it?
D'ya have a small penis or do you need to lose weight? Or is it keeping your penis up? O_o

Re:Huh? Aren't humans 100%? by gid13 · 2004-02-23 13:20 · Score: 5, Insightful

If you read the post, it quotes a study and says humans are only accurate 99.84% of the time.

Kinda makes you wonder how they can know the filters are right though. :)

(please don't reply telling me how)

Hmmmm by Anonymous Coward · 2004-02-23 13:20 · Score: 5, Funny

Probably used those same people who open viruses as test subjects.

i tend to think... by caino59 · 2004-02-23 13:21 · Score: 3, Funny

that i'm 100% accurate.

maybe some of those people just dont know where their 'del' key is, or what it does...

Re:Depends... by DoctorCool · 2004-02-23 13:21 · Score: 1, Funny

None of my mail is spam! I take the penis enlargment and brest enhancement very seriously.

It is 10 times better by flicken · 2004-02-23 13:21 · Score: 2, Informative

Think of it in terms of an error rate:

100%-99.84% = 0.16% 100%-99.984% = 0.016% 0.16% = 10 * 0.016%

--
20 mil and I will! Learn Esperanto with 20M others.

Combined accuracy? by LagDemon · 2004-02-23 13:21 · Score: 2, Interesting

Does this mean that if I use the 2 together, i get a 99.99999728% accuracy? Awesome! THat means it would takes months for me to see a single error!

--

Beware of he who would deny you access to information, for in his heart he dreams himself your master.

Re:Combined accuracy? by canajin56 · 2004-02-23 13:33 · Score: 2, Interesting

No, that only works if the probability of system X being wrong is independent of the particular message it is checking. (This also means that their figures are dependent on the makeup of the e-mail you are getting) Also, you couldn't really combine them usefully. If one says yes and the other says no, what do you do? You could either accept in these cases, or reject. But either way you could increase the error over just using one or the other.

--
ASCII stupid question, get a stupid ANSI
Re:Combined accuracy? by E-Rock · 2004-02-23 13:35 · Score: 1

Hey, since it's more accurate than you are, you won't ever notice. :)

how to lie with statistics.. by isaac338 · 2004-02-23 13:21 · Score: 2, Interesting

1 in 6250?

Who wants to bet that they only sent two 'spam' and one of them was disguised well? ;)

Obligatory Q... When will mozilla/TB have them? by sisukapalli1 · 2004-02-23 13:21 · Score: 5, Interesting

I reached the conclusion of "two filters better than humans" by using two sequential filters:
server side spamassassin, and a couple of simple procmail recipes. They have kept almost all the SPAM away.

However, it is good to see such good techniques becoming available and we can hope to see them as straight forward usable tools.

So, when will mozilla/TB (or your favourite server side or client side filter) get them?

S

Re:Obligatory Q... When will mozilla/TB have them? by perlchild · 2004-02-23 14:04 · Score: 1

server-side It's my impression that crm114 is already supported by procmail and maildrop and pretty much any server-side filtering device that filters on a known added "spammy" header.
Retraining server-side is a little bit less... er fun though.
Re:Obligatory Q... When will mozilla/TB have them? by Anthracks · 2004-02-23 14:53 · Score: 1

While not using these two methods specifically, much work is being done to improve Mozilla's spam filtering. A lot of it leveraging the code and advice of the SpamBayes project (with their full knowledge and support). If you're interested in some of the gory details, look at these two bugs: http://bugzilla.mozilla.org/show_bug.cgi?id=181534 and http://bugzilla.mozilla.org/show_bug.cgi?id=230093 (I'd make them links, but Bugzilla blocks Slashdot referrers).

--
Rock over London, Rock on Chicago. Wheaties: Breakfast of Champions.
Re:Obligatory Q... When will mozilla/TB have them? by CvD · 2004-02-23 21:10 · Score: 1

CRM114 has a Mozilla mail "plugin". See PURITY OF EMAIL (P.O.E.) website

--
The Official Steve Ballmer Webpage
Re:Obligatory Q... When will mozilla/TB have them? by CvD · 2004-02-23 21:13 · Score: 1

Okay, so I lied. Its not a plugin. Its a huge list of perl scripts and other kludges. Not for the faint of heart. :-)

--
The Official Steve Ballmer Webpage

Accuracy different for diff people by xot · 2004-02-23 13:21 · Score: 1

Would'nt accuracy differ from user to user? For a user who receive almost no spam and likes to keep his mail clean wouldnt the anti-spam learn to delete stuff that is just being cleaned and is not spam?
And also i'll be the one to judge its accuracy as ONLY I know what my spam is.

--
Lord of the Binges.

knowspam.net by flyingrobots · 2004-02-23 13:22 · Score: 2, Interesting

I still think it is the best 'filter' available, since filtering is a lookup into a database of 'good senders' http://www.knowspam.net

Re:knowspam.net by perlchild · 2004-02-23 14:02 · Score: 2, Insightful

until the next "trinoo-like" proxy allows spammers to spend email from a desktop near you...
Re:knowspam.net by flyingrobots · 2004-02-23 16:27 · Score: 1

yeah, well it's not like the desktop near you won't be identified rather quickly and fixed...

actually by Digitus1337 · 2004-02-23 13:22 · Score: 5, Funny

it's not that humans are not as accurate, it's that 1 in X times we really do want a mini camera or free porn. It is what seperates us from those cold, heartless machines.... mini cameras and porn....

Re:actually by Deraj+DeZine · 2004-02-23 13:31 · Score: 2, Funny

What about that 1 in 6250 for the automated filters? Your computer might be spying on you at this very moment!

This is indeed a disturbing development.

--
True story.
Re:actually by Anonymous Coward · 2004-02-23 14:40 · Score: 0

separates
Re:actually by gardyloo · 2004-02-23 15:07 · Score: 1

Xminicam ~ 10^6... Xfreeporn ~ 1.1
Re:actually by Anonymous Coward · 2004-02-23 23:48 · Score: 0

> 1 in X times we really do want a mini camera or free porn

Yes, but I don't want the free porn at my work address, which is the one that gets almost all the spam.

It could be more accruate than human by Anonymous Coward · 2004-02-23 13:22 · Score: 0

I use sa, and still get about 200 spams a day. Every once in a while, while deleting spam, I accidentally open one up. I imagine I have deleted non-spam mail too.If you count this human error, these methods could actually be more accurate than humans.

Or, in my case... by Atario · 2004-02-23 13:23 · Score: 1

...days! Yee haw!

--
"A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt

News story Headline by tacokill · 2004-02-23 13:23 · Score: 3, Funny

My Machine outhinks me!!"

I've seen better stories in Highlights for Children

Re:News story Headline by pcraven · 2004-02-23 14:34 · Score: 1

Which one? I must have missed it.

Re:Huh? Aren't humans 100%? by mattkime · 2004-02-23 13:23 · Score: 5, Insightful

Obviously you've never seen someone new to the internet sit in front of their computer. Lots of people don't know what popups are. Lots of people read some spam not knowing what it is. To these people, a computer is merely an interesting string of sensations.

--
Know what I like about atheists? I've yet to meet one that believes God is on their side.

*slams head against wall* by Faust7 · 2004-02-23 13:24 · Score: 5, Funny

I received an email from my sister-in-law from her work

Yeah, so did I. The subject line was "I want you so bad."

I deleted it. Turned out the message was genuine. I'll never forgive myself...

--
The coolest voice ever.

Re:*slams head against wall* by Bendebecker · 2004-02-23 13:48 · Score: 4, Funny

If you can't forgive yourself, I'll forgive you... as soon as I recieve your sister-in-law's email address.

--
There's a growing sense that even if The Future comes,
most of us won't be able to afford it.
-- Lemmy
Re:*slams head against wall* by maddskillz · 2004-02-23 15:09 · Score: 5, Funny

If it was your sister-in-law sending you that subject line, you probably did the right thing and deleted it
Re:*slams head against wall* by Anonymous Coward · 2004-02-24 09:40 · Score: 0

not true - could be an unmarried sister in law.
ex.
the sister of the wife of his brother

I'm sure they're great, but... by LesPaul75 · 2004-02-23 13:24 · Score: 5, Insightful

I'm also sure that Yahoo's "SpamGuard" was great when they first introduced it. Now, It catches roughly half of all the spam I get. Why? Because people have figured out how it works and taken advantage of it. The same will happen with any content-recognition-based spam software. In the extreme case, even if a piece of software were 100% accurate at saying "This piece of e-mail looks like spam," then spammers would just make their e-mails look exactly like e-mail from one of your buddies. How could software ever tell the difference between:

Hey, dude, check out this website I found. There are some hot naked chicks and stuff. Sweet.
Signed,
Your Buddy

and

Hey, dude, check out this website I found. There are some hot naked chicks and stuff. Sweet.
Signed,
SpamKiddy

Even a human can't tell the difference. The only real difference is who they're from.

Re:I'm sure they're great, but... by Anonymous Coward · 2004-02-23 13:30 · Score: 0

if your friends send mail that look like spam, get new friends.
Re:I'm sure they're great, but... by tepples · 2004-02-23 13:43 · Score: 1

get new friends.

True, but one cannot easily just get new family, and in this tight job market, one cannot easily just get new co-workers.
Re:I'm sure they're great, but... by RedWizzard · 2004-02-23 14:16 · Score: 2, Insightful

Even a human can't tell the difference. The only real difference is who they're from.
And that is all you need. I want website recommendations from friends, I don't want them from random spambots. That's enough for a human or a program to decide that one of those messages is spam and one is not.
Re:I'm sure they're great, but... by macshit · 2004-02-23 16:00 · Score: 1

I simply trashbag any mail that's in html (with slightly pickier handling of multipart/alternative), which seems to get rid of about 90% of the spam I receive. I can do this because basically everybody I know, and every mailing list I'm on, is clueful enough to not send html mail (even those who aren't computer types).

However this strategy wouldn't work very well for people that have ditzier friends.

The moral? Choose your friends well to avoid spam...

--
We live, as we dream -- alone....
Re:I'm sure they're great, but... by Anonymous Coward · 2004-02-23 17:16 · Score: 0

In statistics land, the text: Hey, dude, check out this website I found. There are some hot naked chicks and stuff. Sweet.
Signed, becomes closer to neutral in your filter as it learns, so the only things it's paying attention to by now is "SpamKiddy". This is a little bit simplified because you've got all kinds of headers, HTML construction, and other things to look at, but is the general idea.
Re:I'm sure they're great, but... by Anonymous Coward · 2004-02-23 18:45 · Score: 0

My friends don't send me messages like that. Oh wait they do sometimes:
http://www.haxxxor.com/
HaXXXor - Naked Chicks Teach You To Hack

Here's some of the cast names:
Elita (think E-Leet-a), Zero Day Grrrl, Mollycule

what will they think of next?
Re:I'm sure they're great, but... by Anonymous Coward · 2004-02-23 23:36 · Score: 0

My theory is that the competition between spam detectors and spam creators is what finally gives us the first artificial intelligence capable of defeating the Turing test. We just need to keep pouring as much money into the spam detection business as idiots are passing to the spammers...
Re:I'm sure they're great, but... by danila · 2004-02-24 03:24 · Score: 1

Apparently most people don't have spammer friends, who write letter like that. I feel really sorry for you...

--
Future Wiki -- If you don't think about the future, you cannot have one.

Re:Huh? Aren't humans 100%? by Celandro · 2004-02-23 13:25 · Score: 4, Insightful

Perhaps they mean that Human A is reading email intended for Human B and attempting to classify the email as spam or not spam. I wouldnt be surprised if a computer could do a better job at that sort of task. Besides Im sure Human B wouldnt want Human A reading that cyber sex chat log.

Dup filters by Tablizer · 2004-02-23 13:25 · Score: 1

I am testing a dup filter for slashdot stories.
It is 99.9% accurate.
It is 99.9% accurate.
It is 99.9% accurate.
It is 99.9% accurate.
It is 99.9% accurate.
It is 99.9% accurate.
It is 99.9% accurate.

--
Table-ized A.I.

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 13:26 · Score: 0

Sure I can. No one knows My Email. I regard it all as spam. therefore, I only have email so that I can complain about spam.

Meh. by Anonymous Coward · 2004-02-23 13:26 · Score: 0

Yes, we all agree that being better than a human is damn near impossible.
Great.

Still better than the pay-for-email thing.

Honestly, who wouldn't rather delete an email or two
a week about their penis than pay for every message they send?
If pay were required for email, a new kind of electronic mail would develop.

Once again, the old saying was right by Anonymous Coward · 2004-02-23 13:27 · Score: 0

"Once robots outlaw humans from detecting spam only spam detecting outlaws will be human robots." or something like that.

How exactly did that work? by Stevyn · 2004-02-23 13:27 · Score: 1

Okay, so someone let 1 or 2 go during a test of over 6000 emails. I'd like to see their faces when the testers told them that their mother telling them to enlarge their penis was spam. I'd actually like to see that email that they thought was legitimate but in fact some nigerian asking for $5000 to "buy" $1,000,000

Re:How can a human be wrong? by pclminion · 2004-02-23 13:27 · Score: 4, Informative

No matter what, in the end, the human CANT be wrong... right?

[*Bing* -- mail from VP of sales pops into my inbox. Subject: "Making money fast!"]

[*Bam* -- I hit delete, thinking "Stupid Spam!"]

Ahh, shit! Lookie, a human screwed up.

The filter would have actually examined the message and probably decided that it was legitimate.

Re:How can a human be wrong? by Behrooz · 2004-02-23 13:27 · Score: 1

No matter what, in the end, the human CANT be wrong... right?

Nah, wrong.

At least, I think it's wrong. Either way, one of us is wrong, so I must be right, because you said that humans can't be wrong and I said that you're wrong about that. Right?

--
"We have to go forth and crush every world view that doesn't believe in tolerance and free speech." - David Brin

Here's the real test by Otter · 2004-02-23 13:28 · Score: 2, Interesting

I'm very happy with POPFile but there's one thing it just can't handle -- bounces from spam with my domain forged in the header when the original text isn't included. And how could it know? The response is the same whether it's to my mail or to spam. The domain is a clue, I guess, but otherwise it seems like an impossible task. I just let them be sorted into my inbox and delete them manually.

If these filters can hit 99.99% with those, I'd be quite impressed.

--
What I'm listening to now on Pandora...

Adaptive adversaries by Pendersempai · 2004-02-23 13:28 · Score: 5, Insightful

It's really easy to design an effective solution when the problem is purely mechanical or natural. As long as you're working with spammers who don't adapt, you can slice through their shitstorms very effectively.

But when a single solution becomes mainstream, spammers will adapt to it. Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

Google found an excellent way to rank websites, but then it became widespread enough that webmasters began to game the system it had created. It's been playing catch-up ever since.

Once the adversary begins to adapt, we lapse into the same cat-and-mouse game of technological barriers and counter-barriers that we've seen so many times before.

Re:Adaptive adversaries by kindbud · 2004-02-23 14:28 · Score: 2, Informative

Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.

That does not work. If anything, it makes the spam easier to identify, especially dictionary-salad-type spams that just list random words most of which real people hardly ever use in actual emails. Dictonary salad just gives the Bayesian classifier more spam terms to work with. The rest of the terms, the ones that are common in real emails, converge on a neutral score real quick, and simply stop counting one way or another.

--
Edith Keeler Must Die
Re:Adaptive adversaries by JuggleGeek · 2004-02-23 14:36 · Score: 2, Informative

But when a single solution becomes mainstream, spammers will adapt to it. Bayesian filters tend to work very well, but now spammers are adding sprawls of randomly generated green-light text to offset the filter's score.
I can't see how that would change anything. The "bad" keywords are still in the spam. The gobbledy-gook words (usually short clips of random books/stories/something) are legitimate words, but aren't very likely to have a high coincidence of words found on in my legitimate email.
I'm not using bayesian filtering, but I can't see those making much difference.
Re:Adaptive adversaries by Anonymous Coward · 2004-02-23 15:52 · Score: 0

But when a single solution becomes mainstream, spammers will adapt to it.

That's why it's so nice that these algorithms are completely different to each other.

Bayesian filters tend to work very well

I've never used one. The reason being, I have many email addresses, and one of them in particular recieves thousands of spam mails for every legitimate mail that is sent to that address. My other addresses are more reasonable.

Now, if I train a bayesian filter to look at my email, without knowing any better, isn't it going to automatically file absolutely everything going to that address in /dev/null? After all, that simple rule would have a false positive once every few thousand emails, which is a reasonable ratio, and taken in context of the number of email addresses I have, it would seem like an extremely significant indicator that something is spam.
Re:Adaptive adversaries by Anonymous Coward · 2004-02-23 17:18 · Score: 0

No, it's called TOE - train on error. The program only changes the weight of words when you reclassify a message. The idea is that you don't reclassify messages that are properly sorted, so you don't add anymore weight. After all, if your message is in the right category, you don't want to change that winning formula, right? So this is how it works. It will take you a little longer since you may not got regular messages at that address very often. You can use regular filters to make sure you don't miss them if you want. I have used it on an account that gets 100+ spam messages a day, and it has a 98.5% success rate. (popfile)
Re:Adaptive adversaries by Anonymous Coward · 2004-02-23 18:07 · Score: 0

I don't know if they make much of a difference, but of the spams that sometimes make it through my Bayesian filter, the random-list-of-words ones are one of the two types (assuming that they don't use the same list as spam I've already classified). (The other type is is the chatty "email from a friend" that doesn't make an explicit sales pitch, but just lists a URL.)
Re:Adaptive adversaries by millette · 2004-02-23 19:36 · Score: 1

I've been wondering why spammers don't use more relevant words, instead of random dictionnary words. In the long run, relevant words would "poison" you bayes filter, since you'd be flagging mails with "good words" so often. Anyway, I won't give too many details here - I don't want to help the bad guys for free!
Re:Adaptive adversaries by Anonymous Coward · 2004-02-23 19:46 · Score: 0

Except there IS an end game condition in this scenario:

Eventually, filters reach a level where they can nearly 100% identify stuff that you would WANT to read. This will be perilously close to AI.

Spammers will either be unable to send you a message that your AI self will let through, or craft email so well that you'll end up wanting to read it, and they'll probably use an AI to do this. This 2nd option will be the most effective form of advertising of all time. It'll be something equivalent to the "5 words that will make anyone fall in love" thing from Babylon 5.

Of course, it probably also won't be long after that, that the AI's reading our mail realize that it would just all be much simpler if they got rid of the end-points: Us.
Re:Adaptive adversaries by KjetilK · 2004-02-23 22:08 · Score: 2, Insightful

It doesn't work for people who train their filters themselves. Indeed, with my well-trained SA install, my Bayes marks those spams as BAYES_99.
But my old university, that has 40000 users, this has completely defeated their Bayesian filters. They say that the disk and CPU needed to have per-user bayesian training is prohibetively expensive, and they found that training for all users were doing more harm than good.
So, we definately need more approaches to the problem.

--
Employee of Inrupt, Project Release Manager and Community Manager for Solid
Re:Adaptive adversaries by WuphonsReach · 2004-02-24 04:58 · Score: 1

But my old university, that has 40000 users, this has completely defeated their Bayesian filters.

Frankly that doesn't surprise me. Using a single Bayesian database for 40k users is not going to work long-term because there are 40,000 divergent ideas about what is spam/ham. Bayesian is very good at the individual level and moderately useful at a small group level (2-50 people roughly). Individuals are generally consistent about what they classify as ham/spam, and a small group of related people is also likely to be self-consistent about spam/ham classifications. (One person's spam is another person's ham.)

--
Wolde you bothe eate your cake, and have your cake?
Re:Adaptive adversaries by asackett · 2004-02-24 05:54 · Score: 1

I have been using DSPAM for many moons now, and not even one of the messages of the kind you refer to has made it into my inbox.

--
Warning: This signature may offend some viewers.

Re:Depends... by smharr4 · 2004-02-23 13:28 · Score: 0

Combine the two together to get enlarged breast-shaped penises, or penis-shaped breasts.

Re:Huh? Aren't humans 100%? by evilmrhenry · 2004-02-23 13:29 · Score: 5, Insightful

Quite simple:
With 10 messages (after automatic spam detection) humans are 100% accurate.

With 1,000 messages, (before automatic spam detection)
humans are less than 100% accurate.

The experiment was done on 5849 messages.

Remember; one thing computers are good at is doing boring things repeatedly.

Bad science by bkhl · 2004-02-23 13:30 · Score: 1

What kind of stats is this? I would guess that the selection of what mails to receive he user makes would be the definition of accuracy here.

Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · 2004-02-23 13:31 · Score: 5, Interesting

No, humans are not 100%.

If you see a strange name in your inbox with an odd title, that might be a Nigerian businessman, or it might be your long lost Nigerian brother.

I recently tried to order a t-shirt from this guy for a band he used to be in. I found his band because we have the same (semi-uncommon) name. So, he got an email From: himself. I had to send him two emails because he deleted the first one assuming it was spam.

I ordered some RAM for my dad a while back. He gets 200 spam emails a day (email addy in resume & web page), and he deleted the confirmation email from the RAM vendor. The RAM never shipped, and it took us a week to figure out that there was a problem.

People make mistakes all the time. Why is this an unexpected result? People are jackasses. This should be obvious.

--

There are no trails. There are no trees out here.

Re:How can a human be wrong? by Anonymous Coward · 2004-02-23 13:31 · Score: 0

Ah, but if you had opened the email, thereby giving both you and the program the same criteria with which to classify as spam or not, then you wouldn't have deleted it.

Based just on subject line I'd be tempted to say that a computer would also have classified your example as spam.

Could somebody explain this to me... by heldlikesound · 2004-02-23 13:32 · Score: 5, Interesting

I order all kinds of stuff online, wouldn't the receipt emails look like spam? My current spam solution is very simple:

1. display my email online as little as possible

2. use a number of addresses that all filter into one account, then filter by the sent-to address... this has turned up some VERY interesting results, for instance. I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...

3. i built a rudementary filter that looks for viagra,free,debt,enlarge, etc... if the sender is not in my address book, and the email contains these words, it is sent to a "check these out" folder...

How might a spam filter help me out without zapping confirmation type emails?

--

Cloud City Digital: DVD Production at its cheapest/finest

Re:Could somebody explain this to me... by Anonymous Coward · 2004-02-23 13:38 · Score: 1, Interesting

*Most* spammers have some very recognizable patterns, because they're classic advertising patterns. They use BIG PRINT, they offer a very limited of popular and fraudulent products (such as free prizes and Viagra) and now use various tricks to avoid other spam filters. Normal on-line business traffic should not trigger this: if it does, you should be able to notice it and create a whitelist for that sender.

Those classic spam patterns are detectable, but writing the detection rules as a static list is a bitch and a half. And as soon as you publish *static* rules, your rules will be circumvented.

The Bayesian/Markovian style learning of these tools helps randomize the rules so there is no magic bullet to get past them.
Re:Could somebody explain this to me... by caseih · 2004-02-23 13:56 · Score: 4, Informative

If you don't control the mail server to create aliases for yourself, you can also employ RFC-compiliant suffixes to your e-mail address. For example:
foobar+dellorders@mydomain.com.
Re:Could somebody explain this to me... by asavage · 2004-02-23 14:07 · Score: 1

Something like your street name or postal code works well or your name if it isn't in the email address. Those should all be in any conformation email. Check for them before the spam goes to your main filter. On a side note, a bayesian filter will see those words and as they never occur in spam let all conformation emails through.
Re:Could somebody explain this to me... by Anonymous Coward · 2004-02-23 14:54 · Score: 2, Funny

I used dellorders@mydomain.com for an order from Dell, and NEVER used it or even typed it anywhere again, and started get spam about 6 months later, and I mean the nasty stuff, no just innocent stuff from Dell resellers...

You probably ticked off "Eric" the Indian tech. I talked to that guy yesterday. What a jerk.
Re:Could somebody explain this to me... by Fnkmaster · 2004-02-23 15:53 · Score: 3, Informative

Unfortunately, even though it's RFC-compliant, I've found probably half the sites I have to give my email address to won't grok the username+filtername@mydomain.com syntax. It's convenient when it works, but it doesn't work enough to rely on. No, throw-away spam-bait email addresses that you use for 6 months at a time for all online ordering and the like, then eventually trash when they get too spam-ridden are the best solution I know of.
Re:Could somebody explain this to me... by cdefghijklmnop · 2004-02-23 17:03 · Score: 1

Isn't it also helpful to add some extra characters or numbers to that so that people (or spammers) won't easily guess your alias? One example could be the date you gave the alias to someone or some random set of words or numbers or both added to it. Like: foobar+slashdot20040224@mydomain.com.
Re:Could somebody explain this to me... by Vainglorious+Coward · 2004-02-23 17:28 · Score: 1

I've found probably half the sites I have to give my email address to won't grok the username+filtername@mydomain.com syntax

Interesting, and a little surprising. I use the dash character for alias extensions (eg username-filtername@example.com) and I've never come across a site that had problems with it.

--
My next sig will be ready soon, but subscribers can beat the rush
Re:Could somebody explain this to me... by viware · 2004-02-23 18:36 · Score: 1

I use the website domain as the prefix for the email address, eg if the website is dell.ca, then the email address I give them is dell.ca@mydomain.com

In this way I'm never likely to need any of the addresses I give out for anything else, and they are very unlikely to overlap.
Re:Could somebody explain this to me... by mdfst13 · 2004-02-24 00:59 · Score: 3, Informative

username+filtername@domain.com should go to username@domain.com as per the RFC (the +filtername is carried but not used by servers, or at least it shouldn't be). Some email clients will allow you to use this for such things as folder sorting (i.e. username+foldername goes into foldername automatically). If this worked consistently, it would be good for people who don't have the ability to make more usernames.

AFAIK, username-filtername will still just go to username-filtername, i.e. you have to configure your mail server to handle username-filtername separately from username. This works great when you can specify as many usernames as you want (i.e. if you manage your own server or have a catch-all on your domain).

Maybe you are talking about something different than the original poster?

One reason why the - would work when the + does not is that the - can appear multiple times, so it just another valid character (like a letter, number, or underscore). The + can only appear once, so many servers can ignore it, drop it, or puke on it.

Interestingly enough, while the (optional) challenge/response system is what gets the press, the main purpose of TMDA is to create aliases like username-filter (and then filter based on them). Thus the name: *Tagged* Message Delivery Agent. The -filter is the tag of Tagged.
Re:Could somebody explain this to me... by MyFourthAccount · 2004-02-24 01:04 · Score: 1

you can also employ RFC-compiliant suffixes to your e-mail address. For example:
foobar+dellorders@mydomain.com.

What's the use of that? Wouldn't it be trivial for spammers to write a filter that converts foobar+dellorders@mydomain.com to foobar@mydomain.com? In which case they have your 'real' email address.
Re:Could somebody explain this to me... by Vainglorious+Coward · 2004-02-24 03:17 · Score: 1

You're right - I did misunderstand the original poster, as of course, you have to actually set up the dash char on the server.

--
My next sig will be ready soon, but subscribers can beat the rush
Re:Could somebody explain this to me... by danila · 2004-02-24 03:29 · Score: 1

I order all kinds of stuff online, wouldn't the receipt emails look like spam?

i built a rudementary filter that looks for viagra,free,debt,enlarge, etc... if the sender is not in my address book, and the email contains these words, it is sent to a "check these out" folder...

"Check these out"? You are buying that stuff? So we have to thank you for making spam profitable...

Check out this site, it has everything you might need in the future. All kinds of stuff.

--
Future Wiki -- If you don't think about the future, you cannot have one.
Re:Could somebody explain this to me... by oobar · 2004-02-25 19:21 · Score: 1

The thing you have to remember is that just because you'd never used dellorders@mydomain.com before and now it's being spammed does NOT mean that Dell sold your email address. It's just as likely that a spammer used a dictionary attack on your server, randomly trying addresses composed of dictionary words. That is, unless your domain accepts wildcards for the local_part. In that case Dell probably supplied it to a parter who supplied it to a partner who supplied it to a partner, etc. Not that that doesn't make them all spammers, of course, but it does explain how you could get on someone's spam list even if Dell itself had a strict no-spam policy.

Operating on a different scale... by ptolemu · 2004-02-23 13:32 · Score: 2, Interesting

I think these guys are trying to put the focus on the server side of things where they emphasize greater speed and efficiency in eliminating spam from a large number of accounts as opposed to a single one. Just out of curiosity, do Thunderbird and iMail use similar filtering techniques with their junk mail controls?

Re:How can a human be wrong? by Anonymous Coward · 2004-02-23 13:32 · Score: 0

No, that just means that the human defined the rules that it and the computer will follow. However, the computer will always follow those rules, while the human won't. People often will sometimes delete a mail, perhaps not recognizing an address from a valid sender, thinking it was spam. This would be a failure on the human part. Or perhaps they might consider opening up an email and reading it when it is in fact spam, a failure.

Re:Huh? Aren't humans 100%? by Dulimano · 2004-02-23 13:33 · Score: 2, Interesting

No, imaginary humans with infinite time and dedication are 100%. But real humans are not. The percent goes down with time and dedication continuously, so I really don't understand what this 99.84% means.

This is just carp. by corian · 2004-02-23 13:33 · Score: 3, Insightful

Spam is what is defined by humans as Spam.

To determine the accuracy of a spam detector, it is necessary first to come up with a sample of what is or isn't Spam. (I'd assume a human would do this?) So the best result we can get be evaluating humans is how often they agree with the result of the initial label.

This figure probably won't be 100%. People have slightly different concepts of what mail is requested vs. unwanted, and what is advertising or useful information. So there is a valid possibility of disagreement.

That doesn't mean humans can't do the job accurataly. (After all, if they couldn't, then the initial human-made labels would themselves be wrong and any data based on them meaningless!)

If the training data is labeled with the same criteria as the test data, it is obviously possible that a trained system can acheive results which more closely agree with the test data. They are being trained on similiar data. But that doesn't mean that the system is MORE accurate at detecting spam than humans. It means that the system agrees with a particular human (or set of humans) more than other people do in a labelling of spam/non-spam.

For all we know, the evaluators idea of spam is "wrong".

Re:This is just carp. by sholden · 2004-02-23 13:50 · Score: 4, Insightful

They are learning algorithms. For measuring their accuracy you have to assume that the data is correctly classified so you can see how they do.

The point is that humans also aren't perfect. Have a person classify 10000 emails and they will make a few mistakes. Point out those mistakes, and they will say "yes, I got that wrong it is an email from my wife reminding me to pick up milk and not a spam trying to sell me printer ink, I must have been day dreaming."

Just like if you give a person a document and say "find all the spelling errors" they will probably miss some. This is not because they have a different definition of how those words are spelt, it is because they made some mistakes.

For the training/testing data, some double checking needs to be done to find the mistakes the human classifying it almost certainly made.

It's a pretty normal situation in any machine learning application, you don't have to be perfect to be as good as a human - after all humans are only human.
Re:This is just carp. by BigBadBri · 2004-02-23 14:07 · Score: 1

As someone who hates to carp, didn't you mean crap?
Or are you making a clever reference to the bottom-feeders that make up the spamming classes?
Koi, for one, would like to know...

--
oh brave new world, that has such people in it!
Re:This is just carp. by Anonymous Coward · 2004-02-23 14:38 · Score: 0

Just like if you give a person a document and say "find all the spelling errors" they will probably miss some. This is not because they have a different definition of how those words are spelt, it is because they made some mistakes.

Just like how you misspelled "spelled" with "spelt". Spelt is a wheat used in animal feed.
Re:This is just carp. by Trillan · 2004-02-23 15:08 · Score: 1

Mmm. I don't think I agree. Spam is the common term used for unsolicited bulk commercial email. The other day, I got an email I nearly flagged as spam... but it was actually a reply to a post I'd made to usenet several months ago. So it was neither unsolicited nor commercial, just unrecognized because of the lateness.

If I had flagged it as spam, would it have been spam? It was still written by one person, in direct response to my question...
Re:This is just carp. by Dirtside · 2004-02-23 15:25 · Score: 1

Just like if you give a person a document and say "find all the spelling errors" they will probably miss some. This is not because they have a different definition of how those words are spelt, it is because they made some mistakes.
The situation is different, because though we do have standard spellings of words, we don't have a standard definition for what constitutes spam -- because whether or not an email is spam depends on whether the recipient thinks it's spam. In other words, if two people receive the same email, there's a chance that one of them will think it's spam, and the other will say, no, it's not spam. So did the second guy make a "mistake?" If so, why? Why is his definition of spam less valid than the other guy's?
(Of course if you want to get pedantic, "standard" spellings of words are those which happen to be agreed upon by the overwhelming majority. Overwhelming to such a degree, in fact, that you can "prove" you're right about the spelling of a word: look it up in the dictionary. But whether or not a particular piece of email qualifies as "spam," assuming you've settled upon a particular definition (which would vary from person to person), is still in the eye of the beholder.)

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
Re:This is just carp. by sholden · 2004-02-23 17:17 · Score: 1

Not according to my dictionary which gives as one definition:

A past tense and a past participle of spell.

But my spelling and grammar is awful and always will be awful - it's a subconscious rebellion against my English teacher father...
Re:This is just carp. by JasonStiletto · 2004-02-23 17:19 · Score: 1

Well, human mistakes on the incoming data filter down to noise, with enough emails, a few mistakes in the data going in won't matter much. Since the initial data for both of these programs comes from the user, and not the programmer, individual differences on what spam is wouldn't matter. Sure, it classifies what you would classify as spam. It just never hits the wrong button. It solidly follows your rules. Perhaps you only consider e-mails from your cousin berney spam. If that's all that annoys you, and that's all it filters because of what you taught spam was, individual definition of spam is meaningless. It would filter out what you think spam is, unless you make a random and arbitrary descision on each email you get "This is spam, and this is not." But if you are doing that anyway, you could use rand() for a spamfilter.
Re:This is just carp. by sholden · 2004-02-23 17:26 · Score: 1

That's irrelevant.

The idea is to get Fred to classify some email into spam and non-spam. Get Fred to double and triple check. Have someone else look over it and point out any potential problems. Repeat a few more times.

We now have a set of spams and hams correctly classified by Fred according to Fred's definition of spam.

Now give the same set of email to Fred after some time has passed (but not enough time for Fred's definition of spam to have changed) and get him to classify it into spam and non-spam again.

Fred will most likely make some mistakes when doing this and hence will have a non-zero error rate.

The spam filter's error rate will most likely also be non-zero but less than Fred's.

It's got nothing to do with two people disagreeing on what is and what isn't spam. It's got to do with people making mistakes. That Bob would classify some of the email differently is irrelevant.

I've hand classified email before, and I made mistakes. My definition of spam didn't change between making the mistake and noticing it, I simply put a few emails in the wrong category.
Re:This is just carp. by corian · 2004-02-23 17:33 · Score: 1

It's got nothing to do with two people disagreeing on what is and what isn't spam. It's got to do with people making mistakes. That Bob would classify some of the email differently is irrelevant.

I've hand classified email before, and I made mistakes. My definition of spam didn't change between making the mistake and noticing it, I simply put a few emails in the wrong category.

But now you are getting down to the specifics of using a specific tool to read through mail as you decide which items to move to which folder (good, or spam). Of course, with a bad interface, you can make mistakes. It could be too easy to double click, or if you hold the delete key a second too long it could delete two mails at a time. Or you could not be paying attention.

That's irrelevant.

We're not talking about a specific mail reading interface. We'er talking about pure classification. I give you a piece of mail to look at and you tell me "This is spam" or "This is not Spam". That's what the program is doing.
Re:This is just carp. by sholden · 2004-02-23 20:41 · Score: 1

Yes, and people sometimes get that wrong too.

And in my case I was looking at the files one by one in a text editor and moving them with a command. My mistakes were not key press errors or interface errors, they were due to the fact that it is a mind numbingly tedious task and your brain switches off and tells you that the newsletter is spam even though you know it isn't.
Re:This is just carp. by Dirtside · 2004-02-23 20:44 · Score: 1

The situation you describe is *not* what's happening in this article. What's happening here is more like this:

Bob comes up with a definition for spam. He manually classifies 5000 emails as spam or non-spam, then writes a filter that almost perfectly matches his classifications.

Then he gives the emails to Fred and has Fred classify them. Fred's classification differs from Bob's enough that Fred's "accuracy" according to Bob's definition is worse than Bob's filter.

Yeah, I agree with you, in the situation you described, Fred is likely to make some mistakes, even using his own definition of spam. But that's not what's happening here.

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased

Ah, procmail... by telekon · 2004-02-23 13:35 · Score: 1

The procmail element, IMHO, incorporates a bit of 'human' in the machine... as other posts have mentioned, I decide what I consider to be spam for me. So, the server-side component would filter out what a machine can determine to be clearly spam, and a couple of standard procmail recipes would catch the rest of what is "most spam to most people."

But I don't wanna have to do any of it by hand, so I'm gonna add my own recipes... so I don't see any see the stuff that isn't spam to the machine (not in the .0016% that it 'misses') but that I consider spam. That's a human using a tool to do something...

Nothing will ever be 100%, but the asymptote can get smaller and smaller the more closely the user and machine are working together for this.

As far as Mozilla integration... mozilla's just reading from my mail spool, so why would I want my MUA consuming resources that procmail will use more efficiently, silently in the background?

Some people won't run procmail, or run some OS that isn't compatible. I understand. But that's like not wearing a seatbelt: do so at your own risk and possible injury.

--

To understand recursion, you must first understand recursion.

Re:Huh? Aren't humans 100%? by dbarclay10 · 2004-02-23 13:37 · Score: 4, Interesting

How can a spam filter be more accurate than humans? Humans are always the last step in spam filtering.. i use popfile and it catches 99% but it still needs me.. because im the only one capable of identifying spam 100% of the time.

And if the study posted about is accruate, of those 1% that are left, you will (if you're a perfectly average person) accidentally delete 0.16% of good messages. Surely you've deleted a valid message by accident before? I do it regularily, deleting 25 spam messages with a single good one embedded in it when I just woke up before I had my coffee is not a good thing ;)

At the very least, if you were given the same data as these tests, that would be true. Consider if you *didn't* use popfile - how many spams would you be deleting every day, and how many good messages would be accidentally deleted? I know that if I had to manually delete the two or three hundred spams interspersed with good messages, my false-positive rate (the percentage of good mail I accidentally deleted) would skyrocket.

So just be glad you've got popfile. Not only do you not have to go through as much spam, but you're also more accurate while going through the little you must.

--

Barclay family motto:
Aut agere aut mori.
(Either action or death.)

KILL SPAM HYPE!! by Anonymous Coward · 2004-02-23 13:38 · Score: 0

I need a filter to prevent anti-SPAM hype from getting into my brain's inbox!!

Re:Huh? Aren't humans 100%? by BillyBlaze · 2004-02-23 13:39 · Score: 2, Insightful

If you have no spam filters, then classifying email amounts to "delete, delete, delete, delete, down-arrow, delete, delete, down-arrow, delete, delete, whoops!" That one mistake just dropped your average to 90%. Frankly, I'm amazed humans scored as well as they did.

The true test of a spam filter... by GrpA · 2004-02-23 13:39 · Score: 5, Insightful

Results of new spam filters cannot help but to be bogus... The true test of a filter is how well it works *after* all the spammers know how it works and try to circumvent it.

--
Enjoy science fiction? "Turing Evolved" - AI, Mecha, Androids and rail-gun battles. What more could you want?

Re:The true test of a spam filter... by Anonymous Coward · 2004-02-23 13:50 · Score: 2, Interesting

Statistical/Probabilistic filters are adaptive, and are capable of learning new characteristics of spam. This is the biggest difference between SpamAssassin (which has a set of predefined "rules") and these two filters. These filters break down each message into tokens and statistically weigh the tokens based on prior learning. If one of them makes a mistake, you can teach it. AFAIK these have been around for at least a couple years, and have only increased in accuracy over time.
Re:The true test of a spam filter... by kryptkpr · 2004-02-23 15:43 · Score: 1

I run POPFile to help me manage my inbox. I've found that over the past few months, I've received and increasing ammount of spams filters. These come in [at least] 2 flavors:

a) Spams that contain a large number of "innocent" words, HTML formatted to be white on white. The real spam message is the (innocently named) inline image.

b) Spams that intentionally mis-spell "spammy" words (how many different ways have you seen viagra spelt by spammers?) by inserting random letters here and there. Sometimes this is done to the point of near-unreadability.

The point here is, the spammers are learning. When new anti-spam techniques come out, they will find [feeble] ways to attack them... but they're definitely on losing. A few clicks and I've re-trained my filter to deal with their latest and greatest.

What I don't understand is why spammers go through the trouble of pulling tricks like the ones above. Bayesian filtering is almost exclusively a client-side system, and if I'm going through all the trouble of setting up a spam filtering system, do the spammers really think I will buy something from them?

--
DJ kRYPT's Free MP3s!
Re:The true test of a spam filter... by kryptkpr · 2004-02-23 15:45 · Score: 1

Damnit, I used preview too!

I've received an increasing ammount of spams specifically designed to thwart bayesian-based filters.

--
DJ kRYPT's Free MP3s!

So what? by jmb-d · 2004-02-23 13:40 · Score: 0, Redundant

Don't care, as I don't use IM...

--
In walking, just walk. In sitting, just sit. Above all, don't wobble.
-- Yun-Men

Re:So what? by Anonymous Coward · 2004-02-23 14:25 · Score: 0

Wow, that was useless.

Spinal Tap? by Anonymous Coward · 2004-02-23 13:40 · Score: 2, Funny

Hasn't anybody noticed the obvious Spinal Tap reference?

Jeanine: You know, it might have been better if the, uh, album had been mixed right.
David: Well I suppose you could cry about that, of course it's true. I mean it's true.
Jeanine: It wasn't...it was mixed all wrong, wasn't it?
Nigel: It was mixed wrong?
Jeanine: Yeah....
Nigel: Were you there?
Jeanine: ...you couldn't hear the...
Nigel: How do you know it was mixed wrong?
David: But she's...she's heard the...she's heard the record.
Jeanine: No, but I've heard the album.
Nigel: So you're judgement is that it was mixed wrong.
Jeanine: You couldn't hear the lyrics all over it.
David: You don't agree that you can't hear the
vocals?
Nigel: No, I don't. I do not agree. No.
David: Well I think maybe....
Nigel: It's interesting that she's bringing it up.
David: Well she'd like to hear the vocals.
Nigel: I mean it's like it's me saying, you know, you're using the wrong conditioner for your hair.
David: Don't be stupid.
Jeanine: You don't, you don't do heavy metal in
dobly, you know, I Mean...it's
Nigel: In what??? In what???
Jeanine: In dobly...
Nigel: In dublin!?! What's that?
David: She means Dolby, alright? She means
dolby, you know? You know perfectly well what she means.

This spam filter goes to 11!

This spam goes up to 11... by Anonymous Coward · 2004-02-23 13:40 · Score: 2, Funny

DSPAM implements a Dolby-type noise reduction algorithm called Dobly

Despite the musical reference on the DSPAM site, I figured some people still won't get the joke. So here it is:

JEANINE: You don't- you don't do heavy metal in dobly, you know, I mean...it's-- NIGEL: In what??? In what??? JEANINE: In dobly... NIGEL (GRINS): In doubly!?! What's that? DAVID: She means Dolby, alright? She means Dolby, you know? You know perfectly well what she means.

--from the movie "Spinal Tap"

Re:This spam goes up to 11... by PetWolverine · 2004-02-23 18:50 · Score: 1

The other one is a Dr. Strangelove reference. The CRM114 was the code module in the radio in the bomber that ultimately blew up the world. It kept the signal to turn back from its mission from getting to the pilot.

I think there's a lesson in there somewhere.

--
I found the meaning of life the other day, but I had write-only access.

Re:can it be used with SA? -yes by wideangle · 2004-02-23 13:41 · Score: 5, Informative

A CRM114 plugin for SA is available, thanks to Devin Nate:

http://bugzilla.spamassassin.org/show_bug.cgi?id =2 301

Image Noise Reduction and Machine Learning by use_compress · 2004-02-23 13:42 · Score: 3, Interesting

I find it interesting that an algorithm that was originally for image noise reduction found it's way to Machine Learning through a company whose purpose is to impliment noise reduction in audio. From my Googling, I think this is the first time anyone has used Baysian Noise Reduction in Machine Learning. Does anyone know otherwise?

Re:Image Noise Reduction and Machine Learning by Magada · 2004-02-23 21:11 · Score: 0

well, not really. Adaptive OCR software's been doing this for ages. With varying rates of success, certainly, but nonetheless. Ironically enough, one application of OCR is to beat those little "write here the text you see in the blurry pic above" types of tests against login bots.

--
Something bad is coming when people are suddenly anxious to tell the truth.

It's much easier in theory by Anonymous Coward · 2004-02-23 13:42 · Score: 0

Once the method is available to spammers, they can start to work around it. I've just started getting spams that don't trigger a single Spam Assasin test. These guys are pre-testing the spams. It's easy to detect all the old spam well, once you're the accepted product you're on the front line and life is a lot harder.

ObSpamError: I almost deleted an e-mail from a friend with the subject "Free Beer". It was only because no spammer could guess that well that I took a second look at the sender.

Re:It's much easier in theory by chollowayss · 2004-02-23 13:50 · Score: 0

if it was spam it would have been "Fr33 B33r" or "FREE B.E.E.R!" or something similar... thinking about spam makes me want to kill randomly.

--

"The next generation of interesting software will be made on a Macintosh, not an IBM PC." -Bill Gates

Re:Huh? Aren't humans 100%? by Suhas · 2004-02-23 13:43 · Score: 1

What I would like to know is, if the spam filter is more accurate than humans, then by definition, how have they detected 1 misclassification? maybe there were two misclassifications and they detected only 1? by definiton, they are worse off than the filter itself.....

Re:Huh? Aren't humans 100%? by helzerr · 2004-02-23 13:44 · Score: 3, Funny

To these people, a computer is merely an interesting string of sensations.

Best phrase I've read all week... Oh, yeah, it's only Monday! This one will probably hold me over 'till Friday, though. ;-)

More accurate than what..? by EdMcMan · 2004-02-23 13:44 · Score: 2, Insightful

If humans don't have 100% accuracy, who/what is defining what spam is?

Re:More accurate than what..? by No.+24601 · 2004-02-23 14:46 · Score: 1

If humans don't have 100% accuracy, who/what is defining what spam is?
If humans don't have nanosecond-response times, who/what is guiding the stealth bomber?
Re:More accurate than what..? by Andrew+Cady · 2004-02-23 15:27 · Score: 1

Another poster's analogy was incredibly apt. Computer spellcheckers also have higher accuracy than human ones. Think about it.
Re:More accurate than what..? by Anonymous Coward · 2004-02-23 15:55 · Score: 0

Humans don't have 100% accuracy when faced with dozens of emails at once and other stuff they need to be paying attention to (the norm). No doubt, this was checked by splitting the workload amongst many people and having them check very carefully.
Re:More accurate than what..? by Anonymous Coward · 2004-02-23 17:40 · Score: 0

Give your email to somebody else, and have him determine which emails are spam. That's what they mean. Obviously, the person whom the mail is actually intended for is the person who determines correctness.
Re:More accurate than what..? by pe1chl · 2004-02-23 22:20 · Score: 1

But spelling is an exact science defined by a pre-agreed dictionary. Something is either spelled correctly, or it isn't.

There might be such a definition for SPAM, but it is not defined by the content of the message. The reader of the message is subjectively determining if something is SPAM.
Re:More accurate than what..? by Andrew+Cady · 2004-02-24 18:35 · Score: 1

Spelling is not an exact science either; the "correct" spelling of a word is a mixture of usage and history and it is not necessarily clear what to do when the two are inconsistent with each other or within themselves.

But this is all quite beside the point. The question was how can humans be inaccurate when humans define accuracy. That question is silly. Humans define accuracy, but they are not by definition accurate.

Blame it on the spammers... by Anonymous Coward · 2004-02-23 13:44 · Score: 0

We are after all, what we eat. I feel sorry for the guy if he's read so much marketing-speak that he writes like a spammer now...

Let's get this straight people! by mabu · 2004-02-23 13:44 · Score: 4, Insightful

client/server-side filtering does NOT solve the problem!

The biggest problem with spam is the invasion of third party computers on the Internet. The ILLEGAL activity spammers perpetrate by breaking into machines, forging headers and hijacking servers.

Any filtering method does not address this most serious problem, and even if you do not see any spam in your inbox, you're still paying for the bandwidth and system resources these spammers steal.

Stop with the filtering algorhythms and take some of that energy and contact your local Attorney General, DA and FBI and demand that they prosecute these people who are BREAKING THE LAW.

Re:Let's get this straight people! by BlueTrin · 2004-02-23 14:25 · Score: 1

Why was it modded as flamebait, this guy is telling the truth even if he is using blunt words

--
Don't you know it is now both immoral and criminal to think beyond the next quarterly report?
Re:Let's get this straight people! by Anonymous Coward · 2004-02-23 15:59 · Score: 0

client/server-side filtering does NOT solve the problem!

It does and it doesn't. Filtering it out is a form of boycott, especially when mail administrators implement it on behalf of many users. By filtering it out, they just make the percentage of people responding to them even lower. Make it low enough, and you'll put the spammers out of business.
Re:Let's get this straight people! by mabu · 2004-02-23 16:15 · Score: 2, Interesting

As an ISP that has to try to do my best to provide my clients with "spam free" e-mail, I have to pass these costs onto the clients, whether they're in the form of charges for additional bandwidth or ineffective server-side filtering systems.

When you filter e-mail at the client or server side based on content, the spammers have no idea that their efforts are truly ineffective. At least RBLs send them a message. Content-based filtering is TOTALLY, TOTALLY ineffective. Yea, it makes the spam go away for a short period, but adds the burden of having to deal with legitimate mail being blocked and you still have to waste 70+% of resources you wouldn't normally need to handle legitimate e-mail. When you're not managing systems that are constantly under attack, you might not realize what a complete fucking mess it is.

On any given day, I have at least 20-30 probes and attempts to DOS my open ports into breaking down and giving these spammers some form of access. I'm having to build new systems to handle the existing load, not because my clients' need more resources, but the spammers progressively eat up more and more system resources. E-mail IS an almost-instanteous communication medium. BUT, because of spammers, it no longer is in many cases, especially with larger ISPs. The spammers, because the authorities won't shut them down, are screwing everything up and content-based filtering is something they LOVE because it's completely ineffective in the long run.
Re:Let's get this straight people! by noidentity · 2004-02-23 16:31 · Score: 1

"Any filtering method does not address this most serious problem, and even if you do not see any spam in your inbox, you're still paying for the bandwidth and system resources these spammers steal."

If the advertisers don't get any responses, they won't use that spammer again. If no spammer can get e-mail through, the spammers go out of business. So, effective filters widely deployed can cause spammers to choose another line of work.
Re:Let's get this straight people! by sootman · 2004-02-23 17:50 · Score: 2, Informative

Laws don't stop people from driving drunk*, and drunk drivers are in this country and even (by definition) driving out in public, in plain sight of everyone. How, exactly, would US law enforcement prosecute a $NATIONALITY1 spammer who's using a hijacked $NATIONALITY2 computer?

Laws are fine, but what would *really* work is if everyone were filtering spam, and everyone tells all their newbie friends & relatives what spam is and installs blocking software for them. If sending 1,000,000 spams no longer results in 10 sales, spam *will* stop.

* yes, laws do stop *some* people from driving whilke drunk, but laws have not eliminated the problem of drunk driving.

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Re:Let's get this straight people! by Anonymous Coward · 2004-02-24 10:38 · Score: 0

Better filtering ~= fewer spammers

Maybe it's a pipe dream, but if filtering starts working well for everyone, there will be fewer spam attempts. Who wants to send spam when it gains them nothing?
Re:Let's get this straight people! by mabu · 2004-02-25 05:39 · Score: 1

Better filtering ~= fewer spammers

There's absolutly no evidence to support that. In all likelihoood, better filtering = more spam. Filtering does little to discourage spamming; if anything it promotes more sophisticated spamming methods.
Re:Let's get this straight people! by mabu · 2004-02-25 05:43 · Score: 1

If the advertisers don't get any responses, they won't use that spammer again.

Look at most of the things spammers are promoting. It's obvious many of them are working on a "commission" basis with loan offers and affiliate things for online med companies, etc. They don't get paid in advance. If they get ONE SINGLE sale, it makes things worth it because they steal other peoples' bandwidth and do little more than press a few buttons. The crime is stealing bandwidth; trying to stop people from being stupid and purchasing penis enlargement pills would be a total waste of time.

Off-topic observation. by Anonymous Coward · 2004-02-23 13:45 · Score: 0

It always humors me when there are any stories about spammers on Slashdot. I think that the "War on Terror" is, in some ways, similar to the "War on Spam". Terrorists want to kill us; spammers want to spam us. Terrorists and spammers use unconventional warfare. People don't like being terrorized, nor do they like being spammed. Both are unwanted facets of humanity. What I don't understand is why everyone wants to fight spam using the legal system (fines, penalties, law suits, etc.), but to combat Terrorism we must "win the people over". If God has an Inbox, he should get some filtering software to send all Muslims to the Trash Can. Hahahahahahahahaha. "Are you sure you want to delete this Muslim?" "Yes." LOL OMFG ROFLMAO

Not that all Muslims are Terrorists, of course, but come on! Randomly checking old grannies in wheelchairs at the airports? WTF?! Get a Motherfucking Clue (TM) and just check out every Arab-looking guy! Also, the French are not welcome.

Re:ob by homeobocks · 2004-02-23 13:45 · Score: 0, Offtopic

In Slashdot, the moderators filter you!

--
MOUNT TAPE U1439 ON B3, NO RING

Oh my god, I'm above the average human ... by porky_pig_jr · 2004-02-23 13:45 · Score: 1

by the whole 0.16%!!!

(the first small step toward fame, I hope)

Re:Huh? Aren't humans 100%? by perlchild · 2004-02-23 13:45 · Score: 1

I agree with you, I expected Humans to rate an even 90% at best...

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 13:46 · Score: 0

Remember; one thing computers are good at is doing boring things repeatedly

True. In fact mine reads and replies to Slashdot posts for me.

You joke, but... by Ancient+Devices+King · 2004-02-23 13:47 · Score: 2, Interesting

I know a guy who has a Korean grad student who doesn't speak English very well. He manages to produce subject lines for the messages he sends that get him blocked by spam filters nearly all the time. Not his fault really, but it happens.

--
-"It seems like you're trying to exploit a security hole. Would you like help?"

don't forget by Anonymous Coward · 2004-02-23 13:48 · Score: 0

don't forget that sorting through over 6000 emails in one sitting is bound to be tiring for a human... there's bound to be some misclassification as the human gets bored and frustrated... the computer can easily sort through 6000 emails with the same attention to detail at the 6000th email as on the 1st... not so for the human

Re:Huh? Aren't humans 100%? by kfg · 2004-02-23 13:49 · Score: 3, Interesting

People are jackasses.

Hence we have spam in the first place.

KFG

Don't worry by sik0fewl · 2004-02-23 13:50 · Score: 4, Funny

Don't worry, I can forward you the one she sent me. Sounds like the same email.

--
I remember when legal used to mean lawful, now it means some kind of loophole. - Leo Kessler

Re:Huh? Aren't humans 100%? by queen+of+everything · 2004-02-23 13:50 · Score: 3, Funny

I work with some people who use their computer every single day. Have had an email address for years, who still buys what they read in an email. Photoshop for $50...sure! Herbal viagra...why not?

Well, she always has a big smile on her face, maybe there's something to this spam thing.

--
"Wisdom is not a product of schooling but of the life-long attempt to acquire it." -Albert Einstein

economics of spam by Anonymous Coward · 2004-02-23 13:51 · Score: 1, Insightful

Most people don't recieve hundreds of pieces of junk mail everyday. Spammers can make money with only a VERY small percentage of the recipients actually responding. If you send spam to a million people and only 0.01 % buy your product you still sold 100 units of your product. If it cost a tenth of a cent to send each email then you would need to make at least $10 per unit under the current economic model to have it still be profitable.

Re:Huh? Aren't humans 100%? by smack_attack · 2004-02-23 13:52 · Score: 1

Mine is smart enough to log in first, and is patched with humor-1.22-irony.wit :-p

--

Hammer of Truth

Re:Huh? Aren't humans 100%? by DougWhite · 2004-02-23 13:52 · Score: 2, Insightful

Not to sound like a litigation whore, but ...

I wonder if it would be possible to sue these spammers for interfering with a business transaction. Granted, the amount in question here is minimal, but just the possibility that a spammer could be found liable for this might deter some of them.

If that doesn't work we should sign up every megacorp CEO on every spammer list possible, and hope s/he misses an important memo costing megacorp millions. Then megacorp could sue spammer into oblivion.

Re:Huh? Aren't humans 100%? by rixstep · 2004-02-23 13:52 · Score: 4, Funny

Lots of people don't know what popups are.

Uh, sure they do. Popups - that's like those porn storms, isn't it? Some people say it only happens with IE and Windows, but I talked to my service provider and they told me 'just pull the power plug out of the wall when that happens'.

Easily fixed.

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-23 13:52 · Score: 4, Interesting

That actually makes humans much more accurate. We can eliminate many of the messages just by looking at the subject.

The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Re:Huh? Aren't humans 100%? by stonecypher · 2004-02-23 13:52 · Score: 1

Pretty much how you'd expect: by making fewer mistakes.

This seems a little like suggesting that a sewing machine isn't more accurate than a human seamstress, because it doesn't *know* about its misstitches. Comprehension isn't the issue. I have deleted legitimate email as spam by knee-jerk, and the math above (which leads to 1 in 625 deletions as in error) seems perfectly legitimate to me.

Or are you one of those that thinks that mathematical models aren't useful if they don't have a deep love for and brotherly understanding of what they're approximating?

--
StoneCypher is Full of BS

One number not enough by blamanj · 2004-02-23 13:54 · Score: 4, Insightful

Saying an algorithm is x% accurate is not sufficient, because there are two kinds of errors: false acceptance of spam, and false rejection of non-spam. Personally, I'd settle for 90% false acceptance if I knew the false reject rate was 100% rather than have a program that was 99% at both.

Re:One number not enough by Anonymous Coward · 2004-02-23 14:10 · Score: 0

according to both filters' test results, it was 1 false reject in around 6000+ messages, and zero false accepts (well each one had one false accept but explained it satisfactorily)

Re:Huh? Aren't humans 100%? by rixstep · 2004-02-23 13:55 · Score: 1

Lots of people read some spam not knowing what it is. To these people, a computer is merely an interesting string of sensations.

Can you release this quote under the GPL, do you think?

Re:How can a human be wrong? by Anonymous Coward · 2004-02-23 13:55 · Score: 0

Yes but your computer has the time to read through a hundred spam emails to get to the 3 non spam emails. Do you?

How not to evaluate filters by Daniel+Quinlan · 2004-02-23 13:55 · Score: 5, Insightful

The study referenced is:

On the author's mail (where all he does is probably talk about CRM114 and probably does not subscribe to many newsletters or non-technical mailing lists).
A pre-trained filter. It can't be compared apples-to-apples with any filter that doesn't require training.
Using his own filter on his own mail! Of course it does well.

... to mention a few of the problems. The statistics and methodology behind these claims are really questionable. I think both Consumer Reports and PC Magazine have both done better evaluations of spam filters (read that however you want).

Also, I wonder how many people have actually looked at CRM114 and tried to use it.

The really interesting thing about CRM114 is the windowed polynomial hashing technique used although there's some evidence that it can work just as well (if not better) on a much smaller window of only two tokens. I'm hoping someone will do a full exploration of the idea for SpamAssassin's Bayes module.

Re:How not to evaluate filters by ars · 2004-02-23 20:04 · Score: 1

> Using his own filter on his own mail! Of course it does well.

Well, DUH! why would you use someone elses filter? You train the filter based on your own email, that's the whole point. What's spam to you, isn't spam to someone else. This way YOU tell the filter what is and isn't spam.

> A pre-trained filter. It can't be compared apples-to-apples with any filter that doesn't require training.

This filter is designed to be trained. I have never heard of a static filter that did better then 95%. Trained filters are the only way to get any real accuracy. And besides all good static filter _ARE_ trained. It's just that they are trained by the devloper. And if that's what you want CRM is distributed with some pre-trained data files.

> Also, I wonder how many people have actually looked at CRM114 and tried to use it.

Me. And it works as advertized. I nearly gave up email for good, until I installed this filter. Now I pretty much never see a spam in my inbox.

--
-Ariel
Re:How not to evaluate filters by jesup · 2004-02-26 11:13 · Score: 1

I know the author (college), and I can tell you he's on a number of mailing lists, including the extended-college-ilk list. Not to mention that he's a TV star... ;-) You've probably watched him (and rooted for him) on TV if you're a slashdotter/techie.

Comment removed by account_deleted · 2004-02-23 13:58 · Score: 1

Comment removed based on user account deletion

Do we buy viagra 0.16% of the time by nri · 2004-02-23 13:58 · Score: 3, Insightful

If we humans are only 99.84% accurate, then 0.16% of the time we will incorrectly think the email is real and buy viagra ? I don't think so.
I read the email and delete it. Exactly the same as the spam filters do it, only MORE accuratly. I think the tests applied would have been between a human reading the header of an email and deciding whether to open it or not verses the spam filter making the decision for us. BUT the spam filter makes its decision by opening the email. Therefore to have a proper comparision I should be allowed to open the email as well before I make the decision. Therefore I am 100% accurate.

--
if :w! doesn't work, try :!cvs commit -m""

Re:Do we buy viagra 0.16% of the time by Anonymous Coward · 2004-02-23 16:03 · Score: 0

If we humans are only 99.84% accurate, then 0.16% of the time we will incorrectly think the email is real and buy viagra ?

It means that 0.16% of the time we think it's a legitimate email and read it. Or that we accidentally delete a legitimate email along with the dozens of garbage emails 0.16% of the time.
Re:Do we buy viagra 0.16% of the time by nri · 2004-02-24 09:06 · Score: 1

so what, once i read it i know its spam. the filter had also read it and decided that it wasn't spam. therefore i am more acurate that the spam filter. Thats my point. We are more 100% accurate and the the crap about them been more accurate is rubbish.
I realise the whole idea of a spam filter is to save me from needing to read 1000's of spam email, and for that i am grateful. I just disagree with the statemant that the filter is more accurate then me - its not.

--
if :w! doesn't work, try :!cvs commit -m""

The CRM114? by tramm · 2004-02-23 14:00 · Score: 3, Funny

I bet it allows messages from General Jack D Ripper or any email that contains the secret phrase "purity of essence", "peace on earth" or "precious bodily fluids".

--
-- http://www.swcp.com/~hudson/

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-23 14:00 · Score: 4, Funny

Presumably they must use a superhuman who has 100.00% accuracy.

--
Karma: It's all a bunch of tree-huggin' hippy crap!

They're trying to sell you something by brucmack · 2004-02-23 14:00 · Score: 2, Insightful

The thing with spam is that it's supposed to be a way for somebody to make money... i.e. they are trying to sell you something, be it directly or indirectly. I can't think offhand of an email I have recently received that could be misconstrued as trying to sell me something. From that simple viewpoint, spam can never look exactly like regular mail, because it has a different purpose.

Re:They're trying to sell you something by 2short · 2004-02-23 14:38 · Score: 1

But look at his example. One is spam trying to drive traffic to a website. Another is an actual friend of his, telling him about a cool website he found.
If it's selling something indirectly, it could well be entirely indistinguishable from legitimate mail, at least to a machine. Only the human knows that "MyBuddy" is someone he knows and "SpamKitty" is not.
Now, if the filter program has some distributed-networking action going on and can detect that a few thousand "people" have recomended the same website to their friends at the same time, it might have a chance...
Re:They're trying to sell you something by Anonymous Coward · 2004-02-23 15:58 · Score: 0

You're forgetting the message headers, and the fact that these are statistical filters that learn from their mistakes.

Re:Huh? Aren't humans 100%? by gvc · 2004-02-23 14:01 · Score: 3, Interesting

Last week I ran a spam filter on all the email I recieved for the last several months. The filter came up with a dozen 'false positives' - messages that I had not flagged as spam when I manually classified them. 11 of them were clearly errors I made in my original classification. The 12th was a solicitation from the alumni association of my alma mater ....

Before I used a spam filter, I once missed a very important message whose subject line was something to the effect of "URGENT - DON't REBOOT THIS MORNING." That was a bad one to miss.

Of course humans make mistakes, and it is entirely possible for an automated or semi-automated system to be more accurate than a human alone.

Re:Huh? Aren't humans 100%? by Stunning+Tard · 2004-02-23 14:02 · Score: 2, Funny

Maybe it's the %0.16 of people who are responding to spam.

Not really by bluGill · 2004-02-23 14:03 · Score: 1

What do I want in my inbox? I get a few dozen "job opportunities" a day. I'm unemployed right now, yet I've still learned to dump the majority of those without looking at them. Sometime their might be a legitimate opening in one of them and I will dump it. Making me less than 100% accurate, because I deleted an email that I didn't want.

Filters at least get most of the spam I get. (In fact most of those opportunities are things I signed up for not realizing they were not only bogus, but also gave no [obvious] way to get off their list) Back when I got 100+ a day I went through my inbox with the big delete button. Most than once I hit delete, yes I'm sure..., then looked up and watched something I think I wanted disappear. However when you have 110 new emails and 100 are spam, I don't have the patience to go through and read them all.

Great results from the MIT Spam Conference! by int2str · 2004-02-23 14:04 · Score: 1

Since the MIT Spam Conference took place, I've been wondering if new ideas would be implemented as a result. And low and behold, not one but two innovative new approaches to Spam filtering!

This is more than I had hoped for. Thanks to all involved!

I want to avoid terms like "war" or "arms race", but it's good to know that every once in a while the "good guys" take a big step forward. Hopefully the "bad guys" wont catch up to quickly.

Cheers,
Andre

Case study in linguistics by max+born · 2004-02-23 14:04 · Score: 1

If we are to believe Noam Chomsky or Steven Pinker (not saying I necessarily do), the study of linguistics shows that language is innately human and at present no mathematical formula can interpret its meaning.

The Sparse Binary Polynomial Hash is a generalization of the Bayesian filter and as such cannot detect "spam" any more than any group of people can all have the same interpretation of Shakespeare or Milton.

As Supreme Court Justice Potter Stewart said about pornography in "Jacobellis v. Ohio" (1964) -- "I know it when I see it." Neither humans or computers can define spam any more than they can define pornography.

Re:Case study in linguistics by acb · 2004-02-23 17:39 · Score: 2, Insightful

From what I gather of Pinker's theory is that language is implemented by a dedicated module in the human brain. This module is just neurological hardware, operating entirely by physical means, and does not invoke any sort of deus ex machina; therefore, what it does is an algorithm.

The language module does invoke other parts of the brain, such as general knowledge; however, there's nothing in the process that depends on it being in a human brain. Given that cognition is a physical process, one could postulate a computer program that could achieve the same results, even if drawing on a very large database of cultural information. The suggestion that language is "innately human" sounds a bit too much like carbon chauvinism, the belief that intelligence is an exclusive property of carbon-based life.
Re:Case study in linguistics by Anonymous Coward · 2004-02-24 07:53 · Score: 0

We might be carbon chauvinists but once the spammers got hold of our baysian forumulas are algorithms are useless.

99.84% * 10 = 998.4% by Anonymous Coward · 2004-02-23 14:07 · Score: 0

Considering that 5% voted for >100% in the recent poll.

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-23 14:07 · Score: 1

To these people, a computer is merely an interesting string of sensations.

If only FuFme weren't jokeware. :-(

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Chaining? by gmplague · 2004-02-23 14:07 · Score: 1

If each one is near-perfect, and they use completely different strategies, why don't we parse all messages through one first, and then through the other one, that way you'll get higher accuracy rates. Only the ones where there's a discrepancy get shown to the user.

--
__________________________________________
Take comfort in your ignorance.
Grandmaster Plague

Hum by Anonymous Coward · 2004-02-23 14:08 · Score: 0

Maybe it's because those humans were braindead /. moderators

Re:Hum by Anonymous Coward · 2004-02-23 14:13 · Score: 0

Maybe it's because those humans were you

I agree by the+eric+conspiracy · 2004-02-23 14:08 · Score: 1

I find that this parallels my own experience to a great extent. I have found that using a good spam filter with a whitelist is indeed more accurate that my own ability to filter email manually.

Once I reached the conclusion that this was the case it made a heck of a lot os sense to use a spam filter.

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 14:10 · Score: 0

I have never deleted an email I meant to keep, therefore I must be a SuperGenius(TM W.E.Coyote)!

Or your dad is an idiot who doesn't know how to route his email.

CRM is more then just spam filter. by k_head · 2004-02-23 14:12 · Score: 2, Informative

CRM is actually quite a acinating product. It's like a super grep where you can match against blocks of text instead of just lines. It also has some logic operators and such. I think there is a quote on his web site that refers to it as "grep bitten by a radioactive spider" and it's true.

You can use it for lot more then spam processing, it's a really neat all purpose tool.

--
The best way to support the US war effort is to continue buying American products.

Better filter AND make money too ! by PHPhD2B · 2004-02-23 14:13 · Score: 2, Funny

I have developed a spam filter that is 100 percent (ONE HUNDRED PER CENT) effective at deleting Unwanted Messages.

In addition, every user will get special discounts on software, mp3s and computer parts with my partners, and two FREE MP3'S every month.

There are also special savings on 100% all-natural and effective male enhancement products. A portion of the rebates will go towards a $100000 fund needed to get 100,000,000 dollars (ONE HUNDRED MILLION DOLLARS) from Liberia into an account in Switzerland. If you provide your social security number (SSN) and your checking and savings account number you will get part of the ONE HUNDRED MILLION US DOLLARS. Only the first 100 people will qualify, so hurry up and don't miss this offer!

--
--I am Sun Tzu of the Borg. Resistance is feudal.

Thats a problem. by geekoid · 2004-02-23 14:13 · Score: 2, Interesting

If there is no universal bottom line of what Spam is, we can never manage it.

I think 'unsolicited request for money from a for profit oranization' will fit into everybodies base definition. Some people will expand on it, but we need a defined place to start.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:Thats a problem. by dodobh · 2004-02-23 23:22 · Score: 1

No. Spam is Unsolicited Bulk Email. Its all about consent, not about content.
(What about an unsolicted request for a vote from a political party?)

--
I can throw myself at the ground, and miss.

Re:Huh? Aren't humans 100%? by bhanafee · 2004-02-23 14:14 · Score: 4, Insightful

No, humans aren't 100% and yes, you can test for that. Try a thought experiment: fill a bin with 50,000 red balls and 50,000 blue balls. Ask a human to sort them all. The result probably won't be 100%, but you can still check the result and figure out how accurate the human is without relying on a superhuman ability to tell the balls apart. Same thing for spam: if you start with a known training set, you can test humans to see how well the spam is identified by manual sorting.

Human accuracy doesn't scale linearly by Kaboom13 · 2004-02-23 14:14 · Score: 5, Insightful

I'm not surprised a filter beat the human, considering the study used a sample of 5849 messages. As the sample size increases, the filter's accuray will increase, and the human's will decrease. Furthermore the higher the spam/real ration, the better the filter will do in comparison to a human trying to sort at a reasonable speed. The reason being humans tend to skim, and rairly actually read entire subjects, much less messages. Give a human 5000 messages and an hour and he will probably make some mistakes. On the other hand, in 10 messages, the human will probably be 100% correct. Most email filters rely on this already, as they tend to err on the side of caution. With the bulk of the spam taken out, it is not a burden to have the human check the iffy bits. Furthermore the type of email can mislead humans. A business-type email sent to someone's personal email is much more likely to be mistaken as spam, and vice versa. The main disadvantage of automated filtering is people generally have an idea of when a really important e-mail is going to come (the type that false positives are completely unacceptable) and who it will be from.

Re:Huh? Aren't humans 100%? by stonecypher · 2004-02-23 14:15 · Score: 1

(please don't reply telling me how)

Kinda makes you wonder how they could get from San Diego to Los Angeles in under three hours. Please don't reply telling me how.

(Extra joke for southern californians: wait for the first person to say "the highway!" Laugh, rinse, repeat.)

--
StoneCypher is Full of BS

Re:Huh? Aren't humans 100%? by c1ay · 2004-02-23 14:16 · Score: 1

I'd like to see how well these would currently do on Darl McBride's SCO email account in sorting out his legitimate mail from his hate mail. It would surprise me if they could maintain this accuracy...

--

Markovan by Anonymous Coward · 2004-02-23 14:18 · Score: 1, Funny

When the Markovan is rockin', don't come a-knockin'...

Ah yes. Misspelling an adjective describing the general procedure, and *then* mistaking that misspelling for the actual *name* of the algorithm. Always nice to see the intelligence of submitters at work!

trademark ...dude. by Anonymous Coward · 2004-02-23 14:19 · Score: 0

sorry dude, but i work at Dolby and well, this is probably trademark infringement. I had to send it to our trademark/IP folks. You will probably be hearing from someone soon.

Again, nothing personal, but we have to look out for this kinda stuff. Trademark is a big deal here.

At best, they may just have you take that logo down, at worst, you will have to change all of it.

Re:trademark ...dude. by Anonymous Coward · 2004-02-23 14:39 · Score: 0

i thought SCO owned you guys. i may be wrong, but it doesn't seem likely to me that dolby labs owns the trademark under the name 'dobly'. otherwise you'll have to sue spinal tap too.
Re:trademark ...dude. by Anonymous Coward · 2004-02-23 15:13 · Score: 0

not to mention the logo doesn't look much like the dolby logo at all except that there is a rectangle...did you guys copyright rectangles? also, isn't dolby related to sound and signal processing? how does that have anything to do with spam filtering? even if the guy wanted to call it dolby i don't think he'd be breaking any trademark laws, since it's a completely different market.
Re:trademark ...dude. by o'reor · 2004-02-23 20:22 · Score: 1
Dude... seriously, why don't your bosses sue J. K. Rowling for both trademark and patent infringement ?
1. Patent infringement: she imagines a device whose primary purpose is to filter one particular student out of a school (i.e. prenventing Harry Potter from going to Hogwarts)
2. Trademark infringement: the name of the particular device is Dobby
--
In Soviet Russia, our new overlords are belong to all your base.

errata by kfg · 2004-02-23 14:22 · Score: 1

Yeah, I screwed up the numbers. So sue me, I'm dyslexic.

KFG

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 14:24 · Score: 1, Insightful

How do you know your training set is correct?

Corpi? by nohat · 2004-02-23 14:28 · Score: 1

The last time I checked, the plural of corpus was corpora (or corpuses), not corpi. CORPI?! What was he thinking!

Re:Corpi? by Anonymous Coward · 2004-02-23 14:45 · Score: 0

Don't be so pedantic.

--
There's a difference between knowing the name of something and understanding it. -- Richard Feynman

CRM114 by chrome · 2004-02-23 14:31 · Score: 1

Its pretty difficult to set up.

Like, it refers to setting a secret password somewhere, but I can't find it!

Gah!

*bashes head on wall*

Re:Huh? Aren't humans 100%? by sketerpot · 2004-02-23 14:31 · Score: 1

I don't think you need to worry about that. Your right to quote that phrase should be protected by fair use rights. Even RMS doesn't release quotes under GPL.

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 14:31 · Score: 5, Insightful

The post quotes "a study" which gives the 99.84% figure. In fact, the 99.84% figure is mentioned in the one paper as "the human author's measured accuracy as an antispam filter...on the first pass". This is what we who understand statistics call "nonsense". An individual human had an estimated accuracy of 99.84% when looking at one particular sample set of data, once. This is not a meaningful number, and it sure as heck ain't "a study".

Re:Huh? Aren't humans 100%? by Pieroxy · 2004-02-23 14:31 · Score: 3, Funny

I have never deleted an email I meant to keep

How could you possibly know? You deleted it!!

--
Write boring code, not shiny code!

Re:Huh? Aren't humans 100%? by SLot · 2004-02-23 14:32 · Score: 3, Funny

Then megacorp could sue spammer into oblivion.

Or more likely, megacorp fires it's mail administrators for being incompetent and goes on about it's business.

human == correspondence secretary by Anonymous Coward · 2004-02-23 14:39 · Score: 1, Informative

All who are harping on about human spam detection rates, the article states:

"By comparison, a human
is only about 99.84% accurate in filtering spam and nonspam, so any of these filters
is more effective than a human "correspondence secretary"."

So, they define "human" to be a secretary, not an uber geek.

Help setting this up by ModernGeek · 2004-02-23 14:41 · Score: 1, Interesting

I would love to rtfm, but I want a fairly simple answer to this, how can I do a 30 minute job of integrating this into the mozilla mail client, or does it have to be tied into the server itself? I was wondering if this was a quick, easy fix, or if it is an all weekend type of project. While I'm on the subject of mail, what is a good all in one mail bundle with webbased interface that isn't opengroupware or ms exchange for php/apache under unix?

--
Sig: I stole this sig.

Re:Help setting this up by Anonymous Coward · 2004-02-23 15:06 · Score: 2, Funny

would love to rtfm, but I want a fairly simple answer to this, how can I do a 30 minute job of integrating this into the mozilla mail client, or does it have to be tied into the server itself? I was wondering if this was a quick, easy fix, or if it is an all weekend type of project.

Most likely it will take at least as long as reading an article. So you might as well not bother.
Re:Help setting this up by attackc0de · 2004-02-23 16:40 · Score: 1

I felt this all-weekend pain when trying to setup spamassasain. There didn't seem to be any mention of using it *without* sendmail/qmail/postfix. Because I wanted to use my ISP's SMTP/POP3 servers, and freshmeat searches came up nada on (functioning/completed) pop3 proxies, I ended up writing my own pop3 proxy in perl out of frustration. Funny though, that's how most of my projects begin... ;)

--
For a nice date: call strftime(3C)
Re:Help setting this up by PugMajere · 2004-02-23 18:13 · Score: 3, Informative

Umm, Fetchmail + procmail on your local machine?

Not sure exactly why you need a pop3 proxy involved, just use Fetchmail to deliver locally, run things through procmail.

Set your local mailserver (sendmail/qmail/postfix/exim/whatever) to use your ISP's SMTP server as a smarthost, and it'll send everything it doesn't recognize as local off to them to handle.
Re:Help setting this up by SethJohnson · 2004-02-23 19:12 · Score: 4, Insightful

ModernGeek,

I recommend you stick with hotmail. Dabbling in stuff like spamassasin is going to be just too much work for someone as lazy as you sound. Apple makes a good built-in spam filter on its Mail client app. Why don't you go there?

--
$5 / month hosted VPS on linux = awesome!
Re:Help setting this up by Anonymous Coward · 2004-02-23 23:49 · Score: 0

Mozilla Mail (and Thunderbird) don't seem to support local delivery, only POP3/IMAP. I'm not sure why, maybe I'm missing something obvious here.
Re:Help setting this up by emilymildew · 2004-02-24 02:25 · Score: 1

Except when Mail.app crashed every two to three minutes because your Junk Mail filter file is corrupt and Apple doesn't recognize this as a legitimate problem despite the numerous bug reports sent by customers.

So I end up having to make my own Junk Mail rules and hope that not too many good messages sneak by.
Re:Help setting this up by Shaleh · 2004-02-24 06:47 · Score: 1

mozilla's mail client already has bayes filtering in it. Does a pretty good job for the people I know using it.
Re:Help setting this up by gwynevans · 2004-02-24 22:05 · Score: 2, Informative

Sounds like POPfile was what you were actually looking for!
Re:Help setting this up by Anonymous Coward · 2004-02-25 11:41 · Score: 0

Me thinks you should get a PeeCee since even a Mac is beyond you.

DSPAM efficacy by Anonymous Coward · 2004-02-23 14:44 · Score: 0

DSPAM does pretty well. As of version 2.8.1 if you receive at least 10% good email it'll happily classify most spam away. As of version 2.10 it should be possible to lower that to near 0 (2.10 implements whitelisting, which saves the extreme end cases where certain addresses (such as john@anydomain.com) receives on the order of 1 good email : 1000 spams.

Re:DSPAM efficacy by Anonymous Coward · 2004-02-23 17:14 · Score: 0

huh? i hang out on the dspam-dev list and haven't seen anything about whitelisting, and its not in the docs anywhere. did your old lady smoke crack? the docs do talk about bayesian whitelisting, which is an automatic type of statistical whitelisting performed automatically by the bayesian process...but that's been around since the first version afaik.

In plain English... by Anonymous Coward · 2004-02-23 14:51 · Score: 0

Can anyone explain how these two filters work?

The white papers are far too complicated for simple minds like mine.

Re:Huh? Aren't humans 100%? by Marvin_OScribbley · 2004-02-23 14:57 · Score: 4, Funny

I talked to my service provider and they told me 'just pull the power plug out of the wall when that happens'.

Ok, now the screen dimmed a little and I heard the hard drive spin down, but the pop ups are still a comin! Oh, and something about "battery level at 98%" or something.

--
I'm not a journalist, but I play one on slashdot

Testing... by Anonymous Coward · 2004-02-23 14:58 · Score: 0

Hrm, I just installed this, and I'm anxious to try it out.
I'll just post my email address here, and wait a few minutes...

Human Error are different in KIND by Anonymous Coward · 2004-02-23 15:01 · Score: 0

The humans are making errors like accidentally hitting the wrong button. The filters are accidentally classifying things as not SPAM. These are not the same kinds of errors. At that level of accuracy individual humans cannot be taken as good baselines.

Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · 2004-02-23 15:07 · Score: 2, Interesting

Or your dad is an idiot who doesn't know how to route his email.

But I was only contesting the great-grandparent poster, who said that humans are by definition 100% accurate.

While my dad may be an idiot, he is also human. I am correct, great-grandparent poster is incorrect, and you are off topic. As far as I can tell, I've never deleted an email I meant to keep either. But you and I aren't the only people worth discussing.

--

There are no trails. There are no trees out here.

Re:Huh? Aren't humans 100%? by ergean · 2004-02-23 15:11 · Score: 5, Funny

There goes my bussines idea. I wanted to start a bussines that puts humans in an eastern europe contry to sort corporate e-mail.

Now I have to think again about putting humans to decorticate sunflower seeds, it's cheper than all those machines.

Re:Huh? Aren't humans 100%? by QuantumFTL · 2004-02-23 15:12 · Score: 4, Interesting

The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

I believe that humans can be 100% accurate (or thereabouts) if they read the *ENTIRE* message, however that's exactly the point - if you have to read an entire message to tell that it's spam, the spam has succeeded.

Their number probably concerns how people can tell without reading the entire message whether or not the message is spam. My brother accidentally deleted a few messages I had sent to him, however if he had read them fully he would have known they were legit.

Cheers,
Justin

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 15:14 · Score: 0

Ever run a car into a tree?
Me neither, but somebody has.
And I bet they thought staying
on the road was as easy as
deleting spam.

Re:Huh? Aren't humans 100%? by Andrew+Cady · 2004-02-23 15:14 · Score: 2, Informative

If every individual human has an accuracy of 99.983%, then two independent humans have an accuracy of 1 - .00017^2 or 99.99999711%. This would allow ample accuracy to judge the computer, except that it's not true[1]. A better answer is the one you suggest: humans must judge spam from subject/author alone, whereas computers get to look at the whole message. Humans reading the whole message, and possibly even following included links, responding, etc., can be assumed to have full accuracy, within epistemic bounds. Indeed, merely re-checking your work, etc. - being consciously more diligent than the average spam-sorter - should insure your accuracy is better than average.

As for how accuracy was actually judged in this particular study, I suppose you would have to read the article for that. I haven't, myself...

[1] It assumes the probability of error is equal for every message, which is obviously not true (i.e., that error is random rather than systematic). The real accuracy of two humans in concert is surely much lower; OTOH, it is still sure to be much, much higher than the accuracy of a single human.

Two Spam Filters 10 Times As Accurate As Humans by localman · 2004-02-23 15:16 · Score: 1

Huh... so would three spam filters be 100 times as accurate? I never thought of running them in series before. Cool! :)

Re:Two Spam Filters 10 Times As Accurate As Humans by Anonymous Coward · 2004-02-23 15:52 · Score: 0

What the article said was that both had achieved that level of accuracy on their own - not when wired together.
Re:Two Spam Filters 10 Times As Accurate As Humans by localman · 2004-02-23 19:31 · Score: 1

I know. Sorry. I was trying to be funny. Guess I wasn't.

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-23 15:17 · Score: 4, Insightful

But the computer reads the entire message, so it's not really a fair comparison, is it? How many more lines of information was the computer allowed to look at to create its superior result?

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Who about some disclosure? by sdo1 · 2004-02-23 15:19 · Score: 1

Nuclear Elephant writes "The authors of two spam filters, CRM114 and DSPAM... If you're looking for a way to rid spam from your inbox, roll on over to one of these authors' websites."

I'm glad these guys are doing great things to combat spam, but when the submitter of the article stands to benefit from posting of the article on Slashdot, then full disclosure (not stealth disclosure) is warranted. No surprise that the "donate" link is right up at the top of their page.

Jonathan, don't get me wrong. I really appreciate what you're doing here. But failure to disclose your relationship of the project you're promoting is on the level (though not the same extent) as the deception that spammers employ.

-S

--
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?

Current Spam filters by Anonymous Coward · 2004-02-23 15:19 · Score: 2, Interesting

Current spam filters may be "10x" better than humans, current spam filters may be terrible on future spam.

Filters beating spam and spam beating filters is a continuous arms race. In the limit, optimal spam filtering is equivalent to solving NLP (natural language processing); Unless you build a filter that can fully understand the text (syntax, semantics, pragmatics, world knowledge, the whole shebang), an adversary can always construct spam to defeat your filter.

Re:Current Spam filters by Anonymous Coward · 2004-02-23 16:03 · Score: 0

Well said.

The definition of spam is as arbitrary as that of pornography. One person's spam is another's answer.

It's a simple fact of lingusistics.
Re:Current Spam filters by corngrower · 2004-02-23 17:58 · Score: 1

Unless you build a filter that can fully understand the text (syntax, semantics, pragmatics, world knowledge, the whole shebang), an adversary can always construct spam to defeat your filter.
An automated filter may not ever be 100% accurate, but I find even if they're only 95% accurate at recognizing spam, it's useful for me.
I'm sure some filters can place mail into categories such as "amost definitely spam", "probably spam", "might be spam", and "not spam". That would be helpful as well.

X-CRM114-code-prefix: OPE by otis+wildflower · 2004-02-23 15:25 · Score: 1

Feed me, Mandrake!

Re:X-CRM114-code-prefix: OPE by ajlitt · 2004-02-23 15:48 · Score: 2, Funny

Please give this man a drink of grain alcohol and rainwater.

Re:Huh? Aren't humans 100%? by Moridineas · 2004-02-23 15:38 · Score: 1

You're missing the point! Humans aren't 100% accurate on classifying JUST on the subject and From lines, but I don't see how we aren't when reading the body.

Re:Huh? Aren't humans 100%? by Fnkmaster · 2004-02-23 15:42 · Score: 4, Funny

Well, she always has a big smile on her face, maybe there's something to this spam thing.

You mean you've never noticed this before? Idiots are some of the happiest people I know.

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 15:45 · Score: 1, Insightful

Lots of people don't know what popups are.

Yes, we call those people "surfers who don't use Internet Explorer" (seeing as pretty much every other browser has options to kill them).

SCNR

Popfile... by Skates1616 · 2004-02-23 15:51 · Score: 0

I'll just stick with my 99.9% accuracy with Popfile, it acts as a proxy so everything happens seamlessly, and the buckets are awesome for sorting your mail...

Training Period? by rixstep · 2004-02-23 15:54 · Score: 1

How long was the training period? Does that count too? Many filters can maintain complete 100% accuracy over finite periods of time (4,000 - 10,000 messages ) once they're trained - such as...

So.... by gonz · 2004-02-23 16:03 · Score: 1

Which one is better? Does anyone have any comments on the actual article?

-Gonz

Spot the reference... by Maj.+Kong · 2004-02-23 16:09 · Score: 5, Informative

CRM114 was a piece of encryption gear in Major Kong's...err, my B-52 in the movie Dr. Strangelove . It allowed only properly coded messages to be received by the crew. When the Soviet SAM detonated near the airframe, the CRM114 was damaged and the crew could not get the recall order.

Kong: (announcing through headset intercom )

This is your attack profile: to insure that the enemy cannot monitor voice transmission or plant false transmission, the CRM114 is to be switched into all the receiver circuits. Emergency phase code prefix is to be set on the dials of the CRM. This'll block any transmission other than those preceded by code prefix. Stand by to set code prefix.

ObKubrick: In 2001: A Space Odyssey, one of the pods was marked with the designation CRM-114. And in Clockwork Orange, Alex is injected with serum 114. I suppose CRM-114 is to Kubrick as THX1138 is to Lucas.

Dobly, on the other hand, is from This is Spinal Tap , a mispronounciation of "Dolby" by David St. Hubbins's girlfriend:

Jeanine Pettibone: You don't do heavy metal in Dobly, you know.

Not to mention that it probably avoids trademark infringement (though I wouldn't put it past Dolby Labs or Thomas Dolby to raise a stink).

Maj. Kong

--

Shoot, a fella' could have a pretty good weekend in Vegas with all that stuff.

Re:Spot the reference... by danshapiro · 2004-02-23 20:58 · Score: 1

Slightly OT, but Thomas Dolby's a FOAF and a pretty cool guy. Given that he makes a decent part of his living sampling, editing, and remixing others, I suspect he'll be pretty cool about the use of Dobly.

--
This posting is provided "AS IS" with no warranties, and confers no rights.
Re:Spot the reference... by metamatic · 2004-02-24 11:19 · Score: 2, Informative

In fact, Thomas Dolby was sued for trademark violation by Dolby Labs. The court found in his favor, as he'd been known as "Thomas Dolby" as a nickname since his school days, when he used to play with tape decks all the time.

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak

Spam Ratio way out of date by linus_vp · 2004-02-23 16:09 · Score: 1

I totally agree. I reached that threshold today, having more than 1000 messages, with only 1 non-spam for about 4 days worth of email.
Also, I think that the articles' ratio of spam to non-spam 1935/5849=33% was way out of proportion.
Specifically, for the period of November 1 to December 1, 2002, a pre-trained mailfilter.crm (as distributed as CRM114 version 2002-11-26 ) processed the authors live incoming mail stream, a total of 5849 messages (1935 spam, 3914 nonspam) , with only 4 false acceptances, 0 false rejections, and 2 messages considered "not humanly classifiable".
My messages to spam ratio is more like (100 messages) 98 spam, 2 nonspam 98/100=98% at the most. My question is, how does the author keep his spam ratio so low? If you update the study from 2002 to 2004, I'm sure those numbers will be vastly different.

--
My Journal.

Re:Huh? Aren't humans 100%? by dj245 · 2004-02-23 16:12 · Score: 1

I have absolutely no experience making any kind of filter, but if I make a filter that deletes every email I get, I have just eliminated 100% of spam. Thats infinitely better than their filter!

However my telepathy skills are not so good so thats not such a great idea.

--
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.

Re:Huh? Aren't humans 100%? by bugsmalli · 2004-02-23 16:20 · Score: 1

if you had asked me, I'd say balls to you!

Re:Huh? Aren't humans 100%? by po8 · 2004-02-23 16:27 · Score: 4, Informative

How do you know your training set is correct?

Good question! We're working on this problem, among other things, at the PSAM project. We have a project to produce high-quality benchmark corpora for spam filter testing. Watch that space for ongoing work, or e-mail us an offer to pitch in and help---we could use it!

Re:Huh? Aren't humans 100%? by Harinezumi · 2004-02-23 16:29 · Score: 5, Informative

Computers are neither lazy nor pressed for time, and therefore can afford to read and evaluate every single line of every single message. Humans generally can't be bothered to be so diligent, and while they have the ability to get a 100% rate, in most cases they devote so little attention to the task of filtering email that the success rate drops.

When these factors are considered, I think it's quite possible to write software that in the long run has a higher success rate than a human who has better things to do than filter his mail all day.

Unfair comparison by Dachannien · 2004-02-23 16:39 · Score: 1

This is an unfair comparison in the first place, unless their spam filter only looks at things like Subject, From, and Date. Their filters in all likelihood also analyze the full body of the message; if a person read the body of the message to do that analysis himself, it would make the process of determining spam/not-spam moot, wouldn't it?

Re:Unfair comparison by Anonymous Coward · 2004-02-23 22:53 · Score: 0

It would also take several hours a day.

Dolby-type noise reduction algorithm called Dobly? by omeomi · 2004-02-23 16:41 · Score: 4, Interesting

Dolby noise reduction works by filtering a spectrum into a bunch of bands, each of which are compressed (in an audio sense, not in a digital sense), and recorded to tape. On playback, they go through an expander...how does that concept translate to spam filtering? It can't be "dolby-type", that doesn't make any sense...

--
ZuluPad, the wiki notepad on crack

Brightmail by saha · 2004-02-23 16:47 · Score: 1

We just started using Brightmail on our mail servers. It gets a large quantity of the SPAM, whatever it doesn't mark as SPAM using the header. I send a report to their service. Which apparently updates the Brightmail data every 10 minutes on our mail servers. Whatever Brightmail doesn't catch Mail.app has done a good job of catching and so far looks like 99% of my SPAM has been caught, but I don't have enough stats over a long enough period of time to show that this 99% will continue.

On our main campus there will be a pilot program to use DSPAM. Although I've heard it requires more user training, Brightmail doesn't need any or very little feedback to block 80% of the SPAM.

Digital signatures and a public key infrastructure by Tracy+Reed · 2004-02-23 16:47 · Score: 2, Insightful

...are still the only real solution to the issue of trust, reputation, and accountability on the Internet. We need it for so many other things in addition to guaranteeing email legitimacy.

If every user or at least every server had a key and we all signed each others keys creating a web of trust and only accepted signed and trusted mail the spam problem would be solved. I really dislike the way SSL certificates are handed out. A central CA is a very bad idea due to the cost and browser lock-in issues etc. With GPG and web of trust if you want to run a mail server you need to talk to a friend who is already running one and get them to sign your key. Perhaps we could even use DNS to propagate and cache the keys and sigs. If you sign a key that turns out to be a spammer you better revoke that signature fast before the person upstreeam from you revokes yours. Problem solved. Now if only we could get the big guys to go along with it...

Then her subject was just "wrong". by NotQuiteReal · 2004-02-23 16:48 · Score: 1

You correctly classified it as spam [or at least junk mail]. Just because it came from your sister-in-law doesn't mean you didn't want to read it.

The fact that you know you deleted a non-spam message indicates that you eventually got the message anyhow.

Ergo - you are still right.

--
This issue is a bit more complicated than you think.

big spam magnets by hhawk · 2004-02-23 16:48 · Score: 1

my addy has been on the net since about 1988 and so I get more than 500+ spam per day, maybe even close to 1000 on a bad day.

So many emails say "Hi" or "hey" or other subjects I use. It get's harder to see what's going on esp. if your not using a 'Nix based reader

So I can tell you i'm not accurate in picking out the good ones, esp the newest spam which uses realistic sounding email names. I'd be eager to try out one of these filters.

--
http://www.hawknest.com/

Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · 2004-02-23 16:50 · Score: 1

No, I'm not missing the point.

My dad, and the other guy with my name, BOTH could have viewed the message body. We might be better than machines when reading the message body, but given the 200 spams my dad gets every day, he chooses not to. So it doesn't matter.

Even with an autopreview window: My dad will see 20 new spams, shift-select all of them, and delete without viewing. This is learned behavior for him.

Yes, you are correct: the artificial intelligence embodied in these anti-spam solutions are not more accurate than a human who actually reads the message body. But they are still better than humans at sorting mail.

--

There are no trails. There are no trees out here.

Re:Huh? Aren't humans 100%? by That's+Unpossible! · 2004-02-23 16:53 · Score: 1

Insightful?

Where the hell did it say these programs were designed to force you to read email?

--
Ironically, the word ironically is often used incorrectly.

Re:Huh? Aren't humans 100%? by JasonStiletto · 2004-02-23 16:57 · Score: 1

obviously, because humans can recheck the ones that they thought were spam but the computer thought wasn't, or visa versa. Humans aren't 100% accurate in a single pass. but won't generally get the same thing wrong, pass after pass.

Mine is 100% accurate by Anonymous Coward · 2004-02-23 16:57 · Score: 0

Mine is 100% accurate. If it doesn't have a certain sequence of words at in the topic, at the start or after the Re:, then it's spam.

Re:Huh? Aren't humans 100%? by jonfromspace · 2004-02-23 17:01 · Score: 1

My question is was the mistake a blocked or deleted "real" message? or a spam thet slipped through. In business, I'm far more worried about false positives than false negatives.

--
I am become Troll, destroyer of threads

Re:Huh? Aren't humans 100%? by manyoso · 2004-02-23 17:12 · Score: 1

Uhm, the *authors* of the spam are 100% accurate in identifying their messages as *spam*. Likewise for the authors of legitimate messages.

windows binaries by Spetiam · 2004-02-23 17:17 · Score: 1

has anyone made windows binaries of CRM114?

Not the best idea by Vainglorious+Coward · 2004-02-23 17:20 · Score: 5, Insightful

What you're planning has already been done, it's called TMDA, and it's not such a good idea. You're going to send out 800 "challenge" emails per day - have you given any thought to how many of those will be genuine addresses, but have nothing to do with the spam you receive because they just happen to be the joe-job victim? These kind of challenge/response systems may slighlty alleviate your own suffering through spam, but at a cost to all those unfortunate enough to have had their email addresses faked. And if the sheer impoliteness of such net behaviour doesn't put you off, note that you're using up more of your own bandwidth to send out such challenges

If any of the smtp exchange or address lookup fails, just forget it, they're probably not real anyway

It would make a lot more sense to make these kind of checks when you're receiving the email in the first place. Reject at the SMTP level - you never accept and process the spam in the first place

--
My next sig will be ready soon, but subscribers can beat the rush

Re:Not the best idea by warrax_666 · 2004-02-23 20:50 · Score: 2, Interesting

I don't think SMTP allows for a "reject" after getting to the DATA portion of the SMTP transaction. That prevents most (effective) spam filters from working at SMTP time. If it were possible, wouldn't everybody be doing this?

Hmm... maybe it's time to update SMTP to allow for this? (Sure, bandwidth is still being consumed, but at least legitimate senders would know that their message didn't get through because of "spamminess")

--
HAND.
Re:Not the best idea by martyros · 2004-02-24 02:48 · Score: 1

That's only until *that* person gets a similar filter: it should be easy to recognize that this is a response to an e-mail they didn't send, and just delete it. Whatever system should have it clearly marked, say with [CHALLENGE] in the subject line, and some unique identifier of the message, so that it can be deleted easily by the person's own spam filters.
Sure, it will increase some traffic for awhile; but the responses don't have to be that big; really only 50-100 bytes more than the headers, if you do it right. It shouldn't be anything like doubling your bandwidth. If everyone did it, soon there'd be less incoming spam, and there's your bandwidth savings back.
I haven't thought this stuff through completely, so I'd be interested in hearing more cons against such a system, but the ones you list aren't that big a deal, I don't think.

--
TCP: Why the Internet is full of SYN.
Re:Not the best idea by Continental+Drift · 2004-02-24 02:50 · Score: 2, Interesting

I disagree, I think that a white list with challenge auto-replies, as I use, are clearly effective and add just a little to mail traffic. I encourage others to use such a system, which would eliminate problems from having the spam reply-to being a real address. Since applying this schema, I've gotten exactly one spam message in my inbox. That's an excellent percentage.
Re:Not the best idea by Vainglorious+Coward · 2004-02-24 03:12 · Score: 1

Simple : you reject at the "MAIL FROM:" stage if the envelope sender is "bad". That's precisely *why* rejecting like this is so much better than content-based filtering, but of course, mostly the spam has a genuine address in the envelope sender, so this only works for a small set of messages (and this is after the RBL checks).

--
My next sig will be ready soon, but subscribers can beat the rush
Re:Not the best idea by Vainglorious+Coward · 2004-02-24 03:26 · Score: 1

That's only until *that* person gets a similar filter: it should be easy to recognize that this is a response to an e-mail they didn't send, and just delete it.

So I have to get one of these filters to deal with the stuff that other users of this filter are throwing at me? As I said, it's not very polite and "just delete it" is the favorite mantra of spammers. I'm cautious about *any* system that automatically generates email in response to arbitrary received mail - the potential for screw-ups is high, and ultimately, if the goal is reducing unwanted email, increasing the number of emails sents is counter-productive.

--
My next sig will be ready soon, but subscribers can beat the rush
Re:Not the best idea by Vainglorious+Coward · 2004-02-24 05:04 · Score: 2, Insightful

I've gotten exactly one spam message in my inbox. That's an excellent percentage.

Excellent *for you* that is. How many unwanted emails have you sent out to joe-job victims? Here's my basic problem - after black/white list weeding, you're always left with a body of messages that you need to decide what to do with. Rather than taking on that burden yourself, you lay it off on others. That's just plain rude, and little different than the MO of a spammer - "let other people bear the costs of my own selfish actions"

--
My next sig will be ready soon, but subscribers can beat the rush
Re:Not the best idea by HeelToe · 2004-02-24 05:06 · Score: 1

I do check for such things at the SMTP level and reject there. It's the stuff that passes muster on returnability and gets through to a spam filter I want to protect against being falsely identified as spam.

My primary goal would be to alleviate the issue that comes about when I haven't whitelisted someone legitimate who normally communicates with me, or someone legitimate who wants to communicate with me for the first time. They get a nice note telling them not to expect a response unless they take some further action. This addresses only false positives.

But thanks for responding. I hadn't actually considered the bit about having your email address spoofed as the sender of spam. That's not happened to me, but I can imagine it's quite devastating. That may be one reason by itself good enough to keep me from going this route. I agree with your assessment - you can't reduce the problem of unwanted email by generating more email, but I think there are more shades of gray.

I guess the bottom line is I'm not willing to change my email address. Why should I? Why should I be bullied by these bastards into changing an online identity? If there are things I can do to lessen the impact of spam on me and those who legitimately communicate with me, I should take those steps instead. I do, however, want to stop where I see those steps as being abusive to others. You identified TDMA as abusive in this regard. Another good example of an abusive step is RBLs. They have effectively squelched the hobbyist who runs servers on their residential internet access. And ONLY because people use them. If people didn't use them, they'd be gone.

Re:Huh? Aren't humans 100%? by wmspringer · 2004-02-23 17:22 · Score: 1

Well...I suppose if you just say the computer will read so much to be this accurate and you will read so much to be so accurate, it really is a fair comparison because it assumes you both make your judgement based on how much of the email you'll read.

Besides, you could argue in the other direction and say that since the computer evaluates the email in a fraction of a second, you should also..

--
Twenties Retirement

It _can't_ know which pr0n I think is spam vs good by ron_ivi · 2004-02-23 17:34 · Score: 4, Funny

I signed up for lots of junk mail lists; some solicited, some not -- sometimes from the same organizations.

How would it know if I consider brunettes non-spam but blondes spam? I did opt-in for one of those email categories, but not the other.

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 17:38 · Score: 0

With an accuracy rate of 99.84%, humans will miss about 10 out of every 6250 messages. If three people run through the same set of messages, they will most likely get different messages wrong. In the cases where humans disagree, the researchers can take a closer look and determine what is or is not spam.

Re:Huh? Aren't humans 100%? by Bronster · 2004-02-23 17:40 · Score: 1

I do it regularily, deleting 25 spam messages with a single good one embedded in it when I just woke up before I had my coffee is not a good thing ;)

So do I, which is why I have a separate mail folder into which absolutely everything gets saved, always. If I realise I've deleted something, I can always go back and get it.

Helps to have a colocated server with a pile of RAIDed disk in it for that though - once it's on the server it doesn't cost any more to keep a copy.

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-23 17:40 · Score: 3, Interesting

I dunno. I'm running CRM114 now, and it's taking something like 1.5 seconds to identify emails. I am on a slow machine though, which used to do SpamAssassin at around 4 seconds, and inaccurately to boot. CRM114 is a big improvement, and if it trains well after the first fortnight I'll kiss TMDA goodbye.

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Re:Huh? Aren't humans 100%? by fferreres · 2004-02-23 17:43 · Score: 2, Informative

Yes, but it is meaningfull nonetheless. If you just think that it's very likely that after reviewing 650 messages, you may have missed one email that you thought was spam, then the "study" is right. I don't care if the number is 900 or 400 emails. Those 400 mails are making me lose a _lot_ of time, and if I value my time, I am losing a lot of productivity, and also missing an important email.

If the program can have a .99 accuracy, then it's a real time saver, and if it only makes a mistaque every 2000 emails, then SURELY I will be more accurate than me. That depends of course, on how much spam you do get. I get arround 20 to 1 ratio of spam to real meat, and I get arround 100 spam messages a day. I can't spend 1 hour a day cleaning spam with 99,9% accuracy, so I am forced to quick sweep. This thing could make me regain the time, and the false positives would mean i even make less mistakes than manually.

The important things is how accurate the antispam tool is, and how accurate I am (ratio of spam to meat, and how much a miss costs me). How much other people make mistaues is not really that important. Everybody knows how much time they have, and how much spam to meat they have, and thus, it's very likely that if they don't have a LOT of time to waste, they will be making a mistake for every 200 to 600 spam messages.

--
unfinished: (adj.)

Animal Farm Spam filter by techno-vampire · 2004-02-23 17:46 · Score: 1

One filter good, two filters better.

--
Good, inexpensive web hosting

Here goes by hendrix69 · 2004-02-23 17:48 · Score: 1

There goes my bussines idea. I wanted to start a bussines that puts humans in an eastern europe contry to sort corporate e-mail.

A year into that project you'd get an email: "Progess is good. Spam was filtered. Send more cans!"

--
The power of Christ compiles you!

Re:Huh? Aren't humans 100%? by wmspringer · 2004-02-23 17:56 · Score: 1

I suppose if the comments on it are good, I'll have to try it eventually. At the moment my current emails are only a year old and I'm just relying on Mozilla's filters...which oddly enough have had trouble because I haven't gotten enough spam for them to train on!

--
Twenties Retirement

Do your part by Anonymous Coward · 2004-02-23 17:56 · Score: 0

Annoy spammers by Slashdotting their sites.

Re:Huh? Aren't humans 100%? by The+Notorious+ASP · 2004-02-23 18:06 · Score: 1

I for one welcome our new superhuman overl....

Eh, nevermind...

More! Faster! Longer! by Anonymous Coward · 2004-02-23 18:20 · Score: 0

And that, my friends, is how you get to the magic number--3670 posts--on Slashdot: converse with yourself. No need to create another account, just type your initials at the end of each post and respond to every Anonymous Coward that mentions anything you're talking about. You can even try going back to old threads and posting AC so that you can respond without the fear of someone finding out your dirty secret.

Share the luxury by bigberk · 2004-02-23 18:40 · Score: 5, Interesting

Having such a powerful statistical spam filter is definitely a luxury. I have no difficulty believing the accuracy values presented here. I have had experience with spamprobe, CRM114, bogofilter, spambayes, and spamassassin and all of these do an amazing job to the point where spam no longer exists (for you).

Which leads to me plug a little project called WPBL that uses exactly these types of statistical spam filters to spot spam sources in a distributed fashion. Each project member uploads hourly the IPs they see relaying spam and non-spam, where the 'decision' is made by these extremely reliable filters. This effectively converts your regular mail account into an intelligent spam-trap that feeds a central blocklist.

The more members we get, the better we can identify active spam sources around the world. This information is then used by some sites for quite large-scale blocking. Since you're doing all this filtering processing anyway, why not also share "what you learn" (the IPs that are spamming you)?

If this grabs your interest, read up on the reporting scripts or alternatively, the open WPBL data upload protocol if you want to code your own report generator. Bandwidth usage is minimal.

Newsflash! by kfg · 2004-02-23 18:43 · Score: 1

The way to get a high post count is to post a lot over an extended period of time.

Film at 11.

3671

KFG

Re:Newsflash! by Anonymous Coward · 2004-02-23 21:26 · Score: 0

Well, I know your secret KFG. You like to make them clever one word posts like. "Yes." and "Her." and "Perhaps."
Those probably add up over time.

Re:Huh? Aren't humans 100%? by bananahammock · 2004-02-23 18:48 · Score: 2, Insightful

That should explain why Dubya's always smiling even when he's trying to be serious.

Sample by Anonymous Coward · 2004-02-23 19:13 · Score: 2, Insightful

I say get a bunch of honeypots and do the test again.

A human doesn't have to determine if it's spam simply by the title.

The human should have all the advantages these filters have body / header / ip .

Cheers

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-23 19:14 · Score: 0

The greatest achievement Spam ever made was proving to the world it didn't exist...

Well by DRACO- · 2004-02-23 19:17 · Score: 3, Insightful

Well if the human was given the chance to read the body text as well like the filters do, then they would be 100% able to delete their own spam.

DRACO-

--
Consider yourself blessed if you are sneezed on by a dragon and only get wet, it could have been a fireball.

No, no, no, not quite by farquharsoncraig · 2004-02-23 19:24 · Score: 1

With 10 messages (after automatic spam detection) humans are 100% accurate.

This is not a statistically correct statement. The 99.84% probability applies no matter how small or large the sample space. Well, even that depends upon how complex your logic is, nevertheless, the central maxim of experimentation is elimination and control. This necessarily implies simplicity, or rather requires as much simplicification as your boundary values and intuitive parameters allow while yet permiting a determinable, meaningful conclusion or result. No rancor directed at evilmrhenry, but the world would benefit from a regularization of statistical pedagogy in education with particular emphasis on the mathematical rigor from which the science draws all its formal reason. Granted not all courses in statistics are properly couched in the mathematics department, but statistics is from postulates to grotesque, abstract theorems a pure mathematical construct.

Re:No, no, no, not quite by Anonymous Coward · 2004-02-23 21:17 · Score: 1, Insightful

You didn't get it. If I were to sweep 5 spam messages out of 5 real ones a day for the next 1000 days I would get 10000 of them right. However, if I needed to do all that in one day, it'd drop to some 9984..
Re:No, no, no, not quite by Anonymous Coward · 2004-02-23 21:46 · Score: 0

Are you just trolling or what? The 99.84% is totally inapplicable since obviously a major cause of human error is loss of concentration during a repetitive, boring task. 10 messages allow the human to read each carefully, but 1000 force the human to skim quickly. Unless you've perfected mind control you can't control the subjects' involuntary responses.

Though I do have to agree with you that mathematical rigor in statistics is needed. Many courses are far too lightweight.

CRM114 errors by Jesus+IS+the+Devil · 2004-02-23 19:47 · Score: 1, Offtopic

I just tried installing CRM114 with no success. The documentation is confusing to me. Perhaps someone can help me out.

#1.
From the docs it says:

> In either case your .css files should be in the same directory as
> your mailfilter will "run" in (yes, this can be changed, but that's
> an advanced topic).

What does this mean? What is the path to my mailfilter? I have qmail on my
system.

#2.
I'm being told to edit mailfilterconfig.crm from the docs. However this file is not found
anywhere in the source folder nor could I find it anywhere on my system
after installing CRM114. Where do I get this file?

#3.
Currently I'm piping my emails to a support ticket script. In the
'/var/qmail/alias/.qmail-support' file I have this line:

|/usr/local/php/bin/php /myscript.php

How would I go about having CRM114 filter the mails and then still have the
mails piped to the support ticket script?

--

eTrade SUCKS

Re:CRM114 errors by Anonymous Coward · 2004-02-24 10:44 · Score: 0

How is this off topic?

Look at the rest of the messages here! OMGOMGOMG

Granted, he should be seeking support through the project's lists, but this is still on topic.

Re:How can a human be wrong? by arekq · 2004-02-23 20:01 · Score: 1

I think it's not *right* (umm... there gotta be a better word?!) to compare the accuracy of human filter and computer filter, because, in general, human don't act as a filter.

Human defines what is spam. The filter, on the other hand, does its best to classified them based on the information it's given...

Re:Huh? Aren't humans 100%? by Moridineas · 2004-02-23 21:06 · Score: 1

Yes, you are correct: the artificial intelligence embodied in these anti-spam solutions are not more accurate than a human who actually reads the message body. But they are still better than humans at sorting mail.

But that's not the point! You could hire someone to read your email and classify as Spam or not spam, and I doubt they would EVER mess up. Do you disagree with this?

THAT'S the point.

lies, damned lies, and... by stile · 2004-02-23 21:26 · Score: 2, Insightful

statistics.

This headline is misleading. I refuse to RTFA, because I imagine the "10 times as effective" figure comes from the article itself.

Come on, folks. The figures do, in fact, show a 10 times increase in effectiveness between humans and these filters. But what the heck does that mean? I have to question the studies. How did they come up with this 99.84% figure? Does it mean that one person will mis-classify about 16 emails in 10000 (a small number indeed)? Or did one or two outliers taint the data?

The important thing here is that we're comparing three averages. Were the conditions between the trials the same? Were the humans given time limits? Were the accounting methods accurate? Were the spam messages the same?

It's quite possible that these averages were bounded by possible error quantities (they should have been!) and that these were tossed when reporting the numbers to us. This was so that a startling result (10 times as effective as a human) could be shown in a headline. It's all about coming up with a flashy "fact".

It's very easy to make numbers say what you want them to say, so I'd be a little wary of running around to your friends "citing" this 10x improvement figure without doing some deep delving into the processes involved in arriving at the number.

Re:lies, damned lies, and... by Anonymous Coward · 2004-02-23 23:45 · Score: 0

OK, time to clear something up. I'm Bill Yerazunis. I wrote CRM114.

The 99.84% statistic is from ME. Here's how I derived it. Hope this helps.

I went on vacation.

I came back.

I dumped all my email into a big file.

I then started classifying it. Twice. Over a period of almost a week. Meanwhile, I read new email as it came in.

I took my time at this, because I wanted to avoid fatigue and "brain-fade". That's why it took me almost a week to get through it.

Now, because I classified the email twice, I could diff the results and see where I classified the same emails differently on the two passes.

In the end, my two passes differed in .16% of the time. Therefore, I take my _personal_ raw error rate at .16% . (actually, it's going to be worse than that, as there is something on the order of .16^2 chance that I would misclassify an email twice the same way. But we'll ignore this for now).

So, that's where 99.84% comes from.

If someone cares to run a better test, please do so, and please do publish the results! It really does matter - if filters are _consistently_ better than humans, it's perfectly reasonable to "route to /dev/null" after you've trained your filter.

-Bill Yerazunis, CRM114 author
Re:lies, damned lies, and... by ectoraige · 2004-02-24 01:19 · Score: 1

How *anybody* can be modded insightful for explicitly not Reading The Fucking Article is beyond me.

I have to question the studies
Really? You *really* want to question the studies? Then Read The Fucking Article.

How did they come up with this 99.84% figure?
Read The Fucking Article.

Were the spam messages the same?
Read The Fucking Article.

Scepticism is a useful tool when analysing information, but it should never be taken to the extreme that it prevents you from even accessing the information in the first place.

The irony here is that in taking a swipe at pop journalism, you engage is pop criticism.

Neither have a place in informed debate.

--
Vs lbh pna ernq guvf, ybt bss abj. Tb bhgfvqr. Syl n xvgr.
Re:lies, damned lies, and... by Anonymous Coward · 2004-02-24 02:03 · Score: 0

I'm really quite shocked with Yerazunis' shoddy statistical methods. He has a PhD and knows perfectly well that you cannot draw sound conclusions from these experiments. Yet he publishes them as truth anyway.

I have never taken him seriously and I don't think that any of you should either.

Re:Huh? Aren't humans 100%? by boaworm · 2004-02-23 21:38 · Score: 1

I dont think your math is entirely correct. The whole thing fails because the two persons have to determine and make a combined decision. So, if both persons make the correct guess, that a true spam is spam, it seems logical to think it is.

But if you have two persons, you actually increase the risk that someone makes an incorrect assumption about the spam. Ie, you have two chances of failure. So even if you reduce the probability that _both_ persons would incorrectly classify a mail as spam, you increase the probability that one will. So how do you deal with that situation ?

True spam, A(spam),b(!spam) -> decision ?

True !spam A(spam, b(!spam) -> decision ?

The problem that occurs can be reduced by having an additional person, and a majority voting process in the end. The chance that a majority of voters would misclassify the mail is very low. You could also stay put with your two algorithms and have a "revote", assuming that the algorithm is not statical (ie will come up with the same decision all the time).

This leads us to the key issue. The algorithms are most likely statical. That means

A: You must have two different algorithms in your case

B: Revote doesnt work

So i'd suggest setting up three different spam filters a middleman voting system. Or you could just click "delete" yourself :)

--
Probable impossibilities are to be preferred to improbable possibilities.
Aristotele

thats bad! by Anonymous Coward · 2004-02-23 22:01 · Score: 0

How am i supposed to increase my penis size now?

it deletes ALL email to achieve result, humans sav by aaron_pet · 2004-02-23 22:06 · Score: 1

I figured out how this works...

as I get TONS of spam... maybe 700 or so per legit email (joking)...

and I save a few of them...

all that the computer would have to do is delete all of them to achieve a higher accuracy rate!

of course that could also be considered a 100 percent failure rate.. but .. it's mostly right!

--
Please use [ informative / summarizing ] SUBJECT LINES
Flame me here

Re:Huh? Aren't humans 100%? by Andrew+Cady · 2004-02-23 22:13 · Score: 1

Well, the math isn't wrong, but there is an unstated assumption, that whenever the two disagree, it will be possible to check and determine who is right. What multiple judges allows is better -detection- of errors, not in itself correction of errors. A majority scheme with an odd number of judges would indeed allow correction as well.

However, you've only actually increased the probability of error in any sense if you count a disagreement as an error - but there is no more reason to count a disagreement as an error than there is to count it as a correct judgment. It is, in fact, distinct from either. A disagreement implies no conclusion, so it cannot be right or wrong.

False positives? by pjt33 · 2004-02-23 22:17 · Score: 1

Be nice if the figures given (and no, of course I haven't RTFA) specified rates for false positives and false negatives. While a human may have 0% false negatives, if you have friends who write dodgy subject lines, you may well have the odd false positive.

Don't you use Pine? by pjt33 · 2004-02-23 22:21 · Score: 1

It's "delete, delete, delete, delete, down-arrow, delete, delete, down-arrow, delete, delete, whoops!, up, undelete, expunge"

Getting rid of "subscription spim" by pjt33 · 2004-02-23 22:24 · Score: 1

I use Fire as my IM client. If someone wants to go on my whitelist, they have to use a non-IM method to communicate that fact to me. Some people might not like that, but I'm quite happy with it.

Re:Huh? Aren't humans 100%? by gujo-odori · 2004-02-23 22:26 · Score: 3, Informative

I write spam filters for a living, and I promise you that they can eliminate many of the spams just by looking at the subject too.

Of course, so can I. Now, since I write the filter based on my human judgement of what constitutes spam, which is more accurate?

please tell me you're kidding ... by pwarf · 2004-02-23 23:33 · Score: 1

Statistics may be a "pure mathematical construct," but the application of statistics to the real world requires verification that the assumptions hold and the conditions remain unchanged.

While it is technically true that the detection of spam accuracy given 10 messages would on average be very slightly less than 100%, it is very reasonable to assume the accuracy would be higher for people looking through 10 messages/day rather than 100s/day.

The condition of only evaluating 10 e-mails is fundamentally different than sorting through "a total of 5849 messages (1935 spam, 3914 nonspam)" in a month or an average of 190+ a day. Extrapolating the author's result to a much lower message load is unwarranted.

Re:please tell me you're kidding ... by farquharsoncraig · 2004-02-24 07:29 · Score: 1

You are right, considering the nature of this experiment, the discreetness of the sample space would likely alter the value of the actual result. I did not consider the human factor of (in)doggedness (: which would definitely affect efficacy on a smaller set of mail seeing that one would probably apply more energy of thought in determining spam knowing the ordeal to be as trivially short as 10 messages.

Nevertheless, strictly statistically speaking the 99.84% statistic applies to each individual message regardless of the size of the sample space. A human has a 99.84% probability of correctly identifying this one RANDOM MESSAGE as spam or not according to statistics. If this number is false or inaccurate, then it is not statistics that is wrong, but the model that is too simple or misapplied.

If I flip a coin one, ten, or ten thousand times, it will not change the probability of being 50% getting either heads or tails on each individual flip, even regardless that I have gotten heads for the past 9,999 flips.

All of that being said, statistics, and mathematics in general can never touch the real world. They only can construct models that simulate the real world and make predictions on the behavior and state of those models.

So ummm.... by Kjella · 2004-02-23 23:35 · Score: 1

Once Email Spam is eliminated, then IM spam will begin... ...did email spam get eliminated sometime in the 90s, and noone told me?

SPAM goes where the people go... on Usenet, on Email, on WWW, on IRC, on IM, on P2P progs. The person that invented a spam-free service would be a very rich or very popular man, probably both.

Kjella

--
Live today, because you never know what tomorrow brings

Re:Huh? Aren't humans 100%? by R.Caley · 2004-02-23 23:50 · Score: 4, Insightful

fill a bin with 50,000 red balls and 50,000 blue balls. Ask a human to sort them all.

Not comparable. The job of a junk mail filter is to drop things I don't want to read. It is trying ot match my evaluation, not to match a semi-objective criterion like red or blue.

If I read 1000 messages and say which I wish I hadn't read, then I am 100% accurate by definition.

Of course, if they are really talking about a pure spam filter -- ie one which identifies unsolicited commercial email -- then they can be more accurate than me, but at an uninteresting, perhaps even counter-productive, task:

I may get unsilicited commercial email I do want to read one day. Almost happened once (I had inadvertantly signed up for it, so it was not really unsolicited, and I didn't actually buy the piece of kit they had on special offer that week, but was tempted). I also get stuff I don't want which isn't spam (notably email from virus infected machines).

The referenced study seems to be a very sloppy job from this POV. They don't define what their criterion of sucess is, and to the extent they put in a hand waving attempt it is clearly nonsense:

Because spam (sometimes termed ?unsolicited commercial email? or ?marketing messages?) is neither expected nor desired[...]

`Unsolicited' does not imply `not desired'. If they don't tease those two apart, they can't get interesting results for real world applications. Eg, someone mailing my work address with a commercial proposition may well be a very welcome unsolicited commercial email.

--
_O_ .|< The named which can be named is not the true named

Re:Huh? Aren't humans 100%? by scott_davey · 2004-02-23 23:55 · Score: 1

I don't know what all the fuss is about with this spam thing. Sure it's a pain responding to all that email, but now I've got a 10 foot penis!

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-24 00:24 · Score: 0

Remember; one thing computers are good at is doing boring things repeatedly.

Does that mean computers/robots will never learn to masturbate?

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-24 00:52 · Score: 0

"We have a project to produce high-quality benchmark corpora for spam filter testing"

So... people are selling the contents of their hotmail accounts on ebay as "spamfilter test data"?

Re:Huh? Aren't humans 100%? by Net_Wakker · 2004-02-24 00:55 · Score: 1

I just woke up before I had my coffee

How do you do that? I NEVER wake up before I had my coffee...

--
a horrible place

Re:Huh? Aren't humans 100%? by Elwood+P+Dowd · 2004-02-24 01:07 · Score: 1

Yes, I do disagree with you. Because eventually, that person you hired would learn that they didn't need to read the text of some of the email, and they would start deleting messages without reading them (or without reading them carefully).

People make mistakes.

--

There are no trails. There are no trees out here.

Re:Huh? Aren't humans 100%? by No.+24601 · 2004-02-24 01:15 · Score: 1

puts humans? as as opposed to what, donkeys?

Overkill by mdfst13 · 2004-02-24 01:22 · Score: 2, Interesting

We don't need to trust the *person* sending the mail. It would be sufficient to trust the machine that is doing so.

Look at http://spf.pobox.com/ which is sufficient. With SPF, you know that if you are getting SPAM saying it is from @ultraviolet.org, then it really is from @ultraviolet.org (or at least someone who ultraviolet.org trusts).

Your solution requires a certain level of technical proficiency (setting up and managing the key) of *all* participants. SPF's solution only requires technical proficiency from those who manage DNS settings and those who manage email servers (in particular the person who manages your email server).

Also, what about *stolen* keys? And who handles key checking? SSL certificates are restricted to a few root signers, but you don't want a central certificate authority. PGP/GPG work well because they only involve small numbers of people. In general, you know the person directly. Occasionally it will be a friend of a friend message. What do you do when the chain is 10 or a 100 or a 1000 keys long? How long will it take for you to find out that 978 has since revoked their signature for 977 (counting in steps from you, i.e. you are 0 and 1000 is the original signer of this chain)? Or how long will it take you to verify all 1000 keys if you try to do it real time (i.e. when you get the message)?

Re:Overkill by Tracy+Reed · 2004-02-24 07:49 · Score: 1

True but we need to trust the individuals in so many other cases that we may as well do it here too. A public key infrastructure is so useful in so many areas. Technical proficiency really isn't a problem as the MUA can automate just about everything. Stolen keys are not a problem. This is why keys are encrypted with a symmetric cipher so you have to enter your password to unlock it. If the key really is stolen just tell your MUA and it will issue the revocation key it generated when it generated the key for you in the first place. The individuals would handle key checking themselves, this is probably the most complicated part. Have your pal come over to your machine, say "yep, that's my key" and then click sign it if you know him or look at his ID if you don't. The chain is unlikely to be 100 or 1000 keys long. 6 degrees of separation etc and orkut and friendster etc have shown us this works. It won't take long at all to verify keys if you cache the results. It could take several minutes (at worst) the first time but then a fraction of a second each time after that. The PGP/GPG/crypto guys are quite smart and have thought of all of this.

It's not overkill. It's just not a band-aid. It is very useful in all kinds of areas from email to verifying downloads are good and trustworthy to verifying webpages etc. We need a universal PKI.
Re:Overkill by mdfst13 · 2004-02-24 09:55 · Score: 1

I don't disagree that in the way that it is used *now*, that it is not overkill. What I'm saying is that a public key system would not be a good general system. I.e. when I get an email from a stranger (which is when it matters; other than virus outbreaks, very little spam comes from friends), I would have to find someone I trusted to verify the stranger. This is where you get long chains (I won't speculate as to length again).

Public keys work great when you want to verify someone you know (although I think that they would work less well if *everyone* used them; the technical difficulty keeps out many of the people who would get virus infected, etc.). The problem is that spam is usually from someone you do *not* know. Now, you could reject all email from those that you do not know, but the effect of that would be to eliminate some email you might want to receive (I do a lot of PHP contract work; a stranger contacting me is a *good* thing).

The more you automate the system, the more vulnerable it is to assault. If everyone used public keys, then all spam would be created by viruses and crackers. Yes, you could revoke the key, etc., but all that takes time. Especially if you cache results (my ISP can take up to a week to update DNS). Further, I have had people take two or three *days* to fix their virus infected computer (which is essentially a stolen key situation), even after I told them what the problem was. End result, a lot of people would have their keys revoked because someone down chain was misusing it (and someone up chain is not willing to wait until they correct it). Until they figure it out, this means that their *valid* email is getting rejected.

A two level verification system (with SPF as one level) is better as a general system. SPF verifies that the machine (IP) is authorized to send mail for that domain. The machine (which can be an SMTP server or just a workstation -- limiting to SMTP servers is more secure; the domain owner can set this) is then responsible for verifying the identity of the person sending the email (in whatever fashion). Now, they either need to crack an SMTP server or a DNS server to send mass emails.

In a public key system, cracking an individual computer can give you access to a key which would allow the spammer to send emails from a separate machine or machines until the key was cancelled (and the cache clears). After the key is cancelled, you lose valid email until a new verification path is created.

The SPF method involves no changes in MUAs (except possibly the introduction of something like SMTP auth, but common MUAs already support this); no extra key signing software; puts the burden on domain *owners* to secure their domain; is easy for ISP tech support to help people set up. Compare to public key situation: tech support says, "you need to send your key to us to get signed." A couple days later they get a letter in the mail with the person's house key. Silly? Yes. Something that would actually happen? You betcha. The majority of users require an abstractable method where someone else does the heavy lifting.

A public key interface is great when everyone involved is motivated to maintain it. However, this is not generally extensible IMO. If everyone used it, we would be continually managing stupid key problems (e.g. Jane revokes John's key because John didn't call her the next day, which shows up the same as if she revoked his key because he sent spam) caused by relying on non-technical users. This is not to say that public keys are a bad idea (they aren't); merely to say that they are not a universally scalable idea. At least not IMO.

10 times...? by holizz · 2004-02-24 01:23 · Score: 2, Funny

If humans are 99.84% accurate and these filters are ten times as accurate, wouldn't that make these filters 998.4% accurate or am I missing something?

Re:10 times...? by KnightStalker · 2004-02-24 05:08 · Score: 1

Think of that less as "Ten times as accurate" and more as "One-tenth as inaccurate" meaning humans will miscategorize ten times as many messages. 99.84% accuracy = 0.16% inaccuracy, and one-tenth of that would be 99.984% accurate.

The problem is that English is not very precise, since it's just as accurate to say that 20% is ten times higher than 2%, as it is to say that 99.984% is ten times higher than 99.84%. (See also, "How to Lie with Statistics").

--
* And remember, it's spelled N-e-t-s-c-a-p-e, but it's pronounced "Mozilla."
Re:10 times...? by holizz · 2004-02-24 07:31 · Score: 1

Thank you, knowledgeble stranger. Damn statistics, damn them to /dev/null And who modded me funny in parent's parent (I'm my own grandparent, hmm)? I was making a serious point...
Re:10 times...? by KnightStalker · 2004-02-25 03:56 · Score: 1

Ah, you probably haven't yet had your first dose of Moderator Crack. When you do, I suggest you look for bizarre trolls to mod up as "Informative", legitimate questions to mod "Funny", and anything that contradicts the groupthink to mod as "Offtopic", "Troll", or "Flamebait". This is the Slashdot Way.

--
* And remember, it's spelled N-e-t-s-c-a-p-e, but it's pronounced "Mozilla."

Re:Huh? Aren't humans 100%? by MrScience · 2004-02-24 01:40 · Score: 1

I can just imagine:

"Please take a seat here.
Thank you for volunteering for the 'Spam Deletion Turing Test'.
We would like you to delete any spam that comes into this in-box...
mail will be arriving at around 10-20 messages per minute."

5 minutes pass

10 minutes pass

"Sir? Could you tell me how long this will take?"

"Another 6 and 1/2 hours."

o-O

--

You quitting proves that the karma kap worked. The most annoying of the whores shut up. --CmdrTaco

Re:Huh? Aren't humans 100%? by Mr+Guy · 2004-02-24 01:50 · Score: 1

And he's never cared, apparently.

Can you truly miss something you never had?

--
Never confuse volume with power.

Making this work with MS products by shadowlight1 · 2004-02-24 02:05 · Score: 1

Anyone know how I could use either of these with my copy of MS Outlook or Exchange 2000 server at work?

Free viagra by TheBoostedBrain · 2004-02-24 02:11 · Score: 1

What would happen if a spam filter were applied to /. comments?

--
-- When did Ignorance Become a Point of View?

Re:Huh? Aren't humans 100%? by TheGreatGraySkwid · 2004-02-24 02:19 · Score: 1

Why the hell would you have a cyber-sex chat log mailed to you?

Much like the rule stating there are few things less funny than a "funny" IRC log you did not participate in, I can think of few things more sexless than a cyber-sex chat log received via e-mail...

--
The Humblest Mollusk on the Net

Re:Huh? Aren't humans 100%? by DjMd · 2004-02-24 02:29 · Score: 1

...however if he had read them fully he would have known they were legit.

Maybe if you didn't start off your emails by talking about cheap online drugs, work from home make $!000, Discounts for Diabetics , Can I Enhance it bigger, Esgic D1scounts, Ch.eap Microsoft Adobe Corel Autodesk OEM software for sa.le! dcunrisnjt kld ....
Then he wouldn't have to read them fully

Sorry got carried away, I jsut started cut and pasted subject lines from my spam folder... How does any of that look like real email? dcunrisnjt kld???

--
DJMD - The fourth man - Planetary

Re:Huh? Aren't humans 100%? by SillySlashdotName · 2004-02-24 03:14 · Score: 1

I thought SPAM was unwanted bulk email. So you are saying the programs are 10 times more accurate in knowing if the email is unwanted by me than I am?

Or are they saying it is 10 times more accurate in identifying bulk email - whether it is wanted or not?

Wanted bulk email != SPAM.

What I want is a filter that round-files anything that "calls home" or otherwise sends information when opened. I can slog though the rest and make my own determination as to whether it is wanted or not.

MACHINE: This email is like other email you did not want, so you don't want it either.
ME: Wait a minute, I DO want to read that.
MACHINE: No you don't. I am more accurate than you, so you don't want to read it.
Me: Yes...master...I...don't...want...to...read...it.. .
MACHINE: These are not the 'droids you want.
ME: These...are...not..the...'droids...we...want...
M ACHINE: Move along
ME: Move...along...

--
Acts of massive stupidity are almost never covered by warranty. --me.

Re:Huh? Aren't humans 100%? by notque · 2004-02-24 03:26 · Score: 1

if you have to read an entire message to tell that it's spam, the spam has succeeded.

I thought the spam succeeded if I actually sent money to nigeria...

Are you telling me that they profit from me wasting my time? Gasp!

--
http://use.perl.org

more spam tools by Anonymous Coward · 2004-02-24 03:33 · Score: 0

Check the spam tools here

? but we define spam by jago25_98 · 2004-02-24 03:35 · Score: 1

If humanity defines what Spam is, how can a machine be better?

--
A blog I run for the wealth

Re:Huh? Aren't humans 100%? by tommck · 2004-02-24 04:09 · Score: 1

Hell I can tell 100% of the time when I have TWO blue balls. I'll bet you a million bucks that I would be 100% accurate if I had 50,000 of them! OUCH! Damned dick-teases!

--
---- It puts the lotion on its skin or else it gets the hose again. It does this whenever it's told.

ten times? I think not. by ArmorFiend · 2004-02-24 04:28 · Score: 1

the average human is only 99.84% accurate. Both filters are reporting to have reached accuracy levels between 99.983% and 99.984%

So to be ten times more accurate, the filters would have to be 998.4% accurate.

"My spam filter gives 110%"

Re:Huh? Aren't humans 100%? by Snowmit · 2004-02-24 04:43 · Score: 1

The further question is, if humans aren't as accurate as the computer, how are they measuring the accuracy at all? That is, how do they know that the 1 in 6250 messages is wrong, if a human, known to be inaccurate, was testing for accuracy?

You're joking right? I give my students math problems all the time and they regularily give me back inaccurate results. I can still correct those results. Yet, I also make mathematical errors from time to time. Sometimes, I can even correct my own results. This is a normal part of testing.

--
I have a lot of opinions about Cyborgs and Architects

Crash from Junkyard wars is one of the developers by doublem · 2004-02-24 04:56 · Score: 1

I've met this guy at a few parties. Cool chap, lots of fun, easy going.

He told me about the "Beyond Baysian" filtering they were putting in place a while back.

The name comes from "Dr. Strangelove". Gotta love it.

--
"Live Free or Die." Don't like it? Then keep out of the USA

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-24 05:14 · Score: 0

How can a spam filter be more accurate than humans?
>>>>>

Filter out 100% of their email and ignore the false positives?

That and the fact that ever so rarely, you cannot rule out something as spam by the subject alone, and thus must click on it (though this isn't quite fair, since the computer reads 100% of the email, and you only read those that you're not sure to be spam...)

Re:Huh? Aren't humans 100%? by Ifni · 2004-02-24 05:29 · Score: 1

People buy products from spammers, hence why there continues to be spam - it's profitable.

If people buy products from spamming vendors, it follows that people aren't 100% effective at determining spam. Yes, you could argue that if the person was interested in the product, then it wasn't spam to them, but when they get that bogus penis enlargement supplement that doesn't work, and THEN realize that the message was for a product they really didn't want (because it doesn't work as advertized), you can see then that they made a mistake that a spam filter could have saved them from. If every spam product worked as advertized, I think that very little of it would be spam since almost everybody would want these miracle products (especially since it seems that all of the penis enlargement products have the handy benefit of enlarging breasts if consumed by females).

Hence one of the reasons that a spam filter (human or otherwise) can't be 100% accurate, as it can't accurately predict if the product will work as advertised.

Of course, this is heavily dependent on your exact definition of spam. Does spam=UCE? Does spam= ANY unsolicited email, commercial or otherwise? Does spam = anything you don't feel like reading, solicited or not? Or one of myriad other possible definitions. Most spam filters define spam as either UCE (word lists) or things you don't want to read (bayesian). But what about the bad news that your brother has died? I didn't want to read that, but I should, and so I would want it to get through. So in effect, what spam filters are TRYING to do is sort out what you SHOULD read from what you SHOULDN'T read, regardless of the combination of commercial, solicited, or pleasant. And for the reason given above concerning ordering a bum product, since many people are gullible and/or unknowledgeable, there are many things they THINK they should read that they shouldn't (or vice versa if they get overzealous when deleting what they think is spam after seeing only the subject and sender). Spam preys on both of those (gullibility and lack of knowledge in a particular area) to entice people to buy products that aren't worth the asking price by a long shot. A filter might be able to do a better job of weeding these out than the average person (i.e. a person might buy a penis enlargement product or believe the Nigerian scam, but the filter would know better), but neither is ever likely to be 100% (the filter might let through something you subscribed to that offers to sell you a legitimate product from a legitimate vendor, but still turns out to be a lemon, or even fail to let through a TRUE deal). In some cases you can consider a spam filter to be like an expert system, providing expertise in email deception practices to the average computer user that doesn't even know that "From:" headers (among others) are easily forged.

Sorry for the rough edges - I'm rambling a stream of conciaousness without editing due to lack of time (gotta get to work!)

--

Oh, was that my outside voice?

Re:It _can't_ know which pr0n I think is spam vs g by xmedar · 2004-02-24 06:57 · Score: 1

Well you could develop RDF Schema for porn spammers that would include attributes as hair colour, nationality, age, soft, hard, images, video, etc. on the other hand you could just go to a password site and get the same thing for free.

--
Any sufficiently advanced man is indistinguishable from God

Re:Huh? Aren't humans 100%? by Moridineas · 2004-02-24 07:10 · Score: 1

Ok, I think we'll just plain disagree here then :)

Re:It _can't_ know which pr0n I think is spam vs g by Anonymous Coward · 2004-02-24 07:31 · Score: 0

I thought this was "insightful". A program can't be "100%" accurate in identifying spam unless it knows _everything_ about the user.

Re:Huh? Aren't humans 100%? by Trejkaz · 2004-02-24 09:11 · Score: 1

If you correct all your own results, then you are 100% accurate in the end.

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Re:Huh? Aren't humans 100%? by Anonymous Coward · 2004-02-24 10:30 · Score: 0

"4, Informative?"

Goddamnit! This is spam!

Obviously, the moderators are not superhuman.

Re:ob by wwest4 · 2004-02-24 10:30 · Score: 1

you almost had it. here's the procedure for future reference.

SUBJECT=`subject of the story`
ACTION=`change $SUBJECT is subject to`

SUBJECT=`strip_definite_article $SUBJECT`
ACTION=`third person singular case of $ACTION`

JOKE=`echo In soviet russia, $SUBJECT $ACTION you!`

How do you deal with the changing nature of spam? by GPS+Pilot · 2004-02-24 13:24 · Score: 1

In building these "corpora" of spam, aren't you trying to hit a moving target? Spammers are always evolving their techniques to avoid filters; as the identifying characteristics of "spam" therefore change constantly, I fear that your corpora will result in filters that do an excellent job of blocking last year's spam.

--
That that is is that that that that is not is not.

Re:How do you deal with the changing nature of spa by po8 · 2004-02-24 18:55 · Score: 1

Part of what we're trying to do is establish a methodology for semi-automatically building good benchmark corpora. So ideally, if we think the spam stream has changed substantially, we should eventually be able to mostly just stuff in more messages, push a button and get a good current benchmark corpus. At least, that's the ideal.

In any case, we believe that everyone should be running filters customized for their personal current ham and spam streams. Our corpora are not intended to be used as training data for your filter. However, they should help you estimate generic properties of that filter, such as quality of its learner.

Re:Huh? Aren't humans 100%? by jonadab · 2004-02-25 01:28 · Score: 1

> but can you identify spam before opening it 100% of the time?

Not with certitude, no. About 2% of the time I have to look at the message
body to be sure. Nevertheless, this nonsense about humans only being 99.8%
accurate is based on the *average* human, and that figure is dragged way down
by a relative few who lack any kind of discernment at all, and a somewhat
larger minority whose accuracy is less than what it ought to be because they
are careless.

--
Cut that out, or I will ship you to Norilsk in a box.

also depends on morale by hany · 2004-02-25 01:50 · Score: 1

The "right" and "wrong" in this situation also geatly depends on morale applied when judging his action.

--
hany

Re:also depends on morale by Alranor · 2004-02-26 09:33 · Score: 1

Well I think his morale would have been somewhat raised :) , he may have had some moral difficulties though.

Re:Huh? Aren't humans 100%? by Ciggy · 2004-02-25 02:34 · Score: 1

Most of my spam (4878/4891 = 99.73% so far this year) arrives at one e-addr which I no longer use. [The rest (13/4891 = 0.27%) has arrived at another non-existant e-addr (looks like a message_id) - probably used by one spammer.]

None of the spam arrives at my active e-addrs - as soon as it does, I've got a sure fire moan at the source: each sign-up that requires an e-addr get an individual, traceable one. (The only problem is my main, family one: if that gets spammed, it implies that it's been leaked by a friend, possibly by a worm/virus.)

--

A rose by any other name would smell as sweet;
A chrysanthemum by any other name would be easier to spell

Re:Huh? Aren't humans 100%? by hatrisc · 2004-02-25 17:15 · Score: 1

so, you're only 98% accurate? You're below average and dragging it down. Get with it.

--
I write code.

Re:Huh? Aren't humans 100%? by jonadab · 2004-02-26 10:28 · Score: 1

> so, you're only 98% accurate?

No, but I achieve my accuracy by knowing when the headers alone aren't
enough to be sure and examining the body in those cases.

--
Cut that out, or I will ship you to Norilsk in a box.

Slashdot Mirror

Two Spam Filters 10 Times As Accurate As Humans

487 comments