Response to Gordon Cormack's Study of Spam Detection

How I do by mirko · 2004-06-24 03:32 · Score: 5, Interesting

I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

--
Trolling using another account since 2005.

Re:How I do by Anonymous Coward · 2004-06-24 03:44 · Score: 0

I reply to spamvertisers with a Hello.jpg file attachment.
Re:How I do by lukewarmfusion · 2004-06-24 03:47 · Score: 1

My mail provider has some filtering software which lets me customize the threshold that I want filtered. On my side, Thunderbird has filters. Finally, I use a wildcard email that forwards to my actual address. When I use my email somewhere that might spam me, I simply describe the potential spammer like this:

slashdotspam@mydomain.com

If I get mail from there, I know how it came in. The combination of all these keeps my total spam to probably three or four a week.
Re:How I do by Anonymous Coward · 2004-06-24 03:51 · Score: 0

I just keep a couple of Hotmail accounts.
one for family/friends/pay bills
one to sign up for stuff that I may get crap from
and one from my ISP that i only use with them.

this way, you don't have to keep changing addresses. I get about 2 pieces of spam at my personal address, and about 40@day at the other 'crap' one
Re:How I do by julesh · 2004-06-24 04:35 · Score: 3, Informative

Mail.app's filter isn't Bayesian. Please see previous slashdot article on how it works (I'm too lazy to find the reference right now).

Excellent review by XMichael · 2004-06-24 03:33 · Score: 5, Informative

On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...

Priceless Photos

--
Gamblers Forum

Re:Excellent review by XMichael · 2004-06-24 03:39 · Score: 0

I tend to get sloppy when forced to type in a box that is 2 inches wide... hehe

--
Gamblers Forum
Re:Excellent review by Anonymous Coward · 2004-06-24 03:55 · Score: 0

the link in your sig is a disgrace to humanity and the internet. die.
Re:Excellent review by mev · 2004-06-24 04:20 · Score: 2, Insightful

Unfortunately it seems like the author is too intent on slamming Cormack for his review to fit my description of an "Excellent Review". I wish he had toned this down as he could still have delivered the same technical message in a more credible fashion.

"Excellent counterattack" might be more fitting.
Re:Excellent review by shockbeton · 2004-06-24 05:19 · Score: 0, Redundant

On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

No, no, no. Let's be clear here. The article was not well written. It was well "architected" whetever the fuck that is.

I guess it's similar to the way in which I constructed this reply from precast concrete building components.

-----
The world's greatest stuff: www.smithtwins.com

Confirmed: Architect not a verb by Exmet+Paff+Daxx · 2004-06-24 03:34 · Score: 0, Offtopic

http://dictionary.reference.com/search?q=architect :

1. One who designs and supervises the construction of buildings or other large structures.
2. One that plans or devises: a country considered to be the chief architect of war in the Middle East.

I mean, it's not even a second meaning. It's just plain English abuse. I hope this Zdziarski guy's paper is decent, since he's pretty tripped my spam filter from the gate.

--
If guns kill people, then CmdrTaco's keyboard misspells words.

Re:Confirmed: Architect not a verb by Anonymous Coward · 2004-06-24 03:45 · Score: 0

He also uses "curve" to mean "curb" as in "to curve [sic] spam". Maybe a good spam filter architect, but not a good writer. ;^)
Re:Confirmed: Architect not a verb by j_kenpo · 2004-06-24 03:53 · Score: 1, Offtopic

http://dictionary.reference.com/search?q=google

The World-Wide Web search engine that
indexes the greatest number of web pages - over two billion by
December 2001 and provides a free service that searches this
index in less than a second.

The site's name is apparently derived from "googol", but
note the difference in spelling.

The "Google" spelling is also used in "The Hitchhikers Guide
to the Galaxy" by Douglas Adams, in which one of Deep
Thought's designers asks, "And are you not," said Fook,
leaning anxiously foward, "a greater analyst than the
Googleplex Star Thinker in the Seventh Galaxy of Light and
Ingenuity which can calculate the trajectory of every single
dust particle throughout a five-week Dangrabad Beta sand
blizzard?"

Home http://www.google.com/.

(2001-12-28)

Look, thats not a verb either, but people still use it...
Re:Confirmed: Architect not a verb by WhatAmIDoingHere · 2004-06-24 04:04 · Score: 0, Offtopic

But if it becomes a verb, Google loses it's hold on the name. You can't trademark a verb. Or something.

That's why back in the early 90's, Jeep put out an advertisement in most car-trader magazines saying "If it's not from GM, it isn't a Jeep!". People were calling anything that kinda looked like a Jeep a Jeep.

--
Not a Twitter sockpuppet... but I wish I was.
Re:Confirmed: Architect not a verb by magefile · 2004-06-24 04:07 · Score: 0, Offtopic

It makes sense, though. If you want to take the flexible view, language is constantly evolving, and the verbing (no pun intended) of nouns is popular, as mentioned in the jargon file. We all understood what he meant, anyway. Second-language teachers call this communicative competence.

If you want to be less forgiving, English is a Germanic language. Thus, it makes sense that as German noun-verbing (Esse [food] => essen [to eat], etc) is acceptable and even common, English noun-verbing should be acceptable.

Yeah, I know, IHBT, IHL, HAND.
Re:Confirmed: Architect not a verb by pipingguy · 2004-06-24 04:17 · Score: 0, Offtopic

If you want to take the flexible view, language is constantly evolving...

Sure, but so many terms are being invented now (IHBT, IHL, HAND, war-whatevering, etc.) that attempt should be made to properly use existing words don't you think?

"Architect" and "engineer" are frequently used interchangeably (improperly), no doubt this is the source of the malapropism.
Re:Confirmed: Architect not a verb by Anonymous Coward · 2004-06-24 05:52 · Score: 0

No, not a verb, but verbifying is fine.

Verbifying is fine, but pretense is gross.
Re:Confirmed: Architect not a verb by Anonymous Coward · 2004-06-24 08:06 · Score: 0

(Google reference)...Look, thats not a verb either, but people still use it...
That doesn't mean they're correct. "People" do all kinds of stupid fucked up things.
Re:Confirmed: Architect not a verb by Drooling+Iguana · 2004-06-24 09:22 · Score: 1

Jeep is a Chrysler brand, formerly owned by AMC. It was never owned by GM.

--
... I'm addicted to placebos
Re:Confirmed: Architect not a verb by Anonymous Coward · 2004-06-24 14:37 · Score: 0

Do you understand the point, though?
Re:Confirmed: Architect not a verb by Drooling+Iguana · 2004-06-24 17:02 · Score: 1

Yes, but nitpicking is fun!

--
... I'm addicted to placebos

Studies create discussion by Timesprout · 2004-06-24 03:39 · Score: 5, Insightful

I usually frown when I see many of these so called studies offering conclusions, several of which differ radically from my own experience. There recent Java/C++ performance one was a classic example. It gets annoying when a pro MS result is immediately decried as marketing FUD because it just cant be better and a pro Linux result is taken gospel truth here on /. Usually I tend to take all results with a grain of salt or just plain ignore them and focus on the debate around them.

The benifit of these studies though is that fantical crap aside informed people will usually take the time to interpret results or suggest corrections/improvements that actually benifit developers and improve their knowledge base more than any information provided by the actual study.

--
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe

Re:Studies create discussion by dasmegabyte · 2004-06-24 04:32 · Score: 1, Insightful

But the purpose of studies is to offer insight into the best tools for a specific set of dependencies. Decry the dependencies, and you're essentially eliminating the purpose of the study.

For example, I am working on a GIS application. I looked at offerings from ArcView and MapInfo and found that while they do what I need to do out of the box, they are quite expensive and required a license for every seat of my application. So I looked to Open Source. There I found hundreds of tools, none of which did what I needed to do. I could adapt a bunch of them to accomplish my goals, but the time to do that, as well as port them all to Windows, would cost at least ten times more than either of these applications. What's worse is that I am not a GIS expert...I can load and search maps using the interface, but I couldn't write my own algorithms to do so...and to commission the work would cost more than if I did it myself.

So, for my particular use, Open Source Software is insufficient. If I just had to display a raster graph, it would be perfect. But for my use, it doesn't work.

Now, if I had released my findings as a study -- "Developer finds Open Source No Good in Specific Area" -- slashdrones would first attack the specific area, attack me for not being bright enough to know some obscure, undocumented toolset that could have solved my problem, and then proceed to talk about how great it was that the computer in their car runs Linux, thus making it the ultimate operating system.

This is defeatest bullshit. Ignoring your problems doesn't make them go away. This is like blaming McDonalds for your big, fat ass, or blaming Microsoft because you got a virus when you didn't run the patch they released to prevent it.

--
Hey freaks: now you're ju
Re:Studies create discussion by killjoe · 2004-06-24 05:59 · Score: 4, Insightful

"This is defeatest bullshit. Ignoring your problems doesn't make them go away. "

You miss an important point. This is not "our" problem, it's YOUR problem. I don't need a GIS program and neither to millions of other other people. YOU need one and too bad for you they cost tens of thousands of dollars. You have no right to complain that somebody else hasn't taken the time and effort required to give you a free equavalent.

What you need to understand is the open source is nothing but scratching an itch. This is your itch and you need to scratch it.

OPEN SOURCE ONLY WORKS IF PEOPLE CONTRIBUTE. This very simple and obvious point seems to be lost on most people. You are not supposed to sit around till somebody else does the work and give you something for nothing. You need to contribute.

You need to start an organization and start raising money to fund an open source development effort or to accelerate and existing one. You need to get involved and contribute. BTW bitching on slashdot does not count as contributing.

"This is like blaming McDonalds for your big, fat ass, or blaming Microsoft because you got a virus when you didn't run the patch they released to prevent it."

Or blaming the open source community because they didn't give you something for free.

--
evil is as evil does
Re:Studies create discussion by Brandybuck · 2004-06-24 06:38 · Score: 1

You are not supposed to sit around till somebody else does the work and give you something for nothing. You need to contribute.

Which is why Open Source will probably always be for developers by developers. Unless of course the non-developing users decide to contribute cash...

It's sort of like public television. You can sit around and watch it for free, or you can donate and help other people watch it for free. "Your generous donations will make this software free-beer for everyone!"

--
Don't blame me, I didn't vote for either of them!

Re:Architect is not a verb. by pete-classic · 2004-06-24 03:39 · Score: 0, Offtopic

adb, you stole my post!

Jonathan,

Invest in a thesaurus. Or you can use this one. You might have used any of the following perfectly serviceable verbs: scratch, scrawl, scribble; draft, draw, make out, write down, write up, hatch, make, generate, or construct.

Please don't recast innocent nouns as verbs.

Thank You,
Peter

Two things by marnargulus · 2004-06-24 03:40 · Score: 0, Redundant

1. Those bots that post goat.cx are very annoying. 2. The author of CRM114 admits that not everyone gets the results others do. Some people get perfect handling, while others get very poor handling. He also claims that setup might have been a problem with the testing. On one hand the tester should have set the system up correctly, but on the other hand this just shows that it isn't a fool-proof system (yet).

You don't like my software so I'll flame you by ifreakshow · 2004-06-24 03:40 · Score: 2, Insightful

This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article. I'd like to know what makes his opinion any more valid than Gordon's.

Information on his professional career was very hard to find on the site.

This just seems like a flame because his software(dspam) didn't perform well in the test.

Re:You don't like my software so I'll flame you by Anonymous Coward · 2004-06-24 03:43 · Score: 0

Agreed. Just because he did a fancy write-up which he refers to as an "article" doesn't make him at all credible either. I didn't even make it to his argument before I stopped reading because it sounded like a load of bullshit.
Let's hear the facts, we don't need a class in persuasive theory.
Re:You don't like my software so I'll flame you by arkanes · 2004-06-24 03:48 · Score: 1

I have to agree that the article has a very put-out and almost bitter feel to it, which makes me less inclined to take it seriously. That said, there are perfectly valid criticisms in it. For example, not releasing the configuration data is clearly improper. Testing the accuracy of the filters against SpamAssassin is totally incorrect methodology! It looks good to apply the filter to such a huge body of email, but a smaller set would have made it much easier to validate the results. Misconfiguration of the filters is another issue, but something that should be corrected and addressed, it's not really a failure of methodology.
Re:You don't like my software so I'll flame you by Threni · 2004-06-24 03:50 · Score: 3, Insightful

> This guy seems a little harsh and just a bit jealous of the success of Gordon
> Cormack's article.

Articles aren't 'successful` - they're either useful, or they're just fun to read. Perhaps his is the latter.

From the response:
---
It turned out that Cormack was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be the one thing lacking from this research paper.
---

One thing I've noticed is that more and more people seem to want an answer NOW - even if it's not the correct answer, or even if the original question asked wasn't the correct one.

> I'd like to know what makes his opinion any more valid than Gordon's.

Everyones opinion is as valid as you - the observer - decide it to be.

But in terms of which filter is the best - what does anyone's opinion have to do with it? If you're bothered about this issue, why not read both articles, think about it, and then perform the tests yourself? Or wait for an impartial third party to perform the relavent tests. There doesn't appear to be any alternative.
Re:You don't like my software so I'll flame you by Anonymous Coward · 2004-06-24 03:51 · Score: 0

The DSPAM team has always been very bullish about the capabilities of their filter compared to SpamAssassin, frequently belittling SA in their documentation.

Out of the box SA is probably the best anti-spam filter, but DSPAM when correctly trained can certainly do just as good a job, but training it takes time, and getting every user to take the time to train can be even harder.

The DSPAM people almost seem to have some sort of personal grudge against SA...
Re:You don't like my software so I'll flame you by Otter · 2004-06-24 03:52 · Score: 5, Insightful
There are some technical objections in there (old versions of software, the fact that Spam Assassin was tested with a spam collection generated by spam assassin). But honestly, after wading through all the whining and sneering, I didn't have the energy to pick the points out of the overall flow.
Jonathan, next time:
- Start by summarizing your technical objections.
- Continue by detailing your technical objections.
- Leave the nasty rants to the end, or better yet, leave them out entirely.
- Stop talking about "geeks" in every paragraph.
- Please stop referring to spam filter comparisons as "science".
--
What I'm listening to now on Pandora...
Re:You don't like my software so I'll flame you by pclminion · 2004-06-24 03:56 · Score: 4, Insightful

This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article.
Let me explain why he's irritated, as somebody who has conducted spam filter statistical tests and made publications on the topic.
Yes, it is irritating when somebody demonstrates that his method is better than yours. However, most researchers are able to accept this, and continue improving their own work.
However, what is far more irritating (by an order of magnitude at least) is when somebody "demonstrates" the inferiority of your work, and they do so in a completely scientifically bogus way.
Let me give a concrete example. Suppose you were Galileo. You have just put forth the postulate that all objects fall at the same speed regardless of mass. A "debunker" attempts to demonstrate that this isn't true by dropping an iron ball and a feather. Obviously, the feather falls much more slowly.
"Ha ha, neener, neener!" cries the debunker. Of course, Galileo knows his method is flawed. If people actually listen to this supposed debunker, Galileo might become very, very irritated indeed.
Re:You don't like my software so I'll flame you by Anonymous Coward · 2004-06-24 03:59 · Score: 0

It turned out that Cormack was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be the one thing lacking from this research paper.
Matter of taste, I guess. I would point to that paragraph as an example of how *not* to make a credible technical argument.
Re:You don't like my software so I'll flame you by Banner · 2004-06-24 04:15 · Score: 1

One thing I've noticed is that more and more people seem to want an answer NOW - even if it's not the correct answer, or even if the original question asked wasn't the correct one.

I can sympathize at times with this. Setting up or training some of the SPAM software out there is a real pain, because the documentation is written often like man pages (i.e. for those that don't need it). Which frustrates a lot of people who desperately want to filter their spam, but aren't sysadms or experienced programmers.

I think at this point what would be a far more valuable resource would be good clear step by steps on how to install and set up DSPAM or SA written so that your non-VCR programming parents could follow it.

After all, you might have the greatest SPAM tool in the world, but if the 'average joe' can't figure out how to use it, well then how great your tool is really doesn't matter.
Re:You don't like my software so I'll flame you by cloudmaster · 2004-06-24 04:25 · Score: 0, Redundant

"Boo Hoo, my spam filter doesn't perform well in short simulations because it takes time into consideration. This guy tested incorrectly." Whatever. I use SpamAssasin company-wide, with 0 false positives and about 5-1 messages out of several thousand gettting through each day. Maybe this whiner's software's even better, but all I got from his article was that he's someone who I don't want to depend on for software support.
Re:You don't like my software so I'll flame you by Threni · 2004-06-24 04:29 · Score: 1

> Matter of taste, I guess. I would point to that paragraph as an example of how
> *not* to make a credible technical argument.

I'm not sure if it constitutes a `technical argument` - whatever that is - but the criticisms are mostly testable (did he use the wrong flags; did he demonstrate an inability to `train correctly`). It's perhaps harder to assess whether or not he read the documentation, but not impossibly so. I've not read the original paper.

I've bookmarked it as I want to learn more about spam, as I'm not sure what the problem is. There seem to be two sorts of users. One sort of user is in the majority - non-technical people who send/receive emails mostly with people they know. These people can use an opt-in only system, or have strict filters which only accept emails if they contain a made-up word. The other sort is the technical user, who is more than capable of configuring and using, say, Thunderbird's simple Baysian based filtering.

I guess there is perhaps a third sort of user - someone who routinely accepts emails from people they've never corresponded with before, and I'd admit that this would appear to be the hardest nut to crack.
Re:You don't like my software so I'll flame you by julesh · 2004-06-24 04:37 · Score: 3, Interesting

He made a few very good points, but the overall tone was a little too ranty.

This was the most important point, I think, and was buried 2/3rds of the way down:

The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. [...] What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?
Re:You don't like my software so I'll flame you by Alanoman · 2004-06-24 05:56 · Score: 0, Offtopic

He appears to be a bible-thumper.

One of my favourite quotes from this article is "there's more evidence to support the existence of Jesus than there is Julis Caesar". !!
Re:You don't like my software so I'll flame you by Anonymous Coward · 2004-06-24 06:42 · Score: 0

Galileo's postulate is fine for scientific advancement and development of advanced gravitational theories... but it's a classic case of eliminating pesky "real world" factors and concentrating on a very limited subset.
It didn't answer the question "which will hit the ground first, the iron ball or the feather?". Instead, it provided an answer to "Is acceleration due to gravitational force dependent on the mass of the object concerned?" (or something along those lines). You'd have to be bored or paid to be interested in the scientific analysis of spam prevention methods using double-blinds and controlled environments and strictly-monitored setups that Mr Nuclear Elephant seems to be proposing.
Theoretically, spam-reduction method A might be better than B... but the reason most people want an automated spam elimination system is to save time, increase efficiency a bit and maybe reduce the amount of explicit, inappropriate crud that floods through your work email account.
The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter

No they don't. I get email. I don't want to see the crap. I certainly can't be arsed with training a "corpus" of 2500 messages before this even starts to kick in. People who read the original article (I must've forwarded it to over 30 people by now) are quite likely to be interested more in "what's a good spam filter I can drop into place and sort out this email nightmare". Sure, if you're writing your own spam software, then perhaps Mr Elephant's comments are valid.
Anyway, the article itself was highly unprofessional. It wasn't proof-read. It was badly organised. Whatever point it was trying to make was lost in the thunderous noise of disrespect and petty carping, IMHO. Definitely not a good advert for his book.
The original article was (as far as I'm aware) supposed to be 8 months of reasonably typical email use... and I'd wager that a high percentage of email users don't chase after the latest version of software unless they're hit around the head with a bug or missing feature or "update now!" box. Mr Cormack appears to have committed the cardinal sin of using software more than three months out of date - I'd say that's about as typical as it gets.
And that "60% to 80% spam" statistic is fine for many people... but I run a mailserver where people have catchall accounts for their domains, and believe me, the ratio runs to 99.9% spam on some of them. I was speaking to one of our customers today who had received 3 valid emails out of 200 downloaded. That's not too unusual. My personal account gets maybe one or two junk messages each month - it's probably filtered out at my ISP, I neither know nor care.
My point is that, despite Mr Elephant's air of "I know about this stuff", he seems to be severely misinformed on real-world conditions. Perhaps that's why his filter performed so badly.
Reading this article was a waste of time... shame, 'cos the original was well worth the effort.
Tom
Re:You don't like my software so I'll flame you by ComputerSlicer23 · 2004-06-24 07:38 · Score: 2, Insightful

Please stop referring to spam filter comparisons as "science".

I believe the author of the article would have two issues with that assertion.
First off, you can have science about how fast grass grows. You have science about how many sexual partners a person has. You have science about how to manipulate people with irrational arguments. Science can be applied to anything that you apply scientific princepals to. Science in a lot of ways, is merely a matter of measuring in a controlled manner and then commenting on such measuring. The usefulness of science is when those measurements are useful and applicable to common every day situations. Like say, your twice as likely to die in a car accident at 50MPH, then 40MPH.
Second, the author sounds like a mathematician, and somewhat of a scientist, and he has a mathematical interest in the filtering of SPAM. It's just as mathematical as using markov chains to model queuing problems to measure how long you'll have to stand in line at the checkout counter. To him, it's an interesting mathematical problem, which in a lot of ways, means that for him personally SPAM classification and the comparison of SPAM classification techniques IS science.
Finally, the results the author is referring to, are due to be published in a peer reviewed journal if I understood it correctly. So in a very technical sense, it is in fact being published for scientific review.
I think a lot of his issue is that you can't use the results of that paper to draw any useful conclusions for yourself if you aren't in a similar situation. As an example, I can get about 18 gallons to the mile in my F150, even though it's only rated for 13/15 city/highway. I manage that by setting the cruise control at the speed right after I switch into the 5th gear, turning off the A/C, driving on predominately flat roads, buying the highest rated fuel, and not stopping for any reason other then purchasing gas. So I could publish a paper saying that a F150 can easily get 18 miles to the gallon. However, that's incredibly useless to anyone who doesn't realize the conditions they have to drive in. His argument is that, the paper doesn't represent the results anyone else would get.
Kirby
Re:You don't like my software so I'll flame you by Inthewire · 2004-06-28 05:54 · Score: 1

I'm that third type.
I run a website that is used by about 2,000 people per day.
My unobfuscated email address is at the bottom of every page.
So I get a tremendous amount of spam along with a decent amount of real mail.
I use the Spam Bayes Outlook plugin and it works as well as I need it to.
It sorts my mail into Spam, Possible Spam, and Inbox.
Anything with a score below 5% remains in the Inbox.
Anything with a score of 5% to 97% is sent to possible (usually about 20 emails per day).
Anything with a score above 97% is sent to Spam (usually about 160 emails per day).
At some point during the day I'll check out the Possibles and either delete them as Spam or return them to the Inbox.
Every couple of days I'll scan through the Spambox and delete all Spam.
I've only ever caught one misclassified "good" message, and that was what prompted me to raise my threshold to 97% (it had been at 92%).
I get a lot of very short emails with mangled syntax, and that looks spammish ('How do I join ur site?' is typical).

--

Writers imply. Readers infer.

Spamassasin is good but not that good... by Shoeler · 2004-06-24 03:41 · Score: 5, Informative

For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.

Point being - I was darn surprised to see SA at the top of his charts.

Now - if only mimedefang would easily use another spam-checker....

Re:Spamassasin is good but not that good... by julesh · 2004-06-24 04:40 · Score: 1

Well, of course it was. As stated in the article, he was using the latest version of SA to classify mail that was up to 8 months old. I'd expect it to be pretty close to perfect on that. It's just current stuff it ain't so hot on.
Re:Spamassasin is good but not that good... by iserlohn · 2004-06-24 05:04 · Score: 1

SA gets a bad rap because it works even when the bayesian filter isn't activated. This leads to horrible results.

We deployed SA on our own internal MX and we have over 99% accuracy over the past 3 months. Although the bayes filter is primitive compared to what other advanced filters are doing, with enough training and a bigger token DB, SA works very very well. Couple that with network checks (ie, Razor2, Pyzor, DCC) and the system is comparable to the best statistical filters.

--
:. Ultimate Control Dedicated/VM Servers
Re:Spamassasin is good but not that good... by macdaddy · 2004-06-24 16:09 · Score: 1

MS is a wonderful tool, don't you think? I love it. We even bought Canit-Pro. That's a very nice tool. I highly recommend it.

Re:Is that what your mom worded by Karamchand · 2004-06-24 03:42 · Score: 1

First off: Your original posting was simply completely off topic. Where would we be if every message pointing out a grammatical mistake in a story got moderated +5? - slashdot would look like an English schoolbook.
Secondly: Your second posting is not only off topic, but also insulting and purely flaming.

Jesus by BoomerSooner · 2004-06-24 03:42 · Score: 0

I'm glad people like this work on keeping "Penis Enlargement" emails out of my inbox because I would just delete them instead of working that hard. I guess this is why I have email addresses for specific purposes (keeps the shit in the subscription addresses ~ yahoo/hotmail/etc...)

Overall an interesting read but it seemed he was a bit irritated with the other guy getting slashdot notoriety.

Re:Architect is not a verb. by pclminion · 2004-06-24 03:42 · Score: 2, Insightful

I hope you're proud of your anal retentiveness.

Haven't you ever Googled something? Haven't you ever input data into a computer? (The use of the word input as a verb is, of course, the result of verbing, and it's now considered acceptable usage.) In recent years it has become common in English to "verb" nouns. In fact, I just did it. English, like any other language, evolves over time.

To deny this fact makes you just another prescriptivist language maven, completely disconnected from reality and any sense of the advancement of human language.

Folks, don't listen to this dinosaur. He's not insightful, he's simply living in the past.

gmail ?? by rasz · 2004-06-24 03:42 · Score: 1

so the test procedure was bad ? ermmm, what about that GMAIL spammmeplease project ?

--
Go grab those torrents.

Just read it - by calebb · 2004-06-24 03:44 · Score: 2, Informative

I just read the whole article - it does repeat itself a few times, but the author provides additional evidence each time his theses were reiterated:

1. Cormack is very inexperienced in the area of statistical filtering. Agreed!!!
2. Cormack went into the testing with many presuppositions. Also Agreed!!

And in case you're not familiar with the word presupposition:
1. To believe or suppose in advance.
2. To require or involve necessarily as an antecedent condition.

Overall, this is a very good article; Check it out if you haven't already done so!

Re:Just read it - by TheAwfulTruth · 2004-06-24 04:12 · Score: 0

You said: "1. Cormack is very inexperienced in the area of statistical filtering. Agreed!!!
2. Cormack went into the testing with many presuppositions. Also Agreed!!"

To that I say:

"You mean like any other normal person who might be wanting to use such a product?"

There is some merit in having someone that knows nothing about a subject, test something. I mean it's exactly how Consumer Reports works. For the 1% 31337 of us it is worthless "information", but for the other 99%...

--
Contrary to popular belief, coding is not all free blow-jobs and beer. Those things cost MONEY!
Re:Just read it - by Henry+Stern · 2004-06-24 06:03 · Score: 3, Informative

1. Cormack is very inexperienced in the area of statistical filtering.

Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP.

A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.

1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:

a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.

b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.

c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.

2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.

3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.

Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.

Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.

Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.

Re:Architect is not a verb. by vk2 · 2004-06-24 03:45 · Score: 1

Here is a zero dollar investment - install this with this extension.

--
No Sig for you.!

Special Pleading by Lulu+of+the+Lotus-Ea · 2004-06-24 03:46 · Score: 1, Insightful

There's really very little to be said in favor of Jonathan A. Zdziarski's "defense". I guess it just amounts to him wanting to sell his product. Of course, I remember when CRM114 first came out, it was subject to some very dubious--or often simply incoherent--claims. It's pretty clear Zdziarski is in quite a bit over his head... not quite as bad as the amateurs who discover their own "breakthrough" encryption techniques, but tending in the same direction.

As near as I can tell (I skimmed, admittedly, I didn't read every word carefully), his defense amounts to "please don't test the different filters because..." Fill in what feature of the test MUST not be the same as the CRM114 users who get 99.95% accuracy. This is precisely the meaning of "special pleading" in rhetoric. Also the same argument about "if only he had tried the latest-and-greatest (even though we made our wild claims before that version came out, too)."

Cormack &alia make a reasonable best effort to test several tools; and as with any test, they make certain assumptions, and choose certain methodologies. Frankly, I find that a lot more useful that "just trust us, ours works best...but we can't quantify what 'works' means."

FWIW, I wrote an empirical study of different spam filters, way back shortly after the Paul Graham buzz:

Spam Filtering Techniques: Six approaches to eliminating unwanted e-mail.

I know my study is based on quite old tool versions by now. But AFAIK, it's one of the few that actually came at the comparisons from an unbiased viewpoint. Most figures are based on the "experiences" of the strongest proponents of a given tool (or occasionally from a strong detractor). I had/have no agenda for or against any particular tool, I was just curious.

--
Buy Text Processing in Python

Re:Special Pleading by Anonymous Coward · 2004-06-24 04:07 · Score: 1, Insightful

There's really very little to be said in favor of Jonathan A. Zdziarski's "defense". I guess it just amounts to him wanting to sell his product. Of course, I remember when CRM114 first came out, it was subject to some very dubious--or often simply incoherent--claims. It's pretty clear Zdziarski is in quite a bit over his head... not quite as bad as the amateurs who discover their own "breakthrough" encryption techniques, but tending in the same direction.

Well, his personal attacks were out of place, but his paper still has merit.

As near as I can tell (I skimmed, admittedly, I didn't read every word carefully), his defense amounts to "please don't test the different filters because..." Fill in what feature of the test MUST not be the same as the CRM114 users who get 99.95% accuracy. This is precisely the meaning of "special pleading" in rhetoric. Also the same argument about "if only he had tried the latest-and-greatest (even though we made our wild claims before that version came out, too)."

That he got results which are lower than hackers who tweak their filters is not surprising. But what is surprising is that he got results which are not characteristic of the filters, eg biased false positives in CRM114. This is something that basically nobody gets, and indicates that he may have used it wrong, eg by flooding the .css files with too many messages (as the documentation specifically tells you not to do).

Zdziarski also points out false claims ("DSPAM doesn't support train everything" when it is in fact the default, etc) which indicate that Cormack didn't RTFM. As for the "latest and greatest," he's comparing wild claims about DSPAM 3.0 to results on 2.8... certainly that's not fair.

The most damning point was the use of SpamAssassin: Cormack didn't classify the messages by hand (there were 49,000 after all), but instead used SpamAssassin to set up his test. When SpamAssassin is acting as a judge, is it surprising that it should win? Surely errors that the two versions made would thend to overlap, thus counting in favor of SA and against filters which had classified the mail correctly. This could explain CRM114's apparent bias towards false positives, if many of those were spams that SA did not detect.
Re:Special Pleading by julesh · 2004-06-24 04:44 · Score: 1

As near as I can tell (I skimmed, admittedly, I didn't read every word carefully), his defense amounts to "please don't test the different filters because..."

You must have been skimming very badly. I read it, and this kind of argument was never used at all. Basically, he pointed out flaws in the way the test was set up that biased it towards SpamAssassin. Particularly that the test was started with untrained filters, and that the version of SpamAssassin's ruleset used was more recent than the messages being classified (I'm sure you can figure out what that means).
Re:Special Pleading by calebb · 2004-06-24 16:50 · Score: 1

Sorry you added me to your foe's list :-(

I can get kinda direct in my posts sometimes - I didn't mean to cause any personal offense! (I even took the f out of RTA!)

My apologies if I've offended you...

Caleb

Eureka by paranode · 2004-06-24 03:46 · Score: 0, Troll

2. One that plans or devises: a country considered to be the chief architect of war in the Middle East

I knew it! GW Bush is The Architect.

Oh wait, I forgot how articulate the Architect is supposed to be. Hrmm.

Re:Why use "architect" - why not "write" by pclminion · 2004-06-24 03:48 · Score: 1

what's the difference between "writing" a response and err, "architecting" a response?

You're being purposefully dense.

To architect a response would imply careful consideration, artistic presentation, and stunning aesthetics. I don't necessarily agree that that's what he's done here, but obviously that is what he meant to convey with his choice of words.

And if you disagree with verbing words, you have better stop "inputting" data into a computer, or "Googling" for answers, or "bookmarking" links, or "forking" processes.

I'm happy with Outlook by Anonymous Coward · 2004-06-24 03:49 · Score: 1, Funny

I tried some of these so-called filters, but none of them performed as well as my copy of Outlook 2003. It's so easy too: you just click on the hundreds and hundreds of messages then click "organise", then send them to junk mail. Tomorrow, you'll do exactly the same thing.

Thank goodness my IT dept. decided to upgrade us all from Eudora + Spamnix. It was awful not being able to see all those \/!agra and XXXh0t gurls advertisements.

Re:I'm happy with Outlook by Anonymous Coward · 2004-06-24 07:49 · Score: 0

Outlook is as a mail client is incredibly powerful. There are many plugins to choose from (some even free) that will do spam filtering. I set up an office with a debian box running mailscanner and spamassassin and had the outlook users all automatically generate a junk mail folder and filter rules for when spamassassin marked the messages. It worked great and it was very easy to do over a large organization. Unfortunately most open source programs lack the powers that outlook has had for years. Oh.. And BTW, there were almost no problems with viruses, worms, or spyware on those machines. Properly administered windows machines, even with outlook, can run very smoothly.

Re:Architect is not a verb. by TRS80NT · 2004-06-24 03:49 · Score: 1

"Verbing weirds language." -- Calvin

--
Lorem ipsum dolor sit amet.

Re:Architect is not a verb. by corporatemutantninja · 2004-06-24 03:52 · Score: 2, Insightful

Well said. HOWEVER, I have to agree with the poster who pointed out that using "architect" as a verb in the context of writing is a little out of place. If we're going to help the language grow, let's at least do so in useful ways. "Architect a solution to an engineering problem", sure, "architect a whiny, defensive rebuttal", no. If we're going to make it a verb let's at least have it relate somewhat to the noun.

--
Actually, I was trying to be Insightful, not Funny.

Re:Architect is not a verb. by Inda · 2004-06-24 03:54 · Score: 1

Haven't you ever input data into a computer?

Why is the readability of that sentence poor?

--
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.

Re:Architect is not a verb. by Anonymous Coward · 2004-06-24 03:54 · Score: 0

completely disconnected from reality

I might have believed you if you had said he was partially disconnected, or somewhat disconnected, but completely disconnected makes you look rather, umm, anal retentive.

Re:Architect is not a verb. by donutz · 2004-06-24 03:57 · Score: 1

In recent years it has become common in English to "verb" nouns.

But "verb" isn't a verb, it's a noun! You can't "verb" something, or go around "verbing" things...check it out here.

It could be worse... by schon · 2004-06-24 03:57 · Score: 1

He could have said someone "tasked" him to "architect" a response. :o)

Re:It could be worse... by gr · 2004-06-24 10:48 · Score: 1

... with the intention of "growing" the economy, no doubt.

--
Do you have a /. uid shorter than five digits? No? Then piss off.

Re:Architect is not a verb. by perly-king-69 · 2004-06-24 03:57 · Score: 1

No, he's promoting the correct use of English which promotes inclusivity. We all know where we stand. By designing (or should I say architecturizing) your own rules you begin to exclude groups of people, such as those whose first language is not English. It's elitism, nothing less.

--

--
This sig is inoffensive.

Re:????? Did you even... by calebb · 2004-06-24 03:59 · Score: 2, Insightful

RTA?

Read the article, then post!

There's really very little to be said in favor of Jonathan A. Zdiarski's "defence?"
Now, I could start posting how ignorant that statement is, but then I'd just be rewriting Zdiarski's article. Cormack's entire test was flawed - He used SpamAssassin (95% accuracy) to create his 'ham' corpus. He used software versions that were 6+ months old. Even the email address he used for testing is incredibly unique and atypical! (He uses an address that he's had for 20+ years; One that has been posted all over the WWW numerous times. An address that has many forwarders pointing to it. How is that typical in any way??)

Ok, go read the article (don't just 'skim' it, as you mentioned), then come back and tell me why you believe he is only trying to 'sell' his product.

Please back up your claims with some evidence this time ;-)

Re:Architect is not a verb. by pclminion · 2004-06-24 03:59 · Score: 1

By designing (or should I say architecturizing) your own rules you begin to exclude groups of people, such as those whose first language is not English.

Then the Germans had better stop joining words together however they please. It creates these big, long words which are incomprehensible to non-native speakers. They seem to do it willy-nilly!

That's what you're saying. Right?

It's elitism, nothing less.

Prescriptivism is the only thing elitist here.

Re:Why use "architect" - why not "write" by Stone+Pony · 2004-06-24 04:00 · Score: 1

Perhaps he could have "crafted" a response; a well-established usage which would have conveyed exactly the sense that you've described.

What this has to do with the guy's spam filter is a mystery to me, though.

Re:Why use "architect" - why not "write" by Stone+Pony · 2004-06-24 04:03 · Score: 1

Perhaps he could have "crafted" a response; a well-established usage which would have conveyed exactly the sense that you've described.

Quite why everybody's suddenly noticed the abysmally low standard of English grammar around here, or what this has to do with spam filters, is beyond me, though.

I'm not saying we wouldn't get our hair mussed... by VAXcat · 2004-06-24 04:04 · Score: 3, Funny

I prefer using the original CRM114 discriminator and it's host platform on spammers. If you're not familiar with the original CRM114 and it's delivery platform, it was featured in the following movie... http://www.imdb.com/title/tt0057012/combined

--
There is no God, and Dirac is his prophet.

Re:????? Did you even... by Otter · 2004-06-24 04:06 · Score: 1

Now, I could start posting how ignorant that statement is, but then I'd just be rewriting Zdiarski's article. Cormack's entire test was flawed - He used SpamAssassin (95% accuracy) to create his 'ham' corpus. He used software versions that were 6+ months old. Even the email address he used for testing is incredibly unique and atypical! (He uses an address that he's had for 20+ years; One that has been posted all over the WWW numerous times. An address that has many forwarders pointing to it. How is that typical in any way??)

No, you just restated it the way he should have written in in the first place!!

I don't find those points quite as damning as you do, but your presentation of them is a zillion times more persuasive and less juvenile-sounding than "Many misled CS students, Ph.Ds, and professionals have jumped on the spam filtering bandwagon with the uncontrollable urge to perform misguided tests in order to grab a piece of the interest surrounding this area of technology to score credits or popularity"...and on and on and on...

--
What I'm listening to now on Pandora...

Re:Why use "architect" - why not "write" by magefile · 2004-06-24 04:11 · Score: 1

It's more established as a word, but take a look at the root of the word "crafted". Is it a noun? You tell me.

Re:Architect is not a verb. by Kphrak · 2004-06-24 04:11 · Score: 1

I wouldn't mind verbing so much if the right usage hadn't been drilled into me as a kid...but "verbing" the word "architect" is not a language advancement. It's a sloppy shortcut normally used in buzz-speak (that's why you almost never hear it in everyday English, but so often in computer- and business-related fields). It's ambiguous and makes English even more difficult to understand than it is already. The fact that enough people complained about it for this thread to occur shows that in fact, it is not "just a prescriptivist language maven" who hates this, but everyone who cringes when someone writes "effect" for "affect", or says "irregardlessly". Most verbing is the result of a current fad, and anyone who's over ten years old knows how fast most fad-generated words disappear.

To quote Bill Watterson (who, AFAIK, created the word "verbing" as "to make a noun into a verb"), "Verbing weirds things."

--

There's no sig like this sig anywhere near this sig, so this must be the sig.

I wouldn't take this critique too seriously by EsbenMoseHansen · 2004-06-24 04:12 · Score: 5, Interesting

There are several warning signs in this article.

The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
The author has no statistical or published backings for his claim
Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.

That said, he does raise a few valid points, such as the timeline:

If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
Configurations should really have been published. I see no reason why not.

--
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.

Re:I wouldn't take this critique too seriously by int2str · 2004-06-24 04:49 · Score: 4, Interesting

Yes, I agree with your points. The author spends way too much time dicrediting the study.

I also have to say that my experience was much more along the line of Cormacks. I've tried DSPAM for a while on my server, starting from scratch. Training on error with only new emails. On a small mail server with about 10 users of different types (geeks, businesses, moms etc).
- DSPAM took way too long to produce any kind of results
- 2500 emails before advanced features kick in is *a lot* for the average soccer mom
- DPSAM produced way too many false positives early on
- The spam filtering accuracy leveled off at about 80% (number from DSPAMs web interfac)

So this is not another overzealus CS student here, but real world testing.

The DSPAM author does not address any of the real points and just rags on Cormack.

Not much of a "rebutal" in my book.
Re:I wouldn't take this critique too seriously by jpetts · 2004-06-24 05:00 · Score: 2, Funny

While such can lead credence to a strong case, it bodes when mentioned as the very first points.

But does it bode well or ill?

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
Re:I wouldn't take this critique too seriously by Anonymous Coward · 2004-06-24 06:16 · Score: 0

- 2500 emails before advanced features kick in is *a lot* for the average soccer mom

So... how many soccer moms are out there setting up a complex server-side anti spam system solely for themselves?
Re:I wouldn't take this critique too seriously by EsbenMoseHansen · 2004-06-24 06:42 · Score: 1

It just bodes. In general. Just as Gaspodes says.
To all non-plussed people here: It's refering to a novel by Terry Pratchett, the one with Holy Wood. The title eludes me right now. It's quite good, actually, featuring several talking animal and humans.

--
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
Re:I wouldn't take this critique too seriously by Glass+of+Water · 2004-06-24 10:29 · Score: 1

I think you mean bodes ill. Bodes means something similar to predicts or foretells.
Thank you, that is all.

--
There are no trolls. There are no trees out here.
Re:I wouldn't take this critique too seriously by gurps_npc · 2004-06-24 11:31 · Score: 1

I disagree entirely with 3. You can NOT test a device's accuracy by comparing it's previous output to future output, even if you also backcheck possible errors using third machines. It is just BAD science and you should graded F- for even attempting to do it.
You ignore the change in relative accuracy.
Assume for example that Spam Assasin is in fact the best around, but it has a 10% false spam rate. Every other program is slightly worse with an 11% false spam rate, always making the same mistake that Spam Assasin does, but also making a few more. At NO point does anything the original study did catch the error. You come away thinking SPam Assasin is the best in the world, it is 99.9% accurate, and everyone else is 99.5% accurate, a big relative difference.
But the truth is Spam Assasin is only 90.9%, you are miscategorizing a ton of real mail as spam and never finding it out. Yes, the other services would be only 90.5% accurate, but now the difference is imperceptible and irrelevant.

--
excitingthingstodo.blogspot.com

What is typical by Anonymous Coward · 2004-06-24 04:13 · Score: 4, Insightful

Due to X's extremely high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in newsgroups for 20 years, it is no surprise that X has an abnormally high spam ratio, 81.6%.

I'm not happy about this, first he says that this account has a abnormally high spam ratio and then says that a normal user can have 60%. Where do we get these figures from I would like to know as my average is pushing up against 100%. I don't think that there is such as thing as an average user, some people seem to get nearly no spam and the rest of us get almost complete spam.

Reviewing todays inbox reveals around 200 emails, of which 8 were legit. You do the maths, I would be making progress if it was only 81%.

Re:What is typical by __aaevmb228 · 2004-06-24 12:34 · Score: 1

I'm in the same boat. I get between 150-200 spams a day and 3-10 real messages. A combination of blacklist queries (hits add a header) and Bogofilter has been working for me for many months now with near perfect accuracy. Lately one particular spam message has been making it through, but it's rare that I find ham in the spam box or vice versa.

To cut through the spam by NigelJohnstone · 2004-06-24 04:13 · Score: 4, Insightful

Oh boy he goes on and on, if ever you wanted to cut out the spam in an article...

His main points (at least the ones I agreed with):

1. No training period, many features only turn on after lots of real emails have been processed. Fair enough.

2. No purge window, stale emails get purged over time (e.g. 4 months), but in a test everthing is shoved through at once (in minutes) and so nothing gets purged. Again fair.

The rest of it complains about the tester, or complains that it was less than ideal conditions & settings for the particular filter.
We call that 'the real world' here.

Sys admins are not experts in configuring filters.

Also he should realise that any new filter gets a better rating than the dominant filter. Spammers try to defeat the most popular filter of the day. So sure a new filter might perform better than an existing one *initially* simply because the spammers are targetting it. Until it becomes dominant and then the spammers adjust the spam to defeat the new dominant filter.

So in the real world the data set will always be unusual because the spammers make it that way.

Verbing is not a verb by horza · 2004-06-24 04:15 · Score: 1

I personally think to "architect" something 'sounds' right and it's obvious and unambiguous in what it means. The grammar nazi is right though and it is incorrect. Input *is* a transitive verb. However verbing sounds like something simply offensive and shouldn't be done in public.

The language evolves, but slowly as everyone needs to be able to keep up. This is the problem with Open Standards: creating a stable API can sometimes slow or stifle innovation

Phillip.

--
Property for sale in Nice, France

Re:Verbing is not a verb by pclminion · 2004-06-24 04:20 · Score: 1

Input *is* a transitive verb.
How do you think those usages get placed in dictionaries? They don't fall from the sky. The noun "input" got verbed. And the use of "verb" as a verb will also eventually be accepted in the dictionary.
Acceptance in the dictionary, and acceptance as usage in language are two distinct things, however.
Re:Verbing is not a verb by cperciva · 2004-06-24 04:36 · Score: 1

The noun "input" got verbed

Nope; in fact, the verb "input" got nouned. The first known use of "input" as a verb in the context of computers was in 1946; the first known use of "input" as a noun in the same context was in 1948.

Outside of the specific case of computers, the difference is even more distinct, with the verb "ynputt" pre-dating the noun "input" by almost four hundred years.

--
Tarsnap: Online backups for the truly paranoid
Re:Verbing is not a verb by Mournblade · 2004-06-24 04:37 · Score: 1

The grammar Nazi would probably also point out that "I personally think" is redundant.

False positives. by Christopher+Thomas · 2004-06-24 04:18 · Score: 2, Informative

I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.

I have a whitelist at work, and it works quite well, but combing through and emptying the spam bucket is still an annoying part of each day.

However, without doing so, I'll never know if I missed that one message in (about) a thousand that's from a vendor that's not in my whitelist.

QOTD: "I don't have a solution, but I do admire the problem.".

Re:False positives. by Donny+Smith · 2004-06-24 04:44 · Score: 2, Insightful

Exactly - what's the point if you have to re-check it anway?
That is the main reason I don't use any spam filters.

Without a filter I can check emails as they come rather than create myself a "homework" of having to check 50 messages at once...
Re:False positives. by DFossmeister · 2004-06-24 05:34 · Score: 1

For me it seems that the email comes in batches, not just one at a time. I think this is probably a result of some common spam providers doing their runs, which causes me to get 10-12 in a one minute period. So whether I choose to do it in real time, or in "batch" as homework, it is often much the same.

What I am looking for is better integration with some of the popular mail readers, such as evolution, Mozilla-mail or Thunderbird. I know that evolution has a pre-execution filter, but that proved to be really slow with spamassassin. The bayesian filters integrated into Mozilla-mail are not very effective. It only gets about 50% and that is after months of training.

Its nice that some of the anti-spam applications try to be MUA agnostic, but this isn't always the best approach because it causes a lowest common denominator issue. pop3 emulators are cool, but the training integration is a pain. The same thing goes for any other standard protocol emulation, except maybe an IMAP4 emulation layer. That way the training could be accomplished by folders.

--
No Not Again! Its whats for dinner.
Re:False positives. by Xentax · 2004-06-24 05:40 · Score: 1

I've used a Bayesian plugin before that let you set thresholds - so a certain score would be marked as "probably good" and be left in your inbox, a range would be set as "probably spam" and put in a possible junk folder, and beyond that was "definitely" spam and went in a spam or trash folder.

It defaulted to like 10/90 (I don't remember which score was more spamlike, so imagine less than 10 was almost certainly ok, and greater than 90 was almost certainly spam) - I set it much lower for awhile (50) until I was reasonably comfortable that even the worst legitimate messages where getting nowhere near the 'almost certainly spam' score.

I like that approach in general - keep a gray area for hand review, and trust that anything worse than that IS spam. And, in general, sort everything by score - then you can just watch the borders for trouble.

Now, that all assumes that a false positive doesn't TANK, that it misses the cut by a narrow margin. Is that a valid assumption? From what I can tell, if you're training against your own mail, and catching any that are borderline and retraining if necessary, yes. Of course, if you have a bunch of friends that are trying to sell you 'enhancement' or stock tips via email, things may be more complicated for you... (solution: get new friends).

Xentax

--
You shouldn't verb words.
Re:False positives. by fyonn · 2004-06-24 05:47 · Score: 1

This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.

I file spam in a spam box as I can easily scan across the contents in 10 seconds and hit delete before I go to bed, as opposed to the distraction when an email arrives and you go to check it immediately being broken ou of what you were doing, and then the annoyance of finding that it's spam.

I find that the minor chore of spending half a minute once a day is much easier than being annoyed constantly throughout the day.

incidentally, I implemented dspam last week and right now, it's performing brilliantly, certainly better than SA was which I had installed beforehand.

dave
Re:False positives. by mAineAc · 2004-06-24 06:19 · Score: 1

it seems to me that with the way that spam is now adays we are taking the wrong approach. Why don't they make a filter assuming everything is spam, and then just filter out the good email? I realize that a white list does just this with the email address, but to take it a step further, look at what the content is and filter according to words or phrases you want to see like 'the kids are doing great.' I can't see this getting any more false positives then what we are using now :)
Re:False positives. by Anonymous Coward · 2004-06-24 06:28 · Score: 0

This is precisely the reason that I use SpamBayes (http://spambayes.sourceforge.net) on my PC. SpamBayes classifies emails into three categories: ham, unsure, and spam. You set the statistical probabilities for each category.

For example, I classify my emails with a 90% probability of being spam as Spam. I then have my email client automatically delete them without review. My ham probability is set to a 10% probability of being spam. The only items I check is the 80% in the middle (classified as unsure).

It's not perfect as spammers are adapting to statistical filters. But I feel confident that there are no false positives in the spam category, so I'm only reviewing the smaller fraction of emails that it's not sure about. The majority of spam is thus deleted without my needing do review them.

I'm guessing that other Bayesian filters would probably do the same with only a spam and ham classification if you can set the spam cutoff probability. I'd be interested in hearing from others on their strategies as it relates to false positives.
Re:False positives. by psilotum · 2004-06-24 06:32 · Score: 1

Because it is much easier to visually scan 50 messages and confirm that they are all spam. Ham stands out and is easy to spot. Secondly, [select all] [delete selected] is easier than selecting or deleting individual messages.
Re:False positives. by juhaz · 2004-06-24 06:39 · Score: 1

Checking emails as they come takes more time than quickly scanning over 50 messages at the end of day.
Re:False positives. by yasth · 2004-06-24 06:40 · Score: 1

They already do in combination with finding spam, that is why you have a list of random words at the end of some spam, in the hopes that it will be a strong hit on a word used in "ham". The real problem is that humans aren't even all that good at detecting spam. (at least not against the tumult that we get now days where some people have a couple hundred spams a day, so even a .1% error rate is still a spam or two a week.) The statistcal white list could be wonderful for dating services though :) Easy way to find people with common ground.

--
I'd do something interesting, but my server can't handle a slashdotting.
Re:False positives. by cardshark2001 · 2004-06-24 06:47 · Score: 1

The bayesian filters integrated into Mozilla-mail are not very effective. It only gets about 50% and that is after months of training
I had the foresight to save all my junk mail, about 3 years ago. I used it to train the filters when I switched to mozilla, and I hit the ground running with about an 80% rate (trained with about 5000 spam mails and about 800 real mails).
Since then, I have given my email address to any site that asks for it, because I figure the more spam, the better for my filter. This has worked out pretty good for me. Mozilla is now up to about a 95% rate, which is nothing like what was promised by Paul Graham, but still reasonably good. No significant false positives yet (just a couple of commercial emails I didn't care too much about). Plus, I see his prediction in the spam that gets through. It's almost impossible to figure out what these people are trying to sell me because they have to bury their message in a bunch of crap to get around my filters. I don't see how this can keep up for long as a sustainable business. But then again, what do I know?

--
WWJD? JWRTFA!
Re:False positives. by Donny+Smith · 2004-06-24 06:57 · Score: 1

In absolute terms it does but in relative terms it does not (the equivalent of lining up several times in short queues vs. one time in a long queue - when duration of both sides is the same).
Re:False positives. by Xentax · 2004-06-24 07:21 · Score: 2, Informative

This *is* already done - statistical filters are trained on both words that are 'spamlike' (words that show up only, or mostly, in lots of email marked by the user as spam), and words that are NOT (words that show up only, or mostly, in email marked not spam).

This is (AFAIK) done against tokens in both the mail body and the headers, which pays dividends if the delivery paths are clustered (for example, if your whole family has accounts with MyISP.com, you'll probably get good filtering provided the spam isn't originating from MyISP.com as well).

Xentax

--
You shouldn't verb words.
Re:False positives. by firewood · 2004-06-24 08:22 · Score: 1

This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand,
True. But you can check if your false positive rate is low enough by statistical sampling.

So once every few days I scan through a thousand or so items marked as spam by procmail. As long as I continue to find 0 or 1 false positives (which I add to my whitelists), I consider my filters 99.9% good. That error rate is probably better than my own human error rate for misfiling and/or just forgetting to respond to email.
Re:False positives. by SpaceLifeForm · 2004-06-24 09:05 · Score: 2, Insightful

No, you can scan your spam folder in seconds, because you will recognise the subject lines. The duration is not comparable. When you have a folder for spam, any non-spam sticks out, but if you need to think looking at alternating spam and non-spam messages, you spend more time thinking.

--
You are being MICROattacked, from various angles, in a SOFT manner.
Re:False positives. by Christopher+Thomas · 2004-06-24 14:03 · Score: 1

No, you can scan your spam folder in seconds, because you will recognise the subject lines. The duration is not comparable.

I do this. It takes a large amount of time. Comparable to the time required to hit "delete" on seeing a flash of spam-text.

Time savings: zero.

The only reason I spam-filter at all is that the spam filter serves adequately as a _priority_ filter, letting me get to _most_ of the real stuff first. Doesn't save me from slogging through the rest.
Re:False positives. by misleb · 2004-06-24 17:38 · Score: 1

However, without doing so, I'll never know if I missed that one message in (about) a thousand that's from a vendor that's not in my whitelist.
I've accidentally deleted more legitmate mail while weeding through the spam in my inbox than DSPAM has caught false positives. See, filters don't get frustrated. I do. I trust DSPAM more than I trust myself. Scary, huh?
-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:False positives. by Cato · 2004-06-25 00:13 · Score: 1

I use SpamAssassin on my mail server, because it has Bayesian filtering, rules-based filtering (not just keywords but regexes and code to detect things such as use of Base64 encoding), and collaborative filtering (e.g. Razor).

Bayesian filters generally work OK, but every day there are a few spams that append some irrelevant text (a joke or just random unusual words) that would beat Bayesian filters if not for the rules-based filters. While they are less effective spams as you point out, there may still be people who click on the URLs and buy something, so this doesn't mean the death of spam. In fact only ISP-based filtering can really affect spam, since there are enough clueless people with no client side filtering that spam will get some responses.

SpamAssassin's integrated approach means that, with quite a few custom rules, I can get upwards of 99% accuracy and the only false positives are new commercial emails from e-commerce sites.

Re:Why use "architect" - why not "write" by Anonymous Coward · 2004-06-24 04:18 · Score: 0

What's the difference? You already said it. Architecting sounds a hell of a lot more pretentious. Almost like "crafting" a response, except it's not a verb.

Re:Architect is not a verb. by ObjetDart · 2004-06-24 04:20 · Score: 1

I agree this seems nit-picky, but the misuse of "architect" is actually only the tip of the iceberg. This article is so chock full of misued words, awkward sentence construction, and serious grammar problems I found it distracting and difficult to read. I guess this is what they mean when the liberal arts folks deplore the poor writing skills of many geeks. This guy really needs an editor. And when he mentioned that he is also writing a book, I just shuddered.

--
I read Usenet for the articles.

Main issue by TheLink · 2004-06-24 04:21 · Score: 1

Zdziarski claims Cormack mainly used Spamassassin to classify the corpus into the ham and spam groups.

If this is true then to me this is a critical flaw in Cormack's methodology.

Not saying there are, or aren't other flaws. But this to me is the main one to consider. Zdziarski should have just put this at the top of his response, instead of putting a lot of waffle about stuff that does "not appear to have been a problem with Cormack's tests".

--

Too many replies beneath your current threshold

Why not... by Vadim+Makarov · 2004-06-24 04:21 · Score: 1

postage-based email?

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.

Re:Why not... by Kent+Recal · 2004-06-24 06:19 · Score: 1

No, thanks.

Re:Why use "architect" - why not "write" by nagora · 2004-06-24 04:22 · Score: 1

You will have noticed by now that any attempt to express what you mean clearly is regarded as fascism by the /. crowd. In this case you were 100% right: "architect" was not even close to being the right word for the job. I can't imagine the trouble this guy gets into if he tries programming: "I know I typed 'print', but I meant 'close'!"

TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Confirmed: Architect IS a verb by cperciva · 2004-06-24 04:24 · Score: 4, Informative

Quoth the OED:

architect v. To design (a building). Also transf. and fig. Hence architected ppl. a., designed by an architect; architecting vbl. n. and ppl. a.

The use of "architect" as a verb isn't even recently invented: Keats wrote "This was architected thus By the great Oceanus" in 1818.

--
Tarsnap: Online backups for the truly paranoid

Re:Architect is not a verb. by Anonymous Coward · 2004-06-24 04:25 · Score: 0

Yes.

"Verb" is a noun. Until it's verbbed. "Verbbed" is a verb until it is nouned. Nouning "verbbed" makes "noun" a verb, when previously, it was a noun.

Does that clear things up for you?

dubious credentials by Anonymous Coward · 2004-06-24 04:27 · Score: 0

a professor of computer science vs. a guy who guys by the monicker "nuclear elephant" (and whose educational credentials are rather dubious).

hmm.

Re:And to that... by calebb · 2004-06-24 04:28 · Score: 3, Insightful

"You mean like any other normal person who might be wanting to use such a product?"

And to that, I would say... Someone writing an article for publication in a peer-reviewed journal should become experienced in their area of research before attempting to publish their results!

For example, I'm sure you don't have much experience with Nuclear Magnetic Resonance imaging - And you might or might not have experience with X11 forwarding. But unless you are fluent with both of those topics, I would not expect you to attempt to publish a paper in a peer-reviewed journal discussing those topics!
(Like I did, last December)

However, for the sake of presenting some evidence to back up what I'm saying here, I'll take your example of Consumer Reports.

From their site: CR has the most comprehensive auto-test program and reliability survey data of any U.S. publication; its auto experts have decades of experience in driving, testing, and reporting on cars.

...nevermind, I don't need to say anything else.

SA vs SA... SA Wins! by sharper56 · 2004-06-24 04:33 · Score: 1

Unfortunately, the most important point is buried in the article.

Cormick builds a list of spam and a list of ham using SPAM Assasin. He then tests the accuracy of the products by comparing them to the SA lists. So in a testing the filtering, if you don't agree with Spam Assasin, then you're wrong.

Gee, it's hard to figure out why SA won. ;-)

Re:SA vs SA... SA Wins! by NigelJohnstone · 2004-06-24 04:57 · Score: 1

"Cormick builds a list of spam and a list of ham using SPAM Assasin. "

I read that bit, but Comicks words:

"The test sequence contained 49,086 messages. Our gold standard classified 9,038
(18.4%) as ham and 40,048 (81.6%) as spam.
>>>>>>>The gold standard was derived from
X's initial judgements, amended to correct errors that were observed as the result
of disagreements between these judgements and the various runs."

From this I am left with the impression that X was the judge, not Spam Assasin, and X only fixed up a few errors.

So I dismissed his counter as a misread. Or did I miss something in Cormick paper you can point to.

Re:Architect is not a verb. by nagora · 2004-06-24 04:33 · Score: 1

I hope you're proud of your anal retentiveness.

If you mean being proud of knowing that "architecting" was not even close to being the right word, then I'm proud, sure.

Language does evolve over time and new words do come into usage, but how does that mean that just picking words at random and using them instead of already existing, perfectly adequate, words is not pointless, unclear, and pretentious?

To deny this fact makes you just another prescriptivist language maven, completely disconnected from reality and any sense of the advancement of human language.

Toggle bus area salty Jehovah wash ribbed.

Did you not understand that I meant "I totally agree with what you said"? How very prescriptive of you!

TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Re:Architect is not a verb. by pclminion · 2004-06-24 04:38 · Score: 1

You put a meaningless jumble of words together. "Architect" in this context was anything but meaningless. If you can't figure out what was meant, that indicates a lack of brain power on your part, nothing more.

Constructing arguments by cynicalmoose · 2004-06-24 04:40 · Score: 4, Informative

As far as I understand, Cormack accepted that he was testing only on one person's corpus, and qualified his findings as such.

This is something that is featured throughout the rebuttal - an argument that runs:
a) Such and such was done incorrectly
b) Therefore the system was inaccurate
c) Therefore CRM-114 is better than stated

The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable. If you discarded results each time they contradicted agreed wisdom we would still think of a geocentric universe.

--
Exercise your right not to vote. thinkoutside.org

Re:Constructing arguments by bourne · 2004-06-24 05:36 · Score: 2, Insightful

The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable.

But without knowing how the test was set up, how can you trust the test's so-called empirical results?

In medicine, research results aren't generally trusted unless 1) the study was sound, e.g., double-blind and 2) a separate team has recreated equivalent results using the published methodology. If, as Zdziarski says, Cormack is not making his config files available, then that alone should be a reason not to blindly accept the study's results. The methodology is unknown.

I can see not publishing the mail messages - in medicine, for example, you don't want to re-use the same test subjects from the first test, so there's no point to it as well as the privacy issue - but the config files? What possible reason could there be for not making them available?

Re:architect by psykocrime · 2004-06-24 04:42 · Score: 2, Insightful

For the love of Cthulu, people, "architect" is a noun, not a verb.

Languages are dynamic, not static. If enough people begin to use 'architect' as a verb, then it is a verb. I have a strong hunch that 20 years from now, the verb form of architect will appear in Merriam-Webster...

--
// TODO: Insert Cool Sig

Anyone got Gordon's email addy? by bl8n8r · 2004-06-24 04:43 · Score: 1, Funny

I purpose a little test of my own...

--
boycott slashdot February 10th - 17th check out: altSlashdot.org

definitely curious and also concerned by fantomas · 2004-06-24 04:44 · Score: 1

Like other people, I found michael's choice of word curious: the first time I have ever seen the noun architect used as a verb. The curiosity of the expression took my attention away from his main argument.

I feel the Plain English Campaign offers a useful guide "We define plain English as something that the intended audience can read, understand and act upon the first time they read it". So, perhaps you are right for the majority of people. But I had to pause a while and think about what michael meant.

I agree with you that some nouns become accepted as verbs.

POPFile OTOH by JohnGrahamCumming · 2004-06-24 04:47 · Score: 3, Informative

Actually publishes statistics from real users. If the user is willing POPFile sends back accuracy information to a central server and then a nightly cron job analyzes it and publishes information on the web for all to see.

No need to read a study, or even the author's opinion. No wild claims made, just real data.

Here it is:

http://www.usethesource.com/popfile_stats.html

Shows that POPFile has an _average accuracy_ over all users, including the training period of 95%. After it's seen 500 emails it has an accuracy of 97%. And the average POPFile user has 5 categories of classification.

John.

Re:POPFile OTOH by driptray · 2004-06-24 14:08 · Score: 1

Those popfile numbers seem low to me. I've never had less than 99% accuracy (after a couple of days training), and I've been using popfile for over a year, and am on my third corpus.
Right now I'm at 99.92% accuracy. I still get pissed off about the 0.08% though :)

Re:Architect is not a verb. by pete-classic · 2004-06-24 04:49 · Score: 1

I'm stuck with windows at work. I can't get any extensions to install. :-(

I've clicked on the install links at texturizer, but when I restart I still just have the DOM thingie all by itself in my extensions manager.

I think that Firefox 0.9 wasn't quite ready with the new extensions model.

Oh well.

-Peter

DSPAM by Big+Boss · 2004-06-24 04:49 · Score: 2, Interesting

I don't claim to have done any scientific studies on the subject, but I have tried a number of different anti-spam soultions over the past few years. In my experience, the best soultion is a multi-pronged approach that takes advantage of the strong points of a few setups.

If you want to talk about the results from a single filter in my current arsenal, I would give DSPAM the highest marks. I found it to catch more spams than a trained and customized SpamAssassin with no false positives. It's also very fast, unlike SA. My current setup is as follows...

1) RBLs via Postfix. I probably block 80% of inbound spam this way. I choose my RBLs carefully to limit false positives.

2) DSPAM. I typically get better than 99% of the ones that slip through the RBLs with DSPAM.

3) A complex procmail.rc that uses some statistical rules and a few simple checks, such as "is the mail addressed to me". I also use procmail to sort my mailing list messages into IMAP boxes and it includes a simple whitelist.

4) Spamassassin. This doesn't run much anymore, but I keep it around anyway as a last resort checker. If a mail makes it through all the above, SA gets a shot at it.

I tried using SA as my only post RBL filter for a couple months, but it wasn't getting the job done. I then added the procmail script, but still wasn't happy. Putting DSPAM in front of it all seems to work best for me. I now find that I only have a few spams per month make it past DSPAM (they sort into seperate boxes so I can track their performance) and I haven't seen a false positive in quite some time, over a month anyway. I've only been using DSPAM for a few months.

What works for me may be crap for you. Try a few things till you find something that works for you and use that. If you're trying statistical filters, keep in mind that it takes a while to train them. I found I got better than 90% with DSPAM after a small corpus feed and about a week of training.

Obfuscated Hyperverbosity by Andy_R · 2004-06-24 04:50 · Score: 1

The author 'architected an appropriate response' . Persumably this is a lot better than simply replying?

I'd advise the author not to use the word "percept", because he doesn't know what it means.

I'd advise the author not to use the word "someodd", because dictionary.com doesn't know what it means.

As for "very unique"...

--
A pizza of radius z and thickness a has a volume of pi z z a

Re:Obfuscated Hyperverbosity by Anonymous Coward · 2004-06-24 05:33 · Score: 0

dictionary.com also does not know what Persumably means.

And I did not find "percept" in the article at all.

Re:Architect is not a verb. by the+chao+goes+mu · 2004-06-24 04:51 · Score: 1

I agree. However, my biggest complaint is that "architecting" makes a new, vague word to replace the perfectly clear word "design".

Not to mention the fact that he neither "architected" nor designed, but simply wrote....

--
Boys from the City. Not yet caught by the Whirlwind of Progress. Feed soda pop to the thirsty pigs.

Wacko by Anonymous Coward · 2004-06-24 04:55 · Score: 0

Personally I find "Nuclear Elephant"s writing ridiculous. Read this article about how terrorists are going to use data centers for their next attack.

Re:Grammar by anethema · 2004-06-24 04:57 · Score: 0, Offtopic

yes

--

It's easier to fight for one's principles than to live up to them.

Re:architect by Anonymous Coward · 2004-06-24 05:02 · Score: 0

Architect is generally heard as a noun *now*. It originated as a verb.

Re:Architect is not a verb. by nagora · 2004-06-24 05:11 · Score: 1

"Architect" in this context was anything but meaningless.

Alright, genius: did he mean "write" or "design"? And why was not using one of those an appropriate choice?

TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

The problem w/ Bayes by king_ramen · 2004-06-24 05:12 · Score: 3, Informative

As the author of this article states OVER and OVER, it is REALLY EASY to mess up your filters, and it is very tedious (with lots of permutations) to properly build your corpus. For a centralized spam filtering solution, the goals are: 1. Insulate the users from spam 2. Insulate the users from "administration" 3. Do no harm (no false positives) For these goals, I would take a "dumb" filter, set it conservatively, and hope for 80% catch rate and zero false positives. DSpam has a complicated workflow that requires EACH AND EVERY end user to complete a feedback loop. This is WAY to much to expect from people who are barely capable of finding Google. Unless the ONLY access to the mail is web-based, with a VERY clear "This is Spam" button, Bayes is a sysadmin's nightmare. My only gripe w/ SpamAssassin is performance. If I could get SPAMD to analyze headers in 25ms instead of 2000ms I'd never look back. As it is, DSPAM's performance has me very jealous.

--
----- Refactoring is the reason why man does not mistake himself for a god.

Re:The problem w/ Bayes by Ragica · 2004-06-24 10:04 · Score: 1

DSpam does have a feature whereby untrained accounts can refer to a "parent" trained account.... unfortunately I've not had good success trying to set this up, however it sounds excellent in theory...
If you have all your users use the global stats by default, it might be a bit dangerous... if one of those users happens to get a lot of legit email that looks like the global classified spam... but it sounds like in the real world this method should work...
And yeah... it's the speed that is the most attractive part for me as well. We have spamassassin on our small ISP, but only users who configure it themselves use it (via spamd)... but the load on the system is annoying. I don't really want to enable it for everyone by default because of this. If I could get DSpam set up and seemingly reasonably accurate using the global stats setup, I'd do it. I'm hoping to do it... eventually. Haven't got around to testing DSpam 3.0 yet.

But is it correct? by NigelJohnstone · 2004-06-24 05:12 · Score: 1

To repeat a comment I made just above. From his original test paper:

"The test sequence contained 49,086 messages. Our gold standard classified 9,038
(18.4%) as ham and 40,048 (81.6%) as spam.
The gold standard was derived from
X's initial judgements, amended to correct errors that were observed as the result
of disagreements between these judgements and the various runs."

From this I got that:

1. He had an initial set of Spam judged by person X. (e.g. 99.84% accurate).
2. That he ran it through each test filter.
3. That discrepencies were analysed by hand to get to the golden 100%.

So its not a spamassasin that generated the gold standard, person X did with corrections from the *runs* (i.e. a composite of all the filters as adjudicated by person X).

Re:But is it correct? by TheLink · 2004-06-24 21:36 · Score: 1

Unless I misunderstand the research PDF/paper, to judge the initial set of spam + ham as spam or ham, the person sifted through what was already sifted through by _Spam_Assassin_ into ham and spam. So this is likely to bias things towards spam assassin.

Whereas if X sifted through all the thousands of messages manually then the bias would be to X's uninfluenced standards, which arguably could be still useful - then at least we'd see which antispam solution is more suitable to X.

A more thorough but tedious approach would be the initial classification of spam/ham by a team of humans - each going through _all_ the messages without any skipping, with discrepancies being resolved by the team.
--
- Too many replies beneath your current threshold

why can't we all just get along? by Anonymous Coward · 2004-06-24 05:20 · Score: 1, Insightful

Irritation is a perfectly reasonable reaction. It is not, however, constructive to vent the irritation in response.

It is not my desire to flame the test or the tester, but...

Somehow came not long after this:

Many misled CS students, Ph.Ds, and professionals have jumped on the spam filtering bandwagon with the uncontrollable urge to perform misguided tests in order to grab a piece of the interest surrounding this area of technology.

Something I learned from girlfriend #4: validate feelings. Yes, the Nuclear Elephant was hurt. He's right to be hurt. But no, lashing out is not adult, it is not constructive.

To characterize other researchers as ignorant, wagon-jumping glory hounds with poor self-control does not encourage cooperation.

Re:why can't we all just get along? by pclminion · 2004-06-24 05:28 · Score: 1

But no, lashing out is not adult, it is not constructive.
True, but neither is ignoring his points simply because he had some attitude.
I do think he handled the stress a little poorly.
Re:why can't we all just get along? by Anonymous Coward · 2004-06-24 05:38 · Score: 0

True, but that's just the point, isn't it? I'm discouraged from reacting dispassionately because of ad hominem attacks. I'm encouraged to ignore his points.

But, yes, you're right. Someone's got to short the vicious cycle. I'll make sure I read the article as best I can. Wait, I'm not Mr. Cormack. Anyway, I'll do my part.

Re: Response to Gordon Cormack's Study of Spam by telstar · 2004-06-24 05:24 · Score: 2, Funny

He launches rockets ... He develops 3D game engines ... He analyzes spam trends ... Is there anything this Carmack guy can't do?

What'd you say?
Cormack?

Nevermind...

Spam Assasin validation telling point by gurps_npc · 2004-06-24 05:27 · Score: 1

I find the most telling point is that he used Spam Assasin to decide if the various spam detectors had made an error or were correct.

OBVIOUSLY, Spam Assasin is going to agree with Spam Assasin being the best.

What the test really did was determine how close to Spam Assasin the other spam detecters were, not how good they were at detecting spam.

--
excitingthingstodo.blogspot.com

Atypical, high volume of traffic? by dougmc · 2004-06-24 05:36 · Score: 2, Informative

This seems very atypical. The test subject does not represent typical email behavior, except among the most hardcore geeks. Even still, typical hardcore geeks will adjust this behavior in an attempt to curve spam. The typical technical user (someone who makes his living online) will have the same email address for perhaps five or more years, and the typical non-technical user (a majority of the users on the Internet, lest we forget) will change email addresses every couple of years. In either case, most sane users use one or two variants at the most.

Who is Jonathan to decide what consitutes sanity?

Maybe I'm a hardcore geek, but I do do exactly what Gordon does -- have several accounts feeding a `master' mail account, using addresses I've owned for over a decade. I also post to Usenet and mailing lists with my unobfuscated mailing address -- I want people to be able to reach me, and I refuse to let the spammers take that away from me.

And I think I'm very sane, thank you.

49,000 emails in eight months is also absurd.

I agree. That's an absurdly *small* amount. I personally receive over 1500 spams/day -- so I'd have 49,000 in under a month. Obviously the amount of spam I receive is because I set myself up as a target, but I'm hardly the only one. Even Jonathan's email address is clearly listed on his page, unobfuscated, so he's doing it too, at least to some degree.

(As a piece of anecdotal evidence, Spamassassin catches all but about 4/day of the spams I get, and false positives are extremely rare. Of course, I have spent a good deal of time tweaking SA to work best with my email, and it now works very well.)

A good test should have included independent tests with corpora from 10-15 different test subject, of all walks of life - geek, doctor, etc.

That sounds fine in theory, but in practice it's hard to do. How many people from all non-geek walks of life save *all* their email, including spam, and are willing to give it to you so you can analyze it?

And merely capturing all their email won't do it -- they need to categorize it for you, because they're the only ones who can reliably decide what's spam *for them* and what's not.

I do agree, that the study had more than it's share of issues, but this critique goes way over the top.

Re:Atypical, high volume of traffic? by Anonymous Coward · 2004-06-24 09:04 · Score: 0

49,000 emails in 8 months is
6,125 emails per month or about
1,256 emails per week (39 weeks in 8 months) or
179 emails per day ( about 273 days in 8 months)

Between work and home I average over 200 emails per day that are HAM let alone SPAM ... maybe I am atypical as well ... same basic email address for 20 years also ... and I am not a geek just in the computer/engineering world ..

Taken individually, we are all atypical. Find me a family with 2.5 people.

Crap writing by fuzzy12345 · 2004-06-24 05:40 · Score: 2, Insightful

I was turned off as soon as I hit that word "architect" being used as a verb. After our hero "architected" his response, did he assign the task of actually writing it to someone else? Nooo.

English does evolve, and good writers sometimes repurpose words to great effect. Alas, judging by the rest of the reviews here, our hero is NOT a good writer -- having built a shoddy and ramshackle outhouse, he proudly crowns himself the architect of it.

As for all those people who shout "prescriptive grammarian!", I often suspect they're just too lazy to learn to write well, and have decided that claiming that rules are passe is an effective workaround.

--

Everybody's a libertarian 'till their neighbour's becomes a crack house.

Re:Crap writing by EatAtJoes · 2004-06-24 11:27 · Score: 1

As for all those people who shout "prescriptive grammarian!", I often suspect they're just too lazy to learn to write well, and have decided that claiming that rules are passe is an effective workaround.

either that or he's too busy coding free software to give a shit about 'prescriptive grammarians' who obviously don't RTFA. this is slashdot pal -- damn, at least he spellchecked!

choosing good RBLs by Anonymous Coward · 2004-06-24 05:46 · Score: 0

I'm using lists.dsbl.org, relays.ordb.org, and sbl.spamhaus.org .

Which are you using?

Re:choosing good RBLs by Big+Boss · 2004-06-24 08:42 · Score: 1

sbl-xbl.spamhaus.org,
blackholes.easynet.nl,
rel ays.ordb.org,
list.dsbl.org,
ipwhois.rfc-ignoran t.org
cn.rbl.cluecentral.net,
kr.rbl.cluecentral .net,

I find I get most of my hits from the spamhaus list, followed by dsbl. In scanning my logs, I have not ever found a false positive from those, nor have I been contacted by anyone complaining about my blocks (they can use yahoo or something if they want to). 90% of them appear to be broadband connected machines, probably virus infested.

Yeah, yeah. I'm blocking China and Korea. I don't know anyone there, so I put them in as an experiment. I get probably 10 blocks a day from them. Not enough to really care either way, but that's 10 fewer spams for my other filters to process.

As I said in my other post, this works for me. If it sucks for you, I don't want to hear about it. ;)

an important consideration left out by mabu · 2004-06-24 05:56 · Score: 1

When self-proclaimed pundits do these studies, they should also factor into account the exponential increase in resources needed to accept and filter the mail's content. This results in more memory, faster machines, slower mail service and more deferred mail and reduced performance overall of everything else that might be done on that server.

Contrast this with the effectiveness of RBLs, which block spam based on the source and immediately cut off the huge resource requirement needed by these "filters".

By my analysis, at BEST, there is little more than a 1-2% difference in spam-catching ability between a well-tweaked RBL setup, and a content-based system. With the exception of the content based system consuming tremendously more resources and further delaying mail service.

It seems to me, if you have unlimited resources and you also want to employ content-based filtering for other means, that's the way to go. For everyone else on the planet who wants fast, reliable mail service without having to spend a fortune in hardware to handle traffic you shouldn't have, a well-selected set of RBLs is the superior approach.

Re:Why use "architect" - why not "write" by Anonymous Coward · 2004-06-24 06:03 · Score: 0

You're being unintentionally obtuse.

"What's the difference..." is a rhetorical question used to highlight the frivolity and pretension of the term.

BTW, to fork is a long-accepted verb.

Re:Architect is not a verb. by OneBigWord · 2004-06-24 06:04 · Score: 1

But you can 'verbalize' all you want. Heck, I did it twice before breakfast.

--
What if that mime really is trapped in a box?

for god's sake... by Anonymous Coward · 2004-06-24 06:05 · Score: 0

"architect" isn't a verb, and anyone who uses it as such should be shot (especially since there's no real designing taking place when writing a paper). What's so bad about just saying you "wrote" it?

Re:Architect is not a verb. by Daniel · 2004-06-24 06:08 · Score: 1

You can't "verb" something

Sure you can, but verbing weirds language.

Daniel

--
Hurry up and jump on the individualist bandwagon!

Re:Architect is not a verb. by perly-king-69 · 2004-06-24 06:10 · Score: 1

There's nothing unwrong about prescriptivisationism.

--

--
This sig is inoffensive.

Re:Is that what your mom worded by Anonymous Coward · 2004-06-24 06:27 · Score: 0

It was a debugger error:

No symbol "Architect" in current context. Single stepping until exit from function main, which has no line number information.

It's a decent paper, but take it with some salt... by Ayanami+Rei · 2004-06-24 06:33 · Score: 4, Interesting

...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:Architect is not a verb. by baomike · 2004-06-24 06:33 · Score: 1

but maybe articulate would fit.
A least it starts with the same letter.

No. by Ayanami+Rei · 2004-06-24 06:36 · Score: 1

You input data. You don't input input.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Mmmm... by Ayanami+Rei · 2004-06-24 06:41 · Score: 1

well, it's all well and good, but you lessen the likelyhood that you'll click-delete the wrong message when they're all in your inbox, not yet sorted. (Incidentally, statistical filters are great for sorting mail period.) I get a LOT of email, I'd be lost without it.
I just check the junk mail folder less often than my inbox. And I do get false positives, but it happens infrequently enough that it's not an issue.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Hello? by Ayanami+Rei · 2004-06-24 06:45 · Score: 1

He said he wasn't an expert. So of course he'd be forced to make that conclusion. He cannot scratch his itch because he cannot reach it.

This is the kind of response he was talking about that does no good. Rather, you should acknowledge that the area is weak and that more focus needs to be given there in the future.

(Incidentally, I'm interested in OSS in the GIS field. Any ideas/good pointers? Anyone?)

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:Hello? by killjoe · 2004-06-24 08:12 · Score: 2, Insightful

"He cannot scratch his itch because he cannot reach it."

You don't have to be a developer. As I said you can start a campaign to ask for donations, you can write letters to companies asking for sponsorship, you can donate some of your own money, you can try to get like minded individuals together to solve the problem.

OPEN SOURCE DOES NOT WORK UNLESS YOU CONTRIBUTE.

" Rather, you should acknowledge that the area is weak and that more focus needs to be given there in the future."

More focus needs to be given by who? Are you saying I should grab random programmers off the street and yell at them until they write a GIS program for me?

--
evil is as evil does

Cormack and Lynam re Zdziarski's factual errors by gvc · 2004-06-24 06:48 · Score: 4, Informative

We shall not respond to Mr. Zdziarski's attacks, except to identify the most outstanding factual errors and to note that ad hominem arguments are irrelevant in assessing the validity of our work.

We encourage interested parties to read our paper and our points of fact re Zdziarski.

Thomas Lynam
Gordon Cormack
June 24, 2004

Re:Cormack and Lynam re Zdziarski's factual errors by Trogre · 2004-06-24 10:01 · Score: 1

It would be so much easier to believe you if you would just show us the code you used to perform the tests.

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Re:Cormack and Lynam re Zdziarski's factual errors by EatAtJoes · 2004-06-24 11:18 · Score: 2, Insightful

While obviously Cormack and Lynam are central to this discussion, it's depressing that this is +4, Informative when instead they obviously resent any serious questioning of their work. Is there a '-1, Wussy' moderation?

"We shall not respond" -- huh? Pull the log out of your ass guys. Like it or not, he's got legitimate beefs with your study. What's more, he's got cred: dude puts SERIOUS effort into GPL'd software that helps people, so his input is relevant and valid. Get over it.

Besides, his questioning of your credibility are neither 'ad-hominem' or irrelevant. Claiming that it is betrays a decidedly unprofessional sensitivity to criticism. as he points out, it is more than legitimate to question the credentials of the tester when interpreting results -- UNLESS the test has been repeated. 'Ad-hominem' attacks means irrelevant insults, whereas he's merely questioning your approach and relevant experience. don't go public with your stuff if you don't like the heat.

How about instead, you address his most damaging points:

- put all of your configuration data and any other information required to re-run the test online, immediately. there is absolutely no reason to resist this. you might want to explain why you haven't already.

- your errata is so far entirely due to his corrections. professional class would merit gratitude for his attention. try it on for size. after all this is supposed to be a *review* period yes?

- he directly questions the use of human error-checking. is he right? wrong? i don't know but it's a damn interesting question, and one your response does not address.

- finally, what's up with saying you won't respond ... and then RESPONDING, and using his work in your errata?

there are more problems here but you get the gist. you guys get paid to do this so do it right.

Wether he's right or wrong ... by Anonymous Coward · 2004-06-24 06:56 · Score: 0

.. the point of writing papers is to get review comments, and it is part of a scientific process to improve the quality of research results. In that respect, Slashdot is doing a very good job.
And so was Cormack when he put out his ideas for feedback. And so were you in formulating feedback.

Collaborative filtering? by WOV · 2004-06-24 06:59 · Score: 1

I am always confused by the omission from these tests of collaborative filters like Cloudmark's SpamNet, which I have used at work for a long time with a very high "catch" rate, no real processing time, and no false positives. Essentially, every email you get it hashes and checks with the server. If you get a spam, you right-click and report it as such. Then it pulls any messages from your inbox which enough credible people have marked before you. (A gross oversimplification, but close enough.)

I feel like at our current stage of technological development, you have to combat human-generated deception with human intervention.

(By the way, that cloudmark tool is Outlook-only, but contains some concepts I'd like to see in other filters...

CRM114 is impossible to get installed by Anonymous Coward · 2004-06-24 07:08 · Score: 3, Insightful

I remember going through the CRM114 installation docs, and vividly remember the 20 or so steps that I had to go through, and after about 3 or 4 hours of trying to get it installed, I finally gave up. I think part of the goal of software design is to make your software so that people will be able to quickly install and use it. The author of this program lost sight of this important point. I'm not going to sit there and reverse engineer some esoteric codebase just to get it working, and I'm sure alot of other people feel the same way. Therefore, I use SpamAssassin among other things, and it works really well and was quick and relatively painless to get working. I didn't have to go through their source code to figure out how to get it installed.

Re:It's a decent paper, but take it with some salt by Anonymous Coward · 2004-06-24 07:26 · Score: 0

I especially like these claims:

Many believe a great atmospheric shroud blocked much of the Sun prior to the great flood, which would have altered the strength and number of radioactive rays affecting earth.

(Uh, "radioactive rays" from space have nothing to do with radiogeology.)

If the earth were indeed billions (or even a million) years old, it would be spinning so fast that nothing would be able to survive on it, and it would be enveloped by the Sun's sheer mass.

I can't even wrap my mind around this one. Huh??

Re:Architect is not a verb. by Anonymous Coward · 2004-06-24 08:02 · Score: 0

He meant "write", but he used a word that means "design."

Cormack got Pwnt. by Anonymous Coward · 2004-06-24 08:07 · Score: 1, Insightful

The Article was necessary. It comes down to this glaring fact:
".... If you use a tool that is only 95% accurate to prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to outperform the better tools whenever the better tools are correct. ...."

the corpus was *not* classified by SA alone by jmason · 2004-06-24 08:09 · Score: 5, Informative

My $.02. disclaimer: I'm one of the SA developers.

"The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":

No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).

The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.

However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.

In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.

IMO, that's as good as a hand-classified corpus can get.
"old versions of software were used":

It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.

Given that, using 6-month old release versions of the software under test seems reasonable.

SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
"test started with untrained filters":

IMO, that's the real world. People don't start with fully-trained filters.

In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
"spam in the test is as old as 14 months":

Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
"it should purge old data":

SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".

In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).

(Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)

And finally, what Henry said in comment 9520473.

--j.

Re:the corpus was *not* classified by SA alone by nathanh · 2004-06-24 14:44 · Score: 1

My $.02. disclaimer: I'm one of the SA developers.

SA has saved my bacon so many times I've lost count. You guys all deserve many free beers.

Re:Architect is not a verb. by Anonymous Coward · 2004-06-24 08:09 · Score: 0

Haven't you ever Googled something?

No, I haven't. I haven't used that stupid, moronic term "bling bling" either.

RBL (black lists) do not help with zombie systems by wintermute42 · 2004-06-24 08:27 · Score: 2, Insightful

I have noticed that black lists are indeed effective. Many spammers now use "bullet proof" spam hosts, so they use static domain names. However, there has been an marked rise in zombie systems sending spams. These are systems that are infected by viruses and then used as spam hosts. Since these systems come on line rapidly (when they are infected) and then drop out (when they are cleared of the virus or booted off their ISP) it seems unlikely that black lists will help.

At least in the spam stream I see, there is more than 1-2 percent of the spam flow from zombies. The best technique seems to be to use a black list first and then content filter.

An a related topic in the parent post:

In a previous post, in another discussion, I also suggested that the sophistication of spam filters like SpamAssassin, which use several algorithms to filter spam, would consume lots of system resources. Another poster wrote that these tools do not consume much in the way of processor and memory resources. This seems counter intuitive, but I don't have any contrary evidence.

Or better yet... by Anonymous Coward · 2004-06-24 08:35 · Score: 0, Informative

Just fucking try the software yourself. Quite simply, spamassassin blows, and this is the consistant opinion of the ~4000 people here who have been stuck with it for now. Testing out CRM114 and DSPAM on limited (100 each) groups of people is showing both to be an order of magnitude better than SA. I can't say which is better, but I can say for certain both are in a whole other league from SA, which lets in 1/20 or so spams, and likes to flag abnoxious HTML laden email from management types as spam, much to their disdain. Both of the statistical filters are much better, with test people seeing between 1/100 and 1/500 spams getting through, with only a handful of false positives.

Re:Or better yet... by Henry+Stern · 2004-06-24 13:51 · Score: 1

I'm sorry to hear that you're having so much trouble with SpamAssassin. I've heard some rumblings from the Faculty of Law at my university that SA makes a lot of errors on their e-mails. The disclaimers that they put in their signatures must be tripping some of the rules.

I'm trying to find a reliable method of personalising your scores without requring you to download the 300MB corpus that we use to optimise the scores. I hope that, in the future, it will make SpamAssassin more to your liking.

P.S. Your system administrator really shouldn't be discarding those e-mails. The SpamAssassin documentation reccomends that you only tag them and let the users write a filter that detects the X-Spam-Status header. You should ask them to tag the e-mails instead so that you can use DSPAM or CRM114, since they work so well for you.
Re:Or better yet... by Anonymous Coward · 2004-06-24 14:18 · Score: 0

I am my system administrator, and discarding what emails? What maildrop does with the email after SA has looked at it depends on the domain, and how those people want it to work. Don't feel bad about how well SA works, its better than nothing, and in its time was a useful tool. Its just that it is no longer useful in comparison to much more accurate, and WAY faster tools. To use SA for so many people requires several machines, and I am pretty sure DSPAM will be ok with just a single machine.

We're going to end up moving to a DSPAM milter not just for the increased accuracy, and faster processing time, but also so that spam can simply be rejected during the SMTP connection, and significantly reduce the work the server has to do.
Re:Or better yet... by Kiryat+Malachi · 2004-06-24 15:41 · Score: 1

As far as I'm concerned, obnoxious HTML-laden email from management *is* spam.

--

---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)
Re:Or better yet... by Anonymous Coward · 2004-06-25 04:26 · Score: 0

That's nice, I live in the real world and don't care what you think of emails from the people who are paying for the fucking service. Telling people they are the problem doesn't help, it just makes them pay someone who isn't a cockface to take care of it instead of you.
Re:Or better yet... by Kiryat+Malachi · 2004-06-25 10:59 · Score: 1

I have two sets of management. My real management, who wouldn't use HTML if it bit them and send me things I need to see, and HR/corporate management, who send me HTML-laden monstrosities I could care less if I read.

In the real world, important email is only rarely HTMLized.

--

---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)

Re:It's a decent paper, but take it with some salt by Anonymous Coward · 2004-06-24 08:42 · Score: 0

Er, that article isn't a paper to try and prove the age of the earth; it's an article about why he is a Christian...or did you not read the part where he said it specifically wasn't about evolution.

Re:And to that... by Anonymous Coward · 2004-06-24 09:20 · Score: 0

So you wrote an article about displaying an image on a remote X server and people are supposed to be impressed?

Jesus - I'm lame, but you go way beyond that.

No, that's not it. by Anonymous Coward · 2004-06-24 09:50 · Score: 0

Posting as AC because I'm too damned lazy to make an account.

I believe he has very valid issues with Cormack's methodology, plus I can speak from personal experience that DSAPM is capable of very high accuracy rates.

DSPAM is being used on one of my domains, DeltaBravo.net, and after it finally reached its activation threshold, it is doing very, very well, much better than SpamAssasin's quoted accuracy. In the last week its running about 99.4% accurate (over approximately 900 messages).

I think DSPAM may need a little more careful training than many filters (or at least more care in evaluating and selecting the correct options to use), but make no mistake, it's *very* good. I initially started out a bit disappointed with it, but it is now dead on the money at catching spam.

FWIW, I also run POPFile on my desktop to catch the few that still get through DSPAM. POPFile was extremely easy to train and well worth using if you don't have access to a server-based solution.

Just my opinion.

Worse, he doesn't make sense by Anonymous Coward · 2004-06-24 10:33 · Score: 0

He says "The bible is the oldest document in the world".

Aside from the fact that the collection as "the bible" is scarecely more than 1000 years old, it clearly is not the oldest document in the world.

I'm a christian, but that doesn't mean you take everything on faith. Faith is limited to things like existence of god, believe in an afterlife.

As soon as the bible makes a verifiable claim, I treat it like any other claim. God is perfect. The guys who wrote various texts in the collection known as the bible are human and prone to error, exaggeration and lies.

DSPAM. by asackett · 2004-06-24 11:04 · Score: 1

Honestly, the first time I read Cormack's paper I stopped partway through because his findings didn't jive with my own experience. I've applied no scientific method to debunk his findings, and I don't care to -- I have other demands for my time.

I use and recommend DSPAM. Many of the accounts that are aggregated in my inbox have been exposed on the web and in Usenet for several years, so my spam load is probably about as high as anyone else's. No comparison testing analysis can change the fact that my inbox sees at most two spams per month (on a maturely trained DSPAM installation) and maybe one false positive every six weeks or so. DSPAM isn't the only tool in the box, but it's the only content filter, and it does what it's supposed to do.

If JZ got a little too personal in his rebuttal, I'll forgive him for it. I'd like to think that if I were in his shoes I'd show a bit more tact and restraint, but there's a pretty good chance that I wouldn't. I get all kinds of defensive about the work I've put my passion into, and can't really blame anyone else for doing the same.

--

Warning: This signature may offend some viewers.

Re:architect by reynhout · 2004-06-24 11:12 · Score: 2, Funny

> For the love of Cthulu, people, "architect" is a noun, not a verb.

Ya.

And for the love of Howard Phillips Lovecraft, "Cthulhu" is not spelled "Cthulu".

Duh.

Re:RBL (black lists) do not help with zombie syste by mabu · 2004-06-24 11:38 · Score: 1

Content-based filtering uses *exponentially* more resources than RBLs. RBLs just cause the mail server to close the connection; no further negotiation, no downloading of mail, no wasted port connections, no storage and memory overhead, no cpu overhead and all other resources necessary to examine the mail content.

Content-based filtering is a privacy issue as well.

The way I run my mail servers is with the utmost respect for the sanctity of our users' e-mail. We do not read their mail, even for the purpose of filtering spam. I consider this unethical personally, but not everyone thinks e-mail should be private.

A week of training! by Tom_Yardley · 2004-06-24 13:27 · Score: 1

You describe four complicated programs you configure and run as well as a week of training needed to run your anti-spam program. Assuming a very low hourly rate, that's $5,000.00 easy. Why use e-mail when for thirty seven cents you can have a man come to your office and take real mail anywhere you want him to? With real mail you can learn to throw the spam in the trash in under thirty seconds.

Re:RBL (black lists) do not help with zombie syste by wintermute42 · 2004-06-24 13:44 · Score: 1

I can see your point about privacy. It is true that once you allow something to read email it could be abused. But to balance this is the fact that, at least for me, email would be useless without a spam filter.

Privacy is not an issue in my case. I use text only email on Linux (email never touches my Windoz system for security reasons). I run a spam filter for my own email account, so it is my program that reads my email, not someone elses. I read my email on a shared Linux system run by the ISP that hosts my domain (my ISP is webquarry.com).

As far as I know, the RBL approach would not work in my case. I do discard some email one the basis of the domain name, which is far less efficient than the RBL. My spam filter keeps a log of some of the header information from the email it discards. A fair amount of spam is going through fixed domain names these days (e.g., like the infamous tekmailer).

One of the problems I had with the commonly used spam filters was that it was unclear to me how to install them in the case where I am simply piping my email to them. I was also concerned about resource usage, since I am using a shared system. So like a typical programmer I wrote my own spam filter in C++. It is probably 80 to 90 percent efficient. Enough spam still gets through that I'm going to take another look at SpamAssassin and see if I can get it to run with a "procmail" forward. It is just too time consuming to constantly hack the spam filter for the latest evil spammer trick (recently they have been sending spam to my email address from the other valid user on my domain, where I don't check content).

good point; wrong example by Anonymous Coward · 2004-06-24 13:47 · Score: 0

I agree with your statement; though 'Esse' (with capital E) is NOT a german word. 'Ich esse' is a verb in present form, first person, singular; the noun is 'Essen'.

Try hxxp://dict.leo.org for anything but serious translation work.

Don't train on old data by HermanAB · 2004-06-24 14:27 · Score: 1

Bayesian filters should not be trained on anything older than about 2 weeks and the test should be done with the mail from the next day.

Training on old stuff makes it worse. SpamProbe's author suggests purging words that have not been referred to in 2 weeks.

My SpamProbe setup handles thousands of messages per day and not ONE spam gets through in weeks and there are NO false positives in more than a year of use on hundreds of thousands of messages.

I estimate SpamProbe to be in excess of 99.5% accurate in eliminating spam and 100% accurate in accepting ham, but it depends 100% on how well you train the thing.

Garbage in, garbage out...

--
Oh well, what the hell...

Re:Architect is not a verb. by stoborrobots · 2004-06-24 15:11 · Score: 1

You can actually verb something... and you can architect a solution for any verbing problem, m'kay?

Both words with sufficient history to claim "Not Invented Here"

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco

Re:And to that... by calebb · 2004-06-24 16:47 · Score: 1

>> So you wrote an article about displaying an image on a remote X server and people are supposed to be impressed?

If you are able to read the paper (i.e., via a university IP address based subscription to J.Chem.Ed.), you'll see that the paper is 7 pages long and the supplemental information is 43 pages long.

It's a little more involved than you think. (i.e., the actual cryostat is connected to a Mercury-VX console computer that is capable of acquiring trillions of points per second). The Sun Ultra/10 workstation is connected to that console via a TCP connection. On the system administration side, it is incredibly complicated to remotely control an NMR spectrometer over the internet. I worked on this project for my M.S. in Chemistry and it took ~2.5 years to perfect it.

Of course, that's way off topic & I'm replying to a flaming AC... but now you know... and knowing is half the battle!!

reasons others have lower accuracy by pwarf · 2004-06-24 20:33 · Score: 1

Many users use e-mail differently than you might.

-For example, some people want the forwarded e-mail of top blonde jokes from their friends and relatives and others don't. I wouldn't mark these annoying forwarded messages as spam because I wouldn't want to risk associating friends' e-mail addresses with spam in the filter and I don't get that many e-mails like that.

-Mailing lists. My bank sends me annoying newsletters, but I may need to note a change in their user policies. Right now I just delete these based on the subject lines.

- Useful e-mails from friends citing good deals.

It just bodes... by Anonymous Coward · 2004-06-24 20:50 · Score: 0

And I think you are missing the reference to a Terry Pratchett novel :) That is why it was emphasized... as Gaspodes states, it doesn't bode ill or well, but just generally bodes.

But thanks for your lesson in English ;-) I would like to be moved up a few years, though ;-)

Posting as Anonymous coward because I left my keyring at home.

Re:Why use "architect" - why not "write" by Stone+Pony · 2004-06-24 23:51 · Score: 1

According to my 1976 edition of the Concise Oxford Dictionary (I don't have ready access to the full-on OED in all its multi-volume glory), "Craft" is a noun or a verb, with its roots in Old English. No doubt the OED itself has citations going back centuries, but I don't have them. The OED's online edition is a subscription-only service, and I don't have one (and neither does anyone else who doesn't have GBP 195 + VAT to spend on a dictionary subscription)

Re:It's a decent paper, but take it with some salt by Anonymous Coward · 2004-06-25 15:07 · Score: 0

Well, take it with a huge grain of salt:

He says: I am a Born-Again Spirit-Filled Heterosexual Serious-About-God Christian (TM)

The document linked from the parent really tells a lot. He says:

Jesus Elicits a Reaction
When people swear, do you ever hear the words "Buddha Damn" come out of their mouth? Or how about "Allah Christ"? Instead, all of the profanities we hear which involve deity revolve around Jesus Christ.

Oh man! What a proof that one is.

Re:choosing good RBLs (fixed) by Anonymous Coward · 2004-06-26 18:55 · Score: 0

sbl-xbl.spamhaus.org,
Best one, keep at top.
blackholes.easynet.nl,
Has not been running in months, remove. (parts now at SORBS and NJABL)
relays.ordb.org,
Less than 1% hit rate, move down in check list.
list.dsbl.org,
Good hit rate, move to #2 on list.
ipwhois.rfc-ignorant.org
cn.rbl.cluecentral.net,
kr.rbl.cluecentral.net,
Others okay if you want.

HTH

WRONG Re:RBL (black lists) do not help with zombie by Anonymous Coward · 2004-06-26 19:03 · Score: 0

Oh, but they do!

Try Spamnhaus' XBL, you'll see, if you're anywhere close to me, 70-80% of all SMTP connections go bye-bye since they are spam coming from zombie systems.

I've yet to have a false positive. Impressive.

Their SBL targets the "static spamhausen"

Slashdot Mirror

Response to Gordon Cormack's Study of Spam Detection

229 comments