Response to Gordon Cormack's Study of Spam Detection
Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."
I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.
Trolling using another account since 2005.
On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)
DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!
DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...
Priceless Photos
Gamblers Forum
I mean, it's not even a second meaning. It's just plain English abuse. I hope this Zdziarski guy's paper is decent, since he's pretty tripped my spam filter from the gate.
If guns kill people, then CmdrTaco's keyboard misspells words.
I usually frown when I see many of these so called studies offering conclusions, several of which differ radically from my own experience. There recent Java/C++ performance one was a classic example. It gets annoying when a pro MS result is immediately decried as marketing FUD because it just cant be better and a pro Linux result is taken gospel truth here on /. Usually I tend to take all results with a grain of salt or just plain ignore them and focus on the debate around them.
The benifit of these studies though is that fantical crap aside informed people will usually take the time to interpret results or suggest corrections/improvements that actually benifit developers and improve their knowledge base more than any information provided by the actual study.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
adb, you stole my post!
Jonathan,
Invest in a thesaurus. Or you can use this one. You might have used any of the following perfectly serviceable verbs: scratch, scrawl, scribble; draft, draw, make out, write down, write up, hatch, make, generate, or construct.
Please don't recast innocent nouns as verbs.
Thank You,
Peter
1. Those bots that post goat.cx are very annoying. 2. The author of CRM114 admits that not everyone gets the results others do. Some people get perfect handling, while others get very poor handling. He also claims that setup might have been a problem with the testing. On one hand the tester should have set the system up correctly, but on the other hand this just shows that it isn't a fool-proof system (yet).
This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article. I'd like to know what makes his opinion any more valid than Gordon's.
Information on his professional career was very hard to find on the site.
This just seems like a flame because his software(dspam) didn't perform well in the test.
For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.
Point being - I was darn surprised to see SA at the top of his charts.
Now - if only mimedefang would easily use another spam-checker....
First off: Your original posting was simply completely off topic. Where would we be if every message pointing out a grammatical mistake in a story got moderated +5? - slashdot would look like an English schoolbook.
Secondly: Your second posting is not only off topic, but also insulting and purely flaming.
I'm glad people like this work on keeping "Penis Enlargement" emails out of my inbox because I would just delete them instead of working that hard. I guess this is why I have email addresses for specific purposes (keeps the shit in the subscription addresses ~ yahoo/hotmail/etc...)
Overall an interesting read but it seemed he was a bit irritated with the other guy getting slashdot notoriety.
Haven't you ever Googled something? Haven't you ever input data into a computer? (The use of the word input as a verb is, of course, the result of verbing, and it's now considered acceptable usage.) In recent years it has become common in English to "verb" nouns. In fact, I just did it. English, like any other language, evolves over time.
To deny this fact makes you just another prescriptivist language maven, completely disconnected from reality and any sense of the advancement of human language.
Folks, don't listen to this dinosaur. He's not insightful, he's simply living in the past.
so the test procedure was bad ? ermmm, what about that GMAIL spammmeplease project ?
Go grab those torrents.
I just read the whole article - it does repeat itself a few times, but the author provides additional evidence each time his theses were reiterated:
1. Cormack is very inexperienced in the area of statistical filtering. Agreed!!!
2. Cormack went into the testing with many presuppositions. Also Agreed!!
And in case you're not familiar with the word presupposition:
1. To believe or suppose in advance.
2. To require or involve necessarily as an antecedent condition.
Overall, this is a very good article; Check it out if you haven't already done so!
Here is a zero dollar investment - install this with this extension.
No Sig for you.!
There's really very little to be said in favor of Jonathan A. Zdziarski's "defense". I guess it just amounts to him wanting to sell his product. Of course, I remember when CRM114 first came out, it was subject to some very dubious--or often simply incoherent--claims. It's pretty clear Zdziarski is in quite a bit over his head... not quite as bad as the amateurs who discover their own "breakthrough" encryption techniques, but tending in the same direction.
As near as I can tell (I skimmed, admittedly, I didn't read every word carefully), his defense amounts to "please don't test the different filters because..." Fill in what feature of the test MUST not be the same as the CRM114 users who get 99.95% accuracy. This is precisely the meaning of "special pleading" in rhetoric. Also the same argument about "if only he had tried the latest-and-greatest (even though we made our wild claims before that version came out, too)."
Cormack &alia make a reasonable best effort to test several tools; and as with any test, they make certain assumptions, and choose certain methodologies. Frankly, I find that a lot more useful that "just trust us, ours works best...but we can't quantify what 'works' means."
FWIW, I wrote an empirical study of different spam filters, way back shortly after the Paul Graham buzz:
I know my study is based on quite old tool versions by now. But AFAIK, it's one of the few that actually came at the comparisons from an unbiased viewpoint. Most figures are based on the "experiences" of the strongest proponents of a given tool (or occasionally from a strong detractor). I had/have no agenda for or against any particular tool, I was just curious.
Buy Text Processing in Python
2. One that plans or devises: a country considered to be the chief architect of war in the Middle East
I knew it! GW Bush is The Architect.
Oh wait, I forgot how articulate the Architect is supposed to be. Hrmm.
You're being purposefully dense.
To architect a response would imply careful consideration, artistic presentation, and stunning aesthetics. I don't necessarily agree that that's what he's done here, but obviously that is what he meant to convey with his choice of words.
And if you disagree with verbing words, you have better stop "inputting" data into a computer, or "Googling" for answers, or "bookmarking" links, or "forking" processes.
I tried some of these so-called filters, but none of them performed as well as my copy of Outlook 2003. It's so easy too: you just click on the hundreds and hundreds of messages then click "organise", then send them to junk mail. Tomorrow, you'll do exactly the same thing.
Thank goodness my IT dept. decided to upgrade us all from Eudora + Spamnix. It was awful not being able to see all those \/!agra and XXXh0t gurls advertisements.
Lorem ipsum dolor sit amet.
Well said. HOWEVER, I have to agree with the poster who pointed out that using "architect" as a verb in the context of writing is a little out of place. If we're going to help the language grow, let's at least do so in useful ways. "Architect a solution to an engineering problem", sure, "architect a whiny, defensive rebuttal", no. If we're going to make it a verb let's at least have it relate somewhat to the noun.
Actually, I was trying to be Insightful, not Funny.
Haven't you ever input data into a computer?
Why is the readability of that sentence poor?
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
completely disconnected from reality
I might have believed you if you had said he was partially disconnected, or somewhat disconnected, but completely disconnected makes you look rather, umm, anal retentive.
In recent years it has become common in English to "verb" nouns.
But "verb" isn't a verb, it's a noun! You can't "verb" something, or go around "verbing" things...check it out here.
He could have said someone "tasked" him to "architect" a response. :o)
No, he's promoting the correct use of English which promotes inclusivity. We all know where we stand. By designing (or should I say architecturizing) your own rules you begin to exclude groups of people, such as those whose first language is not English. It's elitism, nothing less.
--
This sig is inoffensive.
RTA?
;-)
Read the article, then post!
There's really very little to be said in favor of Jonathan A. Zdiarski's "defence?"
Now, I could start posting how ignorant that statement is, but then I'd just be rewriting Zdiarski's article. Cormack's entire test was flawed - He used SpamAssassin (95% accuracy) to create his 'ham' corpus. He used software versions that were 6+ months old. Even the email address he used for testing is incredibly unique and atypical! (He uses an address that he's had for 20+ years; One that has been posted all over the WWW numerous times. An address that has many forwarders pointing to it. How is that typical in any way??)
Ok, go read the article (don't just 'skim' it, as you mentioned), then come back and tell me why you believe he is only trying to 'sell' his product.
Please back up your claims with some evidence this time
Then the Germans had better stop joining words together however they please. It creates these big, long words which are incomprehensible to non-native speakers. They seem to do it willy-nilly!
That's what you're saying. Right?
It's elitism, nothing less.
Prescriptivism is the only thing elitist here.
What this has to do with the guy's spam filter is a mystery to me, though.
Quite why everybody's suddenly noticed the abysmally low standard of English grammar around here, or what this has to do with spam filters, is beyond me, though.
I prefer using the original CRM114 discriminator and it's host platform on spammers. If you're not familiar with the original CRM114 and it's delivery platform, it was featured in the following movie... http://www.imdb.com/title/tt0057012/combined
There is no God, and Dirac is his prophet.
No, you just restated it the way he should have written in in the first place!!
I don't find those points quite as damning as you do, but your presentation of them is a zillion times more persuasive and less juvenile-sounding than "Many misled CS students, Ph.Ds, and professionals have jumped on the spam filtering bandwagon with the uncontrollable urge to perform misguided tests in order to grab a piece of the interest surrounding this area of technology to score credits or popularity"...and on and on and on...
What I'm listening to now on Pandora...
It's more established as a word, but take a look at the root of the word "crafted". Is it a noun? You tell me.
I wouldn't mind verbing so much if the right usage hadn't been drilled into me as a kid...but "verbing" the word "architect" is not a language advancement. It's a sloppy shortcut normally used in buzz-speak (that's why you almost never hear it in everyday English, but so often in computer- and business-related fields). It's ambiguous and makes English even more difficult to understand than it is already. The fact that enough people complained about it for this thread to occur shows that in fact, it is not "just a prescriptivist language maven" who hates this, but everyone who cringes when someone writes "effect" for "affect", or says "irregardlessly". Most verbing is the result of a current fad, and anyone who's over ten years old knows how fast most fad-generated words disappear.
To quote Bill Watterson (who, AFAIK, created the word "verbing" as "to make a noun into a verb"), "Verbing weirds things."
There's no sig like this sig anywhere near this sig, so this must be the sig.
There are several warning signs in this article.
That said, he does raise a few valid points, such as the timeline:
Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
I'm not happy about this, first he says that this account has a abnormally high spam ratio and then says that a normal user can have 60%. Where do we get these figures from I would like to know as my average is pushing up against 100%. I don't think that there is such as thing as an average user, some people seem to get nearly no spam and the rest of us get almost complete spam.
Reviewing todays inbox reveals around 200 emails, of which 8 were legit. You do the maths, I would be making progress if it was only 81%.
Oh boy he goes on and on, if ever you wanted to cut out the spam in an article...
His main points (at least the ones I agreed with):
1. No training period, many features only turn on after lots of real emails have been processed. Fair enough.
2. No purge window, stale emails get purged over time (e.g. 4 months), but in a test everthing is shoved through at once (in minutes) and so nothing gets purged. Again fair.
The rest of it complains about the tester, or complains that it was less than ideal conditions & settings for the particular filter.
We call that 'the real world' here.
Sys admins are not experts in configuring filters.
Also he should realise that any new filter gets a better rating than the dominant filter. Spammers try to defeat the most popular filter of the day. So sure a new filter might perform better than an existing one *initially* simply because the spammers are targetting it. Until it becomes dominant and then the spammers adjust the spam to defeat the new dominant filter.
So in the real world the data set will always be unusual because the spammers make it that way.
I personally think to "architect" something 'sounds' right and it's obvious and unambiguous in what it means. The grammar nazi is right though and it is incorrect. Input *is* a transitive verb. However verbing sounds like something simply offensive and shouldn't be done in public.
The language evolves, but slowly as everyone needs to be able to keep up. This is the problem with Open Standards: creating a stable API can sometimes slow or stifle innovation
Phillip.
Property for sale in Nice, France
I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!
This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.
I have a whitelist at work, and it works quite well, but combing through and emptying the spam bucket is still an annoying part of each day.
However, without doing so, I'll never know if I missed that one message in (about) a thousand that's from a vendor that's not in my whitelist.
QOTD: "I don't have a solution, but I do admire the problem.".
What's the difference? You already said it. Architecting sounds a hell of a lot more pretentious. Almost like "crafting" a response, except it's not a verb.
I agree this seems nit-picky, but the misuse of "architect" is actually only the tip of the iceberg. This article is so chock full of misued words, awkward sentence construction, and serious grammar problems I found it distracting and difficult to read. I guess this is what they mean when the liberal arts folks deplore the poor writing skills of many geeks. This guy really needs an editor. And when he mentioned that he is also writing a book, I just shuddered.
I read Usenet for the articles.
Zdziarski claims Cormack mainly used Spamassassin to classify the corpus into the ham and spam groups.
If this is true then to me this is a critical flaw in Cormack's methodology.
Not saying there are, or aren't other flaws. But this to me is the main one to consider. Zdziarski should have just put this at the top of his response, instead of putting a lot of waffle about stuff that does "not appear to have been a problem with Cormack's tests".
postage-based email?
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
The use of "architect" as a verb isn't even recently invented: Keats wrote "This was architected thus By the great Oceanus" in 1818.
Tarsnap: Online backups for the truly paranoid
Yes.
"Verb" is a noun. Until it's verbbed. "Verbbed" is a verb until it is nouned. Nouning "verbbed" makes "noun" a verb, when previously, it was a noun.
Does that clear things up for you?
a professor of computer science vs. a guy who guys by the monicker "nuclear elephant" (and whose educational credentials are rather dubious).
hmm.
"You mean like any other normal person who might be wanting to use such a product?"
...nevermind, I don't need to say anything else.
And to that, I would say... Someone writing an article for publication in a peer-reviewed journal should become experienced in their area of research before attempting to publish their results!
For example, I'm sure you don't have much experience with Nuclear Magnetic Resonance imaging - And you might or might not have experience with X11 forwarding. But unless you are fluent with both of those topics, I would not expect you to attempt to publish a paper in a peer-reviewed journal discussing those topics!
(Like I did, last December)
However, for the sake of presenting some evidence to back up what I'm saying here, I'll take your example of Consumer Reports.
From their site: CR has the most comprehensive auto-test program and reliability survey data of any U.S. publication; its auto experts have decades of experience in driving, testing, and reporting on cars.
Unfortunately, the most important point is buried in the article.
;-)
Cormick builds a list of spam and a list of ham using SPAM Assasin. He then tests the accuracy of the products by comparing them to the SA lists. So in a testing the filtering, if you don't agree with Spam Assasin, then you're wrong.
Gee, it's hard to figure out why SA won.
If you mean being proud of knowing that "architecting" was not even close to being the right word, then I'm proud, sure.
Language does evolve over time and new words do come into usage, but how does that mean that just picking words at random and using them instead of already existing, perfectly adequate, words is not pointless, unclear, and pretentious?
To deny this fact makes you just another prescriptivist language maven, completely disconnected from reality and any sense of the advancement of human language.
Toggle bus area salty Jehovah wash ribbed.
Did you not understand that I meant "I totally agree with what you said"? How very prescriptive of you!
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
You put a meaningless jumble of words together. "Architect" in this context was anything but meaningless. If you can't figure out what was meant, that indicates a lack of brain power on your part, nothing more.
As far as I understand, Cormack accepted that he was testing only on one person's corpus, and qualified his findings as such.
This is something that is featured throughout the rebuttal - an argument that runs:
a) Such and such was done incorrectly
b) Therefore the system was inaccurate
c) Therefore CRM-114 is better than stated
The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable. If you discarded results each time they contradicted agreed wisdom we would still think of a geocentric universe.
Exercise your right not to vote. thinkoutside.org
For the love of Cthulu, people, "architect" is a noun, not a verb.
Languages are dynamic, not static. If enough people begin to use 'architect' as a verb, then it is a verb. I have a strong hunch that 20 years from now, the verb form of architect will appear in Merriam-Webster...
// TODO: Insert Cool Sig
I purpose a little test of my own...
boycott slashdot February 10th - 17th check out: altSlashdot.org
I feel the Plain English Campaign offers a useful guide "We define plain English as something that the intended audience can read, understand and act upon the first time they read it". So, perhaps you are right for the majority of people. But I had to pause a while and think about what michael meant.
I agree with you that some nouns become accepted as verbs.
Actually publishes statistics from real users. If the user is willing POPFile sends back accuracy information to a central server and then a nightly cron job analyzes it and publishes information on the web for all to see.
No need to read a study, or even the author's opinion. No wild claims made, just real data.
Here it is:
http://www.usethesource.com/popfile_stats.html
Shows that POPFile has an _average accuracy_ over all users, including the training period of 95%. After it's seen 500 emails it has an accuracy of 97%. And the average POPFile user has 5 categories of classification.
John.
I'm stuck with windows at work. I can't get any extensions to install. :-(
I've clicked on the install links at texturizer, but when I restart I still just have the DOM thingie all by itself in my extensions manager.
I think that Firefox 0.9 wasn't quite ready with the new extensions model.
Oh well.
-Peter
I don't claim to have done any scientific studies on the subject, but I have tried a number of different anti-spam soultions over the past few years. In my experience, the best soultion is a multi-pronged approach that takes advantage of the strong points of a few setups.
If you want to talk about the results from a single filter in my current arsenal, I would give DSPAM the highest marks. I found it to catch more spams than a trained and customized SpamAssassin with no false positives. It's also very fast, unlike SA. My current setup is as follows...
1) RBLs via Postfix. I probably block 80% of inbound spam this way. I choose my RBLs carefully to limit false positives.
2) DSPAM. I typically get better than 99% of the ones that slip through the RBLs with DSPAM.
3) A complex procmail.rc that uses some statistical rules and a few simple checks, such as "is the mail addressed to me". I also use procmail to sort my mailing list messages into IMAP boxes and it includes a simple whitelist.
4) Spamassassin. This doesn't run much anymore, but I keep it around anyway as a last resort checker. If a mail makes it through all the above, SA gets a shot at it.
I tried using SA as my only post RBL filter for a couple months, but it wasn't getting the job done. I then added the procmail script, but still wasn't happy. Putting DSPAM in front of it all seems to work best for me. I now find that I only have a few spams per month make it past DSPAM (they sort into seperate boxes so I can track their performance) and I haven't seen a false positive in quite some time, over a month anyway. I've only been using DSPAM for a few months.
What works for me may be crap for you. Try a few things till you find something that works for you and use that. If you're trying statistical filters, keep in mind that it takes a while to train them. I found I got better than 90% with DSPAM after a small corpus feed and about a week of training.
The author 'architected an appropriate response' . Persumably this is a lot better than simply replying?
I'd advise the author not to use the word "percept", because he doesn't know what it means.
I'd advise the author not to use the word "someodd", because dictionary.com doesn't know what it means.
As for "very unique"...
A pizza of radius z and thickness a has a volume of pi z z a
Not to mention the fact that he neither "architected" nor designed, but simply wrote....
Boys from the City. Not yet caught by the Whirlwind of Progress. Feed soda pop to the thirsty pigs.
Personally I find "Nuclear Elephant"s writing ridiculous. Read this article about how terrorists are going to use data centers for their next attack.
yes
It's easier to fight for one's principles than to live up to them.
Architect is generally heard as a noun *now*. It originated as a verb.
Alright, genius: did he mean "write" or "design"? And why was not using one of those an appropriate choice?
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
As the author of this article states OVER and OVER, it is REALLY EASY to mess up your filters, and it is very tedious (with lots of permutations) to properly build your corpus. For a centralized spam filtering solution, the goals are: 1. Insulate the users from spam 2. Insulate the users from "administration" 3. Do no harm (no false positives) For these goals, I would take a "dumb" filter, set it conservatively, and hope for 80% catch rate and zero false positives. DSpam has a complicated workflow that requires EACH AND EVERY end user to complete a feedback loop. This is WAY to much to expect from people who are barely capable of finding Google. Unless the ONLY access to the mail is web-based, with a VERY clear "This is Spam" button, Bayes is a sysadmin's nightmare. My only gripe w/ SpamAssassin is performance. If I could get SPAMD to analyze headers in 25ms instead of 2000ms I'd never look back. As it is, DSPAM's performance has me very jealous.
----- Refactoring is the reason why man does not mistake himself for a god.
To repeat a comment I made just above. From his original test paper:
"The test sequence contained 49,086 messages. Our gold standard classified 9,038
(18.4%) as ham and 40,048 (81.6%) as spam.
The gold standard was derived from
X's initial judgements, amended to correct errors that were observed as the result
of disagreements between these judgements and the various runs."
From this I got that:
1. He had an initial set of Spam judged by person X. (e.g. 99.84% accurate).
2. That he ran it through each test filter.
3. That discrepencies were analysed by hand to get to the golden 100%.
So its not a spamassasin that generated the gold standard, person X did with corrections from the *runs* (i.e. a composite of all the filters as adjudicated by person X).
Irritation is a perfectly reasonable reaction. It is not, however, constructive to vent the irritation in response.
Somehow came not long after this:
Something I learned from girlfriend #4: validate feelings. Yes, the Nuclear Elephant was hurt. He's right to be hurt. But no, lashing out is not adult, it is not constructive.
To characterize other researchers as ignorant, wagon-jumping glory hounds with poor self-control does not encourage cooperation.
He launches rockets ... He develops 3D game engines ... He analyzes spam trends ... Is there anything this Carmack guy can't do?
What'd you say?
Cormack?
Nevermind...
OBVIOUSLY, Spam Assasin is going to agree with Spam Assasin being the best.
What the test really did was determine how close to Spam Assasin the other spam detecters were, not how good they were at detecting spam.
excitingthingstodo.blogspot.com
Maybe I'm a hardcore geek, but I do do exactly what Gordon does -- have several accounts feeding a `master' mail account, using addresses I've owned for over a decade. I also post to Usenet and mailing lists with my unobfuscated mailing address -- I want people to be able to reach me, and I refuse to let the spammers take that away from me.
And I think I'm very sane, thank you.
I agree. That's an absurdly *small* amount. I personally receive over 1500 spams/day -- so I'd have 49,000 in under a month. Obviously the amount of spam I receive is because I set myself up as a target, but I'm hardly the only one. Even Jonathan's email address is clearly listed on his page, unobfuscated, so he's doing it too, at least to some degree.(As a piece of anecdotal evidence, Spamassassin catches all but about 4/day of the spams I get, and false positives are extremely rare. Of course, I have spent a good deal of time tweaking SA to work best with my email, and it now works very well.)
That sounds fine in theory, but in practice it's hard to do. How many people from all non-geek walks of life save *all* their email, including spam, and are willing to give it to you so you can analyze it?And merely capturing all their email won't do it -- they need to categorize it for you, because they're the only ones who can reliably decide what's spam *for them* and what's not.
I do agree, that the study had more than it's share of issues, but this critique goes way over the top.
English does evolve, and good writers sometimes repurpose words to great effect. Alas, judging by the rest of the reviews here, our hero is NOT a good writer -- having built a shoddy and ramshackle outhouse, he proudly crowns himself the architect of it.
As for all those people who shout "prescriptive grammarian!", I often suspect they're just too lazy to learn to write well, and have decided that claiming that rules are passe is an effective workaround.
Everybody's a libertarian 'till their neighbour's becomes a crack house.
I'm using lists.dsbl.org, relays.ordb.org, and sbl.spamhaus.org .
Which are you using?
When self-proclaimed pundits do these studies, they should also factor into account the exponential increase in resources needed to accept and filter the mail's content. This results in more memory, faster machines, slower mail service and more deferred mail and reduced performance overall of everything else that might be done on that server.
Contrast this with the effectiveness of RBLs, which block spam based on the source and immediately cut off the huge resource requirement needed by these "filters".
By my analysis, at BEST, there is little more than a 1-2% difference in spam-catching ability between a well-tweaked RBL setup, and a content-based system. With the exception of the content based system consuming tremendously more resources and further delaying mail service.
It seems to me, if you have unlimited resources and you also want to employ content-based filtering for other means, that's the way to go. For everyone else on the planet who wants fast, reliable mail service without having to spend a fortune in hardware to handle traffic you shouldn't have, a well-selected set of RBLs is the superior approach.
You're being unintentionally obtuse.
"What's the difference..." is a rhetorical question used to highlight the frivolity and pretension of the term.
BTW, to fork is a long-accepted verb.
But you can 'verbalize' all you want. Heck, I did it twice before breakfast.
What if that mime really is trapped in a box?
"architect" isn't a verb, and anyone who uses it as such should be shot (especially since there's no real designing taking place when writing a paper). What's so bad about just saying you "wrote" it?
You can't "verb" something
Sure you can, but verbing weirds language.
Daniel
Hurry up and jump on the individualist bandwagon!
There's nothing unwrong about prescriptivisationism.
--
This sig is inoffensive.
...this guy seriously believes the earth is a scant 10000 years old. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
but maybe articulate would fit.
A least it starts with the same letter.
You input data. You don't input input.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
well, it's all well and good, but you lessen the likelyhood that you'll click-delete the wrong message when they're all in your inbox, not yet sorted. (Incidentally, statistical filters are great for sorting mail period.) I get a LOT of email, I'd be lost without it.
I just check the junk mail folder less often than my inbox. And I do get false positives, but it happens infrequently enough that it's not an issue.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
He said he wasn't an expert. So of course he'd be forced to make that conclusion. He cannot scratch his itch because he cannot reach it.
This is the kind of response he was talking about that does no good. Rather, you should acknowledge that the area is weak and that more focus needs to be given there in the future.
(Incidentally, I'm interested in OSS in the GIS field. Any ideas/good pointers? Anyone?)
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
We encourage interested parties to read our paper and our points of fact re Zdziarski.
Thomas Lynam
Gordon Cormack
June 24, 2004
.. the point of writing papers is to get review comments, and it is part of a scientific process to improve the quality of research results. In that respect, Slashdot is doing a very good job.
And so was Cormack when he put out his ideas for feedback. And so were you in formulating feedback.
I am always confused by the omission from these tests of collaborative filters like Cloudmark's SpamNet, which I have used at work for a long time with a very high "catch" rate, no real processing time, and no false positives. Essentially, every email you get it hashes and checks with the server. If you get a spam, you right-click and report it as such. Then it pulls any messages from your inbox which enough credible people have marked before you. (A gross oversimplification, but close enough.)
I feel like at our current stage of technological development, you have to combat human-generated deception with human intervention.
(By the way, that cloudmark tool is Outlook-only, but contains some concepts I'd like to see in other filters...
I remember going through the CRM114 installation docs, and vividly remember the 20 or so steps that I had to go through, and after about 3 or 4 hours of trying to get it installed, I finally gave up. I think part of the goal of software design is to make your software so that people will be able to quickly install and use it. The author of this program lost sight of this important point. I'm not going to sit there and reverse engineer some esoteric codebase just to get it working, and I'm sure alot of other people feel the same way. Therefore, I use SpamAssassin among other things, and it works really well and was quick and relatively painless to get working. I didn't have to go through their source code to figure out how to get it installed.
(Uh, "radioactive rays" from space have nothing to do with radiogeology.)
I can't even wrap my mind around this one. Huh??
He meant "write", but he used a word that means "design."
The Article was necessary. It comes down to this glaring fact: ...."
".... If you use a tool that is only 95% accurate to prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to outperform the better tools whenever the better tools are correct.
My $.02. disclaimer: I'm one of the SA developers.
"The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":
No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).
The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.
However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.
In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.
IMO, that's as good as a hand-classified corpus can get.
"old versions of software were used":
It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.
Given that, using 6-month old release versions of the software under test seems reasonable.
SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
"test started with untrained filters":
IMO, that's the real world. People don't start with fully-trained filters.
In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
"spam in the test is as old as 14 months":
Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
"it should purge old data":
SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".
In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).
(Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)
And finally, what Henry said in comment 9520473.
--j.
No, I haven't. I haven't used that stupid, moronic term "bling bling" either.
I have noticed that black lists are indeed effective. Many spammers now use "bullet proof" spam hosts, so they use static domain names. However, there has been an marked rise in zombie systems sending spams. These are systems that are infected by viruses and then used as spam hosts. Since these systems come on line rapidly (when they are infected) and then drop out (when they are cleared of the virus or booted off their ISP) it seems unlikely that black lists will help.
At least in the spam stream I see, there is more than 1-2 percent of the spam flow from zombies. The best technique seems to be to use a black list first and then content filter.
An a related topic in the parent post:
In a previous post, in another discussion, I also suggested that the sophistication of spam filters like SpamAssassin, which use several algorithms to filter spam, would consume lots of system resources. Another poster wrote that these tools do not consume much in the way of processor and memory resources. This seems counter intuitive, but I don't have any contrary evidence.
Just fucking try the software yourself. Quite simply, spamassassin blows, and this is the consistant opinion of the ~4000 people here who have been stuck with it for now. Testing out CRM114 and DSPAM on limited (100 each) groups of people is showing both to be an order of magnitude better than SA. I can't say which is better, but I can say for certain both are in a whole other league from SA, which lets in 1/20 or so spams, and likes to flag abnoxious HTML laden email from management types as spam, much to their disdain. Both of the statistical filters are much better, with test people seeing between 1/100 and 1/500 spams getting through, with only a handful of false positives.
Er, that article isn't a paper to try and prove the age of the earth; it's an article about why he is a Christian...or did you not read the part where he said it specifically wasn't about evolution.
So you wrote an article about displaying an image on a remote X server and people are supposed to be impressed?
Jesus - I'm lame, but you go way beyond that.
Posting as AC because I'm too damned lazy to make an account.
I believe he has very valid issues with Cormack's methodology, plus I can speak from personal experience that DSAPM is capable of very high accuracy rates.
DSPAM is being used on one of my domains, DeltaBravo.net, and after it finally reached its activation threshold, it is doing very, very well, much better than SpamAssasin's quoted accuracy. In the last week its running about 99.4% accurate (over approximately 900 messages).
I think DSPAM may need a little more careful training than many filters (or at least more care in evaluating and selecting the correct options to use), but make no mistake, it's *very* good. I initially started out a bit disappointed with it, but it is now dead on the money at catching spam.
FWIW, I also run POPFile on my desktop to catch the few that still get through DSPAM. POPFile was extremely easy to train and well worth using if you don't have access to a server-based solution.
Just my opinion.
He says "The bible is the oldest document in the world".
Aside from the fact that the collection as "the bible" is scarecely more than 1000 years old, it clearly is not the oldest document in the world.
I'm a christian, but that doesn't mean you take everything on faith. Faith is limited to things like existence of god, believe in an afterlife.
As soon as the bible makes a verifiable claim, I treat it like any other claim. God is perfect. The guys who wrote various texts in the collection known as the bible are human and prone to error, exaggeration and lies.
Honestly, the first time I read Cormack's paper I stopped partway through because his findings didn't jive with my own experience. I've applied no scientific method to debunk his findings, and I don't care to -- I have other demands for my time.
I use and recommend DSPAM. Many of the accounts that are aggregated in my inbox have been exposed on the web and in Usenet for several years, so my spam load is probably about as high as anyone else's. No comparison testing analysis can change the fact that my inbox sees at most two spams per month (on a maturely trained DSPAM installation) and maybe one false positive every six weeks or so. DSPAM isn't the only tool in the box, but it's the only content filter, and it does what it's supposed to do.
If JZ got a little too personal in his rebuttal, I'll forgive him for it. I'd like to think that if I were in his shoes I'd show a bit more tact and restraint, but there's a pretty good chance that I wouldn't. I get all kinds of defensive about the work I've put my passion into, and can't really blame anyone else for doing the same.
Warning: This signature may offend some viewers.
> For the love of Cthulu, people, "architect" is a noun, not a verb.
Ya.
And for the love of Howard Phillips Lovecraft, "Cthulhu" is not spelled "Cthulu".
Duh.
Content-based filtering uses *exponentially* more resources than RBLs. RBLs just cause the mail server to close the connection; no further negotiation, no downloading of mail, no wasted port connections, no storage and memory overhead, no cpu overhead and all other resources necessary to examine the mail content.
Content-based filtering is a privacy issue as well.
The way I run my mail servers is with the utmost respect for the sanctity of our users' e-mail. We do not read their mail, even for the purpose of filtering spam. I consider this unethical personally, but not everyone thinks e-mail should be private.
You describe four complicated programs you configure and run as well as a week of training needed to run your anti-spam program. Assuming a very low hourly rate, that's $5,000.00 easy. Why use e-mail when for thirty seven cents you can have a man come to your office and take real mail anywhere you want him to? With real mail you can learn to throw the spam in the trash in under thirty seconds.
I can see your point about privacy. It is true that once you allow something to read email it could be abused. But to balance this is the fact that, at least for me, email would be useless without a spam filter.
Privacy is not an issue in my case. I use text only email on Linux (email never touches my Windoz system for security reasons). I run a spam filter for my own email account, so it is my program that reads my email, not someone elses. I read my email on a shared Linux system run by the ISP that hosts my domain (my ISP is webquarry.com).
As far as I know, the RBL approach would not work in my case. I do discard some email one the basis of the domain name, which is far less efficient than the RBL. My spam filter keeps a log of some of the header information from the email it discards. A fair amount of spam is going through fixed domain names these days (e.g., like the infamous tekmailer).
One of the problems I had with the commonly used spam filters was that it was unclear to me how to install them in the case where I am simply piping my email to them. I was also concerned about resource usage, since I am using a shared system. So like a typical programmer I wrote my own spam filter in C++. It is probably 80 to 90 percent efficient. Enough spam still gets through that I'm going to take another look at SpamAssassin and see if I can get it to run with a "procmail" forward. It is just too time consuming to constantly hack the spam filter for the latest evil spammer trick (recently they have been sending spam to my email address from the other valid user on my domain, where I don't check content).
I agree with your statement; though 'Esse' (with capital E) is NOT a german word. 'Ich esse' is a verb in present form, first person, singular; the noun is 'Essen'.
Try hxxp://dict.leo.org for anything but serious translation work.
Training on old stuff makes it worse. SpamProbe's author suggests purging words that have not been referred to in 2 weeks.
My SpamProbe setup handles thousands of messages per day and not ONE spam gets through in weeks and there are NO false positives in more than a year of use on hundreds of thousands of messages.
I estimate SpamProbe to be in excess of 99.5% accurate in eliminating spam and 100% accurate in accepting ham, but it depends 100% on how well you train the thing.
Garbage in, garbage out...
Oh well, what the hell...
You can actually verb something... and you can architect a solution for any verbing problem, m'kay?
Both words with sufficient history to claim "Not Invented Here"
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
>> So you wrote an article about displaying an image on a remote X server and people are supposed to be impressed?
If you are able to read the paper (i.e., via a university IP address based subscription to J.Chem.Ed.), you'll see that the paper is 7 pages long and the supplemental information is 43 pages long.
It's a little more involved than you think. (i.e., the actual cryostat is connected to a Mercury-VX console computer that is capable of acquiring trillions of points per second). The Sun Ultra/10 workstation is connected to that console via a TCP connection. On the system administration side, it is incredibly complicated to remotely control an NMR spectrometer over the internet. I worked on this project for my M.S. in Chemistry and it took ~2.5 years to perfect it.
Of course, that's way off topic & I'm replying to a flaming AC... but now you know... and knowing is half the battle!!
Many users use e-mail differently than you might.
-For example, some people want the forwarded e-mail of top blonde jokes from their friends and relatives and others don't. I wouldn't mark these annoying forwarded messages as spam because I wouldn't want to risk associating friends' e-mail addresses with spam in the filter and I don't get that many e-mails like that.
-Mailing lists. My bank sends me annoying newsletters, but I may need to note a change in their user policies. Right now I just delete these based on the subject lines.
- Useful e-mails from friends citing good deals.
And I think you are missing the reference to a Terry Pratchett novel :) That is why it was emphasized... as Gaspodes states, it doesn't bode ill or well, but just generally bodes.
But thanks for your lesson in English ;-) I would like to be moved up a few years, though ;-)
Posting as Anonymous coward because I left my keyring at home.
According to my 1976 edition of the Concise Oxford Dictionary (I don't have ready access to the full-on OED in all its multi-volume glory), "Craft" is a noun or a verb, with its roots in Old English. No doubt the OED itself has citations going back centuries, but I don't have them. The OED's online edition is a subscription-only service, and I don't have one (and neither does anyone else who doesn't have GBP 195 + VAT to spend on a dictionary subscription)
Well, take it with a huge grain of salt:
He says: I am a Born-Again Spirit-Filled Heterosexual Serious-About-God Christian (TM)
The document linked from the parent really tells a lot. He says:
Oh man! What a proof that one is.sbl-xbl.spamhaus.org,
Best one, keep at top.
blackholes.easynet.nl,
Has not been running in months, remove. (parts now at SORBS and NJABL)
relays.ordb.org,
Less than 1% hit rate, move down in check list.
list.dsbl.org,
Good hit rate, move to #2 on list.
ipwhois.rfc-ignorant.org
cn.rbl.cluecentral.net,
kr.rbl.cluecentral.net,
Others okay if you want.
HTH
Oh, but they do!
Try Spamnhaus' XBL, you'll see, if you're anywhere close to me, 70-80% of all SMTP connections go bye-bye since they are spam coming from zombie systems.
I've yet to have a false positive. Impressive.
Their SBL targets the "static spamhausen"