Why Programmers Need To Learn Statistics
David Gerard writes "Zed Shaw writes an impassioned plea to programmers: Programmers Need To Learn Statistics Or I Will Kill Them All. Quoting: 'I go insane when I hear programmers talking about statistics like they know s*** when it's clearly obvious they do not. I've been studying it for years and years and still don't think I know anything. ... I have taken a bunch of math classes, studied statistics in grad school, learned the R language, and read tons of books on the subject. Despite all of this I'm not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning. All of this leads to a curse since none of my colleagues have any clue about what they don't understand. I'll propose a measurement technique and they'll scoff at it. I try to show them how to properly graph a run chart and they're indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.'"
Everything I needed to know about statistics I learned playing poker.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
110%.
Correlation != causation. Just repeat that and you don't need to know statistics.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Maybe the problem is in your presentation. Even here, you tell programmers that you want to kill them for not understanding a topic that even you are unwilling to acknowledge mastery of. Then you tell us how hard the topic is to understand, even though you've spent so much time trying to learn it.
Is it any wonder that no one takes your suggestions seriously? You are practically sabotaging yourself with self-effacement.
These aren't homework problems you're tackling here. They are business problems and you need to sell yourself and your ideas if you want to get any traction. Do you have any evidence that your methods are better than the SOP thus far? Do you have any case studies that show how effective statistic analysis is in *any* of your projects?
Or are you simply taking something that seems like a data point and extrapolating it to cover a vast swath of applications?
Statisticians need to learn programming or I will kill them all.
We know as much statistics as we need to know.
Some know more, some less. Each has traded off hours vs. knowledge in many fields.
For example: Why would a programmer who's job is to automate bean counting need to know more then basic statistics? (s)he rightfully focuses his efforts on accounting.
One post calculus statistics course gives me enough grounding to know what I don't know and punt to experts when I need to.
Fucking specialists forget all the things they don't know and only look at the world through one lens.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Programmers Need To Learn Statistics Or I Will Kill Them All
Okay, two things: First, threatening programmers never work. Management's been trying that for years. Second -- don't you mean 'kill -9' them all, or maybe demalloc(), or cast them to void*, or one of a dozen other witty things you could do besides the mundane answer of threatening stabby bits on them because you have a case of intellectual snobbery?
#fuckbeta #iamslashdot #dicemustdie
He's just as arrogantly claiming that he's right and they're wrong. Now, he may very well in fact be right, but he's taking the same obstinate position the people he criticizes do.
It's important to know when your input is not desired. Even if you think it should be.
is not because they don't understand statistics. It is because you are a dick.
Statistics is HARD, for two reasons:
(a) Probability theory, on which all practical Statistics is based it both (i) counter-intuitive and (ii) difficult
(b) The very Mathematics on which it is based is obscure
And, worst of all, it is uniformly badly taught, even in good universities, and the Statistics for XXX are uniformly awful, blind leading the blind.
Lastly it is very hard to get a staight answer from a mathematical Statistician.
I know enough about statistics to know statistically I know I'm safe from his threats. I suspect if I were a bag of Cheetos the odds were be against me but that's not the case.
I've found that more than just about any other degree Computer Science and to a less extent Medical Degrees imbue the recipient with an unnatural ego when it comes to subjects with which they are unfamiliar. I propose we remove the word Science from CS degrees and call it what it is "Computer Programming and Troubleshooting". There are far too many CS graduates who think they are actually scientists.
I've been studying it for years and years and still don't think I know anything.
And yet you're expecting someone whose expertise is in a different field to know more about it than you?
We can't all be experts in everything. If you're the expert in the field of discussion, get used to educating your coworkers on the topic, or find another job where you're surrounded by people with the same education and expertise as you.
The average person is an expert in no more than two or three related areas. That's why people work in teams, to cover each other's blind spots.
I work for the Department of Redundancy Department.
Nothing new to see here.
you had me at #!
Statstics is WAY beyond what a programmer cares about. Logic is all that matters. Statistics->logic is the problem of the software engineer, not the programmer.
...unfortunately, they are mostly lost in the irony of statements like this:
I think women are better programmers because they have less ego and are typically more interested in the gear rather than the pissing contest.
I doubt I've seen anyone more thoroughly entrenched in a pissing contest than Zed Shaw, of the website formerly known as "Zed's So Fucking Awesome".
Don't thank God, thank a doctor!
Let's see, we have one guy complaining about how none of his programmer coworkers understand statistics, and we have X coworkers who undoubtedly disagree with him. Since we do not know him or any of his colleagues to any meaningful degree, we have to assign equal weight to each of their opinions. Statistics then tells us there is a 1/(X+1) chance of his being right, and an X/(X+1) chance of their being right. We can assume that X >= 2 based on his ranting, therefore resulting in the odds favoring them by at least 2/3, and probably much more. Therefore it is only rational to assume they are correct.
I prefer logic and reason mixed with evidence and observation.
If you just have logic and reason, then you get religion. Logically, it worked out when it was created. There is no evidence to counter it, so it must be true. Religion was created with logical reasoning. Some may say it was incorrect reasoning, but it was reasoning nonetheless.
On the other hand, if you just have observable evidence, with no logical reasoning, you can have all the data in the world, but you will have nothing to use it with. True, you can see it, but you cannot understand why it is the way it is.
Having all of one or the other is useless.
I don't like Linux. This doesn't make me a troll.
Lies, damned lies and statistics. Us programmers are too busy dealing with the first two to ever reach the third..
You probably still think I am a lunatic, but hear me out.
You don't qualify as a lunatic; just as someone who has no idea of what he's talking about. Absolutely no idea. Your post, my friend, is so full of ideas you obviously misunderstood that I won't even attempt to make a list.
And yes, I do statistics for a living.
So, since so many people don't seem to want to actually read Zed's stuff -- and I honestly don't blame you -- I'll try to summarize:
Eventually, every major science adopted an empiricist view of the world. Except Computer Science of course.
He tends to bitch a lot about computer scientists. I'm just starting a CS degree, and there is a Statistics class in the curriculum. Is he working with people with good degrees, people from a technical college with a "programming" degree, people from a diploma mill, or high school students with no degree at all?
Of course, he seems to be implying it's everyone, and doing so in a typically Zed-like way.
"All you need to do is run that test [insert power-of-ten] times and then do an average." Usually the power-of-ten is 1000...
I don't know that I've ever heard that particular statement. But it's a good point:
How do you know that 1000 is the correct number of iterations to improve the power of the experiment?
Generally because it was probably closer to a million, so I'm erring on the side of taking more, rather than fewer, measurements. But without careful consideration, I could be way off.
How are you performing the samplings?
I think this is vastly less important than how you are dealing with the data, but it is also a good point. For example, his complaint is that an average isn't enough; with detailed enough logging, he could easily go back into my data and figure out min, max, standard deviation, histograms...
How do you know that 1000 is enough to get the process into a steady state after the ramp-up period?
Not a huge deal -- the "steady state" will almost certainly be faster than the "ramp-up" period. Worst case, I'm over-optimizing.
What will you do if the 1000 tests takes 10 hours?
Either ctrl+c, or try it 10 times.
How does 1000 sequential requests help you determine the performance under load?
Very good point here. It's still a useful statistic, but you still need to measure things like 1000 simultaneous requests, not just 1000 all in sequence.
On the other hand, if your performance is acceptable with them all in sequence, you could just run it through something like Event Machine, so it's all sequential on production, too.
The most troubling problem with these single number “averages” is that there’s two common averages and that without some form of range or variance error they are useless. If you take a look at the previous graphs you can see visually why this is a problem. Two averages can be the same, but hide massive differences in behavior...
So yes, always make sure you can record enough statistics so that someone else can come along and use your data to give you something meaningful.
The moral of the story is that if you give an average without standard deviations then you’re totally missing the entire point of even trying to measure something. A major goal of measurement is to develop a succinct and accurate picture of what’s going on...
It doesn't have to be statistically accurate. It just has to be close enough.
Ah, confounding. The most difficult thing to explain to a programmer, yet the most elementary part of all scientific experimentation. It’s pretty simple: If you want to measure something, then don’t measure other shit.
This is both a very good and a very bad idea. It ties into the peeve he had before -- ramp-up time. For example:
If we want to take one single line of code and test it then we can. If we want to only verify one single query on a database then what’s stopping us?
What's stopping us is that our applications don't actually work like that.
Don't thank God, thank a doctor!
No, Logic and Reason are superior to Cubase.
It's a music joke, laugh.
From his complaints, I can tell knowledge isn't the real issue. Testing performance takes a huge amount of time. You need to simulate other programs running, multiple users and make sure the test matches what real users might do. Generally, this requires writing completely independent test programs and charting the logging from them. People just don't want to go to that kind of effort. It can take weeks just to create proper tests for complex programs like web servers.
I can vouch for this. You might think AC just spends all his time on /. but the reality is that he's a real big-shot who can afford to make ridiculous claims.
Always back up, never back down. ---- Think you're cool 'cos your uid is prime? Take mine, modulo the one digit integers
is one half mental.
of course that explains why 90% of all programs written are CRUD.
-with apologies to Yogi Berra, Theodore Sturgeon, and a 20% apology, as a matter of principle, to a guy called Pareto.
Where are we going and why are we in a handbasket?
The Zed Effect: Whether you're right or wrong people will disagree with you just to piss you off.
Before computers stats involved using parametric tests (t-tests, anova, etc) which made assumptions like "the data comes from an underlying normal distribution". BTW, in stats terms "normal" mean "Gaussian".
Now, with cheap and fast computers, we can actually compute the confidence intervals non-parametrically through permutation tests and bootstrapping without assuming anything about underlying distributions. In most cases, this non-parametric test is the "right thing to do". Most of the time, the results are the same as using a parametric test.
However, a HUGE disaster in empirical science has been the problem of multiple comparisons. With computers it is so easy to compute correlations and significance tests between every possible slice of your data set. Many "scientists" don't have good statistical knowledge and pray at the alter of "p < 0.05". They don't know about or understand the problem of multiple comparisons. So they do 20 tests, find one that comes out p0.05 and write a paper about it. They don't get that if you do 20 tests you are very very very likely to find one that come out p < 0.05.
Anyone who has access to excel or matlab can do this little experiment.
samp=50 normally distributed random numbers.
for x=1:100
test=50 normally distributed random numbers (mean=0, var=1);
sig(x)=ttest(samp,test);
end
now look at the sig vector. OMG, 5% of the tests came out significant!!!
Now you are writing a paper all about how x is linked to y. But you are essentially throwing dice and then writing a paper about why it came up '3-3'.
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765
In practice, statistics is an attempt to quantify messy, uncertain events into a figure. We can even measure the extent to which this works, roughly speaking. Your hard drive has a rough time-to-failure, based on analyses of the things that tend to go wrong in that system. Sure, any time it fails, it's not statistics that broke it; it's one of the kinds of problems captured in the statistical analysis. And sure, you could break it down further for disks and note that the controller has a different failure rate than some other component, just as a bridge has a number of possible failures. Problem is, for any of those, you could break it down further and get failure rates for subcomponents, regions, etc. So what? It's still useful to have statistical measures - the real world is complex, and statistics helps us capture things we otherwise couldn't.
Programmers (particularly but not only young programmers) might not like to acknowledge any field but their own has any depth ("Everything is simple! Just do it my way", hence Ron Paul/Ayn Rand fanboyism and all sorts of other stupidities) - I don't know if there's a lot we can do but hope they grow out of it (It took me awhile to do it, as did a number of people I knew when I was younger, but I made it out).
Basically, if your worldview doesn't wed empiricism and a reasonably flexible practical philosophy, your worldview is (if you err on the pro-logic end) too inflexible and you're going to miss out on standing on the shoulders of giants. Neither the logician nor the mystic understands the world.
For every problem, there is at least one solution that is simple, neat, and wrong.
What Spell Check? I didn't know I was writing a Spell. Is it a good or evil spell?
Damn it's evil. Now I've got to listen to da da da de - da da de bop all the time.
Mod me up/Mod me down: I wont frown as I've no crown
Yup. Also, for a guy who claims to know so much about statistics and measurement, it's weird how he judges programmers so sweepingly on the sole basis of his anecdotal experiences.
Today's weirdness is tomorrow's reason why. -- Hunter S. Thompson
So I read through his article. Yes, the whole mindless rant. The conclusion that one should REALLY draw from it is: Zed Shaw is a douche with Asperger's who clearly feels like his own personal area of expertise is underappreciated. Hey Zed, get over it.
Down with the career politician! SUPPORT TERM LIMITS
I like how the first part of his Wikipedia article says "Zed A. Shaw is a troll" with four citations.
Well, it has never been successfully tested.
not understanding a topic that even you are unwilling to acknowledge mastery of.
Personally, I think that little acknowledgment increases his credibility quite a bit. It suggests to me that he's actually spent some real time coming to grips not just with glossy overview you get in a high school or college course but with some of the devilish subtleties of actually using the stuff.
The funny thing about knowledge... the more it grows, the bigger you realize the frontier is. So, how good of a heuristic is apparent confidence?
Tweet, tweet.
"I construct two sets of n=100 random samples from the normal distribution. Now, if I just take the average (mean or median) of these two sets they seem almost the same."
So its true. The n's justifies the means.
Today's vices may be tomorrow's virtues.
please tell me whether you would like to rely on decision theory, game theory or utilitarian techniques to handle life chances of your children or their sensitive private/critical information in a database.
Read radical news here
You know, that particular citation has made me wonder in the past, but not enough to actually research it. So, I went off looking for more information and found it.
The statistic was generated from a July 1976 survey.
The sample group for this statistic was 1,200 dentists. These dentists were hand picked by the research company, probably with good reason.
They were asked, what advice would they give gum-chewing patients
1) sugared gum
2) sugarless gum
3) no gum at all.
Sugarless gum got 85% of the vote. Not terribly surprising. I'd be fairly confident that their time had been paid for, or at very least they were told "This survey is being done for Trident Sugarless Gum." That is only speculation, so hush up.
17/20 doesn't really sound very good. It just doesn't stick in your head. 4/5 is close enough, even though it reduces your answer to 80% (ahhh, a lie). Since these are marketing folks, I'm sure they pushed all kinds of values past focus groups, until "4 in 5" was accepted as most favorable.
As the link cites, they're fairly confident that the "sugared gum" answer got at least one response. There's always someone that'll take the obvious wrong answer. If you don't believe that, look at any Slashdot poll. :)
What they don't say is how many of the 1,200 samples were dropped. I'm sure there were non-responses, and they could have easily added any number of unfavorable answers in as non-responses. Of course, they couldn't have 100% in their favor, so they had to keep some.
Serious? Seriousness is well above my pay grade.
He's just as arrogantly claiming that he's right and they're wrong.
No he doesn't.
He claims that programmers need to understand statistics more. The people he is talking about are therefore not wrong - they are ignorant.
But that term is loaded with negative meaning, it's more accurate to say they are like a variable with named "statistics" with a value that has never been set. Basically, they don't know what they are missing.
It's like when programmers try to argue about how a language is bad when they've never used it. How would they know? Yet many without understanding of statistics are saying the same thing, they don't need to know any more.
I know enough to know statistics can be a valuable tool. Why would you not want another tool that could help you? The people who refuse do so are less than they could be (as a programmer).
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I hear you, I do performance engineering of web based systems. The developers, the managers, the testers, the architects all have no clue. You are correct here.
However if you can not present your "theory" of how to do something in a dumbed down enough format then who cares. Because the pretty graph is pointless. It will be mis-interpreted, mis-understood, and mis-used.
All the stats theory on the planet will not get you passed the dumb manager or developer. don't loose sleep of this. There is no point. Simply find metrics in your analysis procedure that do mean something to these people. They may not be the total picture but they are something. Build a reputation for being correct by starting with simple things. You are always going to but heads with a know it all developer / architect / manager. Fine let them go off and waste money and time. They will be found out as morons in time. You do your thing and simply become the guy to ask about performance and how to do this.
Being understated and consistently showing above average results for your work is how you will rise up. Being and A-hole about it is not going to help anyone. As a matter of fact I would can your butt for being a D#ck.
I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.
I work with a number of statisticians and I have the opposite problem. They look at the data, apply mathematical transforms to it, and come to a conclusion, whether that conclusion makes any sense or not. They make little attempt to reason that the data may flawed (which experiments often are), or does not really represent what we are trying to measure, or they are using the wrong statistic to summarize the effect. It is very frustrating.
... I ran into a professor of statistics who said that computers were going to be a passing fad in his field.
To a Lisp hacker, XML is S-expressions in drag.
Statistics are important; it is highly unlikely that anyone with an MBA will know how or why, but they want them.
In fact, it is almost a certainty that any given MBA will either lack statistical expertise or will misapply it unthinkingly in a cook-book style. The pseudo-statistics behind Six Sigma comes immediately to mind.
I had repeated theoretical discussions with the four MBA experts who "trained" us (a group of six PhDs in Physics & Engineering doing R&D) in the ways of Six Sigma. There were problems with the statistical theory they presented right from the start - and they were clearly unaccustomed to being contradicted along the lines of "that's not right/applicable in this case, and here's why". For instance, they failed to acknowledge that non-Gaussian distributions could exist, then refused to accept that procedures should be adapted to the data if it was non-Gaussian. Next, they adamantly refused to believe that the 1.5 Z shift hypothesis was supported only by a few studies, all relying on a single dataset from the 1950s for die-based manufacture, and totally irrelevant to most other processes. The Six Sigma books all say "many studies" over decades support the Z shift hypothesis, but fail to cite them, and our MBA experts could not cite any such studies either. Thirdly, they refused to accept that an additional mode of variability (not in the Six Sigma beliefs) existed in processes with feedback (such as recycle lines or controllers). In many cases, this mode guarantees non-Gaussian variability in the process output.
Their advice was that to pass the course, we should ignore our knowledge of statistics (which they acknowledged was far better than theirs) and of process variability, and just "apply the documented methods". We did, and we all passed the course. Then we ignored the Six Sigma bogus statistics bullshit and got on with our jobs using proper statistics to analyze and solve problems in variability with the products we were developing.
MBAs seem to want statistics, but the vast majority appear to lack the training in how to generate proper statistics, or how to use them competently if someone else supplies them. Most MBAs appear to think the world is described adequately using Gaussian distributions, and a few "experts" know the Weibull distribution or the t-distribution. Other distribution types (Poisson, discrete/categorical, etc.) are totally foreign, and methods of inference beyond simple unconditional analyses are also quite alien to them.
I also understand that people who are good at it are rare.
Perhaps not as rare as you might think. But those who have some aptitude in statistics know enough to keep their mouths shut when the data tells them to. MBAs on the other hand, ignorant of their own ignorance, are as verbally promiscuous as politicians...
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
I'm a physicist, I know plenty of statistics. The kinds of statistics he's talking about are not hard. If you can do algebra, you can do things like calculate the standard deviation and variance of a set of measurements.
Was this rant really necessary? I run into people in physics who don't take care of these details. I find that a simple "can you put a standard deviation on that number?" or "can you repeat the experiment?" generally gets the job done. If you want to be more scientific, just start with those questions, and see where it takes you... you could even add "please" if you wanted to be nice. I find threatening people with death and belittling their intellect while talking about trivial calculations doesn't generate useful data.
To be fair, it sounds like Zed has been working as staff at a university. This has nothing to do with statistics, but it's probably the real reason he's in such a bad mood.