Why Programmers Need To Learn Statistics

← Back to Stories (view on slashdot.org)

Why Programmers Need To Learn Statistics

Posted by Soulskill on Saturday January 9, 2010 @11:36AM from the because-they-suck-at-poker dept.

David Gerard writes "Zed Shaw writes an impassioned plea to programmers: Programmers Need To Learn Statistics Or I Will Kill Them All. Quoting: 'I go insane when I hear programmers talking about statistics like they know s*** when it's clearly obvious they do not. I've been studying it for years and years and still don't think I know anything. ... I have taken a bunch of math classes, studied statistics in grad school, learned the R language, and read tons of books on the subject. Despite all of this I'm not at all confident in my understanding of such a vast topic. What I can do is apply the techniques to common problems I encounter at work. My favorite problem to attack with the statistics wolverine is performance measurement and tuning. All of this leads to a curse since none of my colleagues have any clue about what they don't understand. I'll propose a measurement technique and they'll scoff at it. I try to show them how to properly graph a run chart and they're indignant. I question their metrics and they try to back it up with lame attempts at statistical reasoning. I really can't blame them since they were probably told in college that logic and reason are superior to evidence and observation.'"

33 of 572 comments (clear)

Percent probability that Zed Shaw is a jerk by Anonymous Coward · 2010-01-09 11:38 · Score: 5, Funny

110%.
1. Re:Percent probability that Zed Shaw is a jerk by kandela · 2010-01-09 13:48 · Score: 4, Funny
  
  And by that you mean 110% +/- 10% (95% confidence interval) right?
  
  --
  Conservation of angular momentum makes the world go round.
correlation != causation by Hognoxious · 2010-01-09 11:40 · Score: 5, Funny

Correlation != causation. Just repeat that and you don't need to know statistics.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Your argument is dead, Zed by BadAnalogyGuy · 2010-01-09 11:42 · Score: 5, Insightful

Maybe the problem is in your presentation. Even here, you tell programmers that you want to kill them for not understanding a topic that even you are unwilling to acknowledge mastery of. Then you tell us how hard the topic is to understand, even though you've spent so much time trying to learn it.
Is it any wonder that no one takes your suggestions seriously? You are practically sabotaging yourself with self-effacement.
These aren't homework problems you're tackling here. They are business problems and you need to sell yourself and your ideas if you want to get any traction. Do you have any evidence that your methods are better than the SOP thus far? Do you have any case studies that show how effective statistic analysis is in *any* of your projects?
Or are you simply taking something that seems like a data point and extrapolating it to cover a vast swath of applications?
1. Re:Your argument is dead, Zed by Krishnoid · 2010-01-09 11:57 · Score: 4, Funny
  
  Or are you simply taking something that seems like a data point and extrapolating it to cover a vast swath of applications?
  Well yeah, that's what he was saying -- statistics!
2. Re:Your argument is dead, Zed by superdana · 2010-01-09 12:14 · Score: 4, Insightful
  
  Maybe the problem is in your presentation.
  
  Meet Zed Shaw.
3. Re:Your argument is dead, Zed by arendjr · 2010-01-09 13:24 · Score: 4, Insightful
  
  I don't know Zed Shaw yet, but I think you're right.
  The whole problem he is describing sounds like a big ego problem. He himself has a huge ego, and has problems when he runs across the programmers, who often have huge egos as well.
  Now, I think he does make a point though. The programmers he is ranting about indeed do sound like assholes, just like he himself is. In order to be a really good programmer (or a good statistics expert) you should also know when to put aside your ego.
4. Re:Your argument is dead, Zed by Hurricane78 · 2010-01-09 15:35 · Score: 5, Funny
  
  I just found a very old hard disk. Double height. MFM/RLL. And after a “strings -n 32 /dev/hdd”, I got the following old saying, carved in the bytes of the disk:
  
  Computer science
  Statistics
  Social skills
  Choose one.
  ;)
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
Or, how about... by halivar · 2010-01-09 11:43 · Score: 5, Insightful

Statisticians need to learn programming or I will kill them all.
Mathematicians just need to shutup. by HornWumpus · 2010-01-09 11:44 · Score: 4, Insightful

We know as much statistics as we need to know.
Some know more, some less. Each has traded off hours vs. knowledge in many fields.
For example: Why would a programmer who's job is to automate bean counting need to know more then basic statistics? (s)he rightfully focuses his efforts on accounting.
One post calculus statistics course gives me enough grounding to know what I don't know and punt to experts when I need to.
Fucking specialists forget all the things they don't know and only look at the world through one lens.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
1. Re:Mathematicians just need to shutup. by __aasqbs9791 · 2010-01-09 12:14 · Score: 5, Insightful
  
  One post calculus statistics course gives me enough grounding to know what I don't know and punt to experts when I need to.
  That's actually his argument (though I'm pretty sure he doesn't realize it, having met him a few years ago at a conference). People need to know their limits, and the strengths (and weaknesses) of others, and defer to them when they know what they're talking about, rather than talking out of their asses. As you point out, you can't know everything, but you'll defer to others who know more when you need to. I'm pretty sure Zed would like working with you based upon that fact alone (I know I value that trait and try to express it myself). Far too many people think they aren't allowed to have any weaknesses (and we all do in some area or another) so they talk a big game, and when push comes to shove, they will actively block people who actually know more than they do about the subject at hand. Working with too many people like that has driven Zed insane (IMHO) and I know I've been close to it at a couple of work places before (and really loved the one that wasn't like that hardly at all).
2. Re:Mathematicians just need to shutup. by Toonol · 2010-01-09 12:17 · Score: 5, Insightful
  
  But statistics is one of those fields that benefits everybody; it's a bit like probability, logic, or (further afield) history. Lack of a fundamental understanding of statistic can lead you astray in a near-infinite number of ways.
  
  I have sat in business meetings hundreds of times where I've seen decisions made on completely meaningless and irrelevant data, because the people involved don't understand statistics. The same holds true in your personal life; decisions with purchasing products, investing money...
  
  Now, I'll bet that most slashdot readers have the minimum amount of knowledge of statistic to avoid the most egregious errors; but more knowledge is certainly helpful. It will help you in a myriad of ways.
Title fail. by girlintraining · 2010-01-09 11:44 · Score: 5, Funny

Programmers Need To Learn Statistics Or I Will Kill Them All
Okay, two things: First, threatening programmers never work. Management's been trying that for years. Second -- don't you mean 'kill -9' them all, or maybe demalloc(), or cast them to void*, or one of a dozen other witty things you could do besides the mundane answer of threatening stabby bits on them because you have a case of intellectual snobbery?

--
#fuckbeta #iamslashdot #dicemustdie
1. Re:Title fail. by Anonymous Coward · 2010-01-09 13:44 · Score: 5, Funny
  
  or firefox's implementation:
  
  void demalloc(*ptr)
  {
  /* noop */
  return;
  }
Re:93% of Programmers Think You're Wrong by Anonymous Coward · 2010-01-09 11:49 · Score: 5, Interesting

The only statistics book you'll even need
The funny thing is he's doing exactly the same by Rix · 2010-01-09 11:52 · Score: 4, Insightful

He's just as arrogantly claiming that he's right and they're wrong. Now, he may very well in fact be right, but he's taking the same obstinate position the people he criticizes do.
It's important to know when your input is not desired. Even if you think it should be.
The reason people ignore you Zed.. by Anonymous Coward · 2010-01-09 11:54 · Score: 5, Insightful

is not because they don't understand statistics. It is because you are a dick.
1. Re:The reason people ignore you Zed.. by Anonymous Coward · 2010-01-09 14:38 · Score: 4, Insightful
  
  Claiming that the author is a dick is not mutually exclusive to him having a good point. The author is right in his claims that people who don't know what they're talking about often think they do and get pissy when someone claims otherwise. But the author presents this viewpoint in a really stupid manner. It is dickish to say, essentially, "Hey idiot, you're wrong", even if the person is wrong.
  Note how your response is dickish, but probably right in claiming that the world is filled with arrogant/stubborn people.
Statistics is HARD by omb · 2010-01-09 11:54 · Score: 4, Informative

Statistics is HARD, for two reasons:

(a) Probability theory, on which all practical Statistics is based it both (i) counter-intuitive and (ii) difficult

(b) The very Mathematics on which it is based is obscure

And, worst of all, it is uniformly badly taught, even in good universities, and the Statistics for XXX are uniformly awful, blind leading the blind.

Lastly it is very hard to get a staight answer from a mathematical Statistician.
1. Re:Statistics is HARD by radtea · 2010-01-09 12:28 · Score: 4, Insightful
  
  Statistics is HARD, for two reasons:
  I'd argue that probability theory isn't as hard as people make it seem, but statisticians are wankers. Most of what we think of statistics was developed by people who were intimately engaged with empirical research, but modern statisticians are mathematicians, many of whom have never actually performed an experiment. They think the statistics are real, whereas experimental scientists know the truth: God made the Probability Distribution Functions. All else is the work of man.
  Furthermore, modern computing has made a lot of the conceptual apparatus of conventional statistics irrelevant, as it is designed to deal with the problem of reducing problems to something that can be computed by hand and finished off with a single table lookup. Today its a rare case that we can't get at the PDFs directly, bypassing much of conventional statistics. But due to how badly the stats are taught, and how poorly probability theory is understood, we are still living in a world where p-values are the exception, not the norm, and when they are quoted they are frequently unrealistic because they are based on statistical assumptions that are not warranted given the non-idealities of the data.
  So I'd argue that statistics is basically a dead field populated by zombies who are dedicated to infecting as many students as possible. If we taught thermodynamics or mechanics with equally outmoded concepts they would be really hard too.
  
  --
  Blasphemy is a human right. Blasphemophobia kills.
2. Re:Statistics is HARD by thesandtiger · 2010-01-09 12:35 · Score: 5, Interesting
  
  I don't think it's hard - I just think it requires a different way of thinking than most programmers usually take to maths.
  As a programmer/developer who went into research (in social sciences, so it's really soft), I can say that in my experience stats is really closer to a programming language than it is to other maths. Here's why:
  1) You have a LOT of tools to pick from. What kind of analysis do you want to do? What kind will give you the most useful result? What kind is your data amenable to?
  2) You don't always have a clear choice as to which is the best for a given situation. Sometimes you need multiple different types of analysis to really get the full picture.
  3) Just because it's math doesn't always mean it's right. There's some crazy ass black-box magic stats stuff we use for one project of ours that, in theory, will let us figure out the demographic composition of an unknown target population. Maybe. Sometimes. If the wind is right. Or not.
  4) At the advanced levels, it's fucking insane. People who hack stuff like ultra optimized 3d engines with large quantities of assembler or whatever always wigged me out because my brain just doesn't work that way. With the really complex stats stuff it's the same way - I can plug and chug with the formulas, but I honestly have about as much comprehension of why some of the more advanced stuff works as my dog has of CPU design.
  5) If you know the basics, you know just enough to be dangerous and really piss off people who know what they're doing. Being able to run an anova or determine correlation makes some people think they actually know what's going on because, hey, it's math. But a lot of people who just do the basic stuff think their results are more meaningful than they actually are - falling prey to the whole "it's statistically significant therefore it must be IMPORTANT" fallacy (when you can certainly have things that are "statistically significant" but actually have virtually no impact on the outcome.
  6) Even when people know their shit, they disagree. A fine example of this would be the Space Shuttle failure rate - you had people saying that the shuttle would suffer a critical failure from everywhere between 1 in 5 and 1 in 50,000 launches. And depending on what tools they used to do their analysis, they were correct. Same as with programming languages - depending on the problem, equally skilled programmers might pick entirely different languages to use because they think one part or another is more critical.
  Honestly, I really enjoy stats - if I had to do it all over again I would probably have spent a LOT more time working with stats than I did as a programmer in my younger years - but I won't pretend that it's totally clear what tools to use when. The author of TFA should do well to realize that even fellow statisticians would probably slap the shit out of him over some of his beliefs about how to properly go about utilizing stats toolsets.
  
  --
  Since I can't tell them apart, I treat all ACs as the same person.
Re:93% of Programmers Think You're Wrong by ShakaUVM · 2010-01-09 12:01 · Score: 5, Insightful

A manga statistics book, eh?
I just realized I was a nerd. I looked at the table of contents and closed it down, then realized I hadn't even looked at the short skirt-wearing protagonist.
Sigh...
But to answer the article's point, elementary statistics are very easy. Advanced statistics are very hard. It's kind of like how people think "knowing the difference between circles and squares" is geometry and so analytical geometry must be just more of the same, right? It's quite possible the programmers think they know statistics because they know they're vaguely supposed to do a run multiple times, and maybe average the results or something.
It's also possible the author of the article is a know-it-all douchebag who tries to solve problems with overwrought solutions.
From TFA: "Zed: Fuck! Fuck! I have eyes! You do not! See!? No?! Exactly! Because you can't fucking see because you have no fucking eyes! Arrggh!"
Just throwing that theory out there.
Re:It's not just statistics by radarsat1 · 2010-01-09 12:06 · Score: 4, Insightful

I disagree that CS is just "programming and troubleshooting", but I do agree that Computer Science is a complete misnomer. It's extremely misleading, and difficult to explain to people, "I'm a computer scientist, but no I'm not actually a scientist, instead I understand how to describe formal languages in terms of strict grammar rules and transform abstract syntax trees from one representation to another."
It shouldn't be called Computer Science, it should be called Computational Mathematics, because that's what it is.
(On the other hand, there is whole branch of CS that extends very deeply into statistics called Machine Learning, but at the core I'd say it is still more mathematics than science. There is also human-machine interaction which often goes under CS, but is actually more like psychology.. so it's not so cut and dry.)
He makes some good points... by SanityInAnarchy · 2010-01-09 12:07 · Score: 5, Insightful

...unfortunately, they are mostly lost in the irony of statements like this:

I think women are better programmers because they have less ego and are typically more interested in the gear rather than the pissing contest.
I doubt I've seen anyone more thoroughly entrenched in a pissing contest than Zed Shaw, of the website formerly known as "Zed's So Fucking Awesome".

--
Don't thank God, thank a doctor!
Summarized for people who don't want to read Zed by SanityInAnarchy · 2010-01-09 12:41 · Score: 4, Insightful

So, since so many people don't seem to want to actually read Zed's stuff -- and I honestly don't blame you -- I'll try to summarize:

Eventually, every major science adopted an empiricist view of the world. Except Computer Science of course.
He tends to bitch a lot about computer scientists. I'm just starting a CS degree, and there is a Statistics class in the curriculum. Is he working with people with good degrees, people from a technical college with a "programming" degree, people from a diploma mill, or high school students with no degree at all?
Of course, he seems to be implying it's everyone, and doing so in a typically Zed-like way.

"All you need to do is run that test [insert power-of-ten] times and then do an average." Usually the power-of-ten is 1000...
I don't know that I've ever heard that particular statement. But it's a good point:

How do you know that 1000 is the correct number of iterations to improve the power of the experiment?
Generally because it was probably closer to a million, so I'm erring on the side of taking more, rather than fewer, measurements. But without careful consideration, I could be way off.

How are you performing the samplings?
I think this is vastly less important than how you are dealing with the data, but it is also a good point. For example, his complaint is that an average isn't enough; with detailed enough logging, he could easily go back into my data and figure out min, max, standard deviation, histograms...

How do you know that 1000 is enough to get the process into a steady state after the ramp-up period?
Not a huge deal -- the "steady state" will almost certainly be faster than the "ramp-up" period. Worst case, I'm over-optimizing.

What will you do if the 1000 tests takes 10 hours?
Either ctrl+c, or try it 10 times.

How does 1000 sequential requests help you determine the performance under load?
Very good point here. It's still a useful statistic, but you still need to measure things like 1000 simultaneous requests, not just 1000 all in sequence.
On the other hand, if your performance is acceptable with them all in sequence, you could just run it through something like Event Machine, so it's all sequential on production, too.

The most troubling problem with these single number “averages” is that there’s two common averages and that without some form of range or variance error they are useless. If you take a look at the previous graphs you can see visually why this is a problem. Two averages can be the same, but hide massive differences in behavior...
So yes, always make sure you can record enough statistics so that someone else can come along and use your data to give you something meaningful.

The moral of the story is that if you give an average without standard deviations then you’re totally missing the entire point of even trying to measure something. A major goal of measurement is to develop a succinct and accurate picture of what’s going on...
It doesn't have to be statistically accurate. It just has to be close enough.

Ah, confounding. The most difficult thing to explain to a programmer, yet the most elementary part of all scientific experimentation. It’s pretty simple: If you want to measure something, then don’t measure other shit.
This is both a very good and a very bad idea. It ties into the peeve he had before -- ramp-up time. For example:

If we want to take one single line of code and test it then we can. If we want to only verify one single query on a database then what’s stopping us?
What's stopping us is that our applications don't actually work like that.

--
Don't thank God, thank a doctor!
Everyone should learn statistics by jackchance · 2010-01-09 13:17 · Score: 4, Informative

Before computers stats involved using parametric tests (t-tests, anova, etc) which made assumptions like "the data comes from an underlying normal distribution". BTW, in stats terms "normal" mean "Gaussian".
Now, with cheap and fast computers, we can actually compute the confidence intervals non-parametrically through permutation tests and bootstrapping without assuming anything about underlying distributions. In most cases, this non-parametric test is the "right thing to do". Most of the time, the results are the same as using a parametric test.
However, a HUGE disaster in empirical science has been the problem of multiple comparisons. With computers it is so easy to compute correlations and significance tests between every possible slice of your data set. Many "scientists" don't have good statistical knowledge and pray at the alter of "p < 0.05". They don't know about or understand the problem of multiple comparisons. So they do 20 tests, find one that comes out p0.05 and write a paper about it. They don't get that if you do 20 tests you are very very very likely to find one that come out p < 0.05.
Anyone who has access to excel or matlab can do this little experiment.
samp=50 normally distributed random numbers.
for x=1:100
test=50 normally distributed random numbers (mean=0, var=1);
sig(x)=ttest(samp,test);
end
now look at the sig vector. OMG, 5% of the tests came out significant!!!
Now you are writing a paper all about how x is linked to y. But you are essentially throwing dice and then writing a paper about why it came up '3-3'.

--
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765
1. Re:Everyone should learn statistics by Daniel+Dvorkin · 2010-01-09 13:57 · Score: 5, Interesting
  
  Resampling-based statistics haven't replaced parametric models, and I doubt they ever will, for one very simple reason: as the available processing power grows, so does the amount of data. In my field, bioinformatics, the size and complexity of the data sets follows a Moore's Law of its own, and I don't think bioinformatics is unique in this. "Just bootstrap it" is easy to say, and certainly there have been many times when dealing with an analytically intractable distribution when I've done just that, but if the analytical solution takes minutes and the bootstrap solution takes weeks, you have to take this into account.
  Of course, resampling isn't the only way to look at problems non-parametrically. Often a good compromise is to go with rank-based statistics, which are fast and easy to calculate -- and you may not have an analytically tractable model for the distribution of the original data, but you don't have to, since by working with ranks you can define a distribution with good analytical properties. You still need to do some reality-checking exploratory data analysis, of course, but this is an approach that generally works well in practice.
  
  --
  The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:93% of Programmers Think You're Wrong by Daniel+Dvorkin · 2010-01-09 13:20 · Score: 5, Insightful

"Lies, damn lies and statistics" is all you need to know about statistics.
This is right up there with "'click on the big blue e' is all you need to know about the internet."
Speaking as both a statistician and a computer scientist, I've seen the statistics-vs.-CS argument play out many times before, and the lack of knowledge on both sides is really striking, but not all that surprising -- both are hard subjects which take a lot of work to master. The lack of mutual respect is both infuriating and pathetic, and there's no excuse for it.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Statistical analysis of the summary by brian_tanner · 2010-01-09 13:56 · Score: 5, Informative

Wow. What class did you take that says if you don't know something you should assume equal probability?

I don't know if there is an invisible elephant in my kitchen, so I guess I should assign equal probability to both outcomes. I also don't really know how Baccarat works, I guess my odds are 50/50.

Without knowing something about he or his coworkers, you by definition cannot make any statistical statements. To make any statements, you would first need to make some observations. This is how statistics is different from logic. Statistics is grounded in data.

I don't agree with Zed, but you may have just proved his point.
Obigatory Stats Joke by frank249 · 2010-01-09 14:05 · Score: 4, Funny

"I construct two sets of n=100 random samples from the normal distribution. Now, if I just take the average (mean or median) of these two sets they seem almost the same."
So its true. The n's justifies the means.

--
Today's vices may be tomorrow's virtues.
Re:93% of Programmers Think You're Wrong by Devout_IPUite · 2010-01-09 14:39 · Score: 4, Insightful

"It's also possible the author of the article is a know-it-all douchebag who tries to solve problems with overwrought solutions."

That was kinda what I got from this. Sure, my powers of ten runs to determine performance isn't statistically sound. Did I say it was? No. Why don't I care? Because my samples are cheap. Spiking vs non-spiking is something pretty easy to see when you glance at the data.

I mean, he said we're going to die if we don't learn statistics, but he never gave a compelling argument for it.

The best example was users, but even that was lacking. If you design a script that's as aggressive on a system as a high use user and your system supports as many 'users' as students, you're safe, if it supports less you work on qualifying the problem better then.
Re:Very good (from someone who's taken BOTH)... ap by JWSmythe · 2010-01-09 15:13 · Score: 5, Informative

1.) EASILY SKEWED (as in "4/5 dentists chew trident", oh "sure, sure", especially when they're on the corporate payroll (or paid off to say so by said corporation so their "evidence & observation looks good")
and
2.) IS THE SAMPLE SET LARGE & COMPREHENSIVE ENOUGH? (most?? Most are not, period)...
You know, that particular citation has made me wonder in the past, but not enough to actually research it. So, I went off looking for more information and found it.
The statistic was generated from a July 1976 survey.
The sample group for this statistic was 1,200 dentists. These dentists were hand picked by the research company, probably with good reason.
They were asked, what advice would they give gum-chewing patients
1) sugared gum
2) sugarless gum
3) no gum at all.
Sugarless gum got 85% of the vote. Not terribly surprising. I'd be fairly confident that their time had been paid for, or at very least they were told "This survey is being done for Trident Sugarless Gum." That is only speculation, so hush up.
17/20 doesn't really sound very good. It just doesn't stick in your head. 4/5 is close enough, even though it reduces your answer to 80% (ahhh, a lie). Since these are marketing folks, I'm sure they pushed all kinds of values past focus groups, until "4 in 5" was accepted as most favorable.
As the link cites, they're fairly confident that the "sugared gum" answer got at least one response. There's always someone that'll take the obvious wrong answer. If you don't believe that, look at any Slashdot poll. :)
What they don't say is how many of the 1,200 samples were dropped. I'm sure there were non-responses, and they could have easily added any number of unfavorable answers in as non-responses. Of course, they couldn't have 100% in their favor, so they had to keep some.

--
Serious? Seriousness is well above my pay grade.
Re:93% of Programmers Think You're Wrong by Dwonis · 2010-01-09 17:21 · Score: 4, Insightful

Thing is: You can only be expert in ONE of them. Period.
Hundreds of cryptologists prove you wrong.