Why Standard Deviation Should Be Retired From Scientific Use
An anonymous reader writes "Statistician and author Nassim Taleb has a suggestion for scientific researchers: stop trying to use standard deviations in your work. He says it's misunderstood more often than not, and also not the best tool for its purpose. Taleb thinks researchers should use mean deviation instead. 'It is all due to a historical accident: in 1893, the great Karl Pearson introduced the term "standard deviation" for what had been known as "root mean square error." The confusion started then: people thought it meant mean deviation. The idea stuck: every time a newspaper has attempted to clarify the concept of market "volatility", it defined it verbally as mean deviation yet produced the numerical measure of the (higher) standard deviation. But it is not just journalists who fall for the mistake: I recall seeing official documents from the department of commerce and the Federal Reserve partaking of the conflation, even regulators in statements on market volatility. What is worse, Goldstein and I found that a high number of data scientists (many with PhDs) also get confused in real life.'"
...because people use it incorrectly in economics? Get bent. The standard deviation is a useful tool for statistical analysis of large populations.
The meaning of standard deviation is something you learn on a basic statistics course.
We don't ask biochemists to change their terms because the electron transport chain is complicated.
We don't ask cryptographers to change their terms because the difference between extra entropy and multiplicative prediction resistance is not obvious.
We should not ask statisticians to change their terms because people are too stupid to understand them.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
On the other hand, you also need to use 2-pass algorithms to compute Mean Absolute Deviation, whereas STD can be easily calculated in one pass. And you still need standard deviation as it relates directly to the second moment about the mean.
Also, annoyingly, Median Absolute Deviation competes for the MAD name and is more robust against outliers.
Sanity is a sandbox. I prefer the swings.
The problem is that people think they understand statistics when all they know is how to enter numbers into a program to generate "statistics".
They mistake the tools-used-to-make-the-model for reality. Whether intentionally or not.
Standard Deviation is the square root of the second moment about the mean, an important fundamental concept to probability distributions. Looking at moments of probability distributions gives us lots of tools that have been developed over the years and in many cases we can apply closed form solutions with reasonably lenient assumptions. Then we apply the square root in order to put it in the same units as the original list of observations and get some of the heuristic advantages that he attributes to the mean absolute deviation.
But it is a balance, and any data set should be looked at from multiple angles, with multiple summary statistics. To say MAD is better that standard deviation is a reasonable point (with which I would disagree), but to say we should stop using standard deviation (the point made in TFA) is totally incorrect.
First!
... to within 0.5 standard deviations.
Actually, the more posts this story attracts, the more accurate your statement is, and the fewer standard deviations you are away from true first. Response times not being distributed in a Gaussian curve perhaps complicates things.
Perhaps non-mathematicians don't have a problem with this, but it rubs me the wrong way.
What makes the mean an interesting quantity is that it is the constant that best approximates the data, where the measure of goodness of the approximation is precisely the way I like it: As the sum of the squares of the differences.
I understand that not everybody is an "L2" kind of guy, like I am. "L1" people prefer to measure the distance between things as the sum of the absolute values of the differences. But in that case, what makes the mean important? The constant that minimizes the sum of absolute values of the differences is the median, not the mean.
So you either use mean and standard deviation, or you use median and mean absolute deviation. But this notion of measuring mean absolute deviation from the mean is strange.
Anyway, his proposal is preposterous: I use the standard deviation daily and I don't care if others lack the sophistication to understand what it means.
I also think averages should go away. Most people think they are being reported the median (the number in the middle) when people tell them the average. It's great for real estate agents, and people trying to advocate for tax reform, but the numbers are not what people think they are.
I often change CSensiblyNamedClassThatDescribesItsFunctionWell to bTrue throughout the code for precisely this reason and no-one ever appreciates it :(
Hi, I'm a statistician.
It's not so simple to just say "ok, we're going to use the Mean Absolute Deviation from now on." The use of standard deviation is not quite the historical accident that Taleb makes it out to be--there are good reasons for using it. Because it is a one-to-one function of the second central moment (variance), it inherits a bunch of nice properties that the mean absolute deviation does not. There is not a one-to-one correspondence between variance and mean absolute deviation.
Taleb is correct that the mean absolute deviation is easier to explain to people, but this is not just a matter of changing units of measure (where there is a one-to-one correspondence) or changing function and variable names in code (where there is again a one-to-one correspondence). Standard deviation and mean absolute deviation have different theoretical properties. These differences have led most statisticians over the last hundred years to conclude that the standard deviation is a better measure of variability, even though it is harder to explain.
I would have said "18 half gallon pottles to the quarter-barrel firkin."
Wolfram Alpha says 15.75 pottles to the firkin, but that's because of US/UK gallon conversions, I reckon.
352 nails in a chain - which was interesting to me, in that Google includes those units in its calculator.
I now know more about pottles, firkins, nails and chains that I did when I woke up. I shudder to think about what got pushed out of my old head to make way for this new minutia.
...and besides... JUST THINK of all the rigorous Lean Management courses that will have to re-certify all of their "Six-Sigma Black Belts" to some kind of "Half-Dozen of the Other" degrees!
PANDEMONIUM!!!
Data science is a field that combines machine learning and statistics to derive meaning from data. Data scientists should be reasonably well-versed in classical stats, but the data sets they deal with are often huge, ill-defined, and not amenable to analysis using classical methods. To deal with such challenges, data science recruits a healthy combination of certain areas of comp-sci (databases, machine learning, NLP, AI), statistical methods, and, quite often, improvisation.
Strange that there are so many people on here that are unfamiliar with data science.