Weak Statistical Standards Implicated In Scientific Irreproducibility
ananyo writes "The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."
I heard it more than once !!
Use Bayesian statistics.
Doubt it makes a difference, the root of this problems us systematic errors.
That is because of the central limit theorem, (http://en.wikipedia.org/wiki/Central_limit_theorem), which indicated that for a large number of independent samples, it doesn't matter what the original distribution was, and we certainly can reliably use the normal distribution. It is NOT unfounded.
That, and the fact that all of statistics is a joke. It's all based on the assumption that data is distributed in a bell curve. Sure, a bell curve does fit a lot of data, but we blindly assume it fits everything which just can't be true.
We do not assume everything fits a bell curve.
STA-101: When using a normal curve, there needs to be a good reason for it.
In many cases, that good reason is the Central Limit Theorem.
Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.
Such an admonishment is fine for the computational fields, where a few more permutations can net you a p-value of 0.0005 (assuming that you aren't crunching on a 4-month cluster problem). However, biological laborations are often very expensive and take a lot of time. Furthermore, additional tests are not always possible, since it can be damn hard to reproduce specific mutations or knockout sequences without altering the surrounding interactive factors.
So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?
Truth is expensive.
If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.
Statistics does not, by any means, make that assumption. If it did, the entire field of statistics would have been completed by 1810.
Mediocre (actually, sub-mediocre) practitioners of statistics make that assumption.
It is true that many estimators tend to a normal distribution as the sample size gets large, but this is not the same as assuming that the data itself comes from the normal distribution.
Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.
If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.
Obligatory XKCD
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
http://xkcd.com/882/
No, statisticians certainly do not assume that. If everything in my field were normally distributed then my life would be a lot easier, but it's not, and we're aware that it's not.
Authors need to read this: http://www.deirdremccloskey.com/articles/stats/preface_ziliak.php
It explains quite clearly why a p value 0.05 is a fairly arbitrary choice as it cannot possibly the standard for every possible study out there. Or, put it another way, be very skeptical when one sole number (namely 0.05) is supposed to be a universal threshold to decide on the significance of all possible findings, in all possible domains of science. The context of any finding still matters for its significance.
More researchers in the biological sciences are using other more rigorous methods now than the Student's t-test and a p value of 0.05. ANOVA, ANCOVA and ranking methodologies are commonplace. Many scientific findings are based on a P value below 0.01. The problem with bad science certainly involves some bad statistics, but more often it just involves bad methodology, and poor attention to the previous literature (and thus attempting to reinvent the wheel). If your findings are robust and reproducible, then the statistics work out just fine. The good news is that science is self correcting, even if sometimes the corrections seem tardy.
A brain is a terrible thing to waste... Mind? That's debatable.
Unreliable research
Trouble at the lab
Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 |From the print edition
The Economist
First, the statistics, which if perhaps off-putting are quite crucial. Scientists divide errors into two classes. A type I error is the mistake of thinking something is true when it is not (also known as a “false positive”). A type II error is thinking something is not true when in fact it is (a “false negative”). When testing a specific hypothesis, scientists run statistical checks to work out how likely it would be for data which seem to support the idea to have come about simply by chance. If the likelihood of such a false-positive conclusion is less than 5%, they deem the evidence that the hypothesis is true “statistically significant”. They are thus accepting that one result in 20 will be falsely positive—but one in 20 seems a satisfactorily low rate.
In 2005 John Ioannidis, an epidemiologist from Stanford University, caused a stir with a paper showing why, as a matter of statistical logic, the idea that only one such paper in 20 gives a false-positive result was hugely optimistic. Instead, he argued, “most published research findings are probably false.” As he told the quadrennial International Congress on Peer Review and Biomedical Publication, held this September in Chicago, the problem has not gone away.
Dr Ioannidis draws his stark conclusion on the basis that the customary approach to statistical significance ignores three things: the “statistical power” of the study (a measure of its ability to avoid type II errors, false negatives in which a real signal is missed in the noise); the unlikeliness of the hypothesis being tested; and the pervasive bias favouring the publication of claims to have found something new.
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble
Oh, people can come up with statistics to prove anything, Kent. 14% of people know that.
Unless of course we happen to be working in a chaotic system where strange attractors mean there can be no centrality to the data.
Chaos theory is a lot younger than the central limit theorem. The situation might be similar to the way Einstein's theory of relativity has moved Newton's three laws from a position of central importance in all physics to something that works well enough in a small subset. A subset that is extremely important in our daily life, but still a subset.
Some portions of a chaotic system will be consistent with what the central limit theorem would predict. Other data sets from the same system, uh, no.
An important question I do not believe has been answered yet (I am an armchair follower of this stuff, neither expert nor student) is whether all the systems we work with where the CLT does seem to hold are merely subsets of larger systems. A related question would be whether there is any test that can be applied to a discrete data set that rule out its being a subset of a larger chaotic process.
Will
A significant problem is that many of the people who quote p values do it without understanding what a p value actually means. Getting p = 0.05 does not mean that there is only a 5% chance that the model is wrong. That is one of the fundamental misunderstandings in statistics, and I suspect that it is behind a lot of the cases of scientific irreproducibility.
Just because you are paranoid does not mean that no-one is out to get you.
You hardly need chaos theory to come up with examples where a statistical estimator is not normally-distributed.
Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics. Even one example would be great.
http://www.youtube.com/watch?v=HtMX_0jDsrw
So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.
But there's the crux: it remains an assumption; a hypothesis, and fortunately it's usually a *testable* hypothesis. It's the responsibility of a researcher to check if it holds, and to see how problematic it is when it doesn't.
If something has a normal distribution, its square or its square root (or another power) doesn't have a Normal distribution. Take for example the diameter, surface area, and volume of berries. The diameter (goes with the radius, r), the surface area (goes with r^2), and the volume of berries (goes with r^3). They cannot all be Normally distributed at the same time, so assuming any of them is starts you out on shaky foundation.
Okay, here's the real problem with scientific studies.
All science is data compression, and all studies are are intended to compress data so that we can make future predictions. If you want to predict the trajectory of a cannonball, you don't need an almanac cross referencing cannonball weights, powder loads, and cannon angles - you can calculate the arc to any desired accuracy with a set of equations that fit on half a page. The half-page compresses the record of all prior experience with cannonball arcs, and allows us to predict future arcs.
Soft science studies typically make a set of observations which relate two measurable aspects. When plotted, the data points suggest a line or curve, and we accept the linear-regression (line or polynomial) as the best approximation for the data. The theory being that the underlying mechanism is the regression, and unrelated noise in the environment or measurement system causes random deviations of observation.
This is the wrong method. Regression is based on minimizing squared error, which was chosen by Laplace for no other reason that it is easy to calculate. There's lots of "rationalization" explanations of why it works and why it's "just the best possible thing to do", but there's no fundamental logic that can be used to deduce least squares from from fundamental assumptions.
Least squares introduces several problems:
1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source).
2) There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.
3) The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.
4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.
The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell how relevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.
Scientific studies should have a threshold of "compresses data by N bits", rather than "1-in-20 of all studies are due to random chance".
Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.
When I turn that around, it seems to say that statistics is only of value in systems that have fully matured. Which sounds like most of the time statistics have no value.
Is that correct? Or is there some other way to reverse the quotation?
Will
Well, there's really nothing to turn around... you're spouting a lot of pseudo-science here, and still nothing that you've said has even suggested why statistics wouldn't work on "immature" (whatever that means) systems. The central limit theorem can apply to dynamic systems, and even if the CLT didn't hold, that doesn't mean that statistics is impossible. There are many estimators which do not obey the CLT.
Just google "statistics of chaotic systems" or whatever. You'll find plenty of work on the subject. Admittedly, they are using "statistics" the way physicists do, but it's still the same idea: a mathematical characterization.
Basically, whenever there is a probabilistic model for something, statistics happens when you are ignorant of (certain aspects of) the model, and try to infer what you don't know from the data. Again, google "dynamic statistical models"; you'll find a lot.
Having quickly skimmed the paper, I'll give an example of the problem. .54 .65 .74 .83 .88 .96 .94 .98
I couldn't quickly find a real data set that was easy to interpret, so I'm going to make up some data.
Chance to die before reaching this age
Age woman man
80
85
90
95
We have a person who is 90 years old. Taking the null hypothesis to be that this person is a man, we can reject the hypothesis that this is a man with greater than 95 percent confidence (p=0.04). However, if we do a Bayesian analysis assuming prior probabilities of 50 percent for the person being a man or a woman, we find that there is a 25 percent chance that the person is a man after all (as women are 3 times more likely to reach age 90 than men are.)
(Having 11 percent signs in my post seems to have given /. indigestion so I've had to edit them out.)
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).
.
Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?
More interesting, IMO, is that statistical doesn't tell you what the scale of an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.
Sheesh, evil *and* a jerk. -- Jade
In what way ?
A surprising number of senior scientists are not aware of the problems introduced by ending an experiment based on achieving a certain significance level. By taking the significance as the criterion of the experiment, you don't actually know anything about the significance. Your highly significant result may just be a fluctuation because, had you continued, the high signal-to-noise ratio could well dissipate. Too often I've heard senior scientists advising junior scientists: You've got three sigma, publish. But, proper procedure is to design an experiment to run for a certain duration and then find out what the result is.
Medicine has a formal means to end a trial early if a medicine turns out to be dangerous or particularly helpful. This is an ethical consideration. But, it does make the trial results void.
I'm not certain his conclusions are correct... I could not reproduce them...
At one level, they are right that unreproducible results are usually not fraud, but are simply fluctuations that make a study look promising leading to publication. But raising the standard of statistical significance will not really improve the situation. The most important uncertainties in most scientific studies are not random. You can't quantify them assuming a gaussian distribution. There are all kind of choices made in acquiring, processing, and presenting data. The incentives that scientists have are all pushing them to look for ways to obtain a high profile result. We make our best guesses trying to be honest, but when a set of guesses leads to a promising result we publish it and trust further study to determine whether our guesses were fully justified. There is one step that would improve the situation. We need to provide a mechanism to receive career credit for reproducing earlier results or for disproving earlier results. At the moment, you get no credit for doing this. And you will never get funding to do it. The only way to be successful is to spit out a lot of papers and have some of them turn out to be major results that others build on. The number of papers that turn out to be wrong is of no consequence. No one even notices except a couple of researchers who try to build on your result, fail, and don't publish. In their later papers they will probably carefully dance around the error so as not to incur the wrath of a reviewer. If reproducing earlier results was a priority, then we would know earlier which results were wrong and could start giving negative career credit to people who publish a lot of errors.
Ahhh!! it's 1/20, not two percent. Of course, it's 5%.
The bigger problem is the habit of confusing correlation with cause.
I do not fail; I succeed at finding out what does not work.
you're spouting a lot of pseudo-science here
I agree that there IS a lot pseudo-science here, and that I have fallen into a nasty trap.
What can I say? This is not the first time an AC troll has gotten me good, and it probably will not be the last.
Now get thee back under that dark, damp, cobwebby bridge where thou belongest! Or I shall sprinkle thee with Troll-B-Gone powder and there will be nothing left around here but some grins and giggles.
Will
This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.
"p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.
Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.
Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.
Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.
Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.
Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.
The CLT is one of the most elegant and powerful results in all of mathematics, and can be used, quite appropriately, to justify normal models for all sorts of measurements. That being said, its usefulness has led to the dumbed-down idea of "the bell curve" being the appropriate model for all sorts of things where it's clearly not--I don't know how many times I've seen a normal curve superimposed on a histogram or kernel density estimation of data that are clearly non-normal. As another poster pointed out, there are simple and well-understood tests for normality, and failure to apply them when constructing a normal model is just ridiculous.
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
The Central Limit Theorem doesn't state that the samples are normally distributed, but their mean (average). So the average surface area, volume, and diameter will all be normally distributed for a large sample of independent berries (ie. not from same plants, and so forth).
Climate models are currently, at best, when treated as an ensemble (if you buy that as legitimate), skirting along the p 0.05 level of significance in the validation period.
Pointing this out is considered trolling -- it probably offends some religious sensibilities.
Tightening the threshold as the article suggests would mean the model results are not "significant" (i.e., not reasonably distinguishable from natural variation -- note that I am not a "denier" and that I do accept that CO2 is a greenhouse gas etc. etc.; I am however hugely skeptical of most climate and environmental science that I have investigated).
It sounds like you have a clue about statistics. Do you know of a good forum to ask a fairly involved statistics question? I have a set of measured variables A-E which all tend to indicate the likelihood of X. The relationships are a bit complex and unknown, though, so I need help with how I should analyze the historical data in order to come up with parameters to use in the future for making "predictions" of X based on known values of A-E.
"innovative methods"??? I do not know of a single serious scientist who hasn't been lectured on the ills of weak testing (and told not to use 0.05 as some sort of magical threshold below which everything magically works).
Back when I was a wee researchling, this is literally one of the first paper I was told to read and internalise (published 20 years ago, and not even particularly breakthrough at the time).
There is absolutely no need for new evidence or further discussion of the limitations of statistical testing thresholds: anybody who cares is keenly aware of them. People who don't (particularly in some areas of social science), are just looking for a way to get their next paper out the door by any means possible.
Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.
When I turn that around, it seems to say that statistics is only of value in systems that have fully matured.
Statistics in today's world is more a financial tool. That system of manipulation has fully matured. I promise you they know exactly what they are doing with that tool.
Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously.
This is the wrong place for an argument (you want room 12-A) and I don't want to get into a contest, but for illustration here is the problem with this explanation.
A rule learned from experience should minimize the error, not the variance of error.
It's a valid conclusion from the mathematics, but based on a faulty assumption.
"Global Warming"?
Get your free Dropbox account with 2 GB Free storage!
I always knew there was something wrong with their Pies...
http://stats.stackexchange.com/questions
\begin{rant}
Actually using statistics in the first place! I'm sick to $#@%@#$%#$TREWT#$@%$#ing death of CS papers with no statistical testing whatsoever. And don't get me started on electronic engineering.
\end{rant}
It seems like you found a lot of problems with what is taught in a stat 101 class. This is good; there are many. However, there are also solutions to these problems which you would find if you took a higher level course.
That brought a smile to my face. Thanks.
Well, it's good to know that the entire field of non-parametric statistics doesn't exist, then.
But of course, they wear white coats and are the new 'high priests' who have to be worshipped at any cost. It's not as if most scientists are concerned more with their careers and pensions than with the truth, is it...
http://stats.stackexchange.com/
If X is categorical this sounds like a case for logistic regression.
What you're talking about is the distribution of the sample means of r1, r2, r3 respectively. Those are asymptotically normally distributed, but that's not what we're talking about here.
What we were talking about is whether: r1, r2, and r3 can all be normally distributed. The reason being that people investigating the size, weight, and surface area of berries may *assume* (appealing to the Central Limit Theorem) that the quantity they're investigating can be modeled adequately through a normal distribution, and proceed to apply statistical tests based on dealing with normal distributions. For example by comparing the effect of fertilizer on berry size and weight. And it's clear that the distributions of r1, r2, and r3 cannot all be distributed normally.
So statistical tests based on the assumption that they are normally distributed will operate outside their guaranteed area of applicability, which may or may not cause them to be in error.
... but is it reproducible? :p
"I love animals! Some are cute, others are tasty, what's not to like?" - Betsy Schroeder, Jeopardy contestant
That is because of the central limit theorem, (http://en.wikipedia.org/wiki/Central_limit_theorem), which indicated that for a large number of independent samples, it doesn't matter what the original distribution was, and we certainly can reliably use the normal distribution. It is NOT unfounded.
Emphasis is mine.
Actually, you are misstating the CLT, which does not work at all for distributions without finite mean or variance (which may well be the case for real-world experiments). And even if the variable we are measuring does have finite mean and variance, the speed of convergence is only possible to quantify in certain cases. So the shape you get from samples of size 1000 may look good to you because you are impressed by a bunch of zeroes, and may even work OK near the estimated mean, but when we look at the tails, we may find that your approximation is not worth a crumpled paper napkin.
Of course it's a proven fact that 87% of all statistics are wrong. 8-)
> do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?
Yes. Before giving it to a million people, we should run statistical calculations on the first 10,000 to better asses safety and efficacy.
Oh, you meant as opposed to a trial with 200 people. But that's a false dichotomy. You run run stats on the first 200 to see whether
or not it's likely safe, then run stats on 10,000 to confirm it. Which is to say, you'd wait until you managed a smaller P before announcing a conclusion. In the meantime, with a P of 0.05, you'd label it as a tentative conclusion, a likely theory.
The problem I have with least squares is that I don't like the definition of the "error". If you have two things that are correlated, one isn't necessariy a function of the other that includes some variability. If you flip the X and Y axes over - plot height against weight, rather than weight against height - then the least squares regression gives a different line. If the two errors are both minimised, but different, then neither of them is the "real" error.
Wow - brilliant insight! Thanks for that - things like this are why I come to Slashdot.
Can I discuss some ideas with you offline? thon dot 9 dot okianwarrior at spamgourmet dot com
OLS Regression/glm based approaches do not assume a Gaussian distribution of the dependent variable just that the residuals/errors be. If you don't like that for your data, use nonparametric, or if you know what matches the distribution, like Poisson or negative gamma then use those. As a psychologist, my colleagues were often taught to look for normal data because they assume the errors are intrinsically random, and will follow the existing distribution. I wish they didn't.
Climate models are currently, at best, when treated as an ensemble (if you buy that as legitimate)
Is there a methodological reason to NOT treat the ensemble as legitimate? Please describe this reasoning in detail.
skirting along the p 0.05 level of significance in the validation period.
Define precisely what you mean by "skirting along". How far below 0.05 are the models results, exactly?
Pointing this out is considered trolling -- it probably offends some religious sensibilities.
I suspect you are misinterpreting the responses. Mostly when we dig down on these sorts of remarks, high level, without data or empirical basis, no repeatable observations, we find them to be the work of some deceptive, brainless, mouth breathing denialist. You might be an ok chap, but I can't really make a conclusion until I've seen the data.
It's a matter of probability. Perhaps someone who sounds like a denialist is, in spite of the evidence, a rational, coherent person who is nevertheless ignorant of the science or misled by paid shills (like Anthony Watts who is paid a salary by the Heartland Institute to LIE about climate science, or Judith Curry, who deliberately misleads by wrapping genuine science in a penumbra of sneering psuedo-scepticism).
Generally the best measure is as follows:
1. The person refuses to provide specifics but only speaks in generalities - likely a committed denialist
2. The person provides specific (albeit incorrect) facts. ignorant or misled by liars
Like many people I have a lot of sympathy for the genuinely misled and will try to help them if I can. The deliberate lies and deception on the part of the denier hierarchy makes my blood boil.
Tightening the threshold as the article suggests would mean the model results are not "significant" (i.e., not reasonably distinguishable from natural variation -
Actually the article makes no mention of anything related to climate science. It is mainly focussed on instances where results are found not to be reproducible and using a frequentist methodology. Climate models are very reproducible and don't use a frequentist method - they make predictions, not observation.
n -- note that I am not a "denier" and that I do accept that CO2 is a greenhouse gas etc. etc.; I am however hugely skeptical of most climate and environmental science that I have investigated).
And yet you refer to these supposed problems in the climate science in generalities. Why is that?