Weak Statistical Standards Implicated In Scientific Irreproducibility

← Back to Stories (view on slashdot.org)

Weak Statistical Standards Implicated In Scientific Irreproducibility

Posted by Soulskill on Tuesday November 12, 2013 @11:40AM from the nobody-who-needs-to-understand-statistics-understands-statistics dept.

ananyo writes "The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."

11 of 182 comments (clear)

Min score:

Reason:

Sort:

Re:Or you know.. by Anonymous Coward · 2013-11-12 11:53 · Score: 5, Informative

This would have the same problems, maybe even worse. The problem with statistics is usually that the model is wrong, and Bayesian stats offers two chances to fuck that up: in the prior, and in the generative model (=likelihood). Bayesian statistics still requires models (yes, you can do non-parametric Bayes, but you can do non-parametric frequentist stats also).
Contrary to the hype and buzzwords, Bayesian statistics is not some magical solution. It is incredibly useful when done right, of course.
Scarcely productive by fey000 · 2013-11-12 11:59 · Score: 4, Interesting

Such an admonishment is fine for the computational fields, where a few more permutations can net you a p-value of 0.0005 (assuming that you aren't crunching on a 4-month cluster problem). However, biological laborations are often very expensive and take a lot of time. Furthermore, additional tests are not always possible, since it can be damn hard to reproduce specific mutations or knockout sequences without altering the surrounding interactive factors.
So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?
1. Re:Scarcely productive by hawguy · 2013-11-12 12:21 · Score: 4, Insightful
  
  If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.
  Because I'd rather use a drug found to be 75-83% effective at treating my disease than die while waiting for someone to come up with one that's 99.9% effective.
Re:Or you know.. by hde226868 · 2013-11-12 12:01 · Score: 5, Insightful

The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach. But that's also the problem with the cited article. Just remaining in the framework and going towards more discriminating thresholds is not really a solution of the problem that people do not understand their data analysis (a p-value based on the wrong distribution remains meaningless, even if you change your threshold...). Because it is more logical in its setup, the danger of making such mistakes is smaller in bayesian statistics. The telescoper over at http://telescoper.wordpress.com/2013/11/12/the-curse-of-p-values/ has a good discussion of these issues.
Not going to happen by Anonymous Coward · 2013-11-12 12:02 · Score: 4, Insightful

If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.
Interpretation of the 0.05 threshold by Michael+Woodhams · 2013-11-12 12:06 · Score: 5, Insightful

Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.
If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.
Obligatory XKCD

--
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
Re:Or you know.. by Anonymous Coward · 2013-11-12 12:20 · Score: 5, Interesting

Yes, I agree. If a p-value of 0.05 actually "means" 0.20 when evaluated, then any sane frequentist will tell you that things are fucked, since the limiting probability does not match the nominal probability (this is the definition of frequentism).
The power of Bayesian stats is largely in being able to easily represent hierarchical models, which are very powerful for modeling dependence in the data through latent variables. But it's not the Bayesianism per se that fixes things, it's the breadth of models it allows. A mediocre modeler using Bayesian statistics will still create mediocre models, and if they use a bad prior, then things will be worse than they would be for a frequentist.
Consider that if Bayesian statisticians are doing a better job than frequentists at the moment, it may be because Bayesian stats hasn't yet been drilled into the minds of the mediocre, as frequentist stats has been for decades. People doing Bayesian stats tend to be better modelers to begin with.
Yes and no by golodh · 2013-11-12 13:02 · Score: 4, Interesting

As you say, there is the Central Limit Theorem (a whole bunch of them actually) that says that the Normal distribution is the asymptotic limit that describes unbelievably many averaging processes.
So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.
But there's the crux: it remains an assumption; a hypothesis, and fortunately it's usually a *testable* hypothesis. It's the responsibility of a researcher to check if it holds, and to see how problematic it is when it doesn't.
If something has a normal distribution, its square or its square root (or another power) doesn't have a Normal distribution. Take for example the diameter, surface area, and volume of berries. The diameter (goes with the radius, r), the surface area (goes with r^2), and the volume of berries (goes with r^3). They cannot all be Normally distributed at the same time, so assuming any of them is starts you out on shaky foundation.
The real issue by Okian+Warrior · 2013-11-12 13:06 · Score: 5, Interesting

Okay, here's the real problem with scientific studies.
All science is data compression, and all studies are are intended to compress data so that we can make future predictions. If you want to predict the trajectory of a cannonball, you don't need an almanac cross referencing cannonball weights, powder loads, and cannon angles - you can calculate the arc to any desired accuracy with a set of equations that fit on half a page. The half-page compresses the record of all prior experience with cannonball arcs, and allows us to predict future arcs.
Soft science studies typically make a set of observations which relate two measurable aspects. When plotted, the data points suggest a line or curve, and we accept the linear-regression (line or polynomial) as the best approximation for the data. The theory being that the underlying mechanism is the regression, and unrelated noise in the environment or measurement system causes random deviations of observation.
This is the wrong method. Regression is based on minimizing squared error, which was chosen by Laplace for no other reason that it is easy to calculate. There's lots of "rationalization" explanations of why it works and why it's "just the best possible thing to do", but there's no fundamental logic that can be used to deduce least squares from from fundamental assumptions.
Least squares introduces several problems:
1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source).
2) There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.
3) The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.
4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.
The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell how relevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.
Scientific studies should have a threshold of "compresses data by N bits", rather than "1-in-20 of all studies are due to random chance".
Let's get something straight you non-staticians by j33px0r · 2013-11-12 14:50 · Score: 4, Insightful

This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.
"p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.
Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.
Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.
Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.
Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.
Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.
Re:Or you know.. by Daniel+Dvorkin · 2013-11-12 15:27 · Score: 4, Insightful

The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach.
If only. The number of people who think "sprinkle a little Bayes on it" is the solution to everything is frighteningly large, and growing exponentially AFAICT. There's now a Bayesian recipe counterpart to just about every non-Bayesian recipe, and the only difference between them, as a practical matter, is that the people using the former think they're doing something special and better. One might say that their prior is on the order of P(correct|Bayes) = 1, which makes it very hard to convince them otherwise ...

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.