Weak Statistical Standards Implicated In Scientific Irreproducibility
ananyo writes "The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."
The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach. But that's also the problem with the cited article. Just remaining in the framework and going towards more discriminating thresholds is not really a solution of the problem that people do not understand their data analysis (a p-value based on the wrong distribution remains meaningless, even if you change your threshold...). Because it is more logical in its setup, the danger of making such mistakes is smaller in bayesian statistics. The telescoper over at http://telescoper.wordpress.com/2013/11/12/the-curse-of-p-values/ has a good discussion of these issues.
If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.
Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.
If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.
Obligatory XKCD
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.
Because I'd rather use a drug found to be 75-83% effective at treating my disease than die while waiting for someone to come up with one that's 99.9% effective.
This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.
"p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.
Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.
Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.
Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.
Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.
Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.
The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach.
If only. The number of people who think "sprinkle a little Bayes on it" is the solution to everything is frighteningly large, and growing exponentially AFAICT. There's now a Bayesian recipe counterpart to just about every non-Bayesian recipe, and the only difference between them, as a practical matter, is that the people using the former think they're doing something special and better. One might say that their prior is on the order of P(correct|Bayes) = 1, which makes it very hard to convince them otherwise ...
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.