Scientists Propose To Raise the Standards For Statistical Significance In Research Studies (sciencemag.org)
sciencehabit shares a report from Science Magazine: A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005. Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results -- studies that claim to find an effect when there is none -- and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
I'd be surprised if it was anywhere near 56%. I'm a biologist, I don't understand P values, but I am aware that they shouldn't be the gold standard. Ideally scientists in all the different fields would use the statistics that make the most sense for their specific study, and would take the time to figure that out, and reviewers would read up on statistics and think themselves about what statistics would make the most sense for that case.
P0.05 is used everywhere because that simply won't happen. Scientists who aren't statisticians care passionately about only their topic and it isn't statistics. If anyone tries to use something else, everyone including reviewers will demand they use what everyone else uses anyway.
There's a trade-off between sensitivity and specificity. If you increase the threshold for "significance", you reduce the power to discover a significant effect when it truly does exist.
And a major part of the problem with scientific studies is that they are already underpowered. According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.
Underpowered studies result in two major problems:
1) Most obviously, an underpowered study results in a greater number of FALSE NEGATIVES. You fail to find a true effect. You will either publish your incorrect result of no effect. (And why should we consider published false positives to be any worse than false negatives?) Alternatively, perhaps you don't publish your study because you couldn't reach significance. This exacerbates the "file-drawer effect" and also results in wasted research dollars because the results aren't published.
2) Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.
Reducing the level of significance will do little to address these problems, and in some cases may even exacerbate the problem.
The key is *to move away from the binary concept of "significance" altogether*. It's obviously artificial to have an arbitrary numerical cutoff for "matters" vs. "doesn't matter", and this is not what Ronald Fisher intended when he popularized the p-value or developed the concept of "significance".
What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.
Nah. He just wants to eliminate 95% of the competition.
Second class citizen of the New Gilded Age
Make it Six Sigma
That would eliminate many false positives, as well as eliminating nearly all true positives. Of course, this will do nothing to reduce flawed studies caused by reasons other than statistics, such as non-representative sampling (e.g.: most mouse studies use only male mice), poor experiment design, shoddy data gathering, sponsorship bias, and outright fraud.
But, the cost of clinical studies would only increase by an order of magnitude, so what do we have to lose?
After viewing it first hand, there are a lot of people going through "degree factories", getting degrees that are getting only the basics of statistical knowledge. And a little knowledge is very dangerous. The p-value is a useful measure, but it's been simplified to (p less than 0.05 = good) in biomedical circles. And if you read the other upvoted threads, or read some of the linked articles, you'll understand why this is a big problem.
There are a few tensions here that I think may be causing this: (a) publish or perish - if it looks reasonable enough, publish because that's where your next job comes from, (b) poor statistical training - can be from both the authors and reviewers side, (c) unwillingness to fund or publish work that is reproducing previous results - there is a publisher created publication bias, (d) the general high cost of patient centred biomedical research, so meaning your have low sample numbers generally, (e) the unwillingness in some disciplines to get formal statistical input.
What are the potential solutions? If there was an unrestricted money pool you can recruit adequately (n>10000) to each study, but the money is not there, and there are some very rare diseases around. Better statistical training would be ideal, and there has been a push towards Bayesian analysis: I would think that as in most statistical tools someone will eventually find a way to inappropriately use them. Self-publish as an option - could be possible: I've seen some horrifically bad peer reviewed articles (& predatory journals!) but there is an ethical tension between publishing without review which could just flood the literature with absolute garbage which is difficult to sort through, and actual proper peer review. Maybe something like Arxiv for biomedical science, although there would be a lot of resistance to it I suspect.
I don't hold too many hopes for a quick solution to this as there are a lot of vested interests, and people using the best new fangled statistical methods they've learned. I've even reviewed a paper recently, with multiple authors from a big university, where I just shook my head at the amount of statistical fudging that took place: the authors had imputed about 80% of their primary predictor variable for an outcome, and then came up with a conclusion based on the imputed data. I just shook my head that this was actually allowed nowadays. While this article is good, some of the authors have been banging on about it for some time without much change.