Scientists Propose To Raise the Standards For Statistical Significance In Research Studies (sciencemag.org)
sciencehabit shares a report from Science Magazine: A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005. Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results -- studies that claim to find an effect when there is none -- and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
There's a trade-off between sensitivity and specificity. If you increase the threshold for "significance", you reduce the power to discover a significant effect when it truly does exist.
And a major part of the problem with scientific studies is that they are already underpowered. According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.
Underpowered studies result in two major problems:
1) Most obviously, an underpowered study results in a greater number of FALSE NEGATIVES. You fail to find a true effect. You will either publish your incorrect result of no effect. (And why should we consider published false positives to be any worse than false negatives?) Alternatively, perhaps you don't publish your study because you couldn't reach significance. This exacerbates the "file-drawer effect" and also results in wasted research dollars because the results aren't published.
2) Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.
Reducing the level of significance will do little to address these problems, and in some cases may even exacerbate the problem.
The key is *to move away from the binary concept of "significance" altogether*. It's obviously artificial to have an arbitrary numerical cutoff for "matters" vs. "doesn't matter", and this is not what Ronald Fisher intended when he popularized the p-value or developed the concept of "significance".
What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.
I'm not convinced this will help. There are a couple of issues here. Often, the experimental design can be changed, like how certain variables are controlled for, to get a p-value that's below the threshold. The other problem is that p-value is sensitive to the sample size. If you want a lower p-value, increase the sample size. In many cases, p-values aren't a good way to show whether a result is useful or not.
I'm a meteorologist and I research severe thunderstorms. Let's say that I want to test whether a particular variable is useful in discriminating between tornadic and non-tornadic supercells. One approach might be to calculate the mean of that variable for tornadic supercells and the mean in non-tornadic supercells. The null hypothesis is that the mean of the two samples are the same, and I calculate a p-value. if the sample size is large enough, that is I've included enough supercells, I can make even very small differences in the means appear statistically significant.
A better approach is to use that variable as a predictor and have two data sets -- a training data set and a testing data set. I then calculate a function to classify storms based on the training data set, using the variable as a predictor of whether a storm will be tornadic or not. Then I test its accuracy with the testing data set and the metric of success is the accuracy of the variable (hits, misses, and false alarms) of whether a storm will be tornadic or not. This is better because increasing the sample size isn't going to achieve a statistically significant result.
Normally, some kind of baseline is chosen, and you want to show that your method performs better than the baseline. Of course, the problem is that you have a lot of flexibility in how to choose this baseline, and reviewers still need to be careful in how they evaluate work. For example. let's say that I cite a paper saying that climatologically, 20% of supercells or tornadic. I could randomly guess whether a supercell is tornadic based on that 20% probability and use that as my baseline. If my work is useful, hopefully I outperform than random guessing based on climatology.
This isn't the best way, though, because we know of several variables that are useful in predicting whether supercells will be tornadic or not. A better baseline would be to include variables that are known to be useful and then test whether the additional variable adds skill or not. It also helps to have some physical explanation why a particular variable would affect whether a supercell is tornadic or not.
There are cases where p-values are useful, but it's also very easy to abuse them. There's no substitute for vigilant reviewers who can spot misuses of statistics. There's nothing magical about a p-value of 0.05 or 0.005. I have no problem with p-values being presented, but I think a better approach would be to require that papers include more than p-values to demonstrate that a result is significant. I've described one such approach above that I use in my own research.
This will mean that big pharma will have to run an order of magnitude more studies until they can find the one study which can be published because it shows a positive correlation.
[yes, I know statistics don't really work that way]
Actually they kind of do!
A tactic that Pharma companies have pulled many times in the past is to try and kept generic drugs off the market by showing that they are not equivalent to the proprietary product. And they do this by running a couple of dozen of animal studies, with the animals being given the two different products, with various physiological parameters being monitored. When one of these parameters is found to differ between the two drugs by p > 0.05 they submit the result to the FDA declaring that the two drugs are not equivalent in their effects (the parameter of course has nothing to do with the drug's actual pharmacological effect).
Now with this standard they will have to run 200 or so tests to find one that exceeds p > 0.005.
Second class citizen of the New Gilded Age
If you want reprodicibilty, well then require reproducibility.
Fisher, the inventor of p-values said this about p-values:
>>”[We] thereby admit that no isolated experiment, however significant in itself, can suffix for the experimental demonstration of any natural phenomenon In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result" (Fisher, 1960, p.13-14).
I'm a biologist, I don't understand P values [...]
Here's some light for that subject.
Suppose you make 20 measurements of rats in a maze and discover that 15 out of the 20 times they turn left on their first corridor junction. Is that significant?
We know that if the decisions were random we'd expect 10 out of 20, but we also know that there is variation in that number. 10 out of 20 is the highest probability of individual outcome, but it's even *more* probable that something other than 10 out of 20 will occur.
So to see if the 15 out of 20 is significant, we can compare this outcome to random chance.
We can simulate 20 coin flips in a computer and then write down the number of heads versus tails. Then we do it again and write down the new results, and then do it again and again for a million rounds.
Tallying the results, we can then find the *probability* that 20 random coin tosses will equal 15 or more heads, and this will give us a way to compare the rat data with random chance. What percent of random tosses yield 15 or more heads?
This is the P-value in a nutshell: it's the probability that your measurements could be the result of chance.
Note that we can never be *certain* that the results are significant, only that there is a *probability* that the results are significant. The probability of significance is chosen by convention depending on the outcome risks. For normal scientific studies, it's 5% (P < 0.05). If you're studying a new medicine, you might want to bump that up to 1% (P < 0.01) for safety. If you're exploring subatomic physics, and the experiments are very difficult to reproduce, you might want that to be P < .00001% to be relatively certain.
The conventional value of 5% is often incorrectly attributed to Pearson. He said the 5% value makes the results worthy of more study, not that 5% value makes the results significant.
Also of note, if everyone makes studies to P 5%, then on average 1 out of 20 studies *will* be due to random chance, which means that fully 5% of all scientific studies are reporting random events.
And of course, if your degree requires you to publish, or your tenure is based on your publishing history, there are ways to adjust the results to make the significance more likely.
(For example, you can record 8 different measurements of your rats. There are 8*7 = 76 possible pairs of measurements, so on average about 3 of those pairs will correlate to within 5%. If you want to publish a paper, this is one way to do it.)
Very, very few recent scientific papers have ever been verified (by reproducing), and when later examined were found to be unreproducible.
This is leading people to lose faith in the scientific method.