Scientists Propose To Raise the Standards For Statistical Significance In Research Studies (sciencemag.org)
sciencehabit shares a report from Science Magazine: A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005. Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results -- studies that claim to find an effect when there is none -- and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
If you want reprodicibilty, well then require reproducibility.
Fisher, the inventor of p-values said this about p-values:
>>”[We] thereby admit that no isolated experiment, however significant in itself, can suffix for the experimental demonstration of any natural phenomenon In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result" (Fisher, 1960, p.13-14).
I'm a biologist, I don't understand P values [...]
Here's some light for that subject.
Suppose you make 20 measurements of rats in a maze and discover that 15 out of the 20 times they turn left on their first corridor junction. Is that significant?
We know that if the decisions were random we'd expect 10 out of 20, but we also know that there is variation in that number. 10 out of 20 is the highest probability of individual outcome, but it's even *more* probable that something other than 10 out of 20 will occur.
So to see if the 15 out of 20 is significant, we can compare this outcome to random chance.
We can simulate 20 coin flips in a computer and then write down the number of heads versus tails. Then we do it again and write down the new results, and then do it again and again for a million rounds.
Tallying the results, we can then find the *probability* that 20 random coin tosses will equal 15 or more heads, and this will give us a way to compare the rat data with random chance. What percent of random tosses yield 15 or more heads?
This is the P-value in a nutshell: it's the probability that your measurements could be the result of chance.
Note that we can never be *certain* that the results are significant, only that there is a *probability* that the results are significant. The probability of significance is chosen by convention depending on the outcome risks. For normal scientific studies, it's 5% (P < 0.05). If you're studying a new medicine, you might want to bump that up to 1% (P < 0.01) for safety. If you're exploring subatomic physics, and the experiments are very difficult to reproduce, you might want that to be P < .00001% to be relatively certain.
The conventional value of 5% is often incorrectly attributed to Pearson. He said the 5% value makes the results worthy of more study, not that 5% value makes the results significant.
Also of note, if everyone makes studies to P 5%, then on average 1 out of 20 studies *will* be due to random chance, which means that fully 5% of all scientific studies are reporting random events.
And of course, if your degree requires you to publish, or your tenure is based on your publishing history, there are ways to adjust the results to make the significance more likely.
(For example, you can record 8 different measurements of your rats. There are 8*7 = 76 possible pairs of measurements, so on average about 3 of those pairs will correlate to within 5%. If you want to publish a paper, this is one way to do it.)
Very, very few recent scientific papers have ever been verified (by reproducing), and when later examined were found to be unreproducible.
This is leading people to lose faith in the scientific method.
The problem with current research in semi-soft sciences like biology and medicine is that the scientists use this p-value wrong.
If you suspect a glass of wine a day will lower chances of heart disease, take 1000 volunteers, roll a dice and half of them you tell to not have that wine-a-day and the other half you tell, please drink one glass of wine a day. Next you wait two years, and evaluate the incidence of heart problems in the two groups. That's where 0.05 P-value is acceptable. (in practice, telling people ot suddenly stop or start drinking is not going to go well).
Things become problematic when you suspect: "something we can measure may be related to this disease" (e.g. Sarcoidosis), you take 200 patients and 200 healthy people and then measure 200 parameters in each of the 400 blood samples... Provided there is little to measure, you'll find about 1/20th or about 10 parameters that DO seem to be (p=0.05) different between the two groups.
In the case at hand one or two measurable parameters ARE, different in the patient-group. So you'll have a better than 95% chance of finding those. Of the 198 other parameters you'll find 1/20th of false positives, for a total of almost 12 publishable results.
Should you want to increase your chances of finding these publishable results, the sample size needs to be relatively small. The group of 200 patients and 200 healthy people might already be too big to get enough spurious results. Even if they don't do this consciously, the scientists will quickly be able ot optimize their sample size to find publishable results.
When I was a freshman in 1985, some guy asked me to help him put his research in the computer. He had formulated 50 or so questions and predicted boys would answer differently than girls. So he went into a classroom, interviewed 30 boys and girls and put his results in the computer. Of course the computer told him there were several significant differences between boys and girls. Some of them real (do you like to play with trains? Dolls?) some of them not (I don't remember the example).
The other example is more recent. A Dutch Doctor got her PhD with (among others) the described sarcoidosis research. But my run-ins with subject are very limited simply because I don't move in those circles. This is way more widespread than just the few examples that I encounter personally.
Then people try to "fix" this by proposing the wrong solutions.
The research: "can we find a parameter that allows us to differentiate between the two groups" is very important as well. But you have to do your research in the right way. Take 100 patients and 100 healthy people and find the parameters that seem to make a difference. NOW you go into the second half of the research with a hypothesis: "this parameter is important" and verify your claim. Now the p=0.05 is acceptable. (a 5% chance that you're wrong, as opposed to a 95% chance your'e full of shit).