Why P-values Cannot Tell You If a Hypothesis Is Correct
ananyo writes "P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in Nature looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."
http://xkcd.com/882/
Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.
One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You will eventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).
I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.
Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.
This kind of thing crops up all the time.
Other way around, and not quite true for a properly formulated hypothesis.
Frequentist statistics involves making a statistical hypothesis, choosing a level of evidence that you find acceptable (usually alpha=0.05) and using that to accept or reject it. The statistical hypothesis is tied to your scientific hypothesis (or it should be). If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.
HOWEVER, if you specify your hypothesis well, you include a minimum difference that you consider meaningful. You then calculate error bars for your result and, if they show that your measured value is less than the minimum you hypothesized, that's evidence supporting the negative (not the null) hypothesis: any difference is so small as to be meaningless.
I am not a fan of everyone using Bayesian techniques. A Bayesian analysis of a single experiment that gives a normally distributed measurement (which is most of them) with a non-informative prior is generally equivalent to a frequentist analysis. Since scientists already have trouble doing simple frequentist tests correctly, they do not need to be doing needless Bayesian analyses.
As for informative priors, I don't think they should ever be used in the report of a single experiment. Report the p-value and the Bayes factor, or the equivalent information needed to calculate them. Since an informative prior is inherently subjective, the reader should be left to make up his own mind what it is. Reporting the Bayes factor makes meta-analyses, where Bayesian stats SHOULD be mandatory, easier.