Why P-values Cannot Tell You If a Hypothesis Is Correct
ananyo writes "P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in Nature looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."
It's a P value.
Don't worry, with the way beta is going you'll soon have first post on -every- post :)
http://xkcd.com/882/
Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.
it takes more then 1 study.
There is a push to have studies include Bayesian Probability.
IMHO all papers should be read be statisticians just to be sure the calculation are correct.
The Kruger Dunning explains most post on
There is no shortage of misleading statistics out there. It can be a discipline fraught with peril for the uninformed, and there are lots of statistics packages out there that reduce advanced tests to a "point and shoot" level of difficulty that produces results that may not mean what the user thinks they mean. I've read some articles showing no lack of problems in the social sciences, but the problem is bigger than that.
I can't help wondering how much that plays into the oscillating recommendations that you see for various foods. Both coffee and eggs have gone through repeated cycles of, "it's bad," "no, it's good," "no, it's bad," "no, it's good." I understand that at least some of it is coming down to the aspect they choose to measure, but I can't help but wonder now much bad statistics is playing into it.
much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
The world is full of coincidental correlations waiting to be rationalized into causality relationships.
That means "outmoded and archaic", right?
I realize I have a p-value in my .sig line and have for a decade, but p-values were a mediocre way to communicate the plausibility of a claim even in 2003. They are still used simply because the scientific community--and even moreso the research communities in some areas of the social sciences--are incredibly conservative and unwilling to update their standards of practice long after the rest of the world has passed them by.
Everyone who cares about epistemology has known for decades that p-values are a lousy way to communicate (im)plausibility. This is part and parcel of the Bayesian revolution. It's good that Nature is finally noticing, but it's not as if papers haven't been published in ApJ and similar journals since the '90's with curves showing the plausibility of hypotheses as positive statements.
A p-value is the probability of the data occurring given the null hypothesis is true, and which in the strictest sense says nothing about the hypothesis under test, only the null. This is why the value cited in my .sig line is relevant: people who are innocent are not guilty. This rare case where there is an interesting binary opposition between competing hypothesis is the only one where p-values are modestly useful.
In the general case there are multiple competing hypotheses, and Bayesian analysis is well-suited to updating their plausiblities given some new evidence (I'm personally in favour of biased priors as well.) The results of such an analysis is the plausibility of each hypothesis given everything we know, which is the most anyone can ever reasonably hope for in our quest to know the world.
[Note on language: I distinguish between "plausibility"--which is the degree of belief we have in something--and "probability"--which I'm comfortable taking on a more-or-less frequentist basis. Many Bayesians use "probability" for both of these related by distinct concepts, which I believe is a source of a great deal of confusion, particularly around the question of subjectivity. Plausibilities are subjective, probabilities are objective.]
Blasphemy is a human right. Blasphemophobia kills.
Any researcher worth their salt states a p-value with enough additional information to understand if the p-value is actually meaningful. Anyone who looks at a paper and makes a conclusion besed solely (or largely) off a p-value without thinking about how meaningful the results are from a clinical or real-world perspective is being lazy or reckless.
I guess there are quite a few insightful XKCD strips but this one seems most apt, here: http://xkcd.com/552/
I learnt the uselessness of statistics for guidance of correctness when trying to reduce my effort required at Sudoku. I've since discover the best way to win is not to play. Doesn't stop me trying though!
p-value is just the probability the data/observations were the result of a random process. So a great p value like 0.01 says the results were not random. They do not conform what made them non-random (ie theory).
Epistimology is elementary, and often skipped by those who wish to persuade. "Figures do not lie, but liars figure."[Clemens]
That article posted earlier claiming a lot of Americans think astrology is a science? It was a telephone survey with a sample size of two thousand people. Think that P-value proves the hypothesis is correct?
From TFA:
Dance of the p-values youtu.be/ez4DgdurRPg
A few folk here have commented using incomplete or inaccurate definitions of p-values. A p-value is the probability of finding new data as or more extreme as data you observed assuming a null hypothesis is true. A couple of salient criticisms not mentioned in the article are a) why should more extreme data be lumped in with what was observed and b) what if "new" data can't sensibly be obtained.
In a less technical sense, what the article didn't get into so much is that there is a strong publication bias towards results that are significant (i.e. small p-values), to the point where you need <0.05 to even consider submitting. Some key reading: http://www.stanford.edu/~neilm/qjps.pdf. The short version is to not believe it when the news says that "recent research shows...".
Personally, I wait for evidence to accumulate before, say, changing my diet. And if you really want to get it right, dig through the literature yourself. Some of my saddest moments have come from statistics consulting where mostly people come to you looking for permission to run an inappropriate analysis, not understand their data or fit the "right" model. They want to get published, and that's just how things are done.
Also there is a simpler analysis of the above article
MOD THE CHILD UP!
"A p-value of 0.05 means there's a 5% chance that your paper is wrong. In other words, 1 in 20 papers is bullshit."
Of course p-values don't tell you that your hypothesis is correct.
A p-value is the likelihood of getting a result as extreme as (or more extreme than) the one observed, assuming the null hypothesis is true. It has nothing to do with the probability of the truth of the alternative hypothesis.
is used by people who are aware of this well-known problem.
Correlation != Causation
I do not fail; I succeed at finding out what does not work.
To rephrase a famous quote, "There are lies, damned lies, and then there are statistics - who every heard of a statistician fudging the numbers?"
At least in bioiniformatics, the correction of p-values for multiple comparisons ("q-values") has been standard practice for quite a while now.
It really, really sucks.
One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You will eventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).
I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.
Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.
This kind of thing crops up all the time.
Lies ... damned lies ... and statistics. The P value only tells you if there is statistical significance in the data, not whether your hypothesis is correct or incorrect.
I think the article missed that major point. I think P values are fine when applied to the normal distribution however apparently some people are using the "empirical rule" to apply it to any distribution. If people configured their data collection to allow the "central limit theorem" to be used then the data would be normally distributed and P value limits would be fine.
No where does the article mention the normal distribution of the central limit theorem.
Eggs is a good example. They where 'bad' becasue they had high cholesterol.... The media s the issue. It's can report science worth a damn.
Now they are bad because they impair language skills.
Only joking, this is obviously the product of caffeine deprivation! Nothing to do with eggs at all ...
This was an interesting read on the subject published by a medical doctor who apparently had a good stat background. Dr. Ioannidis has kind of been on a crusade to look at design issues through meta analysis of old med studies that were suspect. http://www.ncbi.nlm.nih.gov/pm.... There has been some more recent work of his in the news as of late i believe. I love the 1001 varying non-mathematical definitions of p values in this thread it was cute.
It's not like the comments on Slashdot are worth much: Ars has far more mature and on the mark comments. It's like only kids are commenting on Slashdot.
The article is little more than an appeal to undefined conceptions of "plausibility" that presumably would be little more than generalized weighted probability functions. However, the article seems to rely more heavily upon being reasonably sure the reader is as confused as the author with respect to the more subtle technical details.
P values reflect, given certain assumptions about how the data are sampled and the statistical distribution from which they are drawn, the probability of committing a type I statistical error (rejecting a true null hypothesis as false). Consequently, a P value of 0.05 suggests that such a result will manifest itself about 1 time in 20 by chance alone, assuming sampling independence and the nature of the probability distribution from which it is presumably drawn, usually a Gaussian one given that in most situations the true mean of an unknown sampling distribution will approach the true mean as the sample size tends to infinity. For most types of data, such as the agronomic data evaluated by Fischer, a p value of 0.05 was and remains a useful choice since it has been found generally useful or meaningful in the context within which it was used (to discuss the nature and existences of differences among plants and their growth rates). In other fields of study more stringent critical values are the norm given the nature and frequency of the expected outcomes of interest. Obviously, type I statistical errors are not the only type of statistical error, since it is also possible to accept a true null hypothesis when it is in fact false. However, given the law of the excluded middle, these two notions are not independent of one another.
Much of the confusion in statistical literature and clearly on display in slashdot (no surprise there) stems from being unable to appropriately recognize what constitutes the null hypothesis and what constitutes a reasonable framework from which to decide what is the nature of the underlying distribution being tested, as well as the extent to which certain assumptions the independence of samples and variates are met and how they may affect p values. In some tests it is the measure of central tendency that is tested, whereas in others it is the homogenetity of variances. In more complicated designs more care must be afforded, just as in analyses of correlation or covariance, where it is the independence or lack thereof between variables that is being tested, In situations with multiple covariates as in their block-design analogs, interaction effects must be controlled. Likewise, in cases of multiple comparisons one must account for the family-wide or experiment-wide error, since as one conducts more tests, the greater the chance of encountering an unlikely result by chance alone. Such considerations give rise to adjustments of critical p values, as in the case of the widely known Bonferroni adjustments. Likewise, the discriminatory power of different tests for a given sample size, can also be an issue requiring consideration, not to mention specificity and sensitivity, which may be more important in certain settings, such as epidemiology.
It should also be kept in mind that most statistical procedures assume variates to be either continuous or discrete in their distribution and to arrive at an idea of the underlying topology of the space being sampled, typically a metric one, and often a Hilbert space. Should the phenomena of interest be pseudometric rather than metric in nature or living within more general topological spaces, then standard probability theory as derived from measure theory would have to be more delicately applied, if not abandoned entirely for non-parametric techniques. p-values like everything else in science must be appropriately interpreted within the context they are being discussed. However, once such considerations are appropriately made, a lower p-value will, as noted by Fisher, provide further confidence in ruling out chance as an explanation of a particular statistical outcome than a higher one. Alternatives, such as "plausibility" will need to prove themselves as viable.
The Earth is not round. It is an oblate spheroid.
Alas for the poor computer jockey.
The Earth is not round. It is an oblate spheroid.
Actually, it's a sphere defined by the EGM96 coefficients.
I'm honestly very dismayed by the idea, rapidly gaining currency, that Bayesian methods will somehow solve all these problems.
Bayesian methods have their own problems, namely what the prior actually is. In the linked article, for example, they suggest that the prior is trivially determined by any number of mechanisms. E.g., how do you interpret previous research to get the prior? Do you use that to get a prior? Do you use an objective prior, such as a reference prior? In reality, the choice of prior is left to the whims of the individual researcher, which becomes an entirely new mechanism with which researchers game the system.
Bayesians (and I consider myself neither a Bayesian or a frequentist, or both a Bayesian or frequentist, depending on how you look at it, for reasons that are too complicated to get into here) would respond to this by saying "well, at least your prior is made explicit." But this is not actually true in practice typically, and the same argument could be made about frequentists (who explicitly have no prior).
The article mentions the utility of meta-analysis in this context, to identify patterns of publication bias and so forth. But if everyone starts using Bayesian methods, you'll have an additional source of heterogeneity and bias to contend with in meta-analysis, namely the nature of the prior.
The dirty secret of Bayesian inference is that it is biased--in fact, results in the literature show it is impossible to have an unbiased Bayesian estimator. This is well-documented, and there's a good reason for it: Bayesian methods take advantage of the fact that you can reduce overall estimation error by introducing bias, under the assumption that the magnitude of the bias is small enough to offset the variance of the estimator (overall estimation error is a function of the squared bias plus variance, so if you can introduce a procedure that increases squared bias but decreases variance even more, you'll decrease overall estimation error). However, this all assumes the only risk is due to random error, with honest experimenters, which is clearly false.
I don't mean to bash Bayesian statistics--I use it and have published on Bayesian methods in multiple places. But it's not the answer to our problems. It's just another tool. The real problem are fads and publication pressures, which won't change by changing your inferential philosophy.
William Anscombe made this same point in 1973. He created four very different data sets with identical mean, variance, correlation (r-value), and linear regressions. Visual inspection, on the other hand, would show that all of these statistics are junk (usually because the model is mismatched to the data set).
Anscombe didn't look at P-values, but the same argument holds. The P-values for these would be quite high because the underlying source of the noise is non-Gaussian [or non-Normal if you're a mathematician] yet the usual calculation of a P-value assumes Gaussian statistics..
Check out the Wikipedia article on Anscombe's quartet to see the data sets.
-JS
if the researcher is an idiot, don't blame the p value
There's strong statistic evidence that statistics are correct 95% of the times
what? p-values are in no way restricted to the normal distribution, although a lot of statistical theory does involve the normal distribution.
"They were pure niggers." – Noam Chomsky