Why P-values Cannot Tell You If a Hypothesis Is Correct

← Back to Stories (view on slashdot.org)

Why P-values Cannot Tell You If a Hypothesis Is Correct

Posted by Soulskill on Wednesday February 12, 2014 @09:45AM from the i'm-a-doctor-not-an-oracle dept.

ananyo writes "P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in Nature looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."

8 of 124 comments (clear)

Min score:

Reason:

Sort:

Oblig XKCD by c++0xFF · 2014-02-12 09:55 · Score: 5, Informative

http://xkcd.com/882/
Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.
1. Re:Oblig XKCD by xQx · 2014-02-12 11:20 · Score: 5, Informative
  
  While I agree with the article's headline/conclusion - They aren't innocent of playing games themselves:
  Take their sentence: "meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale" ... Isn't that intentionally misleading? Sure, 0.16 points doesn't sound like much... but it's on a seven point scale. If we change that to a 3 point scale it's only 0.06 points! Amazingly small! ... but wait, if I change that to a 900,000 point scale, well, then that's a whole 20,571 points difference. HUGE NUMBERS!
  But I think they missed a really important point - SPSS (one of the very popular data analysis packages) offers you a huge range of correlation tests, and you are _supposed_ to choose to best match the data. Each has their own assumptions, and will only provide the correct 'p' value if the data matches those assumptions.
  For example, Many of the tests require that the data follow a bell-shaped curve, and you are supposed to first test your data to ensure that it is normally distributed before using any of the correlation tests that assume normally distributed data. If you don't, you risk over-stating the correlation.
  If you have data from a likert scale, you should treat it as ordinal (ranked) data, not numerical (ie. the difference between "Totally Disagree" and "somewhat disagree" should not be assumed to be the same as the difference between "somewhat disagree" and " totally agree") - however, if you aren't getting to the magic p0.5 treating it as ordinal data, you can usually get it over the line by treating it as numerical data and running a different correlation test.
  Lecturers are measured on how many papers they publish, most peer reviewers don't know the subtle differences between these tests, so as long as they see 'SPSS said p0.5' and they don't disagree with any of the content of your paper, yay, you get published.
  Finally, many of the tests have a minimum sample size that should ever be analysed. If you only have a study of 300 people, there's a whole range of popular correlation tests that you are not supposed to use. But you do, because SPSS makes it easy, because it gets better results, because you forgot what the minimum size was and can't be arsed looking it up (if it's a real problem the reviewers will point it out).
  (Evidence to support these statements can be found in the "Survey Researcher's SPSS Cookbook" by Mark Manning and Don Munro. Obviously, it doesn't go into how you can choose an incorrect test to 'hack the p value', to prove that I recommend you download a copy of SPSS and take a short-term position as a lecturer's assistant)
Reminds me of the Bible Code controversy by sideslash · 2014-02-12 09:59 · Score: 4, Interesting

The world is full of coincidental correlations waiting to be rationalized into causality relationships.
Gold Standard? by radtea · 2014-02-12 10:02 · Score: 4, Interesting

That means "outmoded and archaic", right?
I realize I have a p-value in my .sig line and have for a decade, but p-values were a mediocre way to communicate the plausibility of a claim even in 2003. They are still used simply because the scientific community--and even moreso the research communities in some areas of the social sciences--are incredibly conservative and unwilling to update their standards of practice long after the rest of the world has passed them by.
Everyone who cares about epistemology has known for decades that p-values are a lousy way to communicate (im)plausibility. This is part and parcel of the Bayesian revolution. It's good that Nature is finally noticing, but it's not as if papers haven't been published in ApJ and similar journals since the '90's with curves showing the plausibility of hypotheses as positive statements.
A p-value is the probability of the data occurring given the null hypothesis is true, and which in the strictest sense says nothing about the hypothesis under test, only the null. This is why the value cited in my .sig line is relevant: people who are innocent are not guilty. This rare case where there is an interesting binary opposition between competing hypothesis is the only one where p-values are modestly useful.
In the general case there are multiple competing hypotheses, and Bayesian analysis is well-suited to updating their plausiblities given some new evidence (I'm personally in favour of biased priors as well.) The results of such an analysis is the plausibility of each hypothesis given everything we know, which is the most anyone can ever reasonably hope for in our quest to know the world.
[Note on language: I distinguish between "plausibility"--which is the degree of belief we have in something--and "probability"--which I'm comfortable taking on a more-or-less frequentist basis. Many Bayesians use "probability" for both of these related by distinct concepts, which I believe is a source of a great deal of confusion, particularly around the question of subjectivity. Plausibilities are subjective, probabilities are objective.]

--
Blasphemy is a human right. Blasphemophobia kills.
Re:Misleading statistics by geekoid · 2014-02-12 10:25 · Score: 4, Insightful

Not a lot.
Eggs is a good example.
They where 'bad' becasue they had high cholesterol.
Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.
Same with many things.
The media s the issue. It's can report science worth a damn.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
The Earth Is Round (p < 0.05) by DVega · 2014-02-12 10:30 · Score: 4, Insightful

There is a classic article by Jacob Cohen on this subject.
Also there is a simpler analysis of the above article

--
MOD THE CHILD UP!
Re:Misleading statistics by tlhIngan · 2014-02-12 10:43 · Score: 4, Insightful

Eggs is a good example.
They where 'bad' becasue they had high cholesterol.
Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.
Fats, too. It was deemed that fats were bad for you, so instead of butter, use margarine. Better yet, skip the fats period. Bad for you.
Of course, it was also discovered that hydrogenation had a nasty habit of turning unsaturated fats into different chiral forms - "cis" and "trans". And guess what? The "trans" form of the fat is really, really, really bad for you (yes, that's the same "trans" in trans fats). Suddenly butter wasn't such an unreasonable option anymore as margarine as margarine had to undergo hydrogenation.
Not to mention the effort to go "low fat" has had nasty side effects of its own - the overuse of sugar and salt to replace the taste that fats had, resulting in even worse health problems (obesity, heart disease) than just having the fat to begin with.
(And no, banning trans fats doesn't mean they ban "yummy stuff" - there's plenty of fats you can cook with to still get the "yummy" without all the trans fats.)
Torturing the data by floobedy · 2014-02-12 13:31 · Score: 4, Informative

One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You will eventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).
I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.
Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.
This kind of thing crops up all the time.