Slashdot Mirror


Why P-values Cannot Tell You If a Hypothesis Is Correct

ananyo writes "P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in Nature looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."

23 of 124 comments (clear)

  1. Re:Ooo! First post? by Anonymous Coward · · Score: 2, Funny

    Don't worry, with the way beta is going you'll soon have first post on -every- post :)

  2. Oblig XKCD by c++0xFF · · Score: 5, Informative

    http://xkcd.com/882/

    Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.

    1. Re:Oblig XKCD by xQx · · Score: 5, Informative

      While I agree with the article's headline/conclusion - They aren't innocent of playing games themselves:

      Take their sentence: "meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale" ... Isn't that intentionally misleading? Sure, 0.16 points doesn't sound like much... but it's on a seven point scale. If we change that to a 3 point scale it's only 0.06 points! Amazingly small! ... but wait, if I change that to a 900,000 point scale, well, then that's a whole 20,571 points difference. HUGE NUMBERS!

      But I think they missed a really important point - SPSS (one of the very popular data analysis packages) offers you a huge range of correlation tests, and you are _supposed_ to choose to best match the data. Each has their own assumptions, and will only provide the correct 'p' value if the data matches those assumptions.

      For example, Many of the tests require that the data follow a bell-shaped curve, and you are supposed to first test your data to ensure that it is normally distributed before using any of the correlation tests that assume normally distributed data. If you don't, you risk over-stating the correlation.

      If you have data from a likert scale, you should treat it as ordinal (ranked) data, not numerical (ie. the difference between "Totally Disagree" and "somewhat disagree" should not be assumed to be the same as the difference between "somewhat disagree" and " totally agree") - however, if you aren't getting to the magic p0.5 treating it as ordinal data, you can usually get it over the line by treating it as numerical data and running a different correlation test.

      Lecturers are measured on how many papers they publish, most peer reviewers don't know the subtle differences between these tests, so as long as they see 'SPSS said p0.5' and they don't disagree with any of the content of your paper, yay, you get published.

      Finally, many of the tests have a minimum sample size that should ever be analysed. If you only have a study of 300 people, there's a whole range of popular correlation tests that you are not supposed to use. But you do, because SPSS makes it easy, because it gets better results, because you forgot what the minimum size was and can't be arsed looking it up (if it's a real problem the reviewers will point it out).

      (Evidence to support these statements can be found in the "Survey Researcher's SPSS Cookbook" by Mark Manning and Don Munro. Obviously, it doesn't go into how you can choose an incorrect test to 'hack the p value', to prove that I recommend you download a copy of SPSS and take a short-term position as a lecturer's assistant)

    2. Re:Oblig XKCD by HiThere · · Score: 2

      A lot of times it is. Remember, you don't only need to count the comparisons that you do, but also that of everyone else studying the same problem. ONE of you is likely to find a result by pure coincidence.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    3. Re:Oblig XKCD by Rich0 · · Score: 2

      That's a different problem. Like this one, it's not a problem with p-values, it's a problem with people who don't know what a p-value is. The examples in the comic are NOT p-values for the experiment that was done. Properly calculated p-values do not have this problem because they are corrected for multiple comparisons.

      Agree completely, but the problem is that to an outside observer it is impossible to know how many comparisons were actually done.

      If you design an experiment to handle 20 comparisons and perform 20 comparisons you'll get meaningful data. However, that design will probably tell you that you need to collect 50x as many data points as you have money to pay for. So, instead you design an experiment that can handle one comparison, then you still perform 20 comparisons, and then you publish the one that showed something interesting.

      That was why there was a big push a bunch of years ago to have clinical trial designs (including endpoints) published before the start of trials - it gets rid of the ability to cherry-pick the hypothesis after the fact. From what I've read that experiment hasn't really been a smashing success.

    4. Re:Oblig XKCD by ceoyoyo · · Score: 2

      Then you're not a statistician. There's a reason data mining is a dirty word in science.

      Before you start working you need to have a hypothesis like "woman are shorter on average than men." You then find or collect a dataset of height measurements from a sample of women and men and do a test on the means. That gives you a p-value, which is what you report. If you do it in two separate datasets, you get two p-values and you report both. You don't correct either one for multiple comparisons, and whoever is reading your paper sees that you did an experiment and then replicated it. If both showed a significant difference your evidence is stronger. If they conflicted, it is weaker. If you're actually good at stats you can combine the two with Bayes's theorem and find out quantitatively how much stronger or weaker.

      What you're describing is, yes, how a lot of poor research happens. Your hypothesis is something like "there is a difference between these two groups (I hope)", which isn't a proper hypothesis at all. Then you go fishing.

      The difference is that in the first case you're testing the same thing, that you planned to look at in advance, in one or more datasets. In the second you're testing a bunch of different things, in one or more datasets, and if you find one that works you'll then claim a discovery. In the first case, in random data, if your threshold is a=0.05 you won't get more than 1 in 20 positive results, and everyone will see that. In the second, you expect to find a difference in 1 in 20 experiments; the multiple positive results don't strengthen each other because they're testing different things.

      In either case, if you lie and say you only did one test, you're committing fraud.

  3. And this is why by geekoid · · Score: 3, Insightful

    it takes more then 1 study.
    There is a push to have studies include Bayesian Probability.

    IMHO all papers should be read be statisticians just to be sure the calculation are correct.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    1. Re:And this is why by ceoyoyo · · Score: 3, Informative

      Other way around, and not quite true for a properly formulated hypothesis.

      Frequentist statistics involves making a statistical hypothesis, choosing a level of evidence that you find acceptable (usually alpha=0.05) and using that to accept or reject it. The statistical hypothesis is tied to your scientific hypothesis (or it should be). If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.

      HOWEVER, if you specify your hypothesis well, you include a minimum difference that you consider meaningful. You then calculate error bars for your result and, if they show that your measured value is less than the minimum you hypothesized, that's evidence supporting the negative (not the null) hypothesis: any difference is so small as to be meaningless.

      I am not a fan of everyone using Bayesian techniques. A Bayesian analysis of a single experiment that gives a normally distributed measurement (which is most of them) with a non-informative prior is generally equivalent to a frequentist analysis. Since scientists already have trouble doing simple frequentist tests correctly, they do not need to be doing needless Bayesian analyses.

      As for informative priors, I don't think they should ever be used in the report of a single experiment. Report the p-value and the Bayes factor, or the equivalent information needed to calculate them. Since an informative prior is inherently subjective, the reader should be left to make up his own mind what it is. Reporting the Bayes factor makes meta-analyses, where Bayesian stats SHOULD be mandatory, easier.

    2. Re:And this is why by ColdWetDog · · Score: 2

      No, all researchers should have be able to pass graduate level statistics courses.

      Yes, I realize that most of us would be back at flipping hamburgers or worse, end up going to law school. But to understand what you're doing, you really need to understand statistics.

      Einstein was basically a very good statistician.

      --
      Faster! Faster! Faster would be better!
    3. Re:And this is why by serviscope_minor · · Score: 2

      If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.

      No: I disagree slightly. If you get a bad P value, then it means that the data is unlikey to have come from that hypothesis. If the data is sufficiently unlikely given the hypothesis, then this is generally read as meaning that the hypothesis is unlikely. That's often applied to the null hypothesis, e.g there is no difference between X and Y, but some numbers computed show that it's unlikely that undiffrerentiated data would lead to those results.

      E.g. you could compute the mean and variance of two datasets, compute the variance on the mean and find the means are very far apart given their variances. This would indicate strongly that the data have different means,

      The thing is a good P value gives not support, but ther weaker statement that the data is not inconsistent with the hypothesis.

      You could for example claim that the data in the above example was actually drawn from some rather exotic distribution. The statistical test might indicate it's not inconsistend with your hypothesis. However it doesn't say more than that.

      The hypothesis not being inconsistent with the data is the absolute minimum standard for proposing a model.

      --
      SJW n. One who posts facts.
    4. Re:And this is why by ceoyoyo · · Score: 2

      "If you get a bad P value, then it means that the data is unlikey to have come from that hypothesis."

      Insignificant p-values don't meet anything beyond "my standard of evidence was not met." It's a common mistake. Suppose you get p=0.1. That's pretty universally considered non-significant. But what it actually means is that there's a 90% chance (assuming no prior information and no screwups) that the alternative hypothesis is true. That's a long way from "the data is unlikely to have come from that hypothesis."

      Even (especially) when the p-values get very large, you can't draw any meaningful conclusions. A p-value of 0.99 could mean that the null hypothesis is very likely, OR that you simply have too much unexplained variance and too small a sample. You have to draw out the confidence intervals and find out. Only in the former case, provided the maximum likely effect is smaller than what you consider relevant, you have evidence against the hypothesis. In the latter case you have evidence only that you need to improve your model, measurements and/or collect more data.

      You seem to have the rest backwards. You often collect some pilot data (or use someone else's) and then propose a model, but that's not testing your model. Evidence for or against a model comes from data collected AFTER you've proposed it. You generate hypotheses and design experiments to try to show the model is incorrect, collect the data, and test it. If the data doesn't fit, you discard the model. If it does fit, it counts as evidence in its favour. That's not the minimum standard, it is the standard.

  4. Misleading statistics by cold+fjord · · Score: 3, Insightful

    There is no shortage of misleading statistics out there. It can be a discipline fraught with peril for the uninformed, and there are lots of statistics packages out there that reduce advanced tests to a "point and shoot" level of difficulty that produces results that may not mean what the user thinks they mean. I've read some articles showing no lack of problems in the social sciences, but the problem is bigger than that.

    I can't help wondering how much that plays into the oscillating recommendations that you see for various foods. Both coffee and eggs have gone through repeated cycles of, "it's bad," "no, it's good," "no, it's bad," "no, it's good." I understand that at least some of it is coming down to the aspect they choose to measure, but I can't help but wonder now much bad statistics is playing into it.

    --
    much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    1. Re:Misleading statistics by geekoid · · Score: 4, Insightful

      Not a lot.

      Eggs is a good example.
      They where 'bad' becasue they had high cholesterol.
      Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.

      Same with many things.

      The media s the issue. It's can report science worth a damn.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    2. Re:Misleading statistics by tlhIngan · · Score: 4, Insightful

      Eggs is a good example.
      They where 'bad' becasue they had high cholesterol.
      Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.

      Fats, too. It was deemed that fats were bad for you, so instead of butter, use margarine. Better yet, skip the fats period. Bad for you.

      Of course, it was also discovered that hydrogenation had a nasty habit of turning unsaturated fats into different chiral forms - "cis" and "trans". And guess what? The "trans" form of the fat is really, really, really bad for you (yes, that's the same "trans" in trans fats). Suddenly butter wasn't such an unreasonable option anymore as margarine as margarine had to undergo hydrogenation.

      Not to mention the effort to go "low fat" has had nasty side effects of its own - the overuse of sugar and salt to replace the taste that fats had, resulting in even worse health problems (obesity, heart disease) than just having the fat to begin with.

      (And no, banning trans fats doesn't mean they ban "yummy stuff" - there's plenty of fats you can cook with to still get the "yummy" without all the trans fats.)

    3. Re:Misleading statistics by cold+fjord · · Score: 2

      Not to mention the effort to go "low fat" has had nasty side effects of its own - the overuse of sugar and salt to replace the taste that fats had, resulting in even worse health problems (obesity, heart disease) than just having the fat to begin with.

      To that you can add lower bioavailability of various nutrients that are fat soluble.

      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    4. Re:Misleading statistics by sexconker · · Score: 2

      Butter good.
      Salt good.
      Sugar good.
      Meat good.
      Flour good.
      Sedentary bad.

  5. Reminds me of the Bible Code controversy by sideslash · · Score: 4, Interesting

    The world is full of coincidental correlations waiting to be rationalized into causality relationships.

  6. Gold Standard? by radtea · · Score: 4, Interesting

    That means "outmoded and archaic", right?

    I realize I have a p-value in my .sig line and have for a decade, but p-values were a mediocre way to communicate the plausibility of a claim even in 2003. They are still used simply because the scientific community--and even moreso the research communities in some areas of the social sciences--are incredibly conservative and unwilling to update their standards of practice long after the rest of the world has passed them by.

    Everyone who cares about epistemology has known for decades that p-values are a lousy way to communicate (im)plausibility. This is part and parcel of the Bayesian revolution. It's good that Nature is finally noticing, but it's not as if papers haven't been published in ApJ and similar journals since the '90's with curves showing the plausibility of hypotheses as positive statements.

    A p-value is the probability of the data occurring given the null hypothesis is true, and which in the strictest sense says nothing about the hypothesis under test, only the null. This is why the value cited in my .sig line is relevant: people who are innocent are not guilty. This rare case where there is an interesting binary opposition between competing hypothesis is the only one where p-values are modestly useful.

    In the general case there are multiple competing hypotheses, and Bayesian analysis is well-suited to updating their plausiblities given some new evidence (I'm personally in favour of biased priors as well.) The results of such an analysis is the plausibility of each hypothesis given everything we know, which is the most anyone can ever reasonably hope for in our quest to know the world.

    [Note on language: I distinguish between "plausibility"--which is the degree of belief we have in something--and "probability"--which I'm comfortable taking on a more-or-less frequentist basis. Many Bayesians use "probability" for both of these related by distinct concepts, which I believe is a source of a great deal of confusion, particularly around the question of subjectivity. Plausibilities are subjective, probabilities are objective.]

    --
    Blasphemy is a human right. Blasphemophobia kills.
  7. Sudoku teaches all by evanh · · Score: 2

    I learnt the uselessness of statistics for guidance of correctness when trying to reduce my effort required at Sudoku. I've since discover the best way to win is not to play. Doesn't stop me trying though!

  8. P-hacking. by khasim · · Score: 3, Insightful

    From TFA:

    Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. "P-hacking," says Simonsohn, "is trying multiple things until you get the desired result" - even unconsciously.

  9. Misconceptions by vdorie · · Score: 2

    A few folk here have commented using incomplete or inaccurate definitions of p-values. A p-value is the probability of finding new data as or more extreme as data you observed assuming a null hypothesis is true. A couple of salient criticisms not mentioned in the article are a) why should more extreme data be lumped in with what was observed and b) what if "new" data can't sensibly be obtained.

    In a less technical sense, what the article didn't get into so much is that there is a strong publication bias towards results that are significant (i.e. small p-values), to the point where you need <0.05 to even consider submitting. Some key reading: http://www.stanford.edu/~neilm/qjps.pdf. The short version is to not believe it when the news says that "recent research shows...".

    Personally, I wait for evidence to accumulate before, say, changing my diet. And if you really want to get it right, dig through the literature yourself. Some of my saddest moments have come from statistics consulting where mostly people come to you looking for permission to run an inappropriate analysis, not understand their data or fit the "right" model. They want to get published, and that's just how things are done.

  10. The Earth Is Round (p < 0.05) by DVega · · Score: 4, Insightful
    There is a classic article by Jacob Cohen on this subject.

    Also there is a simpler analysis of the above article

    --
    MOD THE CHILD UP!
  11. Torturing the data by floobedy · · Score: 4, Informative

    One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You will eventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).

    I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.

    Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.

    This kind of thing crops up all the time.