Slashdot Mirror


Why P-values Cannot Tell You If a Hypothesis Is Correct

ananyo writes "P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in Nature looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."

124 comments

  1. TLDR by Anonymous Coward · · Score: 0

    It's a P value.

  2. Re:Ooo! First post? by Anonymous Coward · · Score: 2, Funny

    Don't worry, with the way beta is going you'll soon have first post on -every- post :)

  3. Oblig XKCD by c++0xFF · · Score: 5, Informative

    http://xkcd.com/882/

    Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.

    1. Re:Oblig XKCD by Anonymous Coward · · Score: 0

      Yeah, but multiple comparisons (which is what is going on in the xkcd) is hardly the major problem.

    2. Re:Oblig XKCD by xQx · · Score: 5, Informative

      While I agree with the article's headline/conclusion - They aren't innocent of playing games themselves:

      Take their sentence: "meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale" ... Isn't that intentionally misleading? Sure, 0.16 points doesn't sound like much... but it's on a seven point scale. If we change that to a 3 point scale it's only 0.06 points! Amazingly small! ... but wait, if I change that to a 900,000 point scale, well, then that's a whole 20,571 points difference. HUGE NUMBERS!

      But I think they missed a really important point - SPSS (one of the very popular data analysis packages) offers you a huge range of correlation tests, and you are _supposed_ to choose to best match the data. Each has their own assumptions, and will only provide the correct 'p' value if the data matches those assumptions.

      For example, Many of the tests require that the data follow a bell-shaped curve, and you are supposed to first test your data to ensure that it is normally distributed before using any of the correlation tests that assume normally distributed data. If you don't, you risk over-stating the correlation.

      If you have data from a likert scale, you should treat it as ordinal (ranked) data, not numerical (ie. the difference between "Totally Disagree" and "somewhat disagree" should not be assumed to be the same as the difference between "somewhat disagree" and " totally agree") - however, if you aren't getting to the magic p0.5 treating it as ordinal data, you can usually get it over the line by treating it as numerical data and running a different correlation test.

      Lecturers are measured on how many papers they publish, most peer reviewers don't know the subtle differences between these tests, so as long as they see 'SPSS said p0.5' and they don't disagree with any of the content of your paper, yay, you get published.

      Finally, many of the tests have a minimum sample size that should ever be analysed. If you only have a study of 300 people, there's a whole range of popular correlation tests that you are not supposed to use. But you do, because SPSS makes it easy, because it gets better results, because you forgot what the minimum size was and can't be arsed looking it up (if it's a real problem the reviewers will point it out).

      (Evidence to support these statements can be found in the "Survey Researcher's SPSS Cookbook" by Mark Manning and Don Munro. Obviously, it doesn't go into how you can choose an incorrect test to 'hack the p value', to prove that I recommend you download a copy of SPSS and take a short-term position as a lecturer's assistant)

    3. Re:Oblig XKCD by ceoyoyo · · Score: 1

      That's a different problem. Like this one, it's not a problem with p-values, it's a problem with people who don't know what a p-value is. The examples in the comic are NOT p-values for the experiment that was done. Properly calculated p-values do not have this problem because they are corrected for multiple comparisons.

    4. Re:Oblig XKCD by HiThere · · Score: 2

      A lot of times it is. Remember, you don't only need to count the comparisons that you do, but also that of everyone else studying the same problem. ONE of you is likely to find a result by pure coincidence.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    5. Re:Oblig XKCD by Rich0 · · Score: 2

      That's a different problem. Like this one, it's not a problem with p-values, it's a problem with people who don't know what a p-value is. The examples in the comic are NOT p-values for the experiment that was done. Properly calculated p-values do not have this problem because they are corrected for multiple comparisons.

      Agree completely, but the problem is that to an outside observer it is impossible to know how many comparisons were actually done.

      If you design an experiment to handle 20 comparisons and perform 20 comparisons you'll get meaningful data. However, that design will probably tell you that you need to collect 50x as many data points as you have money to pay for. So, instead you design an experiment that can handle one comparison, then you still perform 20 comparisons, and then you publish the one that showed something interesting.

      That was why there was a big push a bunch of years ago to have clinical trial designs (including endpoints) published before the start of trials - it gets rid of the ability to cherry-pick the hypothesis after the fact. From what I've read that experiment hasn't really been a smashing success.

    6. Re:Oblig XKCD by stenvar · · Score: 1

      It's not a problem with being "fooled by statistics". If they applied the statistics wrong or made some other error, six sigma is no better than two sigma if there is something wrong with the underlying assumptions. (Not saying that there is anything wrong with the Higgs experiment.)

      The only real protection against this sort of thing is to have many different research groups repeat an experiment independently and analyze it many different ways.

    7. Re:Oblig XKCD by stenvar · · Score: 1

      Lecturers are measured on how many papers they publish, most peer reviewers don't know the subtle differences between these tests

      Nobody really "knows the subtle difference between the tests" because nobody really knows what the actual distribution of the data is. In addition, the same way people shop for statistical tests, they shop for experimental procedures, samples, and all other aspects of an experiment, until they get the result they want.

    8. Re:Oblig XKCD by stenvar · · Score: 1

      Properly calculated p-values do not have this problem because they are corrected for multiple comparisons.

      You can't correct for multiple comparisons because you don't know about all the experiments other people have been doing; you'd have to to know that in order to do that correction properly. It's very hard even to count the number of comparisons you have been doing yourself, because it's not just the number of times you've run the test.

    9. Re:Oblig XKCD by ceoyoyo · · Score: 1

      It's also impossible to tell if the other guy made the whole thing up. Fraud is detected via replication. Generally though, it's pretty easy to detect people doing it inadvertently - they publish all their "p-values." I suppose if people actually get better at doing stats some of the inadvertent stuff will turn into harder to detect fraud.

      Clinical trials are actually required to be pre-registered with one of a few tracking agencies if they are to be accepted by the FDA and other similar agencies. There are a few problems, but it's much better than it used to be.

    10. Re:Oblig XKCD by ceoyoyo · · Score: 1

      You don't need to correct for experiments done on other datasets. If it's multiple experiments on the same dataset, whoever is in charge of that dataset should be keeping track. Even then, you only need to correct for multiple tests of the same, or similar, hypothesis. That's mostly a problem when your hypothesis is "something happened," which it should never be, but it is all too often.

    11. Re:Oblig XKCD by sFurbo · · Score: 1

      It is quite often one of the problems of medical papers. People are different, and you can always find another way to split up the groups. What if you only look at males? Aged 20-30? Who eats a moderate amount of broccoli? Furthermore, there are a lot of diseases that people can have, so if you are doing a observational study, even eating multivitamins doesn't change the incidence of cancer in 20-30 year old males who eat a moderate amount of broccoli, what about the incidence of ear cancer? In the right ear?

    12. Re:Oblig XKCD by Anonymous Coward · · Score: 0

      Or like studies showing that taking fish oil supplements don't help much because many researchers are using rancid fish oil.

      Or troll articles linking high fat diets to depression in mice when it's actually a "high fat and sugar" diet.http://blogs.scientificamerican.com/scicurious-brain/2012/05/02/high-fat-diets-and-depression-a-look-in-mice/

      Too much crappy research around. Part of it is incompetence, but I think a lot of it is just so they can get $$$$. Too much crappy "journalism" around. Part of it is incompetence, but I think a lot of it is trolling just so they can get $$$$.

      It's probably too hard nowadays to do an expensive 10 year high quality study. Much easier to do five 2 year crappy studies.

    13. Re:Oblig XKCD by Sique · · Score: 1

      But lets say, you are a statistician working with hundreds of datasets and mining them for some interesting data. If you are looking for p 0.05, about every 20th attempt will yield something "significant", even if all datasets are white noise, and you don't use any dataset twice.

      --
      .sig: Sique *sigh*
    14. Re:Oblig XKCD by petermgreen · · Score: 1

      Agree completely, but the problem is that to an outside observer it is impossible to know how many comparisons were actually done.

      And even if the researchers are being honest about the number of comparisons THEY did it is very likely that multiple people will be independently working on the same problem. If all of those people individually only publish positive results then you get the same problem.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    15. Re:Oblig XKCD by Anonymous Coward · · Score: 0

      'There are a few problems, but it's much better than it used to be.'
      Considering how many problems there have been that's a bit of damning with faint praise. ,)

    16. Re:Oblig XKCD by Rich0 · · Score: 1

      Clinical trials are actually required to be pre-registered with one of a few tracking agencies if they are to be accepted by the FDA and other similar agencies. There are a few problems, but it's much better than it used to be.

      My concern isn't with the trials that are pre-submitted, and then the results are submitted to the FDA. My concern is with the trials that are pre-submitted and then the results are never published.

      If you can do that, then there really is no benefit of pre-submission. Just pre-submit 100 trials, then take the 5 good ones and publish them.

      Granted, I don't know how many of those trials that don't get published actually pertain to drugs that get marketed. If a company abandons a drug entirely during R&D I'm not sure if there is any real public harm if they don't publish the gory details, as long as they don't try to turn around and use it for something else later (without then publishing the prior results).

    17. Re:Oblig XKCD by ceoyoyo · · Score: 1

      Human trials are incredibly expensive. You don't just do 100 and take the best one. Also, various people, including regulatory agencies, are wise to that. If you have a bunch of registered trials without results it's going to be looked at with suspicion.

      There have been cases of abuse, but pharma generally doesn't want to outright fake results. Developing drugs is expensive, but getting sued into oblivion is too.

    18. Re:Oblig XKCD by ceoyoyo · · Score: 2

      Then you're not a statistician. There's a reason data mining is a dirty word in science.

      Before you start working you need to have a hypothesis like "woman are shorter on average than men." You then find or collect a dataset of height measurements from a sample of women and men and do a test on the means. That gives you a p-value, which is what you report. If you do it in two separate datasets, you get two p-values and you report both. You don't correct either one for multiple comparisons, and whoever is reading your paper sees that you did an experiment and then replicated it. If both showed a significant difference your evidence is stronger. If they conflicted, it is weaker. If you're actually good at stats you can combine the two with Bayes's theorem and find out quantitatively how much stronger or weaker.

      What you're describing is, yes, how a lot of poor research happens. Your hypothesis is something like "there is a difference between these two groups (I hope)", which isn't a proper hypothesis at all. Then you go fishing.

      The difference is that in the first case you're testing the same thing, that you planned to look at in advance, in one or more datasets. In the second you're testing a bunch of different things, in one or more datasets, and if you find one that works you'll then claim a discovery. In the first case, in random data, if your threshold is a=0.05 you won't get more than 1 in 20 positive results, and everyone will see that. In the second, you expect to find a difference in 1 in 20 experiments; the multiple positive results don't strengthen each other because they're testing different things.

      In either case, if you lie and say you only did one test, you're committing fraud.

    19. Re:Oblig XKCD by stenvar · · Score: 1

      You don't need to correct for experiments done on other datasets.

      Really? So you're saying that if I have a hypothesis that I'm desperate to prove, I can just keep trying one data set after another until I find one for which I get a statistically significant result and that p-value doesn't need to be corrected? Think that through. In fact, I can even test completely different hypotheses each time, only one hypothesis applied to only one data set ever, and things still fail in the same way.

      There is no correction that you can apply that makes p-values any more meaningful than they would be without correction. A p-value really simply is a convenient scale on which to report resarch results; a single high significance result by itself means little more than no result at all. You need many independent replications of an experiment, each with high significance p-value outcomes, in order to actually demonstrate a scientific hypothesis to be true.

    20. Re:Oblig XKCD by Anonymous Coward · · Score: 0

      If you are looking for p 0.05, about every 20th attempt will yield something "significant", even if all datasets are white noise, and you don't use any dataset twice.

      Then you calculated the p value wrong for the type of analysis you did. Of course if you do the statistics wrong, the statistics can give you the wrong result. You could argue that it is easy to make a mistake with p values if you don't know what you are doing, but not that the p value inherently is the source of that problem.

    21. Re:Oblig XKCD by Anonymous Coward · · Score: 0

      Read the post you are replying to and the one that was replying to. The original claim was you can't possibly correct for things because you can't know what experiments other people are doing, and the post you are replying to is saying you don't need to correct for what other people are doing with other datasets. That doesn't imply that you can keep searching the same data over and over again with different tests. Yes, you need to calculate things correctly based on your entire search, but not for other people's search if that is not included.

    22. Re:Oblig XKCD by Sique · · Score: 1

      How do you tell that the hypothesis was first, and the result from the dataset came later, if you read the study?

      --
      .sig: Sique *sigh*
  4. And this is why by geekoid · · Score: 3, Insightful

    it takes more then 1 study.
    There is a push to have studies include Bayesian Probability.

    IMHO all papers should be read be statisticians just to be sure the calculation are correct.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    1. Re:And this is why by Wootery · · Score: 1

      IMHO all papers should be read be statisticians just to be sure the calculation are correct.

      Slashdot doesn't offer a too much confidence rating.

      Also your 'calculation' is wrong: should be 'calculations'.

    2. Re:And this is why by serviscope_minor · · Score: 1

      I'm a big fan of Bayesian techniques. However, sgtatistical tests still have their place if they're not misused. The trouble is that they are deeply misunderstood and terribly badly used.

      All they do is tell you if a hypothesis is probably incorrect. You can use them to refute a hypothesis, but not support one.

      --
      SJW n. One who posts facts.
    3. Re:And this is why by ceoyoyo · · Score: 3, Informative

      Other way around, and not quite true for a properly formulated hypothesis.

      Frequentist statistics involves making a statistical hypothesis, choosing a level of evidence that you find acceptable (usually alpha=0.05) and using that to accept or reject it. The statistical hypothesis is tied to your scientific hypothesis (or it should be). If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.

      HOWEVER, if you specify your hypothesis well, you include a minimum difference that you consider meaningful. You then calculate error bars for your result and, if they show that your measured value is less than the minimum you hypothesized, that's evidence supporting the negative (not the null) hypothesis: any difference is so small as to be meaningless.

      I am not a fan of everyone using Bayesian techniques. A Bayesian analysis of a single experiment that gives a normally distributed measurement (which is most of them) with a non-informative prior is generally equivalent to a frequentist analysis. Since scientists already have trouble doing simple frequentist tests correctly, they do not need to be doing needless Bayesian analyses.

      As for informative priors, I don't think they should ever be used in the report of a single experiment. Report the p-value and the Bayes factor, or the equivalent information needed to calculate them. Since an informative prior is inherently subjective, the reader should be left to make up his own mind what it is. Reporting the Bayes factor makes meta-analyses, where Bayesian stats SHOULD be mandatory, easier.

    4. Re:And this is why by ColdWetDog · · Score: 2

      No, all researchers should have be able to pass graduate level statistics courses.

      Yes, I realize that most of us would be back at flipping hamburgers or worse, end up going to law school. But to understand what you're doing, you really need to understand statistics.

      Einstein was basically a very good statistician.

      --
      Faster! Faster! Faster would be better!
    5. Re:And this is why by Rich0 · · Score: 1

      it takes more then 1 study.

      That only works if ALL the studies actually get published. If the labs only write up the studies with "interesting" results then there really is no difference between cherry-picking 1 trial out of 100 and cherry-picking 10 trials out of 1000.

    6. Re:And this is why by serviscope_minor · · Score: 2

      If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.

      No: I disagree slightly. If you get a bad P value, then it means that the data is unlikey to have come from that hypothesis. If the data is sufficiently unlikely given the hypothesis, then this is generally read as meaning that the hypothesis is unlikely. That's often applied to the null hypothesis, e.g there is no difference between X and Y, but some numbers computed show that it's unlikely that undiffrerentiated data would lead to those results.

      E.g. you could compute the mean and variance of two datasets, compute the variance on the mean and find the means are very far apart given their variances. This would indicate strongly that the data have different means,

      The thing is a good P value gives not support, but ther weaker statement that the data is not inconsistent with the hypothesis.

      You could for example claim that the data in the above example was actually drawn from some rather exotic distribution. The statistical test might indicate it's not inconsistend with your hypothesis. However it doesn't say more than that.

      The hypothesis not being inconsistent with the data is the absolute minimum standard for proposing a model.

      --
      SJW n. One who posts facts.
    7. Re:And this is why by ceoyoyo · · Score: 2

      "If you get a bad P value, then it means that the data is unlikey to have come from that hypothesis."

      Insignificant p-values don't meet anything beyond "my standard of evidence was not met." It's a common mistake. Suppose you get p=0.1. That's pretty universally considered non-significant. But what it actually means is that there's a 90% chance (assuming no prior information and no screwups) that the alternative hypothesis is true. That's a long way from "the data is unlikely to have come from that hypothesis."

      Even (especially) when the p-values get very large, you can't draw any meaningful conclusions. A p-value of 0.99 could mean that the null hypothesis is very likely, OR that you simply have too much unexplained variance and too small a sample. You have to draw out the confidence intervals and find out. Only in the former case, provided the maximum likely effect is smaller than what you consider relevant, you have evidence against the hypothesis. In the latter case you have evidence only that you need to improve your model, measurements and/or collect more data.

      You seem to have the rest backwards. You often collect some pilot data (or use someone else's) and then propose a model, but that's not testing your model. Evidence for or against a model comes from data collected AFTER you've proposed it. You generate hypotheses and design experiments to try to show the model is incorrect, collect the data, and test it. If the data doesn't fit, you discard the model. If it does fit, it counts as evidence in its favour. That's not the minimum standard, it is the standard.

    8. Re:And this is why by PingPongBoy · · Score: 1

      No, all researchers should have be able to pass graduate level statistics courses.

      Yes, I realize that most of us would be back at flipping hamburgers or worse, end up going to law school. But to understand what you're doing, you really need to understand statistics.

      Einstein was basically a very good statistician.

      The null hypothesis is that the world would not be a better place if researchers had such advanced training in statistics.

      For example, there was no physical evidence favoring relativity for years after the theory was created. Qualification for research work should be intelligence not ability to do statistics.

      --
      Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
    9. Re:And this is why by serviscope_minor · · Score: 1

      Insignificant p-values don't meet anything beyond "my standard of evidence was not met."

      Not at all. It's the probability that given the model, a statistic of that value or greater will be obtained with that many samples. If the p value is really small, that means that such a statistic would occur in 1 in N runs.

      Once the P value gets small enough, one can consider it sufficiently unlikely that the null hypothesis generated the data that you nolonger believe the null hypothesis to be credible.

      If the P-value is not very small, it means you don't have enough evidence to conclude that the null hypothesis is false.

      A p-value of 0.99 could mean that the null hypothesis is very likely, OR that you simply have too much unexplained variance and too small a sample.

      No: it's not that the hull hypothesis is likely, it's that the data is not inconsistent with the hypothesis. That's an important distinction because the data can be consistent with many, different hypotheses. So, it's not that the hypothesis is likely, it's that there's not enough evidence to say it's definitely not true. ...

      I know how people operate, but that doesn't affect what the maths behind the statistics and p-values means.

      --
      SJW n. One who posts facts.
  5. Misleading statistics by cold+fjord · · Score: 3, Insightful

    There is no shortage of misleading statistics out there. It can be a discipline fraught with peril for the uninformed, and there are lots of statistics packages out there that reduce advanced tests to a "point and shoot" level of difficulty that produces results that may not mean what the user thinks they mean. I've read some articles showing no lack of problems in the social sciences, but the problem is bigger than that.

    I can't help wondering how much that plays into the oscillating recommendations that you see for various foods. Both coffee and eggs have gone through repeated cycles of, "it's bad," "no, it's good," "no, it's bad," "no, it's good." I understand that at least some of it is coming down to the aspect they choose to measure, but I can't help but wonder now much bad statistics is playing into it.

    --
    much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    1. Re:Misleading statistics by geekoid · · Score: 4, Insightful

      Not a lot.

      Eggs is a good example.
      They where 'bad' becasue they had high cholesterol.
      Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.

      Same with many things.

      The media s the issue. It's can report science worth a damn.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    2. Re:Misleading statistics by tlhIngan · · Score: 4, Insightful

      Eggs is a good example.
      They where 'bad' becasue they had high cholesterol.
      Science move on, and it turns out there are different kind of cholesterol, some 'good' some 'bad' so now eggs aren't as unhealthy as was thought.

      Fats, too. It was deemed that fats were bad for you, so instead of butter, use margarine. Better yet, skip the fats period. Bad for you.

      Of course, it was also discovered that hydrogenation had a nasty habit of turning unsaturated fats into different chiral forms - "cis" and "trans". And guess what? The "trans" form of the fat is really, really, really bad for you (yes, that's the same "trans" in trans fats). Suddenly butter wasn't such an unreasonable option anymore as margarine as margarine had to undergo hydrogenation.

      Not to mention the effort to go "low fat" has had nasty side effects of its own - the overuse of sugar and salt to replace the taste that fats had, resulting in even worse health problems (obesity, heart disease) than just having the fat to begin with.

      (And no, banning trans fats doesn't mean they ban "yummy stuff" - there's plenty of fats you can cook with to still get the "yummy" without all the trans fats.)

    3. Re:Misleading statistics by Anonymous Coward · · Score: 0

      Both coffee and eggs have gone through repeated cycles of, "it's bad," "no, it's good," "no, it's bad," "no, it's good."

      Once you get to the point of "it's bad" or "it's good", chances are you're not dealing with the original science but some writer's personal opinion. Journal articles on food don't say things like "It's bad" or "It's bad with p-value...", they say things more like, "X causes Y with correlation Z" or "X is a factor in Y." Most researchers are pretty aware that foods have both good and bad effects, and it is difficult to pigeon-hole each thing into a particular category, short of saying "In moderate amounts, it won't do much," for various definitions of moderate.

    4. Re:Misleading statistics by cold+fjord · · Score: 2

      Not to mention the effort to go "low fat" has had nasty side effects of its own - the overuse of sugar and salt to replace the taste that fats had, resulting in even worse health problems (obesity, heart disease) than just having the fat to begin with.

      To that you can add lower bioavailability of various nutrients that are fat soluble.

      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    5. Re:Misleading statistics by ebno-10db · · Score: 1

      Eggs is a good example. They where 'bad' becasue they had high cholesterol.

      That was even worse than bad statistics. There were no statistics because there was no data. Once it was figured out that high cholesterol levels were bad, somebody just assumed that the cholesterol content of the foods you ate had a significant on your cholesterol levels. They don't. A bad guess became gospel for years So much for scientific medicine.

    6. Re:Misleading statistics by sexconker · · Score: 2

      Butter good.
      Salt good.
      Sugar good.
      Meat good.
      Flour good.
      Sedentary bad.

    7. Re:Misleading statistics by tirerim · · Score: 1

      Mostly the problem with nutritional studies is the impossibility of doing large scale, long term, controlled trials. See http://www.nytimes.com/2014/02...

    8. Re:Misleading statistics by floobedy · · Score: 1

      Unfortunately, scientists studying nutrition face an ethical conundrum. They feel they must publish (and publicize) preliminary results because it might save lives. Suppose there's fairly good (but not extremely strong) reason to think that eggs are bad for you. Shouldn't you publicize that result? If you don't, millions of people could die needlessly. If you wait until the results are really certain (or at least more certain), then you have denied people the benefit of preliminary information.

      Bear in mind that diseases like atherosclerosis develop over decades. It would take decades (and it would be unethical besides) to assign people to different dietary groups, control everything perfectly, and see who drops dead of heat disease. Since those studies can't be done, the results we do have are frequently preliminary or merely suggestive.

      Eggs were bad for you because they contain cholesterol, and some peoples' arteries are clogged with exactly that substance. A few scientists made a leap--let's not consume a lot of exactly the substance which is clogging your arteries.

      Unfortunately, that was wrong.

      These days more publicizers provide tentative wording to suggest that a result is preliminary. For example, there is a campaign in California to get people to eat more nuts. There are signs paid for by the state which say "research suggests but does not yet prove that eating nuts can reduce your chance of a heart attack" and so on. At least that's a step in the right direction, IMO.

    9. Re:Misleading statistics by ceoyoyo · · Score: 1

      The biggest problem with the way most people do statistics is that they don't have adequate statistical reasoning skills. The problem is in the design of experiments and analyses, before you ever get to punching the buttons in your stats package of choice. The differences you get from punching the wrong button are really very minor compared to things that happen all the time, like drawing conclusions based on tests you didn't do (the difference of differences error is an excellent example: half of all high impact neuroscience papers that can make this mistake do so).

      The wild world of nutrition recommendations is mostly the way it is because it's all made up. The scientific evidence amounts to "get the basic amounts of macro and micro-nutrients to avoid disease", "eat a variety of foods", "eat vegetables" and a bunch of very basic, and specific things. Those basic and specific results are then wildly extrapolated, mostly by talk show hosts, celebrities, and people looking to cash in on gullible dieters.

    10. Re:Misleading statistics by Capsaicin · · Score: 1

      So much for scientific medicine.

      You don't mean that we ... gulp ... know more about medicine now than we did 10 years ago? And that we will ... teeth chatter ... might know more in another 10 than we do today?! That's it, no more Western medicine for me ---it's sweat tents all the way.

      Seriously though, we make guesses based on current knowledge ... some turn out to be bad. Folks do some empirical work, stats show the guess was wrong, we move one. Or we use invalid stats, statisticians complain, we clean up our act. We move on. That is scientific Medicine.

      --
      Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
    11. Re:Misleading statistics by sjames · · Score: 1

      Evidence based medicine should never make a recommendation based on a guess. A guess should lead to a study which (once repeated) should lead to a recommendation.

      Remember the low salt craze? Turns out that it is very helpful for a small subset of high blood pressure sufferers and useless to the rest of us.

      That's why a bunch of people basically ignore medical/health recommendations entirely now. Too many flip-flops and most of them suggest we only eat food that tastes almost as good as the container it comes in.

    12. Re:Misleading statistics by Capsaicin · · Score: 1

      Evidence based medicine should NEVER make a recommendation based on a guess. A guess should lead to a study which (once repeated) should lead to a recommendation.

      Don't be such an insufferably purist twonk. It's all guesses, your recommendation based on repeated studies (employing P-values no doubt) included. If it's evidenced based it's guesses with some sort of evidentiary basis.

      Remember the low salt craze?

      You mean the one based on numerous repeated studies? For which see:

      • Tuomilehto J, Jousilahti P, Rastenyte D, et al. Urinary sodium excretion and cardiovascular mortality in Finland. Lancet. 2001;357(9259):848-851
      • Nagata C, Takatsuka N, Shimizu N, Shimizu H. Sodium intake and risk of death from stroke in Japanese men and women. Stroke
      • Umesawa M, Iso H, Date C, et al; JACC Study Group. Relations between dietary sodium and potassium intakes and mortality from cardiovascular disease. Am J Clin Nutr. 2008;88(1):195-202
      • He J, Ogden LG, Vupputuri S, Bazzano LA, Loria C, Whelton PK. Dietary sodium intake and subsequent risk of cardiovascular disease in overweight adults. JAMA. 1999;282(21):2027-2034

      Turns out ...

      ... rather unsurprisingly that --as with most things --there's an optimal (but as yet not established) intake of sodium and potassium salts. (Notwithstanding the therapeutic use of salt-restriction with hypertensive patients.) Fall too far on either side and you risk adverse CV effects. Golly!

      But that is beside the point. The point is that even with studies our scientific knowledge is never perfect. We would still be trapaning patients, or failing to mention that opiate use, smoking or overeating may have adverse health consequences, were we to wait for the level of certitude you require.

      Meanwhile here on Earth it turns out that the advance of Medicine of the last century means that a woman dying in childbirth is considered a rare event, while diseases that were a death sentence when I was a child are now curable. And all the while the ingrates in the peanut gallery, armed with their naive purist philosophies of science and their 2nd option bias take pot-shots at the people who do real work advancing science and who may, shock horror, on occasion be wrong.

      --
      Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
    13. Re:Misleading statistics by sjames · · Score: 1

      The smoking thing was based on actual research and observation, not a guess. The same is true of the cures for those once life threatening diseases. The big push to get everyone to give up salt was a wild guess extrapolation from those studies you pointed out (and it turns out, an unjustifiable extrapolation). Another of those unjustifiable extrapolations got millions to give up more or less harmless (in moderation) butter and consume very harmful transfats instead. Actual research and observation eventually corrected the recommendations, but a great many people died early first.

      Trepanation was another such wild guess. That sure worked out well!

      Meanwhile, I note that "Urinary sodium excretion and cardiovascular mortality in Finland. " points out right in the summary that "Second, these data lend further credence to the notion that, for at least half the population, sodium intake is not associated with health outcomes." So yeah, if you read that and yell OMG everyone stop using salt now! you have not only made an unjustifiable extrapolation, you failed to even read the paper.

      Yes, studies can be wrong, but that is quite different from just making wild guesses and extrapolations and acting as if it has anything to do with science.

      If actually following the scientific method is excessively purist, what do you suggest? The magic 8-ball perhaps?

    14. Re:Misleading statistics by Capsaicin · · Score: 1

      The smoking thing was based on actual research and observation, not a guess.

      That research was very controversial you know. Before we interfere with people's freedoms, shouldn't we be ironclad certain? I mean evidence based medicine should never make a recommendation based on a guess.

      If actually following the scientific method is excessively purist, what do you suggest?

      There's no such thing as "the scientific method." That went out with Francis Bacon. Actual science is simply not that purist, that's the point. In science and especially in medicine you must make guesses. Not "wild" guesses, but educated ones, guesses based ultimately on research.

      If you read that and yell OMG everyone stop using salt now ...

      Who said stop using salt now? And why on earth would anyone pay them the least bit of attention, even if their claims had been backed up by numerous studies with glowing P-values? It wouldn't seem to fit with how we know the body to function (i.e. you kinda need Na and K ...). It is a basic principle of science, is it not, that where the data disagrees with theory, question the data.

      Anyway, I thought people whose diets were high in salt were being told to lower their salt intake (not eliminate it), which advice is probably still good.

      Trepanation was another such wild guess.

      Was it? I would have thought it was evidence based, but I doubt we'll ever know.

      Yes, studies can be wrong, but that is quite different from just making wild guesses and extrapolations and acting as if it has anything to do with science.

      Obviously. However what I originally wrote, that you objected to was: "[W]e make guesses based on current knowledge ... some turn out to be bad. Folks do some [more] empirical work, stats show the guess was wrong, we move one" [emphasis added]. So that observation is not all that pertinent. And if you re-read my statement you'll see that the discussion of the pros and cons of low salt intake is simply more grist to my mill.

      --
      Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
    15. Re:Misleading statistics by sjames · · Score: 1

      That research was very controversial you know.

      No, it actually wasn't. You fell for the industry shill research that didn't hold up to peer review.

      I'm going to top now since the rest of your post is so far reversed from fact that it feels like a subtle troll. Really, suggesting question the data rather than the theory was too far over the top. You gave yourself away before even getting to your suggestion that trepanation might have had valid science behind it.

    16. Re:Misleading statistics by Capsaicin · · Score: 1

      No, it actually wasn't. You fell for the industry shill research that didn't hold up to peer review.

      Irony, my good man. Irony. (Would have though my re-quoting you made that obvious). And ironic too, in the other sense, that you should have taken everything else that was serious (and factual) in that post as "reversed from fact."

      Really, suggesting question the data rather than the theory was too far over the top.

      Now that was deadly serious. To quote British astrophysicist Arthur S. Eddington, to the same effect, "No experiment should be believed until it has been confirmed by theory."

      Less provocatively, as science commentator Scott Johnson explained on Ars Technica, "Scientists know that every study is imperfect or incomplete in some way and are especially skeptical of results that contradict—rather than build upon—the existing science."

      You gave yourself away before even getting to your suggestion that trepanation might have had valid science behind it.

      I never suggested that trepanation might have valid science behind it. Stop making things up! Or is this misunderstanding a manifestation of your naive philosophy of science?

      You gave yourself away when you used the phrase "the scientific method" by the way. Perhaps you should stop arguing and try instead to understand?

      --
      Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
    17. Re:Misleading statistics by Capsaicin · · Score: 1

      Oh and btw, I didn't "suggest" to "questions the data rather than the theory" either. I suggested to question the data rather than theory (note the lack of a definite article).

      --
      Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
  6. Reminds me of the Bible Code controversy by sideslash · · Score: 4, Interesting

    The world is full of coincidental correlations waiting to be rationalized into causality relationships.

    1. Re:Reminds me of the Bible Code controversy by ceoyoyo · · Score: 1

      There's no such thing. A correlation test always comes with a p-value that gives you an idea of how likely your observations are to be a coincidence rather than a correlation.

    2. Re:Reminds me of the Bible Code controversy by RightwingNutjob · · Score: 1

      The world is also full of people who can't do math to save their life. Win-win.

    3. Re:Reminds me of the Bible Code controversy by colinrichardday · · Score: 1

      But the standard tests for correlation only supply the correct P-value if the data satisfy certain conditions, such as normally distributed errors, no errors in the independent variable, constant standard deviation of errors, and so on.

    4. Re:Reminds me of the Bible Code controversy by ceoyoyo · · Score: 1

      Any statistical test requires that you apply it appropriately. If you don't do so the result is called a "mistake," not a "coincidence" OR a "correlation".

    5. Re:Reminds me of the Bible Code controversy by Anonymous Coward · · Score: 0

      A p-value is not a measure of how likely your observation is to be a coincidence.

      It is the probability of observing your data or other data more extreme and not observed, given the modeling assumption that there is no relationship.

      The inference is typically then, when a very small p-value is found, that either something very unusual happened (you measured extreme data), or the assumption of no relationship is a poor assumption.

      There are a number of problems with that inference. In particular, it is a conclusion about the data sample, not about the hypothesis.

      Neyman and Pearson wrote that we won't be too often wrong if in the long run we act as if we believe there is a relationship when the p-value is less than the pre-chosen threshold alpha.

    6. Re:Reminds me of the Bible Code controversy by colinrichardday · · Score: 1

      The result might be a mistake, but that doesn't mean it will be called a mistake. Also, how well do introductory statistics texts explain this?.

  7. Gold Standard? by radtea · · Score: 4, Interesting

    That means "outmoded and archaic", right?

    I realize I have a p-value in my .sig line and have for a decade, but p-values were a mediocre way to communicate the plausibility of a claim even in 2003. They are still used simply because the scientific community--and even moreso the research communities in some areas of the social sciences--are incredibly conservative and unwilling to update their standards of practice long after the rest of the world has passed them by.

    Everyone who cares about epistemology has known for decades that p-values are a lousy way to communicate (im)plausibility. This is part and parcel of the Bayesian revolution. It's good that Nature is finally noticing, but it's not as if papers haven't been published in ApJ and similar journals since the '90's with curves showing the plausibility of hypotheses as positive statements.

    A p-value is the probability of the data occurring given the null hypothesis is true, and which in the strictest sense says nothing about the hypothesis under test, only the null. This is why the value cited in my .sig line is relevant: people who are innocent are not guilty. This rare case where there is an interesting binary opposition between competing hypothesis is the only one where p-values are modestly useful.

    In the general case there are multiple competing hypotheses, and Bayesian analysis is well-suited to updating their plausiblities given some new evidence (I'm personally in favour of biased priors as well.) The results of such an analysis is the plausibility of each hypothesis given everything we know, which is the most anyone can ever reasonably hope for in our quest to know the world.

    [Note on language: I distinguish between "plausibility"--which is the degree of belief we have in something--and "probability"--which I'm comfortable taking on a more-or-less frequentist basis. Many Bayesians use "probability" for both of these related by distinct concepts, which I believe is a source of a great deal of confusion, particularly around the question of subjectivity. Plausibilities are subjective, probabilities are objective.]

    --
    Blasphemy is a human right. Blasphemophobia kills.
    1. Re:Gold Standard? by geekoid · · Score: 1

      But publishing studies with Bayesian probability hasn't been the norm.
      It will be, and heir s a move to do so.
      IT's not fast, and it shouldn't be.

      "Many Bayesians use "probability" for both of these related by distinct concepts, which I believe is a source of a great deal of confusion"
      Sing it, brother!

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    2. Re:Gold Standard? by Anonymous Coward · · Score: 0

      Thank you for a wonderful and educational post!

    3. Re:Gold Standard? by narcc · · Score: 1

      +5 Funny.

    4. Re:Gold Standard? by TapeCutter · · Score: 1

      To be fair, the reason (credible) researchers go to great lengths to control all possible influences except the one under study is because of the binary nature of P.

      Other than that nit-pick - Nice post (and your sig makes more sense now :), I offer an interesting article about stats, chaos, stock-markets, and fish in appreciation.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    5. Re:Gold Standard? by ceoyoyo · · Score: 1

      It shouldn't be, and I hope it never will be. If you use a non-informative prior for your Bayesian analysis in most cases you're just doing extra work to get the same result. If you use an informative prior you're colouring your results with your preconceptions. The reader, or the author of a meta-analysis, is the one who should be doing the Bayesian analysis, using their own priors. By all means, report a Bayes factor to make the meta-analysis easier, but also report a p-value, which is a good metric for a single experiment.

    6. Re:Gold Standard? by Anonymous Coward · · Score: 0

      You need to read ET Jaynes. A sound axiomatization for "subjective plausibility" yields a probability measure.

    7. Re:Gold Standard? by Anonymous Coward · · Score: 0

      real world probabilities are usually defined by statistics, and thus have issues because statistics are based on "finite sampling", in other words, some scientist can determine the "probability" of 3 cars turning right on some light by sitting there for a while and gathering statistics, but there is nothing that links that "statistical probability" to the real world, so statistics and probability are just guesses.
      I would add "possibility" to your list of concepts because many people confuse "not likely" with "not possible", "possible" with "likely", etc.

    8. Re:Gold Standard? by Anonymous Coward · · Score: 0

      The people in Fukushima probably confused "not likely" with "not possible" and thought they were safe...but anything that is possible, can (and eventually will) happen.

  8. Real world implications by martinux · · Score: 1

    Any researcher worth their salt states a p-value with enough additional information to understand if the p-value is actually meaningful. Anyone who looks at a paper and makes a conclusion besed solely (or largely) off a p-value without thinking about how meaningful the results are from a clinical or real-world perspective is being lazy or reckless.

    I guess there are quite a few insightful XKCD strips but this one seems most apt, here: http://xkcd.com/552/

    1. Re:Real world implications by tgv · · Score: 1

      Then there are very, very few researchers worth their salt. Even then, it has been shown that a .05 significance under ideal conditions has a chance of being a coincidence of about 1/3. If we add to that the number of errors in the assumptions, the experiments, the unpublished studies, etc., .05 means nothing. I found the work by Jim Berger et al. interesting: http://www.stat.duke.edu/~berg...

  9. Sudoku teaches all by evanh · · Score: 2

    I learnt the uselessness of statistics for guidance of correctness when trying to reduce my effort required at Sudoku. I've since discover the best way to win is not to play. Doesn't stop me trying though!

    1. Re:Sudoku teaches all by TapeCutter · · Score: 1

      if your trying to solve soduku puzzles with statistical analysis, you're doing it wrong.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  10. Simple -- Correlation is NOT causality by redelm · · Score: 1

    p-value is just the probability the data/observations were the result of a random process. So a great p value like 0.01 says the results were not random. They do not conform what made them non-random (ie theory).

    Epistimology is elementary, and often skipped by those who wish to persuade. "Figures do not lie, but liars figure."[Clemens]

    1. Re:Simple -- Correlation is NOT causality by guacamole · · Score: 1

      You're using a quite confusing/inaccurate terminology here. The p-value is the probability of observing a statistic that's at least as large (or extreme) as what has been computed from the sample under the assumption of the "null" or default hypothesis. p=0.01 means that if the null hypothesis is correct, then the probability of observing what you just observed "or worse" is just 1%. A low p-value does not mean that under the _alternative_ hypothesis your results are necessarily "non-random". Normally, the alternative still specifies some kind of probability model. This depends entirely on what your alternative is.

  11. Just look at that astrology survey... by Anonymous Coward · · Score: 0

    That article posted earlier claiming a lot of Americans think astrology is a science? It was a telephone survey with a sample size of two thousand people. Think that P-value proves the hypothesis is correct?

  12. P-hacking. by khasim · · Score: 3, Insightful

    From TFA:

    Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. "P-hacking," says Simonsohn, "is trying multiple things until you get the desired result" - even unconsciously.

  13. Covered before by Anonymous Coward · · Score: 0

    Dance of the p-values youtu.be/ez4DgdurRPg

  14. Misconceptions by vdorie · · Score: 2

    A few folk here have commented using incomplete or inaccurate definitions of p-values. A p-value is the probability of finding new data as or more extreme as data you observed assuming a null hypothesis is true. A couple of salient criticisms not mentioned in the article are a) why should more extreme data be lumped in with what was observed and b) what if "new" data can't sensibly be obtained.

    In a less technical sense, what the article didn't get into so much is that there is a strong publication bias towards results that are significant (i.e. small p-values), to the point where you need <0.05 to even consider submitting. Some key reading: http://www.stanford.edu/~neilm/qjps.pdf. The short version is to not believe it when the news says that "recent research shows...".

    Personally, I wait for evidence to accumulate before, say, changing my diet. And if you really want to get it right, dig through the literature yourself. Some of my saddest moments have come from statistics consulting where mostly people come to you looking for permission to run an inappropriate analysis, not understand their data or fit the "right" model. They want to get published, and that's just how things are done.

    1. Re:Misconceptions by Anonymous Coward · · Score: 0

      Instead of "finding new data," i would say "hypothetically replicating the study and obtaining data", but this is minor.

      The reason for lumping in more extreme data is exactly because that's what you're interested in; you're looking for the best evidence possible to disprove the null hypothesis. What is more important, that group A has a much higher IQ than group B, or that the estimated difference of the two groups' IQs is about 8.4? If a measured difference of 8.4 is enough to convince you, then (assuming constant sigma) 12.4 should be too, and 18.7 is even better, etc., and that is what the p-value measures: the probability of obtaining evidence at least as good as what you observed, since that evidence would (presumably) be just as powerful or better. In other words, no one sets out to do an experiment hoping that the measured difference will be exactly, say, 8.4; they just want it to be positive and far enough away from 0 to be interesting.

      One is typically interested in showing an effect and, while bounding type-i error, not underestimating it.

    2. Re:Misconceptions by ceoyoyo · · Score: 1

      Um, no. Your criticisms on't make sense. You're falling into a misunderstanding that is perpetuated because theoretical statisticians are so careful about how they define things, particularly when Bayesians might be looking over their shoulders.

      A p-value is the probability that accepting your statistical hypothesis (rejecting the null hypothesis) would be an error. This is equivalent to saying that the p-value is the probability that, picking a random run out of many runs of your experiment, you'd expect to get your result or something more extreme, purely by chance.

    3. Re:Misconceptions by Paradigma11 · · Score: 1

      No, you are wrong, the op is correct. Especially: "A p-value is the probability that accepting your statistical hypothesis (rejecting the null hypothesis) would be an error." is wrong and "A p-value is the probability of finding new data as or more extreme as data you observed assuming a null hypothesis is true." is correct.

    4. Re:Misconceptions by ceoyoyo · · Score: 1

      Sigh. Know why they call the threshold for significance alpha?

      http://en.wikipedia.org/wiki/T...

    5. Re:Misconceptions by Paradigma11 · · Score: 1

      Yes. "A p-value is the probability that accepting your statistical hypothesis (rejecting the null hypothesis) would be an error." = P(H0|significant) but what the p-value in frequentist statistics gives you is P(significant|H0).

  15. The Earth Is Round (p < 0.05) by DVega · · Score: 4, Insightful
    There is a classic article by Jacob Cohen on this subject.

    Also there is a simpler analysis of the above article

    --
    MOD THE CHILD UP!
  16. To quote one of my professors... by margeman2k3 · · Score: 1

    "A p-value of 0.05 means there's a 5% chance that your paper is wrong. In other words, 1 in 20 papers is bullshit."

    1. Re:To quote one of my professors... by Anonymous Coward · · Score: 0

      Your professor is bad and he should feel bad.

    2. Re:To quote one of my professors... by Anonymous Coward · · Score: 0

      A p-value of 0.05 means there's a 5% chance that your paper is wrong. In other words, 1 in 20 papers is bullshit.

      RTFA, it's much worse that your professor thinks. Specifically in their "long-shot" scenario a p-value of 0.05 means there is a 89% chance that your paper is wrong! Even in their "toss-up" scenario, a p-value of 0.05 means there is a 29% chance that it is wrong.

      In fact the point of TFA is to address the very error your professor is promulgating.

    3. Re:To quote one of my professors... by ceoyoyo · · Score: 1

      No. If someone writes a paper and claims that "This is true because p 0.05" then they are wrong regardless. The correct conclusion is "this result supports our hypothesis because p the threshold we set for minimal evidence." The point the article makes is correct, but it's not a problem with p-values, it's a problem with the conclusions people draw from them. His professor is absolutely correct, supposing that the paper's he's talking about aren't meaningless bullshit to start with.

    4. Re:To quote one of my professors... by guacamole · · Score: 1

      That's exactly what it means. This is why for the idea to be accepted or consensus to be reached, you need a lot more than one study.

    5. Re:To quote one of my professors... by Paradigma11 · · Score: 1

      "A p-value of 0.05 means there's a 5% chance that your paper is wrong. In other words, 1 in 20 papers is bullshit."

      This is complete bullshit. If you study something where the h1 is true then there is a 0% possibility to be wrong if you report significant findings.

    6. Re:To quote one of my professors... by zachie · · Score: 1

      Wrong, it is much, much worse than that.

      Imagine a body of scholars continuously producing wrong hypotheses. They test all of them. Your teacher correctly pointed that one in twenty will have a p-value > 0.05. But they write papers only off these! In such a scenario, 100% of the papers are wrong.

      In other words, this 5% chance of a paper being bullshit is only a lower bound.

    7. Re:To quote one of my professors... by LateArthurDent · · Score: 1

      Exactly. However, that's not a difficult problem to solve. What the Nature article fails to address is the real problem: it's not easy to publish papers that do nothing but confirm the findings of another paper.

      The article talks about how a researcher had his dreams of being published dashed once he failed to achieve a similar p-value upon attempting to reproduce his own research. This is bullshit. Journals should be selective, yes. They should be selective in terms of whether experiments have been run with proper methodology, and whether the study supports the conclusions made by the author. They shouldn't be selective based on p-value. For proper science to occur, not only should said researcher have been able to publish his first low p-value study, he should have been able to publish his second high p-value study. Other researchers at different institutions should attempt to reproduce the work and publish their positive or negative findings. Only once dozens of said studies are performed can we actually start to draw a conclusion: if only 1 in 20 experiments show a p-value below 0.05, well that doesn't actually disprove the null-hypothesis, it's evidence for it.

      Even if the 5% chance of a paper being bullshit was an upper bound, that would still be a really plausible scenario. Replicating experiments are a fundamental part of science, but journals are only interested in unique experiments yielding positive results with low p-values. Either that or negative results replicating a particularly important paper that everyone takes for granted. Ideally, while grad students are still new and learning the ropes, their research should consist of replicating others' research and publishing the result, whatever it may be. It's the perfect job to get them started before they've had the chance to do significant research of their own, and it's incredibly valuable to the community at large.

  17. Obvious fact is obvious by Anonymous Coward · · Score: 0

    Of course p-values don't tell you that your hypothesis is correct.

    A p-value is the likelihood of getting a result as extreme as (or more extreme than) the one observed, assuming the null hypothesis is true. It has nothing to do with the probability of the truth of the alternative hypothesis.

  18. 95% CI by Anonymous Coward · · Score: 0

    is used by people who are aware of this well-known problem.

    1. Re:95% CI by ceoyoyo · · Score: 1

      A confidence interval is completely equivalent to a statement of p-value, mean and type of distribution. In fact, CIs are almost always calculated from that trio. It's just another way of showing the same information.

  19. Most commonly by msobkow · · Score: 1

    Correlation != Causation

    --
    I do not fail; I succeed at finding out what does not work.
    1. Re:Most commonly by Anonymous Coward · · Score: 0

      Irrelevant. The discussion is about whether or not there's a correlation in the first place.

  20. Lies! by Anonymous Coward · · Score: 0

    To rephrase a famous quote, "There are lies, damned lies, and then there are statistics - who every heard of a statistician fudging the numbers?"

  21. Q values by drooling-dog · · Score: 1

    At least in bioiniformatics, the correction of p-values for multiple comparisons ("q-values") has been standard practice for quite a while now.

    1. Re:Q values by ceoyoyo · · Score: 1

      Please tell me they don't really call them 'q-values'?

      A p-value IS corrected for multiple comparisons. If you did multiple comparisons and you didn't correct it, it ain't a p-value. A good term for those would be "the result of my fishing expedition."

    2. Re:Q values by Paradigma11 · · Score: 1

      At least in bioiniformatics, the correction of p-values for multiple comparisons ("q-values") has been standard practice for quite a while now.

      But then your beta-error goes through the roof and you wont find anything. wouldnt it be far more efficient to repeat the significant experiments.

    3. Re:Q values by tgv · · Score: 1

      But with what correction? There isn't one correction for multiple comparisons, and they all have their problems. Just go Bayesian instead.

    4. Re:Q values by ceoyoyo · · Score: 1

      Bayesian statistics still needs multiple comparison correction, and it's usually more complicated. Bayesian stats lets you quantitatively combine the results from multiple tests of the same thing, usually in different datasets, i.e. different experiments testing the same hypothesis. If you're doing that with frequentist stats you don't need multiple comparison correction either.

      If you're testing multiple different things, usually in the same dataset, you need to do multiple comparison correction, Bayesian or otherwise.

  22. Beta sucks by Anonymous Coward · · Score: 1

    It really, really sucks.

  23. Torturing the data by floobedy · · Score: 4, Informative

    One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You will eventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).

    I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.

    Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.

    This kind of thing crops up all the time.

    1. Re:Torturing the data by floobedy · · Score: 1
      Whoops. The post got garbled because slashdot wrongly interpreted the less-than sign as an html tag opening, and I didn't escape it. (Which seems like a bug to me. The text "p<0.05" is obviously not the beginning of an html tag, because no html tags accepted by slashdot begin with 0.05). Anyway, the offending paragraph should say:

      Then something interesting happened. The original researcher responded by looking for subsets of the data from the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (p<0.05)"... How many statistical tests did he perform to reach that conclusion, while trying to rescue his hypothesis? If he ran more than 20 tests, then you would expect one spurious positive result just from random error, even though his p value was less than 0.05.

  24. Statistics by stevez67 · · Score: 1

    Lies ... damned lies ... and statistics. The P value only tells you if there is statistical significance in the data, not whether your hypothesis is correct or incorrect.

  25. Nothing wrong with P values if they are applied co by Anonymous Coward · · Score: 0

    I think the article missed that major point. I think P values are fine when applied to the normal distribution however apparently some people are using the "empirical rule" to apply it to any distribution. If people configured their data collection to allow the "central limit theorem" to be used then the data would be normally distributed and P value limits would be fine.

    No where does the article mention the normal distribution of the central limit theorem.

  26. Grammar Narcissim by Anonymous Coward · · Score: 0

    Eggs is a good example. They where 'bad' becasue they had high cholesterol.... The media s the issue. It's can report science worth a damn.

    Now they are bad because they impair language skills.

    Only joking, this is obviously the product of caffeine deprivation! Nothing to do with eggs at all ...

  27. Re: Nothing wrong with P values if they are applie by brickh0us3 · · Score: 1

    This was an interesting read on the subject published by a medical doctor who apparently had a good stat background. Dr. Ioannidis has kind of been on a crusade to look at design issues through meta analysis of old med studies that were suspect. http://www.ncbi.nlm.nih.gov/pm.... There has been some more recent work of his in the news as of late i believe. I love the 1001 varying non-mathematical definitions of p values in this thread it was cute.

  28. I don't get the hate for the new style by Anonymous Coward · · Score: 0

    It's not like the comments on Slashdot are worth much: Ars has far more mature and on the mark comments. It's like only kids are commenting on Slashdot.

    1. Re:I don't get the hate for the new style by udippel · · Score: 1

      It's like only kids are commenting on Slashdot.

      It takes a thief to catch a thief ...

  29. So much confusion, so little understanding by Anonymous Coward · · Score: 0

    The article is little more than an appeal to undefined conceptions of "plausibility" that presumably would be little more than generalized weighted probability functions. However, the article seems to rely more heavily upon being reasonably sure the reader is as confused as the author with respect to the more subtle technical details.

    P values reflect, given certain assumptions about how the data are sampled and the statistical distribution from which they are drawn, the probability of committing a type I statistical error (rejecting a true null hypothesis as false). Consequently, a P value of 0.05 suggests that such a result will manifest itself about 1 time in 20 by chance alone, assuming sampling independence and the nature of the probability distribution from which it is presumably drawn, usually a Gaussian one given that in most situations the true mean of an unknown sampling distribution will approach the true mean as the sample size tends to infinity. For most types of data, such as the agronomic data evaluated by Fischer, a p value of 0.05 was and remains a useful choice since it has been found generally useful or meaningful in the context within which it was used (to discuss the nature and existences of differences among plants and their growth rates). In other fields of study more stringent critical values are the norm given the nature and frequency of the expected outcomes of interest. Obviously, type I statistical errors are not the only type of statistical error, since it is also possible to accept a true null hypothesis when it is in fact false. However, given the law of the excluded middle, these two notions are not independent of one another.

    Much of the confusion in statistical literature and clearly on display in slashdot (no surprise there) stems from being unable to appropriately recognize what constitutes the null hypothesis and what constitutes a reasonable framework from which to decide what is the nature of the underlying distribution being tested, as well as the extent to which certain assumptions the independence of samples and variates are met and how they may affect p values. In some tests it is the measure of central tendency that is tested, whereas in others it is the homogenetity of variances. In more complicated designs more care must be afforded, just as in analyses of correlation or covariance, where it is the independence or lack thereof between variables that is being tested, In situations with multiple covariates as in their block-design analogs, interaction effects must be controlled. Likewise, in cases of multiple comparisons one must account for the family-wide or experiment-wide error, since as one conducts more tests, the greater the chance of encountering an unlikely result by chance alone. Such considerations give rise to adjustments of critical p values, as in the case of the widely known Bonferroni adjustments. Likewise, the discriminatory power of different tests for a given sample size, can also be an issue requiring consideration, not to mention specificity and sensitivity, which may be more important in certain settings, such as epidemiology.

    It should also be kept in mind that most statistical procedures assume variates to be either continuous or discrete in their distribution and to arrive at an idea of the underlying topology of the space being sampled, typically a metric one, and often a Hilbert space. Should the phenomena of interest be pseudometric rather than metric in nature or living within more general topological spaces, then standard probability theory as derived from measure theory would have to be more delicately applied, if not abandoned entirely for non-parametric techniques. p-values like everything else in science must be appropriately interpreted within the context they are being discussed. However, once such considerations are appropriately made, a lower p-value will, as noted by Fisher, provide further confidence in ruling out chance as an explanation of a particular statistical outcome than a higher one. Alternatives, such as "plausibility" will need to prove themselves as viable.

  30. Re:The Earth Is Round (p 0.05) by Anonymous Coward · · Score: 0

    The Earth is not round. It is an oblate spheroid.

  31. True by Anonymous Coward · · Score: 0

    Alas for the poor computer jockey.

  32. Re:The Earth Is Round (p 0.05) by Anonymous Coward · · Score: 0

    The Earth is not round. It is an oblate spheroid.

    Actually, it's a sphere defined by the EGM96 coefficients.

  33. None of this will be solved by Bayesian methods by Anonymous Coward · · Score: 0

    I'm honestly very dismayed by the idea, rapidly gaining currency, that Bayesian methods will somehow solve all these problems.

    Bayesian methods have their own problems, namely what the prior actually is. In the linked article, for example, they suggest that the prior is trivially determined by any number of mechanisms. E.g., how do you interpret previous research to get the prior? Do you use that to get a prior? Do you use an objective prior, such as a reference prior? In reality, the choice of prior is left to the whims of the individual researcher, which becomes an entirely new mechanism with which researchers game the system.

    Bayesians (and I consider myself neither a Bayesian or a frequentist, or both a Bayesian or frequentist, depending on how you look at it, for reasons that are too complicated to get into here) would respond to this by saying "well, at least your prior is made explicit." But this is not actually true in practice typically, and the same argument could be made about frequentists (who explicitly have no prior).

    The article mentions the utility of meta-analysis in this context, to identify patterns of publication bias and so forth. But if everyone starts using Bayesian methods, you'll have an additional source of heterogeneity and bias to contend with in meta-analysis, namely the nature of the prior.

    The dirty secret of Bayesian inference is that it is biased--in fact, results in the literature show it is impossible to have an unbiased Bayesian estimator. This is well-documented, and there's a good reason for it: Bayesian methods take advantage of the fact that you can reduce overall estimation error by introducing bias, under the assumption that the magnitude of the bias is small enough to offset the variance of the estimator (overall estimation error is a function of the squared bias plus variance, so if you can introduce a procedure that increases squared bias but decreases variance even more, you'll decrease overall estimation error). However, this all assumes the only risk is due to random error, with honest experimenters, which is clearly false.

    I don't mean to bash Bayesian statistics--I use it and have published on Bayesian methods in multiple places. But it's not the answer to our problems. It's just another tool. The real problem are fads and publication pressures, which won't change by changing your inferential philosophy.

  34. Anscombe's Quartet by Anonymous Coward · · Score: 0

    William Anscombe made this same point in 1973. He created four very different data sets with identical mean, variance, correlation (r-value), and linear regressions. Visual inspection, on the other hand, would show that all of these statistics are junk (usually because the model is mismatched to the data set).

    Anscombe didn't look at P-values, but the same argument holds. The P-values for these would be quite high because the underlying source of the noise is non-Gaussian [or non-Normal if you're a mathematician] yet the usual calculation of a P-value assumes Gaussian statistics..

    Check out the Wikipedia article on Anscombe's quartet to see the data sets.

    -JS

  35. Don't blame the p value by Anonymous Coward · · Score: 0

    if the researcher is an idiot, don't blame the p value

  36. all flawed by Anonymous Coward · · Score: 0

    There's strong statistic evidence that statistics are correct 95% of the times

  37. Re:Nothing wrong with P values if they are applied by retchdog · · Score: 1

    what? p-values are in no way restricted to the normal distribution, although a lot of statistical theory does involve the normal distribution.

    --
    "They were pure niggers." – Noam Chomsky