Slashdot Mirror


Is Statistical Significance Significant? (npr.org)

More than 850 scientists and statisticians told the authors of a Nature commentary that they are endorsing an idea to ban "statistical significance." Critics say that declaring a result to be statistically significant or not essentially forces complicated questions to be answered as true or false. "The world is much more uncertain than that," says Nicoole Lazar, a professor of statistics at the University of Georgia. An entire issue of the journal The American Statistician is devoted to this question, with 43 articles and a 17,500-word editorial that Lazar co-authored.

"In the early 20th century, the father of statistics, R.A. Fisher, developed a test of significance," reports NPR. "It involves a variable called the p-value, that he intended to be a guide for judging results. Over the years, scientists have warped that idea beyond all recognition, creating an arbitrary threshold for the p-value, typically 0.05, and they use that to declare whether a scientific result is significant or not. Slashdot reader apoc.famine writes: In a nutshell, what the statisticians are recommending is that we embrace uncertainty, quantify it, and discuss it, rather than set arbitrary measures for when studies are worth publishing. This way research which appears interesting but which doesn't hit that magical p == 0.05 can be published and discussed, and scientists won't feel pressured to p-hack.

29 of 184 comments (clear)

  1. I used to think so by goombah99 · · Score: 4, Funny

    Then I took a course on statistics, and the stats professor told me that 47.37% of all statisticians make up their own statistics.

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re:I used to think so by Aighearach · · Score: 2

      On average humans have one tit

      You understanding of mammal bodies is substantially lacking.

  2. P-hacking by goombah99 · · Score: 3, Funny

    100% of all published incorrect results have a P value above 0.05

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re: P-hacking by c6gunner · · Score: 5, Insightful

      100% of all published incorrect results have a P value above 0.05

      0.05 has always intended to be the bare minimum, not a guarantee of absolute truth. If you hit 0.05, and you haven't engaged in P hacking, it indicates that there may be an effect there and that more study is warranted.

    2. Re:P-hacking by apoc.famine · · Score: 2

      Like weather predictions of 30% chance of rain at 2 pm, did it actually rain 30% of the time?

      That sort of research is done all the time. Usually it's on far more specific parts of weather models than the overall model. Weather models are ridiculously complicated, and scientists spend a lot of time on minor components of them like modeling aerosols better since they form the nuclei of clouds and thus rain, or the vertical humidity profile, or boundary layer dynamics. There are so many minor processes that make up weather that most of the research effort goes into things that 99.9% of the population never will even know even exist. In conjunction, all of these things will be what predict rain or temperature at a certain time.

      However, once in awhile someone revisits the models as a whole, and you get something like this: http://www.inscc.utah.edu/~pu/...

      For hurricanes in particular: http://science.sciencemag.org/...
      (If you want the pop journalism coverage of that article: https://www.theatlantic.com/sc...)

      --
      Velociraptor = Distiraptor / Timeraptor
    3. Re: P-hacking by WhiplashII · · Score: 4, Insightful

      Worse than that, if you only publish one out of 20 studies, you are reporting noise.

      --
      while (sig==sig) sig=!sig;
    4. Re: P-hacking by fropenn · · Score: 3, Insightful

      Of course 0.05 is arbitrary. But researchers have to run studies using budgets that limit the amount of subjects in the study and they also are up against the level of accuracy of the test / instrument / survey. Obtaining extremely low p-values requires one or more of these:

      1. Very large sample sizes.

      2. Extremely effective intervention that produces huge differences between your groups.

      3. Extremely accurate instruments / measures.

      4. Lying.

      These things all come at a cost, which has to be balanced between doing fewer studies at higher cost or more studies at less cost.

    5. Re: P-hacking by ShanghaiBill · · Score: 4, Insightful

      Worse than that, if you only publish one out of 20 studies, you are reporting noise.

      All publicly funded research should be published.

      Often the failed experiments are more important than the successes.

      Where would we be today if Michelson and Morley hadn't published their failure to measure the ether?

  3. All odd numbers are prime by goombah99 · · Score: 4, Interesting

    A prime number is divisible only by itself and 1
    1 is prime (by this definition)
    3 is prime
    5 is prime
    7 is prime
    11 is prime
    13 is prime
    9 is experimental error.

    The proposition that "all odd numbers are prime" has a P value above 0.05.

    --
    Some drink at the fountain of knowledge. Others just gargle.
  4. Nope. by dohzer · · Score: 4, Funny

    Nope. I'll delete it from Wikipedia later today.

  5. Obligatory XKCD cartoon by nickovs · · Score: 5, Funny
    --
    If intelligent life is too complex to evolve on its own, who designed God?
  6. Science is hard by Sarten-X · · Score: 2, Interesting

    This way research which appears interesting but which doesn't hit that magical p == 0.05 can be published and discussed

    The significance value is essentially a measurement of how good a researcher is at their job. Unfortunately, a lot of academics feel that they shouldn't be bothered by silly things like "accountability", because they've chosen the noble ivory tower of research.

    If your experiment can't hit that level of certainty, redesign your experiment. Go get more samples, run more simulations, and grow more cultures. Alternatively, go ahead and publish, but include the note that the job isn't actually finished. Use the partial result to justify asking for more funding so you can complete the work.

    • Half of your samples died unexpectedly? If you were a better researcher with better lab practices, you'd have had someone check that the equipment stayed plugged in over spring break.
    • Nobody responded to your survey? Maybe you should try something more effective than standing in a corner of the local pub for an hour asking the drunks if you can "get something good from them real quick".
    • You can't get enough reagents for your chemical process? Perhaps you should have actually budgeted for supplies, rather than host an open-bar party celebrating that you received that grant.
    • You ran out of time on the cluster computer? Next time try asking the computer science students to review your program for efficiency, rather than trying to run a direct implementation of your whiteboard notes.

    (These are all things I saw first- or secondhand during my time in academia)

    I'd be fine getting rid of the p-value, but it would have to be replaced by something else that does an equal job of filtering out the half-assed crank "research" that makes more headlines than discoveries. The only replacement I can think of that wouldn't be vulnerable to similar "hack" methods would be to require that every experiment go through an exhaustive process inspection before, during, and after the run. That's an even more painful thing to deal with than making sure your experiment can produce significant results.

    --
    You do not have a moral or legal right to do absolutely anything you want.
    1. Re:Science is hard by Anonymous Coward · · Score: 3, Interesting

      This is absolute horseshit. There is often background noise in a measurement that you CAN NOT GET RID OF. Therefore you will never get a perfect 0 p-value. In fact, you will often be unable to reduce it beyond a certain point NO MATTER HOW GOOD YOUR EXPERIMENT IS.

      What the article is arguing is that we should not be using a blunt instrument like a p-value which is often a lazy person's (like the parent poster) substitute for quality, but instead should be assessing research on its relative merit and making judgments about quality from a deeper understanding of the problems that some experiments face. Attittudes like the one the parent poster gives are why p-hacking and its associated problems exist - dilletantes like Sarten-X substiute p-values for quality, whereas actual statisticians know it cannot be used in that way.

    2. Re:Science is hard by houghi · · Score: 5, Insightful

      If your experiment can't hit that level of certainty, redesign your experiment.

      Or perhaps the thing you thought was sure, isn't at all and you just proved that your idea was wrong.

      A researcher should prove and disprove, not only prove.

      --
      Don't fight for your country, if your country does not fight for you.
    3. Re:Science is hard by werepants · · Score: 4, Interesting

      The significance value is essentially a measurement of how good a researcher is at their job.

      This is totally wrong, and reflects the exact misconception that the article is talking about. For quite a while my job was doing experiments on hardware that cost as much as $100k per sample, where test time would cost $1000/hr or more, and you needed hundreds of hours of testing to get any kind of reasonable certainty. Budgets are finite, and at some point you have to decide how good is good enough, or even if isn't good enough, there just isn't any money left to do better. We could only estimate effects to within a couple orders of magnitude at times. However, we put error bars on fucking everything, so we were very explicit about how much slop there was in the answers. How good a researcher is at their job is determined by how much they can get done with finite resources, and how deeply they understand the limitations of their knowledge. All researchers should be trying to get maximal knowledge per dollar (or per time, in some cases), and sometimes an experiment with large uncertainty is the appropriate approach, or the only thing that is feasible within time/funding/physics constraints.

      Sure, if you are doing something basic like surveys, it's not hard to increase statistics. But if you are doing medical research on a new drug, costs can run into billions and you've got major ethical quandaries every step along the way. If you are developing a drug for a rare condition, there might only be a handful of test candidates in the world, and so you literally can't increase your sample size unless you wait a decade for more incidences to crop up. In that interval, depending on the specifics of the disease, people could be suffering or dying needlessly because you haven't gotten your drug approved.

      Yes, bad research is bad, and journals are replete with examples of terrible studies being published. But the p-value doesn't help that situation - it makes it worse, because it's treated as a binary marker of success. You can easily produce a great p-value by approaching science in the exact wrong way... look for significant correlations in a large, highly multivariate dataset and you are guaranteed to find some total nonsense correlations that look flawless (like the insanely tight correlation between swimming pool drowning deaths and Nicolas Cage movies... true story).

      What we actually need is more rigorous peer review and greater transparency and information sharing in science. If it becomes standard practice to make all of your raw data and calculations public, then it will become obvious very quickly when people are fudging numbers and inflating their stats.

    4. Re:Science is hard by Sarten-X · · Score: 2

      I have a fair coin that always lands on heads, just with about 50% background noise.

      The whole point of an experiment is to remove the "background noise", which is another way of saying "uncontrolled variables". If your experiment can't isolate the target variable, then you need to fix your experiment. In the extremely rare case that the experiment can't be fixed, like in cases where a small number of particles matters (including the very small number of photons hitting a telescope sensor), you still should be acknowledging your experimental problems. Own up to having a low p-value, and explain how you did absolutely everything possible with today's technology to pull signal from that background noise.

      [We] instead should be assessing research on its relative merit and making judgments about quality from a deeper understanding of the problems that some experiments face.

      I agree, but to do that, we'd need a good way to quickly educate every other scientist on that "deeper understanding", and why it's not possible to do any other experiment that does a better job of isolating the variables. Without that, it's easy to simply claim that an overly-complicated random-number generator with cherry-picked results is really an extremely-sensitive test apparatus supporting some pet theory.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    5. Re: Science is hard by phantomfive · · Score: 2

      The scientists in the article are complaining that people conclude two things are the same when there is no statistical difference between the two. You can't conclude that: all you can say is "we aren't sure."

      --
      "First they came for the slanderers and i said nothing."
  7. Bio/Medical Fields by Roger+W+Moore · · Score: 4, Insightful

    Plus they are almost all from biology or medicine. Just because their fields don't seem to understand what statistically significant means does not mean that the rest of us do not. Their example when two results measure the same value but one is within one sigma of a null result and the other is not they claim that people interpret this as two incompatible results!? I do not know of any physicist who would look at those data and make that assertion.

    Their paper reads more like a "I wish our colleagues understood simple statistics". Banning certain terms is not going to address the underlying problem they clearly have. The solution to ignorance is education, not censorship as they really ought to know, working in universities!

    1. Re:Bio/Medical Fields by omnichad · · Score: 3, Insightful

      Statistics in medicine are inherently messier. We don't clone people to do experiments and they don't intentionally kill people. You don't get clean control subjects.

    2. Re:Bio/Medical Fields by Roger+W+Moore · · Score: 2

      It's about publishing Potsy. Something you would know if you were an actual researcher.

      I am an actual researcher. Given your lack of understanding of statistics and reliance on ad hominem attacks, if you are a researcher too then you are clearly the target audience that this paper is trying to help by reducing your exposure to simple statistical concepts that you are likely to misinterpret.
      I never said that they were calling for a ban on p-values, I said that they were calling for an end to "statistical significance". To quote:

      We agree, and call for the entire concept of statistical significance to be abandoned.

      This is just stupid. You do not stop using a valuable and sensible concept simply because some people who should know better do not properly understand it. Drawing a conclusion from your analysis is a fundamental part of doing science and it is completely proper that an author of a paper should make a statement of their conclusions based on the data. When this is whether a particular hypothesis is correct or not then you have to address the binary nature of the result otherwise you have not done your job. How strong a statement you can make will, of course, depend on how good your data are. This can vary from "the data are consistent with the Standard Model but do not rule out the presence of new physics" to "The $EXOTIC_NEW_PROCESS is ruled out at the 95% confidence level".

      Reading is fundamental and you failed.

      Please do not project your own failings onto others.

  8. This won't address the underlying problem by SlaveToTheGrind · · Score: 3, Interesting

    Even without a magical "significant/insignificant" threshold, researchers will still evaluate, judge, and compare levels of significance. The pressure will just shift to come up with results that are "MORE significant" rather than "LESS significant," and thus p-hacking will continue by those that were willing to cross that line in the first place.

    The root cause is going to remain until peer reviewers force researchers to commit to how they're going to evaluate their measurements before they take those measurements. But the likely outcome would be either a lot less research would get published at all or published research would start to lose some of the imprimatur it now enjoys, including that of the peer reviewers. So that's unlikely to happen.

  9. Re:All odd numbers are prime by colinwb · · Score: 3, Informative

    1 is prime by that definition, but it's mostly called a unit and defined as *not* prime to make factorising integers into primes unique (up to the order of the factors): Prime number - Primality of 1

  10. These statisticians are idealists by plague911 · · Score: 3, Interesting

    Sure, in a perfect world we would all discuss the exact probabilities. The reality is we all (even professionals in an industry) have a limited attention span. Benchmarks are useful, even imperfect benchmarks. This is just another example of some purists thinking we should move to some idealized but impractical situation

  11. In defense of the p-value by psychic_bacon · · Score: 2

    I'm really curious about what people think about this comment and my attempt to defend p-values and statistical significance testing as a concept. I used to hate p-values like any respectable scientist, but in teaching intro college stats class (targeted to behavioral science), I've come to appreciate them, for one major reason.

    1. We have to take uncertain science and make certain decisions about the conclusions. Science gets simplified to dichotomous decisions. You either approve the drug or not. You either eat eggs or don't eat eggs. The defendant is guilty or not guilty. In each of these cases, we take scientific and other evidence and have to make a decision: do we trust these data. Confidence intervals, odds ratios, etc, help give a picture but they don't give a clear guideline about what to accept.

    2. It's really hard to understand (and teach) Bayesian and other approaches. I think that statistical significance is a decent proxy, as long as the limitations are well-understood. I am a big believer in teaching science research to people who have no desire to ever be "researchers", and in order to evaluate their studies, statistical significance is a good proxy. If you are doing an intro biology lab testing whether there are more bacteria on your hands after washing your hands versus hand sanitizer, a t-test with a p .05 criterion is a good approach. It won't get published in JAMA, but it's good for teaching research concepts.

    3. Reviewers still want p-values. Each time I have submitted a manuscript without p-values, I get a nasty reviewer who requires p-values. Maybe I've had bad luck, but I'm guessing this is pretty common in the literature. Any time I try a statistical technique that goes beyond null hypothesis testing, there is at least one reviewer who doesn't understand the technique and gripes because there are no p-values or decision criteria. As long as this is required to publish, we need to do it.

    So these aren't very good defenses, but it's why I'm still teaching p-values and null hypothesis testing. Maybe we will get rid of it, but like some other comments here, it leaves the question of what the alternative would be.

  12. Re:Hail incoherentism! by MightyMartian · · Score: 2

    When I took statistics, the text made it clear that a P-value of 0.05 is *somewhat* arbitrary, in that for any individual analysis, it is a useful threshold, but by itself not an absolute indicator of significance. I think the people in this group are guilty of overstating their argument. Determining P-value, or any other statistical measure of significance, is the *start* of a study, and then comes all the hard work of determining if that value is pointing to something truly significant. But a p value of 0.05 is certainly going to suggest that the finding is significant, but it is not THE definitive test.

    --
    The world's burning. Moped Jesus spotted on I50. Details at 11.
  13. Re:Quant vs Qual by PacoSuarez · · Score: 4, Insightful

    And this is why there is so little truth to be found in the humanities.

    Here's a scenario: A white nationalist kills dozens of Muslims. Someone looks at this and sees evidence that the normalization of fringe views, characteristic of the way president Trump talks, is emboldening these maniacs to act violently. Someone else looks at this and sees evidence that white middle-class uneducated men have been marginalized by our economic system and are at their wits' end, which is the same phenomenon that lead to Trump being elected.

    The kind of narrative-based elaborate analyses that you advocate doesn't help us decide which of the points of view above is right, and we carry on with our preconceptions, unable to learn anything.

    Narratives allow you to explain the past perfectly using models that have no predictive value. The only way to make progress when trying to understand a complex system is to come up with very simple hypotheses and try to validate them empirically. Of course this is very hard to do, but I think people in the humanities do a poor job and fool themselves into thinking they understand things they don't understand.

  14. Re: Hail incoherentism! by phantomfive · · Score: 2

    The real problem is when scientists aren't interested in finding something significant, they are interested in getting published. In that situation, even setting the threshold at .0005 will end up with p value hacking.

    --
    "First they came for the slanderers and i said nothing."
  15. Re:All odd numbers are prime by thrich81 · · Score: 3, Informative

    Actually 1 is neither prime nor composite by some deep mathematical definitions which go beyond the integers -- they go into the structure of algebraic rings which are generalizations of the integers. If you allow 1 (a unit) to be prime then you break some properties and theorems which everyone generally accepts in the algebra of the integers. The most well known such property is that of unique factorization -- any natural number is factored uniquely into prime factors. If you let 1 be prime then the prime factorization of a composite number can have any number of factors of 1 in it.

    The deeper definition of a prime (from my old abstract algebra book) is, "In the Euclidean ring R a nonunit p is said to be a prime element of R if whenever p = ab, where a, b are in R, then one of a or b is a unit in R."

    And there is a king which gives the definitive definition -- it is the accepted body of mathematical definitions by the world's mathematical community. There are sometimes differing definitions of a term, but those differences are usually well spelled out in any discussions. You can choose not to accept the definitions as the professionals in the field use them but then don't claim your definition is as good or useful as that of the pros.

  16. Re:Quant vs Qual by Kjella · · Score: 3, Interesting

    Narratives allow you to explain the past perfectly using models that have no predictive value. The only way to make progress when trying to understand a complex system is to come up with very simple hypotheses and try to validate them empirically. Of course this is very hard to do, but I think people in the humanities do a poor job and fool themselves into thinking they understand things they don't understand.

    A person is not a dice, no matter how much you want it to be. You can ask a fairly simple question like "Would you pose for nude art?" and get a survey answer. But if you break it down there'll be a ton of factors and the more answers you get and the more fine masked you make your model you'll only end up finding more and more differences plus the answer will not remain constant in place or time with a strong group dynamic and feedback loops. And you still will not have found a meaningful answer to why, only a bunch of correlated variables. Qualitative studies do the exact opposite, they don't generalize they ask one and one subject to explain their reasoning and try to summarize them into common sentiments. It's a much more accurate description for each person and the group as a whole. It's just really hard to compare scores because it's not on a measurable willingness scale.

    Yes, we've vaguely identified some risk factors that are usually present in a terrorist. We've got a long manifestos on why exactly that person turned into a terrorist. But everyone at risk are somewhere in between, they're not just risk factors and they're not clones of the terrorist. It's something like the Heisenberg's uncertainty principle for the social sciences, the more specific knowledge you have of an individual the less applicable it's to the group and the more general knowledge you have on the group the less accurate it's for the individual. They're both circling what nobody knows for sure, what exactly goes on in somebody else's head. Until we discover mind-reading technology that's going to be an approximation at best. Just because you can sell power tools to most Americans if you throw a dart at a map you could hit an Amish community.

    --
    Live today, because you never know what tomorrow brings