Slashdot Mirror


Is Statistical Significance Significant? (npr.org)

More than 850 scientists and statisticians told the authors of a Nature commentary that they are endorsing an idea to ban "statistical significance." Critics say that declaring a result to be statistically significant or not essentially forces complicated questions to be answered as true or false. "The world is much more uncertain than that," says Nicoole Lazar, a professor of statistics at the University of Georgia. An entire issue of the journal The American Statistician is devoted to this question, with 43 articles and a 17,500-word editorial that Lazar co-authored.

"In the early 20th century, the father of statistics, R.A. Fisher, developed a test of significance," reports NPR. "It involves a variable called the p-value, that he intended to be a guide for judging results. Over the years, scientists have warped that idea beyond all recognition, creating an arbitrary threshold for the p-value, typically 0.05, and they use that to declare whether a scientific result is significant or not. Slashdot reader apoc.famine writes: In a nutshell, what the statisticians are recommending is that we embrace uncertainty, quantify it, and discuss it, rather than set arbitrary measures for when studies are worth publishing. This way research which appears interesting but which doesn't hit that magical p == 0.05 can be published and discussed, and scientists won't feel pressured to p-hack.

7 of 184 comments (clear)

  1. All odd numbers are prime by goombah99 · · Score: 4, Interesting

    A prime number is divisible only by itself and 1
    1 is prime (by this definition)
    3 is prime
    5 is prime
    7 is prime
    11 is prime
    13 is prime
    9 is experimental error.

    The proposition that "all odd numbers are prime" has a P value above 0.05.

    --
    Some drink at the fountain of knowledge. Others just gargle.
  2. Science is hard by Sarten-X · · Score: 2, Interesting

    This way research which appears interesting but which doesn't hit that magical p == 0.05 can be published and discussed

    The significance value is essentially a measurement of how good a researcher is at their job. Unfortunately, a lot of academics feel that they shouldn't be bothered by silly things like "accountability", because they've chosen the noble ivory tower of research.

    If your experiment can't hit that level of certainty, redesign your experiment. Go get more samples, run more simulations, and grow more cultures. Alternatively, go ahead and publish, but include the note that the job isn't actually finished. Use the partial result to justify asking for more funding so you can complete the work.

    • Half of your samples died unexpectedly? If you were a better researcher with better lab practices, you'd have had someone check that the equipment stayed plugged in over spring break.
    • Nobody responded to your survey? Maybe you should try something more effective than standing in a corner of the local pub for an hour asking the drunks if you can "get something good from them real quick".
    • You can't get enough reagents for your chemical process? Perhaps you should have actually budgeted for supplies, rather than host an open-bar party celebrating that you received that grant.
    • You ran out of time on the cluster computer? Next time try asking the computer science students to review your program for efficiency, rather than trying to run a direct implementation of your whiteboard notes.

    (These are all things I saw first- or secondhand during my time in academia)

    I'd be fine getting rid of the p-value, but it would have to be replaced by something else that does an equal job of filtering out the half-assed crank "research" that makes more headlines than discoveries. The only replacement I can think of that wouldn't be vulnerable to similar "hack" methods would be to require that every experiment go through an exhaustive process inspection before, during, and after the run. That's an even more painful thing to deal with than making sure your experiment can produce significant results.

    --
    You do not have a moral or legal right to do absolutely anything you want.
    1. Re:Science is hard by Anonymous Coward · · Score: 3, Interesting

      This is absolute horseshit. There is often background noise in a measurement that you CAN NOT GET RID OF. Therefore you will never get a perfect 0 p-value. In fact, you will often be unable to reduce it beyond a certain point NO MATTER HOW GOOD YOUR EXPERIMENT IS.

      What the article is arguing is that we should not be using a blunt instrument like a p-value which is often a lazy person's (like the parent poster) substitute for quality, but instead should be assessing research on its relative merit and making judgments about quality from a deeper understanding of the problems that some experiments face. Attittudes like the one the parent poster gives are why p-hacking and its associated problems exist - dilletantes like Sarten-X substiute p-values for quality, whereas actual statisticians know it cannot be used in that way.

    2. Re:Science is hard by werepants · · Score: 4, Interesting

      The significance value is essentially a measurement of how good a researcher is at their job.

      This is totally wrong, and reflects the exact misconception that the article is talking about. For quite a while my job was doing experiments on hardware that cost as much as $100k per sample, where test time would cost $1000/hr or more, and you needed hundreds of hours of testing to get any kind of reasonable certainty. Budgets are finite, and at some point you have to decide how good is good enough, or even if isn't good enough, there just isn't any money left to do better. We could only estimate effects to within a couple orders of magnitude at times. However, we put error bars on fucking everything, so we were very explicit about how much slop there was in the answers. How good a researcher is at their job is determined by how much they can get done with finite resources, and how deeply they understand the limitations of their knowledge. All researchers should be trying to get maximal knowledge per dollar (or per time, in some cases), and sometimes an experiment with large uncertainty is the appropriate approach, or the only thing that is feasible within time/funding/physics constraints.

      Sure, if you are doing something basic like surveys, it's not hard to increase statistics. But if you are doing medical research on a new drug, costs can run into billions and you've got major ethical quandaries every step along the way. If you are developing a drug for a rare condition, there might only be a handful of test candidates in the world, and so you literally can't increase your sample size unless you wait a decade for more incidences to crop up. In that interval, depending on the specifics of the disease, people could be suffering or dying needlessly because you haven't gotten your drug approved.

      Yes, bad research is bad, and journals are replete with examples of terrible studies being published. But the p-value doesn't help that situation - it makes it worse, because it's treated as a binary marker of success. You can easily produce a great p-value by approaching science in the exact wrong way... look for significant correlations in a large, highly multivariate dataset and you are guaranteed to find some total nonsense correlations that look flawless (like the insanely tight correlation between swimming pool drowning deaths and Nicolas Cage movies... true story).

      What we actually need is more rigorous peer review and greater transparency and information sharing in science. If it becomes standard practice to make all of your raw data and calculations public, then it will become obvious very quickly when people are fudging numbers and inflating their stats.

  3. This won't address the underlying problem by SlaveToTheGrind · · Score: 3, Interesting

    Even without a magical "significant/insignificant" threshold, researchers will still evaluate, judge, and compare levels of significance. The pressure will just shift to come up with results that are "MORE significant" rather than "LESS significant," and thus p-hacking will continue by those that were willing to cross that line in the first place.

    The root cause is going to remain until peer reviewers force researchers to commit to how they're going to evaluate their measurements before they take those measurements. But the likely outcome would be either a lot less research would get published at all or published research would start to lose some of the imprimatur it now enjoys, including that of the peer reviewers. So that's unlikely to happen.

  4. These statisticians are idealists by plague911 · · Score: 3, Interesting

    Sure, in a perfect world we would all discuss the exact probabilities. The reality is we all (even professionals in an industry) have a limited attention span. Benchmarks are useful, even imperfect benchmarks. This is just another example of some purists thinking we should move to some idealized but impractical situation

  5. Re:Quant vs Qual by Kjella · · Score: 3, Interesting

    Narratives allow you to explain the past perfectly using models that have no predictive value. The only way to make progress when trying to understand a complex system is to come up with very simple hypotheses and try to validate them empirically. Of course this is very hard to do, but I think people in the humanities do a poor job and fool themselves into thinking they understand things they don't understand.

    A person is not a dice, no matter how much you want it to be. You can ask a fairly simple question like "Would you pose for nude art?" and get a survey answer. But if you break it down there'll be a ton of factors and the more answers you get and the more fine masked you make your model you'll only end up finding more and more differences plus the answer will not remain constant in place or time with a strong group dynamic and feedback loops. And you still will not have found a meaningful answer to why, only a bunch of correlated variables. Qualitative studies do the exact opposite, they don't generalize they ask one and one subject to explain their reasoning and try to summarize them into common sentiments. It's a much more accurate description for each person and the group as a whole. It's just really hard to compare scores because it's not on a measurable willingness scale.

    Yes, we've vaguely identified some risk factors that are usually present in a terrorist. We've got a long manifestos on why exactly that person turned into a terrorist. But everyone at risk are somewhere in between, they're not just risk factors and they're not clones of the terrorist. It's something like the Heisenberg's uncertainty principle for the social sciences, the more specific knowledge you have of an individual the less applicable it's to the group and the more general knowledge you have on the group the less accurate it's for the individual. They're both circling what nobody knows for sure, what exactly goes on in somebody else's head. Until we discover mind-reading technology that's going to be an approximation at best. Just because you can sell power tools to most Americans if you throw a dart at a map you could hit an Amish community.

    --
    Live today, because you never know what tomorrow brings