Slashdot Mirror


Scientists Propose To Raise the Standards For Statistical Significance In Research Studies (sciencemag.org)

sciencehabit shares a report from Science Magazine: A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005. Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results -- studies that claim to find an effect when there is none -- and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."

137 comments

  1. But only 56% of scientists agree with this by Anonymous Coward · · Score: 0, Offtopic

    And if you ask me, that's not statistically significant enough.

    Props to GNAA.

    1. Re:But only 56% of scientists agree with this by interkin3tic · · Score: 4, Insightful

      I'd be surprised if it was anywhere near 56%. I'm a biologist, I don't understand P values, but I am aware that they shouldn't be the gold standard. Ideally scientists in all the different fields would use the statistics that make the most sense for their specific study, and would take the time to figure that out, and reviewers would read up on statistics and think themselves about what statistics would make the most sense for that case.

      P0.05 is used everywhere because that simply won't happen. Scientists who aren't statisticians care passionately about only their topic and it isn't statistics. If anyone tries to use something else, everyone including reviewers will demand they use what everyone else uses anyway.

    2. Re:But only 56% of scientists agree with this by Anonymous+Brave+Guy · · Score: 0

      Scientists who aren't statisticians care passionately about only their topic and it isn't statistics. If anyone tries to use something else, everyone including reviewers will demand they use what everyone else uses anyway.

      And then people wonder why the credibility of published science has been called into question so much recently. :-(

      The idea that any specific value -- whether it's 0.05, 0.005 or something else -- should be regarded as a universally appropriate choice is nonsensical and betrays a fundamental lack of understanding of what the statistics being published mean.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    3. Re: But only 56% of scientists agree with this by Anonymous Coward · · Score: 1

      Get a copy of "Statistical Power Analysis for the behavioral Sciences" by Jacob Cohen.

      He spent his career trying to understand how to reliably measure an "effect" on something. Understanding effect sizes (dimensionless) and the number of samples you need to reliably test for an effect of a particular size is what this book is all about.

    4. Re:But only 56% of scientists agree with this by interkin3tic · · Score: 2

      And then people wonder why the credibility of published science has been called into question so much recently. :-(

      No, we don't wonder, it's because of a lack of reproducibility. Well, that and political agendas. The problem is understood and agreed upon generally. Agreeing on a solution is where things go off the rails. Saying "This magic number is no good, we should use THIS MAGIC NUMBER" is what I have a problem with.

    5. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      Doesn't matter since AGW promotes reduce, reuse, and recycle. We don't need science to let us know that is right.

    6. Re: But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      This we need to ignore facts to promote that.

    7. Re: But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      This. It doesn't matter right or wrong, because we need to do that.

    8. Re: But only 56% of scientists agree with this by Anonymous Coward · · Score: 1

      This. We need to ignore facts.

    9. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      That's the ideal but it doesn't work in practice. In practice a large portion of scientists are there to publish and publish fast. They are not interested so much in the search for truth as they are for the search for publications. If you let each set of authors determine their level of significance, it would BALLOON what is already a problem.

    10. Re:But only 56% of scientists agree with this by CptJeanLuc · · Score: 1

      Yes and no.

      Yes, in principle you are right, p-values should ideally be tailored to the application. The higher the negative consequence of a false result, the lower p-value you need.

      On the other hand and more importantly no - the main problem is that most researchers in various non-mathematical fields seem to have no clue how probabilities and statistics works, so they are not able to make that judgement call. Setting a higher standard should improve quality of results overall - yes it will lead to fewer results, but on the other hand we will also see fewer bulls___ results.

      Another major advantage of raising the bar for what can be called a significant result, is it becomes harder to do p-hacking, i.e. so-called "research" where you just measure a ton of variables - so many that there is a high probability that there will be some correlation in the data set with p-value below the threshold - and publish that as a scientific "result".

    11. Re:But only 56% of scientists agree with this by ilguido · · Score: 1

      I'm with you. They can even replace it with a threshold of 0.0000005, but if many "scientists" don't grasp statistics and don't know measure theory, it's all a wasted effort.

    12. Re:But only 56% of scientists agree with this by Quakeulf · · Score: 0
    13. Re:But only 56% of scientists agree with this by dcw3 · · Score: 1

      Glad I checked first, I was going to link to another 538 article:
      http://fivethirtyeight.com/fea...

      --
      Just another day in Paradise
    14. Re:But only 56% of scientists agree with this by Hognoxious · · Score: 1

      I'm a biologist, I don't understand P values

      Are you sure you're a biologist and not a gardener?

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    15. Re:But only 56% of scientists agree with this by Shadow+of+Eternity · · Score: 0

      People who didn't public support hillary were ostracized. People who supported someone else were brutally beaten in the streets, some to the point of being hospitalized. How many people REALLY supported her?

      --
      A bullet may have your name on it but splash damage is addressed "To whom it may concern."
    16. Re:But only 56% of scientists agree with this by Quakeulf · · Score: 0

      I always vote blank because I know no one represents me in the state anymore.

    17. Re:But only 56% of scientists agree with this by Anonymous+Brave+Guy · · Score: 2

      No, we don't wonder, it's because of a lack of reproducibility.

      Well, of course, but part of that is because people unwisely equate statistical significance with a hypothesis being true or false. Suppose you run an experiment twice and you get similar raw data in both cases, but in one case your result is marginally significant at whatever level and in the other it falls just short. The situation didn't fundamentally change between the two experiments, but a regrettable number of people who don't understand that the whole statistical analysis is built around probabilities and not absolute truths would probably report that the original result was "not reproducible". Even if the original researchers do understand, whoever is writing the headline when the result is reported might not.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    18. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      Remember who the other candidate was and the fact that a third-party candidate has a 0% chance of becoming president of the USA? Supporting Hillary was the only sane choice. There are several daily reminders in the news of how bad the other option was.

    19. Re:But only 56% of scientists agree with this by Gilgaron · · Score: 1

      That's a weird narrative, they were the ones saying Trump had a good chance there coming up to the election.

    20. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      Ostracised ? Haha ! I voted THEDONALD. Wore a VOTE-TRUMP sticker on my hat. Two nibberized DemoRat snowflakes attacked me, tried to beat me up. I'm an old gent, but quick as sin. Side-stepping ... and drew my 25-cal SNS and shot them dead. Kicked them into a gutter for dogs to eat.

    21. Re:But only 56% of scientists agree with this by apoc.famine · · Score: 1

      Ideally scientists in all the different fields would use the statistics that make the most sense for their specific study

      I think astronomy is a great example of this. You can spend years trying to find similar enough observations to compare, and often it's just not close enough to draw high-P-value conclusions from. It doesn't mean the observations are worthless - they may well hint at some new concept. But it takes decades to really pin things down, because the universe is a very large place, and our technology still needs time to improve. The alternative would be to never publish astronomical research, because it doesn't meet an arbitrary P value. It's not like a particle accelerator where you can just run it for another couple of weeks and get more data.

      --
      Velociraptor = Distiraptor / Timeraptor
    22. Re:But only 56% of scientists agree with this by umghhh · · Score: 0

      Is that really so? I am not USian so I do not care either way but I cannot escape impression that the rate at which media report on Trump is very high - simple fart is enough to get them hyperventilating. That is not very healthy. It also indicates that a person from outside whenever good or bad is being fought with quite ugly means. This is one thing. The other is the actual choice. If choice is between a person dead bound on challenging Ruskis and an newcomer with conspicuous credentials you may be inclined to decide not to take vote or to vote blank depending on what election system you have. The problem with your proposal is (besides being hindsight fallacy) that by voting for a side you do not support is that you get what you do not support possibly including nuclear war.
      I do not have much sympathy for Trump and I do not have much sympathy for Hillary either. I also tend to believe that both of them and the ugly choice USians had last year are a symptom of a deeper problem with politics in USA. It is not unique to USA tho - similar problem with agency and how decisions made by parties may be against will of majority even if rules are followed and when everybody works in good faith etc

    23. Re:But only 56% of scientists agree with this by mysidia · · Score: 1

      Ideally scientists in all the different fields would use the statistics that make the most sense for their specific study

      How about: Researchers should engage a staff statistician to look at their study, Develop a plan for what statistics to use, and sign-off that the statistical bar of significance used is the right one for this kind of experiment and data?

      Either that or create a rubrik that uses a bunch of Yes/No questions and a few numerical ranges to decide how statistical significance shall be measured and thresholded for each kind of study.

    24. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      AGW is needed to limit us. Can't overthrow the PTB riding a bicycle.

    25. Re:But only 56% of scientists agree with this by Anonymous Coward · · Score: 0

      Hillary was a terrible choice. At the time, Trump was more of an unknown. He was a dice roll - Trump didn't look promising, but no one was sure quite what he was going to do, and there was a chance he might actually do some good. There was absolutely no chance of anything good coming out of electing Hillary.

      As I say, Trump didn't deserve to win, but Hillary deserved every bit of her loss.

  2. Six-sigma! by msauve · · Score: 3, Funny

    Make it Six Sigma, which is really 4.78 sigma (or something like that, I forget the actual number), because they allow a fudge factor to accommodate the fact that 6 sigma isn't realistic.

    --
    "National Security is the chief cause of national insecurity." - Celine's First Law
    1. Re:Six-sigma! by ShanghaiBill · · Score: 4, Insightful

      Make it Six Sigma

      That would eliminate many false positives, as well as eliminating nearly all true positives. Of course, this will do nothing to reduce flawed studies caused by reasons other than statistics, such as non-representative sampling (e.g.: most mouse studies use only male mice), poor experiment design, shoddy data gathering, sponsorship bias, and outright fraud.

      But, the cost of clinical studies would only increase by an order of magnitude, so what do we have to lose?

    2. Re:Six-sigma! by Anonymous Coward · · Score: 0

      Make it Six Sigma, which is really 4.78 sigma (or something like that, I forget the actual number), because they allow a fudge factor to accommodate the fact that 6 sigma isn't realistic.

      I'm not an expert in the field, but my guess is that 5-sigma implies a lot of repetitions of the same experiment. You can do a lot of atom smashing to have 5 sigma for the Higgs Boson, but you can't really do MRI studies in 1 million people to have a 5 sigma of something "small". Not everything is black and white. You can reasonably well define humans as having between 5 feet and 7 feet (you have some outliers, but not many, so this easily has a "high sigma" value). But how do you apply the same principle to something that has, in itself, a low probability? "Gene X doubles your chances of having disease Y." But disease Y shows up in 1% of the population. So you're down to studying a million people to be able to confidently say that with 5 sigma. Suddenly, any scientific experiment costs billions of dollars and takes decades to make.

    3. Re:Six-sigma! by Hognoxious · · Score: 1

      most mouse studies use only male mice

      I bet they're white & middle class too.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    4. Re:Six-sigma! by Anonymous Coward · · Score: 0

      In physics, 5 sigma is the generally accepted standard for making any discovery claims.

    5. Re:Six-sigma! by pipingguy · · Score: 1

      "most mouse studies use only male mice"

      Maybe they're not asking for directions in the maze.

  3. Re:More Climate Change Bullshit by Anonymous Coward · · Score: 0

    your a doosh

  4. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    Global Climate Chance doesn't exclude cooling so you should get onboard.

  5. Must be tenured by Anonymous Coward · · Score: 0

    A behavioral economist wanting stricter significance?

    Career suicide.

    1. Re:Must be tenured by crunchygranola · · Score: 4, Insightful

      Nah. He just wants to eliminate 95% of the competition.

      --
      Second class citizen of the New Gilded Age
  6. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    That doesn't conform to the narrative so we shouldn't talk about that.

  7. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    Just like they told us by 1990 that you wouldn't be able to go outside without a gas mask. Their point is still valid even if wrong.

  8. This won't fix anything by Bueller_007 · · Score: 5, Insightful

    There's a trade-off between sensitivity and specificity. If you increase the threshold for "significance", you reduce the power to discover a significant effect when it truly does exist.

    And a major part of the problem with scientific studies is that they are already underpowered. According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.

    Underpowered studies result in two major problems:
    1) Most obviously, an underpowered study results in a greater number of FALSE NEGATIVES. You fail to find a true effect. You will either publish your incorrect result of no effect. (And why should we consider published false positives to be any worse than false negatives?) Alternatively, perhaps you don't publish your study because you couldn't reach significance. This exacerbates the "file-drawer effect" and also results in wasted research dollars because the results aren't published.
    2) Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.

    Reducing the level of significance will do little to address these problems, and in some cases may even exacerbate the problem.

    The key is *to move away from the binary concept of "significance" altogether*. It's obviously artificial to have an arbitrary numerical cutoff for "matters" vs. "doesn't matter", and this is not what Ronald Fisher intended when he popularized the p-value or developed the concept of "significance".

    What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.

    1. Re:This won't fix anything by Bueller_007 · · Score: 1

      A large number of results that are likely to be null, I mean.

    2. Re: This won't fix anything by Anonymous Coward · · Score: 0

      This is spot-on.

    3. Re:This won't fix anything by Anonymous Coward · · Score: 0

      What this guy said, times 1000.

      With enough replicates, you can make any *irrelevant* effect (e.g. Cohen's d = 0.01 standard deviations) display a "significant p-value" (p 0.005).

      The p-value tells you more about the number of replicates you obtained than about the "realness" of an effect. In the example I gave, when the effect size is *clearly* insignificant, it really doesn't matter if you get p 0.05 or p 0.005 or p 0.0000005: your effect is practically *zero* for all reasonable intents and purposes.

    4. Re:This won't fix anything by Immerman · · Score: 1

      I think you answered your own question in (1) - Negative results rarely get published, and so false negatives rarely propagate beyond the researchers involved. False positives are more of a problem specifically because they are far more likely to misguide others.

      Now, the fact that negative results rarely get published is a whole different problem...

      As for (2) - I guess I just don't see how it's counterintuitive - isn't the entire problem with an underpowered study the fact that the real results are likely to get lost in the random noise? In that case both false positives and false negatives would seem to be equally likely.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    5. Re:This won't fix anything by Anonymous Coward · · Score: 0

      False positives and false negatives are not equally likely. It depends on the power, the significance level, and the proportion of hypotheses tested that are actually true (which is generally unknown).

      In something like a genome-wide association study where you are testing millions of SNPs for an association with a disease, almost all of the hypotheses tested are likely to be false, so in an underpowered study the majority of "significant' discoveries are likely to be false.

    6. Re:This won't fix anything by Compuser · · Score: 1

      Bayesian statistics is all fine and good but prior selection is an art. Most biologists have trouble with the frequentist approach (beyond what some software tool produces automagically). Do you think they will suddenly be able to specify a proper prior? I would guess that 99% of papers will be Jeffrey's (uninformative priors). And the vast majority of the remainder will be junk, violating, oh I don't know, causality? Good luck with that proposal.

    7. Re:This won't fix anything by Immerman · · Score: 1

      Actually, I don't think that example undermines the probability of false negatives/positives themselves - just their impact on the results. Rather it's demonstrating the related statistical property...whose name I forget... that the accuracy of the test alone doesn't directly correspond to the likelihood that the results are accurate, you also have to consider the actual probability within the population.

      e.g. - if a medical test for X is 99% accurate (for both positives and negatives), but only 1 in ten thousand people actually have X , then when you test 1 million people you'll expect:
      Reality:
      100 positives
      999900 negatives

      Test results:
      99 positives
      1 false negative
      989901 negatives
      9999 false positives

      In which case testing negative does indeed give you 99% certainty that you don't have X, but testing positive still only means you have a ~1% chance of actually having X.

      And obviously if the population probabilities were reversed, and only one in 10k *doesn't* have X, then the results would be reversed, and it would be testing negative would only give you a 1% certainty of not having it.

      Obviously that does get especially bad in exploratory studies, where the odds of any given correlation are extremely low. Which is why such studies should only be considered a starting point for further investigation, and not actually indicative of anything in and of themselves. In my example, if you could achieve 99% certainty of a correlation (which would be pretty impressive in itself), you've accomplished a great deal by narrowing down the field of potential interest a hundred-fold while probably overlooking only a single real correlation. That's incredibly helpful for a shot in the dark, but still only a starting point.

      It's a property of statistics that I *really* wish got a lot more publicity. Those sorts of counterintuitive properties of statistics should be featured front-and-center in every introductory statistics course in the world, rather than endless B.S. about how to perform the calculations. Anyone who uses statistics for anything needs to be intimately familiar with the common gotchas and how to avoid them, while calculations are an implementation detail of little interest to anyone but the programmers creating the corresponding spreadsheet functions.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    8. Re:This won't fix anything by mesterha · · Score: 1

      According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.

      How does one control this in practice? Let's say I want to compare means. Do I need to assume the difference is greater than X to lower bound my power (modulo a bunch of assumptions.) I don't really see this in practice. Is it implied by other details of the experiment?

      Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.

      I don't understand. Yes if you do a large number of tests than you are likely to get more false positives. How is this connected to the tests being underpowered? Are you claiming that we will do more experiments/tests when things are underpowered? I think the details would be hard to quantify.

      What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.

      I do agree that giving things like confidence intervals would be more informative. I'm not sure about using Bayesian statistics. It does make the results more intuitive but at the cost of potentially adding uninformed/convenient/manipulated priors. I assume, if there is enough data to make the results insensitive to the priors then it's probably equivalent to a frequency based approach. (It would be nice to see a counterexample to this claim.) Personally I would be curious if bootstrapping approaches could be used. This would remove a lot of bad parametric assumptions, and we've got the computational power to compute them...

      --

      Chris Mesterharm
  9. Is this just scientists arguing over math by rsilvergun · · Score: 0

    or is this more attempts to discredit climate change science by setting an impossible bar for proof? I honestly don't know, it's the first I've heard of this. Still, it's hard to imagine it being controversial otherwise.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
    1. Re: Is this just scientists arguing over math by Anonymous Coward · · Score: 1

      I'm not sure what climate science has to do with this. It was published in a Psychology journal which is firmly in the realm of the social/biological sciences. Climate scientists have their own basket of statistical issues they have to deal with.

  10. Not convinced this will help by Anonymous Coward · · Score: 5, Interesting

    I'm not convinced this will help. There are a couple of issues here. Often, the experimental design can be changed, like how certain variables are controlled for, to get a p-value that's below the threshold. The other problem is that p-value is sensitive to the sample size. If you want a lower p-value, increase the sample size. In many cases, p-values aren't a good way to show whether a result is useful or not.

    I'm a meteorologist and I research severe thunderstorms. Let's say that I want to test whether a particular variable is useful in discriminating between tornadic and non-tornadic supercells. One approach might be to calculate the mean of that variable for tornadic supercells and the mean in non-tornadic supercells. The null hypothesis is that the mean of the two samples are the same, and I calculate a p-value. if the sample size is large enough, that is I've included enough supercells, I can make even very small differences in the means appear statistically significant.

    A better approach is to use that variable as a predictor and have two data sets -- a training data set and a testing data set. I then calculate a function to classify storms based on the training data set, using the variable as a predictor of whether a storm will be tornadic or not. Then I test its accuracy with the testing data set and the metric of success is the accuracy of the variable (hits, misses, and false alarms) of whether a storm will be tornadic or not. This is better because increasing the sample size isn't going to achieve a statistically significant result.

    Normally, some kind of baseline is chosen, and you want to show that your method performs better than the baseline. Of course, the problem is that you have a lot of flexibility in how to choose this baseline, and reviewers still need to be careful in how they evaluate work. For example. let's say that I cite a paper saying that climatologically, 20% of supercells or tornadic. I could randomly guess whether a supercell is tornadic based on that 20% probability and use that as my baseline. If my work is useful, hopefully I outperform than random guessing based on climatology.

    This isn't the best way, though, because we know of several variables that are useful in predicting whether supercells will be tornadic or not. A better baseline would be to include variables that are known to be useful and then test whether the additional variable adds skill or not. It also helps to have some physical explanation why a particular variable would affect whether a supercell is tornadic or not.

    There are cases where p-values are useful, but it's also very easy to abuse them. There's no substitute for vigilant reviewers who can spot misuses of statistics. There's nothing magical about a p-value of 0.05 or 0.005. I have no problem with p-values being presented, but I think a better approach would be to require that papers include more than p-values to demonstrate that a result is significant. I've described one such approach above that I use in my own research.

    1. Re:Not convinced this will help by Orgasmatron · · Score: 1

      Personally, I prefer the "successful predictions" measure of validity.

      --
      See that "Preview" button?
    2. Re:Not convinced this will help by Anonymous Coward · · Score: 0

      Personally, I prefer the "successful predictions" measure of validity.

      Virtually no scientific studies attempt to measure accuracy of predictions. There is a very strong emphasis on novelty in career science.

      New is wonderful. You become an instant expert. Testing for successful predictability is seen as a waste of grant funding as you are going over old work. Best case you prove someone else's work wrong, which probably wont pass peer review as the paper submission will be sent to experts in the field, i.e. the person that put in the work you are trying to prove wrong. Worst case you are merely repeating their work.

      Just one new finding is much more impressive on a CV than testing a dozen predictive theories for validity. So it isn't done.

    3. Re:Not convinced this will help by RespekMyAthorati · · Score: 1

      Wonder of wonders.
      A truly scientific post on slashdot.

    4. Re:Not convinced this will help by DarthVain · · Score: 1

      I couldn't agree more. The p-value is a very specific thing, and by itself isn't something that can really say something is valid or not. I've done my share of analysis for folks, and seen the results of others analysis to know that in many cases you can statistically prove just about anything you want to prove and in many cases there is an agenda that is more akin to we're trying to prove X where X is the desired outcome.

      It is all about the data and the methodology, more than the statistical math. Cherry pick the data, or manipulate it in a favorable way, or use a method that is going to give you the results you want to see, you can take all the statistical math in the world and it won't make a difference. I've seen analysis I can clearly see is flawed, because of how they used the data, what the data actually means, how it was collected, the method of both the collection, the analysis etc... Typically because there is a political agenda involved. I provide a large amount of data to folks for various things, but I can't really control how they use it. About the only time I would chime in, is if someone were to question MY results and why they don't match THEIRS. At which point I would be more than happy to show not only how badly they did their own analysis, but how they likely willfully ignored a lot of aspects to prove whatever it is they are trying to prove. I wouldn't have my integrity blemished by someone else's shotty work.

      That said the real issue is the whole tenure process and the number of papers that it is compelled to generate, and not only the lack of review, but the lack of expert review. The volume just won't allow for it, and it also promotes "cheating" or at least taking shortcuts. I mean you hear of stories of fake papers that are published into reputable organizations, which are just made up humorous things that was someone's joke, but no one reviewed it, or if someone actually did, it was only cursorily or they just are not knowledgeable enough on the topic to really separate the chaff from the wheat. Anyway fix the incentive to produce a million papers, and require expert review prior to publishing, and the problem is solved, p-value or not.

    5. Re:Not convinced this will help by mesterha · · Score: 1

      I can make even very small differences in the means appear statistically significant.

      Small differences in the mean can be statistically significant. And yes you need large sample sizes to show this. Not sure what you mean by appear...

      A better approach is to use that variable as a predictor and have two data sets -- a training data set and a testing data set. I then calculate a function to classify storms based on the training data set, using the variable as a predictor of whether a storm will be tornadic or not.

      I wouldn't say this is a better approach; you are trying to answer a different (and probably more interesting) question. Your conditional variable might have a high variance with a lot of overlap between tornadic or not. This will create a poor predictor. Doesn't mean that the means of the conditionals are not different.

      For showing a difference in the means, giving confidence intervals is probably a more intuitive presentation of the results. Instead of saying, with 0.05 confidence the means are different, one should give the confidence intervals that jointly hold with 0.95 confidence. This way one can easily see the size of the difference. Of course, one will lose statistical power with this technique, so it will be harder to hit the magic 0.05. Also, people will be less impressed if it's obvious the difference is small. I wonder what trade-off people would make to create a convincing looking pair of confidence intervals, and how their inevitable playing with the parameters to create a nice picture would bias the results...

      --

      Chris Mesterharm
  11. When a measure becomes a target... by Anonymous Coward · · Score: 0

    "When a measure becomes a target, it ceases to be a good measure"

    This will just result in the fraudulent academics gaming their papers to achieve 0.005 instead of 0.05. p-value is a measure that tells you something, it shouldn't be a target. Unfortunately due to the way science is economically configured, I don't think it is likely to stop being a target.

  12. Increased costs for drugs by whoever57 · · Score: 1

    This will mean that big pharma will have to run an order of magnitude more studies until they can find the one study which can be published because it shows a positive correlation.

    [yes, I know statistics don't really work that way]

    --
    The real "Libtards" are the Libertarians!
    1. Re:Increased costs for drugs by crunchygranola · · Score: 5, Interesting

      This will mean that big pharma will have to run an order of magnitude more studies until they can find the one study which can be published because it shows a positive correlation.

      [yes, I know statistics don't really work that way]

      Actually they kind of do!

      A tactic that Pharma companies have pulled many times in the past is to try and kept generic drugs off the market by showing that they are not equivalent to the proprietary product. And they do this by running a couple of dozen of animal studies, with the animals being given the two different products, with various physiological parameters being monitored. When one of these parameters is found to differ between the two drugs by p > 0.05 they submit the result to the FDA declaring that the two drugs are not equivalent in their effects (the parameter of course has nothing to do with the drug's actual pharmacological effect).

      Now with this standard they will have to run 200 or so tests to find one that exceeds p > 0.005.

      --
      Second class citizen of the New Gilded Age
    2. Re:Increased costs for drugs by Anonymous Coward · · Score: 0

      it sort of works that way

    3. Re:Increased costs for drugs by Anonymous Coward · · Score: 0

      It's a double edged sword changing a "gold standard" When Big Pharma is sued because a peer reviewed study shows a statistically significant correlation between drug X causing a heart attack or stroke or gynecomastia (male breasts) or ketoacidosis or toxic epidermal necrolysis etc... think how much it effects them from changing it from .05 to .005

  13. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    Because we set better emissions standards and fixed the hole in the ozone layer.

  14. Re:Statistics by Anonymous Coward · · Score: 1

    Averages do not comprise the whole of statistics.

  15. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    That was then. This is now.

  16. Re:More Climate Change Bullshit by Anonymous Coward · · Score: 0

    Careful now, don't die doing something intellectually challenging like crossing the road.

  17. Publish more p values by Anonymous Coward · · Score: 0

    What's the harm with publishing a table of multiple p values as many researchers do now rather than just switching from publishing one p value with another. I'd prefer to see how the probabilities break and read the p values for 0.05, 0.01, and 0.005.

  18. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    "Global Climate Chance"? What are the chances that it's a tool for extracting money from American taxpayers? Pretty high.

  19. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    This. Even if they're wrong, they're more right than the Republicans.

  20. Fisher's comment by Anonymous Coward · · Score: 5, Informative

    If you want reprodicibilty, well then require reproducibility.
    Fisher, the inventor of p-values said this about p-values:
    >>”[We] thereby admit that no isolated experiment, however significant in itself, can suffix for the experimental demonstration of any natural phenomenon In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result" (Fisher, 1960, p.13-14).

    1. Re:Fisher's comment by Anonymous Coward · · Score: 0

      So because a 'method of experiment', doesn't necessarily produce a statistically significant result, the method of procedure needs to increase it's threshold of significance, by design. I.e., the testing methodology needs to be even more stringent from the onset.

      It sounds like this guy wants a level of scrutiny, an unrealistic one at that, for all scientific study. Rather than an increasing refinement of methodology, as has been established for several hundred years.

      Yes, let's do that, since we got to now, being 'ham-handed' with all approaches. Can we just ignore this guy and continue doing sound science?

  21. Shining some light by Okian+Warrior · · Score: 5, Informative

    I'm a biologist, I don't understand P values [...]

    Here's some light for that subject.

    Suppose you make 20 measurements of rats in a maze and discover that 15 out of the 20 times they turn left on their first corridor junction. Is that significant?

    We know that if the decisions were random we'd expect 10 out of 20, but we also know that there is variation in that number. 10 out of 20 is the highest probability of individual outcome, but it's even *more* probable that something other than 10 out of 20 will occur.

    So to see if the 15 out of 20 is significant, we can compare this outcome to random chance.

    We can simulate 20 coin flips in a computer and then write down the number of heads versus tails. Then we do it again and write down the new results, and then do it again and again for a million rounds.

    Tallying the results, we can then find the *probability* that 20 random coin tosses will equal 15 or more heads, and this will give us a way to compare the rat data with random chance. What percent of random tosses yield 15 or more heads?

    This is the P-value in a nutshell: it's the probability that your measurements could be the result of chance.

    Note that we can never be *certain* that the results are significant, only that there is a *probability* that the results are significant. The probability of significance is chosen by convention depending on the outcome risks. For normal scientific studies, it's 5% (P < 0.05). If you're studying a new medicine, you might want to bump that up to 1% (P < 0.01) for safety. If you're exploring subatomic physics, and the experiments are very difficult to reproduce, you might want that to be P < .00001% to be relatively certain.

    The conventional value of 5% is often incorrectly attributed to Pearson. He said the 5% value makes the results worthy of more study, not that 5% value makes the results significant.

    Also of note, if everyone makes studies to P 5%, then on average 1 out of 20 studies *will* be due to random chance, which means that fully 5% of all scientific studies are reporting random events.

    And of course, if your degree requires you to publish, or your tenure is based on your publishing history, there are ways to adjust the results to make the significance more likely.

    (For example, you can record 8 different measurements of your rats. There are 8*7 = 76 possible pairs of measurements, so on average about 3 of those pairs will correlate to within 5%. If you want to publish a paper, this is one way to do it.)

    Very, very few recent scientific papers have ever been verified (by reproducing), and when later examined were found to be unreproducible.

    This is leading people to lose faith in the scientific method.

    1. Re:Shining some light by Anonymous Coward · · Score: 3, Funny

      Extra credit: The numbers 8 and 7 are multiplied together twenty separate times. In nineteen out of the twenty calculations the result is 56. However, one calculation gives a result of 76. What's the p-value for this trial?

    2. Re:Shining some light by Anonymous Coward · · Score: 1

      This is the P-value in a nutshell: it's the probability that your measurements could be the result of chance.

      This is why real scientists use Bayes' theorm and provide confidence intervals.

    3. Re:Shining some light by interkin3tic · · Score: 1

      Okay, walk scientists around the world through this and you'll solve one of the major problems with science today. I was only pointing out the problem is not that we have a bad magic number, I was pointing out scientists are lazy when it comes to things outside their area of expertise.

    4. Re:Shining some light by Anonymous Coward · · Score: 0

      An understanding of statistics, a real understanding, should be a prerequisite for a career in a highly experimental field like biology, psychology and medicine. Otherwise we see the problems we are seeing.

      Alternatively, all experimental studies should be reviewed by a professional statistician and a biologist in peer review. The biologist could comment on the experimental setup, while the statistician could judge on the analysis.

    5. Re:Shining some light by Anonymous Coward · · Score: 0

      Suppose you make 20 measurements of rats in a maze and discover that 15 out of the 20 times they turn left on their first corridor junction. Is that significant?

      [...]

      Tallying the results, we can then find the *probability* that 20 random coin tosses will equal 15 or more heads, and this will give us a way to compare the rat data with random chance. What percent of random tosses yield 15 or more heads?

      If the rats had turned to the *right* 15 out of 20 times, you'd be asking the same question: is this significant? So, in your statistical test, you should be asking the question "What percent of random tosses yield 15 or more heads, *or* 15 or more tails?".

      (I wonder, did you do this deliberately, to test if the reader understood the point you were making about multiple trials? The rest of your post is perfectly well-written and accurate.)

    6. Re:Shining some light by dcw3 · · Score: 1

      This is the P-value in a nutshell: it's the probability that your measurements could be the result of chance.

      Quoted from http://fivethirtyeight.com/fea...

      A common misconception among nonstatisticians is that p-values can tell you the probability that a result occurred by chance. This interpretation is dead wrong, but you see it again and again and again and again.

      --
      Just another day in Paradise
    7. Re:Shining some light by Shadow+of+Eternity · · Score: 1

      The difference between "How likely is this result assuming the Null Hypothesis is true" and "This result is X% likely to be random chance" is very, very slight when it comes to simply deciding whether or not a study's results merit further inquiry. So yes it's a misconception, but it's a misconception that's so microscopically close to the actual truth that it's alright for the layman or average undergrad to rely on it.

      The real problem is that we're using P-values all over the place for bullshit like the social sciences (I should know, I am one) and it gives a false air of scientificness to stuff that's about as scientifically sound as reading chicken entrails.

      --
      A bullet may have your name on it but splash damage is addressed "To whom it may concern."
    8. Re:Shining some light by Anonymous Coward · · Score: 1

      People aren't losing faith in the method, they have lost faith in those purporting to follow the method.

    9. Re:Shining some light by Anonymous Coward · · Score: 0

      I'm a biologist, I don't understand P values [...]

      Here's some light for that subject.

      As with the famous quote on quantum mechanics, anyone who claims to understand statistics really doesn't. Yes one can fully understand the mathematics of statistical theory, but never fully whether the hypothesis apply to a "real world" problem.

    10. Re:Shining some light by Gilgaron · · Score: 1

      Usually we just hand the data to the stats department to crunch, with the example above I almost fear more for science if the people designing the experiments knew better how to game it.

    11. Re:Shining some light by Anonymous Coward · · Score: 0

      Poor choice of word. What you should have said is...

      This is leading people to lose confidence in the scientific method.

      That said, I'm more concerned with the current 'political right' in control of Government bodies, running with this for scientific funding.
      I can certainly see the inroads being used here, to hamstring entire sectors of scientific research, when setting the metrics of standard, unrealistically, and scientifically measurable, as too high.

    12. Re:Shining some light by Anonymous Coward · · Score: 0

      Hilarious - AC's right. Mod up.

    13. Re:Shining some light by pipingguy · · Score: 1

      I *thought* something was fishy with that arithmetic but then I figured maybe AC was talking about 'new math'. Mod parent up.

    14. Re:Shining some light by fropenn · · Score: 1

      Well, not exactly. It's the probability of seeing your results given your assumptions about the underlying distribution.
      In your example, you are assuming the coin flips exactly evenly on both sides (50% heads, 50% tails). If, in reality, a coins flip 51% heads and 49% tails (due to, say, the extra weight of the head), then your p-value would not be accurate.

    15. Re:Shining some light by Anonymous Coward · · Score: 0

      Have you ever heard the phrase "Pearls before swine"? If someone who claims to be a "biologist" doesn't understand some basic probability, statistics, calculus, algebra, arithmetic, logic, chemistry, physics, ... then we should just politely nod our heads and move on. Why? Because at some point the individual is responsible for their own competence, and someone claiming to be a professional (in ANY discipline) certainly should have some competence with the tools necessary to do the job. And if s/he doesn't, s/he should damn well get it PDQ. Do you really think your post is an adequate explanation? No offense, but it isn't. It's obvious to me that what is missing is the "biologist"'s willingness to expend the effort to become competent. S/he's not, your post was, at best pearls before swine (not to mention TL;DR).
      As far as P-values. They do not have one meaning. Context is important. I've used values of 0.5, 0.2, 0.1, 0.05, 0.01, but never 0.005. Why? because in that work, the difference between 0.01 and 0.005 would have been meaningless. The problem is, as I see it, two-fold. First we have people who don't understand their tools using them, they're Cargo Cultists. Second, we have lazy pseudo-analytical types who replace reasoned and educated analysis with arbitrary measures and then act and believe as if the map is the country. Some guy is quoted in the OP as claiming moving from 0.05 to 0.005 is "doable and easy". WTF!! That's such a ridiculous statement (admittedly, taken out of context). How much more work is it? How many additional samples/runs? How much more time? "easy"? LOL!
      I've got an alternative solution, instead of using the term "significance level" for P-values, we should use the term "Smurf level". So many problems will go away if we remover the implication that significance level establishes the significance of a result.

    16. Re:Shining some light by doom · · Score: 1

      An understanding of statistics, a real understanding, should be a prerequisite ...

      My understanding, gleaned from the p-value debate, and other things like it is, that the trouble is you can't expect there to be a magic formula that checks for Scientific Goodness.

      I used to want to understand things like p-values better, now that I do, I've concluded there's no substitute for just eyeballing the scatter. If you're worried you're fooling yourself, you could do it double-blind: hire someone who doesn't even know what the data is they're looking at, and ask them whether a graph looks like there's a correlation between variables x and y.

    17. Re:Shining some light by Anonymous Coward · · Score: 0

      We can simulate 20 coin flips in a computer and then write down the number of heads versus tails. Then we do it again and write down the new results, and then do it again and again for a million rounds.

      Unfortunately, understanding probability and statistics does not guarantee understanding computer science. Random number generators for that simulation are notoriously bad at being random. Like many other things, there are tradeoffs, and doing a million simulated coin tosses will have some difference from doing actual coin tosses. Of course, one could argue whether that difference is statistically significant.

    18. Re:Shining some light by MiniMike · · Score: 1

      What's the p-value for this trial?

      Pentium?

    19. Re:Shining some light by erapert · · Score: 1

      This is leading people to lose faith in the scientific method.

      This is leading people to lose faith in scientists.
      Also the constant pushing of political agendas.

      Scientists have only themselves to blame. What makes people angry is that those same scientists go around acting like they're the oracles of Truth and intelligence just because they put on a stereotypical white lab coat once or published a paper.

      Having your results reproduced several times by several other research groups should be the gold standard.
      Using your results to produce something should be the gold standard. "If you're so smart and what you've figured out is so neato then why ain't you rich?"
      As far as the public should care, the standard should be whether your study is useful and actually improves people's lives today.

      Scientists over the past few centuries have pushed the boundaries pretty far; everyone will agree to that.
      Perhaps what's needed now is to "digest" and make use of those discoveries and come up with useful things rather than some new "study" about how whale songs indicate how hungry the whale is or other such navel gazing and speculation.

    20. Re:Shining some light by elmohound · · Score: 1

      My old biometry profressor, a former graduate student of R.A.Fisher, gave this pearl of wisdom at the conclusion of the course's final lecture. It was was billed as the magic equation that should be heeded by all experimental biologists:

        C^1B(4).

      In other words, see a statistician before performing the experiment.

    21. Re:Shining some light by mesterha · · Score: 1

      And of course, if your degree requires you to publish, or your tenure is based on your publishing history, there are ways to adjust the results to make the significance more likely.

      Even if everyone is completely honest and knowledgeable, because of the publication bias, the significance is way off. Just keep doing experiments and eventually you will get a "significant" result. If we knew the number of experiments done, we could control for this problem, but in most fields this is impossible. Imagine a field where all the ideas are crap. If enough people work in the field, you will still get a lot of significant results. Automation and computer analysis might be partially driving this problem by allowing more "experiments" to be performed.

      --

      Chris Mesterharm
    22. Re:Shining some light by gumbi+west · · Score: 1

      [face palm] on mixing Bayes with confidence intervals.

    23. Re:Shining some light by mesterha · · Score: 1

      While I appreciate an intuitive explanation, I don't think it's as easy as you suggest, especially for someone practicing in the field as opposed to someone doing homework.

      So to see if the 15 out of 20 is significant, we can compare this outcome to random chance.

      So the null hypothesis is a 50/50 coin flip with IID Bernoulli trials. Let's say the result of the experiment is R L L R L L L L L L R R L L L R L L R L? What is the probability of this data given the null hypothesis is true? Well the probability is 2^-20. You really need to justify that the significant result is that I get 15 or more Ls. I'm not sure of a great way to pick this in general particularly as the experiment gets more complex.

      Also, you to specify what is significant before you run the experiment. One can bound that area to have p=0.05. Of course, just because the experiment is not significant does not mean the the data was generated at random. Assume I ran the experiment and I get R L R L R L R L R L R L R L R L R L R L. Any human is going to look at this an think something strange is going on. Perhaps the experiment is not IID. One probably needs to run more trials and use some of the data to formulate a model of what is generating the data to construct the significant region, but even that has issues...

      There are other problems if you have a limited amount of data. Let's say you get 17 lefts. Can you claim a p value of .0013? I don't think you can change your significance space after the experiment, otherwise it would be easy to cheat particularly as the experiments get more complex. Not sure of the best way to resolve this. Probably need to consider sets of significance sets where each one has a different p value. This has it's own problems.

      Anyway I'm not a statistician. I'm sure these things have good answers, but I don't think it's so simple...

      --

      Chris Mesterharm
  22. Let's raise the standard for online journalism by sheramil · · Score: 1

    How many is a "mega-team"? A million, or one million, forty-eight thousand, five hundred and eighty six?

    It appears to be seventy-two.

    1. Re:Let's raise the standard for online journalism by PPH · · Score: 1

      It appears to be seventy-two.

      Is this a statistically valid sample size for P0.005? Have they repeated the study with different study teams to see if the results are repeatable?

      --
      Have gnu, will travel.
  23. Re: More Climate Change Bullshit by PPH · · Score: 1

    Citizen. It appears that you are using an older version of the Newspeak Dictionary. Off to Room 101 with you.

    --
    Have gnu, will travel.
  24. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    The cooling narrative was disprove so we now talk about Climate Change.

  25. Re: The standard is SIX SIGMA. by Anonymous Coward · · Score: 0

    Troll? So much of that going on these days I don't know whether or not to respond. Six-sigma stuff is for when you're making thousands if widgets. You rarely have sample sizes large enough in the biological or social sciences to do anything at a "six sigma" level.

  26. Re: More Climate Change Bullshit by plopez · · Score: 1

    In the 50s there were places where going outside without a gas mask was dangerous. Beijing is a place dangerous to go outside without a gas mask. They were right.

    --
    putting the 'B' in LGBTQ+
  27. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    But that is what we knew at the time. Now we know other things.

  28. Unlikely to change much by SlaveToTheGrind · · Score: 2

    This would just provide a new target for the p-hackers.

    1. Re: Unlikely to change much by Anonymous Coward · · Score: 0

      Pretty much. The problem isn't so much the choice of magic number as the amount of wishful thinking that goes into analysis in studies when that beautiful p=.05 is what's required to keep putting food on the table.

    2. Re: Unlikely to change much by Anonymous Coward · · Score: 0

      I also meant to add that if there was funding available for good studies with boring conclusions instead of for mediocre studies with interesting conclusions scientists wouldn't be punished for being honest with themselves and wouldn't feel the same compulsion to massage data.

  29. Re: More Climate Change Bullshit by Anonymous Coward · · Score: 0

    There were also places where going outside with or without a gas mask was unsafe (like Chicago).

  30. Re: Statistics by Anonymous Coward · · Score: 0

    One of the requirements for statistics to be valid is that the sample be random. Picking those two people in your story were not random.

  31. Show us the proof! by petes_PoV · · Score: 1

    "It seemed like this was something that was doable and easy, and had worked in other fields."

    So where are the studies that "prove" this? Oh, and they'd better have a significance of 0.005 or better.

    --
    politicians are like babies' nappies: they should both be changed regularly and for the same reasons
  32. lowering the P-value won't help. by rew · · Score: 4, Informative

    The problem with current research in semi-soft sciences like biology and medicine is that the scientists use this p-value wrong.

    If you suspect a glass of wine a day will lower chances of heart disease, take 1000 volunteers, roll a dice and half of them you tell to not have that wine-a-day and the other half you tell, please drink one glass of wine a day. Next you wait two years, and evaluate the incidence of heart problems in the two groups. That's where 0.05 P-value is acceptable. (in practice, telling people ot suddenly stop or start drinking is not going to go well).

    Things become problematic when you suspect: "something we can measure may be related to this disease" (e.g. Sarcoidosis), you take 200 patients and 200 healthy people and then measure 200 parameters in each of the 400 blood samples... Provided there is little to measure, you'll find about 1/20th or about 10 parameters that DO seem to be (p=0.05) different between the two groups.

    In the case at hand one or two measurable parameters ARE, different in the patient-group. So you'll have a better than 95% chance of finding those. Of the 198 other parameters you'll find 1/20th of false positives, for a total of almost 12 publishable results.

    Should you want to increase your chances of finding these publishable results, the sample size needs to be relatively small. The group of 200 patients and 200 healthy people might already be too big to get enough spurious results. Even if they don't do this consciously, the scientists will quickly be able ot optimize their sample size to find publishable results.

    When I was a freshman in 1985, some guy asked me to help him put his research in the computer. He had formulated 50 or so questions and predicted boys would answer differently than girls. So he went into a classroom, interviewed 30 boys and girls and put his results in the computer. Of course the computer told him there were several significant differences between boys and girls. Some of them real (do you like to play with trains? Dolls?) some of them not (I don't remember the example).

    The other example is more recent. A Dutch Doctor got her PhD with (among others) the described sarcoidosis research. But my run-ins with subject are very limited simply because I don't move in those circles. This is way more widespread than just the few examples that I encounter personally.

    Then people try to "fix" this by proposing the wrong solutions.

    The research: "can we find a parameter that allows us to differentiate between the two groups" is very important as well. But you have to do your research in the right way. Take 100 patients and 100 healthy people and find the parameters that seem to make a difference. NOW you go into the second half of the research with a hypothesis: "this parameter is important" and verify your claim. Now the p=0.05 is acceptable. (a 5% chance that you're wrong, as opposed to a 95% chance your'e full of shit).

  33. What this guy said, times 1000. by neoshroom · · Score: 3, Funny

    What this guy said, times 1000.

    According to conventional wisdom, ideally, scientists should strive for a power of about 80000% (i.e., an 80000% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50000% and sometimes much less.

    --
    Big apple, new Yorik, undig it, something's unrotting in Edenmark.
    1. Re:What this guy said, times 1000. by Rockoon · · Score: 1

      At these levels, the coherence will be responsible for many causations.

      --
      "His name was James Damore."
  34. What this guy said, times 1000. by neoshroom · · Score: 4, Funny

    What this guy said, times 1000.

    I think we should standardize around "What this guy said, times 10,000" to make sure the effect is truly significant.

    --
    Big apple, new Yorik, undig it, something's unrotting in Edenmark.
  35. Same for medial results? No? by Anonymous Coward · · Score: 0

    Whoa, who would have guessed, not for medical studies....

  36. The current problem with the degree factory by Master+Of+Ninja · · Score: 4, Insightful

    After viewing it first hand, there are a lot of people going through "degree factories", getting degrees that are getting only the basics of statistical knowledge. And a little knowledge is very dangerous. The p-value is a useful measure, but it's been simplified to (p less than 0.05 = good) in biomedical circles. And if you read the other upvoted threads, or read some of the linked articles, you'll understand why this is a big problem.

    There are a few tensions here that I think may be causing this: (a) publish or perish - if it looks reasonable enough, publish because that's where your next job comes from, (b) poor statistical training - can be from both the authors and reviewers side, (c) unwillingness to fund or publish work that is reproducing previous results - there is a publisher created publication bias, (d) the general high cost of patient centred biomedical research, so meaning your have low sample numbers generally, (e) the unwillingness in some disciplines to get formal statistical input.

    What are the potential solutions? If there was an unrestricted money pool you can recruit adequately (n>10000) to each study, but the money is not there, and there are some very rare diseases around. Better statistical training would be ideal, and there has been a push towards Bayesian analysis: I would think that as in most statistical tools someone will eventually find a way to inappropriately use them. Self-publish as an option - could be possible: I've seen some horrifically bad peer reviewed articles (& predatory journals!) but there is an ethical tension between publishing without review which could just flood the literature with absolute garbage which is difficult to sort through, and actual proper peer review. Maybe something like Arxiv for biomedical science, although there would be a lot of resistance to it I suspect.

    I don't hold too many hopes for a quick solution to this as there are a lot of vested interests, and people using the best new fangled statistical methods they've learned. I've even reviewed a paper recently, with multiple authors from a big university, where I just shook my head at the amount of statistical fudging that took place: the authors had imputed about 80% of their primary predictor variable for an outcome, and then came up with a conclusion based on the imputed data. I just shook my head that this was actually allowed nowadays. While this article is good, some of the authors have been banging on about it for some time without much change.

    1. Re:The current problem with the degree factory by Anonymous Coward · · Score: 0

      Is there an equivalent to Gish Galloping in the scientific field?

  37. We know why this happens. by Anonymous Coward · · Score: 0

    People who don't like the consequences or answers or can make money off what is refuted by the results.

  38. p-value in a simpler nutshell by DrYak · · Score: 1

    Another way to put it into simpler term.

    If you replicated this same study (rats in the labyrinth), then
    If the result (see the 15 out of 20 rats) was only due to random chance, with a p-value of 0.05, such result would only occur in 5% of attempt to replicate the experiment, i.e.: only 1 of 20 attempts to replicate the rat-in-the-lab would give results as skewed as 15 rats out of 20 by pure luck.

    For the proposed threshold p-value of 0.005, such results could happen by random chance only in 0.5% attempts to replicate, i.e.: only 1 of 200 attempts to replicate the rat-in-the-lab would give results as skewed as 15 rats out of 20 by pure luck.

    My opinion :
    meh... 5% or 1-in-20 seems to me good enough as long as other teams try to replicated the experiment.
    (Hence the Pearson quote)

    the interesting stuff would then be the meta-studies : articles that try to review all that was published on some subject.

    If you end-up with one lab-experiment giving you left turn at p-values, followed by 19 attempts to replicate that all ended up being negative, then you could consider the first to be a fluke.

    But then, disclaimer : (Dr Bones voice) I am a doctor, Jim, not a statistician.
    I've also studied bio-informatics, but my first degree should be a huge red flags whenever I get to close to stats, so take my opinion with a grain of salt.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  39. Re: Statistics by Anonymous Coward · · Score: 0

    Instead of collecting data randomly, the post-modernists picks randomly from a set of narratives.

  40. better idea by Anonymous Coward · · Score: 0

    As a computer scientist, I think you're all doing it wrong. I propose that we use Co-NP-values instead of these weak sauce P-values. Co-NP values are the set complement of NP-values. When you use NP-values, you effectively evaluate all possible P-values at once and return the first P-value that proves whatever you're trying to prove. Co-NP-values on the other hand evaluate all possible P-values and report the first one that proves your experiment is bullshit. ;)

  41. Re: The standard is SIX SIGMA. by Rockoon · · Score: 1

    You rarely have sample sizes large enough in the biological or social sciences to do anything at a "six sigma" level.

    Crying about how unfair a standard is to your chosen methods isnt much of a persuasive argument. Some might argue that the standard should be inconvenient for the lowest common denominators Some might even argue that the standard should be inconvenient for everybody.

    --
    "His name was James Damore."
  42. P-values are undergraduate level statistics by sjbe · · Score: 1

    I'm a biologist, I don't understand P values, but I am aware that they shouldn't be the gold standard

    It worries me when I read about people doing science who don't understand basic statistics. This is undergraduate level stuff and it's not terribly difficult to wrap your brain around. Anyone smart enough to be a professional biologist should be able to handle P-values without difficulty.

    Scientists who aren't statisticians care passionately about only their topic and it isn't statistics.

    Whether you are passionate about statistics or not is irrelevant. You are advocating mathematical illiteracy because some people aren't "passionate" about math? I'm not passionate about grammar but I recognize its importance. Please tell me how you as a purported biologist plan to conduct population studies or sampling without involving and understanding statistical methods? Do you not want to understand the papers you are putting your name on? How do you know that your conclusion makes sense if you don't understand the math used? Even top journals like Nature recognize the problem of biologists not taking statististics seriously.

    There are topics in pretty much every scientific discipline that cannot be properly understood without a solid grasp of statistics. Sure if you run into a technical problem beyond your prowess at mathematics by all means go seek out the math department at your local university for help but for someone to describe themselves as a scientist without understanding something as basic as a P-value is to basically admit they are not competent at their job.

  43. policing good intentions by epine · · Score: 2

    Most of the comments I skimmed are missing the point.

    The real problem is that even scientists with the best training and the best intentions wind up committing a certain amount of p-hacking subconsciously. Just a simple data exploration to decide post hoc whether any collected data is corrupted or implausible, and you've already slithered one toe across the p-hacking line.

    When p-values gate publication, and publication gates promotion, you create a severe moral hazard where many of the scientists you end up promoting lie on the bottom half of the curve in self-policing their accidental p-hacking. The guy with the penchant to do slightly more irregular experiments, which require slightly more data cleanup, seems to get slightly more published results. Ba da boom.

    p=0.005 would put a pretty big crimp in this effect.

    Of course it doesn't solve the larger problem. But good golly, first things first.

    We also know from replication efforts that p=0.05 is allowing far too much crap to float over the gate. p=0.005 probably gets us closer the crap level we naively assumed we'd get when we originally rallied around p=0.05.

    Probably the increased use of computers hasn't helped matters: even accidental p-hacking with pencil and paper is hard work.

    1. Re:policing good intentions by j-beda · · Score: 1

      very good points.

  44. Re:The standard is SIX SIGMA. by Anonymous Coward · · Score: 0

    You really should take a course on statistics. Six sigma has nothing to do with science and research. It's used, often incorrectly, in industrial quality control.

  45. Significant Paper by techdolphin · · Score: 1

    There is a 95 percent chance this paper is significant. Oh wait, I meant to say 99.5 percent. I'll get the p-value right eventually.

  46. Re:The standard is SIX SIGMA. by Khyber · · Score: 1

    "Six sigma has nothing to do with science and research"

    Apparently you know nothing about this, so let's educate you on this a bit.

    Six-Sigma is used at CERN. So fucking YES, it is most certainly used in science.

    Perhaps you should get a job in an actual scientific field.

    --
    Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
  47. Idiots by Anonymous Coward · · Score: 0

    This is stupid. Many research fields (planetary science is of particular concern to me) cannot get enough data for that p-value. Also, inherent in standard measurements of, say, temperature on Earth, we expect an error of 1K out of 300K from satellites. This is good enough to work with improving predictions to give at least a decent idea of tomorrows weather. The variability of temperature is only about 20K per day in extremes, so 1K is 95%. Much less than that else. Seems stupid as well to make meteorology non-sceinece compliant. The proposal is outright idiocy. I didn't read the paper (don't have to, I presume) to make this comment though.

  48. Ionnidis gives formula that result is correct, PPV by jameson.burt2404 · · Score: 1

    The exact probability a field's (eg, a journal's) article is true can be found in John Ionnidis Plos Medicine article "Why Most Published Research Findings are False". He lets R be the number of true relationships to no relationships among those tested in a field. It's equivalent to a background probability (prior, though perhaps unknown). The positive predictive value (PPV), a probability, is PPV = (1 - Beta) * R / (R - Beta * R + alpha). A coarse bound for this is, when alpha = 0.05, PPV less than 20 R. This bound becomes useful when it's less than 1, eg, R less than 0.05. R is small as in cancer research when, out of 30 genes affecting a cancer amongst 30,000 genes, R = 30 / 30000 = 0.001, so PPV less than 20 R = 0.02. That is, in genetic this research, THE PROBABILITY A PUBLISHED PAPER DECLARING 0.05 SIGNIFICANCE IS CORRECT IS NOT 0.95 -- IT'S AT MOST 0.02! Some have decided that all their research is statistically significant; eg, the journal Basic and Applied Social Psychology banned the p-value. Some research fields' articles tests are truly meaningful 25 percent of the time -- research becomes a child's game unworthy of most research. But when the research is difficult, as when truly meaningful results occur 0.001 of the time, then the p-value becomes a "deceiver of fools" (quote from symphonic metal band Epica).

  49. Would rather they... by Anonymous Coward · · Score: 0

    Trained, beat, cajoled, plead with scientists to develop a better grasp of statistics and how to properly use them.

  50. The real answer... by JoeDuncan · · Score: 1

    ... is not in changing the epsilon value of P

    The real answer is in *requiring* 2 things:

    • - that significance criteria be fixed BEFORE the experiment is run
    • - that researchers be required to use NON-parametric statistical measures, as Fisher originally intended
  51. Think about what 0.005 means by lfp98 · · Score: 1

    If you only allow publication of effects with p less than 0.005, that means that in order to prevent the publication of one false positive, you are discarding ~190 results that had a true difference in outcome. I'd agree that it is better to publish nothing than to publish a wrong result, but this level of certainty seems to me excessive. Maybe 0.05 is a little too high, but surely at 0.02 or 0.01 (a one-in-a-hundred chance that you are wrong), it is time to move on to the next experiment, not keep doing the same work over and over, trying to reach the magic 0.005. Saying this is "doable and easy" is ludicrous. In biological sciences, a twofold difference could be extremely important, but to get p = 0.005 significance for a twofold difference is highly unusual. You'd have to double or triple the number of replicate experiments, with each replicate often taking weeks and many thousands of dollars to perform. Genome analysis is such a special case, you are just comparing sequences, not measuring quantitative variables in finicky cells in a wet lab. This will slow science to a crawl, probably for the sake of a marginal improvement in reproducibility.

  52. I don't need no stinking p-value! by Anonymous Coward · · Score: 0

    The patent office doesn't require no p-value.