Slashdot Mirror


Weak Statistical Standards Implicated In Scientific Irreproducibility

ananyo writes "The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."

182 comments

  1. 2 + 2 = 5 !! by Anonymous Coward · · Score: 0

    I heard it more than once !!

  2. Or you know.. by tanujt · · Score: 1

    Use Bayesian statistics.

    1. Re:Or you know.. by Anonymous Coward · · Score: 5, Informative

      This would have the same problems, maybe even worse. The problem with statistics is usually that the model is wrong, and Bayesian stats offers two chances to fuck that up: in the prior, and in the generative model (=likelihood). Bayesian statistics still requires models (yes, you can do non-parametric Bayes, but you can do non-parametric frequentist stats also).

      Contrary to the hype and buzzwords, Bayesian statistics is not some magical solution. It is incredibly useful when done right, of course.

    2. Re:Or you know.. by hde226868 · · Score: 5, Insightful

      The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach. But that's also the problem with the cited article. Just remaining in the framework and going towards more discriminating thresholds is not really a solution of the problem that people do not understand their data analysis (a p-value based on the wrong distribution remains meaningless, even if you change your threshold...). Because it is more logical in its setup, the danger of making such mistakes is smaller in bayesian statistics. The telescoper over at http://telescoper.wordpress.com/2013/11/12/the-curse-of-p-values/ has a good discussion of these issues.

    3. Re:Or you know.. by Anonymous Coward · · Score: 5, Interesting

      Yes, I agree. If a p-value of 0.05 actually "means" 0.20 when evaluated, then any sane frequentist will tell you that things are fucked, since the limiting probability does not match the nominal probability (this is the definition of frequentism).

      The power of Bayesian stats is largely in being able to easily represent hierarchical models, which are very powerful for modeling dependence in the data through latent variables. But it's not the Bayesianism per se that fixes things, it's the breadth of models it allows. A mediocre modeler using Bayesian statistics will still create mediocre models, and if they use a bad prior, then things will be worse than they would be for a frequentist.

      Consider that if Bayesian statisticians are doing a better job than frequentists at the moment, it may be because Bayesian stats hasn't yet been drilled into the minds of the mediocre, as frequentist stats has been for decades. People doing Bayesian stats tend to be better modelers to begin with.

    4. Re:Or you know.. by Daniel+Dvorkin · · Score: 4, Insightful

      The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach.

      If only. The number of people who think "sprinkle a little Bayes on it" is the solution to everything is frighteningly large, and growing exponentially AFAICT. There's now a Bayesian recipe counterpart to just about every non-Bayesian recipe, and the only difference between them, as a practical matter, is that the people using the former think they're doing something special and better. One might say that their prior is on the order of P(correct|Bayes) = 1, which makes it very hard to convince them otherwise ...

      --
      The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
    5. Re:Or you know.. by Paradigma11 · · Score: 1

      "Yes, I agree. If a p-value of 0.05 actually "means" 0.20 when evaluated, then any sane frequentist will tell you that things are fucked, since the limiting probability does not match the nominal probability (this is the definition of frequentism)."

      P(H0|significant) != P(significant|H0)

    6. Re:Or you know.. by Anonymous Coward · · Score: 0

      if they use a bad prior, then things will be worse than they would be for a frequentist.

      Citation needed. In my anecdotal (IANAS, I am not an statistician) experience the influence of the prior diminishes quickly as you add knowledge, usually becoming irrelevant except on the most extreme cases (when your prior is very, vey, very close to 0 or 1).

    7. Re:Or you know.. by wickerprints · · Score: 3, Informative

      First of all, recommending that hypothesis tests be conducted with smaller tolerances for Type I error almost invariably imply a large decrease in power. There is no free lunch. There are many experimental designs for which the importance of making a positive inference (i.e., accepting the alternative hypothesis) is so great that you do need to set the alpha level very small. But if the test is to have any power, that means the data you must gather must be much, much more extensive. So, to simply say "alpha = 0.05 is too large because it admits too many irreproducible claims by random chance" sort of misses the basic point. A test conducted at such a level still has at most a 1 in 20 chance of observing a test statistic that would reject the null hypothesis even if the null is true. So a p-value of 0.04, for example, would merit further investigation. That's not so much a flaw of the frequentist methods as it is a flaw in interpretation, due to the natural tendency of investigators or clinicians to want a straightforward "yes/no" answer.

      Bayesian methods, then, don't really offer intrinsically more meaning than frequentist methods. The main difference is that Bayesian methods, by their construction, force the investigator to draw an inference that is not characterized by a "yes/no" answer--in fact, it becomes a bit of a contrivance (e.g., Bayes factors and the calculation of cumulative posterior distributions) to try to interpret Bayesian analyses in this way. Don't get me wrong, that is an appealing and advantageous characteristic, whereas more care is needed to interpret the frequentist approach. But Bayesian methods also suffer from their own problems, many of which arise from the necessity of imposing some kind of prior distribution (so, for instance, Bayes factors are not monotone in the hypothesis).

      The takeaway here is that in statistics, there is no magic bullet, no single approach that is supposed to be the "best" or "optimal" for inferential purposes. It is the role of scientists and investigators to perform the necessary follow-up analyses and meta-analyses to improve the credibility of a claim. So in a sense, the state of statistical methods in scientific research is NOT broken. It is working as intended, where people find enough evidence to stimulate further investigation, and it is through this process that previous claims are tested further. The only part that concerns me is how policymakers lacking in sufficient statistical background might put too much credence in a particular analysis--this idea that "oh, we found significance so this MUST be true"--or how the non-statistically informed public or media all too often distort the meaning of an analysis to the point of absurdity. But I argue that this is not a weakness of statistics. It is a deficiency in understanding brought about the human desire to act upon perceived certainties in a fundamentally uncertain world.

    8. Re:Or you know.. by delt0r · · Score: 1

      You've got it all wrong. First they try the old fashion way. If that wasn't signification then try the Bayes way. If that wasn't significant, do something and claim its significant anyway.

      And don't mention that you had to try 30 different tests to finally find one that gives a "significant" result.

      --
      If information wants to be free, why does my internet connection cost so much?
    9. Re:Or you know.. by TechNeilogy · · Score: 1

      When I was taking scientific methodology courses in school, we were strongly encouraged to use the "recipe" approach. We were warned that thinking about the detailed meaning of the statistical results might lead to the introduction of bias. I applauded the intent, but even then I saw it as likely to cause problems by forcing a continuous spectrum of results into accept / reject quanta. So I agree that simply moving the thresholds cannot solve the problem. I think Bayesian statistics, while not a panacea, can offer a more nuanced and realistic approach approach. But the real solution, IMHO, is to make sure every researcher has a good foundation in a broad range of statistical techniques and theory.

      --
      "The wisdom of the Patriarchs was that they *knew* they were fools." --Master Foo
    10. Re:Or you know.. by Anonymous Coward · · Score: 0

      Obviously. However, if something is published, it's almost always a rejection of the null. Therefore, if the published result is false, it means that the null is true, which means that the publication is a type-I error. I stand by my claim.

    11. Re:Or you know.. by Anonymous Coward · · Score: 0

      I think part of the problem is that folks publish with p=0.05 as a cutoff because it is suggestive of guiding future research. However, since it's publish-or-perish, everyone is interested in dumping suggestive results into journals as fast as possible, whereas the followup is not as sexy and often never gets done. Instead, someone will do a similar but different study with, again, a 0.05 cutoff.

      It's naive to think that reducing the cutoff will reduce the true rate of false positives. If the researchers are fudging their results (maliciously or subconsciously), lowering the cutoff clearly won't help. Making the cutoff 0.01 won't magically make the rate of false positives 0.04.

      In short, I agree, this isn't something that statistics can fix.

    12. Re:Or you know.. by sfdrew83 · · Score: 1

      I agree with you, but I also think there is an unhealthy reliance on statistics among researchers these days. In some cases, it's the best thing you've got, so you have to use it, but in other cases, people are just being intellectually lazy. I cringe every time I turn on the news, or look at a newspaper, or read an article online with a headline like "Science Proves It: visitng latin america reduces your risk of cancer by upto 35%". I wish I could reach through spacetime and slap that person silly. On the other hand, how much can I blame them for misrepresneting what the results actually mean when all of these papers these day make claims that aren't much better? We need to get back to finding hard evidence to support our hypothesis, instead of just throwing darts at it, and calling it good.

    13. Re:Or you know.. by Anonymous Coward · · Score: 0

      I can remember giving a progress report and being asked the p value. I replied the study wasn't over so I was not sure it was ok to calculate one. This is in grad school. I know better now, but what was taught of statistics leads to bizarre behaviors.

    14. Re:Or you know.. by Paradigma11 · · Score: 1

      Rather late :) Your claim, and my argument also, is wrong on following count:
      It should be P(H0|significant) != 1 - P(significant|H0)
      If you have a sample without real differences you have P(H0|significant = 0) which is the statistic used in the article.
      This would result in:
      "Yes, I agree. If a p-value of 0.05 actually "means" 1 when evaluated, then any sane frequentist will tell you that things are fucked, since the limiting probability does not match the nominal probability (this is the definition of frequentism)."

  3. Doubt it by Anonymous Coward · · Score: 1

    Doubt it makes a difference, the root of this problems us systematic errors.

    1. Re:Doubt it by Anonymous Coward · · Score: 0

      That's not a very scientific response... Doubt.
      The real question is:

      What P value did he use to come to this conclusion?

  4. Re:That book about the bell curve by Derec01 · · Score: 3, Informative

    That is because of the central limit theorem, (http://en.wikipedia.org/wiki/Central_limit_theorem), which indicated that for a large number of independent samples, it doesn't matter what the original distribution was, and we certainly can reliably use the normal distribution. It is NOT unfounded.

  5. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    That, and the fact that all of statistics is a joke. It's all based on the assumption that data is distributed in a bell curve. Sure, a bell curve does fit a lot of data, but we blindly assume it fits everything which just can't be true.

    We do not assume everything fits a bell curve.

    STA-101: When using a normal curve, there needs to be a good reason for it.

    In many cases, that good reason is the Central Limit Theorem.

  6. Five Sigma or Bust by upmufa · · Score: 2

    Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.

    1. Re:Five Sigma or Bust by mysidia · · Score: 3, Interesting

      Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.

      Of proof yes... that makes sense.

      Other fields should probably use a threshold of 0.005 or 0.001.

      If they use move to five sigma....... 2013 might be the last year that scientists get to keep their jobs.

      What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?

    2. Re:Five Sigma or Bust by Will.Woodhull · · Score: 3, Insightful

      Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations, and you had to carry a couple of spare nine volt batteries for the thing if you expected to keep it running through an afternoon lab period.

      We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

      Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We should have come a long way since then. But have we?

      --
      Will
    3. Re:Five Sigma or Bust by mdsolar · · Score: 0

      Checking Bevington (1969) Table C-1: P=0.0000014868 for 5 sigma. Interestingly, the table does not go up to six sigma, suggesting that a certain set of companies don't actually exist. http://en.wikipedia.org/wiki/List_of_Six_Sigma_companies

    4. Re:Five Sigma or Bust by Anonymous Coward · · Score: 0

      Five sigma probably isn't possible in the medical field for example. What sample size would you need to use to get that?

    5. Re:Five Sigma or Bust by Kjella · · Score: 1

      Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We should have come a long way since then. But have we?

      In engineering? Yes. Science? Well...

      --
      Live today, because you never know what tomorrow brings
    6. Re:Five Sigma or Bust by LoRdTAW · · Score: 1

      "What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?"

      James Cameron could reach the bar.

    7. Re:Five Sigma or Bust by mysidia · · Score: 1

      James Cameron could reach the bar.

      Hm... James Cameron is a deep-sea explorer, and film director.... he directed Titanic.

      In what way, does that make him a researcher who could be sure of meeting five sigma in all his research; even when infeasible truly massive datasets would be required?

    8. Re:Five Sigma or Bust by Anonymous Coward · · Score: 2, Insightful

      Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations

      Um, the issue is not that it is difficult to calculate P-values less than 0.05. Obtaining a low p-value requires either a better signal to noise ratio in the effect you're attempting to observe, or more data. Improving the signal to noise ratio is done by improving experimental design, removing sources of measurement error like rater reliability, measurement noise, covariates, etc. It should be done to the extent feasible, but you can't wave a magic wand and say "computers" to fix it. Likewise, data collection is also expensive, and if you have to have an order of magnitude more subjects, it will substantially raise the cost of doing research.

      There does exist a tradeoff between research output and research quality. It may be (I think so at least) that we ought to push the bar a bit toward quality over quantity, but there is a cost. In the extreme, we might miss out on many discoveries because we could only afford the time and cost of going after a handful of sure things.

    9. Re:Five Sigma or Bust by umafuckit · · Score: 3, Insightful

      We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

      But that stuff isn't the limiting factor. The limiting factor is usually getting enough high quality data. In certain fields that's very hard because measurements are hard or expensive to make and the signal to noise is poor. So you do the best you can. This is why criteria aren't tighter now than before: because stuff at the cutting edge is often hard to do.

    10. Re:Five Sigma or Bust by Anonymous Coward · · Score: 0

      And now you see. Statistical significance is only a function of how much money you can drum up (sample size).

    11. Re:Five Sigma or Bust by martinux · · Score: 3, Interesting

      I work in this field and usually see power calculations recommending samples of non-viable size.

      I can see recruiting hundreds of subjects as being feasible in the US or a large european country but in smaller countries one simply has to state clearly in a paper's limitations that any findings must be interpreted in light of the available sample.

    12. Re:Five Sigma or Bust by Anonymous Coward · · Score: 0

      It's a South Park reference.

    13. Re:Five Sigma or Bust by Anonymous Coward · · Score: 0

      What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it

      I dunno, maybe you could stop calling the study of those fields "science".

    14. Re:Five Sigma or Bust by interkin3tic · · Score: 1

      Why should it be generalized to whole fields rather than based on what you are studying?

      If I'm publishing that drug X does not increase the incidence of spontaneous human combustion, there ought to be a lot of zeroes in that P value. If I'm publishing that "As expected, Protein X does job Y in endangered species Z, which is not surprising given that protein X does job Y in every other species tested, and why the hell did we even do this experiment" then maybe you don't need such a high standard.

    15. Re:Five Sigma or Bust by dcollins · · Score: 1

      As others have said, the problem in many cases is not computational power, but expense or difficulty or even ethics of getting a large data set. What about a medical trial -- do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    16. Re:Five Sigma or Bust by melikamp · · Score: 1

      do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

      We have 40,000 patent attorneys in the US alone, so there's one great sample for experimental medicine. Not very heterogeneous, but will do in a pinch.

    17. Re:Five Sigma or Bust by pepty · · Score: 2

      do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

      If the drug is going to be prescribed to millions of people a year: yes, probably. If not during a phase III trial than during a phase IV trial that begins as soon as the drug goes on the market. The reason being that while efficacy can be extrapolated from smaller trials safety is all about the outliers. A few excess deaths in a trial of several thousand could easily mean that the drug causes more harm than good overall, or that an identifiable patient subgroup can't tolerate the drug.

    18. Re:Five Sigma or Bust by pepty · · Score: 1

      In a lot of fields I'd think requiring inclusion of the a priori power analysis in publications would be a decent compromise between unobtainable sample sizes and overly optimistic results. That still leaves the problem of experiments that are extended to increase the sample size after the first batch of results proved inconclusive ...

    19. Re:Five Sigma or Bust by mysidia · · Score: 1

      If I'm publishing that drug X does not increase the incidence of spontaneous human combustion, there ought to be a lot of zeroes in that P value. If I'm publishing that "As expected, Protein X does job Y in endangered species Z, which is not surprising given that protein X does job Y in every other species tested, and why the hell did we even do this experiment" then maybe you don't need such a high standard.

      I will agree to that with one stipulation..... the ability to apply the more lax standard needs to be documented, applied for and approved before the study is undertaken, and it should be approved by a board of peers in the field who are unbiased -- not directly involved with the study, its participants, etc; to make sure someone didn't get to swtich which standard would be used after they gathered the data; e.g. relaxing the standard to make it publishable.

    20. Re:Five Sigma or Bust by RespekMyAthorati · · Score: 1

      How could the research be "meaningful" if the statistics are crap?

    21. Re:Five Sigma or Bust by mysidia · · Score: 1

      How could the research be "meaningful" if the statistics are crap?

      0.049

      It is highly suggestive; and if replicated by other scientists -- essentially definitive.

      The cost of achieving p

      You may require, for example, tens million people participating as subjects in your cancer cure drug study.

      Your required sample size may exceed the number of available subjects, or even exceed the number of people in the population under study.

      Are you willing to deny approval for a cancer drug that could have the potential to save numerous lives, because the drug companies don't have the budget for ---- or are incapable of recruiting 50 million test subjects, for experimental drug testing?

    22. Re:Five Sigma or Bust by Pseudonym · · Score: 1

      The example that medical statisticians usually use is head trauma. You can't ask for volunteers to be in serious traffic accidents. And if ten people come into your emergency room with head trauma, you can't not-treat five of them to act as a control group.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    23. Re:Five Sigma or Bust by umafuckit · · Score: 1

      No you can't, but that would usually also be pointless. After all, usually what you want to know is if a new treatment is better than what's out there. So the experiment you do on your head trauma victims is new treatment vs current best practice. So your null hypothesis is not a nill. This is considered best practice for most new drugs. It's only drug companies who want to make their crappy new drugs look good who test them against placebo.

  7. Scarcely productive by fey000 · · Score: 4, Interesting

    Such an admonishment is fine for the computational fields, where a few more permutations can net you a p-value of 0.0005 (assuming that you aren't crunching on a 4-month cluster problem). However, biological laborations are often very expensive and take a lot of time. Furthermore, additional tests are not always possible, since it can be damn hard to reproduce specific mutations or knockout sequences without altering the surrounding interactive factors.

    So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?

    1. Re:Scarcely productive by Anonymous Coward · · Score: 0

      If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.

    2. Re:Scarcely productive by hawguy · · Score: 4, Insightful

      If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.

      Because I'd rather use a drug found to be 75-83% effective at treating my disease than die while waiting for someone to come up with one that's 99.9% effective.

    3. Re:Scarcely productive by Anonymous Coward · · Score: 3, Informative

      This is a fallacious understanding of p-value.

      Something closer to (but still not quite) correct would be: that there is a 75-83% chance that the claimed efficacy of the drug is within the stated error bars. For example, there may be a 75-83% chance that the drug is between 15% and 45% effective at treating your disease.

      That's much worse, isn't it?

    4. Re:Scarcely productive by Anonymous Coward · · Score: 0

      It's not 75% effective. There's a 25% chance it's doing nothing at all for you or any of the other twenty million people it's been prescribed to, at enormous cost, and/or a 25% chance it has horrific undiscovered side-effects.

    5. Re:Scarcely productive by Anonymous Coward · · Score: 0

      It's not an assertion, it's basic math--0.05 is 20%. If the chance of any result being found by chance is 20%, it stands to reason that about 20% of all results were found by chance and, therefore, not to be expected to be reproducible.

      The author isn't the first to point this out. There was a really good paper published about 5 years ago which delved much deeper into our approaches to basic science, especially in biology and the social sciences.

      For example, we assume that if a hypothesis holds up that we've learned something substantive about the structure of the particular system being studied. But some systems might be riddled with a huge number of coincidental and misleading relationships, so that even if a study is reproducible it may only add noise to the field. This is why so many results are a dead end. So even by switching to a tiny P-value, you still haven't necessarily improved the productivity of the field. In fact, you may drive researchers to chase even narrower hypotheses which are more likely to be valid but nonetheless worthless.

    6. Re:Scarcely productive by Anonymous Coward · · Score: 0

      ....and with that reasoning you show us exactly why we have this problem. Science only works if you have the discipline to do it right.

    7. Re:Scarcely productive by Anonymous Coward · · Score: 0

      And more importantly, a 17-25% chance that it's completely ineffective, no better than a placebo.

    8. Re:Scarcely productive by petermgreen · · Score: 1

      If the chance of any result being found by chance is 20%, it stands to reason that about 20% of all results were found by chance and, therefore, not to be expected to be reproducible.

      Statistical significant levels only tell us about the chance of a study producing a false positive, they say nothing about the chance of a study producing a true positive.

      So if the chance of a true positive is low then the false positives could easilly outnumber the true positives.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    9. Re:Scarcely productive by hawguy · · Score: 3, Interesting

      And more importantly, a 17-25% chance that it's completely ineffective, no better than a placebo.

      My sister went through 4 different drugs before she found one that made her condition better. One made her (much) worse.

      Yet she likely wouldn't be alive today if none of those 4 drugs worked.

    10. Re:Scarcely productive by evilviper · · Score: 1

      Because I'd rather use a drug found to be 75-83% effective at treating my disease than die while waiting for someone to come up with one that's 99.9% effective.

      The problem becomes when you're treating a non-life threatening ailment with a drug that turns out to:

      1) Not help at all, ever.
      2) Has other, life-threatening side-effects.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    11. Re:Scarcely productive by colinrichardday · · Score: 1

      It's not an assertion, it's basic math--0.05 is 20%
      .
      0.05 is 2%, not 20%

    12. Re:Scarcely productive by theqmann · · Score: 1

      0.05 is 5 percent

    13. Re:Scarcely productive by Anonymous Coward · · Score: 1

      After three decades working in a National Laboratory, and after having been involved in several fundamental discoveries, I just have to ask:
      What the hell is a "laboration"? Is it a new made-up word to go along with what frequently appears now to be just made-up science?

    14. Re:Scarcely productive by Anonymous Coward · · Score: 0

      What you said is a fine representation of the problem: most scientists that have no mathematical background (and even many that should have) don't understand what they're doing in classical hypothesis testing.

    15. Re:Scarcely productive by Anonymous Coward · · Score: 0

      It's not an assertion, it's basic math--0.05 is 20%.

      0.05 is 5% is 1/20th not 20%

      Basic math indeed.

    16. Re:Scarcely productive by Anonymous Coward · · Score: 0

      My sister went through 4 different drugs before she found one that made her condition better. One made her (much) worse. Yet she likely wouldn't be alive today if none of those 4 drugs worked.

      And some people die before they try the 4th drug, because the first 3 weren't tested properly. I understand you love your sister, but seriously, one data point is not a good way to understand statistics.

    17. Re:Scarcely productive by hawguy · · Score: 1

      My sister went through 4 different drugs before she found one that made her condition better. One made her (much) worse.

      Yet she likely wouldn't be alive today if none of those 4 drugs worked.

      And some people die before they try the 4th drug, because the first 3 weren't tested properly. I understand you love your sister, but seriously, one data point is not a good way to understand statistics.

      But if doctors weren't allowed to publish results where there's a relatively weak correlation to a positive outcome, how would her doctors have known "Hey, look, this other drug is effective for some people, lets give it a try"? They already tried the 2 drugs that were the most promising, the 3rd (the one that put her in ICU for a week) was a relative longshot, as was the 4th. Yet if the literature hadn't shown that all of those drugs were used with some moderate success, how would they know known to try them?

      Since her condition is relatively rare and apparently different people respond differently to the drugs, I don't know that any of the treatments would show a P value of .005. But that doesn't mean a 75% chance of a good outcome (even if there's a 25% of a bad outcome) is worse than no treatment at all, which is almost always fatal within a few years.

    18. Re:Scarcely productive by pepty · · Score: 1

      More fun: a 75-83% chance the drug caused a 15 to 45% change in a proxy, i.e., a biomarker, which is itself imperfectly correlated with your disease. Data on whether the drug actually improves lifespan/quality of life for your cohort of chronic disease patients should be back in another decade or so ...

      No wonder people hate evidence based medicine and would rather have someone just Reiki it away.

    19. Re:Scarcely productive by pepty · · Score: 1

      And that's one reason why complaints about Pharmas wasting time on "me too" drugs should be taken with a grain of salt. A first in class drug usually leaves a lot of unmet medical need.

    20. Re:Scarcely productive by Anonymous Coward · · Score: 0

      So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?

      Yes.

  8. Economic Impact by Anonymous Coward · · Score: 3, Insightful

    Truth is expensive.

    1. Re:Economic Impact by Anonymous Coward · · Score: 1

      Truth is expensive.

      Not as expensive as ignorance.

    2. Re:Economic Impact by gl4ss · · Score: 1

      but producing shitty studies pays just as well as producing good studies, especially if the study is about something of no consequence at all to anyon..

      --
      world was created 5 seconds before this post as it is.
  9. Not going to happen by Anonymous Coward · · Score: 4, Insightful

    If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.

    1. Re:Not going to happen by Anubis+IV · · Score: 3, Insightful

      ...and nothing of value would be lost. Seriously, have you read the papers coming from that 90% of journals and conference proceedings outside of the big ones in $field_of_study? The vast majority of them suck, have extraordinarily low standards, and are oftentimes barely readable. There's a reason why the major conferences/journals that researchers actually pay attention to routinely turn away between 80-95% of papers being submitted: it's because the vast majority of research papers are unreadable crap with marginal research value being put out to bolster someone's published paper count so that they can graduate/get a grant/attain tenure.

      If the lesser 90% of journals/conferences disappeared, I'd be happy, since it'd mean wading through less cruft to find the diamonds. I still remember doing weekly seminars with my research group in grad school, where we'd get together and have one person each week present a contemporary paper. Every time one of us tried to branch out and use a paper from a lesser-known conference (this was in CS, where the conferences tend to be more important than the journals), we ended up regretting it, since they were either full of obvious holes, incomplete (I once read a published paper that had empty Data and Results sections...just, nothing at all, yet it was published anyway), or relied on lots of hand-waving to accomplish their claimed results. You want research that's worth reading, you stick to the well-regarded conferences/journals in your field, otherwise the vast majority of your time will be wasted.

    2. Re:Not going to happen by Anonymous Coward · · Score: 0

      The p < 0.05 is the standard for statistically meaningful results, not scientifically meaningful results. You know, the standard 'correlation is not causation' sort of thing.

      One of the many difficulties we have in the sciences is the difficulty in publishing studies showing a lack of statistical effect, as is mentioned in the Nature blurb. I don't often see the results of power tests, used to avoid Type I errors, but those would be welcome additions to results sections. The main Nature article does not mention the use of power tests for some reason, and if I were a statistician I might have an inkling why that is.

    3. Re:Not going to happen by Anonymous Coward · · Score: 0

      Awesome. Then, maybe instead of judging scientists based on the volume of papers they have published, we could judge them based on the quality of their research.

    4. Re:Not going to happen by Anonymous Coward · · Score: 0

      Then we need an entire culture shift. Because if you are applying for grants and can only list one publication (which used p=0.005, and to get that much replication you could not publish all the other data that reached p=0.05, and you spent all your grant money raising at least double the amount of mice - assuming the ethics committee even approves your new larger experiments, it took you months longer to do the replicates while you had to pay all the staff and students) - well, you are in a competition with other scientists and you will lose. You will lose grants, you will lose jobs, you will not get your PhD, you will be out competed by those with poor stats and longer lists of publications. Bad stats don't keep you out of good journals, I've seen some pretty shitty stats in Nature/Science et al.

      So we have a system that will punish anyone who attempts to be more rigorous in their experiments.

    5. Re:Not going to happen by RespekMyAthorati · · Score: 1

      Yay!

  10. Re:That book about the bell curve by Anonymous Coward · · Score: 2, Informative

    Statistics does not, by any means, make that assumption. If it did, the entire field of statistics would have been completed by 1810.

    Mediocre (actually, sub-mediocre) practitioners of statistics make that assumption.

    It is true that many estimators tend to a normal distribution as the sample size gets large, but this is not the same as assuming that the data itself comes from the normal distribution.

  11. Interpretation of the 0.05 threshold by Michael+Woodhams · · Score: 5, Insightful

    Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.

    If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.

    Obligatory XKCD

    --
    Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
    1. Re:Interpretation of the 0.05 threshold by Black+Parrot · · Score: 1

      Obligatory XKCD [xkcd.com]

      FWIW, tests like the Tukey HSD ("Honestly Statistically Different") are designed to avoid that problem.

      I suspect that's how the much-discussed "Jupiter Effect" for astrology came about: Throw in a big pile of names and birth signs, turn the crank, and watch a bogus correlation pop out.

      --
      Sheesh, evil *and* a jerk. -- Jade
    2. Re:Interpretation of the 0.05 threshold by theqmann · · Score: 1

      doesn't a p 0.05 mean that 95% of your data samples (2 sigma) support the hypothesis? wouldn't 1 sigma be more of a "suggestive" level? 95% seems pretty good

    3. Re:Interpretation of the 0.05 threshold by Anonymous Coward · · Score: 1

      doesn't a p 0.05 mean that 95% of your data samples (2 sigma) support the hypothesis? wouldn't 1 sigma be more of a "suggestive" level? 95% seems pretty good

      The best way I've found to understand p-values is to consider the situation where you have an experiment. You're attempting to observe an effect, but in reality there is no effect to observe and all you're seeing are random fluctuations. If your criteria for declaring you've observed your effect is a p-value of 0.05, it means you will be convinced you've seen something there that isn't really there one time in twenty. Can you imagine if you had a one in twenty chance of believing a traffic light was green when it was in fact red? I think "suggestive" is an appropriate label for that level of confidence.

    4. Re:Interpretation of the 0.05 threshold by Anonymous Coward · · Score: 0

      Here is a good explanation. The p value and sample size index an effect size:
      http://arxiv.org/pdf/1311.0081v1.pdf

    5. Re:Interpretation of the 0.05 threshold by delt0r · · Score: 1

      p values in this context don't tell you if something is true. It tells you that the data is unlikely to be from the *null* model. Its not the same as support for the alternative.

      --
      If information wants to be free, why does my internet connection cost so much?
    6. Re:Interpretation of the 0.05 threshold by Laxori666 · · Score: 1

      A p-value of 0.05 means that 1 in 20 results are false positives. This implies that 5% of all scientific papers with a p-value of 0.05 are false. However, applying some statistics, it might even be worse than that. Textual summary of that link:

      Say 1000 hypotheses are tested, and that 10% are true - that is, 100 true hypotheses, and 900 false ones. If the false positive rate is 5%, then 45 of the 900 will end up true. Further, let's say there's a false negative rate of 10%. So of the 100 true hypotheses, 10 will be missed. That means after experimentation on all 1000 hypotheses, 135 true results will be found, out of which only 90 will actually be true. Also consider that usually only positive results are published. So the 1000 experiments yield 135 papers that are published, where 45/135, or 33%, are actually false!

      This is a bit arbitrary in that it assumes 10% of all tested hypotheses are true. If this number is smaller, then this gets much worse. If the number is larger, then it gets better. But still it's quite an interesting indication.
      What humanity needs is a severe revision of the whole peer review and publication method of doing science.

  12. Obligatory XKCD by Anonymous Coward · · Score: 2, Funny

    http://xkcd.com/882/

  13. Re:That book about the bell curve by Entropius · · Score: 2

    No, statisticians certainly do not assume that. If everything in my field were normally distributed then my life would be a lot easier, but it's not, and we're aware that it's not.

  14. A universal standard for significance... by Anonymous Coward · · Score: 3, Insightful

    Authors need to read this: http://www.deirdremccloskey.com/articles/stats/preface_ziliak.php
    It explains quite clearly why a p value 0.05 is a fairly arbitrary choice as it cannot possibly the standard for every possible study out there. Or, put it another way, be very skeptical when one sole number (namely 0.05) is supposed to be a universal threshold to decide on the significance of all possible findings, in all possible domains of science. The context of any finding still matters for its significance.

  15. Student's T-test by The+Real+Dr+John · · Score: 1

    More researchers in the biological sciences are using other more rigorous methods now than the Student's t-test and a p value of 0.05. ANOVA, ANCOVA and ranking methodologies are commonplace. Many scientific findings are based on a P value below 0.01. The problem with bad science certainly involves some bad statistics, but more often it just involves bad methodology, and poor attention to the previous literature (and thus attempting to reinvent the wheel). If your findings are robust and reproducible, then the statistics work out just fine. The good news is that science is self correcting, even if sometimes the corrections seem tardy.

    --
    A brain is a terrible thing to waste... Mind? That's debatable.
    1. Re:Student's T-test by Will.Woodhull · · Score: 1

      The bad news is that it is getting harder and harder to sort the science reported in journals from the papers whose purpose is to generate or preserve revenue streams for the researchers (or the corporations for which they are agents).

      --
      Will
    2. Re:Student's T-test by The+Real+Dr+John · · Score: 1

      Science is being monetized like everything else, and in all cases this is bad news, not good. There are still many scientists who work hard to generate fundamental rather than translational data, but unfortunately, too many scientists have been diverted by money and the constant need to obtain it to keep the research going. That is the curse of capitalism. It stops being about the ends, and becomes all about the means.

      --
      A brain is a terrible thing to waste... Mind? That's debatable.
    3. Re:Student's T-test by Anonymous Coward · · Score: 0

      There is not room here to correct all of your misunderstandings. Please click around this site for a bit:

      http://stats.stackexchange.com/questions

  16. The Economist just had an article on this by Beeftopia · · Score: 2

    Unreliable research
    Trouble at the lab
    Scientists like to think of science as self-correcting. To an alarming degree, it is not
    Oct 19th 2013 |From the print edition
    The Economist

    First, the statistics, which if perhaps off-putting are quite crucial. Scientists divide errors into two classes. A type I error is the mistake of thinking something is true when it is not (also known as a “false positive”). A type II error is thinking something is not true when in fact it is (a “false negative”). When testing a specific hypothesis, scientists run statistical checks to work out how likely it would be for data which seem to support the idea to have come about simply by chance. If the likelihood of such a false-positive conclusion is less than 5%, they deem the evidence that the hypothesis is true “statistically significant”. They are thus accepting that one result in 20 will be falsely positive—but one in 20 seems a satisfactorily low rate.

    In 2005 John Ioannidis, an epidemiologist from Stanford University, caused a stir with a paper showing why, as a matter of statistical logic, the idea that only one such paper in 20 gives a false-positive result was hugely optimistic. Instead, he argued, “most published research findings are probably false.” As he told the quadrennial International Congress on Peer Review and Biomedical Publication, held this September in Chicago, the problem has not gone away.

    Dr Ioannidis draws his stark conclusion on the basis that the customary approach to statistical significance ignores three things: the “statistical power” of the study (a measure of its ability to avoid type II errors, false negatives in which a real signal is missed in the noise); the unlikeliness of the hypothesis being tested; and the pervasive bias favouring the publication of claims to have found something new.

    http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

  17. d'oh by Iniamyen · · Score: 1

    Oh, people can come up with statistics to prove anything, Kent. 14% of people know that.

  18. Re:That book about the bell curve by Will.Woodhull · · Score: 2, Insightful

    Unless of course we happen to be working in a chaotic system where strange attractors mean there can be no centrality to the data.

    Chaos theory is a lot younger than the central limit theorem. The situation might be similar to the way Einstein's theory of relativity has moved Newton's three laws from a position of central importance in all physics to something that works well enough in a small subset. A subset that is extremely important in our daily life, but still a subset.

    Some portions of a chaotic system will be consistent with what the central limit theorem would predict. Other data sets from the same system, uh, no.

    An important question I do not believe has been answered yet (I am an armchair follower of this stuff, neither expert nor student) is whether all the systems we work with where the CLT does seem to hold are merely subsets of larger systems. A related question would be whether there is any test that can be applied to a discrete data set that rule out its being a subset of a larger chaotic process.

    --
    Will
  19. What is a p Value? by mrsquid0 · · Score: 1

    A significant problem is that many of the people who quote p values do it without understanding what a p value actually means. Getting p = 0.05 does not mean that there is only a 5% chance that the model is wrong. That is one of the fundamental misunderstandings in statistics, and I suspect that it is behind a lot of the cases of scientific irreproducibility.

    --
    Just because you are paranoid does not mean that no-one is out to get you.
  20. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    You hardly need chaos theory to come up with examples where a statistical estimator is not normally-distributed.

    Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics. Even one example would be great.

  21. I defer to Feynman by xski · · Score: 2
    1. Re:I defer to Feynman by sfdrew83 · · Score: 1

      You completely missed the point of what he was saying; it isn't about working with numbers or not, but rather about how you verify your hypothesis. In real science, you aren't allowed to be wrong some of the time. If a theory ever fails, then you have to go back to he drawing board. You aren't allowed to be unsure either. It isn't good enough to say "we dropped a thousand objects in North America and measured their acceleration, and based on a p-value of 0.0005 we can infer that acceleration due to gravity is ~9.8 m/s/s" That's what he meant when he was talking about "really knowing something." The theory of gravity has changed over time, but at its heart, it tells you something fundamental about the universe. There are no studies in the Social Sciences that tell you something fundamental about humans; we are too complicated and our behaviour is too unpredictable. The Social Sciences are worthwhile, but they aren't the same as Science, and it hurts both disciplines to keep confusing them. We need to be more earnest about what we know for sure, and what we probably know.

  22. Yes and no by golodh · · Score: 4, Interesting
    As you say, there is the Central Limit Theorem (a whole bunch of them actually) that says that the Normal distribution is the asymptotic limit that describes unbelievably many averaging processes.

    So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.

    But there's the crux: it remains an assumption; a hypothesis, and fortunately it's usually a *testable* hypothesis. It's the responsibility of a researcher to check if it holds, and to see how problematic it is when it doesn't.

    If something has a normal distribution, its square or its square root (or another power) doesn't have a Normal distribution. Take for example the diameter, surface area, and volume of berries. The diameter (goes with the radius, r), the surface area (goes with r^2), and the volume of berries (goes with r^3). They cannot all be Normally distributed at the same time, so assuming any of them is starts you out on shaky foundation.

    1. Re:Yes and no by colinrichardday · · Score: 1

      Maybe none of them is normally distributed, but if we take distributions of sample means of 50 berries, then those distributions might all be close to the normal distribution.

    2. Re:Yes and no by umafuckit · · Score: 1

      As you say, there is the Central Limit Theorem (a whole bunch of them actually) that says that the Normal distribution is the asymptotic limit that describes unbelievably many averaging processes.

      So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.

      If your sample distribution is non-normal and you're using tests that assume normality then you're fucked regardless of the central limit theorem. Anyway, the central limit theorem tells you that the means of repeated samples will be normally distributed, but this isn't usually what you're applying your test to. You're usually applying your test to a single sample and that may well be very non-normal (which is the point of the central limit theorem).

    3. Re:Yes and no by Anonymous Coward · · Score: 0

      Actually, for very large n, the estimators of each one are approximately normal. You can show it quite easily with a first-order Taylor expansion justified by the law of large numbers, which says that the probability mass of the estimator is concentrated enough that the linear expansion is valid. A linear function of a normal random variable is still normal.

      However this only works when the variance is asymptotically zero.

    4. Re:Yes and no by Smauler · · Score: 1

      The diameter (goes with the radius, r), the surface area (goes with r^2), and the volume of berries (goes with r^3). They cannot all be Normally distributed at the same time, so assuming any of them is starts you out on shaky foundation.

      So what? People are only trying to show normalised distribution over one thing. No one is trying to tell you what the best berry is. You can obviously do normalised distribution over each one of those, the radius, area, and volume. I don't see the problem with that.

    5. Re:Yes and no by Anonymous Coward · · Score: 0

      However, their logarithms can all be normally distributed at the same time.
      Besides, neither the diameter, nor the area, nor the volume can ever be normally distributed to begin with, because they cannot take negative values.

    6. Re:Yes and no by efarng · · Score: 1
      Why did the original incorrect post get score 5, but the corrections only get 2?

      As was stated, the CLT states that the average of many tests will be approximately normally distributed. Imagine someone producing a positive result in a paper. Then imagine we have 99 other people replicate exactly the same experiment. Each experiment will give us an average result, for a total of 100 averages from 100 experiments. It is this set of averages that is normally distributed.

      This makes no assumption on the distribution of the sample itself. Sampling distribution does not need to be normally distributed. It only requires that the samples be independent and identically distributed.

      Now, in reality, we only perform experiment and make our conclusions from that. In frequentist statistics (the type that you most likely learned), we use our single experiment to infer the other 99 experiments. Here, it is important that we pick the correct statistical test since different tests make different assumptions. The basic t-test does have a normality, homoskedasitic, and independence assumption and are usually correct because of the Law of Large numbers. But these assumptions are be tested for and, if not met, the scientist/statistician will choose different statistical test instead.

      Just to finish the review: Now that we inferred the entire universe of possible results, we assume the Null Hypothesis: Our treatment in the experiment did nothing (treatment group vs control), or the two groups are identical (blacks/whites or rich/poor). Due to sampling, the average of each group will vary every time we perform the experiment. The statistical test measures how often will we see our particular result in relation to the entire universe of possible results, again assuming that there no treatment effect. If (assuming no effect) we rarely see our result or results more extreme/larger/further apart, then we have evidence that the treatment was the cause for the difference, and not random chance.

      To explain the paper: The authors used a different type of statistics called Bayesian Statistics to derive their results. This branch of statistics is philosophically different, though they have developed analogs of all the popular frequentist statistical tests. New results in Bayesian Statistics allow for direct comparison of the two branches of statistics and the authors have concluded a p=0.05 in frequentist statistics results in a 3.47 Bayes Factor in Bayesian Statistics. Bayes Factor is the ratio of the probability of seeing the data in this experiment assuming that the treatment DID have an effect vs. the probability of seeing the data assuming the treatment DID NOT have an effect. (Note: we are not looking at the entire universe of possible results, using only our single result) In other words, given the chances of seeing two heads in a row, we are saying that there is a treatment effect vs no treatment effect.

    7. Re:Yes and no by dcollins · · Score: 2

      Argggh, you guys are all missing the point that the Central Limit Theorem is about the sampling distribution of the sample mean, i.e., the sample space for possible averages that you get as a result of your sampling process. (Or a proportion, equivalent to an average on booleans 0 or 1.) So you can always use this knowledge, for a fair sample size, to assess how likely it is that your sample mean is usefully close to the population mean.

      What are some things that definitely have an approximately normal distribution for a fair sample size? The average of anything. Yes to biological length or height. Yes to mechanical error. Yes to the average of some diameters, surface areas, or volumes of berries or anything else. All sample averages, or sums or differences of averages, or proportions (or more fundamentally any statistic based on addition), are in fact normally distributed for a fair sample size. No doubt about it.

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    8. Re:Yes and no by dcollins · · Score: 1

      "You're usually applying your test to a single sample and that may well be very non-normal."

      No, usually you're applying your test to the average of a single sample, and since possible averages are always normally distributed (for a fair sample size), you can indeed use that to assess the likelihood that your one sample average is usefully close to the population average.

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    9. Re:Yes and no by umafuckit · · Score: 1

      No, usually you're applying your test to the average of a single sample, and since possible averages are always normally distributed (for a fair sample size), you can indeed use that to assess the likelihood that your one sample average is usefully close to the population average.

      But only if your sample is not biased because you determine the average based upon your sample. If the sample is badly distributed or biased in some way then your estimate of the population mean will be similarly biased and you will not be able to make meaningful inferences regarding the population mean. Problems are worse with small samples.

      The same thing happens with parametric stats tests. e.g. with a t-test, the p-value it returns is only meaningful if the sample meets the assumptions of the test. If the sample distribution violates those assumptions substantially (and the test is robust to some degree) then your estimates will be biased and your p-value not trustworthy (i.e. the actual and nominal alpha levels will differ). This is something you can simulate and watch in action: http://www.ruf.rice.edu/~lane/stat_sim/robustness/ http://onlinestatbook.com/2/tests_of_means/robust_sim.html

    10. Re:Yes and no by dcollins · · Score: 1

      "But only if your sample is not biased because you determine the average based upon your sample. If the sample is badly distributed or biased in some way then your estimate of the population mean will be similarly biased and you will not be able to make meaningful inferences regarding the population mean. Problems are worse with small samples."

      You are almost terrifyingly confused; what you wrote above is nonsense. What I wrote earlier, "the likelihood that your one sample average is usefully close to the population average", is a probability question, and it's not dependent on any one sample. It is in fact the probability that your sample is biased or not (as you put it). Saying that this probability changes because of one oddball sample is incorrect.

      Look, go to your linked applet. Set the sample sizes to 25. Keep the default population shapes (no skew, i.e., normal) and click "simulate". The Type I error rate will almost certainly be 0.05 (rounded off). Now change the population shapes to "severely skewed" (the most severe available), keep the sample size at 25, and run again. The Type I error rate will again be almost exactly 0.05.

      The population shape makes no difference. For a sample of at least 25 or anything higher, the shape of the possible sample means (not shown in the applet) is normally distributed anyway, regardless of the population shape. That's the subject of the CLT anytime it's being discussed. Only for itty-bitty sample sizes (less than 25; really the subject of the applet) does the population shape make any difference.

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
  23. The real issue by Okian+Warrior · · Score: 5, Interesting

    Okay, here's the real problem with scientific studies.

    All science is data compression, and all studies are are intended to compress data so that we can make future predictions. If you want to predict the trajectory of a cannonball, you don't need an almanac cross referencing cannonball weights, powder loads, and cannon angles - you can calculate the arc to any desired accuracy with a set of equations that fit on half a page. The half-page compresses the record of all prior experience with cannonball arcs, and allows us to predict future arcs.

    Soft science studies typically make a set of observations which relate two measurable aspects. When plotted, the data points suggest a line or curve, and we accept the linear-regression (line or polynomial) as the best approximation for the data. The theory being that the underlying mechanism is the regression, and unrelated noise in the environment or measurement system causes random deviations of observation.

    This is the wrong method. Regression is based on minimizing squared error, which was chosen by Laplace for no other reason that it is easy to calculate. There's lots of "rationalization" explanations of why it works and why it's "just the best possible thing to do", but there's no fundamental logic that can be used to deduce least squares from from fundamental assumptions.

    Least squares introduces several problems:

    1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source).

    2) There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.

    3) The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.

    4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

    The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell how relevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.

    Scientific studies should have a threshold of "compresses data by N bits", rather than "1-in-20 of all studies are due to random chance".

    1. Re:The real issue by colinrichardday · · Score: 1

      1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org])

      Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

      4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

      But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

    2. Re:The real issue by Okian+Warrior · · Score: 1

      Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

      There's no algorithm that will identify the outliers in this example.

      But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

      What value of correlation coefficient distinguishes pattern data from random data in this image?

    3. Re:The real issue by Anonymous Coward · · Score: 0

      Some of the things you mentioned are good reasons to useRobust Statistics

    4. Re:The real issue by stymy · · Score: 1

      Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously. Now assume that in a model you have taken into account all variables that have real predictive value, and are fairly independent. Then your error should be normally distributed, and randomly over the range of your data by the Central Limit Theorem. So if your data looks like that after fitting the model, your model probably has very good real predictive value. Note that this definitely may not hold for data where there is no clear causative link, I assume that the variables chosen to predict the response have clear reasons to provide predictive value. For example, if trying to predict the yield of a farm, the soil type, rainfall, sun coverage, and so forth clearly have a part in the resulting yield, but what the farmer drinks on a Sunday night might not, so it's best to exclude from the model even if the variable has a p-value0.05.

    5. Re:The real issue by Anonymous Coward · · Score: 0

      1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org])

      Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

      Outliers are often so extreme and rare that despite being statistically unbiased, they nevertheless severely skew statistics which aren't robust to them.

      4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

      But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

      Even random data will show significant correlation coefficients at the rate determined by the p-value threshold for significance (typically 0.05). A set of random data with a significant correlation coefficient is indistinguishable from a genuine correlation.

      The whole point of statistics is not to give us any certainty as to the validity of conclusions. Certainty is the one thing statistics can never provide. Rather, it's to keep people from throwing up their hands when presented with noisy data that doesn't lie exactly on a line or parabola or whatever. It gives us a knob to turn to control the tradeoff between the ability to discover new knowledge and the risk of misleading ourselves into believing things that aren't true.

    6. Re:The real issue by colinrichardday · · Score: 1

      Outliers are often so extreme and rare that despite being statistically unbiased, they nevertheless severely skew statistics which aren't robust to them.

      If outliers are unbiased, they can affect the results, but how can they skew the results? Also, if they're rare, how much effect can they have?

      A set of random data with a significant correlation coefficient is indistinguishable from a genuine correlation.

      Not on a scatterplot. It's pretty clear how close the data are to the line. Also, how probable is it that random data would have s statistically significant correlation coefficient?

    7. Re:The real issue by colinrichardday · · Score: 1

      There's no algorithm that will identify the outliers in this example [dropbox.com].

      So there's no algorithm for comparing observed values to modeled (predicted) values? The absolute value of the difference between the two can't be calculated? Hmm. . .

      What value of correlation coefficient distinguishes pattern data from random data in this image [wikimedia.org]?

      Are the data in that image random? Also, the data without the four points at the bottom would have a higher correlation coefficient.

    8. Re:The real issue by colinrichardday · · Score: 1

      Also, you may want to account for the difference between the x coordinate of the point and the average of the xs, as having an x coordinate far from the mean contributes to being farther away from the regression line.

    9. Re:The real issue by colinrichardday · · Score: 1

      Oops, I missed the second image. But the correlation coefficients are there. The sets of data that more closely approximate a line have such values close to 1 or -1. The ones that don't have values close to 0.

    10. Re:The real issue by Anonymous Coward · · Score: 0

      lolwut? Here is an algorithm to identify the outliers in that example:

      if y<1000 return TRUE;
      return FALSE;

      I guess you meant a pre-specified algorithm. Sure, just reject every point with the y-value over 3 sigmas away. Done. Sure it won't work everywhere but it's a decent place to start as long as a human reviews it.

      You're being way too pessimistic. Just because we can't completely automate science, doesn't mean that statistics has nothing to contribute.

    11. Re:The real issue by Paradigma11 · · Score: 1

      Continuous data is just one possible scientific problem. Most studies are done comparing 2 groups which differ by a categorical variable. There are other forms of regression like ordinal linear regression...

    12. Re:The real issue by Anonymous Coward · · Score: 0

      1. Yes, there is no universal way to deal with outliers. However there is also no algorithm to universally produce a good visualization of a graph. There are plenty of pretty good ways to deal with outliers, that an expert can choose between and use.
      2. Once precisely formulated, these various hypotheses can in fact be tested against each other.
      3. Actually humans are terrible at predicting future data points. They are far too easily influenced by local "patterns" (actually spurious) they think they detect. In most (not all of course) cases I would trust LS over an average human, though I'd trust an expert over either.
      4. If the data are "random" (i assume you meant that y is independent of x), then the line of best fit will be flat (slope approximately 0). This is in fact the line of best fit as it is defined. If you want to test for no linear relationship, then you can test that.

      Your compression idea is being studied as algorithmic information theory and mdl methods. One problem with it is that it's difficult to test hypotheses in a way humans can understand. I mean, what's the cutoff for whether there is "really" a signal? 10% compression? 20%? There's a difference between "raw" entropy, and information that actually means something about the question being asked.

      It seems like you found a lot of problems with what is taught in a stat 101 class. This is good; there are many. However, there are also solutions to these problems which you would find if you took a higher level course.

    13. Re:The real issue by Anonymous Coward · · Score: 0

      chosen by Laplace for no other reason that it is easy to calculate

      It's used because it's the standard L2 norm in the N-dimensional vector space spanned by the samples. If your prediction vector is X and your measurement vector is Y then least squares minimizes |X-Y| in that vector space. This is not the only way to do it, but neither is it completely arbitrary.

      Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org]).

      In the context of least-squares fit you can perform N fits on N-1 points and calculate a Bayesian estimate for likelihood the missing point fits. But as others have said it's not clear if you want to "deal with" outliers, unless you are expecting some significant non-modelled behaviour.

      There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.

      There are all kinds of tests to determine goodness of fit to polynomials. Simple log-linear regression can give you N in y=x^N to start you off.

      The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.

      Nonsense. Humans draw curves that aesthetically look better, but these results do a worse job at prediction. Optimal Bayesian estimation is called optimal for a reason. (Please correct me if I'm wrong - do you have any scientific results that back this claim up?)

      There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

      It will also return an error variance measurement which you can use to tell one from the other.

      The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell how relevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.

      Cool, let's design such a metric.

      It clearly can't be based on floating-point binary representations. These are not scale-free. What I mean is, under one scale, you might have an error of 0.5 and you might say that needs 2 bits to store. Under another scale the same error would be 0.3333333 which takes infinite bits to store, under the same storage scheme. Even if you solve that, you'd get that a small but complex error (e.g. 1/pi^100) is worse than a huge but simple one (e.g. 2^64). That's just ridiculous.

      So we have to use a metric where larger errors give a larger metric. A monotonic function, at least. So let's choose log2(error), right? That's kind of "number of bits" - in fact if you decide on a fixed-point storage scheme and each error is a multiple of some "quantum of error" then this gives you a kind of information-theoretic integer basis where larger errors take more bits to store. So you have log2(error/errorQuantum). Oh, except error might be negative - tell you what, let's square it, and then halve the value from log : 1/2log2(error*error/errorQuantum*errorQuantum).

      A bit more handwaving to take into account that the reals have measure zero so we want to look at error intervals, and we've figured out p(error) under a normal distribution. Oh, how we have revolutionized statistics! :) Since the use of normal distributions is completely bog-standard I assume you'll let me stop there ;)

    14. Re:The real issue by Anonymous Coward · · Score: 0

      (To be fair, I quite like the way your information theoretic measure combines simplicity of model with severity of error within the same metric, thus combining significance testing and Occam's Razor into one big melting pot. That's fairly cool.)

    15. Re:The real issue by Anonymous Coward · · Score: 0

      If you give us the raw data for the line with 4 outliers, we'll give you an algorithm that detects them. Regardless, stating "there's no algorithm that ..." has a burden of proof, which you could at least satisfy with references (Godel? Turing? :p)

      As for the correlation being 0 in some of those non-trivial dot images, sure, but correlation is only 2nd-order so can only detect the top images (variants from circle -> ellipse -> line). Plenty of computer vision algorithms can detect the boundaries of the lower images, if that's what you want. Statistics doesn't just stop dead at the 2nd moment.

    16. Re:The real issue by dcollins · · Score: 1

      "If they aren't biased [outliers], then their effects cancel."

      Oh, god no. The book I teach out of says that if outliers exist, it's required to do the regression both with and without the outliers and compare. Frequently there will be a big difference. (Weiss, Introductory Statistics, Sec. 14.2)

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
    17. Re:The real issue by Okian+Warrior · · Score: 1

      It clearly can't be based on floating-point binary representations. These are not scale-free. What I mean is, under one scale, you might have an error of 0.5 and you might say that needs 2 bits to store. Under another scale the same error would be 0.3333333 which takes infinite bits to store, under the same storage scheme. Even if you solve that, you'd get that a small but complex error (e.g. 1/pi^100) is worse than a huge but simple one (e.g. 2^64). That's just ridiculous.

      So we have to use a metric where larger errors give a larger metric. A monotonic function, at least. So let's choose log2(error), right?

      Ah, damn! If only you weren't anonymous, I'd love to continue this conversation offline!

      That level of insight! I didn't expect anyone to be familiar enough with the foundations of statistics to make this observation.

      You were correct up to the point where you chose log2(error) as a metric. There's a function which can be deduced from first principles, and which solves all the above-mentioned problems. Can you find it?

      Back-edit the statistical methods and use the new error function instead of least squares. This leads to a solution to the fixed-width floating point problem, the regression problems mentioned above (including mixtures of gaussians mentioned in a side-post), and an algorithmic solution to the front-page article issue. (As a side-note, it's the first step towards hard-AI.)

      I'm writing up my results right now - we'll see if my solution holds up under scrutiny.

    18. Re:The real issue by Anonymous Coward · · Score: 0

      RANSAC

  24. Re:That book about the bell curve by Will.Woodhull · · Score: 1

    Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.

    When I turn that around, it seems to say that statistics is only of value in systems that have fully matured. Which sounds like most of the time statistics have no value.

    Is that correct? Or is there some other way to reverse the quotation?

    --
    Will
  25. Re:That book about the bell curve by Anonymous Coward · · Score: 1

    Well, there's really nothing to turn around... you're spouting a lot of pseudo-science here, and still nothing that you've said has even suggested why statistics wouldn't work on "immature" (whatever that means) systems. The central limit theorem can apply to dynamic systems, and even if the CLT didn't hold, that doesn't mean that statistics is impossible. There are many estimators which do not obey the CLT.

    Just google "statistics of chaotic systems" or whatever. You'll find plenty of work on the subject. Admittedly, they are using "statistics" the way physicists do, but it's still the same idea: a mathematical characterization.

    Basically, whenever there is a probabilistic model for something, statistics happens when you are ignorant of (certain aspects of) the model, and try to infer what you don't know from the data. Again, google "dynamic statistical models"; you'll find a lot.

  26. An example by Michael+Woodhams · · Score: 2

    Having quickly skimmed the paper, I'll give an example of the problem.
    I couldn't quickly find a real data set that was easy to interpret, so I'm going to make up some data.
                  Chance to die before reaching this age
    Age woman man
    80 .54 .65
    85 .74 .83
    90 .88 .96
    95 .94 .98

    We have a person who is 90 years old. Taking the null hypothesis to be that this person is a man, we can reject the hypothesis that this is a man with greater than 95 percent confidence (p=0.04). However, if we do a Bayesian analysis assuming prior probabilities of 50 percent for the person being a man or a woman, we find that there is a 25 percent chance that the person is a man after all (as women are 3 times more likely to reach age 90 than men are.)

    (Having 11 percent signs in my post seems to have given /. indigestion so I've had to edit them out.)

    --
    Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
    1. Re:An example by Anonymous Coward · · Score: 0

      Unfortunately, that's not quite right, but it's in the spirit of the right way of thinking at least. If you know a person is 90 years old, then that is prior information that could be used in a Bayesian estimate of the gender of your mystery person. Given that the person is 90, we could say that 12 percent of women, and 4 percent of men make it to that age. This gives the person a 25 percent chance of being a man, and a 75 percent chance of being a woman.

      If you ignore the Bayesian prior probabilities, you don't have much of an example of anything. You could measure the gender of the population and conclude that on average 50 percent (maybe not quite) of people are male, which would not be anywhere near sufficient evidence to conclude anything about a particular person.

    2. Re:An example by Anonymous Coward · · Score: 0

      This is misrepresenting frequentist statistics because you've set up the problem wrong. With this approach someone over 100 years old is neither a man or a woman, because the probablity of someone dieing before then is greater then 95%.

      A frequentist still gets to use Bayes theorem in this case and winds up with the same answer, with P(M|>90) = ( P(>90|M)*P(M) ) / ( P(>90|M)*P(M) + P(>90|W)*P(W) ) using equal probabilities for man or woman (.04*.5) / (.04*.5 + .12*.5) = .25

  27. Well, duh. by Black+Parrot · · Score: 1

    Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).

    .
    Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

    More interesting, IMO, is that statistical doesn't tell you what the scale of an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

    --
    Sheesh, evil *and* a jerk. -- Jade
    1. Re:Well, duh. by ColdWetDog · · Score: 1

      There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

      This is especially prevalent in medicine (especially drug advertising). If you look at medical journals, they are replete with ads touting that Drug A is 'statistically better' than Drug B. Even looking at the 'best case' data (the pretty graph in the advert) you quickly see that the lines very nearly converge. Statistically significant. Clinically insignificant.

      Lies, Damned Lies and Statistics

      --
      Faster! Faster! Faster would be better!
    2. Re:Well, duh. by Mr.+Slippery · · Score: 2
      Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

      A P-value of 0.05 means by definition that there is a 0.05, or 5%, or 1 in 20, probability that the result could be obtained by chance even though there's no actual relationship.

      --
      Tom Swiss | the infamous tms | my blog
      You cannot wash away blood with blood
    3. Re:Well, duh. by Paradigma11 · · Score: 1

      Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).

      . Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

      More interesting, IMO, is that statistical doesn't tell you what the scale of an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

      Ofcourse it was found. The 20% are not by definition but a function of the percentage of studies done based on correct/incorrect H1. You could have 0% if you only did studies on correct H1s.

    4. Re:Well, duh. by Paradigma11 · · Score: 1

      Wanted to reply to the parent, sorry.

    5. Re:Well, duh. by wonkey_monkey · · Score: 1

      Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

      1 in 20 != 20%.

      If p=0.2, then you'd be right.

      --
      systemd is Roko's Basilisk.
  28. Re: Impossible! by KeensMustard · · Score: 1

    In what way ?

  29. Integrating to x sigma by mdsolar · · Score: 1

    A surprising number of senior scientists are not aware of the problems introduced by ending an experiment based on achieving a certain significance level. By taking the significance as the criterion of the experiment, you don't actually know anything about the significance. Your highly significant result may just be a fluctuation because, had you continued, the high signal-to-noise ratio could well dissipate. Too often I've heard senior scientists advising junior scientists: You've got three sigma, publish. But, proper procedure is to design an experiment to run for a certain duration and then find out what the result is.

    Medicine has a formal means to end a trial early if a medicine turns out to be dangerous or particularly helpful. This is an ethical consideration. But, it does make the trial results void.

    1. Re:Integrating to x sigma by pepty · · Score: 1

      A surprising number of senior scientists are not aware of the problems introduced by ending an experiment based on achieving a certain significance level.

      I think the vast majority are aware of at least some of the problems. The issue is the ones who are willing to publish their results without addressing those problems honestly in the writeup.

  30. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    I'm not certain his conclusions are correct... I could not reproduce them...

  31. not the real problem by ganv · · Score: 3, Insightful

    At one level, they are right that unreproducible results are usually not fraud, but are simply fluctuations that make a study look promising leading to publication. But raising the standard of statistical significance will not really improve the situation. The most important uncertainties in most scientific studies are not random. You can't quantify them assuming a gaussian distribution. There are all kind of choices made in acquiring, processing, and presenting data. The incentives that scientists have are all pushing them to look for ways to obtain a high profile result. We make our best guesses trying to be honest, but when a set of guesses leads to a promising result we publish it and trust further study to determine whether our guesses were fully justified. There is one step that would improve the situation. We need to provide a mechanism to receive career credit for reproducing earlier results or for disproving earlier results. At the moment, you get no credit for doing this. And you will never get funding to do it. The only way to be successful is to spit out a lot of papers and have some of them turn out to be major results that others build on. The number of papers that turn out to be wrong is of no consequence. No one even notices except a couple of researchers who try to build on your result, fail, and don't publish. In their later papers they will probably carefully dance around the error so as not to incur the wrath of a reviewer. If reproducing earlier results was a priority, then we would know earlier which results were wrong and could start giving negative career credit to people who publish a lot of errors.

    1. Re:not the real problem by Anonymous Coward · · Score: 0

      I agree with you completely. The use of statistical significance is not adequate on its own. Look at it this way. If a single study is considered to make a hypothesis accepted, then why would we need reproducibility? Fraud isn't hindered by a more stringent numerical standard. A fraudster will simply adjust the numbers to meet whatever standard is required. Reproducibility screens for fraud and also screens for sublime influence better than any statistical method.

          Now consider the opportunity cost. What will be missed if we introduce a more stringent standard? Reproducibility is already required so we shouldn't throw babies out with the bathwater by only focusing on only one tool. Statistical analysis is a hammer but everything else is not a nail.

  32. oops, my bad by colinrichardday · · Score: 1

    Ahhh!! it's 1/20, not two percent. Of course, it's 5%.

  33. The bigger problem by msobkow · · Score: 1

    The bigger problem is the habit of confusing correlation with cause.

    --
    I do not fail; I succeed at finding out what does not work.
  34. Re:That book about the bell curve by Will.Woodhull · · Score: 1

    you're spouting a lot of pseudo-science here

    I agree that there IS a lot pseudo-science here, and that I have fallen into a nasty trap.

    What can I say? This is not the first time an AC troll has gotten me good, and it probably will not be the last.

    Now get thee back under that dark, damp, cobwebby bridge where thou belongest! Or I shall sprinkle thee with Troll-B-Gone powder and there will be nothing left around here but some grins and giggles.

    --
    Will
  35. Let's get something straight you non-staticians by j33px0r · · Score: 4, Insightful

    This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.

    "p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.

    Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.

    Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.

    Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.

    Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.

    Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.

  36. Re:That book about the bell curve by Daniel+Dvorkin · · Score: 2

    The CLT is one of the most elegant and powerful results in all of mathematics, and can be used, quite appropriately, to justify normal models for all sorts of measurements. That being said, its usefulness has led to the dumbed-down idea of "the bell curve" being the appropriate model for all sorts of things where it's clearly not--I don't know how many times I've seen a normal curve superimposed on a histogram or kernel density estimation of data that are clearly non-normal. As another poster pointed out, there are simple and well-understood tests for normality, and failure to apply them when constructing a normal model is just ridiculous.

    --
    The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
  37. Re: No by stymy · · Score: 1

    The Central Limit Theorem doesn't state that the samples are normally distributed, but their mean (average). So the average surface area, volume, and diameter will all be normally distributed for a large sample of independent berries (ie. not from same plants, and so forth).

  38. Re: Impossible! by ralphbecket · · Score: 1

    Climate models are currently, at best, when treated as an ensemble (if you buy that as legitimate), skirting along the p 0.05 level of significance in the validation period.

    Pointing this out is considered trolling -- it probably offends some religious sensibilities.

    Tightening the threshold as the article suggests would mean the model results are not "significant" (i.e., not reasonably distinguishable from natural variation -- note that I am not a "denier" and that I do accept that CO2 is a greenhouse gas etc. etc.; I am however hugely skeptical of most climate and environmental science that I have investigated).

  39. you sound like you know what you're talking about by raymorris · · Score: 1

    It sounds like you have a clue about statistics. Do you know of a good forum to ask a fairly involved statistics question? I have a set of measured variables A-E which all tend to indicate the likelihood of X. The relationships are a bit complex and unknown, though, so I need help with how I should analyze the historical data in order to come up with parameters to use in the future for making "predictions" of X based on known values of A-E.

  40. And has been so for, oh, 50 years? by zedrdave · · Score: 1

    "innovative methods"??? I do not know of a single serious scientist who hasn't been lectured on the ills of weak testing (and told not to use 0.05 as some sort of magical threshold below which everything magically works).

    Back when I was a wee researchling, this is literally one of the first paper I was told to read and internalise (published 20 years ago, and not even particularly breakthrough at the time).

    There is absolutely no need for new evidence or further discussion of the limitations of statistical testing thresholds: anybody who cares is keenly aware of them. People who don't (particularly in some areas of social science), are just looking for a way to get their next paper out the door by any means possible.

  41. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.

    When I turn that around, it seems to say that statistics is only of value in systems that have fully matured.

    Statistics in today's world is more a financial tool. That system of manipulation has fully matured. I promise you they know exactly what they are doing with that tool.

  42. Variance of error is not what we want by Okian+Warrior · · Score: 1

    Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously.

    This is the wrong place for an argument (you want room 12-A) and I don't want to get into a contest, but for illustration here is the problem with this explanation.

    A rule learned from experience should minimize the error, not the variance of error.

    It's a valid conclusion from the mathematics, but based on a faulty assumption.

    1. Re:Variance of error is not what we want by fatphil · · Score: 1

      The problem I have with least squares is that I don't like the definition of the "error". If you have two things that are correlated, one isn't necessariy a function of the other that includes some variability. If you flip the X and Y axes over - plot height against weight, rather than weight against height - then the least squares regression gives a different line. If the two errors are both minimised, but different, then neither of them is the "real" error. One's the (squared) distance to the regression line in the X direction, the other is the (squared) distance in the Y direction. Neither is the actual (squared) distance from the line, which would be in a direction perpendicular to that line. Is that not a better error to try and minimise? Also, non-squared rather than squared.

      Naively, I believe that regressions lines should be preserved by all affine transformations. Least squares seems to be only preserved under scaling and translation.

      As you commented 3 posts back, the human eye plots a different line from what you get from least squares linear regression. I suspect the human plots a line about half way between the two axes-flipped regressions. And would be preserved by all affine transformations. (And probably has a gradient very close to sqrt(var(X)/var(Y)).)

      However, I have no statistical background. I was a pure mathematician, and stats was just applied woo-woo to me.

      --
      Also FatPhil on SoylentNews, id 863
  43. Hah, could smell it! by Anonymous Coward · · Score: 0

    I always knew there was something wrong with their Pies...

  44. Re:you sound like you know what you're talking abo by Anonymous Coward · · Score: 0

    http://stats.stackexchange.com/questions

  45. How about by Anonymous Coward · · Score: 0

    \begin{rant}
    Actually using statistics in the first place! I'm sick to $#@%@#$%#$TREWT#$@%$#ing death of CS papers with no statistical testing whatsoever. And don't get me started on electronic engineering.
    \end{rant}

    1. Re:How about by Anonymous Coward · · Score: 0

      Hmmm ... how would you statistically test the correctness of an algorithm?

  46. Thanks by Okian+Warrior · · Score: 1

    It seems like you found a lot of problems with what is taught in a stat 101 class. This is good; there are many. However, there are also solutions to these problems which you would find if you took a higher level course.

    That brought a smile to my face. Thanks.

  47. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    Well, it's good to know that the entire field of non-parametric statistics doesn't exist, then.

  48. Or widespread FRAUD... by Anonymous Coward · · Score: 0

    But of course, they wear white coats and are the new 'high priests' who have to be worshipped at any cost. It's not as if most scientists are concerned more with their careers and pensions than with the truth, is it...

  49. Re:you sound like you know what you're talking abo by Anonymous Coward · · Score: 0

    http://stats.stackexchange.com/

  50. Re:you sound like you know what you're talking abo by Anonymous Coward · · Score: 0

    If X is categorical this sounds like a case for logistic regression.

  51. Yes, and? by golodh · · Score: 1
    If r1 is a sample of berry radii, en r2 and r3 samples of that radius squared and cubed.

    What you're talking about is the distribution of the sample means of r1, r2, r3 respectively. Those are asymptotically normally distributed, but that's not what we're talking about here.

    What we were talking about is whether: r1, r2, and r3 can all be normally distributed. The reason being that people investigating the size, weight, and surface area of berries may *assume* (appealing to the Central Limit Theorem) that the quantity they're investigating can be modeled adequately through a normal distribution, and proceed to apply statistical tests based on dealing with normal distributions. For example by comparing the effect of fertilizer on berry size and weight. And it's clear that the distributions of r1, r2, and r3 cannot all be distributed normally.

    So statistical tests based on the assumption that they are normally distributed will operate outside their guaranteed area of applicability, which may or may not cause them to be in error.

    1. Re:Yes, and? by colinrichardday · · Score: 1

      What we were talking about is whether: r1, r2, and r3 can all be normally distributed. The reason being that people investigating the size, weight, and surface area of berries may *assume* (appealing to the Central Limit Theorem) that the quantity they're investigating can be modeled adequately through a normal distribution,

      But the Central Limit Theorem is a claim about the distribution of sample means as the sample size gets larger.

  52. Nice method he's developed there... by CCarrot · · Score: 1

    ... but is it reproducible? :p

    --
    "I love animals! Some are cute, others are tasty, what's not to like?" - Betsy Schroeder, Jeopardy contestant
  53. Re:That book about the bell curve by melikamp · · Score: 1

    That is because of the central limit theorem, (http://en.wikipedia.org/wiki/Central_limit_theorem), which indicated that for a large number of independent samples, it doesn't matter what the original distribution was, and we certainly can reliably use the normal distribution. It is NOT unfounded.

    Emphasis is mine.

    Actually, you are misstating the CLT, which does not work at all for distributions without finite mean or variance (which may well be the case for real-world experiments). And even if the variable we are measuring does have finite mean and variance, the speed of convergence is only possible to quantify in certain cases. So the shape you get from samples of size 1000 may look good to you because you are impressed by a bunch of zeroes, and may even work OK near the estimated mean, but when we look at the tails, we may find that your approximation is not worth a crumpled paper napkin.

  54. Stats by Anonymous Coward · · Score: 0

    Of course it's a proven fact that 87% of all statistics are wrong. 8-)

  55. Well yes, actually by raymorris · · Score: 1

    > do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

    Yes. Before giving it to a million people, we should run statistical calculations on the first 10,000 to better asses safety and efficacy.

    Oh, you meant as opposed to a trial with 200 people. But that's a false dichotomy. You run run stats on the first 200 to see whether
    or not it's likely safe, then run stats on 10,000 to confirm it. Which is to say, you'd wait until you managed a smaller P before announcing a conclusion. In the meantime, with a P of 0.05, you'd label it as a tentative conclusion, a likely theory.

    1. Re:Well yes, actually by dcollins · · Score: 1

      "In the meantime, with a P of 0.05, you'd label it as a tentative conclusion, a likely theory."

      Great, so we agree: Getting a P of less than 0.05 with a sample of 200 gets you published and other machinery acting on that signal.

      --
      We know where leadership by an anti-intellectual "strongman" who scapegoats minorities and likes boisterous rallies goes
  56. That's brilliant - thanks! by Okian+Warrior · · Score: 1

    The problem I have with least squares is that I don't like the definition of the "error". If you have two things that are correlated, one isn't necessariy a function of the other that includes some variability. If you flip the X and Y axes over - plot height against weight, rather than weight against height - then the least squares regression gives a different line. If the two errors are both minimised, but different, then neither of them is the "real" error.

    Wow - brilliant insight! Thanks for that - things like this are why I come to Slashdot.

    1. Re:That's brilliant - thanks! by fatphil · · Score: 1

      I forgot to include in my post the fact that as I was reading the article and earliest posts I was fomenting in my head an idea that was based on information theory, and so was *most pleased* to see your compression example - the kind of thing that I come to slashdot for. That's, I think, the purest approach; whether it's practical or not is another matter, and - to a pure mathematican like me - irrelevant!

      I replied to your post rather than the parent, where it was perhaps more relevant, as it was you I wished to engage in dialogue, as you seemed to have the insights that I was interested in! As I said, I'm an outsider to the field, but still interested in it.

      --
      Also FatPhil on SoylentNews, id 863
  57. Hey - pure mathematician! by Okian+Warrior · · Score: 1

    Can I discuss some ideas with you offline? thon dot 9 dot okianwarrior at spamgourmet dot com

  58. Re:That book about the bell curve by Anonymous Coward · · Score: 0

    OLS Regression/glm based approaches do not assume a Gaussian distribution of the dependent variable just that the residuals/errors be. If you don't like that for your data, use nonparametric, or if you know what matches the distribution, like Poisson or negative gamma then use those. As a psychologist, my colleagues were often taught to look for normal data because they assume the errors are intrinsically random, and will follow the existing distribution. I wish they didn't.

  59. Re: Impossible! by KeensMustard · · Score: 1

    Climate models are currently, at best, when treated as an ensemble (if you buy that as legitimate)

    Is there a methodological reason to NOT treat the ensemble as legitimate? Please describe this reasoning in detail.

    skirting along the p 0.05 level of significance in the validation period.

    Define precisely what you mean by "skirting along". How far below 0.05 are the models results, exactly?

    Pointing this out is considered trolling -- it probably offends some religious sensibilities.

    I suspect you are misinterpreting the responses. Mostly when we dig down on these sorts of remarks, high level, without data or empirical basis, no repeatable observations, we find them to be the work of some deceptive, brainless, mouth breathing denialist. You might be an ok chap, but I can't really make a conclusion until I've seen the data.

    It's a matter of probability. Perhaps someone who sounds like a denialist is, in spite of the evidence, a rational, coherent person who is nevertheless ignorant of the science or misled by paid shills (like Anthony Watts who is paid a salary by the Heartland Institute to LIE about climate science, or Judith Curry, who deliberately misleads by wrapping genuine science in a penumbra of sneering psuedo-scepticism).

    Generally the best measure is as follows:

    1. The person refuses to provide specifics but only speaks in generalities - likely a committed denialist

    2. The person provides specific (albeit incorrect) facts. ignorant or misled by liars

    Like many people I have a lot of sympathy for the genuinely misled and will try to help them if I can. The deliberate lies and deception on the part of the denier hierarchy makes my blood boil.

    Tightening the threshold as the article suggests would mean the model results are not "significant" (i.e., not reasonably distinguishable from natural variation -

    Actually the article makes no mention of anything related to climate science. It is mainly focussed on instances where results are found not to be reproducible and using a frequentist methodology. Climate models are very reproducible and don't use a frequentist method - they make predictions, not observation.

    n -- note that I am not a "denier" and that I do accept that CO2 is a greenhouse gas etc. etc.; I am however hugely skeptical of most climate and environmental science that I have investigated).

    And yet you refer to these supposed problems in the climate science in generalities. Why is that?