Cause and Effect: How a Revolutionary New Statistical Test Can Tease Them Apart
KentuckyFC writes Statisticians have long thought it impossible to tell cause and effect apart using observational data. The problem is to take two sets of measurements that are correlated, say X and Y, and to find out if X caused Y or Y caused X. That's straightforward with a controlled experiment in which one variable can be held constant to see how this influences the other. Take for example, a correlation between wind speed and the rotation speed of a wind turbine. Observational data gives no clue about cause and effect but an experiment that holds the wind speed constant while measuring the speed of the turbine, and vice versa, would soon give an answer. But in the last couple of years, statisticians have developed a technique that can tease apart cause and effect from the observational data alone. It is based on the idea that any set of measurements always contain noise. However, the noise in the cause variable can influence the effect but not the other way round. So the noise in the effect dataset is always more complex than the noise in the cause dataset. The new statistical test, known as the additive noise model, is designed to find this asymmetry. Now statisticians have tested the model on 88 sets of cause-and-effect data, ranging from altitude and temperature measurements at German weather stations to the correlation between rent and apartment size in student accommodation.The results suggest that the additive noise model can tease apart cause and effect correctly in up to 80 per cent of the cases (provided there are no confounding factors or selection effects). That's a useful new trick in a statistician's armoury, particularly in areas of science where controlled experiments are expensive, unethical or practically impossible.
>provided there are no confounding factors or selection effects
So that'll provide plenty of material for medical researchers, nutrition researchers, education researchers and economists to keep doing what they're doing.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
So the noise in the effect dataset is always more complex than the noise in the cause dataset....... the additive noise model can tease apart cause and effect correctly in up to 80 per cent of the cases
In other words, not always.
"First they came for the slanderers and i said nothing."
Reading through the article, it wasn't clear to me how it is determined whether it worked correctly or not.
But still, an interesting statistical breakthrough, and one that allows researches to ask interesting questions about their data.
Well, of course it can. How do you think causation is determined? First by noticing a correlation. There can't be causation without correlation.
Gawd I hate the brain-dead fools who thoughtlessly parrot, "Correlation is not causation!"
So if Z causes both X and Y, I assume that this amazing test gives garbage?
Is that a joke for the quantitatively pedantic?
Hey, we have this new technique. It's somewhere between 0% and 80% reliable.
80% of the time it confirms the scientits' exectations.
Many other attempts at detecting causality exist. There's one based on dynamical systems theory (Takens' theorem): in a multidimensional, causally linked dynamical system, all the information in the high-dimensional system can be recovered from a multiple values of a single dimension over time.
The method works by reconstructing values of X from lagged vectors of Y(t) nearest-neighbor lagged vectors of Y in a training set. As the training set gets larger, the predictions get better. If they keep getting better, X probably causes Y. The idea that the noise in X(t) shows up in Y(t) but not the other way around is implicitly captured in that approach, although not in a statistically rigorous way.
Sugihara et al. Science 2012 (sorry about paywall).
2) A whole bunch of people totally ignoring this study because they don't like what it means.
excitingthingstodo.blogspot.com
Don't fucking delete your fucking data you fucking dipshit.
Unless, of course, you know that some fucking data is bad, and other fucking data is good. In that case it makes sense to fucking delete the fucking bad data.
But then how are you supposed to get your research published?
sysadmins and parents of newborns get the same amount of sleep.
You torture the data until they confess.
The standard t-test for detecting an effect is already probabalistic. In science and medicine a 95% confidence value is commonly used, which means a 1/20 of detecting something that isn't there.
Sheesh, evil *and* a jerk. -- Jade
I would like to officially confirm that, indeed, OP is angry.
sexconker, can you please point to me the place on the doll where the bad ebil statistician touched you?
We will get you some therapy sorted out. Please dont rape, torture and mutilate the dead body of an innocent person in the meantime.
So angry!
It implicitly presumes that there is some relatively direct casual relationship between the two events.
Fundamental flaw.
Yes, but now we can find out whether we read Slashdot because we are nerds, or we are nerds because we read Slashdot.
Sheesh, evil *and* a jerk. -- Jade
Now that they've found a way to filter out ("ignore") data that doesn't fit, maybe now they'll actually be able to conclusively prove that climate change exists!
AGW skeptic here, but I'd be very cautions about applying this technique to climatic data to try and prove anything.
This technique works best there there are a limited number (read: two) variables, and a clear cause & effect (ie. one variable is dependant). At least that's my understanding.
Climate data is mindbogglingly complex, with a huge number of know variables with known and unknown dependencies. Even something as seemingly straightforward as the carbon cycle has a large number of feedbacks, which (again as I understand it) would only mess this approach up.
To my mind, the AGW hypothesis either succeeds or fails based on the predictions it makes and how much in-line those predictions are with observed reality. Clever statistical tricks don't help nor lend credibility in either case.
The turbine example is poor. Adequate data will show causality in time between a wind gust, and a delayed turbine rotation rate. Momentum easily causes a lag between one data set and the other, and the concept of time running in one direction can easily be used to suggest causality.
I am more curious about a test that would show if 2 data sets are clearly caused by a third non-measured factor.
Which direction in time does cause/effect flow? The world may never know.
Almost any level of accuracy above pure randomness can be fruitfully added to the bayesion inference process. You can pretty harmlessly add the pure noise as well, it's just not going to be fruitful.
Someone had to do it.
So once we start using this on everything, 1 out of every 5 times, it will lead us to bogus conclusions with false statistical confidence....
Apparently the Trident Gum people have been using this for decades.
I looked at the article - I don't understand how this is different than a covariance matrix?
So once we start using this on everything, 1 out of every 5 times, it will lead us to bogus conclusions with false statistical confidence....
So, a vast improvement then? ;-)
You are in a maze of twisty little passages, all alike.
Can they say:
Does A cause B? Probably not.
Does B cause A? Probably not.
So there's probably a C causing A and B.
There's a lot of probablys in that.
... light at the end of the tunnel re: Chicken v. Egg... Pretty interesting though!
This excellent blog article describes a technique developed by Judea Pearl decades ago to do exactly this. Would be interested to understand how this is different/better.
I love statistics. I hate "statisticians".
"Finally, we can discover whether increased crime causes ice cream sales to rise...or if it's the other way around."
Nonsense... increased ice cream sales comes from global warming which, in turn, reduces pirates as everybody know, therefore reducing crime, not the other way around.
You can't know your data is bad when doing experimentation. That's the point of experimentation - you control variables and observe others to test a hypothesis.
The point at which you can KNOW data is bad is the point at which you know all of the variables and all the details of the phenomena observing. It's like "experimenting" with 1+1 on a calculator. When it give you a 12 you know you've got bad data (you keyed in 11+1 or 1+11 or something), but that's only because you know the entire system and what it's supposed to do. It's not an experiment at that point, and there's no fucking point in doing it.
If you're experimenting on something then you don't know the entire system. If you don't know the entire system then you cannot know for sure whether any data is bad or not.
Even without going to that extreme. "bad data" - even obviously "bad data" - is merely a failure to control variables. The methodology and experiment as a whole is then suspect. Repeat with better control and methodology, or deal with the small amount of ugliness in the graph that the "bad data" may have contributed.
The standard t-test for detecting an effect is already probabalistic. In science and medicine a 95% confidence value is commonly used, which means a 1/20 of detecting something that isn't there.
Unless things have been radically relaxed in the last decade, the standard in hard sciences and medicine remains a 99% confidence interval. It's the social sciences that allow for a 95% confidence interval. Having worked in all the different schools out there, I think I have some confidence in my assertion.
"[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go
Some confidence? Would you give yourself a 99% confidence interval, or only a 95%?
Would you give yourself a 99% confidence interval, or only a 95%?
The question of the different criteria used in different fields of research is itself a social science question. Using OP's own criteria, they would require only 95% confidence. Obvious ... no?
Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
You've got your statistics all wrong: you misrepresent significance testing, and overlook that t-tests are only suitable for a small range of problems. Plus it doesn't bear on the discussion of causality. You should have been downmodded into oblivion.
Understandable reaction to the quantity of smugness in the story?
Yeah. you can make anything significant if you use like the entire population of the US as your population; on the other hand, really obvious effects will not reach significance if you use the currently affordable study populations of like 40 people.
Star Trek transporters are just 3d printers.
Some confidence? Would you give yourself a 99% confidence interval, or only a 95%?
95% confidence. It depends on what you're testing, the purpose of the test, and the design of the experiment. For example, in some cases you might go with 90% simply because you're doing a pilot study--think of it as beta testing, or perhaps the alpha testing round. These are typically small and, well, simple, and you may go with a higher alpha simply because you're doing rough measurements to see if it works at all before investing the resources into doing a larger study with a lower alpha.
On the other hand, some large medical experiments may even go for a 99.5% confidence interval, due to both the fact that they can due to having a huge sample population and the importance of being as certain as possible.
100% certainty basically translates as "Numbers were pulled from anus."
Hmmm. there is a lot more noise in the global temperature data than there is the atmospheric concentration of CO2.
Drunk falls into an open grave, passes out. In the AM, thinks to self: "If I'm not dead, then why am I in a grave? And if I am dead, then why do I need to pee?"
Star Trek transporters are just 3d printers.
I know, my sound system has terrible problems with 60 Hz Hume from Induction.
Star Trek transporters are just 3d printers.
Now that they've found a way to filter out ("ignore") data that doesn't fit, maybe now they'll actually be able to conclusively prove that climate change exists!
Oh, wait. They are already ignoring the data that doesn't fit, so I guess this won't help. Well, maybe sometime in the next 50 years they'll actually come up with a model that is accurate for more than 2-3 years in the future.
There's undoubtedly more noise in climate data than in CO2 data, so you've just reminded us about the "climate makes CO2 rise, not the other way around" argument and it is now even more clear that it is false. Good job!.
Star Trek transporters are just 3d printers.
I don't know about that, but I can tell you that global warming causes the ground to get harder. Proof: when I was younger, I could camp out on the ground with no air mattress; can't do that now.