Why the Cloud Cannot Obscure the Scientific Method
aproposofwhat noted Ars Technica's rebuttal to
yesterday's story about "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." The response is titled "Why the cloud cannot obscure the Scientific Method," and is a good follow up to the discussion.
Because a datasource isn't a process?
Check out my sysadmin blog!
http://arstechnica.com/news.ars/post/20080625-why-the-cloud-cannot-obscure-the-scientific-method.html
I like the fact that the web and search/aggregate engines may combine vast amounts of data in ways we now
cannot imagine - it expands the field for new scientific research enormously. Replace science? No.
accept no limits but time
Latest addition to bullshit bingo cards:
CLOUD
I'd say that the models are the science. They're how you explain your data. They provide evidence that the experiments make sense, and they guide you by making predictions you can test.
Moreover, SIMPLIFIED MODELS are good science. Understanding which details can be omitted without impacting the predictive ability of your model shows you know which effects are important and which aren't.
Use the Firehose to mod down Second Life stories!
A large source of data that has a correlation does not somehow imply causation. Even if it works under some conditions (or even all conditions). The science happens when the causation is determined and then applied.
All models are wrong, but some are useful.
We still need scientific methods to develop useful models and understand and refine the existing models. When Newton defined his mechanics that was the state of the art in his era, and now we have progressed to quantum mechanics which might be refined tomorrow.
But mere observation of some phenomena is not sufficient to postulate the behaviour in a changed condition. A scientific model and its rigorous application is required for this. Correlations drawn from the cloud cannot substitute it.
gopla
In general I'm right behind the rebuttal. However John Timmer chooses a very bad real-life example as his rebuttal champion.
He asks: ...would Anderson be willing to help test a drug that was based on a poorly understood correlation pulled out of a datamine? These days, we like our drugs to have known targets and mechanisms of action and, to get there, we need standard science.
These days we may like our drugs to have these attributes, but very often they don't. There are still quite a few medicines around that clearly work and are prescribed on that basis, but for which there is only the haziest evidence as to how exactly they work.
The good thing about the scientific method, however is it gives us a framework to investigate these drug's actions - even if the explanation is still currently beyond us.
Truly, the whole reason someone like Mr. Anderson could claim the end of science because of data is that he is a writer, a thinker, and large part businessman. Businessmen do not think about Science and how to use it to come with a method that produces a conclusion. He uses information to come up with ways to illicit a reaction in people. So to him data is more important than science because he uses it for his purposes. That is marketing, and the "science" of marketing has almost always been that way.
/. this article is as cogent a rebuttal as one can make.
Mr. Anderson was not prescient in any way, he was just speaking his perspective. The only thing is we must be careful to even consider his proposition as a valid reality worth pursuing. Not for true scientists, but from a social perspective, or it will truly be the end of science. There are some in power as it is already attempting to make this happen.
That said, I almost consider responding to yesterday's article as falling for the argument. But, since it hit the
...and it should be known by now
And can back up this rebuttal with a practical example. I am a physicist, I know sod all about blood samples, or proteins, or cancer. I get a pile of mass spec data (about a billion data points or so on some days) and through binning, background subtraction, and a string of other statistical witchcraft I produce a set of peaks labeled according to intensity and significance.
This does not make me a cancer researcher. This data has to go back to the cancer guys and they have to pick out the Biomarkers and thus develop new diagnostic tests, based on principles that I don't understand. I am master of the information but entirely blind as far as the science is concerned. Same goes for google.
If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
When I read the original article my thought was that someone was just trying to write something to get noticed. The Scientific method, IMHO, is all about a person or group of persons using a logical process to determine the vailidity of an idea. Observing massive amounts of data can reveal relationships that may not have been noticed in other ways, but at the end of the day the process of "I think X, I wonder if it is true", the heart of the scientific method, can no sooner become obsolete than we can stop being human. The questions of What, Why and How are so fundamental to humans as humans that nothing short of total omniscience will ever replace the logical process represented by the scientific method.
I have always viewed this debate in the context of scientist vs. engineer. That is one who views data as "good and true" vs. "good enough". That's not a slam on engineers (I am one), but a reflection of the balance between the two. A scientist that never applies theory sits in an empty room. An engineer who build things with out science, sits in a cluttered room surrounded by useless objects.
I do find interesting though that the advent of "google data" may indicate a flip in order of the two disciplines. Historically (IMHO) science has led engineering. A theoretical breakthrough, provable by the scientific method, may take years to give birth to a practical application. Now, with enormous piles of data and the knowledge that "good enough" is often good enough, we may be creating useful objects that will take science many years to explain and model.
The biggest issue and omission in both of these pieces is that this "cloud" of data does not represent "truth" (as the scientist may seek), but rather a summation or averaging of the "perception of truth" as seen by the individual authors. The cloud, therefore, is only as useful as human's ability to divine truth without the scientific method.
My two cents. :)
Yes, I think that prediction without explanation is fascinating, but I don't know if it's what I like about science :) Have you ever heard Lenard Smith speak? I saw him at SAMSI, but his MSRI talk is online and is roughly the same. He's a statistician who works in exactly this.
Some fancy-pants technique he has is better at predicting the future behavior of chaotic systems (like van der Pol circuits or the weather) than physical models. But he also points out that these predictions don't tell you what type of data to collect to make better predictions, and that they don't generalize. One nice "model" he has can predict the weather at Heathrow better than physical weather models (from the same inputs: wind speed, temperature, pressure, etc), but it's useless for predicting the weather in Kinshasa until the model is re-trained.
I think these types of data analysis tools will be very important in the future, but they won't replace the explanatory power of models. Just like how scientific computing is useful, but never replaced actual experiments.
Use the Firehose to mod down Second Life stories!
Thank you. Sure, there's a ton of data out there, but how was it collected? What statistical methods were used to analyze the data? How did you select the data set you're analyzing? Nothing I understand about science really applies to data mining a so-called "cloud". Prediction without explanation is just observation. Observation in and of itself is not science. You might have data, but is it the right data?
I see all this petabyte stuff as interesting and even as a valuable adjunct to real science, but a basic requirement of science is reproducibility and you can't reproduce the data collection.
I have mod points. The reign of terror begins now.