Google Begat the End of the Scientific Method?
TheSauce writes "In a fairly concise one-pager from Chris Anderson, at Wired, the editor posits that all of our current (or now previous) models for collecting data are dead. The content is compelling. It notes that we've entered the Age of the Petabyte — where one can collect immense amounts of data that are paradigm agnostic. It goes on to add a comment from the head of Google's R&D, that we need an update to George Box's maxim: 'All models are wrong, and increasingly you can succeed without them.' Have we reached a time where all of our tool-sets are now made moot by vast clouds of information and strictly applied maths?"
WTF?
English, ---, do you speak it?
I saw the article yesterday, but it was so WTFey I just moved on...definitely not Slashdot submission material (especially being a Wired article).
"When information is power, privacy is freedom" - Jah-Wren Ryel
Until cells, molecules, atoms, and subatomic particles start publishing blogs, the scientific method will remain useful.
So everything possible has been researched now and therefore no more research is necessary since it will all be on the internet? Ridiculous!
A bit OT here, but don't forget 'wisdom' after intelligence. So many people stop at intelligence.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
Um, no. Claims like this demonstrate a lack of understanding of what a model is.
From the perspective of physics, the universe is just a massive amount of data--more data than any single human can comprehend at once. But thanks to the models of Newton we have a set of relatively simple equations that describe, generally, the way bodies in the universe interact. The model is not perfect, but it is useful.
Likewise, Google uses a very explicit model to describe the universe of the web: some pages are more relevant to a given search query than others, and these pages will generally be more 'popular' among other important pages. Again, the model is not perfect, but it is useful.
The fallacy is that somehow what Google is doing is a paradigm shift. It's not. It's just applying the same kind of scientific method to a type of data that hadn't existed before.
What, I think, the article is really trying to say is that Google's data is so massive and complex that we can't ascribe any explanation to the results it gives us. First of all, that is false, because the PageRank algorithm in its simplest form does give us a very explicit explanation (popular pages generally return better results). But even if it were true, Newton faced the same kind of accusations when people called his model of the universe 'Godless' and claimed, for example, that he decribed how gravity works without actually explaining "why" it works like it does. And that accusation is always with science. There are always more questions raised than answered. This is nothing new.
The article is utter nonsense. But it's such a rambling mess it's hard to know where to start picking it apart. Perhaps the best is when he presents as an example of this new "model-free" approach with a program which includes "simulations of the brain and the nervous system". Uh, hello... a simulation IS a model.
I am definitely a victim of this "Google effect". Search makes me lazy.
For example, for years I would pride myself on my well-tended Windows Start menu. I'd create base categories for my application folders like Hardware, Games, and Internet, and move applications into those folders to keep my Start menu manageable. I blogged about this procedure and included a screenshot.
Now that I'm using Vista I have little need to be so organized. I rarely have to navigate manually to an application folder thanks to the embedded search box on the Start menu. So now my Start menu is a huge clutter, but so what? I see that exercise as futile as dusting the cardboard boxes in the attic.
Searching data is a tool. You still need to have insight to formulate a theory, develop a test for the theory, and ask the data pool the right (non-leading) question. Then evaluate the data looking for both proof and disproof of the theory and be smart and ego neutral enough to let the data suggest a new theory, test and question. Don't confuse a new and useful tool that makes insight easier, with the ability of humans to have that insight.
That an incredible amount of data exists on any given topic does nothing to describe relationships, causality, precision, accuracy, distribution, correlation, or anything else. Data is information, and information must be processed in order to make it meaningful. Additionally, everything that's written, printed, published, etc, is not necessarily true, accurate, precise, etc.
If anything, the Google phenomenon demands more rigorous examination by accepted methods.
The preceding message has been brought to you by Captain Obvious and the letters O,R,L,Y.
First, not everyone has access to vast clouds of information due to expense and I don't think that's going away any time soon. So we'll still get to understand what's going on around us and not just rely on regression analysis to inform our every decision.
Second, in my experience with large sets of data, you can do all kinds of math to them to bring out interesting relationships but someone with domain expertise is going to have a much better insight into what the data is saying than someone who doesn't. It seems the peak of hubris to think that the techniques taught in every science (social, hard, or otherwise) are worth nothing compared to massive amounts of data. How do you know where to get the data from? How do you apply the data?
I don't think it's quite time to throw out "correlation != causation". In fact, I think now more than ever we need to be able to understand underlying phenomena behind the data precisely because there is so much of it. With so much data, coincidental correlation is going to happen quite often I'm sure.
And, of course, the ultimate reason we need to understand things is for, you know, when the cloud's not there.
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
This is typical web 2.0 hype... more is better. Which, as anybody who has used Wikipedia knows, is utter bullshit. The scientific method can't be supplanted by a large amount of questionable data. Tons and tons of bad data is still bad data. It doesn't get any more correct just because there's more of it.
I don't respond to AC's.
A thought-provoking piece written by someone who neither understands the scientific method nor Google. Who doesn't understand the difference between a Theory and a model. Who still doesn't get correlation!=causation. Who probably has never had to actually analyze any substantial amount of data before. And who has clearly been raised on a self-important intellectual diet consisting of too much Buckminster Fuller, Kurtzweil, Frank Tipler, and Derrida. I'm sure there are some kernels of insight buried in there someplace, but I'm just not clear what they are. If his rant is indicative about the future direction of science, we're all doomed.
i\hbar\dot{\psi}=\hat{H}\psi
He's getting rather old, but he's a good mouse.
I thought this was a joke at first. One thing to think about is that the biggest data collector of them all, the Large Hadron Collider, which fits the frame given perfectly - delivering terabytes of data in huge data sets is just the opposite of the described scenario. Models are crucial to actually picking what data is actually recorded. In fact a large part of how good the LHC data will be will be in using models to select what events to capture. The way the data is captured is of course also based on long effort and knowledge from previous detectors. This isn't just randomly, or even generically selectively gathering data and then analyzing it. This is targeted data gathering based on complex scientific theories. There have been shouting matches at what to tag for collection based on what people think is important for a given theory - and these will happen again.
As our collection abilities rise exponentially, the the storage and analysis abilities are not exponentially growing, even though they are increasing at a fast rate! I would argue exactly the opposite of what this article said. We are going to be more and more dependent on our current scientific theories to even be able to choose appropriately the rich data that new sensors and techniques will let us collect. That is we are more and more dependent on our scientific theories when we get data not less. Did we even know to get methylation data when sequencing a genome. How about some other "ylation". Without background theory and experience we wouldn't even know some of that stuff was there to collect!
To avoid the same fate as the GP, let me clarify that by WTFey I specifically meant that the article was full of fluff, light on details and generally pointless...which makes me think "WTF." The closest thing to a point I could get from the article was "Nice big blobs of data can be useful, and statistical data based on said blobs could replace the results of scientific research." Mmmkay.
A sensational headline leading to a rather pointless article consisting mostly of fluff: WTF.
"When information is power, privacy is freedom" - Jah-Wren Ryel
For example, to detect stress you might traditionally measure heartbeat, skin conductivity, pupil dilation.
In the "petabyte age" you throw in the number of times the subject uses the letter 's'; how frequently they use the 'reload' button on the browser; what colour of pants they wore last tuesday; Pepsi vs. coca cola; the number of times they picked their nose in 1997 and any and every other bit of data you have on the subject.
In the "petabyte age", most of the data you sift through will show no correlation, but you have a much better chance of finding the unexpected if indeed, there is some unknown factor out there.