Google Begat the End of the Scientific Method?
TheSauce writes "In a fairly concise one-pager from Chris Anderson, at Wired, the editor posits that all of our current (or now previous) models for collecting data are dead. The content is compelling. It notes that we've entered the Age of the Petabyte — where one can collect immense amounts of data that are paradigm agnostic. It goes on to add a comment from the head of Google's R&D, that we need an update to George Box's maxim: 'All models are wrong, and increasingly you can succeed without them.' Have we reached a time where all of our tool-sets are now made moot by vast clouds of information and strictly applied maths?"
WTF?
English, ---, do you speak it?
I saw the article yesterday, but it was so WTFey I just moved on...definitely not Slashdot submission material (especially being a Wired article).
"When information is power, privacy is freedom" - Jah-Wren Ryel
They may lead from one to the other but they are not all the same thing.
Until cells, molecules, atoms, and subatomic particles start publishing blogs, the scientific method will remain useful.
So everything possible has been researched now and therefore no more research is necessary since it will all be on the internet? Ridiculous!
Um, no. Claims like this demonstrate a lack of understanding of what a model is.
From the perspective of physics, the universe is just a massive amount of data--more data than any single human can comprehend at once. But thanks to the models of Newton we have a set of relatively simple equations that describe, generally, the way bodies in the universe interact. The model is not perfect, but it is useful.
Likewise, Google uses a very explicit model to describe the universe of the web: some pages are more relevant to a given search query than others, and these pages will generally be more 'popular' among other important pages. Again, the model is not perfect, but it is useful.
The fallacy is that somehow what Google is doing is a paradigm shift. It's not. It's just applying the same kind of scientific method to a type of data that hadn't existed before.
What, I think, the article is really trying to say is that Google's data is so massive and complex that we can't ascribe any explanation to the results it gives us. First of all, that is false, because the PageRank algorithm in its simplest form does give us a very explicit explanation (popular pages generally return better results). But even if it were true, Newton faced the same kind of accusations when people called his model of the universe 'Godless' and claimed, for example, that he decribed how gravity works without actually explaining "why" it works like it does. And that accusation is always with science. There are always more questions raised than answered. This is nothing new.
The article is utter nonsense. But it's such a rambling mess it's hard to know where to start picking it apart. Perhaps the best is when he presents as an example of this new "model-free" approach with a program which includes "simulations of the brain and the nervous system". Uh, hello... a simulation IS a model.
I am definitely a victim of this "Google effect". Search makes me lazy.
For example, for years I would pride myself on my well-tended Windows Start menu. I'd create base categories for my application folders like Hardware, Games, and Internet, and move applications into those folders to keep my Start menu manageable. I blogged about this procedure and included a screenshot.
Now that I'm using Vista I have little need to be so organized. I rarely have to navigate manually to an application folder thanks to the embedded search box on the Start menu. So now my Start menu is a huge clutter, but so what? I see that exercise as futile as dusting the cardboard boxes in the attic.
Searching data is a tool. You still need to have insight to formulate a theory, develop a test for the theory, and ask the data pool the right (non-leading) question. Then evaluate the data looking for both proof and disproof of the theory and be smart and ego neutral enough to let the data suggest a new theory, test and question. Don't confuse a new and useful tool that makes insight easier, with the ability of humans to have that insight.
There are still several computing problems from earlier, smaller eras that havent been solved by the "more" paradigm. One example is realistic synthetic voice. The bandwidth is megabytes, achieved by mp3 players some years ago. However voice is the last part of the "real world" we have to capture instead of synthesize to implement computer-generated feature movies or video games. This keeps the need for having some "flesh" actors around, at least for a few more years :-)
Then there was Slashdot's retrospective of Artificial Intelligence a few days ago. Many of the interesting advances where made in the kilobyte and megabyte eras. It seems the gigabyte and terabyte eras have barely made a dent in progress.
That an incredible amount of data exists on any given topic does nothing to describe relationships, causality, precision, accuracy, distribution, correlation, or anything else. Data is information, and information must be processed in order to make it meaningful. Additionally, everything that's written, printed, published, etc, is not necessarily true, accurate, precise, etc.
If anything, the Google phenomenon demands more rigorous examination by accepted methods.
The preceding message has been brought to you by Captain Obvious and the letters O,R,L,Y.
First, not everyone has access to vast clouds of information due to expense and I don't think that's going away any time soon. So we'll still get to understand what's going on around us and not just rely on regression analysis to inform our every decision.
Second, in my experience with large sets of data, you can do all kinds of math to them to bring out interesting relationships but someone with domain expertise is going to have a much better insight into what the data is saying than someone who doesn't. It seems the peak of hubris to think that the techniques taught in every science (social, hard, or otherwise) are worth nothing compared to massive amounts of data. How do you know where to get the data from? How do you apply the data?
I don't think it's quite time to throw out "correlation != causation". In fact, I think now more than ever we need to be able to understand underlying phenomena behind the data precisely because there is so much of it. With so much data, coincidental correlation is going to happen quite often I'm sure.
And, of course, the ultimate reason we need to understand things is for, you know, when the cloud's not there.
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
This is typical web 2.0 hype... more is better. Which, as anybody who has used Wikipedia knows, is utter bullshit. The scientific method can't be supplanted by a large amount of questionable data. Tons and tons of bad data is still bad data. It doesn't get any more correct just because there's more of it.
I don't respond to AC's.
A thought-provoking piece written by someone who neither understands the scientific method nor Google. Who doesn't understand the difference between a Theory and a model. Who still doesn't get correlation!=causation. Who probably has never had to actually analyze any substantial amount of data before. And who has clearly been raised on a self-important intellectual diet consisting of too much Buckminster Fuller, Kurtzweil, Frank Tipler, and Derrida. I'm sure there are some kernels of insight buried in there someplace, but I'm just not clear what they are. If his rant is indicative about the future direction of science, we're all doomed.
i\hbar\dot{\psi}=\hat{H}\psi
I thought this was a joke at first. One thing to think about is that the biggest data collector of them all, the Large Hadron Collider, which fits the frame given perfectly - delivering terabytes of data in huge data sets is just the opposite of the described scenario. Models are crucial to actually picking what data is actually recorded. In fact a large part of how good the LHC data will be will be in using models to select what events to capture. The way the data is captured is of course also based on long effort and knowledge from previous detectors. This isn't just randomly, or even generically selectively gathering data and then analyzing it. This is targeted data gathering based on complex scientific theories. There have been shouting matches at what to tag for collection based on what people think is important for a given theory - and these will happen again.
As our collection abilities rise exponentially, the the storage and analysis abilities are not exponentially growing, even though they are increasing at a fast rate! I would argue exactly the opposite of what this article said. We are going to be more and more dependent on our current scientific theories to even be able to choose appropriately the rich data that new sensors and techniques will let us collect. That is we are more and more dependent on our scientific theories when we get data not less. Did we even know to get methylation data when sequencing a genome. How about some other "ylation". Without background theory and experience we wouldn't even know some of that stuff was there to collect!
This is nonsense pure and simple.
One needs to acquire facts. Now these "facts" can come from your own research or, in the age if the internet, someone else' data, but they still need to be collected and verified.
The *only* advantage that google provides is a more efficient way of sharing and finding facts. Not even all facts, those that are popular and topical are what you'll most likely find.
Historical information, from when newspapers only used dead trees, can be very difficult to find on the internet unless someone else did the research first.
To avoid the same fate as the GP, let me clarify that by WTFey I specifically meant that the article was full of fluff, light on details and generally pointless...which makes me think "WTF." The closest thing to a point I could get from the article was "Nice big blobs of data can be useful, and statistical data based on said blobs could replace the results of scientific research." Mmmkay.
A sensational headline leading to a rather pointless article consisting mostly of fluff: WTF.
"When information is power, privacy is freedom" - Jah-Wren Ryel
For example, to detect stress you might traditionally measure heartbeat, skin conductivity, pupil dilation.
In the "petabyte age" you throw in the number of times the subject uses the letter 's'; how frequently they use the 'reload' button on the browser; what colour of pants they wore last tuesday; Pepsi vs. coca cola; the number of times they picked their nose in 1997 and any and every other bit of data you have on the subject.
In the "petabyte age", most of the data you sift through will show no correlation, but you have a much better chance of finding the unexpected if indeed, there is some unknown factor out there.
Have we reached a time where all of our tool-sets are now made moot by vast clouds of information and strictly applied maths?
No. And also no to the basic premise of the article.
Meteorologists have been doing this for decades (principal component analysis has been a crucial tool there since the 1960's, and correlation analysis has been used in some form since the 1920's if not earlier) and so have the astronomers. Oh, and the particle physicists have been sifting data in their own way on a big scale ever since World War II.
As one of many examples, if you ever have heard of an "El Nino event," that was discovered through correlation analysis and is best understood through principal component analysis. BTW, the original work predates electronic computers and was all done by hand. The vast quantities of meteorological data require statistical analysis to make any progress at all, but that certainly does not mean that you cannot use the scientific method.
So, no, this does not invalidate the scientific method. In the Internet jargon, science scales.
Anyone who has read any work by Lyotard, Baudrillard, or Derrida has seen this interpretation of reality coming for years. This is basically the consequence of the Post-modernist/Post-structuralist mentality.
In a sense, what the article is proposing is the "simulation" of reality in a computer system based on the available "data". This simulation as i will suppose in a moment is merely a flawed model since the data being related must in some sense be based on an algorithm which inherently MIMICS reality and is not a substitution for it (no matter how, "accurate" agreement). But nonetheless, the result of this as Baudrillard observed is not a simulation but a simulacrum of reality and eventually will take the place of reality. The implication is that reality is not created or manufactured by the interaction of people in a "real" sense but is actually lead by the operation of the simulacrum!
Nonetheless, the fact is there is no possible way to store ALL the data of the entire world (since some data is not recordable by a binary machine, and no a "quantum" computer is the solution to say it can be); however, the problem is this fact does not mean we cannot be mislead by the simulacrum and be lead into a future where human interaction is as I would call inhuman, but as some who have (in some cases unknowingly) fallen for the post-modern myth would call it merely an evolutionary result of human-interaction.
In the future the storage of data, the usage of data, and the power of data will have a huge impact on our humanity as the past twenty years should already be evidence of. I am not an apocalyptic fear-monger, but the proof is in the pudding. For further reading, I recommend a highly prescient book written in 1990 by a Mr. Mark Poster called the Mode of Information which talks about some of these implications which are in the process of becoming as we speak
...and it should be known by now
It means there's about to be an explosion in models and theoretical sciences. Always beware the End of History ;)
Don't blame me, I voted for Baltar.
And since most slashdot readers don't RTFA most comments here have proven useless in trying to figure what those kernels you mention are.
But this guy, who has read TFA (and commented on it on the Wired's site) seems to have found them.
Mit der Dummheit kämpfen Götter selbst vergebens
Fighter classes generally stop at con, where as Casters generally for Int or Wis. No one cares about Cha.
The Kruger Dunning explains most post on
Petaphile
1) Someone who loves their pets more than human beings or, at the extreme, someone willing to kill a human to save a lower animal's life.
2) Somebody who has sex with animals because they cannot attract any humans, or they are attracted to animals
(and the best one)
3) someone so caught up in his own egomaniacle conception of the world that he is compelled to spew vomit and blood on a strangers clothes to show his contempt for anybody's thought but his own.
Which sounds kinda like the summary for the article, as well as some of the article.
Just finished rereading the Foundation series for the one millionth time. Anyone remember some of the signs of the decay of the first empire? The idea that these "scientists" were no longer experimenting, no longer looking for new ways to do things - just spending their time looking at old books and old experiments and trying to squeeze a "new" thought or two out of them? That a sociologist would study a society through books written about it? An archeologist would explore the ruins of a world by reading descriptions written by someone centuries before?
Anyway catching the parallels here? The "search engine" is a great tool for gathering existing data - but our current tools help us:
1. Analyze that data
2. Gather more data
Can you honestly say that those aren't important anymore? The summary seems pretty crazy to me.
You know, this may be the most pure Wired article I've read in a long time. Reminds me of the magazine's layout when it first came out. Complete bull, unreadable, unstructure, but slick.
Google used reams of data to get good at advertising and marketing, so Wired is using this ability to predict the end of SCIENCE?
Do they not realize the difference between these things? Advertising is extremely hand wavy and vague in the best of circumstances - I would argue that Google's offerings aren't really better than any other method, they're just cheaper for advertisers, and have a much larger base than normal.
I'm honestly astounded at this.
An unknowable paradigm? Interesting.
I am very small, utmostly microscopic.
You can't get good data unless you control the makeup of your data population. Even if you applied this technique to all the data in the cloud, it wouldn't mean the "end of the scientific method", it would be scientifically studying the cloud.
So no. Even if everything he wrote is all true, you still apply science to study things, just in a different way. The internet doesn't make science obsolete any more than it made economics obsolete, and saying otherwise is as much hubris now as it was then.
It seems rather stupid to me. Sure, we can correlate a whole bunch of data. And we can collect a whole bunch of data. But that's not going to give us the predictive power that scientific models give us.
.) to predict that the earth orbits the sun, and how it does so, that gives us insight.
Take for example, the orbit of the earth around the sun. Suppose we collected a whole bunch of data on the orbit of the earth around the sun. Sure, we'd be able to predict what the orbit is going to be, based on past data. But it gives us no other insight. Whereas, when we use the theory of gravity (and rotational motion and conservation of angular momentum etc . .
Because we can now turn to, say, Jupiter and the sun. Even if there is no data collected on how Jupiter orbits the sun, we can use the predictive power of our theories, that we have tested on the earth-sun system, to say how Jupiter is going to orbit.
That's a simple example, but you can imagine much more complicated situations. If we simply have correlation, we may be able to say that X is going to do Y based on previous behavior, but if I ask you how something new and unexpected is going to behave, we can get no answer until we take data . . . because we don't know *why* anything happens. And that's why we're never going to replace theories with statistical analysis of data.
There's a place for both. Obviously, just statistics can be very successful (google, for example), but, at least in science, it's not sufficient.
TODAY: Feeling up.
There are two reasons you're wrong. One is entropy, and the other is one way functions.
Entropy forces causality to appear in physical systems. A boiled egg is highly correlated with a heated raw egg, but I challenge you to explain away the causation from one state to the other.
One way functions are quite similar, and probably a result of the same physical properties of matter. When a key is used to encrypt data, there is a high correlation between the original data, the key, and the encrypted data, but causation clearly flows from encrypting data with the key to the encrypted data state, and not from the encrypted state to a derived key and the original data. It's just a limitation of human (and our machines) abilities, but it nevertheless presents very strong evidence for the practical existence of causation.
- None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
I must admit, as an applied mathematician who makes models of physical things for a living, this sort of research threatens to steal my bread and butter. It may be self-centered, but I think modeling is, beside experiment, half of science.
Simplified models are so valuable to our understanding because they tell us what information we can remove, which parts of a problem are important and which parts may be ignored. They allow us to not just make predictions, but they guide future experimentalists as to what sorts of changes will impact the system and which won't.
To be fair, it's more of a cycle: experiments generate data, models are constructed to explain the data. These models make predictions (and hopefully useful simplifications) that can be checked by further experiments to validate them. At the end of the process, we've produced a clearer picture of how a system works. Enough information maybe for someone building something slightly different to not have to test the aspects covered by the model.
I view these data-mining techniques like the scientific computing techniques of the last 30 years or so, only the inverse. Sci Comp nerds wanted to do away with experiments. They thought they could numerically simulate (relatively) exact models (like Navier-Stokes for fluid motion rather than one of its more tractable, understandable simplifications) and use the generated data instead of experimental data. The trouble was that no one will believe that the crazy new phenomenon discovered by your program is real until they see it in the lab, until they construct a simplified model that has the same behavior -- i.e. the same science as before.
The new data-mining idea is the same, but for the modeling end of things. "No models, please," they say. They'll just data-mine the experimental results and "discover" whatever the model missed. Except people will want to do experiments to verify the discovery. They'll want to build models so they can know they're doing the right experiments, and so on.
At the end, I think Sci Comp and data-mining are fantastic new tools that have a lot to offer science, but I don't think either eliminates the need for old fashioned modeling.
Use the Firehose to mod down Second Life stories!
Comment removed based on user account deletion
Comment removed based on user account deletion
i'd never heard the term "model selection," so thanks for pointing that out. it looks like there really is some good literature to read on the subject.
the process described by the model selection sites i skimmed still doesn't adress what i was getting at, though. "choosing a model from a set of potential models" is only conceivable when your set of potential models (and set of variables to potentially be modeled) is well bounded.
to put it another way, take the smartest model choosing algorithm you can find, hand it a pile of data, and say "what do you make of that, smart guy?" i'm willing to bet that the answer is going to be along the lines of "wtf?" unless there is some sort of context or metadata provided along with the data to give the algorithm a hint of what it's looking for. am i looking for covariance between scalar values among regularly organized groups? am i looking for white rabbits in the image data from a camera? is this ascii or ebcdic or 8-bit PCM data? you can argue that these questions are trivial, that no algorithm can be *that* general, but that is precisely my point: all known algorithms require significant narrowing down of the problem space by human hands before they can begin to produce useful output.
if you had an algorithm that took *truly* semantics-free data in one end and spit models of regularly occuring features out the other end, you'd be halfway to general AI.
Comment removed based on user account deletion