Cutting Through Data Science Hype
An anonymous reader writes: Data science — or "big data" if you prefer — has evolved into a full-fledged buzzword, thanks to marketing departments around the world. John Foreman writes that part of the marketing blitz has been focused on how fast big data analysis can be. Most companies offering some kind of analytic service try to sell you on how it'll make it easy for you to quickly find and fix the problems with your business. But he points out that good, robust models need a stable set of inputs, and businesses often change far too quickly for any kind of stable prediction. He takes IBM's analytic services as an example, quoting Kevin Hillstrom: "If IBM Watson can find hidden correlations that help your business, then why can't IBM Watson stem a 3 year sales drop at IBM?" Foreman offers some simple advice: "Simple analyses don't require huge models that get blown away when the business changes. ... If your business is currently too chaotic to support a complex model, don't build one."
"we don't need no stinkin' sales", we have Ginni.
"Big Data" is like sex in high school. Nobody really knows for sure how to do it properly, but everyone thinks everyone else is doing it, so everyone says they're doing it, too.
Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
Statistical Process Control and Western Digital rule are very applicable here. Without stability for a baseline, it's (pretty well) impossible to utilize small data, much less big data (big bad data:).
Great minds think alike; fools seldom differ.
This pretty much sums up the entirety of Big Data.
Data analysis can highlight the correlations that would otherwise go unnoticed, and the "big" data sets involved help to ensure that the noticed correlations are statistically significant. With a large enough sample size, the effects of time can be eliminated from the statistics, supporting analysis of even highly-dynamic models. To a statistician, this is all trivial, given a large enough data set.
Once correlations are discovered, interpreting them in the business context is a different matter for which computers are not well-suited. As the phrase goes, correlation is not causation. A business expert must analyse the observations and figure out what it all means. There may be a correlation indicating a causal relationship, or there may be a hidden cause not covered by the available data.
Even if a causal relationship can be identified, the management may not want to act on it. Sure, the company might make more money by changing their behavior in a particular market segment, but if that segment is dying, it may not be worth the expense to change now. That's also not a task for computers, yet.
Big Data techniques are effectively just a tool. It does one job particularly well, and does a few other jobs well enough to be useful. It is still up to humans to determine if Big Data is the best tool for a particular situation.
You do not have a moral or legal right to do absolutely anything you want.
If you have a marketing department, you're wasting money.
If you hire a marketing firm, you're burning money.
If you hire a marketing firm and then take their advice, you're emptying your bank account into a volcano.
The dinosaurs did not die out because they were unable to adapt anymore than a person dies because they fail to "adapt" to a grenade.
It little behooves the best of us to comment on the rest of us.
Data scientists are this bubble's web masters. 'Nuff said.
I have worked with many very large data sets or very important data sets covering large numbers of people (not that big just complex). In both cases my first fight was with the data itself. I don't know how many databases I would get into with fields (all in one table) like phone, phone_num, number_phone, phonenum, and then usually a magical set like phone1, phone2, phone3, and phone2a.
Or I would have lat longs for customers that put them in 100 miles off the coast of Nova Scotia (not sable island either). Or a mostly good lat longs but if they couldn't get one then they would use the lat long of the nation's capital resulting in 20% of the customers residing in any given nation's capital which also then obscured the actual number of customers in the nation's capital.
And then dates, can nobody ever get dates right. A favourite is that round one of the system will only record the day of a transaction but later they expand their collection to the hour and minute but now the old dates are all at noon or something. So when you try to find the usage pattern of users there will be this massive spike at noon and a scattering of transactions in the rest of the day. Try and run that through a Bayesian analysis.
I can go on and on with one of my recent favorites is a phone company database where many phone calls never begin, or never end.
So I think the big bucks is not in doing an ML processing of their data using some ingenious Hadoop crap but to maybe use ML to clean the data up. And by the way if someone has a tilde(~) in their name your OCR needs to be shot.
Watson was impressive on Jeopardy, but a TV show is a very different venue than business data analytics.
For the latter you really need a statistically sound approach in order to reach the right conclusion.
(DISCLAIMER: I do not work for Bayesia, but actually a competitor, yet any person or company that understand Bayesianism as a sound foundation for knowledge inference knows this dirty little secret about Watson)
What do you mean "dinosaurs failed to adapt", there are several of them flying around in my garden right now!
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Find a rusty railroad spike. Shove it through your eyeball over and over again. That's what IBM products are like.
Buy a very expensive rusty railroad spike. Shove it through your eyeball over and over again. That's what IBM products are like.
There, fixed it for you.
I'm a consultant - I convert gibberish into cash-flow.
Birds heap shame upon their ancestors merely by existing. (Except maybe shrikes; their willingness to keep up a proud tradition of bloodthirsty carnivorous murder despite now being about the size of a sparrow is pretty honorable).
>> Catastrophe is a critical factor in most evolutionary history.
> Citation, please.
Wikipedia has a fairly good entry on "Catastrophism", and another on "Punctuated equilibrium". But even without large scale events such as dinosaur killer asteroids or the evolution of photosynthesis poisoning most species with much higher concentrations of volatile oxygen, the are much smaller and more frequent effects. Forest fires are a crtical factor in breeding jack pine trees, floods are vital to the fertility of the ecosystem near river banks, and hurricanes spread species throughout their trail and profoundly affect the ecology and evolution of areas that are likely to endure hurricanes. And catastrophes can and do create a "founder effect", where a small number of introduced species members become a new species quite quickly in their new environment.
Do I need to find individual links links for each of those?