Slashdot Mirror


Cutting Through Data Science Hype

An anonymous reader writes: Data science — or "big data" if you prefer — has evolved into a full-fledged buzzword, thanks to marketing departments around the world. John Foreman writes that part of the marketing blitz has been focused on how fast big data analysis can be. Most companies offering some kind of analytic service try to sell you on how it'll make it easy for you to quickly find and fix the problems with your business. But he points out that good, robust models need a stable set of inputs, and businesses often change far too quickly for any kind of stable prediction. He takes IBM's analytic services as an example, quoting Kevin Hillstrom: "If IBM Watson can find hidden correlations that help your business, then why can't IBM Watson stem a 3 year sales drop at IBM?" Foreman offers some simple advice: "Simple analyses don't require huge models that get blown away when the business changes. ... If your business is currently too chaotic to support a complex model, don't build one."

2 of 99 comments (clear)

  1. Re:IBM by Sarten-X · · Score: 5, Insightful

    This pretty much sums up the entirety of Big Data.

    Data analysis can highlight the correlations that would otherwise go unnoticed, and the "big" data sets involved help to ensure that the noticed correlations are statistically significant. With a large enough sample size, the effects of time can be eliminated from the statistics, supporting analysis of even highly-dynamic models. To a statistician, this is all trivial, given a large enough data set.

    Once correlations are discovered, interpreting them in the business context is a different matter for which computers are not well-suited. As the phrase goes, correlation is not causation. A business expert must analyse the observations and figure out what it all means. There may be a correlation indicating a causal relationship, or there may be a hidden cause not covered by the available data.

    Even if a causal relationship can be identified, the management may not want to act on it. Sure, the company might make more money by changing their behavior in a particular market segment, but if that segment is dying, it may not be worth the expense to change now. That's also not a task for computers, yet.

    Big Data techniques are effectively just a tool. It does one job particularly well, and does a few other jobs well enough to be useful. It is still up to humans to determine if Big Data is the best tool for a particular situation.

    --
    You do not have a moral or legal right to do absolutely anything you want.
  2. Good data first, then maybe big data later by EmperorOfCanada · · Score: 4, Insightful

    I have worked with many very large data sets or very important data sets covering large numbers of people (not that big just complex). In both cases my first fight was with the data itself. I don't know how many databases I would get into with fields (all in one table) like phone, phone_num, number_phone, phonenum, and then usually a magical set like phone1, phone2, phone3, and phone2a.

    Or I would have lat longs for customers that put them in 100 miles off the coast of Nova Scotia (not sable island either). Or a mostly good lat longs but if they couldn't get one then they would use the lat long of the nation's capital resulting in 20% of the customers residing in any given nation's capital which also then obscured the actual number of customers in the nation's capital.

    And then dates, can nobody ever get dates right. A favourite is that round one of the system will only record the day of a transaction but later they expand their collection to the hour and minute but now the old dates are all at noon or something. So when you try to find the usage pattern of users there will be this massive spike at noon and a scattering of transactions in the rest of the day. Try and run that through a Bayesian analysis.

    I can go on and on with one of my recent favorites is a phone company database where many phone calls never begin, or never end.

    So I think the big bucks is not in doing an ML processing of their data using some ingenious Hadoop crap but to maybe use ML to clean the data up. And by the way if someone has a tilde(~) in their name your OCR needs to be shot.