How Big Data Creates False Confidence (nautil.us)
Mr D from 63 shares an article from Nautilus urging skepticism of big data: "The general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry... But there's a problem: It's tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn't be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus -- and the reasons why should give us pause about any research that blindly trusts big data."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
Film at 11.
We suffer more in our imagination than in reality. - Seneca
a lot of the big prediction sites have been predicting the kansas city royals to be average to poor the last few years. so far they have been to the world series twice and are in first place in their division this year. all with average to slightly above average mainstream stats but if you look at them they built a team using a team strategy instead of simply signing guys and looking at individual stats
Getting folks in the Bay Area to realize that is still an unsolved problem. Maybe they have an AI team working on it.
In all seriousness, I saw this a lot when working within a monitoring team, and in consulting I've done for other orgs. Big Data is great for vast, multi-dimensional analysis of massive amounts of data, but it's not a substitute for domain knowledge about *WHAT* you're monitoring, critically thinking about what you're looking for and what types of failure modes might occur, and simple(r) heuristics for triggers.
Trend analysis is very useful as an adjunct, for example, but within a server monitoring context it's not a *substitute* for having hard limits on, say, CPU load, or HTTP response time, or memory usage.
Somehow, people managed to come to conclusions and make good decisions even before we had terabytes of raw data being sifted through by statistical algorithms to come up with a result.
To place it into a broader cultural context, I see this in parallel with "data fetishisation" where nothing at all can be possibly true unless Science. And Data. Hipster praying at the altar of data.gov as some sort of left-wing (or Millennial) shibboleth for smug certainty when the basics -- the entry-level, basic 101 class of domain knowledge for the field -- is being forgotten.
I'm all for bringing in new tech and new analytic techniques, but you can't look at it as a panacea for failing to understand what's going on in your domain on a philosophical level.
Hire a Linux system administrator, systems engineer,
"Well, if I generate (by simulation) a set of 200 variables — completely random and totally unrelated to each other — with about 1,000 data points for each, then it would be near impossible not to find in it a certain number of “significant” correlations of sorts. But these correlations would be entirely spurious. And while there are techniques to control the cherry-picking (such as the Bonferroni adjustment), they don’t catch the culprits — much as regulation didn’t stop insiders from gaming the system. You can’t really police researchers, particularly when they are free agents toying with the large data available on the web.
I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack."
Getting data is dead easy. I can get you gobs of it. I can store it fairly quickly.
Now for the hard part. What are you trying to find? Have we been collecting the right data? Is it in the right form? Do we actually have enough? Is it at the proper interval? These are where most people fail and they just keep collecting more of the same data. Even though it has 0 use for them somehow magically expecting the data to self organize itself.
https://xkcd.com/1138/
"data fetishisation" where nothing at all can be possibly true unless Science. And Data.
The problem is that it's all data, very little science.
Real scientists know how to scrutinize their data, and how to rule out false positives. Actual science will not only give you a statistical level of confidence, but use domain expertise to the uttermost to rule out systematic errors. A nice case study in that regards is the recent LIGO gravitational wave results.
Most of the people who like to call themselves "data scientists" these days know as much about science as "computer engineers" know about proper engineering.
Of course having more data to run statistical models against gives more confidence.
Not necessarily. Data sets of a few thousand records are generally sufficient for decent p values unless you're looking for effects that are so small that they're of limited commercial value. The trouble with a really big data set is that data quality and data volume are often inversely related.
Far more relevant than a larger data set are the answer to questions like these: Is my data set representative or does it have important biases? Is my data set stable over time? What important causal variables might be missing from my data? Am I looking at causality or common cause? Are my errors normally distributed? Are my missing data points representative of the remaining data? What does this data mean in the real world?
In almost all cases, I would far rather work with a small, high quality data set than a large set of uncertain quality.
I'm reminded somewhat of "The Bible Code" - the theory/idea that there is a bunch of stuff hidden in the bible, visible when viewed different ways (like when skipping characters, etc - Google it) The reality is - the bigger the dataset - the more patterns - even false patterns may be present in it. If I had a billion money's, what would they type...