How Big Data Creates False Confidence (nautil.us)
Mr D from 63 shares an article from Nautilus urging skepticism of big data: "The general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry... But there's a problem: It's tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn't be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus -- and the reasons why should give us pause about any research that blindly trusts big data."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
Film at 11.
We suffer more in our imagination than in reality. - Seneca
"data fetishisation" where nothing at all can be possibly true unless Science. And Data.
The problem is that it's all data, very little science.
Real scientists know how to scrutinize their data, and how to rule out false positives. Actual science will not only give you a statistical level of confidence, but use domain expertise to the uttermost to rule out systematic errors. A nice case study in that regards is the recent LIGO gravitational wave results.
Most of the people who like to call themselves "data scientists" these days know as much about science as "computer engineers" know about proper engineering.
Of course having more data to run statistical models against gives more confidence.
Not necessarily. Data sets of a few thousand records are generally sufficient for decent p values unless you're looking for effects that are so small that they're of limited commercial value. The trouble with a really big data set is that data quality and data volume are often inversely related.
Far more relevant than a larger data set are the answer to questions like these: Is my data set representative or does it have important biases? Is my data set stable over time? What important causal variables might be missing from my data? Am I looking at causality or common cause? Are my errors normally distributed? Are my missing data points representative of the remaining data? What does this data mean in the real world?
In almost all cases, I would far rather work with a small, high quality data set than a large set of uncertain quality.