How Big Data Creates False Confidence (nautil.us)
Mr D from 63 shares an article from Nautilus urging skepticism of big data: "The general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry... But there's a problem: It's tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn't be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus -- and the reasons why should give us pause about any research that blindly trusts big data."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.
The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
Getting data is dead easy. I can get you gobs of it. I can store it fairly quickly.
Now for the hard part. What are you trying to find? Have we been collecting the right data? Is it in the right form? Do we actually have enough? Is it at the proper interval? These are where most people fail and they just keep collecting more of the same data. Even though it has 0 use for them somehow magically expecting the data to self organize itself.