Test Shows Big Data Text Analysis Inconsistent, Inaccurate
DillyTonto writes The "state of the art" in big-data (text) analysis turns out to use a method of categorizing words and documents that, when tested, offered different results for the same data 20% of the time and was flat wrong another 10%, according to researchers at Northwestern. The Researchers offered a more accurate method, but only as an example of how to use community detection algorithms to improve on the leading method (LDA). Meanwhile, a certain percentage of answers from all those big data installations will continue to be flat wrong until they're re-run, which will make them wrong in a different way.
This is what scares most people, or at least me, about ideas of using big data to predict criminals or otherwise mess up people's lives.
There's lies, damn lies, and statistics. Big data is just the 3rd repackaged, snake oil for people who (a) don't understand the business they're in (or they wouldn't need consultants telling them big data will tell them how to better run their business), (b) don't know which data is relevant, (c) don't know what questions are important, and (d) should be fired.
Big data wouldn't have prevented GM from going bankrupt. GM head idiot Wagoneer didn't understand that the nature of the business had changed (point a). Also didn't understand that those big sales figures for Hummer were irrelevant, because they were a product that was soon answering the "wrong question" (point b). He failed to address the crunch others knew was coming, so he didn't ask "what happens when ..." (point c). As for point d, he was finally fired, but too late.
Big data is just a new twist of online dating. "Given enough people, we can match any two." Yeah, right.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
The hype over big data comes from companies like Facebook or Amazon. It's a consequence of bad decisions made in the early days.
It's easy to see how this happens. Some dude says: to hell with data models, data governance or a formal approach to data warehousing; those are too "enterprisey", we are a nimble startup with the need to pivot and build MVPs quickly, let's just serialize our java/python/php objects for now. A billion dollars and 20 petabytes later the company has to rely on machine learning to sift through their digital garbage so they could find out how many users they have. And if they need stuff that runs on thousands of commodity servers, like hadoop or cassandra, it's not because it's better, it's because IBM doesn't make a mainframe big enough to help them.
In most organization these solutions should not even be considered. That's like considering bariatric surgery to lose 10 lbs because it helped the morbidly obese lady next door lose 250 lbs.
But it's cooler to say you work on a Spark project than on evolving an Inman-inspired enterprise data warehouse using Netezza to crunch numbers. So let's all brush up on our graph theory and deliver unreliable answers to painstakingly formulated questions until the next fad kicks in.
lucm, indeed.