Test Shows Big Data Text Analysis Inconsistent, Inaccurate
DillyTonto writes The "state of the art" in big-data (text) analysis turns out to use a method of categorizing words and documents that, when tested, offered different results for the same data 20% of the time and was flat wrong another 10%, according to researchers at Northwestern. The Researchers offered a more accurate method, but only as an example of how to use community detection algorithms to improve on the leading method (LDA). Meanwhile, a certain percentage of answers from all those big data installations will continue to be flat wrong until they're re-run, which will make them wrong in a different way.
In other words, when it comes to big data, you're doing it wrong - and if you change how you're doing it, you're still going to be doing it wrong.
Big data fails to live up to hype - news at 11.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
People thought you could bypass doing the work and actually understand what is going on but get useful results.
Turns out you can't.
Or put another way, If big data is so great "Why didn't Watson see IBM's crash coming ?"
The difference between "92% accurate" and "accurate enough for my task" are profound.
If you were using these kind of analytics to bill your customers, 92% would be hideously inaccurate. You'd face lawsuits on a daily basis, and you wouldn't survive a month in business. So the easy answer is, "this would be the wrong tool for billing."
But if you're advertising, you know the rates at which people bite on your message. Perhaps only 0.1% of random people are going to respond, but of people who are interested, 5.0% might bite. If you have the choice between sending the message to 10000 random people, or to 217 targeted people (only 92% of whom may be your target audience), both groups will deliver the same 10 hits. Let's say the cost per message is $10.00 per thousand views. The first wave of advertising cost you $100. The second costs you $2.17. Big Data, with all of its inaccuracies, still improves your results by a wide margin.
Way too often people like this point out that perfection is impossible. They presume that "because it's not perfect, it's useless." The answer is not always to focus on becoming more accurate, but to choose the right tool for the job, and to learn how to recognize when it's good enough to be usable. At that point you learn how to cope with the inaccuracy and derive the maximum benefits possible given what you have.
John
The hype over big data comes from companies like Facebook or Amazon. It's a consequence of bad decisions made in the early days.
It's easy to see how this happens. Some dude says: to hell with data models, data governance or a formal approach to data warehousing; those are too "enterprisey", we are a nimble startup with the need to pivot and build MVPs quickly, let's just serialize our java/python/php objects for now. A billion dollars and 20 petabytes later the company has to rely on machine learning to sift through their digital garbage so they could find out how many users they have. And if they need stuff that runs on thousands of commodity servers, like hadoop or cassandra, it's not because it's better, it's because IBM doesn't make a mainframe big enough to help them.
In most organization these solutions should not even be considered. That's like considering bariatric surgery to lose 10 lbs because it helped the morbidly obese lady next door lose 250 lbs.
But it's cooler to say you work on a Spark project than on evolving an Inman-inspired enterprise data warehouse using Netezza to crunch numbers. So let's all brush up on our graph theory and deliver unreliable answers to painstakingly formulated questions until the next fad kicks in.
lucm, indeed.
Just as we expect expert practitioners in medicine or civil engineering to bear liability for mistakes in their respective professions, can the notion of modeling malpractice be far behind? When will the first class-action suit be filed against a statistical model that incorrectly denies service or besmirches the credit ratings of thousands?
All models are wrong, some are useful.
If you can't be good, be good at it!
I analyzed the free-text field on hospital surveys. A simple keyword search gave me very reliable results on what the patients were complaining about -- they fell into the categories of bad food (food, cafeteria, diet, tasted, stale), dirty rooms (dirty, rat, blood, bathroom), rude staff (rude, ignore, curt), noise (noise, loud, echo, hallway), TV broken (TV, Television, "can't see"). So if the context is narrow enough, even simple searches work.
I agree that more broadly worded questions require more sophistication. I've looked at word combinations and so forth, though I haven't really needed to use them yet in analyzing health care data. We would not trust a computer to parse a full doctor's report, no matter how sophisticated the software; that will require manual inspection, often by multiple people to agree on a consensus interpretation.
The first hint you get is when you notice this paper was published in a physics journal, not a great sign. Then you actually start reading, and you see they declare LDA as "state of the art". And when you actually read what they propose it is a bunch of standard text techniques which actually work quite well with LDA.
So what they actually showed is that taking vanilla algorithms out of the box without even the most basic data processing under-performs compared to superior data processing attached to a simpler algorithm. Which anyone which did any sort of text processing or any other kind of data managling already new.
That's a great question. Do you think 80% accuracy is good enough for medical use? If you're a doctor facing an unfamiliar situation, and your data says treatment X helped 40% of patients it was tried on, treatment Y helped 35% of them, and all other treatments (Z, W, etc.) helped no more than 30%, but you know the data might only be 80% accurate, what treatment do you choose? Are those ratios even meaningful in the presence of so many errors?
Consider the case where the patient's condition is critical, and you don't have time for additional evaluation. Is X always the best choice? What if your specialty makes you better than average at treatment Y? Maybe that 20% inaccuracy works in favor of the doctor who has the right experience.
It could it be used for ill, too. What if you know you'll get paid more by the insurance company for all the extra tests required to do treatment Y? You could justify part of your decision based on the uncertainty of the data.
In the end, historical data is just one factor out of many that goes into each of these decisions. Inaccurate data may lead to suboptimal decisions, so it can't be the only factor.
Great strawman, but your strawman happens to actually be a nuclear powered, armor plated tank...with sharks and laser beams!!! Turns out way back in the 60's, when they started to think about what problems computers could one day solve, they listed many: beat world champion at chess, drive cars, etc...one of them was medical diagnosis. It took decades longer than thought to solve the ones they have been able to solve with one exception: medical diagnosis. By the early 80s we had "expert systems" that were more accurate than human doctors at medical diagnosis (especially 24 hrs in to a 36 hr shift). The AMA and insurance companies have basically blocked this tech for decades despite overwhelming evidence that they were killing people by doing so. Today we have started to slowly role out this type of tech for things like drug interaction but not yet for medical diagnosis. Ironic huh?
"Those that start by burning books, will end by burning men."
In the latter it's PCA/SVD and it's used to reduce the dimensionality (compact) of large numbers of variables eg a linear approximation is almost as good as accounting for all the variables individually.
The problem in both text analysis and climate (or any other) models is that PCA/LDA/etc. are linear, and the data they are applied to are generally nonlinear.
The latter means that the solution space has many (infinite?) number of sub optimal solutions.
That in turn means PCA/LDA/etc. return a linear approximation to one of those solutions, and those solutions can be very different.
So, yeah, there is a margin of error. And yeah, the reasons for that error varies. No surprise, because text understanding (and the climate) are hugely complex and nonlinear problems.
BUT at least maybe more people will become aware that models are pretty much flawed ... so don't base legal or public policy on them.
"Consensus" in science is _always_ a political construct.