Test Shows Big Data Text Analysis Inconsistent, Inaccurate

← Back to Stories (view on slashdot.org)

Test Shows Big Data Text Analysis Inconsistent, Inaccurate

Posted by samzenpus on Sunday February 1, 2015 @05:40AM from the you'll-love-these-links dept.

DillyTonto writes The "state of the art" in big-data (text) analysis turns out to use a method of categorizing words and documents that, when tested, offered different results for the same data 20% of the time and was flat wrong another 10%, according to researchers at Northwestern. The Researchers offered a more accurate method, but only as an example of how to use community detection algorithms to improve on the leading method (LDA). Meanwhile, a certain percentage of answers from all those big data installations will continue to be flat wrong until they're re-run, which will make them wrong in a different way.

6 of 60 comments (clear)

Min score:

Reason:

Sort:

In other words, you're doing it wrong. by BarbaraHudson · 2015-02-01 05:46 · Score: 5, Insightful

In other words, when it comes to big data, you're doing it wrong - and if you change how you're doing it, you're still going to be doing it wrong.
Big data fails to live up to hype - news at 11.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
1. Re:In other words, you're doing it wrong. by drinkypoo · 2015-02-01 06:10 · Score: 4, Insightful
  
  This is what scares most people, or at least me, about ideas of using big data to predict criminals or otherwise mess up people's lives.
  It's not a problem to use big data to try to figure out where to focus. But you have to subject the results to some sanity checking, and before you actually impact someone's life, perhaps even some common sense. Shocking idea, I know, and the reason why it's still a problem.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Color me surprised by Crashmarik · 2015-02-01 05:53 · Score: 4, Insightful

People thought you could bypass doing the work and actually understand what is going on but get useful results.
Turns out you can't.
Or put another way, If big data is so great "Why didn't Watson see IBM's crash coming ?"
Don't let perfection be the enemy of good enough by plover · 2015-02-01 06:02 · Score: 5, Insightful

The difference between "92% accurate" and "accurate enough for my task" are profound.
If you were using these kind of analytics to bill your customers, 92% would be hideously inaccurate. You'd face lawsuits on a daily basis, and you wouldn't survive a month in business. So the easy answer is, "this would be the wrong tool for billing."
But if you're advertising, you know the rates at which people bite on your message. Perhaps only 0.1% of random people are going to respond, but of people who are interested, 5.0% might bite. If you have the choice between sending the message to 10000 random people, or to 217 targeted people (only 92% of whom may be your target audience), both groups will deliver the same 10 hits. Let's say the cost per message is $10.00 per thousand views. The first wave of advertising cost you $100. The second costs you $2.17. Big Data, with all of its inaccuracies, still improves your results by a wide margin.
Way too often people like this point out that perfection is impossible. They presume that "because it's not perfect, it's useless." The answer is not always to focus on becoming more accurate, but to choose the right tool for the job, and to learn how to recognize when it's good enough to be usable. At that point you learn how to cope with the inaccuracy and derive the maximum benefits possible given what you have.

--
John
Re:Don't let perfection be the enemy of good enoug by Jumperalex · 2015-02-01 06:47 · Score: 3, Insightful

All models are wrong, some are useful.

--
If you can't be good, be good at it!
Bad science strikes again by iceco2 · 2015-02-01 08:11 · Score: 4, Insightful

The first hint you get is when you notice this paper was published in a physics journal, not a great sign. Then you actually start reading, and you see they declare LDA as "state of the art". And when you actually read what they propose it is a bunch of standard text techniques which actually work quite well with LDA.
So what they actually showed is that taking vanilla algorithms out of the box without even the most basic data processing under-performs compared to superior data processing attached to a simpler algorithm. Which anyone which did any sort of text processing or any other kind of data managling already new.