Slashdot Mirror


Test Shows Big Data Text Analysis Inconsistent, Inaccurate

DillyTonto writes The "state of the art" in big-data (text) analysis turns out to use a method of categorizing words and documents that, when tested, offered different results for the same data 20% of the time and was flat wrong another 10%, according to researchers at Northwestern. The Researchers offered a more accurate method, but only as an example of how to use community detection algorithms to improve on the leading method (LDA). Meanwhile, a certain percentage of answers from all those big data installations will continue to be flat wrong until they're re-run, which will make them wrong in a different way.

60 comments

  1. In other words, you're doing it wrong. by BarbaraHudson · · Score: 5, Insightful

    In other words, when it comes to big data, you're doing it wrong - and if you change how you're doing it, you're still going to be doing it wrong.

    Big data fails to live up to hype - news at 11.

    --
    "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    1. Re:In other words, you're doing it wrong. by Drethon · · Score: 4, Interesting

      This is what scares most people, or at least me, about ideas of using big data to predict criminals or otherwise mess up people's lives.

    2. Re:In other words, you're doing it wrong. by drinkypoo · · Score: 4, Insightful

      This is what scares most people, or at least me, about ideas of using big data to predict criminals or otherwise mess up people's lives.

      It's not a problem to use big data to try to figure out where to focus. But you have to subject the results to some sanity checking, and before you actually impact someone's life, perhaps even some common sense. Shocking idea, I know, and the reason why it's still a problem.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    3. Re:In other words, you're doing it wrong. by jythie · · Score: 2

      Yeah, but this makes sense. The success criteria for analysis is not accuracy, but faith. As long as they can sell it to marketers, correctness is just something that needs a bit of spin.

    4. Re:In other words, you're doing it wrong. by Anonymous Coward · · Score: 0

      perhaps even some common sense Common sense and sanity checks from the government that is a shocking idea. You are right these two simple tasks that don't cost much would save billions. But if you have any level of common sense you wouldn't be working for the government in the first place. Maybe that is the impass?

      Yep some people just don't realize that the size of the haystack can and will effect you finding the needle. The more hay actually makes it harder to find the needle.

  2. Color me surprised by Crashmarik · · Score: 4, Insightful

    People thought you could bypass doing the work and actually understand what is going on but get useful results.
    Turns out you can't.

    Or put another way, If big data is so great "Why didn't Watson see IBM's crash coming ?"

    1. Re:Color me surprised by drinkypoo · · Score: 2

      Or put another way, If big data is so great "Why didn't Watson see IBM's crash coming ?"

      You're assuming it didn't.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:Color me surprised by BarbaraHudson · · Score: 3, Interesting

      There's lies, damn lies, and statistics. Big data is just the 3rd repackaged, snake oil for people who (a) don't understand the business they're in (or they wouldn't need consultants telling them big data will tell them how to better run their business), (b) don't know which data is relevant, (c) don't know what questions are important, and (d) should be fired.

      Big data wouldn't have prevented GM from going bankrupt. GM head idiot Wagoneer didn't understand that the nature of the business had changed (point a). Also didn't understand that those big sales figures for Hummer were irrelevant, because they were a product that was soon answering the "wrong question" (point b). He failed to address the crunch others knew was coming, so he didn't ask "what happens when ..." (point c). As for point d, he was finally fired, but too late.

      Big data is just a new twist of online dating. "Given enough people, we can match any two." Yeah, right.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    3. Re:Color me surprised by plover · · Score: 2

      When you're dealing with statistics, you ought to recognize that 92% accuracy is a huge improvement over a random distribution. You do not use big data to select a target for a sniper rifle, you use it to point a shotgun.

      And just like your faulty GM CEO analogy (I assume you felt the need to apply a car analogy for the benefit of the slashdot crowd) only an idiot would send someone off in the woods blindfolded and have him fire his shotgun in a random direction hoping to bring home some kind of food animal. You still have to know what you're hunting for, you still have to know how to hunt, you still have to make wise decisions. It's just a tool, not a sage.

      --
      John
    4. Re:Color me surprised by BarbaraHudson · · Score: 2

      That's a nice strawman argument you got going there.

      People who have an understanding of their business will achieve far better results than a random distribution. The CEO of Ford saw it coming several years in advance, and prepared (via tens of billions of borrowing against every company asset, including their logo) to have enough funds to weather it out, while changing their product line-up to match the new reality.

      There is no "silver bullet" to replace competence.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    5. Re: Color me surprised by jo7hs2 · · Score: 1

      Exactly. There is actually an excellent book on the topic: Once Upon a Car by Bill Vlasic. In fact, GM even approached Ford for a potential merger. Ford, realizing that GM was screwed and they were, in fact, not...thought it over for the night...and then flatly rejected GM.

    6. Re:Color me surprised by plover · · Score: 1

      You seem to be belaboring this mistaken impression that analyzing Big Data somehow replaces thinking in the board room. It does not. Big Data is a tool that can help provide evidence of what people have done in the past, statistically correlated to potential causes. Big Data doesn't decide "hey, let's buy GM." People make those decisions, and they try to make them based on the information they have -- and Big Data can be a good source of that info. But people can be idiots, they can be talented, they can be anywhere on the spectrum. Do not blame the tool, or the accuracy of the tool, just because it's capable of being swung by an unqualified, incompetent idiot.

      As a friend of mine is wont to say, "A fool with a tool is still a fool."

      --
      John
    7. Re:Color me surprised by BarbaraHudson · · Score: 2

      As I pointed out, big data is being used by people who shouldn't be in the position to make decisions. You can't make right decisions if you ask the wrong questions.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    8. Re:Color me surprised by TapeCutter · · Score: 2

      Coincidently the Slashdot 'thought of the day' below reads -"There's no sense in being precise when you don't even know what you're talking about. -- John von Neumann"

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    9. Re:Color me surprised by Anonymous Coward · · Score: 0

      You do not use big data to select a target for a sniper rifle, you use it to point a shotgun.

      Considering NSA's admitting on killing based on metadata we can't rule out that big data isn't used to select a target for a sniper rifle literally.

  3. Don't let perfection be the enemy of good enough by plover · · Score: 5, Insightful

    The difference between "92% accurate" and "accurate enough for my task" are profound.

    If you were using these kind of analytics to bill your customers, 92% would be hideously inaccurate. You'd face lawsuits on a daily basis, and you wouldn't survive a month in business. So the easy answer is, "this would be the wrong tool for billing."

    But if you're advertising, you know the rates at which people bite on your message. Perhaps only 0.1% of random people are going to respond, but of people who are interested, 5.0% might bite. If you have the choice between sending the message to 10000 random people, or to 217 targeted people (only 92% of whom may be your target audience), both groups will deliver the same 10 hits. Let's say the cost per message is $10.00 per thousand views. The first wave of advertising cost you $100. The second costs you $2.17. Big Data, with all of its inaccuracies, still improves your results by a wide margin.

    Way too often people like this point out that perfection is impossible. They presume that "because it's not perfect, it's useless." The answer is not always to focus on becoming more accurate, but to choose the right tool for the job, and to learn how to recognize when it's good enough to be usable. At that point you learn how to cope with the inaccuracy and derive the maximum benefits possible given what you have.

    --
    John
  4. Technological breakthrough by techdolphin · · Score: 1

    This is actually good news. We always wanted computers to behave more like people, and in this case they are. The same question and data often gets different results just like people. What a great technological breakthrough.

    1. Re:Technological breakthrough by Anne+Thwacks · · Score: 1
      But...

      With Neural net, we can get more stuff wrong, faster

      WTF???

      Profit

      --
      Sent from my ASR33 using ASCII
  5. the Facebook syndrome by lucm · · Score: 3, Interesting

    The hype over big data comes from companies like Facebook or Amazon. It's a consequence of bad decisions made in the early days.

    It's easy to see how this happens. Some dude says: to hell with data models, data governance or a formal approach to data warehousing; those are too "enterprisey", we are a nimble startup with the need to pivot and build MVPs quickly, let's just serialize our java/python/php objects for now. A billion dollars and 20 petabytes later the company has to rely on machine learning to sift through their digital garbage so they could find out how many users they have. And if they need stuff that runs on thousands of commodity servers, like hadoop or cassandra, it's not because it's better, it's because IBM doesn't make a mainframe big enough to help them.

    In most organization these solutions should not even be considered. That's like considering bariatric surgery to lose 10 lbs because it helped the morbidly obese lady next door lose 250 lbs.

    But it's cooler to say you work on a Spark project than on evolving an Inman-inspired enterprise data warehouse using Netezza to crunch numbers. So let's all brush up on our graph theory and deliver unreliable answers to painstakingly formulated questions until the next fad kicks in.

    --
    lucm, indeed.
    1. Re:the Facebook syndrome by Anonymous Coward · · Score: 0

      The problem is Netezza is expensive---so you have companies that don't have "big data" (e.g say they have 100T of data), hop onto the "big data" bandwagon due to tools of dealing with their kind of data sizes.

    2. Re:the Facebook syndrome by Anonymous Coward · · Score: 0

      The hype over big data comes from companies like Facebook or Amazon. It's a consequence of bad decisions made in the early days.

      And yet, the reason we talk about Amazon and Facebook is because their competitors who made different decisions are doing worse than they are. Sometimes there are "good" decisions with high costs.

    3. Re:the Facebook syndrome by lucm · · Score: 1

      The hype over big data comes from companies like Facebook or Amazon. It's a consequence of bad decisions made in the early days.

      And yet, the reason we talk about Amazon and Facebook is because their competitors who made different decisions are doing worse than they are. Sometimes there are "good" decisions with high costs.

      I'm not sure what competitors you mean, but look at the numbers: Azure is profitable while AWS is a money pit.

      --
      lucm, indeed.
  6. Re:Don't let perfection be the enemy of good enoug by hax4bux · · Score: 1

    +1, great response

  7. Ivory tower academic by Alan+Shutko · · Score: 1

    "Companies that make products must show that their products work," Amaral said in the Northwestern release. "They must be certified."

    This researcher is completely out of touch with what's sold in the marketplace. No wonder he doesn't understand that flawed solutions can still be useful.

  8. Malpractice? by gregor-e · · Score: 2

    Just as we expect expert practitioners in medicine or civil engineering to bear liability for mistakes in their respective professions, can the notion of modeling malpractice be far behind? When will the first class-action suit be filed against a statistical model that incorrectly denies service or besmirches the credit ratings of thousands?

    1. Re:Malpractice? by Anonymous Coward · · Score: 0

      Right after they start paying enough for it. You can't have cheap complex software that works perfectly well.
      You want good, it can be done, but you'll have to pay for it.

  9. Re:Don't let perfection be the enemy of good enoug by Jumperalex · · Score: 3, Insightful

    All models are wrong, some are useful.

    --
    If you can't be good, be good at it!
  10. Just ask the NSA for suggestions by mrflash818 · · Score: 1

    ...They seem to be quite proud of their massive warrantless text analysis ; )

    --
    Uh, Linux geek since 1999.
  11. Narrowing the context by dorpus · · Score: 2

    I analyzed the free-text field on hospital surveys. A simple keyword search gave me very reliable results on what the patients were complaining about -- they fell into the categories of bad food (food, cafeteria, diet, tasted, stale), dirty rooms (dirty, rat, blood, bathroom), rude staff (rude, ignore, curt), noise (noise, loud, echo, hallway), TV broken (TV, Television, "can't see"). So if the context is narrow enough, even simple searches work.

    I agree that more broadly worded questions require more sophistication. I've looked at word combinations and so forth, though I haven't really needed to use them yet in analyzing health care data. We would not trust a computer to parse a full doctor's report, no matter how sophisticated the software; that will require manual inspection, often by multiple people to agree on a consensus interpretation.

    1. Re:Narrowing the context by dorpus · · Score: 1

      Yes I did. There were a few thousand responses that fit on a single spreadsheet, and after an hour spent coming up with buckets and keywords for them, I couldn't find any exceptions from the above. I'll keep an eye out for future changes, though I doubt they will change much. I know the hospitals and their problems.

  12. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    text analytics is considered accurate at around 80%

  13. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 1

    Advertising is a scourge. How about an example using medical procedures instead?

  14. Narrowing the context by Anonymous Coward · · Score: 0

    How did you come up with your topics ("bad food", "dirty rooms", etc)? From your comment, it honestly sounds like you curated them manually. The article is describing methods that automatically discover topics, without a human-in-the-loop. Of course you'll do better if you manually inspect the data...

  15. LDA != All text analysis by SilenceBE · · Score: 1

    I was a bit surprised reading the headline as this would also mean that text analysis like a bayesian classifier or using a linear svm for classifying text would also be inconsistent or inaccurate. I personally have good experience with the latter and the former.

    But after reading the article - yes I know - it seems to focus on the LDA model but that is only of technique that is available for doing textual analysis or categorising documents.

  16. Re:Don't let perfection be the enemy of good enoug by houghi · · Score: 1

    If you have the choice between sending the message to 10000 random people, or to 217 targeted people

    Do not think that they will send less messages. They will just send more, because they see it is working and the budget stays the same.

    They can send 50 times as many spam messages. Sure, that will mean that some people who would have bought something now are fed up, wich will mean less sales but as long as they get at least the same result in money, all is good. They now have a margin of 50 times. And later they will find something else and send even more.

    --
    Don't fight for your country, if your country does not fight for you.
  17. Bad science strikes again by iceco2 · · Score: 4, Insightful

    The first hint you get is when you notice this paper was published in a physics journal, not a great sign. Then you actually start reading, and you see they declare LDA as "state of the art". And when you actually read what they propose it is a bunch of standard text techniques which actually work quite well with LDA.
    So what they actually showed is that taking vanilla algorithms out of the box without even the most basic data processing under-performs compared to superior data processing attached to a simpler algorithm. Which anyone which did any sort of text processing or any other kind of data managling already new.

    1. Re:Bad science strikes again by BlackPignouf · · Score: 1

      Which anyone which did any sort of text processing or any other kind of data managling already new.

      By text processing, did you mean "knew->new" , "which->who" and "managing->managling"?

  18. A-priori assumptions of structure... by Anonymous Coward · · Score: 0

    From the paper:
    "As a second test, we use a real-world corpus for which one has a good a priori understanding of the topics."

    Eh? The a priori "human" classification of topics is NOT necessarily the most cohesive. For example, I can point to at least one paper in computer science (1) (but there are others), which more readily belong in physics, papers in chemistry which more appropriately belong in biology or physics, papers in mathematics which deal with topics in biology, etc.

    Calling for a match against the pre-defined topics seems like a wild wish for serendipity in my opinion.

    (1) Robert Atkey. From Parametricity to Conservation Laws, via Noether's Theorem. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2014). 2014.

  19. But you are missing the point... by Jawnn · · Score: 1

    Let me get this straight. You're saying that big data, and the tools used to analyze it are frequently inaccurate, or just plain wrong? To which I say,
    "Yes, but big data is 'web scale', so it has to be better."
    /sarcasm

  20. Re:Don't let perfection be the enemy of good enoug by tommeke100 · · Score: 1

    I always liked the HIV test example.
    If you have a test that is 99% accurate, you would think that's a pretty good test.
    However, if only 1 in 1000 people actually are HIV positive, this means you get 10 false positives per correct positive.
    So that's not a very good number falsely claiming people have HIV 9 times out of 10!

    Actually, even 99,9% would be bad, since that means you're wrong 1 time out of 2.

  21. Re:Don't let perfection be the enemy of good enoug by plover · · Score: 1

    That's a great question. Do you think 80% accuracy is good enough for medical use? If you're a doctor facing an unfamiliar situation, and your data says treatment X helped 40% of patients it was tried on, treatment Y helped 35% of them, and all other treatments (Z, W, etc.) helped no more than 30%, but you know the data might only be 80% accurate, what treatment do you choose? Are those ratios even meaningful in the presence of so many errors?

    Consider the case where the patient's condition is critical, and you don't have time for additional evaluation. Is X always the best choice? What if your specialty makes you better than average at treatment Y? Maybe that 20% inaccuracy works in favor of the doctor who has the right experience.

    It could it be used for ill, too. What if you know you'll get paid more by the insurance company for all the extra tests required to do treatment Y? You could justify part of your decision based on the uncertainty of the data.

    In the end, historical data is just one factor out of many that goes into each of these decisions. Inaccurate data may lead to suboptimal decisions, so it can't be the only factor.

    --
    John
  22. Re:Don't let perfection be the enemy of good enoug by plover · · Score: 1

    They could certainly send 50 times as many messages, but they'll improve their return on investment if they target all of them at people who are more susceptible to their message in the first place. Given the cost of the Big Data systems they may only be able to afford to send 10 times as many instead of 50 times, but as long as their message is 5% effective instead of 0.1%, it's still a vast improvement on ROI.

    --
    John
  23. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    I do research in this area, and have mixed feelings about the research paper in Physical Review X that's getting attention.

    First, the authors of that paper are right to scrutinize the reproducibility of LDA. I've been suspicious of it for some time, wondering if I was doing something wrong in my own work, given that other approaches involving other types of data are so different.

    It's not just LDA, though--it's a lot of Big Data and data science approaches. They tend to involve idiosyncratic, large data sets where there's hardly any possibility of replication because of the nature of the data (who could replicate Facebook, for example?). Also, compared to statistics, where you have well-established theories of inference that get scrutinized for decades, and where proofs are commonplace, in the ML literature, a lot of it is ad hoc and argued for exclusively through simulations or examples, which are full of idiosyncracies. I'm overgeneralizing here, but it's a trend. The real problem is an underlying mindset of practicality over fundamentals, buoyed by excessive hype that plagues all of the tech sector.

    Going back to the particular case of LDA: it's pulling for something that's bothered me for some time, and that I've written about in my own research, which is that LDA is focused on lower-level features (words, ngrams) that are stochastically highly variable, and less so on higher-level features (e.g., topics or subtopics) that are more stable. This is the nature of LDA, to identify topics, but as such, it's at a relatively low level--what's higher-level and more replicable are the subtopics or topics themselves, which should be built on in a text classifier. What they're doing in the paper is building on this idea.

    Having said that, the paper has its own problems. First, contrary to what the authors claim, LDA is *not* state of the art anymore. It's a quintessential technique that's used all the time, but it's sort of dated and somewhat of a strawman. The approach they take is less comparable to LDA than some of the LDA successors, which they don't acknowledge. It's sort of unfair in this regard. Also, this approach is just as ad-hoc as all the other approaches that are taken regularly, if not more so, and definitely more so than LDA. Hell, I've done research on models that conceptually are the same as what the PRX authors propose, but are mathematically integrated with LDA in a single model (e.g., its easy to imagine the topics in LDA being modeled by higher-order topics in a second-order model).

    So, I have mixed feelings about it. And it's not even getting into the issues that you mention, which is what's an acceptable error rate? For that matter, with text classification, what's the true state? And how the hell do you simulate *that*? How do we know their test case is any less susceptible to problems than the original examples used with LDA? Where the hell are their proofs of consistency or efficiency?

  24. Re:Don't let perfection be the enemy of good enoug by Sarten-X · · Score: 1

    As someone with a bit of experience in Big Data and medical technology...

    A test that falsely indicated HIV 9 times out of 10 is absolutely wonderful, if it actually catches that one correct positive reliably. A false negative is far more dangerous, and it's the job of the doctor to try multiple tests to confirm a diagnosis. If the initial screening comes back positive, the patient can be warned off intercourse for a while or start some initial therapy while another test is tried, without significant risk to the patient's health.

    Similarly, Big Data is just a term for a particular approach to model design. Having huge amounts of data doesn't magically improve your analysis algorithm, and you still need to have a properly-skilled expert testing your algorithm to make sure that it's correct before it's used for business decision.

    --
    You do not have a moral or legal right to do absolutely anything you want.
  25. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    Circumcision. It only goes wrong 1 out of 500 times, but it makes $$$.

    Besides, those for whom it goes wrong are just bad people anyway.

  26. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    With HIV, it's easy. Just cut off people's body parts at birth without consent from the individual and pretend that consent from the mother is good enough. Circumcision is why there are no cases of AIDS in the USA, right?

  27. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    More to the point, a model is, by definition, NOT the real thing. There will be differences between the model and reality. Does not mean that the model cannot be helpful.

  28. Re:Don't let perfection be the enemy of good enoug by ArsonSmith · · Score: 1

    What big data brings to the table is you can find that "A" is strongly correlated with "B" and have not even an inkling as to why.

    If you scan all medical and personal records throughout history and find that everyone that owns a yellow Camero (no matter what year) at some point in their life comes down with liver cancer at 45. Sure you can make up reasons all you want but if there is no other correlation in the data, do you just ignore it? What if it's just 90% of the Camero owners? even 50% or 20% would still be a significant correlation to start asking about car ownership history.

    --
    Paying taxes to buy civilization is like paying a hooker to buy love.
  29. Re:Don't let perfection be the enemy of good enoug by Sarten-X · · Score: 1

    That's almost exactly what one aspect of my project was.

    My project was, in brief, allowing medical researchers to search through patient records for patients matching particular criteria. The system could recommend related criteria, as well, based on the correlation to the already-shown results.

    Early tests were particularly useless, as the system noticed a strong correlation between being pregnant and being female. It also suggested that if you included people who had smoked within the last six months, there was a strong correlation with those who had smoked within the last year.

    Once the obvious correlations were flagged as being obvious, though, the system started making some valid observations, noting that particular variations in drug treatments had particular variations in outcome. That was enough to get a few researchers' attention, but I left the project before seeing any published papers.

    --
    You do not have a moral or legal right to do absolutely anything you want.
  30. Should be turned into a movie by anchovy_chekov · · Score: 1

    We need a light-hearted romp, something that touches on our fear of Big Data while extolling the virtues. Something with a dash of romance. If only Spencer Tracy and Catherine Hepburn were still alive, they'd be perfect.

  31. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    This is what every mathematician needs to know before they start dabbling in physics.
    It's also helpful to realize that most models comes with more fine print than a telecom merger contract.

  32. Re:Don't let perfection be the enemy of good enoug by sfcat · · Score: 2

    That's a great question. Do you think 80% accuracy is good enough for medical use? If you're a doctor facing an unfamiliar situation, and your data says treatment X helped 40% of patients it was tried on, treatment Y helped 35% of them, and all other treatments (Z, W, etc.) helped no more than 30%, but you know the data might only be 80% accurate, what treatment do you choose? Are those ratios even meaningful in the presence of so many errors?

    Consider the case where the patient's condition is critical, and you don't have time for additional evaluation. Is X always the best choice? What if your specialty makes you better than average at treatment Y? Maybe that 20% inaccuracy works in favor of the doctor who has the right experience.

    It could it be used for ill, too. What if you know you'll get paid more by the insurance company for all the extra tests required to do treatment Y? You could justify part of your decision based on the uncertainty of the data.

    In the end, historical data is just one factor out of many that goes into each of these decisions. Inaccurate data may lead to suboptimal decisions, so it can't be the only factor.

    Great strawman, but your strawman happens to actually be a nuclear powered, armor plated tank...with sharks and laser beams!!! Turns out way back in the 60's, when they started to think about what problems computers could one day solve, they listed many: beat world champion at chess, drive cars, etc...one of them was medical diagnosis. It took decades longer than thought to solve the ones they have been able to solve with one exception: medical diagnosis. By the early 80s we had "expert systems" that were more accurate than human doctors at medical diagnosis (especially 24 hrs in to a 36 hr shift). The AMA and insurance companies have basically blocked this tech for decades despite overwhelming evidence that they were killing people by doing so. Today we have started to slowly role out this type of tech for things like drug interaction but not yet for medical diagnosis. Ironic huh?

    --
    "Those that start by burning books, will end by burning men."
  33. this article is an advertisement! by Anonymous Coward · · Score: 0

    This article is spreading fear uncertainty and doubt about perfectly valid methods to push a crappy proprietary product. They don't mention what approaches were tested and the things they claim their product does differently are actually well established practices (lemmas and stemming). This is infuriating junk, but thanks to slashdot and computer world they were able to buy some advertising. This site is really going downhill.

  34. Is headline the issue? by Anonymous Coward · · Score: 0

    People working in these fields use precision and recall in their work both of which are NEVER 100%. Neither is anybody expecting them to be.
    So, what's the big deal? Dumbing down for the average joe makes these technologies look (almost absolutely) useless

  35. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    You are wrong. The reason you are wrong is that they are not sending emails (which nobody will read), they are buying web ads.

  36. LDA Equivalent is also used in Climate Models by fygment · · Score: 2

    In the latter it's PCA/SVD and it's used to reduce the dimensionality (compact) of large numbers of variables eg a linear approximation is almost as good as accounting for all the variables individually.
    The problem in both text analysis and climate (or any other) models is that PCA/LDA/etc. are linear, and the data they are applied to are generally nonlinear.
    The latter means that the solution space has many (infinite?) number of sub optimal solutions.
    That in turn means PCA/LDA/etc. return a linear approximation to one of those solutions, and those solutions can be very different.

    So, yeah, there is a margin of error. And yeah, the reasons for that error varies. No surprise, because text understanding (and the climate) are hugely complex and nonlinear problems.

    BUT at least maybe more people will become aware that models are pretty much flawed ... so don't base legal or public policy on them.

    --
    "Consensus" in science is _always_ a political construct.
    1. Re:LDA Equivalent is also used in Climate Models by Anonymous Coward · · Score: 0

      The best example of SVD I have seen was using it for image compression. If you have a high resolution image you want to transmit over limited bandwidth, and you have ample computational resources, you can remove a lot of unnecessary detail from the data.

      Applying it to something with a lot of nuance, like written text, and the benefits don't really work. Some people sell programs that will grade college essays using this technique, and they fall for basic keyword spamming.

  37. Re:Don't let perfection be the enemy of good enoug by Anonymous Coward · · Score: 0

    My spam folder suggests that this isn't an either/or situation

  38. Which sentence should I use??? by Anonymous Coward · · Score: 0

    *I* didn't say I killed my wife.
    I *DIDN'T* say I killed my wife.
    I didn't *SAY* I killed my wife.
    I didn't say *I* killed my wife.
    I didn't say I *KILLED* my wife.
    I didn't say I killed *MY* wife.
    I didn't say I killed my *WIFE*.

    btw... has anybody seen my wife lately?

  39. so it's a good algo by rraylion · · Score: 1

    So this algo is consistent 80% of the time and is correct 90%.... this is just a spin on numbers