Slashdot Mirror


Why Is Data Mining Still A Frontier?

bbsguru writes "How much do we know that we still don't know? A story in The Register points out that little has changed since Francis Bacon proposed combining knowledge to learn new things 400 years ago, despite all the computer power we now have. Scientific (and other) data is still housed in unrelated collections, waiting for some enterprising Relational Database Programmer to unlock the keys to understanding. Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?"

17 of 223 comments (clear)

  1. Shot in the dark: by Spazntwich · · Score: 5, Insightful

    Either
    a) There's not enough money in it to make it worthwhile

    or

    b) It doesn't work.

    1. Re:Shot in the dark: by Disavian · · Score: 5, Insightful

      How about

      c) our ability to produce data far outstrips our ability and/or willingness to analzye it

    2. Re:Shot in the dark: by delete · · Score: 2, Insightful

      Or

      c) The title of this submission is inaccurate, as data mining tools are both useful and financially lucative in a wide variety of domains today, particularly bioinformatics, image analysis and text mining.

      Of course, the title of this article is quite ambiguous and misleading: the article itself is concerned with RDBMS, rather than the statistical analysis of data.

    3. Re:Shot in the dark: by flynt · · Score: 4, Insightful

      Also, blindly "mining" data for trends can be very misleading. Hypothesis generation is usually better done some other way. There will always be trends in data we already have that are there by chance, and this is what data mining finds in many cases. Then models are fit to that data and don't validate on future samples taken, and everyone wonders why.

    4. Re:Shot in the dark: by rainman_bc · · Score: 2, Insightful

      I'm a big fan of c. As a reporting and data analyst, I see the same crap all the time.

      People design systems for what they want to put into it, without consider what they want to get back out of it. That usually results in crappy query performance and all that crap because of undue care. When designing a system, engineers need to be aware of : 1) What do we want to store and how do we want to store it, 2) how do we want to put it in there, 3) What do we want to get back out of it.

      Many people in designing systems pass over 3.

      I've seen it in my last job. I had no input in database design, and had to deal with insanly stupid queries resulting from thoughtless and careless design.

      --
      09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
    5. Re:Shot in the dark: by AuMatar · · Score: 2, Insightful

      Occurences of polio go up in summer.
      People eat more ice cream in summer.

      Conclusion: ice cream causes polio.

      This was actually something people believed for a brief time before the Salk vaccine. Its also a great example of the kind of facts data mining most frequently dredges up- accidents or correlation with no real common cause.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    6. Re:Shot in the dark: by arlow · · Score: 3, Insightful
      It does work, but it requires judgement. A lot of people seem to think that you just shove the data into a statistical test, out comes a p-value, and if it's small enough you win. Interpreting and validating the initial hit is where 90% of the real work is, and it requires the careful application of prior knowledge and subsequent experiments. I work with a guy who's probably one of the best statisticians in the world, and he often asks me, "well, does the result make sense?" His judgement was developed over decades of looking at real data. If you just shove your data into an algorithm and take the top-scoring hits, you'll probably spend most of your time chasing bogus predictions. Algorithms are good for automating specific tasks that are essentially repeatable. Data mining requires an in-depth understanding of the specific problem you're trying to solve; you usually need to tailor your statistics so that they make sense for the problem. That's why the idea of selling someone a suite of fancy data mining software is probably useless; you need to sell them the statistican too.

      probably :/

      --

      my other lambda is a Y

    7. Re:Shot in the dark: by plover · · Score: 4, Insightful
      I have to wonder if data mining isn't the problem -- the real problem seems to be that there are few obvious problems data mining will solve.

      Consider WalM*rt. When the 2005 hurricanes were predicted, they mined their sales data for previous hurricanes. They found that in the last hurricane people stocked up on beer, pop tarts and peanut butter, so they sent trucks full of that stuff to the stores in the path of the hurricanes. They made lots of sales, and provided a valuable service to the communities. Capitalism at its finest.

      Data mining worked very well in this case. The issue was "here's an obvious problem, and a clever solution involving data mining."

      The big problem is that people expect the same golden results from non-obvious situations. "Hey, sales are down in the Wisconsin stores, let's do some data mining to figure out what they'll buy" makes no sense. Data mining worked well in the case of an obvious trigger event, but data mining by itself didn't reveal the trigger. You can't predict hurricanes based on the sales of pop tarts and beer, for example.

      But, can you ever correlate pop tart and beer sales to an external event? You might be able to go back and say "here's a strange case where pop tarts and beer sold out quickly, why did this happen?" If you can tie this to external events, you'd think you'd be better prepared to react to the same events in the future.

      Maybe correlating sales to Google News is the next step? Republican scandal == lower white bread sales; French riots + Senate bickering over immigration control reform == higher 'Peeps' sales; etc. p. Or maybe it's always been a bad idea to equate correlation with causality.

      --
      John
    8. Re:Shot in the dark: by polv0 · · Score: 2, Insightful

      I'm a statistician and data mining consultant, and i've implemented models based on millions of records generating consulting fees in the high hundreds of thousands of dollars. I thus have a strong understanding of the data, modeling and project management aspects of data mining ventures.

      I believe there are several fundamental factors required to make a data-mining project succesful:

      1) A mathematically precise definition of what it is to be modeled (the response) as in the probability of purchasing product x rather than "profit"
      2) Multiple sources of data linked together: demographic, financial, transactional, etc...
      3) A set of "variables" from 2) that have a strong, intuitive relationship to the response in 1)
      4) A reasonably sophisticated statistical algorithm designed to weed out the significant relationships between 1) and 3)
      5) An organizational culture willing and capable of institutionalizing the model into a decision making process

      In my experience, these projects end up failing because of problems with one of the above critical steps. Roughly 50% of the time 5) will hold back a project. This isn't what a consultant, statistician or DBA will tell you, because more often than not they don't stick around long enough to see it through to the bottom line. Step 1) and 4) are basically table stakes. If you can't very precisely define your objective (and thus that which you would like to predict, the response) you aren't going anywhere. Similarly, if there is a significant flaw in your analytical method (e.g. overfitting with neural networks) then you'll produce rubbish.

      Steps 2) and 3) really determine the upper bound for how succesfull the project can be. Often, there is a fundamental driver of the value in question that is not accounted for in the analysis. For example, in modeling auto-insurance claim frequency, it is very difficult to obtain number of miles driven, and yet it is the most significant factor impacting the occurrence of claims. In the end, the quality of the model hinges on linking together high-quality data sources that are often transactional and deriving from them variables that would dramatically suprise you if they weren't predictive of the response in question.

      That's my 2 cents...

  2. Companies are doing it, but... by deanj · · Score: 3, Insightful

    There are companies and research project that are doing this sort of thing. The trouble is, there are a LOT of people that are freaking out about it, and that's making companies less willing to 1) admit they're doing it, and 2) even think about starting to do it.

    Considering how up and arms people are about it, how long before we have people accusing others of "data profiling"?

  3. I tell you why (from a bioinformatics viewpoint) by Neil+Blender · · Score: 5, Insightful

    Programmers have no idea of context. Biologists have no idea about programming. It is very hard to mix the two. You can be the shit-hottest dba in the world but if you have no relevant (deep) biology background you are guaranteed to produce crap. Almost every piece of biological software is a POS because of this.

  4. Because it's not sexy by beacher · · Score: 4, Insightful

    From my expierience - The people who are subject matter experts in their field (outside of computers) and typically don't have the time to perform all of the data entry. So you have to get an ETL / Miner to do all of the work for you. ETL and data mining are *NOT* the sexiest jobs in the industry by a long shot. Auditing data makes you want to gouge your eyes out after the fourth day straight of reviewing loads.

  5. Re:I tell you why (from a bioinformatics viewpoint by Anonymous+Crowhead · · Score: 3, Insightful

    So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it

    Well, that's pretty much how it works in academia (+/- the real dba). Problem is that this is a lab by lab (or department) solution to problems that appear in hundreds or thousands of institutions. The wheel is reinvented over and over again because either commercial/free solutions suck or don't exist. The commercial versions suck because they are built by software engineers and the free versions suck because they are built by scientists (who tend to have the mantra of "if it works, it's done").

  6. Re:Semantic Web goodness by TrappedByMyself · · Score: 2, Insightful

    Datamining would be a piece of cake if all data were kept in clear, standard XML dialects. See Visualising the Semantic Web , ed Geroimenko and Chen (Spring Verlag, 2004). Some of the possibilities of combing through information and elucidating it, combining it and converting it described in that book are simply awesome. Too bad that the Semantic Web is a pipe dream at the moment.

    Well, XML is not really import. The problem lies in going from the infinite real world to a well defined ontology or whatever. I can make the greatest data model ever, and the first time someone tries to put a large data set into it, it just won't fit. You hit a bazillion, "I have this as two fields, you have this as one" issues. You can jump a meta-level up to store all the data, but then you just lost a handle on context. The Semantci Web people have tackled the issue, but have yet to solve world hunger. Tossing a bunch of web and AI/ontology experts into a room produces great things, but they haven't gotten there yet. And the stuff they've produced is still academic level. The average high school kid isn't going to be hacking OWL into his web pages.

    As with most things, we'll get closer and closer, and better and better things will happen. We'll never find the holy grail, but some pretty cool and useful technologies will eventually emerge. It just takes time.

    --

    Help me take back Slashdot. When did 'News for Nerds' become 'FUD and Conspiracy Theories for Extremist Nutjobs'?
  7. The problem is both easier and more difficult by zappepcs · · Score: 3, Insightful

    The problem is both easier and more difficult than it first appears, or even second and third times:

    Data, whether held in databases (usually nice and tidy) or in flatfiles, or random text files spread all over hell's half acre, is simply data, not the information required to link it to other data. Even meta data about the data held in any data store is not the information required to link it to other data.

    One of the things I believe will help (possibly) is ODF (buzzword warning sounds) because it begins to help format data in a universally accepted manner. Though it is not the only way, universal access methods are required for accessible data. Second, the structure of the data must be presented in a universal manner. This second part allows query languages to support cognitive understanding of the structure, and thus (with some work) the value of data held in a storage location, where ever and whatever that location is, be it RDBMS, text files, or phone bills.

    Indexing is simply not enough. The ability to retrieve and utilize the index with the most probability of having relevent data is what is needed. We all know that any search engine can get you too many 'hits' that contain useless data. Google or anyone else is helpless until there are accepted methods for applying metadata and data structure descriptions on all data.

    When there is far more organization to data storage, there will be a great sucking sound of people actually using data from the internet in brand new ways.... until then, its all hit and miss.

  8. It's still only an index by Anonymous Coward · · Score: 1, Insightful
    A table of contents doth not a book make.

    I don't think Google will replace good old fashioned research by humans. I think we're still light years from computers having anything even *close* to intelligence high enough to replace humans in 'connecting the dots' of data libraries.


    $0.02


    ******************
    Slow Down, Cowboy! It's been x minutes since you last successfully posted a relevant comment anyone wants to read.

  9. 42 by DesertWolf0132 · · Score: 4, Insightful

    "I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."-Hitchhiker's Guide to the Galaxy"

    One must remember when undertaking to find answers in the data to first figure out the question. Otherwise the answer you find will be as useful to you as the answer 42.

    Without context you only have a neat compilation of arranged meaningless facts.

    On the small scale data mining is used daily by marketing people and the like to figure out who would be most receptive to their approach. Webmasters use it to optimize content and respond to user trends. In most large corporations data mining is used on some level.

    Data mining on the scale discussed here may be practical at some point in the future once we determine the questions we wish answers to.

    Let us hope the answer is more useful than 42.

    --
    No animals were harmed in the making of this sig.
    Well, there was that one puppy, but he is all better now.