Why Is Data Mining Still A Frontier?
bbsguru writes "How much do we know that we still don't know? A story in The Register points out that little has changed since Francis Bacon proposed combining knowledge to learn new things 400 years ago, despite all the computer power we now have. Scientific (and other) data is still housed in unrelated collections, waiting for some enterprising Relational Database Programmer to unlock the keys to understanding. Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?"
Either
a) There's not enough money in it to make it worthwhile
or
b) It doesn't work.
There are companies and research project that are doing this sort of thing. The trouble is, there are a LOT of people that are freaking out about it, and that's making companies less willing to 1) admit they're doing it, and 2) even think about starting to do it.
Considering how up and arms people are about it, how long before we have people accusing others of "data profiling"?
Programmers have no idea of context. Biologists have no idea about programming. It is very hard to mix the two. You can be the shit-hottest dba in the world but if you have no relevant (deep) biology background you are guaranteed to produce crap. Almost every piece of biological software is a POS because of this.
From my expierience - The people who are subject matter experts in their field (outside of computers) and typically don't have the time to perform all of the data entry. So you have to get an ETL / Miner to do all of the work for you. ETL and data mining are *NOT* the sexiest jobs in the industry by a long shot. Auditing data makes you want to gouge your eyes out after the fourth day straight of reviewing loads.
So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it
Well, that's pretty much how it works in academia (+/- the real dba). Problem is that this is a lab by lab (or department) solution to problems that appear in hundreds or thousands of institutions. The wheel is reinvented over and over again because either commercial/free solutions suck or don't exist. The commercial versions suck because they are built by software engineers and the free versions suck because they are built by scientists (who tend to have the mantra of "if it works, it's done").
The blurb hit on a fundamental reason data mining is still at (or beyond) the horizon...defining relations between the various elements is hard. Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.
Consider the following boring but difficult task I was given: two large organizations were to merge, each with a portfolio of about 100,000 items. Each item had a short history, some descriptive information, and some data such as internal quality ratings or sector assignments. This data was available (for various reasons) as big CSV file dumps. Questions to answer were: (1) how much overlap did the portfolios have? (2) were the sector distributions similar?
These are very simple, concrete questions. But you can imagine that since the categorizations differed, and descriptors differed within the CSV files, let alone between the two, the questions were difficult to answer. It required a lot of approximate matching, governed intelligently (or so I flatter myself).
Contrast this situation with what people typically think of as data-mining: answering interesting questions, and you can appreciate that without a whole lot of intelligence, artificial or otherwise, those questions will be unanswerable.
This is a hoary chestnut. I have a masters in AI, and a PhD in machine learning (and had a lot of interest in machine discovery).
:)
The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn't even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.
Think of this in terms of permutations. Lets say you have variable A, B, and C. They are all binary (have values 1 or 0). Now, you are given a set of these assigments (eg A=1, B=1,C=1, A=1,B=1, C=1, and so on). Now, try to tell me what the correct partition is. Sort them in to two sets of any size. See the problem ? I didn't tell you what I wanted as characteristics of those sets - so in effect, they are all possible good partitions.
So, data-mining ultimately relies on human's deciding what they want to read from the tea-leaves of the data.
Now, give it up, and start addressing issues of efficient algorithms given that you have a specific performance task
Winton
The problem is both easier and more difficult than it first appears, or even second and third times:
Data, whether held in databases (usually nice and tidy) or in flatfiles, or random text files spread all over hell's half acre, is simply data, not the information required to link it to other data. Even meta data about the data held in any data store is not the information required to link it to other data.
One of the things I believe will help (possibly) is ODF (buzzword warning sounds) because it begins to help format data in a universally accepted manner. Though it is not the only way, universal access methods are required for accessible data. Second, the structure of the data must be presented in a universal manner. This second part allows query languages to support cognitive understanding of the structure, and thus (with some work) the value of data held in a storage location, where ever and whatever that location is, be it RDBMS, text files, or phone bills.
Indexing is simply not enough. The ability to retrieve and utilize the index with the most probability of having relevent data is what is needed. We all know that any search engine can get you too many 'hits' that contain useless data. Google or anyone else is helpless until there are accepted methods for applying metadata and data structure descriptions on all data.
When there is far more organization to data storage, there will be a great sucking sound of people actually using data from the internet in brand new ways.... until then, its all hit and miss.
Support NYCountryLawyer RIAA vs People
"I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."-Hitchhiker's Guide to the Galaxy"
One must remember when undertaking to find answers in the data to first figure out the question. Otherwise the answer you find will be as useful to you as the answer 42.
Without context you only have a neat compilation of arranged meaningless facts.
On the small scale data mining is used daily by marketing people and the like to figure out who would be most receptive to their approach. Webmasters use it to optimize content and respond to user trends. In most large corporations data mining is used on some level.
Data mining on the scale discussed here may be practical at some point in the future once we determine the questions we wish answers to.
Let us hope the answer is more useful than 42.
No animals were harmed in the making of this sig.
Well, there was that one puppy, but he is all better now.