Why Is Data Mining Still A Frontier?

Shot in the dark: by Spazntwich · 2006-04-10 08:50 · Score: 5, Insightful

Either
a) There's not enough money in it to make it worthwhile

or

b) It doesn't work.

Re:Shot in the dark: by Disavian · 2006-04-10 08:58 · Score: 5, Insightful

How about

c) our ability to produce data far outstrips our ability and/or willingness to analzye it
Re:Shot in the dark: by Daniel+Dvorkin · 2006-04-10 09:03 · Score: 4, Informative

Neither of those is quite true -- a lot of entities public and private are throwing a lot of money at data mining research, reasonably expecting a big payoff, and sometimes it gets very good results indeed. The basic problem is that, as with any worthwhile CS question, doing it well is hard. It is very easy to come up with false connections between data. Sorting the wheat from the chaff in any kind of automated or even semi-automated fashion, OTOH, is an enormous challenge.

Analogies like this are always dangerous, but I'd say data mining now is about where language development was in the mid-1950's, when FORTRAN was first being developed. IOW, we have a set of tools that kind of work, most of the time, for certain applications -- but we can pretty much guarantee that they're not the best possible tools, and that we will build better ones. Consider how much work is still going on in language development half a century later, and you can see how much room there is for further development.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Shot in the dark: by flynt · 2006-04-10 09:31 · Score: 4, Insightful

Also, blindly "mining" data for trends can be very misleading. Hypothesis generation is usually better done some other way. There will always be trends in data we already have that are there by chance, and this is what data mining finds in many cases. Then models are fit to that data and don't validate on future samples taken, and everyone wonders why.
Re:Shot in the dark: by Coryoth · 2006-04-10 10:15 · Score: 5, Informative

a lot of entities public and private are throwing a lot of money at data mining research, reasonably expecting a big payoff, and sometimes it gets very good results indeed. The basic problem is that, as with any worthwhile CS question, doing it well is hard. It is very easy to come up with false connections between data. Sorting the wheat from the chaff in any kind of automated or even semi-automated fashion, OTOH, is an enormous challenge.

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie. It's avery very old field which people have been working on for a very long time, to the point where the problems that remain to be solved are incredibly difficult. What is new is someone other than pure mathematicians taking much interest in these problems. Do a search for "non linear manifold learning" on Google and you'll see what I mean.

Jedidiah.

--
Craft Beer Programming T-shirts
Re:Shot in the dark: by arlow · 2006-04-10 10:32 · Score: 3, Insightful

It does work, but it requires judgement. A lot of people seem to think that you just shove the data into a statistical test, out comes a p-value, and if it's small enough you win. Interpreting and validating the initial hit is where 90% of the real work is, and it requires the careful application of prior knowledge and subsequent experiments. I work with a guy who's probably one of the best statisticians in the world, and he often asks me, "well, does the result make sense?" His judgement was developed over decades of looking at real data. If you just shove your data into an algorithm and take the top-scoring hits, you'll probably spend most of your time chasing bogus predictions. Algorithms are good for automating specific tasks that are essentially repeatable. Data mining requires an in-depth understanding of the specific problem you're trying to solve; you usually need to tailor your statistics so that they make sense for the problem. That's why the idea of selling someone a suite of fancy data mining software is probably useless; you need to sell them the statistican too.
probably :/

--
my other lambda is a Y
Re:Shot in the dark: by plover · 2006-04-10 10:33 · Score: 4, Insightful

I have to wonder if data mining isn't the problem -- the real problem seems to be that there are few obvious problems data mining will solve.
Consider WalM*rt. When the 2005 hurricanes were predicted, they mined their sales data for previous hurricanes. They found that in the last hurricane people stocked up on beer, pop tarts and peanut butter, so they sent trucks full of that stuff to the stores in the path of the hurricanes. They made lots of sales, and provided a valuable service to the communities. Capitalism at its finest.
Data mining worked very well in this case. The issue was "here's an obvious problem, and a clever solution involving data mining."
The big problem is that people expect the same golden results from non-obvious situations. "Hey, sales are down in the Wisconsin stores, let's do some data mining to figure out what they'll buy" makes no sense. Data mining worked well in the case of an obvious trigger event, but data mining by itself didn't reveal the trigger. You can't predict hurricanes based on the sales of pop tarts and beer, for example.
But, can you ever correlate pop tart and beer sales to an external event? You might be able to go back and say "here's a strange case where pop tarts and beer sold out quickly, why did this happen?" If you can tie this to external events, you'd think you'd be better prepared to react to the same events in the future.
Maybe correlating sales to Google News is the next step? Republican scandal == lower white bread sales; French riots + Senate bickering over immigration control reform == higher 'Peeps' sales; etc. p. Or maybe it's always been a bad idea to equate correlation with causality.

--
John
Re:Shot in the dark: by asuffield · 2006-04-10 13:25 · Score: 3, Interesting

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics.

That's part of the problem.

Another part is computational complexity. No, I'm not kidding. These things are often in like the second and third powers of the data set size. The data sets are often terabytes in size. We don't have computers that big, and by the time we do, we'll probably have bigger data sets. Contemporary data mining is an exercise in finding a fast enough approximation that is accurate enough to look convincing. We're not really sure how accurate they actually are - most of the time, there's no way to find out for certain. "Probably good enough" is the best you normally get. Some researchers can put a number on that 'probably' for you, eventually. Mostly they just compare the available approximations and tell you which one works the best.

The biggest problem is the inability to figure out intelligent things to do with it. Computers aren't smart. You can't just hand them a heap of data and say "find me the things I want to know". You have to work out what the patterns in the data are for yourself, then do pure math research to turn those patterns into a mathematical model. Then you have to come up with useful questions to ask that model. That's two major insights plus several years of work - and most researchers only have one major insight in their entire career. Just to figure out what question to ask. Data mining is then the process of repeatedly answering that question for all possible values of the parameters. And the answers you get out will only be as good as the model you invented. The current method for discovering usable patterns in data is trial and error.

I think that 'data mining' is more or less a frontier by definition. It's all the things we don't yet know about the data we currently have which would take a huge amount of effort to discover. Most unsolved problems in mathematics could probably be called 'data mining problems': if an answer exists, it can be derived from the existing body of theory. Most decisions that people make, from deciding whether to eat now or later, to deciding whether to invade a foreign nation, can also qualify. The sheer range of things it could cover means that there will probably always be vastly more unsolved problems than solved ones.

Companies are doing it, but... by deanj · 2006-04-10 08:56 · Score: 3, Insightful

There are companies and research project that are doing this sort of thing. The trouble is, there are a LOT of people that are freaking out about it, and that's making companies less willing to 1) admit they're doing it, and 2) even think about starting to do it.

Considering how up and arms people are about it, how long before we have people accusing others of "data profiling"?

I tell you why (from a bioinformatics viewpoint) by Neil+Blender · 2006-04-10 08:57 · Score: 5, Insightful

Programmers have no idea of context. Biologists have no idea about programming. It is very hard to mix the two. You can be the shit-hottest dba in the world but if you have no relevant (deep) biology background you are guaranteed to produce crap. Almost every piece of biological software is a POS because of this.

Because it's not sexy by beacher · 2006-04-10 09:02 · Score: 4, Insightful

From my expierience - The people who are subject matter experts in their field (outside of computers) and typically don't have the time to perform all of the data entry. So you have to get an ETL / Miner to do all of the work for you. ETL and data mining are *NOT* the sexiest jobs in the industry by a long shot. Auditing data makes you want to gouge your eyes out after the fourth day straight of reviewing loads.

Re:Because it's not sexy by Coryoth · 2006-04-10 10:03 · Score: 4, Interesting

As someone who has done datamining, ETL, and data auditing for very large systems (every transaction on every slot machine in a large Las Vegas casino for 5 years or so) I can assure you that the problem is not lack of data or issues with data entry. The problem, simply put, is that analysis is hard. The data is sitting there, but extracting meaningful information from it is far harder than you might imagine. The first hard part is determining what constitutes meaningful information, and yes that requires subject matter experts. Given the amount of money that can be made with even the slightest improvement, getting subject matter experts to sit down and work with the data people was not the problem. The problem is that, in the end, even subject matter experts can't say what is going to be meaningful - they know what sorts of things they currently extract for themselves as meaningful, but they simply don't know what patterns or connections are lying hidden that, if they knew about it, would be exceedingly meaningful. Because the pattern is a subtle one that they never even thought to connect they most certainly couldn't tell you to look for it. The best you can do is, upon finding an interesting pattern, is say "suppose I could tell you ..." and wait for the reaction. Often enough with some of the work I did they simply didn't know how to react: the pattern was beyond their experience; it might be meaningful, it might not, even the subject matter experts couldn't tell immediately.

So how do you arrive at all those possible patterns and connections? If you think the number of different ways of slicing, considering, and analysing a given large dataset is anything but stupendously amazingly big then you're fooling yourself. Aside from millions of ways of slicing and dicing the data there are all kinds of useful ways to transform or reinterpret the data to find other connections: do fourier transforms to look at frequency spaces, view it as a directed graph or a lattice, perform some manner of clustering or classification against [insert random property here] and reinterpret, and so on, each of which expose whole new levels of slice and dice that can be done. If you'ev got subject matter experts working closely with you then you can at least make some constructive guesses as to some directions that will be profitable, and some directions that definitely will not be, but in between is a vast space where you simply cannot know. Data mining, right now, involves an awful lot of fumbling in the dark because there are simply so many ways to analyse the sort of volume of data we have collected, and the only real way to judge any analysis is to present it to human because our computers simply aren't as good at seeing understanding an interpreting patterns to trust with the job. Anytime a process has to route everything through humans you know it is going to be very very slow.

Jedidiah.

--
Craft Beer Programming T-shirts

Re:I tell you why (from a bioinformatics viewpoint by Anonymous+Crowhead · 2006-04-10 09:09 · Score: 3, Insightful

So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it

Well, that's pretty much how it works in academia (+/- the real dba). Problem is that this is a lab by lab (or department) solution to problems that appear in hundreds or thousands of institutions. The wheel is reinvented over and over again because either commercial/free solutions suck or don't exist. The commercial versions suck because they are built by software engineers and the free versions suck because they are built by scientists (who tend to have the mantra of "if it works, it's done").

Data mining is DIFFICULT by GlobalEcho · 2006-04-10 09:19 · Score: 4, Informative

The blurb hit on a fundamental reason data mining is still at (or beyond) the horizon...defining relations between the various elements is hard. Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

Consider the following boring but difficult task I was given: two large organizations were to merge, each with a portfolio of about 100,000 items. Each item had a short history, some descriptive information, and some data such as internal quality ratings or sector assignments. This data was available (for various reasons) as big CSV file dumps. Questions to answer were: (1) how much overlap did the portfolios have? (2) were the sector distributions similar?

These are very simple, concrete questions. But you can imagine that since the categorizations differed, and descriptors differed within the CSV files, let alone between the two, the questions were difficult to answer. It required a lot of approximate matching, governed intelligently (or so I flatter myself).

Contrast this situation with what people typically think of as data-mining: answering interesting questions, and you can appreciate that without a whole lot of intelligence, artificial or otherwise, those questions will be unanswerable.

Nothing to do with Technology by wdavies · 2006-04-10 09:21 · Score: 3, Informative

This is a hoary chestnut. I have a masters in AI, and a PhD in machine learning (and had a lot of interest in machine discovery).

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn't even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

Think of this in terms of permutations. Lets say you have variable A, B, and C. They are all binary (have values 1 or 0). Now, you are given a set of these assigments (eg A=1, B=1,C=1, A=1,B=1, C=1, and so on). Now, try to tell me what the correct partition is. Sort them in to two sets of any size. See the problem ? I didn't tell you what I wanted as characteristics of those sets - so in effect, they are all possible good partitions.

So, data-mining ultimately relies on human's deciding what they want to read from the tea-leaves of the data.

Now, give it up, and start addressing issues of efficient algorithms given that you have a specific performance task :)

Winton

The problem is both easier and more difficult by zappepcs · 2006-04-10 09:44 · Score: 3, Insightful

The problem is both easier and more difficult than it first appears, or even second and third times:

Data, whether held in databases (usually nice and tidy) or in flatfiles, or random text files spread all over hell's half acre, is simply data, not the information required to link it to other data. Even meta data about the data held in any data store is not the information required to link it to other data.

One of the things I believe will help (possibly) is ODF (buzzword warning sounds) because it begins to help format data in a universally accepted manner. Though it is not the only way, universal access methods are required for accessible data. Second, the structure of the data must be presented in a universal manner. This second part allows query languages to support cognitive understanding of the structure, and thus (with some work) the value of data held in a storage location, where ever and whatever that location is, be it RDBMS, text files, or phone bills.

Indexing is simply not enough. The ability to retrieve and utilize the index with the most probability of having relevent data is what is needed. We all know that any search engine can get you too many 'hits' that contain useless data. Google or anyone else is helpless until there are accepted methods for applying metadata and data structure descriptions on all data.

When there is far more organization to data storage, there will be a great sucking sound of people actually using data from the internet in brand new ways.... until then, its all hit and miss.

--
Support NYCountryLawyer RIAA vs People

42 by DesertWolf0132 · 2006-04-10 09:52 · Score: 4, Insightful

"I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."-Hitchhiker's Guide to the Galaxy"

One must remember when undertaking to find answers in the data to first figure out the question. Otherwise the answer you find will be as useful to you as the answer 42.

Without context you only have a neat compilation of arranged meaningless facts.

On the small scale data mining is used daily by marketing people and the like to figure out who would be most receptive to their approach. Webmasters use it to optimize content and respond to user trends. In most large corporations data mining is used on some level.

Data mining on the scale discussed here may be practical at some point in the future once we determine the questions we wish answers to.

Let us hope the answer is more useful than 42.

--
No animals were harmed in the making of this sig.
Well, there was that one puppy, but he is all better now.

Slashdot Mirror

Why Is Data Mining Still A Frontier?

17 of 223 comments (clear)