Why Is Data Mining Still A Frontier?
bbsguru writes "How much do we know that we still don't know? A story in The Register points out that little has changed since Francis Bacon proposed combining knowledge to learn new things 400 years ago, despite all the computer power we now have. Scientific (and other) data is still housed in unrelated collections, waiting for some enterprising Relational Database Programmer to unlock the keys to understanding. Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?"
As someone who has done datamining, ETL, and data auditing for very large systems (every transaction on every slot machine in a large Las Vegas casino for 5 years or so) I can assure you that the problem is not lack of data or issues with data entry. The problem, simply put, is that analysis is hard. The data is sitting there, but extracting meaningful information from it is far harder than you might imagine. The first hard part is determining what constitutes meaningful information, and yes that requires subject matter experts. Given the amount of money that can be made with even the slightest improvement, getting subject matter experts to sit down and work with the data people was not the problem. The problem is that, in the end, even subject matter experts can't say what is going to be meaningful - they know what sorts of things they currently extract for themselves as meaningful, but they simply don't know what patterns or connections are lying hidden that, if they knew about it, would be exceedingly meaningful. Because the pattern is a subtle one that they never even thought to connect they most certainly couldn't tell you to look for it. The best you can do is, upon finding an interesting pattern, is say "suppose I could tell you ..." and wait for the reaction. Often enough with some of the work I did they simply didn't know how to react: the pattern was beyond their experience; it might be meaningful, it might not, even the subject matter experts couldn't tell immediately.
So how do you arrive at all those possible patterns and connections? If you think the number of different ways of slicing, considering, and analysing a given large dataset is anything but stupendously amazingly big then you're fooling yourself. Aside from millions of ways of slicing and dicing the data there are all kinds of useful ways to transform or reinterpret the data to find other connections: do fourier transforms to look at frequency spaces, view it as a directed graph or a lattice, perform some manner of clustering or classification against [insert random property here] and reinterpret, and so on, each of which expose whole new levels of slice and dice that can be done. If you'ev got subject matter experts working closely with you then you can at least make some constructive guesses as to some directions that will be profitable, and some directions that definitely will not be, but in between is a vast space where you simply cannot know. Data mining, right now, involves an awful lot of fumbling in the dark because there are simply so many ways to analyse the sort of volume of data we have collected, and the only real way to judge any analysis is to present it to human because our computers simply aren't as good at seeing understanding an interpreting patterns to trust with the job. Anytime a process has to route everything through humans you know it is going to be very very slow.
Jedidiah.
Craft Beer Programming T-shirts