Data Mining Goes 3D
Roland Piquepaille writes "At Sandia National Laboratories (SNL), a data mining and visualization software suite developed in the last two years is now able to extract information from many sources of data and to return 3D images as results. In Sandia's intelligence lab converts business data into 3-D images, the New Mexico Business Weekly reports that Sandia's Information Visualization Lab is able to search structured documents, such as scientific journals, or unstructured ones, such as the Web or an intranet. Since the lab has been established five months ago, this software has already been used to determine the potential of several partnerships with SNL. Other firms, such as Lockheed Martin, also are starting to use the lab. Let's hope that SNL releases this software as open source. It should be fun to use it. For more details and pictures, please read this overview."
is over 5 years old already
google search
people have been doing real time data mining in VRML since the vrml2.0 plugins came out back in 97
back in the day we didnt have no old school
MacSpin was a 3-d data mining tool that is over 16 years old now.
I wish this story went into more details into the algorithms used. Saying stuff like "we take tons of data and out comes a 3D image" is great, but what does the 3D image actually represent? What are the dimensions being graphed?
If I had to guess I would guess that they are doing 3D Self Organizing Maps, or something very similar.
The principle is: create a huge feature space for the documents in question (something like word counts for each document for each word in the corpus, with appropriate fixes (drop the most and least common words, do stemming etc.). You can now "visualize" the documents in a massive 20,000 dimensional space. However, what you can do, is try to create a projection from 20,000 dimensions down to 2 or 3 dimensions in a way that best preserves distances in the 20,000 dimensional space. This automatically creates a clustering of the documents as well, and you now have something that you can visualize practically. If you start doing things like labelling clusters and subsclusters by the words unique to/defining that cluster you can start to make some sense of the visualisation.
Effectively this is just a means of doing clustering on a large document space in such a way that the final output can be visualized (instead of the sort of results you get from k-means, or heirarchical clustering, which are a lot harder to visualize in a meaningful way for laymen). The benefit of being able to visualize it in that sense is that you can "see" patterns of other document attributes by adding that to the visualization (via colors, labels, etc.) and see a global overview of those attributes across the entire document space.
Just to reiterate: I do not know that this is what is being done, and they don't say a lot in the article, but I do have some experience in this field, and what I gleaned from the article would tend to imply an approach like this.
Jedidiah.
Craft Beer Programming T-shirts
Anyone interested in doing powerful 3D data visualization should make a mandatory stop here. It's an open source visualization toolkit written in C++, but with bindings for Java and Python as well. This is a very powerful and very impressive system, and ought to be rated as one of the great open source projects. It doesn't seem to get much attention - I'm not sure why.
Have a look, and look at what it is actually capable of doing. If you want to do any sort of 3D visualization, it really is worth your time to learn a bit about VTK.
Jedidiah.
Craft Beer Programming T-shirts
SGI had a product called "MineSet" which did this kind of stuff, only a long long time ago. Originally it was inspired by the 3D filemanager SGI did for Jurassic Park. Cool idea, but old hat :).
--ralpht
I do not represent myself.
Other firms, such as Lockheed Martin, also are starting to use the lab.
I don't find it surprising that Lockheed Martin is one of the firms "starting to use the lab". Lockheed Martin runs Sandia as a contractor for the Department of Energy. Lockheed has a builtin bias to show how applicable the work at Sandia is.
Wouldn't the work of a government-funded national lab be public domain if it ever were to be released?
As far as I know the Department of Energy labs, which include the Sandia labs, Lawrence Livermore, Los Alamos, are all managed by contractors. The contractor does work for the government, but frequently maintains co-ownership with the government for the work performed.
I have worked with commercial contractors that worked under similar arragements. The customer paid the contractor for software development work, but the contractor also owned a copy, which tbey could sell to others. Only work that was explicitly identified as proprietary was exempted from this. Some consulting companies, like Wind River, in its early days, have built a significant amount of intellectual property following this model. Once they build up a software base they have a competitive advantage in licensing it for new applications. The fact that some software can be provided "off the shelf" rather than developed provides an incentive for the prospective customer to agree to co-ownership.
The organizations that manage the national labs seem to take a similar approach. They also own much of the intellectual information they develop. Release of software into the public domain at the University of California managed labs requires a review by a UC office that is in charge of licensing.
I feel like I'm playing Civilization and my agent is reporting that another civilization has just invented something my people have had for the last hour.
Seriously, I was doing this at the Census Bureau years ago with VRML and enhanced it with those dodgy Performance Copilot (SGI) type tools. Since then products such as, oh, I don't know, Cognos and Crystal Reports (4+) have implemented 3d data set controls and reports in spades(Tivoli Business Decision Manager anyone?).
Open source tends to lack the robust (read: overcomplicated buggy) features of the commercial variants but the underlying technology is still mesozoic for us terrans. And yeah, many MBA dinosaurs lack the ability to visualize data like this (compare business typical fiugures to an economist's throughput figures and the economist has no trouble understanding this stuff, odd how they make so little when they show off that title). Still, there are countless open minded business ppl with econ backgrounds who love these kinds of tools. Not to mention the courses being offered for the past decade in the mindset of 3d management.
Nachos for all, but not all the nachos.
I have been tinkering with this since I came across it last year sometime. But it too is nothing new; first release was in 1998
http://www.opendx.org
yeah, since excel has the really not so ahead of the game in charting, there are numerous tools to visualize data to 3D image, tons of them.
I am in the data mining field, so I really dont seem anything as "new" tech here.