Data Mining Goes 3D
Roland Piquepaille writes "At Sandia National Laboratories (SNL), a data mining and visualization software suite developed in the last two years is now able to extract information from many sources of data and to return 3D images as results. In Sandia's intelligence lab converts business data into 3-D images, the New Mexico Business Weekly reports that Sandia's Information Visualization Lab is able to search structured documents, such as scientific journals, or unstructured ones, such as the Web or an intranet. Since the lab has been established five months ago, this software has already been used to determine the potential of several partnerships with SNL. Other firms, such as Lockheed Martin, also are starting to use the lab. Let's hope that SNL releases this software as open source. It should be fun to use it. For more details and pictures, please read this overview."
is over 5 years old already
google search
people have been doing real time data mining in VRML since the vrml2.0 plugins came out back in 97
back in the day we didnt have no old school
MacSpin was a 3-d data mining tool that is over 16 years old now.
I wish this story went into more details into the algorithms used. Saying stuff like "we take tons of data and out comes a 3D image" is great, but what does the 3D image actually represent? What are the dimensions being graphed?
If I had to guess I would guess that they are doing 3D Self Organizing Maps, or something very similar.
The principle is: create a huge feature space for the documents in question (something like word counts for each document for each word in the corpus, with appropriate fixes (drop the most and least common words, do stemming etc.). You can now "visualize" the documents in a massive 20,000 dimensional space. However, what you can do, is try to create a projection from 20,000 dimensions down to 2 or 3 dimensions in a way that best preserves distances in the 20,000 dimensional space. This automatically creates a clustering of the documents as well, and you now have something that you can visualize practically. If you start doing things like labelling clusters and subsclusters by the words unique to/defining that cluster you can start to make some sense of the visualisation.
Effectively this is just a means of doing clustering on a large document space in such a way that the final output can be visualized (instead of the sort of results you get from k-means, or heirarchical clustering, which are a lot harder to visualize in a meaningful way for laymen). The benefit of being able to visualize it in that sense is that you can "see" patterns of other document attributes by adding that to the visualization (via colors, labels, etc.) and see a global overview of those attributes across the entire document space.
Just to reiterate: I do not know that this is what is being done, and they don't say a lot in the article, but I do have some experience in this field, and what I gleaned from the article would tend to imply an approach like this.
Jedidiah.
Craft Beer Programming T-shirts
Anyone interested in doing powerful 3D data visualization should make a mandatory stop here. It's an open source visualization toolkit written in C++, but with bindings for Java and Python as well. This is a very powerful and very impressive system, and ought to be rated as one of the great open source projects. It doesn't seem to get much attention - I'm not sure why.
Have a look, and look at what it is actually capable of doing. If you want to do any sort of 3D visualization, it really is worth your time to learn a bit about VTK.
Jedidiah.
Craft Beer Programming T-shirts