Text Mining the New York Times
Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."
An artificial intelligence could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.
Technological Singularity -- -- here we come!
Text modeling is mostly viewed as an unsupervised machine learning problem (as nobody will go through thousands of articles and tag each and every word, i.e. assign a topic to it). However support vector machines are very good classifiers for supervised data, e.g. digits recognition (you just learn your svm for a training sample of pictures of 9's tagged as a 9, the svm should then return the correct class for a new digit).
The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
Phil Schrodt at the U of Kansas has been doing something similar for years using The Kansas Event Data System (and its new update, TABARI). He started using Reuters news summaries to feed the KEDS engine back in the 1990s.
Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database using machine-based coding.
These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
Right. And, unsupervised learning can be useful in some areas. Does anybody know how Google news works? It seems to work reasonably well, and seems to be solving the same problem.
Also note that for most purposes however classification is becoming less of a big deal. Read Clay Shirky's article to understand why. Shirkey talks about ontologies specifically, but the gist is the same -- basically, tagging each and every word isn't as crazy an idea if the end goal is just "I want to find something related" which is the most common case.
This is interesting, but the idea has been around for more than 50 years, and practiced using automated computers (as opposed to human coders) since the 1960s. Lerner and de Sola Pool came up with the idea of using "themes" to analyze political texts at Stanford in 1954, and hundreds or even thousands of studies using automated text analysis tools have been performed since then. You can download a free text analysis tool called Yoshikoder, which will perform frequency counts of all words in a text, as well as dictionary analysis, and several other functions. So why is this news now? I think the press release is really leaving out some key information. I think the more relevant questions that should have been addressed in the original release is how the text was prepared for analysis, because most websites and online databases of news articles (LexisNexis, Factiva, etc.) don't allow batch downloads of huge amounts of news text in XML or some other format that can be easily parsed by text analysis programs.
- basically, the model assigns a probability distribution over topics to each document
- topics are learned from the documents automatically, not pre-defined
- the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
For a good high level description of what these models are doing, see Mark Steyvers' research page (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser.i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
side benefit: you can also discover misattributions (e.g., authors with the same name)
We did this 2 years ago, filed patents. We have a real-time implementation at http://wizag.com/ in the form of TopicClouds and TopicMaps. It is applied to to hundreds of thousands of news and blogs (including Slashdot). Both the nodes and the links in the TopicMaps are clickable. Once you create an account, the system creates a personalized TopicCloud for each user.