Text Mining the New York Times

← Back to Stories (view on slashdot.org)

Text Mining the New York Times

Posted by ryuzaki0 on Friday July 28, 2006 @10:29PM from the good-place-to-mine dept.

Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."

3 of 104 comments (clear)

Min score:

Reason:

Sort:

Re:Support Vector Machine? by Anonymous Coward · 2006-07-29 00:22 · Score: 4, Informative

Text modeling is mostly viewed as an unsupervised machine learning problem (as nobody will go through thousands of articles and tag each and every word, i.e. assign a topic to it). However support vector machines are very good classifiers for supervised data, e.g. digits recognition (you just learn your svm for a training sample of pictures of 9's tagged as a 9, the svm should then return the correct class for a new digit).

The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
Earlier modes of text mining by soapbox · 2006-07-29 01:18 · Score: 4, Informative

Phil Schrodt at the U of Kansas has been doing something similar for years using The Kansas Event Data System (and its new update, TABARI). He started using Reuters news summaries to feed the KEDS engine back in the 1990s.

Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database using machine-based coding.

These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
brief explanation of the method by jrtom · 2006-07-29 17:30 · Score: 4, Informative
I'm a PhD student in the research group that worked on this. My research is somewhat different (machine learning and data mining on social network data sets) but I've gone to a lot of meeting and presentations on this work, and I've used the model they're describing in my own research. Certainly people have worked on document classification before, but posters that are suggesting that this isn't new don't understand what this method accomplishes. For example:
- basically, the model assigns a probability distribution over topics to each document
  i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
- topics are learned from the documents automatically, not pre-defined
  this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
- the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
  side benefit: you can also discover misattributions (e.g., authors with the same name)
For a good high level description of what these models are doing, see Mark Steyvers' research page (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser.