Slashdot Mirror

← Back to Stories (view on slashdot.org)

Text-Mining Technique Intelligently Learns Topics

Posted by ryuzaki0 on Wednesday August 2, 2006 @11:20AM from the sound-of-google-knocking-on-your-door dept.

Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"

2 of 84 comments (clear)

Min score:

Reason:

Sort:

Re:Can it deal with the canonical problem? by NickFitz · 2006-08-02 11:47 · Score: 4, Insightful

Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)

--
Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
Topic modeling to the rescue by alienmole · 2006-08-02 12:15 · Score: 4, Insightful

Perhaps topic modeling could be used to analyze Slashdot to detect dupes before they're posted?