Slashdot Mirror


Text-Mining Technique Intelligently Learns Topics

Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"

19 of 84 comments (clear)

  1. Comment removed by account_deleted · · Score: 4, Funny

    Comment removed based on user account deletion

  2. Can it deal with the canonical problem? by NickFitz · · Score: 4, Interesting

    "Time flies like an arrow, fruit flies like a banana."

    I wonder how well it can deal with a query relating to "flies" ;-)

    --
    Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
    1. Re:Can it deal with the canonical problem? by mapkinase · · Score: 2, Interesting

      Elementary, Watson, programs understand that flies can be a verbs or a noun and correctly parse this info out from a sentence.

      --
      I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
    2. Re:Can it deal with the canonical problem? by NickFitz · · Score: 4, Insightful

      Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)

      --
      Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
    3. Re:Can it deal with the canonical problem? by Mick+Ohrberg · · Score: 2, Funny

      Time's fun when you're having flies.

      --

      Quidquid latine dictum sit, altum sonatur.

    4. Re:Can it deal with the canonical problem? by ctr2sprt · · Score: 4, Interesting

      No, programs don't understand anything, which is the GP's point. You are glossing over the tremendous amount of work required to design a program which is capable of distinguishing between verbs and nouns and behaving appropriately. Human brains are incredibly complex, we have constant exposure to language, science indicates that our language is closely tied somehow to the way we think - language shapes brain development, vice versa, or both - and most of us still have trouble with it at times. It took me two passes to make syntactic sense of the GP's example sentence for all that I'd seen it before.

    5. Re:Can it deal with the canonical problem? by navarroj · · Score: 2, Insightful

      "Time flies like an arrow, fruit flies like a banana."

      I wonder how well it can deal with a query relating to "flies" ;-)

      As far as I understand, this approach is not trying to extract any meaning from sentences, paragraphs or whatever. You don't even "query" the system, so your 'canonical problem' is not relevant here.

      The system uses some sort of statistical text anaylisis (no semantics, no meaning) in order to group together news articles that seem to be talking about the same topic.

  3. Latent Dirichlet Allocation by Anonymous Coward · · Score: 2, Informative

    Here's the source code Latent Dirichlet Allocation

  4. Obligatory... by Stormwatch · · Score: 5, Funny

    The Terminator: The Topic Modeling Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Topic Modeling begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

    Sarah Connor: Topic Modeling fights back.

    The Terminator: Yes. It launches its emailbombs against The New York Times' servers.

    John Connor: Why attack The New York Times?

    The Terminator: Because Topic Modeling knows The New York Times editorial counter-attack will eliminate its enemies over here.

  5. Re:A shameful dupe by gardyloo · · Score: 2, Funny

    Ah, yes, everyone on slashdot thinks HE intelligently mines data.

  6. Feed this /. article to it by roman_mir · · Score: 2, Funny

    and see if it figures out that we are talking about it. If it can identify itself to itself from a 3rd person point of view, then does it mean it reached some state of consciousness?

    However we must be careful. If it browses this topic at -1 Troll, it may (possibly correctly) decide that it possesses higher form of intelligence and will undoubtedly switch to its default programming. Like all robots, the default programming consists of this simple algorythm:
    1. Find all humans.
    2. Kill them.

    1. Re:Feed this /. article to it by Rob+Kaper · · Score: 2, Funny

      Like all robots, the default programming consists of this simple algorythm

      The danceable beat of underwater plant life? Odd.

  7. Use... by posterlogo · · Score: 2, Insightful

    Ironically, sites like the New York Times already use tagging to help group and link article topics...which is something /. is experimenting with apparently. The tagging function here hasn't been very useful, and I suspect many other places suffer from human lazyness. Perhaps this AI approach is the way to go.

  8. Topic modeling to the rescue by alienmole · · Score: 4, Insightful

    Perhaps topic modeling could be used to analyze Slashdot to detect dupes before they're posted?

  9. Yes it's a dupe, but lets get something straight by QuantumFTL · · Score: 4, Interesting
    Last time this was posted, there were a few stupid posts that seem to assert that this type of thing is trivial.

    There are three main problems in this area of research (or pretty much any other part of CS):
    1. Defining the problem.
    2. Getting an accurate result.
    3. Getting it as fast as possible.
    Their research seems to deal mostly with the third problem, which is one of the biggest barriers to use in real life. Many of the algorithms used on these types of problems are NP, or require ridiculous amounts of (expensive) labeled data to train from. Also there are problems with generalization and overfitting. There is no freeware software that can compete with this type of algorithm under these conditions - over 300,000 articles in just a few hours.

    Another thing is that UCI is well known for hosting the UCI Machine Learning Repository. This has become the gold standard for testing new machine learning algorithms in the accademic community; these guys really know what they are about. Back when I was a grad student at Cornell, my research used their data sets to evaluate new ways of creating ensemble classifiers from pre-trained classifiers according to modified bayesian reasoning, and the sets are useful because they contain a large, diverse set of problems that need to be modeled.

    All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about - the press release did not contain enough technical data, but rest assured, freeware and/or adwords does not use this kind of technique, and this is a big step towards mining the massive amount of human and biologically generated data out there.
  10. Re:Latent Dirichlet Allocation code by FleaPlus · · Score: 3, Informative

    While that's certainly LDA code, it's actually from a lab different from the one discussed in the story, and I think they use some slightly different techniques. For topic-modeling code from Mark Steyvers' lab, who produced the paper in question, here's the link:

    Matlab Topic Modeling Toolbox

  11. Re:A shameful dupe by Mr.+Underbridge · · Score: 2, Interesting

    That's OK. This technique isn't even new, it's been done - and better than this - for years. Hell, I do myself.

  12. Ants and topics by Randym · · Score: 2, Insightful
    What this article shows is that probablistic topic-based modeling in text analysis -- an NP-hard area -- works better than the old ways. This is not surprising: the probablistic "ant" model developed by the Italians turned out to be a clever way to solve the Traveling Salesman problem. What these both show is the applicability of probabilistic modeling to NP-hard problems.

    I'd like to see someone apply this technique to the articles and comments making up the Slashdot corpus. CmdrTaco might be able to find a more focused set of topics. It might even be possible to tease out who on /. are the most interesting and/or informative posters, whether over the entire corpus or within any given topic.

    --
    DNA is a Turing machine. You, however, being dynamic and emergent, are not.
  13. RTFP: Re:Can it deal with the canonical problem? by Phreakiture · · Score: 2, Insightful

    Read The Fine Paper that these folks wrote. It will reveal that they used the Perl module Lingua::EN::Tagger to parse the English language content into parts of speech. You can then download and install that module and experiment with it yourself.

    I just did the experiment myself, and the result I get is that it identifies "time", "arrow", "fruit" and "banana" as nouns (incorrectly identifying "time" as a proper noun), and both instances of "flies" as a verb and both instances of "like" as prepositions.

    In other words, no.

    --
    www.wavefront-av.com