Slashdot Mirror


Text Mining the New York Times

Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."

10 of 104 comments (clear)

  1. Sounds like an alternative to cross-referencing by liuyunn · · Score: 3, Interesting

    If this can be implemented into research in academia, is searching through decades of articles and abstracts finally going to be more efficient? Provided that they are electronic of course. Poor citations, inaccurate keyword tags, obscure sources...ahh reminds me of grad school.

  2. I guess it's one way to avoid registering. by Anonymous Coward · · Score: 1, Interesting

    But does it also ditch the ads?

  3. Support Vector Machine? by Uruviel · · Score: 5, Interesting

    I thought this was fairly easy to do with a Support Vector Machine. (http://en.wikipedia.org/wiki/Support_Vector_Machi ne ) Or even simple Decision trees by setting the threshold for certain words. (http://en.wikipedia.org/wiki/Decision_tree)

    1. Re:Support Vector Machine? by Ezubaric · · Score: 2, Interesting


      Well, even in variational inference, you have the problem of convergence. You have a huge EM algorithm and you're trying to maximize the completele likelihood of the data you have. Gibbs sampling doesn't have the same nice properties, but usually works pretty well in practice. Gibbs sampling is nice because it's usually easier to do, requires less memory (in variational methods you basically have to create a new probability model where everything is decoupled), and it's far easier to debug.

      --

      ----------
      I am an expert in electricity. My father held the chair of applied electricity at the state prision.
  4. Interesting by glowworm · · Score: 5, Interesting

    I have available to me quite a large database of historical research spanning back to 1991, being freeform copies of emails between researchers and acedemics on a wide variety of topics to do with a specific topic from the 15th century. Dry stuff, but a very exciting topic.

    At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.

    It will be quite interesting applying this technique to the dataset to see if unknown relationships become apparent or known relationships become clearer.

    Looking at the paper and samples would indicate this tool (if it does what it promises) might be able to not only work out the correlation between datum but to create visual diagrams linking people, places and events quite well. A handy tool for my dataset.

    I'm now sitting here crystal ball gazing; if we were to expand this to a 3D map. Say by displaying a resulting chart and allow a researcher to hotlink to the data underneath it would be an interesting way to navigate a complex topic, more so than a text based wild or fuzzy search. Of course I won't know if this is possible until I look into the program more, and I won't be able to look into the program more until I massage teh dataset again ;) but it does open up some interesting possibilities.

    Click on the Anthony Ashcam box and see the hotlinking and unfolding of data specific to him. Drill in more... then more... and eventually get to a specific fact.

    The only problem will be that I would need to pre-compute all the charts. Oh well, one day ;)

    --
    Orationem pulchram non habens, scribo ista linea in lingua Latina
  5. Re:Has anyone realized this by rgravina · · Score: 4, Interesting

    Yeah I agree :). Linguists have tried to develop new international languages to replace English (e.g. Esperanto) that have less cruft and exceptions, but unfortunately very few people bother with them in practice, and keep using English :).

    Wouldn't it be cool if we all spoke a language which was expressive but at the same time had a machine-parsable grammar and had absolutely no silly exceptions or odd concepts like the masculine/feminine nouns that French and Italian has?

    I'm no expert on this, but I think linguists will tell you that we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used by various cultures around the world.

    Still yeah, I am glad I'm a native speaker of English since it would be a pain to learn as a second language! Imagine all the special cases you'd have to memorise! Spelling, grammar exceptions that may not fit the definition you learned but native speakers use anyway etc.

  6. Homeland Aftosa by Lord+Balto · · Score: 5, Interesting

    As William Burroughs suggested, the goal of the Aftosa Commission is not to rid the world of bovine aftosa. It's goal is to justify its existence and continue to enlarge its budget and its manpower until the world understands that bovine aftosa is such a critical issue that there needs to be a cabinet level Office of Bovine Aftosa with a budget only surpassed by that of the military. No one in government ever does anything that could conceivably put them out of business. This is why relying on the military and the "defense" contractors to bring peace is such a dangerous activity.

  7. Text Mining freeware already does this by saddino · · Score: 4, Interesting

    The demonstration is significant because it is one of the earliest showing that an extremely efficient, yet very complicated, technology called text mining is on the brink of becoming a tool useful to more than highly trained computer programmers and homeland security experts.

    On the brink? Q-Phrase has desktop software that does this exact type of topic modeling on huge datasets - and it runs on any Windows or OS X box. [Disclaimer: I work there] And there are a number of companies (e.g. Vivisimo/Clusty) that uses these techniques as well.

    Going beyond the pure mechanics (this article speaks of research that is only groundbreaking in their speed of mining huge data sets), there are more interesting uses for topic modeling such as its application to already loosely correlated data sets. A prime example: mining the text from the result pages that are returned from a typical Google search. One of our products, CQ web does exactly this (and bonus: it's freeware):

    Using the example from the story: in CQ web, text mining the top 100 results from a Google search of "tour de france" takes about 20 seconds (via broadband) and produces topics such as:
    floyd landis
    lance armstrong
    yellow jersey
    time trial


    And going beyond simple topic analysis: using CQ web's "Dig In" feature (which provides relevant citations from the raw data) on floyd landis returns "Floyd landis has tested positive for high leves of testosterone during the tour de france." as the most relevant sentence from over 100 pages of unstructured text.

    So, while this is a somewhat interesting article, fact is, anyone can download software today that accomplishes much of this "groundbreaking" research and beyond.

  8. Do Try This At Home! by ejoe · · Score: 2, Interesting

    It doesn't come bundled with an analysis engine, but if you're looking to build your own corpus of material (e.g., by automating searches or harvesting large volumes of your research web pages) and you're on MacOSX, check out Anthracite web mining desktop toolkit... It makes it easy to build spidering and scraping systems, structure the output and feed it into a database like MySQL...all without requiring you to write a single line of code. Take that output and feed it into any number of the analysis and search systems on SourceForge or Freshmeat and you're going to get comparable results without all the fuss, although you should definitely write a press release about it! The Google API and regex support are built-in, and you can even run the data through any UNIX command (e.g., grep or Perl) without leaving the program if you need even more. As for speed, the new release is going to feature a throttle because a few customers are getting overwhelmed by the URL loading throughput. Yes, by way of full disclosure, I wrote the software and that's why I'm always busy promoting it.

  9. Re:Has anyone realized this by spiffyman · · Score: 2, Interesting

    Linguists have tried to develop new international languages to replace English (e.g. Esperanto)...

    Actually, Esperanto was created by an ophthalmologist. In general, linguists don't attempt to replace languages with "better" ones. They recognize that linguistic change is natural and unavoidable. And, like other sciences, linguistics is largely occupied with observing and recording phenomena. They do not, as a rule, take a prescriptive point of view.

    ...we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used...

    This is exactly why attempts to replace English (or any other presently used natural language) with constructed languages generally fail. Construction, and its attendant notions of maintenance and static-ness, preclude incorporation into actual use. Remember that Frege in the late 19th and early 20th centuries and Russell as late as 1919 were interested in describing an 'ideal' language, but they gave up in the end - Russell long after Frege, for various reasons. Frege did, however, manage to stabilize the symbology of formal logic, and Russell contributed a great deal to both mathematics and linguistics.

    The notion that English is somehow less grammatical than other languages is just bunk. All languages function on similar principles, and all languages are heavily governed by syntax. IANACS (I am not a computer scientist), but I've often wondered just why, exactly, the grammar of English is so hard to parse. It does contain exceptions, unlike the computer languages of which I am aware, but I don't know why those have proven insurmountable.

    --
    So you can laugh all you want to...