Slashdot Mirror


Text Mining the New York Times

Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."

5 of 104 comments (clear)

  1. Mining? by Eudial · · Score: 5, Funny

    "Home atlast after another long day in the salt^H^H^H^Htext mines.

    We lost four more miners today, bless their souls. The foreman kept insisting they'd dig another tunnel between bicycling and Tour de France. They told him it was too dangerous, but no... he never listens. One of these days... They've got us working 20 hour shifts in the abyss that is the text mines, barely pay us enough to afford the rent, I'm telling you, one of these days..."

    --
    GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
  2. Support Vector Machine? by Uruviel · · Score: 5, Interesting

    I thought this was fairly easy to do with a Support Vector Machine. (http://en.wikipedia.org/wiki/Support_Vector_Machi ne ) Or even simple Decision trees by setting the threshold for certain words. (http://en.wikipedia.org/wiki/Decision_tree)

  3. Interesting by glowworm · · Score: 5, Interesting

    I have available to me quite a large database of historical research spanning back to 1991, being freeform copies of emails between researchers and acedemics on a wide variety of topics to do with a specific topic from the 15th century. Dry stuff, but a very exciting topic.

    At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.

    It will be quite interesting applying this technique to the dataset to see if unknown relationships become apparent or known relationships become clearer.

    Looking at the paper and samples would indicate this tool (if it does what it promises) might be able to not only work out the correlation between datum but to create visual diagrams linking people, places and events quite well. A handy tool for my dataset.

    I'm now sitting here crystal ball gazing; if we were to expand this to a 3D map. Say by displaying a resulting chart and allow a researcher to hotlink to the data underneath it would be an interesting way to navigate a complex topic, more so than a text based wild or fuzzy search. Of course I won't know if this is possible until I look into the program more, and I won't be able to look into the program more until I massage teh dataset again ;) but it does open up some interesting possibilities.

    Click on the Anthony Ashcam box and see the hotlinking and unfolding of data specific to him. Drill in more... then more... and eventually get to a specific fact.

    The only problem will be that I would need to pre-compute all the charts. Oh well, one day ;)

    --
    Orationem pulchram non habens, scribo ista linea in lingua Latina
  4. Text mining is... by SlashSquatch · · Score: 5, Funny

    ...a load of grep.

    --
    Autonomous Retard -- Is your camp safe? UnsafeCamp.com
  5. Homeland Aftosa by Lord+Balto · · Score: 5, Interesting

    As William Burroughs suggested, the goal of the Aftosa Commission is not to rid the world of bovine aftosa. It's goal is to justify its existence and continue to enlarge its budget and its manpower until the world understands that bovine aftosa is such a critical issue that there needs to be a cabinet level Office of Bovine Aftosa with a budget only surpassed by that of the military. No one in government ever does anything that could conceivably put them out of business. This is why relying on the military and the "defense" contractors to bring peace is such a dangerous activity.