Slashdot Mirror


Text Mining the New York Times

Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."

104 comments

  1. Homeland security by Anonymous Coward · · Score: 4, Insightful

    For every time homeland security is mentioned as benefitting of a new technology, you should get a swift kick to the nuts. Goddam, there is more than just terrorism in this world.

    1. Re:Homeland security by Anonymous Coward · · Score: 0

      HOT DAMN! Switched RACK POWER units PAY for themselves QUICK. Click, click, PING!

      Sorry, just happy to get ping in 2 minutes @4:30am.

    2. Re:Homeland security by mrogers · · Score: 2, Insightful
      But the pretty graph clearly shows that some guy called MOHAMMED is the missing link between Religion and Terrorism - without this new technology, homeland security experts might have been kept in the dark about that.

      The graph also shows links betwen US_Military and AL_QAEDA, and between ARIEL_SHARON and Mid_East_Conflict. If only they'd had this technology when they were trying to justify the invasion of Iraq.

      "Look, Saddam Hussein has links to Al Qaeda! You can see it on the graph!"

      "Uh, Mister Vice-President, this graph is based on press conferences in which you repeatedly mentioned Saddam Hussein and Al Qaeda in the same breath. It may not have any statistical value."

      "Shut up and bring me my war britches, dimwit, the computer never lies!"

    3. Re:Homeland security by Gli7ch · · Score: 2

      Good sir, I wish I had some mod points left for you

      Seriously, every time you mention homeland security, every time you watch a special report on terrorism on you local current affair program - That means the terrorists are winning.

      ...You don't support terrorism now do you?

    4. Re:Homeland security by 1u3hr · · Score: 2, Insightful
      The compulsory "Homeland Security" link makes me think of the story about a drunk who was crawling about on the sidewalk under a lamppost late one night. A Police Officer came up to him and inquired, "What are you doing?"
      The drunk replied, "I'm looking for my car keys."
      The Officer looked around in the lamplight, then asked the drunk, "I don't see any car keys. Are you sure you lost them here?"
      The drunk replied, "No, I lost them over there", and pointed to an area of the sidewalk deep in shadow.
      The policeman then asked, "Well, if you lost them over there, why are you looking over here?"
      The drunk looked at him and said, "Because the light is better over here."

      Searching for terrorists by datamining from the comfort of your cubicle is about as likely to be successful.

    5. Re:Homeland security by damian+cosmas · · Score: 0

      Searching for terrorists by datamining from the comfort of your cubicle is about as likely to be successful.

      Unless you have a metric crapload of intercepted communications to sort through for information that might be useful. Especially since the NSA is listening to everything.

      Remember that the darling of the Left, John Kerry, insisted that terrorism was a law enforcement problem, not a military problem. A large part of law enforcement is digging through all available information from the comfort of your desk, rather than carpet-bombing potential suspects.

      Besides, the Op-Ed section of the NYT is a good place to start looking for terror suspects ;)

    6. Re:Homeland security by LS · · Score: 1

      Yeah, and what's up with him mentioning Homeland Security in lowercase, as if it's already the fabric of our society, like the state department or some such. creepy...

      --
      There is a fine line between being a cultivated citizen and being someone else's crop. - A. J. Patrick Liszkie
    7. Re:Homeland security by Anonymous Coward · · Score: 0

      Disclaimer: I have a Masters in information science, but it's been a few years since I've done anything with it

      Homeland security is also a convinient "real-world application" one can use in the paper. Academic papers typically have the format where you first introduce the problem, explain why it's relevant and then show what you've done with it. Typically the relevancy is pointed out with practical examples, like going through news articles or helping to fight terrorists. (Unfortunately most editors balk if you just put in "This problem was researched because we had unspent government grants for this fiscal year."

      If you look at data mining papers published in 2000-2001, you'll see a few of those list "Internet traffic logs" or "customer shopping baskets" as the motivation. Then the papers were useful in biochemistry and genetics.

      All this saves thinking from the potential reviewers, editors and readers of the paper (not to mention the newspaper reporters who occasionally happen to write a short article about recent advances in science). Actually rogress occurs when someone looks at the paper and says "That's neat, I wonder what else I could do with this?".

      Take the current topic for instance; all the examples are about going through written documents. However, there are many more things out there that can be represented with a topic model. Let's say you count the number of emails in your company from one employee to another, and then apply the model. Bingo, you get some sort of informal team map based on the communications between people (before actually doing this, please consider the appropriate privacy and data protection legislation and moral issues in your area. I am an engineer, not a lawyer.)

    8. Re:Homeland security by CaptDeuce · · Score: 1
      "Uh, Mister Vice-President, this graph is based on press conferences in which you repeatedly mentioned Saddam Hussein and Al Qaeda in the same breath. It may not have any statistical value."

      "Shut up and bring me my war britches, dimwit, the computer never lies!"

      "... That's my job!"

      --
      "Where's my other sock?" - A. Einstein
    9. Re:Homeland security by lysergic.acid · · Score: 1

      I doubt the "real" terrorists would speak in regular english, either. First, different languages have different grammatical rules and idioms. Secondly, they wouldn't talk openly about "BOMBING THE WHITEHOUSE", they'd probably say it more discretely in a semi-sophisticated code. This will just be another arms race--a [tele]communinications one--and civilian casualties will be the main results.

      Unless I'm wrong ofcourse and terrorists write like NY Times writers.

    10. Re:Homeland security by 1u3hr · · Score: 1
      A large part of law enforcement is digging through all available information from the comfort of your desk, rather than carpet-bombing potential suspects

      Did I suggest carpet bombing as an alternative? I think legwork is the only likely method. Real terrorists don't live their lives online, you might fill up Gitmo with idiots who spouted "Jihad" on some website. Osama gave up using his satellite phone years ago, they're well aware the NSA is snooping on every form of telephone or Internet communication. My point is that of course you do have to search the information available, but that's insufficient, it's more important to go out into the real world.

    11. Re:Homeland security by a55clown · · Score: 1

      you apparently don't understand the level of abstraction that topic modeling operates within. it doesn't matter what language it's written in; rtfa - it's about the relationship of words to each other.

    12. Re:Homeland security by lysergic.acid · · Score: 1

      you apparently don't understand the difference between grammatical structures/idiomatic expressions and "vocabulary."

    13. Re:Homeland security by Fordiman · · Score: 1

      Oh, so what about that. Five words: defocused artificial large scale understanding.

      --
      110100 1101000 1101000 1100110 0 1101111 1101000 1100011 1
    14. Re:Homeland security by Anonymous Coward · · Score: 1, Informative

      We did this 2 years ago, filed patents. We have a real-time implementation at http://wizag.com/ in the form of TopicClouds and TopicMaps. It is applied to to hundreds of thousands of news and blogs (including Slashdot). Both the nodes and the links in the TopicMaps are clickable. Once you create an account, the system creates a personalized TopicCloud for each user.

  2. Go away Roland by Anonymous Coward · · Score: 0, Funny

    Nobody likes you.

  3. Plus some other words by stimpleton · · Score: 4, Funny

    For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich."

    From this, researchers were easily able to identify that topic as the Tour de France.


    I imagine "testosterone", "doping", and "supportive mother", would have found the Tour de France topic even faster.

    --

    In post Patriot Act America, the library books scan you.
    1. Re:Plus some other words by kfg · · Score: 1

      It's the Tour de France, silly. Ya'll left out the most important word:

      Texas!

      KFG

  4. Funny by vllbs · · Score: 1, Insightful

    A relative new method? A difficult task? Sorry, but these are almost laughable, even for a poor spaniard like me.

    1. Re:Funny by kfg · · Score: 2, Funny

      You'll have to forgive them, these are computer scientists. Until now they have been completely unaware that natural language has grammar, syntax and that even individual words have structure and meaning; despite the complete absence of a metatag blizzard to inform them that [color]red is a [/color].

      KFG

    2. Re:Funny by vllbs · · Score: 1

      I'd studied Computer Science ergo I suppose I'm a computer scientist too. So save your ironic comments for the less-experienced souls around you (if any)

    3. Re:Funny by Anonymous Coward · · Score: 0

      I think you'll find computer scientists know this better than anyone.

      Cham

    4. Re:Funny by tsa · · Score: 1

      I found it funny. And I'm a nerd (just like everybody else here).

      --

      -- Cheers!

    5. Re:Funny by vllbs · · Score: 1

      Shit! I'm sorry, I understand your comment the wrong way. Now I'm remembering the first time I chat in English, when I was politely redirected to recycle classes at primary school...maybe hand by hand with the f. "text mining program".

    6. Re:Funny by kfg · · Score: 1

      Please forgive if I have given offense. The jibe was directed only against those specific computer scientists who can use the phrase "data mining unstructured text" without bursting into fits of giggles.

      KFG

    7. Re:Funny by andrewman327 · · Score: 1

      I think they developed this technology trying to find a link between computer scientist and girls. Sadly they were not successful.

      --
      Information wants a fueled airplane waiting at the hangar and no one gets hurt.
  5. Mining? by Eudial · · Score: 5, Funny

    "Home atlast after another long day in the salt^H^H^H^Htext mines.

    We lost four more miners today, bless their souls. The foreman kept insisting they'd dig another tunnel between bicycling and Tour de France. They told him it was too dangerous, but no... he never listens. One of these days... They've got us working 20 hour shifts in the abyss that is the text mines, barely pay us enough to afford the rent, I'm telling you, one of these days..."

    --
    GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
    1. Re:Mining? by The_Wilschon · · Score: 1
      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
  6. Sounds like an alternative to cross-referencing by liuyunn · · Score: 3, Interesting

    If this can be implemented into research in academia, is searching through decades of articles and abstracts finally going to be more efficient? Provided that they are electronic of course. Poor citations, inaccurate keyword tags, obscure sources...ahh reminds me of grad school.

    1. Re:Sounds like an alternative to cross-referencing by Ezubaric · · Score: 1


      Already done. It's pretty cool to see how the psychology topic over time turns into the AI topic:

      http://www.cs.cmu.edu/~lemur/science/

      --

      ----------
      I am an expert in electricity. My father held the chair of applied electricity at the state prision.
  7. I guess it's one way to avoid registering. by Anonymous Coward · · Score: 1, Interesting

    But does it also ditch the ads?

    1. Re:I guess it's one way to avoid registering. by AndroidCat · · Score: 1

      Spammers will probably use it to locate key places on hot topics to put link-spam and other trash. (Tour de France & ONLINE-M3DS, that could work.)

      --
      One line blog. I hear that they're called Twitters now.
  8. Support Vector Machine? by Uruviel · · Score: 5, Interesting

    I thought this was fairly easy to do with a Support Vector Machine. (http://en.wikipedia.org/wiki/Support_Vector_Machi ne ) Or even simple Decision trees by setting the threshold for certain words. (http://en.wikipedia.org/wiki/Decision_tree)

    1. Re:Support Vector Machine? by NoTheory · · Score: 1

      I don't know about that, but this looks like it does something akin to Latent Semantic Analysis

      I'm not entirely sure what the novel component of this is. I think it might be the duration of time it takes to process the bodies of text (i should RTF papers to find out i suppose). Latent Semantic Analysis is really computationally expensive.

      --
      There are lives at stake here!
    2. Re:Support Vector Machine? by Anonymous Coward · · Score: 4, Informative

      Text modeling is mostly viewed as an unsupervised machine learning problem (as nobody will go through thousands of articles and tag each and every word, i.e. assign a topic to it). However support vector machines are very good classifiers for supervised data, e.g. digits recognition (you just learn your svm for a training sample of pictures of 9's tagged as a 9, the svm should then return the correct class for a new digit).

      The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.

    3. Re:Support Vector Machine? by Anonymous Coward · · Score: 0
      I don't know about that, but this looks like it does something akin to Latent Semantic Analysis.

      It is similar. The problem with LSI is that in text mining your data is non-negative; you either have a word there or not. The reporter doesn't include a list of words he would never have used in the context of the article, so your data is truncated from the beginning. LSI doesn't take this into account, so the results aren't as good as with a model that does.

      Latent Semantic Analysis is really computationally expensive.

      Not really; it's basically a spectral-value decomposition, which is somewhat tricky, but well-understood and well-implemented in most math libraries. It's nowhere near the computational complexity of some Bayesian methods ;-)

    4. Re:Support Vector Machine? by docl · · Score: 2, Informative

      Right. And, unsupervised learning can be useful in some areas. Does anybody know how Google news works? It seems to work reasonably well, and seems to be solving the same problem.

      Also note that for most purposes however classification is becoming less of a big deal. Read Clay Shirky's article to understand why. Shirkey talks about ontologies specifically, but the gist is the same -- basically, tagging each and every word isn't as crazy an idea if the end goal is just "I want to find something related" which is the most common case.

    5. Re:Support Vector Machine? by Ezubaric · · Score: 2, Interesting


      Well, even in variational inference, you have the problem of convergence. You have a huge EM algorithm and you're trying to maximize the completele likelihood of the data you have. Gibbs sampling doesn't have the same nice properties, but usually works pretty well in practice. Gibbs sampling is nice because it's usually easier to do, requires less memory (in variational methods you basically have to create a new probability model where everything is decoupled), and it's far easier to debug.

      --

      ----------
      I am an expert in electricity. My father held the chair of applied electricity at the state prision.
  9. You mean clusty.com? by SirStanley · · Score: 3, Insightful

    You mean they can group data by topic? Like clusty.com does when you search?

    I just read the stub of the article... because it seemed like it does exactly what clusty does and I don't care to read anymore.

    --
    --------========+++Dont Feed The Lab Techs+++========--------
  10. in other news by tompee · · Score: 4, Funny

    Google buys the University of California computer science school

    1. Re:in other news by Connie_Lingus · · Score: 1

      Yeah, and the California legislators were so happy with the deal that they threw in the rest of the State for a 15% stake in Google Earth.

      --
      never bring a twinkie to a food fight.
    2. Re:in other news by SporkLand · · Score: 1

      Too late, the irvine company already owns it. Who do you think Donald Bren, whom the school is named after, is?

  11. Hello Newman..... by ActiveMatx · · Score: 1

    "This research work has been presented by Newman and his colleagues during the IEEE Intelligence and Security Informatics Conference" .... Hello Newman....

  12. Has anyone realized this by ThePengwin · · Score: 2

    Has anyone realised that english is one of the most screwed up, stupid languages ever created? its just been stretched and modified in any way possible and some aspects of it are practically useless. Maybe the world would be better off inventing a better language than analysing a horrible one :P

    1. Re:Has anyone realized this by rgravina · · Score: 4, Interesting

      Yeah I agree :). Linguists have tried to develop new international languages to replace English (e.g. Esperanto) that have less cruft and exceptions, but unfortunately very few people bother with them in practice, and keep using English :).

      Wouldn't it be cool if we all spoke a language which was expressive but at the same time had a machine-parsable grammar and had absolutely no silly exceptions or odd concepts like the masculine/feminine nouns that French and Italian has?

      I'm no expert on this, but I think linguists will tell you that we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used by various cultures around the world.

      Still yeah, I am glad I'm a native speaker of English since it would be a pain to learn as a second language! Imagine all the special cases you'd have to memorise! Spelling, grammar exceptions that may not fit the definition you learned but native speakers use anyway etc.

    2. Re:Has anyone realized this by gclef · · Score: 1

      So, when are we switching to Esperanto?

    3. Re:Has anyone realized this by AndroidCat · · Score: 1

      Or Klingon.

      --
      One line blog. I hear that they're called Twitters now.
    4. Re:Has anyone realized this by spiffyman · · Score: 2, Interesting

      Linguists have tried to develop new international languages to replace English (e.g. Esperanto)...

      Actually, Esperanto was created by an ophthalmologist. In general, linguists don't attempt to replace languages with "better" ones. They recognize that linguistic change is natural and unavoidable. And, like other sciences, linguistics is largely occupied with observing and recording phenomena. They do not, as a rule, take a prescriptive point of view.

      ...we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used...

      This is exactly why attempts to replace English (or any other presently used natural language) with constructed languages generally fail. Construction, and its attendant notions of maintenance and static-ness, preclude incorporation into actual use. Remember that Frege in the late 19th and early 20th centuries and Russell as late as 1919 were interested in describing an 'ideal' language, but they gave up in the end - Russell long after Frege, for various reasons. Frege did, however, manage to stabilize the symbology of formal logic, and Russell contributed a great deal to both mathematics and linguistics.

      The notion that English is somehow less grammatical than other languages is just bunk. All languages function on similar principles, and all languages are heavily governed by syntax. IANACS (I am not a computer scientist), but I've often wondered just why, exactly, the grammar of English is so hard to parse. It does contain exceptions, unlike the computer languages of which I am aware, but I don't know why those have proven insurmountable.

      --
      So you can laugh all you want to...
    5. Re:Has anyone realized this by EarthlingN · · Score: 1

      Some people assume that reading and parsing are the difficult part for computers. Which is understandable. It's not that easy for us. The study of words and language is a major part of our early schooling.

      Others (like yourself) realize this shouldn't be difficult for computers. You are correct. In truth, computers have little trouble keeping track of nouns, verbs, subjects, predicates... even most of the exceptions.

      BUT, The insurmountable part is giving the computer any kind of useful understanding of those words. Even the classification of "noun" or "verb" is useless to the computer. It has no concept of action versus object. Computers don't know up from down. We take our understanding for granted. We understand the world before we can even speak, then, later, we learn, relate, and attach words to that model of reality that is already cemented into our brains.

      Computers only have the words. (Not even words, more like large numbers.) How would you explain something to a blind, deaf, brain-in-a-jar using only numbers? The is the difficult part. To truly understand human language, you must first understand humans.

    6. Re:Has anyone realized this by The_Wilschon · · Score: 1
      Russell contributed a great deal to both mathematics and linguistics.
      And Exhibit C over here, gentlemen, is the Understatement of the Century!
      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    7. Re:Has anyone realized this by Anonymous Coward · · Score: 0

      any tagu nun

      (That's "any day now" in esperanto according to the first online english/esperanto translator Google found)

    8. Re:Has anyone realized this by mdaniel · · Score: 1

      In the same vein as Esperanto, Lojban http://www.lojban.org/ is a culturally neutral, machine parsable (written in Lex/Yacc, see the website) artifical language.

      It was originally designed to study the Sapir-Whorf hypothesis http://en.wikipedia.org/wiki/Sapir-Whorf_hypothesi s, but has since developed a rich following from computer scientists as a potential human-computer interface tool. Err, at least that's why THIS computer scientist is interested in it. ;-)

          -- /v\atthew

    9. Re:Has anyone realized this by spiffyman · · Score: 1

      Russell contributed a great deal to both mathematics and linguistics.

      Technically, I was wrong there. He actually contributed a great deal to the philosophy of language, which is not at all the same thing as linguistics (though there is overlap).

      --
      So you can laugh all you want to...
    10. Re:Has anyone realized this by im_thatoneguy · · Score: 1

      I wouldn't pick on english. Any language in use is going to be abused and crafted. That's like saying "Isn't painting stupid, we need clear symbols to represent everything in the world." The moment somebody says "Hey." instead of "Hello James, how are you this morning?" all of the work put into the precise grammer is gone. Your wonderful language would also kill off the job of most authors, poets and editors, people who in my opinion advance and improve the language to which they are patrons every day.

      How about this. Try talking using this simple but elegant structure for a week. "[Indirect Object] [verb] [Direct Object]". The indirect object is optional. Machines everywhere will thank you.

  13. Interesting by glowworm · · Score: 5, Interesting

    I have available to me quite a large database of historical research spanning back to 1991, being freeform copies of emails between researchers and acedemics on a wide variety of topics to do with a specific topic from the 15th century. Dry stuff, but a very exciting topic.

    At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.

    It will be quite interesting applying this technique to the dataset to see if unknown relationships become apparent or known relationships become clearer.

    Looking at the paper and samples would indicate this tool (if it does what it promises) might be able to not only work out the correlation between datum but to create visual diagrams linking people, places and events quite well. A handy tool for my dataset.

    I'm now sitting here crystal ball gazing; if we were to expand this to a 3D map. Say by displaying a resulting chart and allow a researcher to hotlink to the data underneath it would be an interesting way to navigate a complex topic, more so than a text based wild or fuzzy search. Of course I won't know if this is possible until I look into the program more, and I won't be able to look into the program more until I massage teh dataset again ;) but it does open up some interesting possibilities.

    Click on the Anthony Ashcam box and see the hotlinking and unfolding of data specific to him. Drill in more... then more... and eventually get to a specific fact.

    The only problem will be that I would need to pre-compute all the charts. Oh well, one day ;)

    --
    Orationem pulchram non habens, scribo ista linea in lingua Latina
    1. Re:Interesting by Anonymous Coward · · Score: 0

      At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.

      You might want to download a demo of "ConceptQ Pro" from Q-Phrase.com and see what a topic analysis says about your data.

  14. Artificial intelligence implications? by Anonymous Coward · · Score: 2, Informative

    An artificial intelligence could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.

    Technological Singularity -- -- here we come!

    1. Re:Artificial intelligence implications? by Anonymous Coward · · Score: 0

      An artificial intelligence could maybe use these new methods to grok all human stupidity contained in all textual material all over the World Wide Web.

  15. OMG SOMEONE INVENTED TEH SEARCH ENGINE! by Anonymous+Crowhead · · Score: 1

    We're doomed! DOOOOOOOOOOMED!

  16. Discourse Analysis? by AJ_Levy · · Score: 1

    So how is this not simply automated discourse analysis?

    --
    http://amishthrasher.blogspot.com/
  17. Hahahaha by cj5 · · Score: 1

    I have to agree with the first response (swift kick in the nuts to whomever came up with that). It's called Google or Regex, whatever you want to use to strip unwanted content from a search.

  18. They're late to the game. by alcohollins · · Score: 3, Insightful

    Not revolutionary. In fact, they're late.

    Google AdSense network has done this for years to serve contextually-relevant text ads across thousands of websites. Yahoo now, too.

    1. Re:They're late to the game. by bytesex · · Score: 1

      Yeah - Google ad sense gave this very slashdot topic (text _mining_) two advertisements, both having to do with shoving coal around the globe. I'd say we can use some advancements in this area.

      --
      Religion is what happens when nature strikes and groupthink goes wrong.
  19. grep? by muftak · · Score: 2, Funny

    Wow, they figured out how to use grep!

    1. Re:grep? by Teresita · · Score: 0

      muftak wrote:
      Wow, they figured out how to use grep!

      Shhhhh! You just gave away the NSA's secret method and concept for monitoring sedition at the Gray Lady!

  20. Has anyone realized this-Rebel yell. by Anonymous Coward · · Score: 0

    "Maybe the world would be better off inventing a better language than analysing a horrible one :P"

    So how would we read your posts then?

  21. Homeland security-Think of the children. by Anonymous Coward · · Score: 0

    Agreed. We jump all over politicians and their "think of the children", and we do no better with our "homeland security". As I posted elsewere the advance of technology can free a public as much as it can enslave it.

  22. Unsupervised text mining is not new by Anonymous Coward · · Score: 0

    It's hard from the press release to understand what's the innovation here. Certainly unsupervised text mining techniques have been around for a long time. Latent Semantic Indexing has been around for fifteen years.

  23. Text mining is... by SlashSquatch · · Score: 5, Funny

    ...a load of grep.

    --
    Autonomous Retard -- Is your camp safe? UnsafeCamp.com
  24. Hard to do? by accurrent · · Score: 1

    How is this hard to do? It seems like this could be done with relatively simple algorithms.

  25. Earlier modes of text mining by soapbox · · Score: 4, Informative

    Phil Schrodt at the U of Kansas has been doing something similar for years using The Kansas Event Data System (and its new update, TABARI). He started using Reuters news summaries to feed the KEDS engine back in the 1990s.

    Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database using machine-based coding.

    These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...

  26. Maybe the punch is in the naming *.* by Seiruu · · Score: 0, Offtopic

    "Search" is a word used by "commoners".
    "Google" is a word used by "commoners that play on 'puters".
    But "text mine"? Why, that's a word meant only for science's finest.

    A rose may still smell as beautiful when it's named shit, but who names their daughters "Shit", like, ever?

  27. The new method discoverd by romka1 · · Score: 1

    The new method that they figured out was
    "site:newyorktimes.com "Tour de France" "

    --
    Visit my site @ http://www.madtorrent.com
  28. Teaching Granny to Suck Eggs... by Aeomer · · Score: 1

    We were doing this in 1989 with long free form responent answers to marketing questions to gain information about their actual preferences. Full natural language processing. We didn't patent the technique because we thought it was obvious - and we were too dumb to know how difficult a thing we achieved. It worked wonderfully. Ours worked in Japanese, German, and Thai, too - I bet their's only works in English, and American English at that. Of course it took us several months to teach it the decoding matrix for each language. I always think of this as the coolest computer related thing I ever did.

  29. Homeland Aftosa by Lord+Balto · · Score: 5, Interesting

    As William Burroughs suggested, the goal of the Aftosa Commission is not to rid the world of bovine aftosa. It's goal is to justify its existence and continue to enlarge its budget and its manpower until the world understands that bovine aftosa is such a critical issue that there needs to be a cabinet level Office of Bovine Aftosa with a budget only surpassed by that of the military. No one in government ever does anything that could conceivably put them out of business. This is why relying on the military and the "defense" contractors to bring peace is such a dangerous activity.

  30. Text Mining freeware already does this by saddino · · Score: 4, Interesting

    The demonstration is significant because it is one of the earliest showing that an extremely efficient, yet very complicated, technology called text mining is on the brink of becoming a tool useful to more than highly trained computer programmers and homeland security experts.

    On the brink? Q-Phrase has desktop software that does this exact type of topic modeling on huge datasets - and it runs on any Windows or OS X box. [Disclaimer: I work there] And there are a number of companies (e.g. Vivisimo/Clusty) that uses these techniques as well.

    Going beyond the pure mechanics (this article speaks of research that is only groundbreaking in their speed of mining huge data sets), there are more interesting uses for topic modeling such as its application to already loosely correlated data sets. A prime example: mining the text from the result pages that are returned from a typical Google search. One of our products, CQ web does exactly this (and bonus: it's freeware):

    Using the example from the story: in CQ web, text mining the top 100 results from a Google search of "tour de france" takes about 20 seconds (via broadband) and produces topics such as:
    floyd landis
    lance armstrong
    yellow jersey
    time trial


    And going beyond simple topic analysis: using CQ web's "Dig In" feature (which provides relevant citations from the raw data) on floyd landis returns "Floyd landis has tested positive for high leves of testosterone during the tour de france." as the most relevant sentence from over 100 pages of unstructured text.

    So, while this is a somewhat interesting article, fact is, anyone can download software today that accomplishes much of this "groundbreaking" research and beyond.

    1. Re:Text Mining freeware already does this by Anonymous Coward · · Score: 0

      Looks like a nice tool. Too bad the links to download the CQ-web don't show up when the page is loaded by firefox under OSX. They do display in Camino.

  31. Roland needs a kick in the nuts anyway by Anonymous Coward · · Score: 0

    Follow that up by reaching down his throat so you can rip out his spine and strangle him with it. Then tear off his head and go bowling with his dead skull. Have a few beers and enjoy the experience. After that, shit down the bloody stump of his neck - big, nasty, stinky beer shit.

    Just on general principles.

    There's no need to use any excuse.

  32. How much did that cost? by drsquare · · Score: 1

    330,000 articles at $3 each comes to $990,000, almost a million dollars for their data mining experiment. No wonder tuition costs are so high when this is what they're spending their money on!

  33. Heck, who needs a computer by Anonymous Coward · · Score: 0

    I can text-mine the NYTimes without even accessing the text:

    It's all Bush's fault.

    Business is evil.

    Tax cuts are bad.

    Republicans are fascists.

  34. That's what Google News does by Animats · · Score: 1

    Google News does a rather good job of associating all the stories on the same topic. I'd thought this was a solved problem.

  35. Fastest Way To Fix English by Anonymous Coward · · Score: 0

    The fastest way to correct many of the problems with the American English language could easilly be solved by switching to a pure phonetic spelling instead of the misbegotten methods we're now using. In other words, spelly it the way it sounds. EG: Fone instead of Phone, Duk instead if Duck and so on. Yes it's going to create havik but I think it will ease not only the learning of English (remember there are only 33/35 sounds in the language) but it will also increase Speech Recognition as you will then be pronouncing a word as it's supposed to be spelt.

  36. Self-Perpetuation First... by Anonymous Coward · · Score: 0

    As William Burroughs suggested, the goal of the Aftosa Commission is not to rid the world of bovine aftosa. It's goal is to justify its existence and continue to enlarge its budget and its manpower until the world understands that bovine aftosa is such a critical issue that there needs to be a cabinet level Office of Bovine Aftosa with a budget only surpassed by that of the military. No one in government ever does anything that could conceivably put them out of business. This is why relying on the military and the "defense" contractors to bring peace is such a dangerous activity.

    The same could be said of registered charities.

  37. Do Try This At Home! by ejoe · · Score: 2, Interesting

    It doesn't come bundled with an analysis engine, but if you're looking to build your own corpus of material (e.g., by automating searches or harvesting large volumes of your research web pages) and you're on MacOSX, check out Anthracite web mining desktop toolkit... It makes it easy to build spidering and scraping systems, structure the output and feed it into a database like MySQL...all without requiring you to write a single line of code. Take that output and feed it into any number of the analysis and search systems on SourceForge or Freshmeat and you're going to get comparable results without all the fuss, although you should definitely write a press release about it! The Google API and regex support are built-in, and you can even run the data through any UNIX command (e.g., grep or Perl) without leaving the program if you need even more. As for speed, the new release is going to feature a throttle because a few customers are getting overwhelmed by the URL loading throughput. Yes, by way of full disclosure, I wrote the software and that's why I'm always busy promoting it.

  38. Chomsky Anyone? by TheStonepedo · · Score: 1

    Edward Herman and Noam Chomsky may or may not have had a fancy computerized search system, but association of loaded keywords was a major topic in Manufacturing Consent (ISBN 0375714499) where the influences of commercial interests on the media and government was analyzed using the New York Times. The great improvement in the rate at which text can be analyzed should make for an excellent third edition.

    --
    I'll be your candy shop of infinite deliciousity if you'll be my discotheque of endless rump-shaking.
    1. Re:Chomsky Anyone? by Anonymous Coward · · Score: 0

      Chomsky is good repeating himself, not replicating his results. I wouldn't hold my breath for a more rigorous application of this in future editions of Manufacturing Consent.

  39. english subset by spectrokid · · Score: 1

    Some people have suggested to combine both: make a new version of english, with dumbed-down grammar and a reduced vocabulary. Egyptian-taxi-driver-english if you want. That I believe would be a good solution, as everybody could learn it, and those with time/talent could "move on" to normal english.

    --

    10 ?"Hello World" life was simple then

  40. weird by m874t232 · · Score: 1

    Out of the thousands of papers published on this subject every year, Roland Piquepaille picks this one.

  41. Why is this news? by Lam1969 · · Score: 3, Informative

    This is interesting, but the idea has been around for more than 50 years, and practiced using automated computers (as opposed to human coders) since the 1960s. Lerner and de Sola Pool came up with the idea of using "themes" to analyze political texts at Stanford in 1954, and hundreds or even thousands of studies using automated text analysis tools have been performed since then. You can download a free text analysis tool called Yoshikoder, which will perform frequency counts of all words in a text, as well as dictionary analysis, and several other functions. So why is this news now? I think the press release is really leaving out some key information. I think the more relevant questions that should have been addressed in the original release is how the text was prepared for analysis, because most websites and online databases of news articles (LexisNexis, Factiva, etc.) don't allow batch downloads of huge amounts of news text in XML or some other format that can be easily parsed by text analysis programs.

  42. brief explanation of the method by jrtom · · Score: 4, Informative
    I'm a PhD student in the research group that worked on this. My research is somewhat different (machine learning and data mining on social network data sets) but I've gone to a lot of meeting and presentations on this work, and I've used the model they're describing in my own research. Certainly people have worked on document classification before, but posters that are suggesting that this isn't new don't understand what this method accomplishes. For example:
    • basically, the model assigns a probability distribution over topics to each document
      i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
    • topics are learned from the documents automatically, not pre-defined
      this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
    • the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
      side benefit: you can also discover misattributions (e.g., authors with the same name)
    For a good high level description of what these models are doing, see Mark Steyvers' research page (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser.
    1. Re:brief explanation of the method by Anonymous Coward · · Score: 0

      None of the points you raised are new. We had a real time implementation two years ago with more capabilities. Now we personalize this for each user based on the user's interest (learned from the user's attention gestures). Our system http://wizag.com/ does this for hundreads of thousands of news sources and blogs in near real-time and personalized it for each user.

    2. Re:brief explanation of the method by jrtom · · Score: 1

      If you want to read the papers I pointed you to, and become specifically acquainted with the techniques and their advantages, then we can talk. Otherwise, in the absence of any technical papers that describe your technology (of which there seem to be none on your website), I don't see any particular reason why I should pay further attention to your anonymous claims.

    3. Re:brief explanation of the method by mithras+the+prophet · · Score: 1

      How does this differ from Andrew McCallum at UMass Amherst's work on AT (Author-Topic) and ART (Author-Recipient-Topic) models? I think he uses a generative model assuming each document has a Dirichlet distribution over topics, and uses Gibbs sampling to infer the parameters. I'll have to read the paper, obviously, but some plain explanation would be useful. cheers

      --
      four nine eighteen twenty-7 thirty-nine forty-7 fiftyeight sixty-nine seventy-9 eighty-8 one-hundred-and-nine one-twenty
    4. Re:brief explanation of the method by jrtom · · Score: 1

      The Author-Topic model is actually due to Steyvers et al. at UC Irvine. McCallum's contribution was the Author-Recipient-Topic model, which extended the AT model to the domain of directed communications. The AT model is actually very closely related to Steyvers' topic model. I recommend reading the summaries on his page referenced above (in my original comment).

  43. Re:Homeland security & NY Times by grossvogel · · Score: 1

    i heard on the (fox) news that NY Times writers ARE the terrorists.

  44. A solution already exists by dino213b · · Score: 1

    The Klingon Language Institute

    http://www.kli.org/

  45. Re:taxed minding what their dough is being spent o by Savantissimo · · Score: 1

    I wonder how the classifier program would cope with text like that in the parent post... probably sprain its parser, or something.

    --
    "Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
  46. Context-sensitive adaptive parsing by Savantissimo · · Score: 1

    Context-sensitive adaptive parsing seems to be effective in parsing English even with very small (http://www.sand-stone.com/Meta-S.htm for an introduction. (The 2nd reference is on natural language parsing.)

    --
    "Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry