Slashdot Mirror


Mining Unstructured Data

jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."

3 of 105 comments (clear)

  1. Google Made to Order by shalunov · · Score: 3, Informative
    Some quotes from the press release:
    People actually vote their preferences by providing links to different documents. You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something.
    This Discoverylink(TM) search engine concept somehow sounds very familiar. Where could I have heard this innovative idea before? Or, as the press release asks, "Where did I read that?" Ah, yes!
  2. Nat. Language Understanding != Speech Recognition by thelenm · · Score: 3, Informative

    A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.

    --
    Use Ctrl-C instead of ESC in Vim!
  3. creative uses by rnd() · · Score: 3, Informative
    There are some companies that are doing some creative things with this kind of technology.

    It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.

    --

    Amazing magic tricks