Slashdot Mirror


Post-Googleism At IBM With Piquant

kamesh writes "James Fallows of the New York Times reports an interesting search technology that IBM is developing. IBM demonstrated a system called Piquant, which analyzed the semantic structure of a passage and therefore exposed 'knowledge' that wasn't explicitly there. After scanning a news article about Canadian politics, the system responded correctly to the question, 'Who is Canada's prime minister?' even though those exact words didn't appear in the article. What do you think?"

10 of 159 comments (clear)

  1. Latent Sematic Indexing by LISNews · · Score: 5, Informative

    They don't come out and say it, but it sounds like it's just a big ol' LSI System. It works really well for some types of searching, but I'm not sure if such a thing would out perform google for a general purpose search engine.

    "Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent."

    1. Re:Latent Sematic Indexing by SpinyNorman · · Score: 5, Informative

      Actually it sounds more like CYC-lite.

      The LSI system, despite the name, knows nothing about semantics. I just ASSUMES that words that frequently occur near each other are semantically related.

    2. Re:Latent Sematic Indexing by ragnar · · Score: 5, Informative

      I thought the same when I read this. I've met the people at NITLE who are developing an implementation of LSI. It is impressive and they have a download of their software available via CVS. For persons interested in this area of research it is worth the while to look at what NITLE is doing.

      --
      -- Solaris Central - http://w
    3. Re:Latent Sematic Indexing by Haydn+Fenton · · Score: 5, Informative

      For other Natural Language Processor being researched and/or developed by IBM, check out their NLP Research page. They have quite a few different technologies in this feild, which I wasn't aware of.
      I for one, welcome our new semantic web overlords! It's really great to hear that something based on semantic technologies is finally breaking through. This could be the dawn of a new era :)
      I know this is very optimistic, but how long do you think it will be before we'll have something like this combined with something like Google. The amount of knowledge readily available will be mind boggling huge. Imagine having a text service on your mobile, you text off a question to something and get an answer immediately back. All knowledge available everywhere, any time, that would be a great thing. Heck, it's even quite scary to think about it.

    4. Re:Latent Sematic Indexing by Haydn+Fenton · · Score: 4, Informative

      Yep, a little digging shows that it does indeed use CYC technology, or at least, according to this site (google's HTML of a PDF).

    5. Re:Latent Sematic Indexing by tootlemonde · · Score: 2, Informative

      it sounds like it's just a big ol' LSI System

      A Perl implimentation of LSI can be found at Building a Vector Space Search Engine in Perl

      However, there are at least three problems. First, it doesn't look LSI can answer questions like "Who is the Prime Minister of Canada?"

      Second, the approach is patented by Telcordia Technologies.

      Third, there are scalability problems with LSI. The author of the Perl article writes:

      For all its advantages, LSI also presents some drawbacks. The poor scalability of the singular value decomposition (SVD) algorithm remains an obstacle to indexing very large collections. While techniques have been developed for making incremental updates to a scaled collection, these changes typically cannot exceed a certain threshold without triggering a rebuild [7,8]. These constraints make LSI ill suited to the kinds of large, rapidly changing document collections typically found on the Web.

      A further disadvantage to LSI is the difficulty in interpreting the underlying reduced term space [4]. This makes it difficult to select an optimum number of singular values to retain in the SVD for a given collection, or allow domain exert adjustment of relevance values in the reduced space once the SVD has been calculated.

      As a result, the author is now pursuing something called Contextual Network Graphs and has written a Perl module that was updated as recently as last August.

    6. Re:Latent Sematic Indexing by otisg · · Score: 2, Informative

      Not only that, but this stuff is also patented, see: here.

      --
      Simpy
  2. Reg Free by bendelo · · Score: 5, Informative
  3. Won't work. by jameson · · Score: 5, Informative

    Disclaimer: I haven't read the article; however, I was somewhat involved in research in this field in late 2003 and early 2004.

    What the summary of the article claims IBM is developing-- a technology for getting the semantics behind an arbitrary sentence on the web-- is the Holy Grail of the discipline of Natural Language Processing (NLP) and very, very, very, _very_ far away at this point. Many people believe that we cannot ever get there (that's the point of a Holy Grail, after all), but I don't want to be quite as pessimistic (or realistic?) at this point.

    The problem here is that English (or any other natural language, for that matter) isn't SML, or Haskell, or some other language with a well-defined denotational semantics. Natural language suffers from at least three problems that make it very tough to gather anything useful from a given piece of text:

    (1) Grammar. Natural language isn't typechecked, and frequently uses incomplete sentences, which makes it hard to develop grammars (context-free, context-free probabilistic, lambek-style/proofnet-style or whatever else people have come up with) for it.

    (2) Anaphora resolution. "I saw a dog on the street this morning. It was barking". So who's barking, street or dog? Gramatically, both would be possible; only with prior knowledge we can see that we're talking about the dog here.

    (3) Polysemy. What does "play" mean, taken by itself? It can be used for different meanings in "to play a game", "a play of words", "a terrific shakespearian play" etc.; you might want to have a look at wordnet one of these days to get a feeling for this. Not knowing which meaning an arbitrary occurence of "play" refers to means that you have to try lots of options when parsing, LSIing or whatever else you do (though most people simply ignore this problem in research today-- it's too hard to disambiguate words in practice).

    That's not all, of course-- try thinking of the need to deal with irony/sarcasm, metaphors, foreign words, the credibility of whichever sources you're using etc., and you'll get a pretty good feeling for why this is beyond merely being "hard". Of course, for very small problem domains (a "command language for naval vessels" was investigated in one paper I read a while ago-- those DARPA people definitely have too much money on their hands, but I digress), this can be solved, but general-purpose open-domain NLP is what you need to do a web search.

    It might happen in my lifetime, but I won't hold my breath for it.

    -- Christoph

  4. From factoids to facts by yfnET · · Score: 2, Informative

    As it happens, The Economist recently ran an article addressing some of these issues. The article also provides context and perspective that should be of interest to those participating in this discussion. For convenience, the full text is reproduced below; it is also accessible online (may require paid subscription).

    ----

    Computing

    From factoids to facts

    Aug 26th 2004 | REDMOND, WASHINGTON
    From The Economist print edition

    At last, a way of getting answers from the web

    WHAT is the next stage in the evolution of internet search engines? AltaVista demonstrated that indexing the entire world wide web was feasible. Google's success stems from its uncanny ability to sort useful web pages from dross. But the real prize will surely go to whoever can use the web to deliver a straight answer to a straight question. And Eric Brill, a researcher at Microsoft, intends that his firm will be the first to do that.

    Dr Brill's initial crack at the problem is a system called "Ask MSR" (MSR stands for Microsoft Research). This program uses information on web pages to respond to questions to which the answer is a single word or phrase--such as "When was Marilyn Monroe born?" Ask MSR starts by manipulating the question in various ways: by identifying the verb, for example, and then changing its tense or moving it into different positions in the sentence ("Marilyn was Monroe born", "Marilyn Monroe was born" and so on). The resulting phrases are then fed into a search engine, and documents containing matching strings of words are retrieved. It sounds a promiscuous strategy, but gibberish phrases produce few matches, so, as Dr Brill puts it, "being wrong is very cheap."

    Once accumulated, the pile of documents is scanned for possible answers, and these are ranked by frequency. In practice, the correct answer appears in one of the first three places around 75% of the time. That might not sound very good, but human intelligence provides a second filter, since wrong answers are often obvious. If you ask how many times Bjorn Borg won Wimbledon, for example, "1980" is not a plausible answer, but "5" is. If in doubt, clicking on an answer produces a list of links to pages which provide support for that answer.

    Ask MSR is still a prototype, although Microsoft is trying to improve it and it may be launched commercially under the name AnswerBot. Dr Brill, meanwhile, has moved to a more difficult task. One of his most recent papers, written jointly with Radu Soricut of the University of Southern California, is entitled "Beyond the Factoid". It describes his efforts to build a system capable of providing 50-word answers to questions such as "What are the rules for qualifying for the Academy Awards?" This is harder than finding a single-word answer, but Dr Brill thinks it should be possible using something called a "noisy channel" model.

    Such models are already employed in spell-checking and speech-recognition systems. They work by modelling the transformation between what a user means (in spell-checking, the word he intended to type) and what he does (the garbled word actually typed). Just as a telephone line distorts the voice of the person at the other end of the line, this process can be thought of as being a noisy channel that transforms the user's intention into something rather different.

    By analysing many pairs of correct and mis-spelled words using statistical techniques, it is possible to predict how such transformations work in general cases. A system can then be designed to work the process backwards. Given a mis-spelled word, it can guess what that word is most likely to be a mis-spelling of.

    Dr Brill's question-answering system does something similar. Many question-and-answer pairs exist on the web, in the form of "frequently asked questions" (FAQ) pages. Dr Brill trained his system using a million such pairs, to create a model that, given

    --
    The extreme centre is the paper's historical position. --Geoffrey Crowther