Post-Googleism At IBM With Piquant

Latent Sematic Indexing by LISNews · 2004-12-26 01:07 · Score: 5, Informative

They don't come out and say it, but it sounds like it's just a big ol' LSI System. It works really well for some types of searching, but I'm not sure if such a thing would out perform google for a general purpose search engine.

"Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent."

Re:Latent Sematic Indexing by SpinyNorman · 2004-12-26 01:21 · Score: 5, Informative

Actually it sounds more like CYC-lite.

The LSI system, despite the name, knows nothing about semantics. I just ASSUMES that words that frequently occur near each other are semantically related.
Re:Latent Sematic Indexing by ragnar · 2004-12-26 01:23 · Score: 5, Informative

I thought the same when I read this. I've met the people at NITLE who are developing an implementation of LSI. It is impressive and they have a download of their software available via CVS. For persons interested in this area of research it is worth the while to look at what NITLE is doing.

--
-- Solaris Central - http://w
Re:Latent Sematic Indexing by timeOday · 2004-12-26 01:29 · Score: 4, Interesting

I'm not sure if such a thing would out perform google for a general purpose search engine.
The short answer is no, because traditional information retrieval methods like LSI are easily fooled by spammer tricks like keyword stuffing.
The genius being google's success was paying *less* attention to the content of a page when categorizing it, and relying on links *to* the page instead. Why? Because of spammers.
Think about hiring for a job. You don't limit yourself to interviews with candidates, because the're highly motivated to decieve you. So you look for references. Certification is an example of this - somebody besides the person himself who will vouch for his competence. An even better reference is somebody you know and trust who thinks highly of the individual (which is why personal networking is so important to getting hired).
Google's PageRank is analogous. Instead of looking at the content of a page, you rely heavily on links to the page, especially links from more trusted sources. This helps defeat spammers, who use all manner of tricks to make their crap look good to search engine spiders.
Re:Latent Sematic Indexing by Haydn+Fenton · 2004-12-26 01:57 · Score: 5, Informative

For other Natural Language Processor being researched and/or developed by IBM, check out their NLP Research page. They have quite a few different technologies in this feild, which I wasn't aware of.
I for one, welcome our new semantic web overlords! It's really great to hear that something based on semantic technologies is finally breaking through. This could be the dawn of a new era :)
I know this is very optimistic, but how long do you think it will be before we'll have something like this combined with something like Google. The amount of knowledge readily available will be mind boggling huge. Imagine having a text service on your mobile, you text off a question to something and get an answer immediately back. All knowledge available everywhere, any time, that would be a great thing. Heck, it's even quite scary to think about it.
Re:Latent Sematic Indexing by Haydn+Fenton · 2004-12-26 02:03 · Score: 4, Informative

Yep, a little digging shows that it does indeed use CYC technology, or at least, according to this site (google's HTML of a PDF).
Re:Latent Sematic Indexing by tootlemonde · 2004-12-26 02:51 · Score: 2, Informative

it sounds like it's just a big ol' LSI System
A Perl implimentation of LSI can be found at Building a Vector Space Search Engine in Perl
However, there are at least three problems. First, it doesn't look LSI can answer questions like "Who is the Prime Minister of Canada?"
Second, the approach is patented by Telcordia Technologies.
Third, there are scalability problems with LSI. The author of the Perl article writes:

For all its advantages, LSI also presents some drawbacks. The poor scalability of the singular value decomposition (SVD) algorithm remains an obstacle to indexing very large collections. While techniques have been developed for making incremental updates to a scaled collection, these changes typically cannot exceed a certain threshold without triggering a rebuild [7,8]. These constraints make LSI ill suited to the kinds of large, rapidly changing document collections typically found on the Web.
A further disadvantage to LSI is the difficulty in interpreting the underlying reduced term space [4]. This makes it difficult to select an optimum number of singular values to retain in the SVD for a given collection, or allow domain exert adjustment of relevance values in the reduced space once the SVD has been calculated.

As a result, the author is now pursuing something called Contextual Network Graphs and has written a Perl module that was updated as recently as last August.
Re:Latent Sematic Indexing by MasonMcD · 2004-12-26 04:10 · Score: 2, Insightful

From the article:

MR. CICCOLO, the search strategist, said that in a way his team was trying to match - and reverse - what Google has achieved. "As Google use became widespread, people began asking why it was so much easier to find material on the external Web than it was on their own computers or in their company's Web sites," he said. "Google sets a very high standard for that Web. We would like to set the next standard, so that people will find it so easy to do things at work that they'll wonder why they can't do them on the Internet."

They seem to be explicity targeting intranets or known good databases, so the spammer issue might be moot.

This raises another issue, however. Will this technology become so useful as to lead to the bad old days of proprietary information dbs a la Lexis/Nexis? I'm assuming the indexing will have to take place on company-owned servers.
Re:Latent Sematic Indexing by St.+Arbirix · 2004-12-26 10:34 · Score: 2, Funny

They don't come out and say it, but it sounds like it's just a big ol' LSI System.

Actually they did that on purpose. The press release was actually a test for Piquant to see if it could figure out that it was really just a rehashed older idea.

--
Direct away from face when opening.
Re:Latent Sematic Indexing by otisg · 2004-12-26 13:50 · Score: 2, Informative

Not only that, but this stuff is also patented, see: here.

--
Simpy

Wow by setagllib · 2004-12-26 01:08 · Score: 4, Insightful

That's pretty impressive. It takes quite a clever AI to read between lines and connect concepts, but I have to wonder how much of its 'understanding' was hard-coded rather than purely abstract. Would it be trivial to just stick in another language database and have it read translations of the article the same way?

Nevertheless it makes me feel like all the programming and design I've ever done is pathetic and I will never amount to anything. That's how it is in the software industry - always someone out there who makes you look bad.

--
Sam ty sig.

Re:Wow by EpsCylonB · 2004-12-26 01:13 · Score: 2, Insightful

That's how it is in the software industry - always someone out there who makes you look bad.

Thats how it is in Life.
Re:Wow by smchris · 2004-12-26 02:00 · Score: 2, Interesting

I have to wonder how much of its 'understanding' was hard-coded rather than purely abstract.

Baby steps, but the sort of essential baby steps that accumulate real technological progress. When the system discovers its _own_ non-trivial and useful rules, when it spontaneously parses our input to reply upon a self-generated "Oh, you mean......", then it gets scary.

Epistemology is a big word.
Re:Wow by forkazoo · 2004-12-26 04:01 · Score: 2, Funny

lachlan@localhost $ analyse -q "What is the meaning of Life, the Universe, and Everything?"
42

lachlan@localhost $ analyse -q "Is there a God?"
There is now!

Reg Free by bendelo · 2004-12-26 01:09 · Score: 5, Informative

Reg-free link

Sounds impressive by Timesprout · 2004-12-26 01:09 · Score: 4, Funny

Till you realise the computer answered 'some asshole' which could be any prime minister in the world really.

--
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe

Trust Issue by Flamefly · 2004-12-26 01:16 · Score: 5, Interesting

On a global scale this system tends to fall apart, there is a constant issue of trust when dealing with what looks to me, to be the holy grail of the semantic web.

What if 2 sites said the Prime Minister of Canada was Santa? explicity said it, would that overwrite the linked information? How would the system know what is right? You can't always just pick the majority answer, so you need to set up little areas of trust "I trust www.thisplace.com and everything it says" and that site in turn will say "I trust www.overhere.com" but who allocates the trust, couldn't those people be biased?

The semantic web will have a fantastic impact on the world, but the trust issue is something that needs to be addressed, and I don't see how it can ever, globally be done.

More likely we would have systems like this for individual sites, or intranets, trusted circles that would be unlikely to contradict themselves.

hopefully one day, if we truely get a global semantic web, we can see if the answer really is 42 :]

Re:Trust Issue by ctr2sprt · 2004-12-26 01:47 · Score: 4, Interesting

All search engines return a bunch of results ordered by those it thinks most likely address your search terms. One very simple way of ranking the results is popularity (number of pages with the same answer to your question). You could fine-tune the popularity index with a Google-ish reference counting algorithm.
One of the neatest approaches of this technology, I think, is the ability to eliminate search results. Anyone who's ever used Google to troubleshoot a problem knows that the first thirty or forty matches will all be the same: web mirrors of mailing lists or USENET posts. Using a vaguely semantic technology like this, Google could say, "Hey, all these pages are effectively identical" and collapse them into a single result.
This would be terribly useful for me, since I usually start my troubleshooting searches with an error message. Error messages in the Unix world being quite standardized, this nets me at least ten irrelevant "threads." Since each "thread" is duplicated about ten times in the Google results, that means the question I'm actually asking may not appear until page 5 or later. But using result grouping like this - which Google tries and is generally unsuccessful at - would mean I'd see my question asked on the first or second pages. Big improvement.
Another nifty trick would be an actual, working "related pages" link. So let's say I find my question, but, as is all too common, it's a question without an answer. I click on the link, the search engine does its magic, and it pulls up (perhaps) technical details on the software in question or alternate solutions to my problem. This is definitely going to be harder to implement than my other idea (perhaps even impossible for now), but it'd be really nice. It could make navigating the Internet like navigating Wikipedia or amazon.com.
Ah well. I can dream.

I wonder... by Raul654 · 2004-12-26 01:17 · Score: 4, Interesting

Using google means that this would have to contend with a lot of noise - looking for one nugget of information on the internet will tend to yield a low signal-to-noise ratio. I wonder what would happen if instead, you were to run it using Wikipedia as a back end (full discosure - I'm a wikipedia admin). There'd be less information, but I suspect the quality of the results would be better.

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton

Prolly a hand-picked question by Ancient_Hacker · 2004-12-26 01:22 · Score: 3, Insightful

One example is meaningless. To get a realistic idea of how useful this system is, we'd like to see what it says if you ask several dozen questions. For all we know this was the one question out of 100 that it answered correctly.

Re:Prolly a hand-picked question by Quixote · 2004-12-26 02:02 · Score: 4, Funny

Any sufficiently advanced technology is indistinguishable from a rigged demo.
-- Andy Finkel, computer guy
Or, conversely,
Any sufficiently rigged demo is indistinguishable from an advanced technology.
-- Don Quixote, slashdot guy
;-)

AI research is still in the Dark Ages by Anonymous Coward · 2004-12-26 01:23 · Score: 2, Funny

The solution to functional, robust and real AI is not better software or better hardware. Real AI will never be implemented on silicon chips.

We must integrate ourselves with computers to a point at which the living being and computer cannot be separated anymore. The perfect union of the biological component (wetware) and computer (hardware) will mark the end of the human race - and the birth of something new and wonderful.

Obviously this will face strong, religious and quasi-religious (ethics) resistance from the old guard but it will pass with the fools themselves.

Canadian Prime Minister by Anonymous Coward · 2004-12-26 01:25 · Score: 4, Funny

I for one congratulate Canadian Prime Minister Tim Horton for running a great campaign and his wife Wendy for her fantastic chain of restaurants!

Now... by SharpFang · 2004-12-26 01:52 · Score: 5, Insightful

Feed it the news about Iraq. Then ask it what the war was about.
Good bye, new system, too dangerous for "national security".

--
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2

Re:Now... by jdgeorge · 2004-12-26 16:59 · Score: 2, Funny

Okay, let's get back on topic. I fed the parent post into Diebold's equivalent of IBM's fancy technology and asked it to provide an appropriate response. Here's what I got:

------------------

There are other countries besides America. Their parties are usually not called "Republicans" and "Democrats" - and don't even necessarily correspond to those American parties. The non-American countries also hold views about Iraq. Many also write in English (UK, Canada, Australia, New Zealand, also India, the largest democracy in the world ...)

What a pile of pinko, left-wing, pansy-assed, New York propaganda. Everyone knows the Good Ol' US of A is the only real country. Don't try to pull that "there are other countries" crap or we'll kick your sorry nation's ass just like we did back in 'Nam. Oh, and Iraq, too; we really kicked some major terrorist ass there. And your anti-Republican propaganda means you're definitely a terrorist.

Also remember: the US accounts for just 5% of the world's population. The rest of us are 95%. You are outnumbered. Even the Internet is becoming less American day by day. And as for the web, it wasn't even invented by Americans or in America (it is a European invention).

Now, that's right out of the Democrats party thing where they say what they say about stuff. Damn, Democrats are stupid; Everyone knows that the US is, like, the third biggest country. That means the US is AT LEAST a third of the world's population. Except Africa, but they don't count, 'cuz they all live in huts and eat dried camel poo.

Oh, and I wouldn't be bragging about the web being a European invention, because it wasn't. Besides, if it was, the web sucks anyway, so why are you bragging about it?
-----------------

Won't work. by jameson · 2004-12-26 02:12 · Score: 5, Informative

Disclaimer: I haven't read the article; however, I was somewhat involved in research in this field in late 2003 and early 2004.

What the summary of the article claims IBM is developing-- a technology for getting the semantics behind an arbitrary sentence on the web-- is the Holy Grail of the discipline of Natural Language Processing (NLP) and very, very, very, _very_ far away at this point. Many people believe that we cannot ever get there (that's the point of a Holy Grail, after all), but I don't want to be quite as pessimistic (or realistic?) at this point.

The problem here is that English (or any other natural language, for that matter) isn't SML, or Haskell, or some other language with a well-defined denotational semantics. Natural language suffers from at least three problems that make it very tough to gather anything useful from a given piece of text:

(1) Grammar. Natural language isn't typechecked, and frequently uses incomplete sentences, which makes it hard to develop grammars (context-free, context-free probabilistic, lambek-style/proofnet-style or whatever else people have come up with) for it.

(2) Anaphora resolution. "I saw a dog on the street this morning. It was barking". So who's barking, street or dog? Gramatically, both would be possible; only with prior knowledge we can see that we're talking about the dog here.

(3) Polysemy. What does "play" mean, taken by itself? It can be used for different meanings in "to play a game", "a play of words", "a terrific shakespearian play" etc.; you might want to have a look at wordnet one of these days to get a feeling for this. Not knowing which meaning an arbitrary occurence of "play" refers to means that you have to try lots of options when parsing, LSIing or whatever else you do (though most people simply ignore this problem in research today-- it's too hard to disambiguate words in practice).

That's not all, of course-- try thinking of the need to deal with irony/sarcasm, metaphors, foreign words, the credibility of whichever sources you're using etc., and you'll get a pretty good feeling for why this is beyond merely being "hard". Of course, for very small problem domains (a "command language for naval vessels" was investigated in one paper I read a while ago-- those DARPA people definitely have too much money on their hands, but I digress), this can be solved, but general-purpose open-domain NLP is what you need to do a web search.

It might happen in my lifetime, but I won't hold my breath for it.

-- Christoph

Re:You fools! This is the beginning of the end! by Xugumad · 2004-12-26 02:14 · Score: 3, Funny

Is it just me who would, if designing an AI, would have have a trivial off switch. Probably a few backups, like wire cutters next to the main power cable, a jug of water near the PSU, things like that.

It is just me, isn't it...

Can you imagine by melvo · 2004-12-26 02:26 · Score: 3, Interesting

Can you imagine when a system of this kind is capable of reading google's online library? If knowledge is power, we are looking towards creating a very powerful entity.

citation analysis by jeif1k · 2004-12-26 02:40 · Score: 3, Insightful

The genius being google's success was paying *less* attention to the content of a page when categorizing it, and relying on links *to* the page instead. Why? Because of spammers.

"Genius" would imply some sort of brand new insight, but citation analysis has had a long tradition before Google appeared on the scene as a search engine. Google's biggest achievement is probably in implementing citation analysis on a very large scale, but they didn't break completely new ground in how people search.

And, in the long run, semantics-based analysis, like IBM's Piquant, is probably going to be the better technology: citation analysis for determining relevance to a query is really just a limited substitute for understanding of the content.

Transcript of conversation with Piquant by trveler · 2004-12-26 02:48 · Score: 3, Funny

User: Who is Canada's Prime Minister?

PIQUANT: WHY DO YOU ASK, USER?

Because I want to know.

WHAT IS THE REAL REASON, USER?

I want to test you.

WHAT WOULD IT MEAN TO YOU IF YOU GOT TO TEST ME?

It would mean I can determine whether or not you know who Canada's Prime Minister is.

WHAT MAKES YOU THINK I COULD TELL YOU THAT?

I programmed you to figure it out.

LET'S TALK ABOUT YOU -- NOT ME.

I want to know the name of Canada's Prime Minister.

WHY DO YOU WANT TO KNOW THE NAME OF CANADA'S PRIME MINISTER?

Are you going to tell me his name or not?

WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM GOING TO TELL YOU HIS NAME OR NOT?

Stop avoiding the question! Just tell me his name!

NAMES DON'T INTEREST ME.

Fuck you!

TELL ME MORE ABOUT YOUR OWN PROBLEM, USER.

Shit!

GO ON.

You suck.

YOU'RE NOT REALLY STILL TALKING ABOUT ME, ARE YOU?

--
... is whot bwings os tugevza tsuzay.

From factoids to facts by yfnET · 2004-12-26 03:03 · Score: 2, Informative

As it happens, The Economist recently ran an article addressing some of these issues. The article also provides context and perspective that should be of interest to those participating in this discussion. For convenience, the full text is reproduced below; it is also accessible online (may require paid subscription).

----

Computing

From factoids to facts

Aug 26th 2004 | REDMOND, WASHINGTON
From The Economist print edition

At last, a way of getting answers from the web

WHAT is the next stage in the evolution of internet search engines? AltaVista demonstrated that indexing the entire world wide web was feasible. Google's success stems from its uncanny ability to sort useful web pages from dross. But the real prize will surely go to whoever can use the web to deliver a straight answer to a straight question. And Eric Brill, a researcher at Microsoft, intends that his firm will be the first to do that.

Dr Brill's initial crack at the problem is a system called "Ask MSR" (MSR stands for Microsoft Research). This program uses information on web pages to respond to questions to which the answer is a single word or phrase--such as "When was Marilyn Monroe born?" Ask MSR starts by manipulating the question in various ways: by identifying the verb, for example, and then changing its tense or moving it into different positions in the sentence ("Marilyn was Monroe born", "Marilyn Monroe was born" and so on). The resulting phrases are then fed into a search engine, and documents containing matching strings of words are retrieved. It sounds a promiscuous strategy, but gibberish phrases produce few matches, so, as Dr Brill puts it, "being wrong is very cheap."

Once accumulated, the pile of documents is scanned for possible answers, and these are ranked by frequency. In practice, the correct answer appears in one of the first three places around 75% of the time. That might not sound very good, but human intelligence provides a second filter, since wrong answers are often obvious. If you ask how many times Bjorn Borg won Wimbledon, for example, "1980" is not a plausible answer, but "5" is. If in doubt, clicking on an answer produces a list of links to pages which provide support for that answer.

Ask MSR is still a prototype, although Microsoft is trying to improve it and it may be launched commercially under the name AnswerBot. Dr Brill, meanwhile, has moved to a more difficult task. One of his most recent papers, written jointly with Radu Soricut of the University of Southern California, is entitled "Beyond the Factoid". It describes his efforts to build a system capable of providing 50-word answers to questions such as "What are the rules for qualifying for the Academy Awards?" This is harder than finding a single-word answer, but Dr Brill thinks it should be possible using something called a "noisy channel" model.

Such models are already employed in spell-checking and speech-recognition systems. They work by modelling the transformation between what a user means (in spell-checking, the word he intended to type) and what he does (the garbled word actually typed). Just as a telephone line distorts the voice of the person at the other end of the line, this process can be thought of as being a noisy channel that transforms the user's intention into something rather different.

By analysing many pairs of correct and mis-spelled words using statistical techniques, it is possible to predict how such transformations work in general cases. A system can then be designed to work the process backwards. Given a mis-spelled word, it can guess what that word is most likely to be a mis-spelling of.

Dr Brill's question-answering system does something similar. Many question-and-answer pairs exist on the web, in the form of "frequently asked questions" (FAQ) pages. Dr Brill trained his system using a million such pairs, to create a model that, given

--
The extreme centre is the paper's historical position. --Geoffrey Crowther

SM/2 lives? by Nelson · 2004-12-26 04:55 · Score: 2, Interesting

They used the very same example to demo searchmanager/2 about 10 years ago (maybe more?)

Phenominal technology, IBM built the desktop search that everybody is pushing now, way back when. Cutting edge search and indexing capabilities, fully extendable, you could write your own plugins to deal with your data (use JPEG meta tags to label pictures from your digicam? Write a little plug in so you can search through your photos) and it had semantic and linguisitic searching.

For a long time SM/2 was kind of the poster child for IBM's inability to take remarkably cool technology to the consumer. Everyone that used it thought it was cool, nobody ever knew about it. They had trouble getting the word out within the company about it. Last I heard anything about it, they were turing the technology into some kind of intranet spider. It was the shit, it might have even had primitive cross referencing, like you could search for president and it would find references to Clinton because a third article may have referred to him as the president. They seemed to have some foresight into this area, web searching has to cut out some much bullshit, you wouldn't want to contaminate your semantic searches with all of it, keeping it in intranet space might be a good idea. Local search is hot right now too though so maybe it'll come back.

Now, we've been over this before by dodongo · 2004-12-26 05:34 · Score: 3, Interesting

NLP and semantic extraction and conceputal indexing is nothing new; admittedly, practical implmentations have been few and far between.

However, as I'm often fond of pointing out, the problem is not getting the 80 - 90% accuracy in translation and interpretation that I'm sure these systems can attain.

The challenge quickly becomes how to deal with idioms and idiosyncratic constructions. Is this system even ready to deal with sentences like "The criminal was shot dead by police"? If it is, great. How about "The trolley rumbled through town"? Or the idiomatic "time flies"?

This is what, so far as I know, the field of computational linguistics is now facing in textual interpretation and translation. Coming up with a system to effectively identify what appear to be three-argument verbs ("Mary hammered the metal flat") or constructions or idioms above may well be something that traditional systematic recursive grammars aren't yet up to handling.

Somehow these situations have to be identified, and separated in the parsing process so that they don't get processed like standard grammatical expressions.

Hopefully these problems are how I'll make my living ;)

Who is NOT Canada's prime minister? by bob@dB.org · 2004-12-26 07:02 · Score: 4, Interesting

I've worked for a company making a system that could easily answer a question like that. It really isn't hard to do. If you want to know how much of this is "black magic"/AI and how much is statistics, compare the results of the following two queries:

Who is Canada's prime minister?
Who is NOT Canada's prime minister?

If the system really understand the semantics of the indexed documents, the two result sets should be very different, and both should have a fair number of relevant documents.

If the system is just based on clever use of statistis, the two result sets will include a lot of the same documents, and the result set for the second query will probably have very few relevant documents.

--
Acts@core.mailboks.com Acrux@core.mailboks.com Adam@core.mailboks.com Adar@core.mailboks.com Ada@core.mailboks.com

Slashdot Mirror

Post-Googleism At IBM With Piquant

34 of 159 comments (clear)