Post-Googleism At IBM With Piquant
kamesh writes "James Fallows of the New York Times reports an interesting search technology that IBM is developing. IBM demonstrated a system called Piquant, which analyzed the semantic structure of a passage and therefore exposed 'knowledge' that wasn't explicitly there. After scanning a news article about Canadian politics, the system responded correctly to the question, 'Who is Canada's prime minister?' even though those exact words didn't appear in the article. What do you think?"
They don't come out and say it, but it sounds like it's just a big ol' LSI System. It works really well for some types of searching, but I'm not sure if such a thing would out perform google for a general purpose search engine.
"Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent."
That's pretty impressive. It takes quite a clever AI to read between lines and connect concepts, but I have to wonder how much of its 'understanding' was hard-coded rather than purely abstract. Would it be trivial to just stick in another language database and have it read translations of the article the same way?
Nevertheless it makes me feel like all the programming and design I've ever done is pathetic and I will never amount to anything. That's how it is in the software industry - always someone out there who makes you look bad.
Sam ty sig.
Reg-free link
Till you realise the computer answered 'some asshole' which could be any prime minister in the world really.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
What if 2 sites said the Prime Minister of Canada was Santa? explicity said it, would that overwrite the linked information? How would the system know what is right? You can't always just pick the majority answer, so you need to set up little areas of trust "I trust www.thisplace.com and everything it says" and that site in turn will say "I trust www.overhere.com" but who allocates the trust, couldn't those people be biased?
The semantic web will have a fantastic impact on the world, but the trust issue is something that needs to be addressed, and I don't see how it can ever, globally be done.
More likely we would have systems like this for individual sites, or intranets, trusted circles that would be unlikely to contradict themselves.
hopefully one day, if we truely get a global semantic web, we can see if the answer really is 42 :]
Using google means that this would have to contend with a lot of noise - looking for one nugget of information on the internet will tend to yield a low signal-to-noise ratio. I wonder what would happen if instead, you were to run it using Wikipedia as a back end (full discosure - I'm a wikipedia admin). There'd be less information, but I suspect the quality of the results would be better.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
One example is meaningless. To get a realistic idea of how useful this system is, we'd like to see what it says if you ask several dozen questions. For all we know this was the one question out of 100 that it answered correctly.
I for one congratulate Canadian Prime Minister Tim Horton for running a great campaign and his wife Wendy for her fantastic chain of restaurants!
Feed it the news about Iraq. Then ask it what the war was about.
Good bye, new system, too dangerous for "national security".
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
Disclaimer: I haven't read the article; however, I was somewhat involved in research in this field in late 2003 and early 2004.
What the summary of the article claims IBM is developing-- a technology for getting the semantics behind an arbitrary sentence on the web-- is the Holy Grail of the discipline of Natural Language Processing (NLP) and very, very, very, _very_ far away at this point. Many people believe that we cannot ever get there (that's the point of a Holy Grail, after all), but I don't want to be quite as pessimistic (or realistic?) at this point.
The problem here is that English (or any other natural language, for that matter) isn't SML, or Haskell, or some other language with a well-defined denotational semantics. Natural language suffers from at least three problems that make it very tough to gather anything useful from a given piece of text:
(1) Grammar. Natural language isn't typechecked, and frequently uses incomplete sentences, which makes it hard to develop grammars (context-free, context-free probabilistic, lambek-style/proofnet-style or whatever else people have come up with) for it.
(2) Anaphora resolution. "I saw a dog on the street this morning. It was barking". So who's barking, street or dog? Gramatically, both would be possible; only with prior knowledge we can see that we're talking about the dog here.
(3) Polysemy. What does "play" mean, taken by itself? It can be used for different meanings in "to play a game", "a play of words", "a terrific shakespearian play" etc.; you might want to have a look at wordnet one of these days to get a feeling for this. Not knowing which meaning an arbitrary occurence of "play" refers to means that you have to try lots of options when parsing, LSIing or whatever else you do (though most people simply ignore this problem in research today-- it's too hard to disambiguate words in practice).
That's not all, of course-- try thinking of the need to deal with irony/sarcasm, metaphors, foreign words, the credibility of whichever sources you're using etc., and you'll get a pretty good feeling for why this is beyond merely being "hard". Of course, for very small problem domains (a "command language for naval vessels" was investigated in one paper I read a while ago-- those DARPA people definitely have too much money on their hands, but I digress), this can be solved, but general-purpose open-domain NLP is what you need to do a web search.
It might happen in my lifetime, but I won't hold my breath for it.
-- Christoph
Is it just me who would, if designing an AI, would have have a trivial off switch. Probably a few backups, like wire cutters next to the main power cable, a jug of water near the PSU, things like that.
It is just me, isn't it...
Can you imagine when a system of this kind is capable of reading google's online library? If knowledge is power, we are looking towards creating a very powerful entity.
The genius being google's success was paying *less* attention to the content of a page when categorizing it, and relying on links *to* the page instead. Why? Because of spammers.
"Genius" would imply some sort of brand new insight, but citation analysis has had a long tradition before Google appeared on the scene as a search engine. Google's biggest achievement is probably in implementing citation analysis on a very large scale, but they didn't break completely new ground in how people search.
And, in the long run, semantics-based analysis, like IBM's Piquant, is probably going to be the better technology: citation analysis for determining relevance to a query is really just a limited substitute for understanding of the content.
User: Who is Canada's Prime Minister?
PIQUANT: WHY DO YOU ASK, USER?
Because I want to know.
WHAT IS THE REAL REASON, USER?
I want to test you.
WHAT WOULD IT MEAN TO YOU IF YOU GOT TO TEST ME?
It would mean I can determine whether or not you know who Canada's Prime Minister is.
WHAT MAKES YOU THINK I COULD TELL YOU THAT?
I programmed you to figure it out.
LET'S TALK ABOUT YOU -- NOT ME.
I want to know the name of Canada's Prime Minister.
WHY DO YOU WANT TO KNOW THE NAME OF CANADA'S PRIME MINISTER?
Are you going to tell me his name or not?
WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM GOING TO TELL YOU HIS NAME OR NOT?
Stop avoiding the question! Just tell me his name!
NAMES DON'T INTEREST ME.
Fuck you!
TELL ME MORE ABOUT YOUR OWN PROBLEM, USER.
Shit!
GO ON.
You suck.
YOU'RE NOT REALLY STILL TALKING ABOUT ME, ARE YOU?
... is whot bwings os tugevza tsuzay.
NLP and semantic extraction and conceputal indexing is nothing new; admittedly, practical implmentations have been few and far between.
;)
However, as I'm often fond of pointing out, the problem is not getting the 80 - 90% accuracy in translation and interpretation that I'm sure these systems can attain.
The challenge quickly becomes how to deal with idioms and idiosyncratic constructions. Is this system even ready to deal with sentences like "The criminal was shot dead by police"? If it is, great. How about "The trolley rumbled through town"? Or the idiomatic "time flies"?
This is what, so far as I know, the field of computational linguistics is now facing in textual interpretation and translation. Coming up with a system to effectively identify what appear to be three-argument verbs ("Mary hammered the metal flat") or constructions or idioms above may well be something that traditional systematic recursive grammars aren't yet up to handling.
Somehow these situations have to be identified, and separated in the parsing process so that they don't get processed like standard grammatical expressions.
Hopefully these problems are how I'll make my living
I've worked for a company making a system that could easily answer a question like that. It really isn't hard to do. If you want to know how much of this is "black magic"/AI and how much is statistics, compare the results of the following two queries:
If the system really understand the semantics of the indexed documents, the two result sets should be very different, and both should have a fair number of relevant documents.
If the system is just based on clever use of statistis, the two result sets will include a lot of the same documents, and the result set for the second query will probably have very few relevant documents.
Acts@core.mailboks.com Acrux@core.mailboks.com Adam@core.mailboks.com Adar@core.mailboks.com Ada@core.mailboks.com