Post-Googleism At IBM With Piquant

← Back to Stories (view on slashdot.org)

Post-Googleism At IBM With Piquant

Posted by ryuzaki0 on Sunday December 26, 2004 @01:03AM from the find-the-prettiest-girl-in-new-jersey dept.

kamesh writes "James Fallows of the New York Times reports an interesting search technology that IBM is developing. IBM demonstrated a system called Piquant, which analyzed the semantic structure of a passage and therefore exposed 'knowledge' that wasn't explicitly there. After scanning a news article about Canadian politics, the system responded correctly to the question, 'Who is Canada's prime minister?' even though those exact words didn't appear in the article. What do you think?"

13 of 159 comments (clear)

Min score:

Reason:

Sort:

Trust Issue by Flamefly · 2004-12-26 01:16 · Score: 5, Interesting

On a global scale this system tends to fall apart, there is a constant issue of trust when dealing with what looks to me, to be the holy grail of the semantic web.
What if 2 sites said the Prime Minister of Canada was Santa? explicity said it, would that overwrite the linked information? How would the system know what is right? You can't always just pick the majority answer, so you need to set up little areas of trust "I trust www.thisplace.com and everything it says" and that site in turn will say "I trust www.overhere.com" but who allocates the trust, couldn't those people be biased?
The semantic web will have a fantastic impact on the world, but the trust issue is something that needs to be addressed, and I don't see how it can ever, globally be done.
More likely we would have systems like this for individual sites, or intranets, trusted circles that would be unlikely to contradict themselves.
hopefully one day, if we truely get a global semantic web, we can see if the answer really is 42 :]
1. Re:Trust Issue by ctr2sprt · 2004-12-26 01:47 · Score: 4, Interesting
  
  All search engines return a bunch of results ordered by those it thinks most likely address your search terms. One very simple way of ranking the results is popularity (number of pages with the same answer to your question). You could fine-tune the popularity index with a Google-ish reference counting algorithm.
  One of the neatest approaches of this technology, I think, is the ability to eliminate search results. Anyone who's ever used Google to troubleshoot a problem knows that the first thirty or forty matches will all be the same: web mirrors of mailing lists or USENET posts. Using a vaguely semantic technology like this, Google could say, "Hey, all these pages are effectively identical" and collapse them into a single result.
  This would be terribly useful for me, since I usually start my troubleshooting searches with an error message. Error messages in the Unix world being quite standardized, this nets me at least ten irrelevant "threads." Since each "thread" is duplicated about ten times in the Google results, that means the question I'm actually asking may not appear until page 5 or later. But using result grouping like this - which Google tries and is generally unsuccessful at - would mean I'd see my question asked on the first or second pages. Big improvement.
  Another nifty trick would be an actual, working "related pages" link. So let's say I find my question, but, as is all too common, it's a question without an answer. I click on the link, the search engine does its magic, and it pulls up (perhaps) technical details on the software in question or alternate solutions to my problem. This is definitely going to be harder to implement than my other idea (perhaps even impossible for now), but it'd be really nice. It could make navigating the Internet like navigating Wikipedia or amazon.com.
  Ah well. I can dream.
I wonder... by Raul654 · 2004-12-26 01:17 · Score: 4, Interesting

Using google means that this would have to contend with a lot of noise - looking for one nugget of information on the internet will tend to yield a low signal-to-noise ratio. I wonder what would happen if instead, you were to run it using Wikipedia as a back end (full discosure - I'm a wikipedia admin). There'd be less information, but I suspect the quality of the results would be better.

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:Latent Sematic Indexing by timeOday · 2004-12-26 01:29 · Score: 4, Interesting

I'm not sure if such a thing would out perform google for a general purpose search engine.
The short answer is no, because traditional information retrieval methods like LSI are easily fooled by spammer tricks like keyword stuffing.
The genius being google's success was paying *less* attention to the content of a page when categorizing it, and relying on links *to* the page instead. Why? Because of spammers.
Think about hiring for a job. You don't limit yourself to interviews with candidates, because the're highly motivated to decieve you. So you look for references. Certification is an example of this - somebody besides the person himself who will vouch for his competence. An even better reference is somebody you know and trust who thinks highly of the individual (which is why personal networking is so important to getting hired).
Google's PageRank is analogous. Instead of looking at the content of a page, you rely heavily on links to the page, especially links from more trusted sources. This helps defeat spammers, who use all manner of tricks to make their crap look good to search engine spiders.
I'd like to see that article. by Anonymous Coward · 2004-12-26 01:36 · Score: 1, Interesting

If the article doesn't come out and state that Paul Martin is the Prime Minister then how could anyone--including a computer--know that for sure? I think the submitter was stretching the truth a bit when he said the words "Prime Minister" don't appear in the article. Can you imagine an article about George Bush that didn't use the word President?
Re:Wow by smchris · 2004-12-26 02:00 · Score: 2, Interesting

I have to wonder how much of its 'understanding' was hard-coded rather than purely abstract.

Baby steps, but the sort of essential baby steps that accumulate real technological progress. When the system discovers its _own_ non-trivial and useful rules, when it spontaneously parses our input to reply upon a self-generated "Oh, you mean......", then it gets scary.

Epistemology is a big word.
Actually, this technology was developed at CMU by Anonymous Coward · 2004-12-26 02:26 · Score: 1, Interesting

As some of you still remember, the original technology behind this was developed at CMU in the mid 90's when Corey Kosak, Andrej Bauer and a bunch of other talented people created the first ever natural language based neural network with a measurable IQ. People could even post questions to certain personae emulated by the neural network through the web site CGI at forum2000.org. This neural network was really fun and witty, but what you probably do not know is that all the technology in fact consisted of bored postgraduate students answering your questions.

Greetings to Kosak, Bauer and all the anonymous people who tried their best to pretend they're a software based neural network.
Can you imagine by melvo · 2004-12-26 02:26 · Score: 3, Interesting

Can you imagine when a system of this kind is capable of reading google's online library? If knowledge is power, we are looking towards creating a very powerful entity.
Google already has an unfair monopoly by Anonymous Coward · 2004-12-26 02:58 · Score: 1, Interesting

Google has an unfair advantage over potential rivals. I'm talking about their ownership of the entire Usenet archive (effectively so) in the form of google-groups. No matter how good any potential rival becomes, people will always have to turn to them for access to past Usenet archives.

Google's recent mangling of google-groups (mentioned already on /. ) is proof of the power they hold by virtue of ownership of the Usenet archive, which they acquired when they bought out deja-news. Some legislation should be enacted to address this issue. Otherwise what is to stop them from one day offering pay-per-view or "premium access" to their archive ? After all Usenet is a public resource that shouldn't be at the mercy of any single corp. - no matter how large.
Re:Latent Sematic Indexing by Anonymous Coward · 2004-12-26 04:03 · Score: 1, Interesting

The short answer is no, because traditional information retrieval methods like LSI are easily fooled by spammer tricks like keyword stuffing.

That depends upon how you apply it. For instance, with a comprehensive database built up with that technology, when you search for Hilton, it might be able to respond with "the family or the hotel chain?", and return categorised results.
SM/2 lives? by Nelson · 2004-12-26 04:55 · Score: 2, Interesting

They used the very same example to demo searchmanager/2 about 10 years ago (maybe more?)

Phenominal technology, IBM built the desktop search that everybody is pushing now, way back when. Cutting edge search and indexing capabilities, fully extendable, you could write your own plugins to deal with your data (use JPEG meta tags to label pictures from your digicam? Write a little plug in so you can search through your photos) and it had semantic and linguisitic searching.

For a long time SM/2 was kind of the poster child for IBM's inability to take remarkably cool technology to the consumer. Everyone that used it thought it was cool, nobody ever knew about it. They had trouble getting the word out within the company about it. Last I heard anything about it, they were turing the technology into some kind of intranet spider. It was the shit, it might have even had primitive cross referencing, like you could search for president and it would find references to Clinton because a third article may have referred to him as the president. They seemed to have some foresight into this area, web searching has to cut out some much bullshit, you wouldn't want to contaminate your semantic searches with all of it, keeping it in intranet space might be a good idea. Local search is hot right now too though so maybe it'll come back.
Now, we've been over this before by dodongo · 2004-12-26 05:34 · Score: 3, Interesting

NLP and semantic extraction and conceputal indexing is nothing new; admittedly, practical implmentations have been few and far between.

However, as I'm often fond of pointing out, the problem is not getting the 80 - 90% accuracy in translation and interpretation that I'm sure these systems can attain.

The challenge quickly becomes how to deal with idioms and idiosyncratic constructions. Is this system even ready to deal with sentences like "The criminal was shot dead by police"? If it is, great. How about "The trolley rumbled through town"? Or the idiomatic "time flies"?

This is what, so far as I know, the field of computational linguistics is now facing in textual interpretation and translation. Coming up with a system to effectively identify what appear to be three-argument verbs ("Mary hammered the metal flat") or constructions or idioms above may well be something that traditional systematic recursive grammars aren't yet up to handling.

Somehow these situations have to be identified, and separated in the parsing process so that they don't get processed like standard grammatical expressions.

Hopefully these problems are how I'll make my living ;)
Who is NOT Canada's prime minister? by bob@dB.org · 2004-12-26 07:02 · Score: 4, Interesting
I've worked for a company making a system that could easily answer a question like that. It really isn't hard to do. If you want to know how much of this is "black magic"/AI and how much is statistics, compare the results of the following two queries:
- Who is Canada's prime minister?
- Who is NOT Canada's prime minister?
If the system really understand the semantics of the indexed documents, the two result sets should be very different, and both should have a fair number of relevant documents.

If the system is just based on clever use of statistis, the two result sets will include a lot of the same documents, and the result set for the second query will probably have very few relevant documents.
--
Acts@core.mailboks.com Acrux@core.mailboks.com Adam@core.mailboks.com Adar@core.mailboks.com Ada@core.mailboks.com