Deriving Semantic Meaning From Google Results

← Back to Stories (view on slashdot.org)

Deriving Semantic Meaning From Google Results

Posted by ryuzaki0 on Saturday January 29, 2005 @09:35AM from the can-also-use-tea-leaves-if-google-not-available dept.

prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."

10 of 120 comments (clear)

Min score:

Reason:

Sort:

The elephant in the living room. by Eunuch · 2005-01-29 09:36 · Score: 4, Insightful

These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.

The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.

--
Transcend Humanity. Please.
1. Re:The elephant in the living room. by freralqqvba · 2005-01-29 09:45 · Score: 4, Insightful
  
  Well obviously the technology is not perfect yet. However, none of the problems you bring up are particularly insurmountable (as long as you aren't excepting the AI to be BETTER at parsing languages than people). Yes, words are ambiguous, and yes humans can fail at parsing them, ergo computers probably will too. That's just a fact, we're not going to achieve perfection. Still, this could be a pretty major step forward (well, not that this is the first time something like this has been tried - but the base premise seems sound) by using google the elephant of a problem you mention can be partialy mitigated. Google gives enough context around a word that ideally, when the word to be translated is also surrounded by context its meaning amoung alternate meanings can be discovered without giving an overly ambigous translation.
Re:wARTIME? by Anonymous Coward · 2005-01-29 09:42 · Score: 1, Insightful

If you tell someone to take the right path, they could mistake it as going the opposite of left and walking into a minefield, when in fact you meant going the correct direction.
Scientology by Jace+of+Fuse! · 2005-01-29 09:42 · Score: 2, Insightful

Is this in any way related to the way that Google was able to decide all on it's own that Scientology was crap, and thus bring Operation Clambake up to the top of the search results? (Until they Scientology people got pissed, anyway.)

Google is already starting to show signs of intelligence higher than some people. :)

--

"Everything you know is wrong. (And stupid.)"

Moderation Totals: Wrong=2, Stupid=3, Total=5.
Would that be 'semantic meaning'... by exp(pi*sqrt(163)) · 2005-01-29 09:46 · Score: 2, Insightful

...as opposed to 'non-semantic meaning' or just 'semantic meaning' as in 'I don't know what semantic means but using it here will make me look intelligent'?

--
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
not many will get this by 2TecTom · 2005-01-29 10:05 · Score: 2, Insightful

First off, I am not an "AI" expert nor do I claim to be, however, this is how I see it.

Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?

One: intelligence is not awareness.

Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.

Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.

However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.

Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.

Whew ...

--
Words to men, as air to birds.
Limitations of NGD (Normalized Google Distance) by G4from128k · 2005-01-29 10:08 · Score: 4, Insightful

Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.

More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.

Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.

--
Two wrongs don't make a right, but three lefts do.
Language is more than words by Hal+XP · 2005-01-29 10:36 · Score: 2, Insightful

English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example.
"Right" isn't really a good example of a word that might "trip even humans." A human (translator) will parse not just by word but will attempt to extract a word's meaning from the surrounding phrases, sentences or even paragraphs. The syntax of the language may also come into play. In spoken language, additional "clues" can be derived from the situation in which the word is spoken, and often the extra-textual "body language" is more important, e.g. a hand pointing right or a head nodding in approval. I don't think an adult would be confused by the sentence "You're right. Let's go right." In wartime, I can imagine a responsible English-speaking commander barking references to GPS locations or using body language. It would be a mistake to think of a word in isolation from its context. After all, even in computer languages, a printf or goto by itself will chuck off a compiler error.

--
I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
Pretentiously titled by Turadg · 2005-01-29 12:44 · Score: 2, Insightful

I've perused the abstract and skimmed the body of the paper. They're fine. But the title is misleading: Automatic Meaning Discovery Using Google.

Their software has discovered meaning no more than paper has when the lexicographer is done writing her dictionary. Meaning is not the grouping of symbols.

For systems that step towards encoding meaning as human brains do, consider the Neural Theory of Language.
understanding relationships is intellegence by menem · 2005-01-29 14:18 · Score: 2, Insightful

If given perfect information about the relationships between concepts, you could derive a very intellegent machine. TAke a human for example..

A baby hears the word mom spoken by his mom. Gradually, the baby knows there is a relationship between that sound and a smily face.

The child, growing up, starts to see relationships. Intense pain, which is rare, when correlated with a hot stove, has strong meaning in his mind.

Everything is learned initially through correlations. The advantage of human beings is that there are many more data points for correlation. Google's correlations are weak and don't give nearly as much information.