Deriving Semantic Meaning From Google Results
prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."
These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.
The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.
Transcend Humanity. Please.
If you tell someone to take the right path, they could mistake it as going the opposite of left and walking into a minefield, when in fact you meant going the correct direction.
Is this in any way related to the way that Google was able to decide all on it's own that Scientology was crap, and thus bring Operation Clambake up to the top of the search results? (Until they Scientology people got pissed, anyway.)
:)
Google is already starting to show signs of intelligence higher than some people.
"Everything you know is wrong. (And stupid.)"
Moderation Totals: Wrong=2, Stupid=3, Total=5.
...as opposed to 'non-semantic meaning' or just 'semantic meaning' as in 'I don't know what semantic means but using it here will make me look intelligent'?
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
First off, I am not an "AI" expert nor do I claim to be, however, this is how I see it.
...
Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?
One: intelligence is not awareness.
Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.
Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.
However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.
Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.
Whew
Words to men, as air to birds.
Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.
More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.
Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.
Two wrongs don't make a right, but three lefts do.
I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
I've perused the abstract and skimmed the body of the paper. They're fine. But the title is misleading: Automatic Meaning Discovery Using Google.
Their software has discovered meaning no more than paper has when the lexicographer is done writing her dictionary. Meaning is not the grouping of symbols.
For systems that step towards encoding meaning as human brains do, consider the Neural Theory of Language.
If given perfect information about the relationships between concepts, you could derive a very intellegent machine. TAke a human for example..
A baby hears the word mom spoken by his mom. Gradually, the baby knows there is a relationship between that sound and a smily face.
The child, growing up, starts to see relationships. Intense pain, which is rare, when correlated with a hot stove, has strong meaning in his mind.
Everything is learned initially through correlations. The advantage of human beings is that there are many more data points for correlation. Google's correlations are weak and don't give nearly as much information.