Slashdot Mirror


Deriving Semantic Meaning From Google Results

prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."

5 of 120 comments (clear)

  1. The elephant in the living room. by Eunuch · · Score: 4, Insightful

    These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.

    The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.

    --
    Transcend Humanity. Please.
    1. Re:The elephant in the living room. by freralqqvba · · Score: 4, Insightful

      Well obviously the technology is not perfect yet. However, none of the problems you bring up are particularly insurmountable (as long as you aren't excepting the AI to be BETTER at parsing languages than people). Yes, words are ambiguous, and yes humans can fail at parsing them, ergo computers probably will too. That's just a fact, we're not going to achieve perfection. Still, this could be a pretty major step forward (well, not that this is the first time something like this has been tried - but the base premise seems sound) by using google the elephant of a problem you mention can be partialy mitigated. Google gives enough context around a word that ideally, when the word to be translated is also surrounded by context its meaning amoung alternate meanings can be discovered without giving an overly ambigous translation.

    2. Re:The elephant in the living room. by ericbg05 · · Score: 5, Interesting
      The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation.

      Every language has "ambiguity", but ambiguity can come in different flavors (phonological, morphological, syntactic, semantic, pragmatic). Some of the chief instigators of language change can be thought of as ambiguity on these levels. So firstly, it's hard to imagine the existence of a function mapping languages to "ambiguity levels".

      The motivation for your comment about English versus Spanish probably comes from the fact that you know of more English homophones than Spanish ones. Indeed, most literate people think of their language in terms of written words, so your take on the matter is common.

      (As a slight digression, your example of right the direction versus right as in 'correct, just' is pretty interesting. We can understand the semantic similarity between the two when we notice that most humans are right-handed. Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.) So the given example is one where, historically, a word had no ambiguity, but gained ambiguity because speakers started using it differently.)

      Getting back to the main topic, more problematic about Section 7 of TFA is the implicit assertion that, at some point in the future, their techniques can be applied to create a function mapping words in a particular language to words in another language. Anybody who has studied more than one language has seen cases where this is difficult to do on the word-level. For instance, the French equivalent of English river is often given as riviere or fleuve. But riviere is only used by French speakers to mean 'river or stream that runs into another river or stream' whereas fleuve means 'river or stream that runs into the sea'. English breaks up river-like things by size: rivers are bigger than streams. So, in the strictest sense, there is no English word for fleuve, just as there's no French word for stream (unless there has been a recent borrowing I don't know about). This certainly does not imply that French people can't tell the difference between big rivers and small rivers; their lexicon just breaks things up differently.

      These little problems can be remedied lexically, as I've just done. So fleuve is denotationally equivalent to river or stream that runs into the sea, although the latter is obviously much bulkier than its French equivalent. The real problem is that there are words in some languages whose meanings are not encoded at all in other languages. English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?

      It's pretty well-known that Slashdotters' general policy is to tear apart every article we read, and half of those we don't. This is certainly not my intent here. Languages are complicated beasties, and everyone seems to understand that, including the writers of the article. So, we should interpret their result in Section 7 as them saying, "Well, maybe this has gotten us a baby-step closer to creating the hypothetical Perfect Natural Language Translator, but someone's gonna have to do a lot more work to see where this thing goes".

  2. Limitations of NGD (Normalized Google Distance) by G4from128k · · Score: 4, Insightful

    Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.

    More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.

    Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.

    --
    Two wrongs don't make a right, but three lefts do.
  3. Limits to semantic derivations from Google by saddino · · Score: 4, Interesting

    My company develops a data mining program for OS X (theConcept) that uses Google (or other search engines) to provide links to data for mining.

    For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.

    Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.

    But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.

    If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.