Deriving Semantic Meaning From Google Results

← Back to Stories (view on slashdot.org)

Deriving Semantic Meaning From Google Results

Posted by ryuzaki0 on Saturday January 29, 2005 @09:35AM from the can-also-use-tea-leaves-if-google-not-available dept.

prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."

12 of 120 comments (clear)

Min score:

Reason:

Sort:

Semantic meaning? by zorren · 2005-01-29 09:37 · Score: 2, Interesting

I though semantic meant "meaning".
Extend this to the library of congress... by physicsphairy · 2005-01-29 09:44 · Score: 3, Interesting

While I think ideally you would endow computers with the same algorithmic usage of speech that is employed by human beings, as these researchers have shown, it is also possible to work with programs that do not 'parse' language but rather categorize it based on massive databases of language that has already been parsed by humans.
This obviously has its failings, but theoretically, you could use a sufficiently large database of common human language coupled with simple algorithms to perform operations like grammar checking.
An internet search would not be quite so useful for that, but I would really be interested in what would be possible with full digital access to the library of congress. I would imagine you could do things like automatically generate books based on existing material.

--
When things get complex, multiply by the complex conjugate.
Good for scholars, bad for geeks by kyndig · 2005-01-29 09:51 · Score: 2, Interesting

This is a pretty nice approach. Quoted from the news article "The technique has managed to distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return, the researchers report in an online preprint.", it shows that common terminology can be drawn. In the end though, this is a refined search routine for Google IMHO. This would be good for scholar searches perhaps, or even a dynamic thesaurus. But when using terms such as: does windows use linux, the derived results would be broken down into: "linux" "windows" "use" . Google cached pages containing these terms vary so greatly in content. But, if searching for something along the lines of "dutch painters favorite colors", would produce desired results like the control method used in the news article

--
My Thoughts, Kyndig
Unsupervised but Reflective of Human Preferences by reporter · 2005-01-29 10:08 · Score: 3, Interesting

Even though I disagree with Google's hiring practices (i.e. preferring H-1Bs when many American engineers are unemployed), I must admit that Google's search algorithm is the best one -- even better than Yahoo! Search, which I use regularly for socio-political reasons.
I will give you an example. If you search news (i.e., either Google News or Yahoo! News) for stories about the recent federal action (by Washington) involving Chinese companies and Iranians weapons improved by Chinese technology, you will discover that one of the popular news articles about this topic comes from the "New York Times". Several other newspapers redistributed the Times article, written by David Sanger (spelling?).
I read that article, but I also read articles from less popular Web news sites: e.g. "Taipei Times". The "Taipei Times" article does mention that a Taiwanese company was also implicated in the sale of weapons technology to Taiwan. Yet, "New York Times" article made no mention of this fact.
Is the "Taipei Times" telling the truth? It claims that Ecoma Enterprise Company, a Taiwanese company, was one of the culprits.
At this point, I fired up both Yahoo! Search and Google. Only on Google was I successful in locating the the ORIGINAL source of the information about American penalties against the 7 Chinese companies and the 1 Taiwanese company. The information is on page 133 of the "Federal Register" (volume 70, number 1). So, I discovered that the "Taipei Times" was telling the truth.
Guess how long I took on Google to find this information? 5 minutes. I kid you not. Even though I hate Google's employment practices, I am quite impressed with their technology.
Using Yahoo! Search, I was not able to locate the desired information.
Apparently, Google has an algorithm that, although it is unsupervised (i.e. without the kind of human interaction that corrupts Yahoo! Search), it captures the notion of what the typical person wants to find. The Google algorithm, dare I say "it", is on the verge of acquiring human sentience. THAT is, indeed, impressive.
Pray to Buddha that the middle name of the CEO is not "666" or Beelzebub. Just kidding.
Been working on similar by Arngautr · 2005-01-29 10:30 · Score: 3, Interesting

I wrote a program that gathered, analyzed and used word pair frequency data (various situational pairings). It needs more raw data, but shows a lot of promise. I opted to not use literature, as that often has archaic and purposefully awful word usage. Some of the issues involved include case, like Fall vs fall, I chose to ignore case, grammatical structure, needs to integrate with a grammar checker. Coupling this with a thesaurus is my eventual goal, this leads to some obvious difficulties, though it has potential rewards. I had considered google, and have run a few tests using it, but that solution was too simple, and not quite as powerful in the long run. Just had to share, sorry to waste your time.
Re:wARTIME? by Fjornir · 2005-01-29 10:31 · Score: 2, Interesting

Er. You didn't get the joke, so I will explain. The lore is that "repeat" is a command to the artillery to fire again on their last target, so you never ever say "repeat" on the radio, instead you say "say again".
The lore also contains an interesting anectode about the '92 riots in LA. Apparently a group of Marines were dispatched to assist the police. Two officers were approaching a house when someone opened up with a shotgun at them. One officer shouted "cover me" -- so the Marines proceeded to lay down covering fire on the house -- more than two hundred rounds were fired into that house.

--
I want a new world. I think this one is broken.
Re:The elephant in the living room. by ericbg05 · 2005-01-29 11:16 · Score: 5, Interesting

The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation.
Every language has "ambiguity", but ambiguity can come in different flavors (phonological, morphological, syntactic, semantic, pragmatic). Some of the chief instigators of language change can be thought of as ambiguity on these levels. So firstly, it's hard to imagine the existence of a function mapping languages to "ambiguity levels".
The motivation for your comment about English versus Spanish probably comes from the fact that you know of more English homophones than Spanish ones. Indeed, most literate people think of their language in terms of written words, so your take on the matter is common.
(As a slight digression, your example of right the direction versus right as in 'correct, just' is pretty interesting. We can understand the semantic similarity between the two when we notice that most humans are right-handed. Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.) So the given example is one where, historically, a word had no ambiguity, but gained ambiguity because speakers started using it differently.)
Getting back to the main topic, more problematic about Section 7 of TFA is the implicit assertion that, at some point in the future, their techniques can be applied to create a function mapping words in a particular language to words in another language. Anybody who has studied more than one language has seen cases where this is difficult to do on the word-level. For instance, the French equivalent of English river is often given as riviere or fleuve. But riviere is only used by French speakers to mean 'river or stream that runs into another river or stream' whereas fleuve means 'river or stream that runs into the sea'. English breaks up river-like things by size: rivers are bigger than streams. So, in the strictest sense, there is no English word for fleuve, just as there's no French word for stream (unless there has been a recent borrowing I don't know about). This certainly does not imply that French people can't tell the difference between big rivers and small rivers; their lexicon just breaks things up differently.
These little problems can be remedied lexically, as I've just done. So fleuve is denotationally equivalent to river or stream that runs into the sea, although the latter is obviously much bulkier than its French equivalent. The real problem is that there are words in some languages whose meanings are not encoded at all in other languages. English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?
It's pretty well-known that Slashdotters' general policy is to tear apart every article we read, and half of those we don't. This is certainly not my intent here. Languages are complicated beasties, and everyone seems to understand that, including the writers of the article. So, we should interpret their result in Section 7 as them saying, "Well, maybe this has gotten us a baby-step closer to creating the hypothetical Perfect Natural Language Translator, but someone's gonna have to do a lot more work to see where this thing goes".
Re:The elephant in the living room. by Anonymous Coward · 2005-01-29 11:33 · Score: 1, Interesting

I'm sorry, but as a linguist/cognitive scientist, I have to call bullshit on any current approaches to AI when it comes to language. "Words" do not have "meanings." Current research regards "chunking" and "lexical phrases." Basically, humans do not typically process words' individual meanings, but rather spit out pre-formed, formulaic expressions. This is why children who cannot form a sentence on their own yet will say things like "Let's go." They don't see that as a fully grammatical utterance; they see it as a word.

Hence, teaching a computer to recognize meaning when we do not fully recognize meaning (only use) is far, far more difficult that the typical computer scientist realizes.
Re:Limitations of NGD (Normalized Google Distance) by Rudi+Cilibrasi · 2005-01-29 12:14 · Score: 2, Interesting
You are right that Google may be performing estimation and this could effect results and I don't really know what sort of rounding they do at this time. Perhaps more will become apparent. But your other assertion about no higher order statistics is incorrect. see the earlier Clustering by Compression paper for more info. Quickly, the reason is as follows:
- I use NGD to convert arbitrarily-large lists of search-terms into feature-vectors of arbitrary dimension. The only limit to this is the max query length for Google, and this is just a detail.
- I use a Support Vector Machine with a Radial Basis Function kernel. The RBF kernel has an effectively infinite dimension and so can learn any function. SVM is a universal learner like neural nets and many other famous algorithms. So higher-order features (composed of products of several NGD) can indeed be used in learning.
The main purpose of the research is in extending generality of automatic learning. See the earlier papers in the series including Algorithmic Clustering of Music, and the earlier theoretical work. NGD is a special case of NCD. NCD is a family of functions that can be used as the basis of a universal learning system in a variety of ways. Our theory justifies this innovation and leads to a whole class of easy to write algorithms.
Thanks for your interest, it is good to see that this research is striking a chord with the Slashdot community. I hope this leads to a whole lot of more easy-to-use semi-intelligent software. Cheers!
Limits to semantic derivations from Google by saddino · 2005-01-29 12:18 · Score: 4, Interesting

My company develops a data mining program for OS X (theConcept) that uses Google (or other search engines) to provide links to data for mining.

For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.

Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.

But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.

If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.
What you build is a substrate. by Dylan+Thomas · 2005-01-29 13:08 · Score: 2, Interesting

You're quite correct that cowboy-loose definitions of terms make this a very difficult discussion to have. For example, when you say "self awareness," it's unlikely that you actually mean "self awareness" in the literal sense; after all, if a computer is capable of detecting when its processor is overheating (and perhaps turn on a fan in response), it is basically "self aware," though we wouldn't confuse that with itelligence.
Rather, I think by "self awareness" here you mean, possessing narrativity; that is, the ability to construct a narrative of itself in relation to the things of which it is aware. In simpler words, consciousness. Now, it is possible to be intelligent without being conscious (everyone thinks they have the smartest dog in the world, but that doesn't make the poor beast conscious). But is it possible to be conscious without being intelligent?
Consciousness is fundamentally linguistic in origin (and I'm tired of arguing that point with people who haven't done a day of cognitive studies in their lives; there's no way around it: without language, consciousness does not evolve). So, for example, in the course of human evolution, first a linguistic parsing system was evolved, humans got language, and then, once this substrate was established, consciousness evolved as an epiphenomenon which rode on top of it. This substrate proved to be a fertile breeding ground on which memetic evolution could take place, as well, and since that is broader than any one particular human component in the system, it's almost more proper to say that we are the tools memes use to propogate, and not vice versa. (This argument is fairly well established with genes; same rules apply.)
So, any artificial system which contains "consciousness" will have to first handle language. If you don't have that linguistic substrate for narrativity and memetic evolution, there is nothing for consciousness to occur in. Maybe the information is there, but it would be like me pointing to an empty spot in the room and saying, "That's a balloon full of air; I just forgot the balloon." So, let's do this in the proper order: language first, then consciousness.

--
What he wants is more important that what I want. What he wants is also more important that what you want.
Re:The elephant in the living room. by danila · 2005-01-30 04:30 · Score: 2, Interesting
Well, I don't think there is a way to translate a text into another language without
1. understanding the text
2. understanding both languages
3. understanding the socio-cultural context of both languages
But we must consider the fact that most humans can't produce a decent translation either, even if they think they understand both languages. I've been professionally translating movies (EN->RU) and I know to what extent the scripts are riddled with linguistic traps. An average professional human translator would be lucky to produce a 90% valid translation (much less a perfect one).
So if the developers of computer translation tools do not strive for perfection, but for an average human level, they might succeed rather soon, by using a combination of several approaches (including the one described in this paper). Of course, a translation program can confuse NATO with the Northern Alliance, or Thai with Tahitian but so can a human.
--
Future Wiki -- If you don't think about the future, you cannot have one.