Deriving Semantic Meaning From Google Results
prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."
These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.
The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.
Transcend Humanity. Please.
Can someone explain this to me? Yes I did try to RTFA.
--sig fault--
I though semantic meant "meaning".
what the hell are you babbling about??!?
Because God knows I'd never be able to distinguish between dutch painters on my own
...a few steps away from BSG 75 becoming a reality.
Yep. It's Yet Another Application For Tax Money. It's just academic hot air that should be ignored until someone actually produces something that works and can be sold.
go to google define: microsoft, http://www.google.com/search?hl=en&q=define%3A+mic rosoft&btnG=Google+Search
they all have it wrong untill you see "NOT A STANDARDS BODY"
Is this in any way related to the way that Google was able to decide all on it's own that Scientology was crap, and thus bring Operation Clambake up to the top of the search results? (Until they Scientology people got pissed, anyway.)
:)
Google is already starting to show signs of intelligence higher than some people.
"Everything you know is wrong. (And stupid.)"
Moderation Totals: Wrong=2, Stupid=3, Total=5.
This obviously has its failings, but theoretically, you could use a sufficiently large database of common human language coupled with simple algorithms to perform operations like grammar checking.
An internet search would not be quite so useful for that, but I would really be interested in what would be possible with full digital access to the library of congress. I would imagine you could do things like automatically generate books based on existing material.
When things get complex, multiply by the complex conjugate.
...as opposed to 'non-semantic meaning' or just 'semantic meaning' as in 'I don't know what semantic means but using it here will make me look intelligent'?
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
This is basically what I was referring to in my response to "Using The Web For Linguistic Research" when I said:
followed by the explanation: andSeastead this.
After consulting with the elephant in my living room, I have only one thing to say. semantic Pronunciation Key (s-mntk) also semantical (-t-kl) adj. 1. Of or relating to meaning, especially meaning in language. 2. Of, relating to, or according to the science of semantics.
Warning: Could be fatal if taken seriously
This is a pretty nice approach. Quoted from the news article "The technique has managed to distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return, the researchers report in an online preprint.", it shows that common terminology can be drawn. In the end though, this is a refined search routine for Google IMHO. This would be good for scholar searches perhaps, or even a dynamic thesaurus. But when using terms such as: does windows use linux, the derived results would be broken down into: "linux" "windows" "use" . Google cached pages containing these terms vary so greatly in content. But, if searching for something along the lines of "dutch painters favorite colors", would produce desired results like the control method used in the news article
My Thoughts, Kyndig
wow! = conveying amazement
WoW! = I can't believe I paid $50 for this crap and can't even logon!
I doubt that it will work in that way. However...
When we talk about meanings of words, we are really taking a symbol and utilizing it in the place of a meaning. For instance, when I say dog, I am talking of the pattern of dog - a furry thing that moves in certain directions and can bark, but has a wide range of sizes.
In other words, whenever we are saying a word we are taking patterns and using them. Whever I see a certain pattern I can identify it with something - this is where a word comes into play. The word identifies a pattern: ie, my friend Gus. Whenever I see the pattern of Gus I know that it is Gus.
This is why the search won't work: one must take other experiences (sight, smell, sound) and combine them into a pattern, and then represent that pattern. Deriving meanings of words from words will not work - unless you have the actual visual representations involved. This is where Google may work: their visual images searches may be a great way for AI systems to derive meanings - they can find visuals of many different words, then using logic find out what the meaning of that word is. Usig such searches, they will learn the meanings and then they will be able to use them in real life. eg: if they typed the word "dog" in, then they will be able to find images of dogs, and then connect them using pattern recognition systems to realise what "dog" is. Of course, this isn;t AI yet - one must still include other forces such as need, which drives one to action, and other stuff...
Anyway, I always loved AI and I am happy that the guys made such big progress in it... hopefully when I grow up I will also go into the industry... after the 1980's it really did go KAPUT, now it's reviving!
First off, I am not an "AI" expert nor do I claim to be, however, this is how I see it.
...
Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?
One: intelligence is not awareness.
Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.
Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.
However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.
Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.
Whew
Words to men, as air to birds.
I will give you an example. If you search news (i.e., either Google News or Yahoo! News) for stories about the recent federal action (by Washington) involving Chinese companies and Iranians weapons improved by Chinese technology, you will discover that one of the popular news articles about this topic comes from the "New York Times". Several other newspapers redistributed the Times article, written by David Sanger (spelling?).
I read that article, but I also read articles from less popular Web news sites: e.g. "Taipei Times". The "Taipei Times" article does mention that a Taiwanese company was also implicated in the sale of weapons technology to Taiwan. Yet, "New York Times" article made no mention of this fact.
Is the "Taipei Times" telling the truth? It claims that Ecoma Enterprise Company, a Taiwanese company, was one of the culprits.
At this point, I fired up both Yahoo! Search and Google. Only on Google was I successful in locating the the ORIGINAL source of the information about American penalties against the 7 Chinese companies and the 1 Taiwanese company. The information is on page 133 of the "Federal Register" (volume 70, number 1). So, I discovered that the "Taipei Times" was telling the truth.
Guess how long I took on Google to find this information? 5 minutes. I kid you not. Even though I hate Google's employment practices, I am quite impressed with their technology.
Using Yahoo! Search, I was not able to locate the desired information.
Apparently, Google has an algorithm that, although it is unsupervised (i.e. without the kind of human interaction that corrupts Yahoo! Search), it captures the notion of what the typical person wants to find. The Google algorithm, dare I say "it", is on the verge of acquiring human sentience. THAT is, indeed, impressive.
Pray to Buddha that the middle name of the CEO is not "666" or Beelzebub. Just kidding.
Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.
More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.
Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.
Two wrongs don't make a right, but three lefts do.
Looks like it's slashdotted
Is it my imagination or is this essentially a new flavor of Kabbalah with the same strengths and weaknesses?
Just need to fit that in googlefs to get better results on queries :)
Intelligence does, however, imply the ability to perform self-directed learning. Without that, all you have is preprogrammed behavior, which is not intelligence. Given the ability to learn, an intelligent entity is likely to draw conclusions about its own existence ("I think therefore I am"), and will thus essentially be self-aware.
Of course, the builders of an artificially intelligent machine might restrict its ability to gather facts about itself - it wouldn't necessarily have the ability to "see" its "body", for example - so this may limit the scope of the AI's self-awareness, at least at first. However, that's an artificially-imposed external constraint, which says nothing about the AI's ability or potential for self-awareness.
Sometimes in a dicucssion somebody might say
"I don't want to get into semantics".
I always want to yell - "why worry about the meaning of things - it'll just cloud things".
Tis maybe a karmic balancing of centuries old morally wrong Amerikan policies wherein engineers of color have been refused employment in their own country. Those traditionally priviliged are now being marginally impeded from their "given" priviledge. H1-B's don't seem to be displacing too many people.
I wrote a program that gathered, analyzed and used word pair frequency data (various situational pairings). It needs more raw data, but shows a lot of promise. I opted to not use literature, as that often has archaic and purposefully awful word usage. Some of the issues involved include case, like Fall vs fall, I chose to ignore case, grammatical structure, needs to integrate with a grammar checker. Coupling this with a thesaurus is my eventual goal, this leads to some obvious difficulties, though it has potential rewards. I had considered google, and have run a few tests using it, but that solution was too simple, and not quite as powerful in the long run. Just had to share, sorry to waste your time.
What about multiple semantism?
How would it understand GNAA? As "Greater Nashville Auburn Association" or as "Guilford Native American Association"?
I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
This is the mearning of this article for me.
Someone can explain it for me on a human language?
http://www.michel.eti.br
(whispering)
"!=" = NOT equal to
Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things about the space from which the sample was drawn. The smaller the sample and the more accurate the prediction, the greater the intelligence.
Good description, and I agree, though I would alter it slightly to include speed and range of prediction, i.e.:
The smaller the sample, the larger the domain covered, and the quicker and more accurate the prediction, the greater the intelligence.
My company develops a data mining program for OS X (theConcept) that uses Google (or other search engines) to provide links to data for mining.
For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.
Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.
But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.
If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.
They are developing an open source tool http://complearn.sourceforge.net/ that will hopefully integrate the algorithm they describe. Right now it's only supporting one of their previous algorithms. More about this in the above link and section 5 of the paper.
I've perused the abstract and skimmed the body of the paper. They're fine. But the title is misleading: Automatic Meaning Discovery Using Google.
Their software has discovered meaning no more than paper has when the lexicographer is done writing her dictionary. Meaning is not the grouping of symbols.
For systems that step towards encoding meaning as human brains do, consider the Neural Theory of Language.
You're quite correct that cowboy-loose definitions of terms make this a very difficult discussion to have. For example, when you say "self awareness," it's unlikely that you actually mean "self awareness" in the literal sense; after all, if a computer is capable of detecting when its processor is overheating (and perhaps turn on a fan in response), it is basically "self aware," though we wouldn't confuse that with itelligence.
Rather, I think by "self awareness" here you mean, possessing narrativity; that is, the ability to construct a narrative of itself in relation to the things of which it is aware. In simpler words, consciousness. Now, it is possible to be intelligent without being conscious (everyone thinks they have the smartest dog in the world, but that doesn't make the poor beast conscious). But is it possible to be conscious without being intelligent?
Consciousness is fundamentally linguistic in origin (and I'm tired of arguing that point with people who haven't done a day of cognitive studies in their lives; there's no way around it: without language, consciousness does not evolve). So, for example, in the course of human evolution, first a linguistic parsing system was evolved, humans got language, and then, once this substrate was established, consciousness evolved as an epiphenomenon which rode on top of it. This substrate proved to be a fertile breeding ground on which memetic evolution could take place, as well, and since that is broader than any one particular human component in the system, it's almost more proper to say that we are the tools memes use to propogate, and not vice versa. (This argument is fairly well established with genes; same rules apply.)
So, any artificial system which contains "consciousness" will have to first handle language. If you don't have that linguistic substrate for narrativity and memetic evolution, there is nothing for consciousness to occur in. Maybe the information is there, but it would be like me pointing to an empty spot in the room and saying, "That's a balloon full of air; I just forgot the balloon." So, let's do this in the proper order: language first, then consciousness.
What he wants is more important that what I want. What he wants is also more important that what you want.
Thank you for the reply. I'm glad your work generalizes to longer search-term lists. Like so many other /. readers, I did not take the time to read your preprint before posting.
I've often wondered if one can use simple pair-wise distance estimates to reconstruct a polytope or distorted simplex for the set of items within a multidimensional space. In theory, an N-object system, with non-zero pairwise distances, requires (N-1) dimensions. But in practice, many real systems don't fill the space -- being M-dimensional (M less than N-1) and having only negligible (perhaps noise-induced) thickness in the other dimensions.
For semantic systems, the total number of semantic dimensions may be far less than the number semantic terms or tokens. A simple example dimensional flattening is the existence of synonyms -- the second word does not expand the space because it does not encode a new dimension of meaning. (Synonyms would also be negatively correlated in Google searches, but that's another issue). Also, the fact that each word can be defined in terms of other words suggests that the semantic nebula does not actually fill the space.
Accomplishing this would require a true distance metric. I notice that NGD does not satisfy the triangle inequality. Perhaps some minor transform or alternative formulation of NGD would yield a true metric.
The reason that estimating semantic dimensionality is useful is two fold. First, it says something about the cognitive complexity of humans and human systems. Second, it provides some insight into the required cognitive sophistication of autonomous learning systemd that need to interact "intelligently" with humans. How many words does a system need to truly understand to pass the Turing test?
Creating a full reconstruction, a more challenging task, would provide insight into the structure of human language and human language usage patterns. The dimensionality of clusters of words might provide insight into the complexity of subdomains of knowledge.
I wish you every success in creating better autonomous learning systems.
Two wrongs don't make a right, but three lefts do.
Just because you heard some rumor, or even heard it more than once, or possess one datapoint does not mean you know what Google's hiring practices are. There is no bias against Americans; I speak from personal experience. Having such a bias would be both illegal and stupid, and Google is law-abiding and not-stupid.
Good point. However it is difficult to value time in a single competitive metric whereas compression ratio (where the initial and compressed sizes include the size of the algorithm/knowledge of the AI) is a single number.
Perhaps the way around this is to have different prizes for different time classes, varying by an exponential. You'd have, say, 3 competitions with timeouts of one unit of time, 10 time units and 100 time units. This could make the contest run in a reasonable period of time at a reasonable cost.
Seastead this.
Are you sleeping, moderators?
The author of the paper take the time to answer some questions in an insightful and friendly manner, and his post is still buried at +1.
You can do better than that.
If given perfect information about the relationships between concepts, you could derive a very intellegent machine. TAke a human for example..
A baby hears the word mom spoken by his mom. Gradually, the baby knows there is a relationship between that sound and a smily face.
The child, growing up, starts to see relationships. Intense pain, which is rare, when correlated with a hot stove, has strong meaning in his mind.
Everything is learned initially through correlations. The advantage of human beings is that there are many more data points for correlation. Google's correlations are weak and don't give nearly as much information.
"the ability to "see" its "body"" - Individual ant's do not learn, they are very much like small robots. An ant's nest on the other hand can display a modicum of intelligence in the way that it forages and protects itself.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Here's the original thread from June 2004: http://forums.searchenginewatch.com/showthread.php ?t=48
Here are the writings from Dr. Garcias own web site: http://www.miislita.com/semantics/c-index-1.html - see especially parts three and four.
Intelligence is not just knowing _absolute_ semantic meaning - e.g. that a cow is a cow and grass is grass. And being able to group grass with other grasses and cows with other bovine animals.
:)
It's being able to understand the statement that cow is to grass in a similar way that balleen whales are to krill.
And then now knowing something about krill from that even if you didn't know what krill was at all.
It's not just knowing the "absolute value" of the meaning - or even that these two objects are linked or close in the same area ( which seems to be the level which most current AI are at).
It's more like kind of knowing the "vector/direction" they are linked, and being able to organize other objects that are related in similar ways in a similar vector. Thus you can learn about things by analogies and metaphors AND even create new things with those.
Would pump out more BS but I have to go for dinner
Compression is a stricter test for AI than Turing
By stricter do you mean narrower and incomplete? Do you think that taking something overly terse and compressed and explaining it simply with examples and analogies etc is a greater intellectual acheivement?
Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things
It would be myopic to see it as such. The ability to communicate an idea is a closer description of what it is to be intelligent as captured by the Turing test which rightly leaves the problem domain beyond this undefined. I know many intelligent people who are incontinently verbose and cannot summarise in a techie/scientific mode but can communicate a feeling or a subtle insight by complex layered descriptions. Compression and prediction might be an optional string of intelligence but by no means the whole instrument.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
A slug is not conscious. Nothing without langauge is. Recommended reading: Dr. Daniel C. Dennett, Consciousness Explained and Darwin's Dangerous Idea. Richard Dawkins, The Extended Phenotype. Julian Jaynes, The Origin of Consciousness in the Breakdown of the Bicameral Mind.
Those are all more commercial works, well within the grasp of even people who've done no work in the field. For more sholarly and technical references, check their bibliographies, especially in Dennett.
What he wants is more important that what I want. What he wants is also more important that what you want.
"Sheesh" is a word which normally means, "I'm not very good at actually saying what I mean, so I'll just make strange noises and roll my eyes at someone who won't figure it out for me." (It's also the nick of one of my favorite Internet trolls; what ever happened to the good old days when trolls actually tried to be entertaining, instead of merely annoying?)
Anyway, okay, it's loose definitions of words that are once again getting us into trouble here. That a slug is aware of its environment, as in, capable of responding to environmental stimuli, okay, I'll give you that one. I won't gift wrap it for you, but I'll give it to you.
But that's entirely different from "consciousness" in the sense that we're discussing here. After all, even a computer is capable of detecting environmental stimuli, and responding to them, but as my colleague 2TecTom is pointing out in this same thread, the mere ability to respond to environmental stimuli is not synonymous with consciousness.
Read the source material. It'll give you the weapons you need to overcome your sheeshing.
What he wants is more important that what I want. What he wants is also more important that what you want.
my feeling is that what's often missed in the various AI language research programs is the problem of speech and reference. Any linguist working in the field will tell you that speech recognition is an incredible pain in the arse and then to layer semantic recognition on top of that is doubly painful. Though my real concern is about things like irony and sarcasm. I'm glad somebody stepped up to point out that different languages break concepts and the world up in different ways. But how exactly can you get AI to gather enough circumstantial data to understand something like sarcasm, where the connoted meaning is often different or the exact opposite of the denoted.
And what about signs that are coded to mean multiple things to multiple people simultaneously? For example, you're talking with your friend about this absolutely grotesque hat someone is wearing. You say to this person nice hat; he thanks you and your friend snickers. Same sign, two simultaneous meanings.