Slashdot Mirror


Deriving Semantic Meaning From Google Results

prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."

120 comments

  1. The elephant in the living room. by Eunuch · · Score: 4, Insightful

    These kinds of articles never seem to get a very basic problem--natural languages. English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death. It's the elephant in the living room. Huge, important problem that nobody wants to talk about. There are alternatives, such as lojban which can be parsed like any computer program.

    The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.

    --
    Transcend Humanity. Please.
    1. Re:The elephant in the living room. by freralqqvba · · Score: 4, Insightful

      Well obviously the technology is not perfect yet. However, none of the problems you bring up are particularly insurmountable (as long as you aren't excepting the AI to be BETTER at parsing languages than people). Yes, words are ambiguous, and yes humans can fail at parsing them, ergo computers probably will too. That's just a fact, we're not going to achieve perfection. Still, this could be a pretty major step forward (well, not that this is the first time something like this has been tried - but the base premise seems sound) by using google the elephant of a problem you mention can be partialy mitigated. Google gives enough context around a word that ideally, when the word to be translated is also surrounded by context its meaning amoung alternate meanings can be discovered without giving an overly ambigous translation.

    2. Re:The elephant in the living room. by Anonymous Coward · · Score: 0
      Don't you think that possible ambiguities are something worthy of knowing? on real-time? before important decisions are ever taken?

      In other words, ambiguities may not be resolvable without human input, but if the human producing ambiguities is there when they are made/uttered/written, then its a huge plus because they can be solved.

      Furthermore, it's not that nobody wants to hear about them, it's probably that you don't want to research them.

    3. Re:The elephant in the living room. by ericbg05 · · Score: 5, Interesting
      The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation.

      Every language has "ambiguity", but ambiguity can come in different flavors (phonological, morphological, syntactic, semantic, pragmatic). Some of the chief instigators of language change can be thought of as ambiguity on these levels. So firstly, it's hard to imagine the existence of a function mapping languages to "ambiguity levels".

      The motivation for your comment about English versus Spanish probably comes from the fact that you know of more English homophones than Spanish ones. Indeed, most literate people think of their language in terms of written words, so your take on the matter is common.

      (As a slight digression, your example of right the direction versus right as in 'correct, just' is pretty interesting. We can understand the semantic similarity between the two when we notice that most humans are right-handed. Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.) So the given example is one where, historically, a word had no ambiguity, but gained ambiguity because speakers started using it differently.)

      Getting back to the main topic, more problematic about Section 7 of TFA is the implicit assertion that, at some point in the future, their techniques can be applied to create a function mapping words in a particular language to words in another language. Anybody who has studied more than one language has seen cases where this is difficult to do on the word-level. For instance, the French equivalent of English river is often given as riviere or fleuve. But riviere is only used by French speakers to mean 'river or stream that runs into another river or stream' whereas fleuve means 'river or stream that runs into the sea'. English breaks up river-like things by size: rivers are bigger than streams. So, in the strictest sense, there is no English word for fleuve, just as there's no French word for stream (unless there has been a recent borrowing I don't know about). This certainly does not imply that French people can't tell the difference between big rivers and small rivers; their lexicon just breaks things up differently.

      These little problems can be remedied lexically, as I've just done. So fleuve is denotationally equivalent to river or stream that runs into the sea, although the latter is obviously much bulkier than its French equivalent. The real problem is that there are words in some languages whose meanings are not encoded at all in other languages. English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?

      It's pretty well-known that Slashdotters' general policy is to tear apart every article we read, and half of those we don't. This is certainly not my intent here. Languages are complicated beasties, and everyone seems to understand that, including the writers of the article. So, we should interpret their result in Section 7 as them saying, "Well, maybe this has gotten us a baby-step closer to creating the hypothetical Perfect Natural Language Translator, but someone's gonna have to do a lot more work to see where this thing goes".

    4. Re:The elephant in the living room. by Anonymous Coward · · Score: 1, Interesting

      I'm sorry, but as a linguist/cognitive scientist, I have to call bullshit on any current approaches to AI when it comes to language. "Words" do not have "meanings." Current research regards "chunking" and "lexical phrases." Basically, humans do not typically process words' individual meanings, but rather spit out pre-formed, formulaic expressions. This is why children who cannot form a sentence on their own yet will say things like "Let's go." They don't see that as a fully grammatical utterance; they see it as a word.

      Hence, teaching a computer to recognize meaning when we do not fully recognize meaning (only use) is far, far more difficult that the typical computer scientist realizes.

    5. Re:The elephant in the living room. by Dr.+Zed · · Score: 1
      It always is not is necessary is completely unambiguous. The thick translation possibly is better then does not have the translation, specially when you knew the translation is the rough start and.

      The above was the following text....

      It isn't always necessary to be totally unambiguous. Even rough translation can be better then no translation, especially when you know the translation is rough to begin with.

      .... traslated using AltaVista Bable Fish Translation into Chinese-Trad and then translated back into English.

      If you need precise translation, then you pay for a trusted translator. If you need some, on-the-fly better-than-nothing, then why not try to create a translator that might just be able to 'learn' to translate. It would seem a lot more flexible than some static-dictionary translator.

    6. Re:The elephant in the living room. by eraserewind · · Score: 1
      English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?
      Plainly it means nothing significant. In most cases the distinction it makes in English contains no useful information. We think such words are useful, because a sentence sounds "wrong" without them, or because we know that a certain type of information is being lost without it, but in most cases the same fundamental information can be communicated without them, or with a less specific alternative. "I am running to the store. A big man attacks me" is just as understandable as "I was running to the store, when a big man attacked me". The vast majority of machine translation is not going to suffer unduly if it fudges such distinctions.
    7. Re:The elephant in the living room. by Anonymous Coward · · Score: 0

      Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.)

      To complete this aside, the word for the leftward direction opposite of dexter is sinister, whose symbolic connotation has completely eclipsed the literal definition from which the symbolism was derived.

      Perhaps a midieval predecessor to Rush Limbaugh is to blame? ;-)

    8. Re:The elephant in the living room. by Anonymous Coward · · Score: 0

      I remember some years ago reading about natural language ambiguity, the very same reference of french terms 'riviere' and 'fleuve' vs english 'river' on a AI paper.

      Did you read the same? or maybe you are the author?

      My take is that translating between english and spanish, requires an artificial english and spanish 'speaker', without that the interlingual cognoscible gaps cannot be obviated.

      AC posting because /. keeps blocking loging from public ADSL Inktomy cache servers.

    9. Re:The elephant in the living room. by clsc · · Score: 1

      It seems to me that there is more than one elephant in the room.

      First, you cannot assume that all pages returned by Google have an equal opportunity to be served. This is due to the facts that:
      a) For any query, Google will only show the top 1,000 most relevant pages
      b) Most relevant is determined by a numer of factors, one of them being a probability distributon that assigns more weight (PR) to some pages than others

      Second, the extend that Google itself employs some sort(s) of clustering techniques (LSI has been debated) will skew the results. You might think you get really nice success rates, but that will be because Google itself has really nice success rates.

      Third, your a priori selection of terms should take into account that pages written in seperate languages or for separate target groups (eg. kids vs. academiae) will use quite different wording.

      Forth, you cannot assume that the number you base your whole key metric on is correct. The result count on Google is (for larger volumes) itself an estimated number, and the accuracy (or not) of that estimate will bias your calculations.

      Fifth, if you try to extract meaning based on results from a system that itself tries to extract meaning, your own results will be highly colored by the assumptions of the underlying system. You will not be analysing the basic data, in stead you will be analyzing Google. As above, you might think you get really nice success rates, but that will be because Google itself has really nice success rates.

    10. Re:The elephant in the living room. by danila · · Score: 2, Interesting
      Well, I don't think there is a way to translate a text into another language without
      1. understanding the text
      2. understanding both languages
      3. understanding the socio-cultural context of both languages
      But we must consider the fact that most humans can't produce a decent translation either, even if they think they understand both languages. I've been professionally translating movies (EN->RU) and I know to what extent the scripts are riddled with linguistic traps. An average professional human translator would be lucky to produce a 90% valid translation (much less a perfect one).

      So if the developers of computer translation tools do not strive for perfection, but for an average human level, they might succeed rather soon, by using a combination of several approaches (including the one described in this paper). Of course, a translation program can confuse NATO with the Northern Alliance, or Thai with Tahitian but so can a human.

      --
      Future Wiki -- If you don't think about the future, you cannot have one.
  2. Huh? by NETHED · · Score: 0

    Can someone explain this to me? Yes I did try to RTFA.

    --
    --sig fault--
    1. Re:Huh? by spac3manspiff · · Score: 1

      Google has determined the meaning of life:

      http://www.google.com/search?hl=en&q=42&btnG=Googl e+Search

    2. Re:Huh? by tmortn · · Score: 1

      the AC actually isn't far off.

      In english what they are trying to do is define a word by context. For example go input a single word in google and you will get all the various contexts in which it is used.

      Then by using some algo. ( serious academic handwaving here ) you place the meaning of word via context as determined for google. Thus effectively you create the potential for a program that could distinguish between there and their and do it across languages. It could also translate sayins like, the spirit is willing but the flesh is weak, dumb as a doorknob, I Rocked her World, Looney Tunes and Looney Toons.

      The impressive part if done would be that you would not have to program in these various means of determining context as that could be gleaned via Google Results.

      --
      I don't ask you to be me. I only ask you not expect me to be you.
    3. Re:Huh? by kyndig · · Score: 1

      My understanding is that it is a multiple query system.

      Let's say you are just learning to ride a horse, and you want to know of positions to sit on the horse. You'd search for something like:
      'good riding positions'
      A current search return for this statement would deliver you everything from: Xaviers House Of Pleasures, to Yokels Horse Taming Ranch.

      What this system does is refine your query for you, based off cached google pages, and using: page popularity and keyword algorithms.

      The search would return results like:
      "how to ride your woman"
      "positions open on the dude ranch"
      "popular riding positions for horses"

      A tree is built with all these returns, and their system filters out the "junk", and returns the nearest pages you are looking for "popular riding positions for horses"

      It's a smarter smart search engine (filter)

      --
      My Thoughts, Kyndig
    4. Re:Huh? by MoonFog · · Score: 1

      W3C's OWL standard is a "language" to mark-up information to make it more meaningful to machines. Machines can draw conclusions to what a word means by context. So even if two words are the same, they may not mean the same and the computer can draw that conclusion based on the context. It's all a part of W3C's Semantic Web initiative. There is research dedicated to query languages for these kinds of files.

  3. Semantic meaning? by zorren · · Score: 2, Interesting

    I though semantic meant "meaning".

    1. Re:Semantic meaning? by Anonymous Coward · · Score: 0

      but does it know what "first post" means?

    2. Re:Semantic meaning? by exp(pi*sqrt(163)) · · Score: 3, Funny

      Meaning could, in principle, mean 'affective meaning' as in the emotional weight something carries. Maybe Google are also working on emotional search engines and the article poster doesn't want us getting confused with that.

      --
      Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    3. Re:Semantic meaning? by Anonymous Coward · · Score: 0

      Let's google for a religion: discordianism (47700), a dutch painter: de Hooch (391000), a number: d8ff752 (24), and a color: xanthe (46900).

      I looked at the abstract, and I don't think they're just using "the number of hits." It sounds more complex, but, OMG I AM NOT A GERBIL

    4. Re:Semantic meaning? by gardyloo · · Score: 1

      Maybe Google are also working on emotional search engines and the article poster doesn't want us getting confused with that.

      I don't understand your meaning, and I'm really happy about it! ;)

    5. Re:Semantic meaning? by fireboy1919 · · Score: 1

      Words can be identified as a specific part of speech, or a specific part of a sentence, or even just identified as words, for instance - all of which would I would consider non-semantic meanings.

      In language theory/compilers, semantic meaning is information that cannot be obtained by a lexer (a.k.a information that cannot be gained through regular expressions a.k.a. non-regular language components).

      The part that can be recognized by a lexer is still part of the meaning, which is the reason for the name.

      --
      Mod me down and I will become more powerful than you can possibly imagine!
  4. wARTIME? by Anonymous Coward · · Score: 0
    "Right" the direction versus "right" the judgement is a good example. In wartime something as simple as that may have lead to death.

    what the hell are you babbling about??!?

    1. Re:wARTIME? by Anonymous Coward · · Score: 1, Insightful

      If you tell someone to take the right path, they could mistake it as going the opposite of left and walking into a minefield, when in fact you meant going the correct direction.

    2. Re:wARTIME? by Geoffreyerffoeg · · Score: 1

      "We attack the enemy on their left flank?"

      "Right."

      It turns out that the right flank was the dangerous one, and attacking on the left would've guaranteed a victory for the army that unfortunately spoke English.

    3. Re:wARTIME? by tmortn · · Score: 1

      If the context is not there then not even a human can make that distinction.

      Once past that the natural language problem or elephant you speak of is really just a machine that define words in context.

      Google is actually a pretty powerfull means for deriving context by some means of comparing word frequencies around various results.

      --
      I don't ask you to be me. I only ask you not expect me to be you.
    4. Re:wARTIME? by MoonFog · · Score: 3, Informative

      Well, when I was in the army, it was very strict that whatever was said over a network DIDN'T have an ambigous meaning. That's why the army language sounds kinda weird at times, because you are not supposed to misunderstand anything.

    5. Re:wARTIME? by Anonymous Coward · · Score: 0

      I would have hoped they would use something like "Correct." or "No, Right."

    6. Re:wARTIME? by Fjornir · · Score: 1

      Like you never ask someone to "repeat" something ?

      --
      I want a new world. I think this one is broken.
    7. Re:wARTIME? by Anonymous Coward · · Score: 0

      There's one thing to not hear something good enough and asking to repeat, and another thing to speak ambiguously so you don't ask for repeats, but just do the wrong thing.

    8. Re:wARTIME? by Fjornir · · Score: 2, Interesting
      Er. You didn't get the joke, so I will explain. The lore is that "repeat" is a command to the artillery to fire again on their last target, so you never ever say "repeat" on the radio, instead you say "say again".

      The lore also contains an interesting anectode about the '92 riots in LA. Apparently a group of Marines were dispatched to assist the police. Two officers were approaching a house when someone opened up with a shotgun at them. One officer shouted "cover me" -- so the Marines proceeded to lay down covering fire on the house -- more than two hundred rounds were fired into that house.

      --
      I want a new world. I think this one is broken.
    9. Re:wARTIME? by Fjornir · · Score: 1
      Another good one from the humor archives of the lore. Given the command "Secure the building"

      The NAVY would turn out the lights and lock the doors.

      The ARMY would surround the building with defensive fortifications, tanks and concertina wire.

      The MARINE CORPS would assault the building, using overlapping fields of fire from all appropriate points on the perimeter.

      The AIR FORCE would take out a three-year lease with an option to buy the building.

      --
      I want a new world. I think this one is broken.
    10. Re:wARTIME? by TapeCutter · · Score: 1

      Anyone whose driven a "radio taxi" would have to agree, but for less dramatic reasons :)

      Don't know about the US but over here in Australia they dropped taxi dispatching by voice about 15yrs ago. Now instead of arguing with the dispatcher they listen to music....

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    11. Re:wARTIME? by danila · · Score: 1

      That's so true. For (a somewhat related) example, yesterday I've listened to an mp3 recording of radio conversations of Moscow metro traffic controller when one of the machinists was incapacitated by a large dose of vodka. :) The machinist stopped the train at the station and left the train, only to be stopped by the police. Eventually when another machinist stopped on the opposite track at the opposite track at the same station, the traffic controller asked him to takeover the first train and park it in one of the station tonnels, then go back to his train and continue.

      Even though he was speaking quite clearly and in good Russian, even though the station-master went to the machinist to explain the situation in person, he was unable to comprehend what exactly he was asked to do. :) I imagine how bad it could get if they were on the battlefield.

      --
      Future Wiki -- If you don't think about the future, you cannot have one.
  5. Well that's good by Anonymous Coward · · Score: 0, Funny

    Because God knows I'd never be able to distinguish between dutch painters on my own

  6. just... by Anonymous Coward · · Score: 0

    ...a few steps away from BSG 75 becoming a reality.

  7. YAAFTM by Anonymous Coward · · Score: 0

    Yep. It's Yet Another Application For Tax Money. It's just academic hot air that should be ignored until someone actually produces something that works and can be sold.

  8. not really... by bird603568 · · Score: 0

    go to google define: microsoft, http://www.google.com/search?hl=en&q=define%3A+mic rosoft&btnG=Google+Search they all have it wrong untill you see "NOT A STANDARDS BODY"

  9. Scientology by Jace+of+Fuse! · · Score: 2, Insightful

    Is this in any way related to the way that Google was able to decide all on it's own that Scientology was crap, and thus bring Operation Clambake up to the top of the search results? (Until they Scientology people got pissed, anyway.)

    Google is already starting to show signs of intelligence higher than some people. :)

    --

    "Everything you know is wrong. (And stupid.)"

    Moderation Totals: Wrong=2, Stupid=3, Total=5.
    1. Re:Scientology by britneys+9th+husband · · Score: 1

      I think that's more just a reflection of the opinions (knowledge?) of the people on the internet, unless you also want to claim that Google decided "on its own" that Bush is a miserable failure.

      --
      Hear recorded Slashdot headlines on your phone! New service beta testing. Just call (248) 434-5508
    2. Re:Scientology by Anonymous Coward · · Score: 0

      What google measures is what URLs are popular. At the moment, the idea that Scientology is a dangerous scam combining the worst aspects of corporations and religion just happens to be a popular one; therefore, when you search on Google for Scientology, you first get a link providing the evidence for this. As it happens, this idea is correct.

      Another popular idea is that there is a link between Saddam Hussein and Al Qaeda. This idea is not correct. However, it is popular; therefore when you search on Google for Iraq Al Qaeda, you first get propaganda from a neoconservative mouthpiece which attempts to convince others of this false idea.

      Google can't invent or judge ideas. What it can do is categorize and flag which ideas are common.

    3. Re:Scientology by spikefruit · · Score: 1

      The same thing happened for Mormons. Google lifted up a bunch of anti-Mormonism sites to the first page. exmormon.org is only below the official LDS sites.

      --
      I'm going to become a theologist and a scientist so I can spend long hours into the night arguing with myself.
    4. Re:Scientology by xgamer04 · · Score: 1

      I'm guessing that it's due to people posting (in their .sigs and whatnot) "scientology is a cult" and linking the word scientology or the phrase to the clambake site. I don't know of a formal googlebomb, though.

      --
      When you look at the state of the world, how can you not become a radical, liberal anarchist?
  10. Extend this to the library of congress... by physicsphairy · · Score: 3, Interesting
    While I think ideally you would endow computers with the same algorithmic usage of speech that is employed by human beings, as these researchers have shown, it is also possible to work with programs that do not 'parse' language but rather categorize it based on massive databases of language that has already been parsed by humans.

    This obviously has its failings, but theoretically, you could use a sufficiently large database of common human language coupled with simple algorithms to perform operations like grammar checking.

    An internet search would not be quite so useful for that, but I would really be interested in what would be possible with full digital access to the library of congress. I would imagine you could do things like automatically generate books based on existing material.

  11. Would that be 'semantic meaning'... by exp(pi*sqrt(163)) · · Score: 2, Insightful

    ...as opposed to 'non-semantic meaning' or just 'semantic meaning' as in 'I don't know what semantic means but using it here will make me look intelligent'?

    --
    Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
    1. Re:Would that be 'semantic meaning'... by mmu_man · · Score: 1

      I thought "semantic" was actually a synonym for "meaning"... !?

  12. Compression is a stricter test for AI than Turing by Baldrson · · Score: 3, Informative
    From the linked academic abstract:
    Viewing this mapping as a data compressor, we connect to earlier work on Normalized Compression Distance.

    This is basically what I was referring to in my response to "Using The Web For Linguistic Research" when I said:

    There needs to be an annual prize for the highest compression ratio using random pages from the web as the corpus. This would probably do more for real advancement of artificial intelligence than the Turing competitions.
    followed by the explanation:
    Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things about the space from which the sample was drawn. The smaller the sample and the more accurate the prediction, the greater the intelligence. This is also a short description of what a compression algorithm does.
    and
    Text Compression as a Test for Artificial Intelligence, 1999 AAAI Proceedings. Matt Mahoney shows that text prediction or compression is a stricter test for AI than the Turing test. (1 page poster, compressed Postscript).
  13. The elephant... by eremitic · · Score: 1

    After consulting with the elephant in my living room, I have only one thing to say. semantic Pronunciation Key (s-mntk) also semantical (-t-kl) adj. 1. Of or relating to meaning, especially meaning in language. 2. Of, relating to, or according to the science of semantics.

    --
    Warning: Could be fatal if taken seriously
  14. Good for scholars, bad for geeks by kyndig · · Score: 2, Interesting

    This is a pretty nice approach. Quoted from the news article "The technique has managed to distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return, the researchers report in an online preprint.", it shows that common terminology can be drawn. In the end though, this is a refined search routine for Google IMHO. This would be good for scholar searches perhaps, or even a dynamic thesaurus. But when using terms such as: does windows use linux, the derived results would be broken down into: "linux" "windows" "use" . Google cached pages containing these terms vary so greatly in content. But, if searching for something along the lines of "dutch painters favorite colors", would produce desired results like the control method used in the news article

    --
    My Thoughts, Kyndig
  15. Yea by mao+che+minh · · Score: 1

    wow! = conveying amazement
    WoW! = I can't believe I paid $50 for this crap and can't even logon!

  16. AI by Anonymous Coward · · Score: 0

    I doubt that it will work in that way. However...

    When we talk about meanings of words, we are really taking a symbol and utilizing it in the place of a meaning. For instance, when I say dog, I am talking of the pattern of dog - a furry thing that moves in certain directions and can bark, but has a wide range of sizes.

    In other words, whenever we are saying a word we are taking patterns and using them. Whever I see a certain pattern I can identify it with something - this is where a word comes into play. The word identifies a pattern: ie, my friend Gus. Whenever I see the pattern of Gus I know that it is Gus.

    This is why the search won't work: one must take other experiences (sight, smell, sound) and combine them into a pattern, and then represent that pattern. Deriving meanings of words from words will not work - unless you have the actual visual representations involved. This is where Google may work: their visual images searches may be a great way for AI systems to derive meanings - they can find visuals of many different words, then using logic find out what the meaning of that word is. Usig such searches, they will learn the meanings and then they will be able to use them in real life. eg: if they typed the word "dog" in, then they will be able to find images of dogs, and then connect them using pattern recognition systems to realise what "dog" is. Of course, this isn;t AI yet - one must still include other forces such as need, which drives one to action, and other stuff...
    Anyway, I always loved AI and I am happy that the guys made such big progress in it... hopefully when I grow up I will also go into the industry... after the 1980's it really did go KAPUT, now it's reviving!

    1. Re:AI by Sotek · · Score: 1

      You, um, seem to be implying that blind people cannot use language.

      You may want to seriously reconsider that.

      Yes, when we use words, we're using a symbol to stand for the meaning. This is obvious.

      However, we can in fact use words we don't understand, and we can extract some sense of understanding from the way it's used.

      Consider the concept of, oh, "semantic". This has little-to-no basis in physical reality, and yet is perfectly understandable all around.

      An AI, even without physical interactions, could understand such concepts.

      And a physical interaction is a vague concept when it comes to an AI; George Lakoff suggests we understand things via metaphor of a very few basic concepts; the directions (up/down/left/right/forward/back) being one of them.

      As a result, we sometimes understand the concept of "more" via analogy from "up".

      Why could a computer not go the other way? It is not at all difficult for a computer to get the idea of more; why not have a vague idea of what "up" means to us from that?

      Obviously this is beyond what is done now, but it does tend to imply that there are methods other than direct physical experience to produce an understanding of the same concepts we get from such experiences. And as AI is geared towards achieving that understanding, even if it would likely not be the most appropriate "intelligence" (such a VAGUE word, sometimes) for a computer, it's still something we will likely end up with.

  17. not many will get this by 2TecTom · · Score: 2, Insightful

    First off, I am not an "AI" expert nor do I claim to be, however, this is how I see it.

    Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?

    One: intelligence is not awareness.

    Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.

    Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.

    However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.

    Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.

    Whew ...

    --
    Words to men, as air to birds.
    1. Re:not many will get this by abulafia · · Score: 1

      The other comment hinted at the distinction, but one of the hallmarks of intelligence is desire. In philosophic language, "intentionaliy". Without goals and self-directed moves towards those goals, you do not have intelligence. Note that intentionality is needed, but not sufficient. In any case, Google, the machine, does not have self-directed goals.

      --
      I forget what 8 was for.
    2. Re:not many will get this by Anonymous Coward · · Score: 0

      or at least not as far as you know

  18. Unsupervised but Reflective of Human Preferences by reporter · · Score: 3, Interesting
    Even though I disagree with Google's hiring practices (i.e. preferring H-1Bs when many American engineers are unemployed), I must admit that Google's search algorithm is the best one -- even better than Yahoo! Search, which I use regularly for socio-political reasons.

    I will give you an example. If you search news (i.e., either Google News or Yahoo! News) for stories about the recent federal action (by Washington) involving Chinese companies and Iranians weapons improved by Chinese technology, you will discover that one of the popular news articles about this topic comes from the "New York Times". Several other newspapers redistributed the Times article, written by David Sanger (spelling?).

    I read that article, but I also read articles from less popular Web news sites: e.g. "Taipei Times". The "Taipei Times" article does mention that a Taiwanese company was also implicated in the sale of weapons technology to Taiwan. Yet, "New York Times" article made no mention of this fact.

    Is the "Taipei Times" telling the truth? It claims that Ecoma Enterprise Company, a Taiwanese company, was one of the culprits.

    At this point, I fired up both Yahoo! Search and Google. Only on Google was I successful in locating the the ORIGINAL source of the information about American penalties against the 7 Chinese companies and the 1 Taiwanese company. The information is on page 133 of the "Federal Register" (volume 70, number 1). So, I discovered that the "Taipei Times" was telling the truth.

    Guess how long I took on Google to find this information? 5 minutes. I kid you not. Even though I hate Google's employment practices, I am quite impressed with their technology.

    Using Yahoo! Search, I was not able to locate the desired information.

    Apparently, Google has an algorithm that, although it is unsupervised (i.e. without the kind of human interaction that corrupts Yahoo! Search), it captures the notion of what the typical person wants to find. The Google algorithm, dare I say "it", is on the verge of acquiring human sentience. THAT is, indeed, impressive.

    Pray to Buddha that the middle name of the CEO is not "666" or Beelzebub. Just kidding.

  19. Limitations of NGD (Normalized Google Distance) by G4from128k · · Score: 4, Insightful

    Although very clever, NGD (Normalized Google Distance) misses alll higher-order relationships and does not even distinguish between different categories of pairwise relationships. For example, NGD might assume that "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek" because the two word pairs co-occur with similar frequencies.

    More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.

    Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.

    --
    Two wrongs don't make a right, but three lefts do.
    1. Re:Limitations of NGD (Normalized Google Distance) by Rudi+Cilibrasi · · Score: 2, Interesting
      You are right that Google may be performing estimation and this could effect results and I don't really know what sort of rounding they do at this time. Perhaps more will become apparent. But your other assertion about no higher order statistics is incorrect. see the earlier Clustering by Compression paper for more info. Quickly, the reason is as follows:
      • I use NGD to convert arbitrarily-large lists of search-terms into feature-vectors of arbitrary dimension. The only limit to this is the max query length for Google, and this is just a detail.
      • I use a Support Vector Machine with a Radial Basis Function kernel. The RBF kernel has an effectively infinite dimension and so can learn any function. SVM is a universal learner like neural nets and many other famous algorithms. So higher-order features (composed of products of several NGD) can indeed be used in learning.

      The main purpose of the research is in extending generality of automatic learning. See the earlier papers in the series including Algorithmic Clustering of Music, and the earlier theoretical work. NGD is a special case of NCD. NCD is a family of functions that can be used as the basis of a universal learning system in a variety of ways. Our theory justifies this innovation and leads to a whole class of easy to write algorithms.

      Thanks for your interest, it is good to see that this research is striking a chord with the Slashdot community. I hope this leads to a whole lot of more easy-to-use semi-intelligent software. Cheers!

    2. Re:Limitations of NGD (Normalized Google Distance) by Anonymous Coward · · Score: 0

      Its 2am in the morning and I have only just briefly scanned through the paper in the last 15 minutes. I don't see any significant difference between NGD and current corpus based methods. Both seem to do the same thing - counting occurance of words and word pairs.

      Yes, using Google to do word counting gives higher confidence due to having many more counts. Many researchers have already been using Google in the past few years, just look the past three to four year's TREC question answering papers for some sample of how Google is used.

    3. Re:Limitations of NGD (Normalized Google Distance) by Geoffreyerffoeg · · Score: 0

      "Bush" & "Iraq" had the same relationship as "Slashdot" & "Geek"

      Bush invades Iraq, geeks invade Slashdot.

      Too many geeks on Slashdot can make servers melt, too much Bush in Iraq can make buildings melt.

      Bush seeks weapons of mass instruction, geeks seek weapons of math instruction.

      Geeks spend too much of their time on Slashdot, Bush spends too much of his on Iraq.

      Iraq and Slashdot are often both irrelevant ("links to Al-Qaeda", "this isn't News for Nerds!"), but Bush and geeks don't seem to notice.

      So yeah, they do have the same relationship.

  20. Re:FREE IPODS! by Anonymous Coward · · Score: 0

    Looks like it's slashdotted

  21. Kabbalah by Feneric · · Score: 1

    Is it my imagination or is this essentially a new flavor of Kabbalah with the same strengths and weaknesses?

    1. Re:Kabbalah by Fortun+L'Escrot · · Score: 1

      i do not know much about jewish mysticism so if i may: how does kabalah relate?

    2. Re:Kabbalah by Anonymous Coward · · Score: 0

      The Kabbalah uses archetypes and their relations(flow of)to each other. Semantic meaning does play a role in reading the Kabbalah but dealing with abstract symbolism and knowing how it applies to real life is what is important. Either way this is definately a step in that direction.

      Google, the Oracle, heheheheh

  22. Great, can it run in the kernel space ? by mmu_man · · Score: 1

    Just need to fit that in googlefs to get better results on queries :)

  23. Intelligence vs. awareness by alienmole · · Score: 1
    One: intelligence is not awareness.

    Intelligence does, however, imply the ability to perform self-directed learning. Without that, all you have is preprogrammed behavior, which is not intelligence. Given the ability to learn, an intelligent entity is likely to draw conclusions about its own existence ("I think therefore I am"), and will thus essentially be self-aware.

    Of course, the builders of an artificially intelligent machine might restrict its ability to gather facts about itself - it wouldn't necessarily have the ability to "see" its "body", for example - so this may limit the scope of the AI's self-awareness, at least at first. However, that's an artificially-imposed external constraint, which says nothing about the AI's ability or potential for self-awareness.

  24. "I don't want to get into semantics" by hey · · Score: 1

    Sometimes in a dicucssion somebody might say
    "I don't want to get into semantics".
    I always want to yell - "why worry about the meaning of things - it'll just cloud things".

  25. Re:Unsupervised but Reflective of Human Preference by Anonymous Coward · · Score: 0

    Tis maybe a karmic balancing of centuries old morally wrong Amerikan policies wherein engineers of color have been refused employment in their own country. Those traditionally priviliged are now being marginally impeded from their "given" priviledge. H1-B's don't seem to be displacing too many people.

  26. Been working on similar by Arngautr · · Score: 3, Interesting

    I wrote a program that gathered, analyzed and used word pair frequency data (various situational pairings). It needs more raw data, but shows a lot of promise. I opted to not use literature, as that often has archaic and purposefully awful word usage. Some of the issues involved include case, like Fall vs fall, I chose to ignore case, grammatical structure, needs to integrate with a grammar checker. Coupling this with a thesaurus is my eventual goal, this leads to some obvious difficulties, though it has potential rewards. I had considered google, and have run a few tests using it, but that solution was too simple, and not quite as powerful in the long run. Just had to share, sorry to waste your time.

    1. Re:Been working on similar by Arngautr · · Score: 1

      The other article used "hat + head" and "hat + banana" and said that because hat and head is way more common than hat and banana there is a correlation, however let's look at the numbers:

      about 175,000,000 for hat

      about 162,000,000 for head
      about 10,400,000 for banana

      about 8,900,000 for head hat
      about 517,000 for banana hat

      so 5.49% of head sites have hat in them
      and 4.97% of banana sites have hat in them

      just over .5% difference.... not a great example.

    2. Re:Been working on similar by physicsphairy · · Score: 1
      Just had to share, sorry to waste your time.

      I accept your apology for relating relevant information about the subject matter of the article.

      For future reference, to avoid this, it helps not to read article. If you must read it, you can always pick out a short phrase and take it out of context. If you are absolutely at a loss on how to comment on a story with presenting useful/interesting information, generally you can get away with "FRIsT POST!!!" or one of the popular Slashdot memes.

      Don't worry, I'm sure you will soon fit in just fine. :)

    3. Re:Been working on similar by Arngautr · · Score: 1

      playing a bit more I tried the following (changed order in search):

      about 669,000 for hat banana
      about 8,870,000 for hat head

      so this indicates that hats have a stronger correlation with banana than with head by %!!

      5.475% (head)
      6.433% (banana)

      !!!

    4. Re:Been working on similar by Arngautr · · Score: 1

      Will do,

      I downloaded the research paper/proposal/whatever-it-is, but have yet to read it does that count?

    5. Re:Been working on similar by Spy+Hunter · · Score: 1

      Why not use Wikipedia? The database is downloadable in its entirety, quite large, and contains plenty of great information about topics from advanced mathematics to pop culture; all in quite down-to-earth normal language written and refined by normal people. I think Wikipedia ought to be a tremendously great resource for computer learning research.

      --
      main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
    6. Re:Been working on similar by Arngautr · · Score: 1

      Actually that is the intent, I just need to parse it. I've got the curr database on my computer and it is slightly parsed. I've figured out most of the relevant structure (like what are discussions, vs articles, what separates articles, wikicode), just need to sit down and parse all the wikicode and html. Preprocessing aside I've estimated a couple CPU weeks for wikipedia ~1.7GB. One thing that was interesting about wikipedia was that I counted all words in it, and looked at the ones that only occurred a few times, these were almost all misspellings or typos, I thought about going through and correcting a lot of these but thought better of it...they serve a purpose, they flag less professionally done articles.

    7. Re:Been working on similar by Hognoxious · · Score: 1
      so 5.49% of head sites have hat in them and 4.97% of banana sites have hat in them just over .5% difference.... not a great example.
      It might be people who can't spell "bandana", especially if they use a speilchucker. Or it could be something to do with Carmen Miranda.
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  27. Multiple semantism by Anonymous Coward · · Score: 0

    What about multiple semantism?

    How would it understand GNAA? As "Greater Nashville Auburn Association" or as "Guilford Native American Association"?

  28. Language is more than words by Hal+XP · · Score: 2, Insightful
    English is full of words that trip even humans. "Right" the direction versus "right" the judgement is a good example.
    "Right" isn't really a good example of a word that might "trip even humans." A human (translator) will parse not just by word but will attempt to extract a word's meaning from the surrounding phrases, sentences or even paragraphs. The syntax of the language may also come into play. In spoken language, additional "clues" can be derived from the situation in which the word is spoken, and often the extra-textual "body language" is more important, e.g. a hand pointing right or a head nodding in approval. I don't think an adult would be confused by the sentence "You're right. Let's go right." In wartime, I can imagine a responsible English-speaking commander barking references to GPS locations or using body language. It would be a mistake to think of a word in isolation from its context. After all, even in computer languages, a printf or goto by itself will chuck off a compiler error.
    --
    I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
    1. Re:Language is more than words by CastrTroy · · Score: 1

      A goto in any context should always cause a compile error. In Java it's a reserved word, so you can't use it as a variable name, but it has absolutely no use in the language. Maybe it's a feature to be adding in Java 6 (1.6)

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    2. Re:Language is more than words by Sotek · · Score: 1

      "So I go left?" "Right." Ambiguity; I've had this happen while driving moderately frequently. Body language isn't a help, since I'm, well, looking at the /road/. It's easy enough to disambiguate afterwards, but if I were driving a military vehicle in a combat situation, that could easily get one of us killed, yes.

    3. Re:Language is more than words by Hal+XP · · Score: 2, Funny

      There ought to be a military regulation forbidding the use of anything other than "Yes," "No," or "I don't know, sir" in a combat situation. Right?

      --
      I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
    4. Re:Language is more than words by Sotek · · Score: 1

      Probably, although I wouldn't know.

      Of course, if the superior is the person giving directions, and not the driver, that may not help. ;)

      But... well, yes, and that brings it back to what Moonfog said.

    5. Re:Language is more than words by CableModemSniper · · Score: 1

      but theres a difference between a programming language word and a natural language word. printf or goto is more like a letter (or possibly syllable) to a parser than a whole word. Thats not really true either, but context doesn't work the same way. A parser expects a finite class of objects to come after printf, and while natural languages are technically finite, its not quite the same thing because communication doesn't have to be correct all the time to be understood.

      --
      Why not fork?
  29. The meaning for me by michelcultivo · · Score: 1

    This is the mearning of this article for me.
    Someone can explain it for me on a human language?

  30. psst! by MerryGoByeBye · · Score: 1

    (whispering)
    "!=" = NOT equal to

    1. Re:psst! by martyn+s · · Score: 1

      yes and "! =" != "!="

      lameness lameness lameness lameness filter

    2. Re:psst! by MerryGoByeBye · · Score: 0

      It was a joke. Try and keep your pants on.

      hands martyn extra Humor Tablets

    3. Re:psst! by Anonymous Coward · · Score: 0

      It wasn't funny.

      I didn't think it was a joke.

    4. Re:psst! by Anonymous Coward · · Score: 0

      Did you mean

      "!=" == NOT equal to

  31. Re:Compression is a stricter test for AI than Turi by Anonymous Coward · · Score: 0

    Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things about the space from which the sample was drawn. The smaller the sample and the more accurate the prediction, the greater the intelligence.

    Good description, and I agree, though I would alter it slightly to include speed and range of prediction, i.e.:

    The smaller the sample, the larger the domain covered, and the quicker and more accurate the prediction, the greater the intelligence.

  32. Limits to semantic derivations from Google by saddino · · Score: 4, Interesting

    My company develops a data mining program for OS X (theConcept) that uses Google (or other search engines) to provide links to data for mining.

    For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.

    Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.

    But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.

    If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.

    1. Re:Limits to semantic derivations from Google by Rudi+Cilibrasi · · Score: 1
      I'm glad to see you are interested in our work. I applaud parallel and different efforts like your own system, however I think you are making at least one misleading and factually false assumption that I would like to correct. By coincidence, I have already done an unpublished experiment that involved Tom Cruise. Contrary to your assertion that it's impossible to get useful data, in fact I have already gotten the data that
      • Tom Cruise is an actor/actress more than something else
      • Tom Cruise is an actor more than an actress

      This was actually one of the first experiments that I tried and the results were about 85% accurate for this actor / non-actor classification problem. But most of my experiments cannot fit in the paper.

      I got these results from trained classification using an SVM in binary mode with training, about the same size training data as all my other WordNet experiments. If you review my paper you will find my program looks only at page counts and does not look at results yet. Therefore, your claim that we cannot gain "useful" data from pagecounts is patently false.

      One of the most astounding results of my research is in fact precisely the opposite of your assertion: namely that we can in fact derive useful data from just page counts alone. And another clear conclusion of my experiments is that the problems you are imagining, that the web is somehow too low quality to be useful, is false. In fact there is a very active branch of learning theory that deals with boosting accuracy of imperfect heuristics using a variety of techniques such a majority-voting schemes, multiple trials, etc. But my experiments show that you can achieve good accuracy even without these techniques.

      For the skeptical, I invite you to try to replicate my experiment using Tom Cruise youself using your favorite scripting language. Just make a list of actors and a list of non-actor (but famous) people. Then choose anchor terms as I did in my paper and train an SVM. Then test it with Tom or whoever else you like as a test case, and tell us about it. If you don't script then you can do this by hand in a few hours using websearches in your browser and a calculator to calculate the NGD grid. Then just feed that in to an SVM package of your choice. (or try any other learning algorithm if you like) Best reagrds, Rudi

  33. On the bright side... by Anonymous Coward · · Score: 2, Informative

    They are developing an open source tool http://complearn.sourceforge.net/ that will hopefully integrate the algorithm they describe. Right now it's only supporting one of their previous algorithms. More about this in the above link and section 5 of the paper.

  34. Pretentiously titled by Turadg · · Score: 2, Insightful

    I've perused the abstract and skimmed the body of the paper. They're fine. But the title is misleading: Automatic Meaning Discovery Using Google.

    Their software has discovered meaning no more than paper has when the lexicographer is done writing her dictionary. Meaning is not the grouping of symbols.

    For systems that step towards encoding meaning as human brains do, consider the Neural Theory of Language.

  35. What you build is a substrate. by Dylan+Thomas · · Score: 2, Interesting

    You're quite correct that cowboy-loose definitions of terms make this a very difficult discussion to have. For example, when you say "self awareness," it's unlikely that you actually mean "self awareness" in the literal sense; after all, if a computer is capable of detecting when its processor is overheating (and perhaps turn on a fan in response), it is basically "self aware," though we wouldn't confuse that with itelligence.

    Rather, I think by "self awareness" here you mean, possessing narrativity; that is, the ability to construct a narrative of itself in relation to the things of which it is aware. In simpler words, consciousness. Now, it is possible to be intelligent without being conscious (everyone thinks they have the smartest dog in the world, but that doesn't make the poor beast conscious). But is it possible to be conscious without being intelligent?

    Consciousness is fundamentally linguistic in origin (and I'm tired of arguing that point with people who haven't done a day of cognitive studies in their lives; there's no way around it: without language, consciousness does not evolve). So, for example, in the course of human evolution, first a linguistic parsing system was evolved, humans got language, and then, once this substrate was established, consciousness evolved as an epiphenomenon which rode on top of it. This substrate proved to be a fertile breeding ground on which memetic evolution could take place, as well, and since that is broader than any one particular human component in the system, it's almost more proper to say that we are the tools memes use to propogate, and not vice versa. (This argument is fairly well established with genes; same rules apply.)

    So, any artificial system which contains "consciousness" will have to first handle language. If you don't have that linguistic substrate for narrativity and memetic evolution, there is nothing for consciousness to occur in. Maybe the information is there, but it would be like me pointing to an empty spot in the room and saying, "That's a balloon full of air; I just forgot the balloon." So, let's do this in the proper order: language first, then consciousness.

    --
    What he wants is more important that what I want. What he wants is also more important that what you want.
    1. Re:What you build is a substrate. by Anonymous Coward · · Score: 0

      A slug is conscious, no? Yet we never here them talking about it, do we?

    2. Re:What you build is a substrate. by 2TecTom · · Score: 1

      I'm so sorry, but I find I can't agree with your statement that a mechanical system is "capable of detecting when its processor is overheating". A system is not "aware" in the sense of being "conscious" since obviously machines are not capable, at this point, of being conscious. Indeed, you seem to be making exactly the mistake I was attempting to describe. That is, you've mistaken the intelligence of the created with the intelligence of the creator. The programmer was aware that at a certain temperature level, damage can occur. His awareness of this is what leads to a particular design, but at no time is the design itself "aware". Is this clear?

      As for the definition of "self-awareness", I think I'll stick to the stick to the conventional meaning as it is normally defined.

      The noun "self-awareness" has 1 sense in WordNet.

      1. self-awareness -- (awareness of your own individuality)

      http://www.cogsci.princeton.edu/cgi-bin/webwn?stag e=1&word=self-awareness

      The noun "awareness" has 2 senses in WordNet.

      1. awareness, consciousness, cognizance, cognisance, knowingness -- (having knowledge of; "he had no awareness of his mistakes"; "his sudden consciousness of the problem he faced"; "their intelligence and general knowingness was impressive")

      2. awareness, sentience -- (state of elementary or undifferentiated consciousness; "the crash intruded on his awareness")

      http://www.cogsci.princeton.edu/cgi-bin/webwn?stag e=1&word=awareness

      Please note that awareness is dependant upon:

      a) "consciousness"
      b) "sentience"

      These are clearly two qualities that cannot, at this time, be mechanically produced.

      --
      Words to men, as air to birds.
  36. Reconstructing semantic space by G4from128k · · Score: 1

    Thank you for the reply. I'm glad your work generalizes to longer search-term lists. Like so many other /. readers, I did not take the time to read your preprint before posting.

    I've often wondered if one can use simple pair-wise distance estimates to reconstruct a polytope or distorted simplex for the set of items within a multidimensional space. In theory, an N-object system, with non-zero pairwise distances, requires (N-1) dimensions. But in practice, many real systems don't fill the space -- being M-dimensional (M less than N-1) and having only negligible (perhaps noise-induced) thickness in the other dimensions.

    For semantic systems, the total number of semantic dimensions may be far less than the number semantic terms or tokens. A simple example dimensional flattening is the existence of synonyms -- the second word does not expand the space because it does not encode a new dimension of meaning. (Synonyms would also be negatively correlated in Google searches, but that's another issue). Also, the fact that each word can be defined in terms of other words suggests that the semantic nebula does not actually fill the space.

    Accomplishing this would require a true distance metric. I notice that NGD does not satisfy the triangle inequality. Perhaps some minor transform or alternative formulation of NGD would yield a true metric.

    The reason that estimating semantic dimensionality is useful is two fold. First, it says something about the cognitive complexity of humans and human systems. Second, it provides some insight into the required cognitive sophistication of autonomous learning systemd that need to interact "intelligently" with humans. How many words does a system need to truly understand to pass the Turing test?

    Creating a full reconstruction, a more challenging task, would provide insight into the structure of human language and human language usage patterns. The dimensionality of clusters of words might provide insight into the complexity of subdomains of knowledge.

    I wish you every success in creating better autonomous learning systems.

    --
    Two wrongs don't make a right, but three lefts do.
  37. Re:Unsupervised but Reflective of Human Preference by Anonymous Coward · · Score: 0

    Just because you heard some rumor, or even heard it more than once, or possess one datapoint does not mean you know what Google's hiring practices are. There is no bias against Americans; I speak from personal experience. Having such a bias would be both illegal and stupid, and Google is law-abiding and not-stupid.

  38. Prize classes by Baldrson · · Score: 1
    The smaller the sample, the larger the domain covered, and the quicker and more accurate the prediction, the greater the intelligence.

    Good point. However it is difficult to value time in a single competitive metric whereas compression ratio (where the initial and compressed sizes include the size of the algorithm/knowledge of the AI) is a single number.

    Perhaps the way around this is to have different prizes for different time classes, varying by an exponential. You'd have, say, 3 competitions with timeouts of one unit of time, 10 time units and 100 time units. This could make the contest run in a reasonable period of time at a reasonable cost.

  39. Please mod parent UP (he wrote TFP) by Anonymous Coward · · Score: 0

    Are you sleeping, moderators?
    The author of the paper take the time to answer some questions in an insightful and friendly manner, and his post is still buried at +1.
    You can do better than that.

  40. understanding relationships is intellegence by menem · · Score: 2, Insightful

    If given perfect information about the relationships between concepts, you could derive a very intellegent machine. TAke a human for example..

    A baby hears the word mom spoken by his mom. Gradually, the baby knows there is a relationship between that sound and a smily face.

    The child, growing up, starts to see relationships. Intense pain, which is rare, when correlated with a hot stove, has strong meaning in his mind.

    Everything is learned initially through correlations. The advantage of human beings is that there are many more data points for correlation. Google's correlations are weak and don't give nearly as much information.

    1. Re:understanding relationships is intellegence by nagora · · Score: 1
      It'll be intelligent the day it returns the value:

      Google isn't very good anymore, is it?

      TWW

      --
      "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
    2. Re:understanding relationships is intellegence by apsheth · · Score: 1

      Here is a more extensive exposition on current work on relationships, esp. as they can be supported in the Semantic Web context: http://www.cs.uic.edu/~ifc/SWDB/keynote-as.html http://lsdis.cs.uga.edu/lib/download/SAK02-TM.pdf

  41. Ant's vs. Nest by TapeCutter · · Score: 1

    "the ability to "see" its "body"" - Individual ant's do not learn, they are very much like small robots. An ant's nest on the other hand can display a modicum of intelligence in the way that it forages and protects itself.

    --
    And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  42. This has been suggested before by clsc · · Score: 1
    Specifically, by the moderator Orion (aka. Dr E. Garcia, Mi Islita.com) at the Search Engine Watch Forums.

    Here's the original thread from June 2004: http://forums.searchenginewatch.com/showthread.php ?t=48

    Here are the writings from Dr. Garcias own web site: http://www.miislita.com/semantics/c-index-1.html - see especially parts three and four.

    1. Re:This has been suggested before by pmbv · · Score: 1

      I checked the supplied links. The point of the Cilibrasi-Vitanyi method is not that it uses Google page counts (like the supplied links many different approaches have done so). The point is that a particular distance (formula) based on an extensive mathematical theory and a sequence of research papers spanning over a decade has been developed. Experimental testing shows that it always works, in diffrent settings, in a relatively precise way. For example, in a massive randomized experiment the mean agreement with the decade-effort WordNet database at Princeton University is 87.5% with a standard deviation of 0.11. This means that the automatic method is remarkably precise and almost equals the knowledge put in by experts with a PhD. The comment above is like saying that "internet is nothing new because the idea of connecting computers has been suggested before". That depends on what you call "new", but the devel is in the underpinning theory and the experimental validation. If the approach of Dr E. Garcia, Mi Islita.com is better in practice, so be it. But the statement I react to says nothing of the sort.

  43. I agree. by TheLink · · Score: 1

    Intelligence is not just knowing _absolute_ semantic meaning - e.g. that a cow is a cow and grass is grass. And being able to group grass with other grasses and cows with other bovine animals.

    It's being able to understand the statement that cow is to grass in a similar way that balleen whales are to krill.

    And then now knowing something about krill from that even if you didn't know what krill was at all.

    It's not just knowing the "absolute value" of the meaning - or even that these two objects are linked or close in the same area ( which seems to be the level which most current AI are at).

    It's more like kind of knowing the "vector/direction" they are linked, and being able to organize other objects that are related in similar ways in a similar vector. Thus you can learn about things by analogies and metaphors AND even create new things with those.

    Would pump out more BS but I have to go for dinner :)

    --
  44. Re:Compression is not AI by Zukix · · Score: 1

    Compression is a stricter test for AI than Turing

    By stricter do you mean narrower and incomplete? Do you think that taking something overly terse and compressed and explaining it simply with examples and analogies etc is a greater intellectual acheivement?

    Intelligence can be seen as the ability to take a sample of some space and generalize it to predict things

    It would be myopic to see it as such. The ability to communicate an idea is a closer description of what it is to be intelligent as captured by the Turing test which rightly leaves the problem domain beyond this undefined. I know many intelligent people who are incontinently verbose and cannot summarise in a techie/scientific mode but can communicate a feeling or a subtle insight by complex layered descriptions. Compression and prediction might be an optional string of intelligence but by no means the whole instrument.

  45. Re:Unsupervised but Reflective of Human Preference by Hognoxious · · Score: 1
    Google is law-abiding
    Bzzzzt, Wrong! Unless you define the legality of a action by whether you get off with just a slap on the wrist.
    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  46. No. by Dylan+Thomas · · Score: 2, Informative

    A slug is not conscious. Nothing without langauge is. Recommended reading: Dr. Daniel C. Dennett, Consciousness Explained and Darwin's Dangerous Idea. Richard Dawkins, The Extended Phenotype. Julian Jaynes, The Origin of Consciousness in the Breakdown of the Bicameral Mind.

    Those are all more commercial works, well within the grasp of even people who've done no work in the field. For more sholarly and technical references, check their bibliographies, especially in Dennett.

    --
    What he wants is more important that what I want. What he wants is also more important that what you want.
    1. Re:No. by Anonymous Coward · · Score: 0

      a slug certain is concious of it's enviroment ... sheesh

  47. Down, cowboy. by Dylan+Thomas · · Score: 2

    "Sheesh" is a word which normally means, "I'm not very good at actually saying what I mean, so I'll just make strange noises and roll my eyes at someone who won't figure it out for me." (It's also the nick of one of my favorite Internet trolls; what ever happened to the good old days when trolls actually tried to be entertaining, instead of merely annoying?)

    Anyway, okay, it's loose definitions of words that are once again getting us into trouble here. That a slug is aware of its environment, as in, capable of responding to environmental stimuli, okay, I'll give you that one. I won't gift wrap it for you, but I'll give it to you.

    But that's entirely different from "consciousness" in the sense that we're discussing here. After all, even a computer is capable of detecting environmental stimuli, and responding to them, but as my colleague 2TecTom is pointing out in this same thread, the mere ability to respond to environmental stimuli is not synonymous with consciousness.

    Read the source material. It'll give you the weapons you need to overcome your sheeshing.

    --
    What he wants is more important that what I want. What he wants is also more important that what you want.
  48. another elephant by uahsenaa · · Score: 1

    my feeling is that what's often missed in the various AI language research programs is the problem of speech and reference. Any linguist working in the field will tell you that speech recognition is an incredible pain in the arse and then to layer semantic recognition on top of that is doubly painful. Though my real concern is about things like irony and sarcasm. I'm glad somebody stepped up to point out that different languages break concepts and the world up in different ways. But how exactly can you get AI to gather enough circumstantial data to understand something like sarcasm, where the connoted meaning is often different or the exact opposite of the denoted.

    And what about signs that are coded to mean multiple things to multiple people simultaneously? For example, you're talking with your friend about this absolutely grotesque hat someone is wearing. You say to this person nice hat; he thanks you and your friend snickers. Same sign, two simultaneous meanings.