Slashdot Mirror


Automatic Translation Without Dictionaries

New submitter physicsphairy writes "Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words. For example, in any language, a word like 'cat' will have a particular relationship to words like 'small,' 'furry,' 'pet,' etc. The set of relationships of words in a language can be described as a vector space, and words from one language can be translated into words in another language by identifying the mapping between their two vector spaces. The technique works even for very dissimilar languages, and is presently being used to refine and identify mistakes in existing translation dictionaries."

115 comments

  1. My hovercraft is full of eels! by Anonymous Coward · · Score: 5, Funny

    My nipples explode with delight!

    1. Re:My hovercraft is full of eels! by Brad1138 · · Score: 1

      LOL, I was just watching my Monty Python DVDs last week and saw that episode. Very Funny.

      --
      If you could reason with religious people, there would be no religious people
    2. Re:My hovercraft is full of eels! by Alsee · · Score: 1

      This method is mÃfÆ'à © complÃfÆ'à throughly cromulent! I always use Google Translate to exÃfÆ'à © cuter all my messages through Slashdot franÃfÆ'à Ãf  ais Ãf then English. You can not tell me mÃfÆ'à  at all.

      -

      --
      - - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
    3. Re:My hovercraft is full of eels! by kloro2006 · · Score: 1

      Ditto, ditto, ditto. And because it is able to generate an unlimited amount of close translations, the technology would it seems to me change our whole concept of dictionaries used by those who want to read foreign texts without external translations. Whenever I want to understand a word, what I want also is to learn it for its future appearances, but I can't learn a word without lots of examples. With the dictionary I have in mind, you find the word in a corpus of good literature in the language (if good lit' is what yr wanting to translate ;o) and underneath the relevant texts in, say, Latin, are very close translations of the texts in English. With such a tool, you keep finding examples of the word until you get a solid feel for it. It seems to me that such a 'dictionary' would be infinitely quicker and easier to use than conventional dictionaries. HOORAY!

    4. Re:My hovercraft is full of eels! by lissnup · · Score: 1

      You beat me to it!

  2. And what's the algorithm complexity? by d33tah · · Score: 1

    Well, that sounds quite cool, but also makes me wonder how does the algorithm tell wrong associations from the good ones. These things can easily go up to n^2 complexity.

    1. Re:And what's the algorithm complexity? by d33tah · · Score: 1

      (I meant O(n^2) memory complexity.)

    2. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      Nested loops are always a great idea. Puts you on the dark side of the O chart. Above linear space/time.

      You act so big but tell me this, since your an awesome algorithm developer, can you tell me for sure you never used NESTED loops? Because once you nest a loop you are in the above linear time/space terratory. Which is BAD.

    3. Re: And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      You win the internets

    4. Re:And what's the algorithm complexity? by SuricouRaven · · Score: 4, Funny

      Statistical translation is always going to have issues like that, but it can perhaps reach the 'good enough' point to hold a conversation with.

      I can easily see it getting confused by formal vs informal use. If it goes on association, eventually it's going to get 'lawyer' and 'extortionist' confused.

    5. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 4, Funny

      I too get lawyer and extortionist confused.

    6. Re:And what's the algorithm complexity? by FatdogHaiku · · Score: 1

      Awl hour go rhythms spume pizza!

      --
      You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
    7. Re: And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      I'm pretty sure Google translate has used this for a while. It used to make amusing mistakes of this sort, translating Bush into Blair for instance, or inserting "God save the Queen" into a translation of the Irish national anthem.

    8. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      Not bacon? Hmph.

    9. Re:And what's the algorithm complexity? by TaoPhoenix · · Score: 1

      This is of course a subset of the big overall AI problem.

      So I think (without spending hours on the Articles!) that somewhere either in this research or the next few sets past it, is a key clue. I think the algorithm is (making up a slightly silly sounding word) "Quadratically modular". In other words, nothing says the comp can only use one algorithm to start working on its meaning. Studies like to chop things down because researchers get nervous at Emergent Complexity in old style science results. But using a brutal bit of humor, "people are not that smart all the time".

      So if we decide we are "talking about cats", then just load the "Cat Module"!

      Algorithm 1 is a simple nested tree like Animal/Mammal/Pet/Cat/.
      Colors
      Behaviors
      Names
      Owner Tips
      Pictures (!)
      Other

      Cats themselves keep the "conversational complexity" down (most of the time!) You don't *normally* describe Java Exploits in a cat-lovers discussion! So the types of topics converge not unlike basic limits in Calculus 1. With a little smart programming the comp can tell when the topic is changing. As long as it doesn't, it can shuffle between cat pictures and cat toys and cat food and so on all day. A "relatively few" "meta modules" just do error check loops and then go back to cats.

      So if you get enough expert modules built, at some stage the program begins to become fun to chat with. The unfortunate thing is that all this was ridiculed for 20 years until it became useful for Siri type programs, and then they wouldn't let us in the public have it anymore!

      --
      My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
    10. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      It's really not perfect. Here's a screen capture of Google translate of 2012 Finnish presidential election results. It shows Sponge Bob getting 3rd, 5th, 6th places in the race http://static.iltalehti.fi/presidentinvaalit/spongebob_etu230112STL_pd.jpg

    11. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 0

      Google's PageRank is O(n^2), too. This worst case occurs if every page links to all the others. If you parametrize it with additional variables such as average link count, it'll probably be something like O(n*average link count).

      In this translation case, you can just keep track of some associations (x most common, for example), and it'll be O(n*x), where x is significantly less than n.

  3. Pun + Her attitude arbitrary pleases me too. by mynamestolen · · Score: 1

    Neither the article or PDF contain the word "pun". We're still a little way off. But hopefully we'll get better than this attempt from google translate: = She turned me off with her bossy manner. but Google translate gets OPPOSITE meaning. saying "Her attitude arbitrary pleases me too."

    --
    work in progress
    1. Re:Pun + Her attitude arbitrary pleases me too. by mynamestolen · · Score: 2

      hmmm?? slashdot doesn't easily accommodate unicode.

      --
      work in progress
    2. Re:Pun + Her attitude arbitrary pleases me too. by Kjella · · Score: 2

      Welcome to /. where we still party like it's 1999. We'll have colonies on Mars before this site gets unicode support.

      --
      Live today, because you never know what tomorrow brings
    3. Re:Pun + Her attitude arbitrary pleases me too. by tepples · · Score: 2

      Slashdot has a fairly strict code point whitelist because there were problems in the past with trolls using directionality override characters to break Slashdot's layout and big blocks of foreign characters to make not-ASCII ASCII art.

    4. Re:Pun + Her attitude arbitrary pleases me too. by NonUniqueNickname · · Score: 1

      ... to make not-ASCII ASCII art.

      So... just art?

    5. Re:Pun + Her attitude arbitrary pleases me too. by tepples · · Score: 1

      I was referring to Shift JIS or Unicode glyph art, which extend the concept of ASCII art past the ASCII character set.

    6. Re:Pun + Her attitude arbitrary pleases me too. by Anonymous Coward · · Score: 0

      That must be so hard to filter.

  4. make that the cat wise! by Anonymous Coward · · Score: 0

    And would that really work? Make that the cat wise!

    1. Re:make that the cat wise! by Anonymous Coward · · Score: 0

      You must be Dutch, because that hit like a rod on a pig.

    2. Re: make that the cat wise! by Anonymous Coward · · Score: 2, Funny

      Yes exactly. For sayings google translate works not so good now. But perhaps with this technique it will be to plums in the future.

    3. Re: make that the cat wise! by Anne+Thwacks · · Score: 1

      Since the TV subtitles have so many errors that they are impossible for humans to understand, I cant see this working in my lifetime. I suspect the cat is a weasel, rather than wise.

      --
      Sent from my ASR33 using ASCII
  5. how would by ozduo · · Score: 1

    'tight pussy" be translated?

    --
    I got to the chocolate box before you, that's why the hard ones have teeth marks.
    1. Re:how would by Anonymous Coward · · Score: 5, Funny

      how would 'tight pussy" be translated?

      "Tight pussy" would be translated automatically, and without dictionaries. This is answered right in the headline.

    2. Re:how would by Jane+Q.+Public · · Score: 4, Funny

      "tight pussy" be translated?

      "The cat has drunk a saucer of wine."

    3. Re:how would by SuricouRaven · · Score: 2

      Depends on source corpus. If they trained it using one of the usual formal collections of publications, it would only have built up associations based on the slang-free usage and so would translate it as 'Tight cat.' If they have instead fed it a broader selection, perhaps culled from a web spider, it may pick up the other meaning.

    4. Re:how would by narcc · · Score: 1

      it may pick up the other meaning

      In a manner of speaking. The actual meaning of the words is completely irrelevant.

    5. Re:how would by smallfries · · Score: 1

      Dat cat was good to roll with, yo?

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  6. Darmok and Jalad at Tanagra by Vanders · · Score: 4, Interesting

    Finally, the team point out that since the technique makes few assumptions about the languages themselves, it can be used on argots that are entirely unrelated.

    Once again, Star Trek is ahead of the curve.

    1. Re:Darmok and Jalad at Tanagra by Samantha+Wright · · Score: 2

      Incidentally, real life caught up—fortunately there's not much worth translating with such a low-bandwidth form of communication.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    2. Re:Darmok and Jalad at Tanagra by epine · · Score: 1

      Once again, Star Trek is ahead of the curve.

      If you don't count noticing that the gear cogs of the antikythera could be made ever smaller and smaller by ongoing advances in Swiss craftsmen 1600 years later, then Star Trek was indeed ahead of its time in guessing that a large phone might become a small phone with batteries (the Baghdad Battery dates to roughly the same age as the antikythera) and a radio (1887) carried by some exotic flux such as neutrinos (as named by Fermi in 1933).

    3. Re:Darmok and Jalad at Tanagra by Anonymous Coward · · Score: 0

      More important is the need to grasp the idea of intra-language translation as compared with inter-language translation. That is, considering possible ways of reexplaining something in the same language prior to a translation of the type described here. Meanings could be thought of as limit points of nets of explanations to create a topologised space of concepts, and then rather than just linear transforms, one could consider those which are nice to such a topology. Not a good thing for the actual computation, but could be useful in the underlying theory.

    4. Re:Darmok and Jalad at Tanagra by Samantha+Wright · · Score: 1

      The way they describe their conceptual representation system (they give the example "king - man + woman = queen") makes it pretty clear that figurative language is completely out of the question.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
  7. Hmmm... by freshlimesoda · · Score: 1

    Makes me think about hash functions and flash storage and data interoperability..... future..

    --
    I come to Slashdot only to read sigs. One you are reading is mine.
  8. Hofstadter? Isn't this AI, not translation? by Etcetera · · Score: 5, Interesting

    Reminds me a lot of the Fluid Concepts and Creative Analogies work that Hofstadter led back in the day.

    I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.

    That being said.... Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC-like rules to figure it out, seems like a useful building block, but maybe I'm wrong.

    Very cool stuff. Makes me want to go back and finish that CS degree after all.

    1. Re:Hofstadter? Isn't this AI, not translation? by mozumder · · Score: 0

      I'm trying to figure out what the "space" is in the first place?

      What do the axis of the graph represent?

      It would be funny if "meaning" could be quantified into meaningless numbers. That would piss off anyone that believes there's a meaning to life. haha.

    2. Re:Hofstadter? Isn't this AI, not translation? by Anonymous Coward · · Score: 0

      It would be funny if "meaning" could be quantified into meaningless numbers. That would piss off anyone that believes there's a meaning to life. haha.

      42 is NOT a meaningless number.

    3. Re:Hofstadter? Isn't this AI, not translation? by phantomfive · · Score: 4, Interesting

      I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.

      According to the paper, this translation technique is only for translating words and short phrases. But it seems to work well for languages as far apart as English and Vietnamese.

      --
      "First they came for the slanderers and i said nothing."
    4. Re:Hofstadter? Isn't this AI, not translation? by infinitelink · · Score: 1

      [...] Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC [wikipedia.org]-like rules to figure it out, seems [...]

      Like not enough given the symbol-grounding problem.

      --
      Intelligent idiots are we. | Evil men do not understand justice.
    5. Re:Hofstadter? Isn't this AI, not translation? by narcc · · Score: 1

      Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC [wikipedia.org]-like rules to figure it out, seems like a useful building block

      Yeah, that's not what's happening at all.

    6. Re:Hofstadter? Isn't this AI, not translation? by Anonymous Coward · · Score: 0

      I have done this in ... 1976. It's not a big deal if high accuracy is not paramount.

  9. Re:Sounds good, but we need a robust plug by Finallyjoined!!! · · Score: 4, Funny

    it gets full of lint

    What's it got in its pocketses?

    --
    If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
  10. Re:Sounds good, but we need a robust plug by caseih · · Score: 2

    Agg. firefox put me on the wrong story... bye bye karma

  11. Load of bollocks by Skiron · · Score: 1

    OK, I am just having a fag. I bet that will bugger it up.

    1. Re:Load of bollocks by denzacar · · Score: 1

      Cats still dig fags?

      --
      Mit der Dummheit kämpfen Götter selbst vergebens
    2. Re:Load of bollocks by Anonymous Coward · · Score: 0

      Yep they're totally tubular.

  12. Interesting approach by Anonymous Coward · · Score: 0

    Now assign each word/phrase a certainty/confidence value, and apply your new algorithm only to words/phrases for which a literal (i.e. dictionary) translation has a low degree of confidence.

    Much appreciated,
    Multilingual speakers

  13. Re:Sounds good, but we need a robust plug by icebike · · Score: 3, Insightful

    Firefox had nothing to do with it.
    It was PEBCAK, pure and simple.

    --
    Sig Battery depleted. Reverting to safe mode.
  14. Summary wrong (again) by icebike · · Score: 1, Flamebait

    Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.

    Its still a dictionary, and also a thesaurus. Come to think of it a thesaurus is simply a meaning vectored dictionary.
    What's old is new again.
    Mathematicians, late to the party, still trying to drink all the punch.

    --
    Sig Battery depleted. Reverting to safe mode.
    1. Re:Summary wrong (again) by Anonymous Coward · · Score: 0

      Also, crediting them with moving us past human translation overlooks that fact that machine translation has been happening for decades. Yay to the authors for a step forward; boo to the submitter for repeating the every-cool-new-thing-changes-the-world fallacy.

    2. Re:Summary wrong (again) by hey! · · Score: 4, Insightful

      Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.

      True, but calling a dictionary a vector space doesn't make it so. For example how "close" are the definitions of "happiness" and "joy"? In a dictionary, the only concept of "closeness" is the lexical ordering of the word itself, and in that sense "happiness" and "joy" are quite far apart (as far apart as words beginning h-a are from words beginning with j-o are in the dictionary). But in some kind of adjacency matrix which show how often these words appear in some relation to other words, they might be quite close in vector-space; "guilt" and "shame" might likewise be closer to each other than either is from "happiness", and each of the four words ("happiness", "joy", "guilt", "shame") would be closer to any other of those words than they would be to "crankshaft"; probably close to "crankshaft" (a noun) than they'd be to "chewy" (an adjective).

      Anyhow, if you'd read the paper, at least as far as the abstract, you'd see that this is about *generating* likely dictionary entries for unknown words using analysis of some corpus of texts.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    3. Re:Summary wrong (again) by abies · · Score: 1

      I think that joy is quite close to chewy (through bubblegum and caramel for example). Of course, I believe some people may get more joy from playing with well oiled crankshaft, but that's a personal preference ;)

  15. 2003 called and wants its news back. by Anonymous Coward · · Score: 0

    Um... while it is awesome and works, uh the translator has for over about a decade. Heck the original Babel Fish (used by Yahoo, Overture, Altavista) were the first with such concepts pre 2000.

  16. Cat by Anonymous Coward · · Score: 0

    Cat, associated with:

    Big
    Steel
    Heavy
    Wheels
    Track
    Blade

    1. Re:Cat by blue+trane · · Score: 3, Insightful

      jazz musician

    2. Re:Cat by Anonymous Coward · · Score: 0

      bash
      pipe
      echo

      Cat, associated with:

      Big
      Steel
      Heavy
      Wheels
      Track
      Blade

    3. Re:Cat by dkleinsc · · Score: 2

      Rimmer, Lister

      --
      I am officially gone from /. Long live http://www.soylentnews.com/
  17. Dolphinese Will Now Be Understood by MacroSlopp · · Score: 4, Funny

    With this technology we should be able to understand Dolphin-talk.
    It should also allow us to detect future ape rebellions before they happen.

    1. Re:Dolphinese Will Now Be Understood by Anonymous Coward · · Score: 1

      Thanks for all the fish!

    2. Re:Dolphinese Will Now Be Understood by Vanders · · Score: 2
    3. Re:Dolphinese Will Now Be Understood by SuricouRaven · · Score: 1

      It's already been partially decoded.

      Most of the calls are individual identifiers unique to the individual. Makes a lot of sense. A dolphin pod is essentially free-floating a lot of the time in an ocean with no navigational markers and little indication of direction. They need some way to track each other to keep the group from getting split up.

    4. Re:Dolphinese Will Now Be Understood by schlachter · · Score: 1

      wrt your first sentence. i don't think this is funny at all. it's an amazing opportunity.

      --
      My God can beat up your God. Just kidding...don't take offense. I know there's no God.
  18. the spirit is willing but the flesh is weak by Anonymous Coward · · Score: 0

    Was the know-nothing reporter using/regurgitating something that was mistranslated?

    You WILL have to convert the old word to the one from the new langauage - THAT takes a dictionary operation.

    The problem is, that old word may translate into many different possible new word or phrase and the difficulty is WHICH new verbage is correct.

    Whats being talked of IS combining the basic translation lookup (dictionary) with some extra association information (context of adjacent) to try to pick the RIGHT translation and resulting new words.

    SO its pretty farging stupid declaring this is 'dictionaryless' .

    The word association link info can be a thousand-fold increase in the information such a translation database would need to maintain (and is largely what such real translator efforts have been doing in the past 50 years).

    1. Re:the spirit is willing but the flesh is weak by icebike · · Score: 3, Interesting

      Yes, the pretty vectors (nothing but lists of words) still have to be assembled by humans for the most part. Maybe not EVERY association, but enough of them such that you can build relationships and associations in-directly, and achieve a round-about translation, even if you end up having to go through 2 or 3 related languages to get there.

      After a few words of context are translated you can, perhaps deduce the rest. But the idea you can do so without a dictionary is ridiculous. And putting your dictionary into digital forms and calling it a vector doesn't change the fact that you still have a dictionary associating an english word with a french word and a Mandarin word.

      --
      Sig Battery depleted. Reverting to safe mode.
    2. Re:the spirit is willing but the flesh is weak by sourcerror · · Score: 1

      It pretty much sounds like a dumbed down Wordnet.

      http://en.wikipedia.org/wiki/WordNet

      http://wordnet.princeton.edu/

  19. Isn't that pretty much how Google Translate works? by Anonymous Coward · · Score: 1

    Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words.

    Like how Google Translate have noticed that Danish domain names ends in "dk" and therefore translates "dk" to "com" with "uk", "gb" and "en" as some of the other suggestions?

    Sometimes "a simple means" can be too simple.

  20. Old idea, new implementation? by Theovon · · Score: 5, Interesting

    When I was in grad school, studying linguistics, compitational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?

    1. Re:Old idea, new implementation? by schlachter · · Score: 1

      yeah, it's about all these different corpuses coming online and being available to a single group, especially because in order to train, they need a one to one translation of a single doc. like a gov doc that's in both spanish and english is great fodder for the algorithm.

      --
      My God can beat up your God. Just kidding...don't take offense. I know there's no God.
    2. Re:Old idea, new implementation? by Anonymous Coward · · Score: 0

      Funny, in philosophy we call it post-modernism. Heidegger was good at it.

    3. Re:Old idea, new implementation? by Anonymous Coward · · Score: 0

      I was under the impression that LSA was patented by Bell Labs. Perhaps the patent has expired?

      It is surprising that the (Google) authors don't reference LSA. LSA was the first thing I thought of when I read the Slashdot description. Or is this another case of computer scientists reinventing the wheel without being aware of work done in other disciplines (often many years earlier).

      I also wonder how well this works with Hungarian, which has no verb "to have", and English, which does. Or any other pair of languages coming from different language families, e.g, Indo-European (English or Czech) and Finno-Ugric (Hungarian or Finnish). There seem to be some similarity assumptions about vector spaces that are not necessarily true across language families (among other things).

    4. Re:Old idea, new implementation? by k.a.f. · · Score: 1

      When I was in grad school, studying linguistics, computational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?

      You are utterly correct. The idea of machine translation by looking up each word in a dictionary and shuffle the result around was big in the 1950s, but hasn't been since then. It became all too clear very early that this isn't the way to produce texts that a native speaker would ever say (or even comprehend). The barrier to doing this kind of context-dependent analysis was that the hardware wasn't there for a long time, and later the huge parallel corpora that are needed to make it work were missing. (Just think how many millions of words a child hears until it learns to speak fluently and effortlessly!) Now that both are there, of course Google is among the most successful implementors.

  21. Old news by richwiss · · Score: 4, Informative

    This is old news, going back to 1975. Yawn. http://en.wikipedia.org/wiki/Vector_space_model

    1. Re:Old news by smallfries · · Score: 1

      That is something differnent, each document is a vector and each word is a dimension.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  22. StarTrek Universal Translator by Anonymous Coward · · Score: 0

    Sounds very similar to StarTrek's universal translator. You only have to say about a dozen words to map the language right?

    1. Re:StarTrek Universal Translator by SuricouRaven · · Score: 1

      Unless the plot calls for a breakdown of communication, in which case the language will be too 'complex' for the universal translator.

    2. Re:StarTrek Universal Translator by jedidiah · · Score: 1

      Damn you Darmok!

      --
      A Pirate and a Puritan look the same on a balance sheet.
  23. When and where matters by gmuslera · · Score: 1

    Meaning of words, and their translations, vary with time and location. Infering meanings from texts from 20 years ago or another country, state or even region inside a state, even if the language is the "same", could be risky. There had been a lot of marketing problems thanks to this kind of bad translation

  24. Like so many of these algorithms by holophrastic · · Score: 3, Interesting

    They do a great job of improving the precision of what used to be mediocre. And then, as a direct result, they not only make the errors worse, they make the errors undetectable.

    CAT: small, furry, pet.
    BIG CAT: big, furry, pet.

    Um. Both are orange. One's a tabby. One's a tiger.

    It's not good enough that your translation system has a 99% accuracy whereas the old one has a 90% accuracy. What matters is that the old one's 10% error rate sounded like an error (e.g. tiger becomes monster), whereas your new one's 1% passes the turing test and can't be discerned by an intelligent listener (e.g. tiger becomes tabby).

    "My friend owns a monster." -- You friend owns what? I don't think you meant a monster. -- "eh, you know, a very big dangerous jungle cat" -- oh, like a lion -- "not a lion, it has stripes" -- oh, a tiger.

    "My friend owns a tabby." -- Ok.

    1. Re:Like so many of these algorithms by flimflammer · · Score: 2

      "My friend owns a monster." -- You friend owns what? I don't think you meant a monster. -- "eh, you know, a very big dangerous jungle cat" -- oh, like a lion -- "not a lion, it has stripes" -- oh, a tiger.

      Do you frequently converse with machine translators that elaborate the meaning of their mistranslations? Would be interested in knowing which one is capable of that. See when I use them it's what-you-see-is-what-you-get and I have to pick at the original source text with a dictionary to learn monster actually means tiger. That they can nonchalantly narrow the meaning down for you in a Star Trek-esque computer conversation is leaps and bounds ahead of what I'm used to!

      Sarcasm aside for a moment, you're actually complaining that machine translators may eventually get so convincing that you might not even notice the errors anymore? Really? Sign me up for that scenario. Nothing should replace native translators anyway for precision work.

    2. Re:Like so many of these algorithms by holophrastic · · Score: 1

      That's almost my complaint. It's not that I won't notice the errors. It's that I won't notice the errors when they are spoken. I'll notice the errors when I get bitten by a tiger after reading a sign that says "beware of cat".

      It's important for miscommunication to be identified during the communication protocol.

    3. Re:Like so many of these algorithms by Anonymous Coward · · Score: 0

      Actually, I thought this was exactly how google translate taught itself languages. I remember how I tested when it was new by having it translate some article on cnn.com into my native Swedish. Since the Swedish translation of "President Bush met with blah, blah, blah" was "King Bush met with..." it seemed clear to me that the algorithm had automagically formed some internat representation for "head of state" but had yet to form the "subclasses" King and President of that "superclass".

    4. Re:Like so many of these algorithms by Anonymous Coward · · Score: 0

      This looks like a place where Dr. Mueller's PSIMETRICA would be extremely useful. I believe the original work was used in relation to short term memory. Using the PSIMETRICA word dissimilarity calculation and applying it to a sentence, it could be used to predict what words are not applicable.

    5. Re:Like so many of these algorithms by Anonymous Coward · · Score: 0

      I'm not too worried. Just translate back into the source language and see if it still means the same thing. The translator might translate "tiger" into "monster", but it's unlikely to then translate "monster" back into "tiger".

      If you know 2 languages well, you can write in one language, translate to the language that you don't know and then translate that to the other language that you know. If the input and output mean the same thing, probably the translation in the middle was OK. If they don't mean the same thing, then rephrase the input until they do. If you can't get a good result like that, swap the two languages that you know and see if the translation software can do things better in that direction. You can still do the thing you'd have done if you only knew one language. Repeat for each sentence in the document that you are translating.

        If you know 3 languages, this will all work even better. You'll get 3 chances at getting a good translation instead of 2 and you'll get 3 different verifications of the translation. This is all even better if one of the 3 languages is completely unlike the other 2.

    6. Re:Like so many of these algorithms by Anonymous Coward · · Score: 0

      I'm not too worried. Just translate back into the source language and see if it still means the same thing. The translator might translate "tiger" into "monster", but it's unlikely to then translate "monster" back into "tiger".

      Actually, I noticed that Google translate makes exactly that kind of mistakes. Take the word "nut" for example. It can mean this or this. In other languages that I speak, the two alternative meanings have totally different words. "One-to-two mappings" like that are tricky since when machine translated back from an erroneous translation the result seems correct. I noticed this particular case in a hilarious situation when me and my GF were cooking and the recipe in Swedish didn't seem weird at all until it indeed stated that we should add the type of nuts depicted in the first picture. We got ourselves a laugh and since the machine translation was otherwise so good I got curious enough to test if it was made with Google and then I discovered that Google indeed translates it pretty well back and forth but in Swedish it had that funny mistake every time. Machine translation software would thus have to conclude whether the instructions are for cooking or for assembling furniture to get it right...

  25. Linearithmic by tepples · · Score: 1

    But there's a space between linear and quadratic called linearithmic, or O(n log n). Merge sort uses nested loops and lies in this space.

  26. Darmok and Jihad at Viagra by tepples · · Score: 1

    The allusion-heavy Tamarian language has real-world analogs, such as Tropese and the tendency for users of sites closely linked to 4chan to talk in memes.

    1. Re:Darmok and Jihad at Viagra by Anonymous Coward · · Score: 0

      Za Warudo, when the Snacks is back.

  27. Still needs dictionaries by raju1kabir · · Score: 2

    Anyone who regularly uses Google Translate has seen the problems that come with this approach.

    It "translates" analogous terms in ways that make no sense. Translate "Amsterdam" from Dutch to English and it often gives you "London". Same with kilometres / miles, and other things that significantly change the meaning of the text.

    With some hand-crafted guidance, the outcome can be much less useful than the more rough-sounding word-by-word machine translations from days of yore.

    --
    "Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
    1. Re:Still needs dictionaries by sourcerror · · Score: 1

      On the other hand it's much better at translating idioms or expressions where the component words have a lot of different meanings.

    2. Re:Still needs dictionaries by Anonymous Coward · · Score: 0

      Google Translate isn't using "this approach" at all.

      Translate is built on parallel texts, hence the Amsterdam -> London thing. A typical source of parallel texts is corporate documents. The Dutch documents say mostly the same things as the English documents, but the Dutch headquarters is in Amsterdam, while the English one is in London, so the parallel text analysis concludes "London" is English for "Amsterdam".

      This system isn't about parallel texts, it will introduce a whole new type of errors.

  28. Humangrunt Will Now Be Understood by Anonymous Coward · · Score: 0

    It's already been partially decoded.

    Some of the calls are individually identifiable, unique to the individual. The rest doesn't make much sense. A human pod is essentially land bound a lot of the time but spread apart with no inherent electromagnetic navigation organs and little of import to say. Still, they babble on to track each other after splitting up or silently stalk each other online.

  29. Synonyms by manu0601 · · Score: 1

    I wonder how they handle synonyms, which may be much more prevalent in a given language from another one.

    If the destination language is poorer in synonyms than the source language, this is straightforward, and that automatic translation will just miss subtle points that cannot be translated without a periphrase. In the opposite case, which is moving from synonym-poor language to a synonym rich language, the computer needs to choose the right word, and doing so requires some understanding of the context.

    And the problem exists beyond synonyms with sentence structures. Let us take the english sentence "We will give territories". In french it could become "Nous cèderons des territoires" (We will give some territories) or "Nous cèderons les terriroires" (We will give the territories). What should be chosen? It depends of the context, something the computer may have a hard time to grasp.

    1. Re:Synonyms by Panoptes · · Score: 2

      Synonyms are only the tip of the iceberg: there are so many other problem areas. Collocations (words that 'go together'): we can say a 'tall boy', but not a 'high boy'; 'a large beer', but not 'a big beer'. Connotations (attitudes, feelings and emotions that a word acquires): compare 'a slim girl' with 'a skinny girl'. Idioms: 'hot potato' and 'red herring' cannot be translated directly into any another language. Add irony and sarcasm to the mix, class and regional usage, dialects, diglossia (for example, demotic and classical Arabic), puns and plays on words - the list goes on. Machine translation is a chimera.

    2. Re:Synonyms by manu0601 · · Score: 2

      I understand that collocation are adressed by their model: they study texts to discover that 'boy' may be preceded by 'tall' but not by 'high', and that in french, 'garçon' may be preceded by 'grand' but not 'haut'. That enables them to translate without a hitch.

      But even adjectives handling may come with traps. Adjectives in french may appear before or after a noun. You may say 'un grand garçon' or 'un garçon grand', the meaning is the same most of the time. But there are exceptions! 'un type pauvre' is a poor guy, 'un pauvre type' is a mediocre person. Even the 'grand garçon' vs 'garçon grand' may carry subtle difference, as a father will tell his son he is 'un grand garçon' now (which means he is not a child anymore), but he will probably not tell him he is now 'un garçon grand' (which just mean he is tall). I guess this can be handled by their statistical model, but at some time they will need to add some logic to handle it. I guess it falls in the idiom category.

      Puns and irony are probably the most difficult part of the game. Even human translator have a hard time with them

    3. Re:Synonyms by Anonymous Coward · · Score: 0

      But that seems only relevant for grammatical correctness and flow of language.
      For actual understanding 'high boy' works just as well as 'tall boy' unless you hang around drug addicts much. 'big beer' and 'large beer' could be used interchangeably without causing confusion.
      This means that if you have an unknown language where no dictionary is available you can trow it at this translation and get a reasonably good translation. Apart from that this method probably still needs a starting point, but perhaps it could be used to automatically expand a dictionary given a very small dictionary and a lot of sample texts.

      Would be interesting to throw some noise from SETI at it and see what happens.

  30. 20 questions? by Anonymous Coward · · Score: 0

    So is it a game of 20 questions, with each answer projecting out one or more dimensions?

  31. What I want from a translator by snadrus · · Score: 1

    1. Rough word-by-word is the beginning
    2. Sentence structure reorganization
    3. Idiom recognition.
    4. Connotation, Tone, Irony
    5. Generation / Area / Nature: How a native listener can determine details about the speaker.

    The result will always be annotated-looking with warnings for plays-on-words, and will always be longer with maximum detail extraction from the source language.
    I'm sure there's more to do after these items are done.

    --
    Science & open-source build trust from peer review. Learn systems you can trust.
  32. nice idea, but by Anonymous Coward · · Score: 0

    babelfish will be "created" through crossbreeding and genetic experimentation long before the language barrier on this planet is gone...

  33. Another way of understandling language translation by beachdog · · Score: 1

    "Vector spaces" is the heart of the Google proposal. Previous posters have disassembled the weaknesses pretty well.

    The thing a "vector spaces" analysis needs is specific vector mapping based on the sounds of speech, the rythmns of a language, the breathing of the speaker and the physical proximity parts of the brain associated with hearing and parts of the brain associated with speech.

    Multiple languages exist because the growing infant's brain organizes the sounds it hears by passing the neural sensations through many layers of pattern forming and recognition processes. Multiple languages and the ambiguities in languages means the language learning process within a developing child has some features that are quite consistent, like saying "ma ma". The rest of language aquisition spreads out in the physical vector space of the topology of the brain. Italian has been noted as wonderful for singing, Spanish as good at expressing emotion. Perhaps these languages follow slightly different paths in the brain.

    An idea I picked up from digital ham radio tutorials is quadrature phase demodulation. It extracts data from a carrier signal, it looks simple, it looks like you could do it with nerve cells and it associates nicely with known large scale brain electrical activity.

    I work with severely disabled kids. Language aquisition or finding work arounds for missing or weak parts of the language pathway is an interesting challenge. A fellow who stone facededly ignored my spoken words laughed at me and smiled when I began signing to him in pigin half made up American Sign Language.

  34. Star Trek Universal Translator anyone? by Anonymous Coward · · Score: 1

    Looks like this could be the beginning of a Universal translation scheme. Next all we need is to add voice recognition to this and Star Trek tech comes alive once again!

  35. Computational Linguistics by Anonymous Coward · · Score: 1

    In recent times there are regularly articles in technology magazines about topics in computational linguistics (CL) that are blatantly ignorant of the current research. This is just another example.

    The time that dictionaries are used for applied machine translation is already history since 10 years. Statistical machine translation (SMT) and the techniques described here have not been developed by google. In fact the basic idea of SMT is over 25 years old and distributional semantics is over 50 years old. Phrase tables for SMT are nothing new and always contained these properties of distributional semantics (DS) which are tightly connected to vector space models (VSM) and the next step to merge VSM and SMT is just the next logical step.

    If you find that topic interesting search for "statistical machine translation" "phrase tables" "distributional semantics". Have fun the next 2 years reading all the stuff.

  36. English(Chinese(X)) ==... by Anonymous Coward · · Score: 0

    English(Chinese(Input)) became: "I strained my friend's cat loves the taste of sausage meat." Have a guess as to the original sentence.

  37. Re:Sounds good, but we need a robust plug by plover · · Score: 2

    With this story being about automated translations getting it very wrong, there was a 95% chance people would have thought you were just making a joke about Apple doing language translations!

    If you had posted a follow up like "That's what Apple translate gets when I wrote 'Orchards of apple trees have fans to spray microscopic poison dust on all trees', it would have been perfectly believable.

    --
    John
  38. Devil is in the details, by Antony+T+Curtis · · Score: 1

    The amusing side effect of the effectiveness of statistical machine translation is that more and more people would use machine translation instead of employing fluent humans to do the translation and that is where the fun begins: The machine-translated phrases will reenter the corpus as seed data and as the percentage of human-origin data in the corpus reduces, so does the quality of the translations as subtle errors are magnified over time.

    There should be some kind of "fingerprint" added to the machine translation which can be trivially detectable so that such work doesn't reenter the corpus. Of course, the quantity of human text will decline but that will degrade the quality at a slower pace.

    Fun to think about.

    --
    No sig. Move along - nothing to see here.
  39. Nothing new by Anonymous Coward · · Score: 0

    Stanford actually has a peer reviewed paper with the same model but better and actually show it works on real machine translation

    http://ai.stanford.edu/~wzou/emnlp2013_ZouSocherCerManning.pdf

    from

    https://plus.google.com/u/0/communities/107785538899595981479

  40. This may lead to some comical results I foresee by MXB2001 · · Score: 0

    Pet the small furry pussy. Hmmm that could have some very different meanings...

    --
    01/01/01