Slashdot Mirror


Automatic Translation Without Dictionaries

New submitter physicsphairy writes "Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words. For example, in any language, a word like 'cat' will have a particular relationship to words like 'small,' 'furry,' 'pet,' etc. The set of relationships of words in a language can be described as a vector space, and words from one language can be translated into words in another language by identifying the mapping between their two vector spaces. The technique works even for very dissimilar languages, and is presently being used to refine and identify mistakes in existing translation dictionaries."

17 of 115 comments (clear)

  1. My hovercraft is full of eels! by Anonymous Coward · · Score: 5, Funny

    My nipples explode with delight!

  2. Re:how would by Anonymous Coward · · Score: 5, Funny

    how would 'tight pussy" be translated?

    "Tight pussy" would be translated automatically, and without dictionaries. This is answered right in the headline.

  3. Darmok and Jalad at Tanagra by Vanders · · Score: 4, Interesting

    Finally, the team point out that since the technique makes few assumptions about the languages themselves, it can be used on argots that are entirely unrelated.

    Once again, Star Trek is ahead of the curve.

  4. Hofstadter? Isn't this AI, not translation? by Etcetera · · Score: 5, Interesting

    Reminds me a lot of the Fluid Concepts and Creative Analogies work that Hofstadter led back in the day.

    I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.

    That being said.... Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC-like rules to figure it out, seems like a useful building block, but maybe I'm wrong.

    Very cool stuff. Makes me want to go back and finish that CS degree after all.

    1. Re:Hofstadter? Isn't this AI, not translation? by phantomfive · · Score: 4, Interesting

      I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.

      According to the paper, this translation technique is only for translating words and short phrases. But it seems to work well for languages as far apart as English and Vietnamese.

      --
      "First they came for the slanderers and i said nothing."
  5. Re:Sounds good, but we need a robust plug by Finallyjoined!!! · · Score: 4, Funny

    it gets full of lint

    What's it got in its pocketses?

    --
    If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
  6. Re:Sounds good, but we need a robust plug by icebike · · Score: 3, Insightful

    Firefox had nothing to do with it.
    It was PEBCAK, pure and simple.

    --
    Sig Battery depleted. Reverting to safe mode.
  7. Dolphinese Will Now Be Understood by MacroSlopp · · Score: 4, Funny

    With this technology we should be able to understand Dolphin-talk.
    It should also allow us to detect future ape rebellions before they happen.

  8. Re:how would by Jane+Q.+Public · · Score: 4, Funny

    "tight pussy" be translated?

    "The cat has drunk a saucer of wine."

  9. Old idea, new implementation? by Theovon · · Score: 5, Interesting

    When I was in grad school, studying linguistics, compitational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?

  10. Old news by richwiss · · Score: 4, Informative

    This is old news, going back to 1975. Yawn. http://en.wikipedia.org/wiki/Vector_space_model

  11. Re:the spirit is willing but the flesh is weak by icebike · · Score: 3, Interesting

    Yes, the pretty vectors (nothing but lists of words) still have to be assembled by humans for the most part. Maybe not EVERY association, but enough of them such that you can build relationships and associations in-directly, and achieve a round-about translation, even if you end up having to go through 2 or 3 related languages to get there.

    After a few words of context are translated you can, perhaps deduce the rest. But the idea you can do so without a dictionary is ridiculous. And putting your dictionary into digital forms and calling it a vector doesn't change the fact that you still have a dictionary associating an english word with a french word and a Mandarin word.

    --
    Sig Battery depleted. Reverting to safe mode.
  12. Re:Cat by blue+trane · · Score: 3, Insightful

    jazz musician

  13. Re:Summary wrong (again) by hey! · · Score: 4, Insightful

    Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.

    True, but calling a dictionary a vector space doesn't make it so. For example how "close" are the definitions of "happiness" and "joy"? In a dictionary, the only concept of "closeness" is the lexical ordering of the word itself, and in that sense "happiness" and "joy" are quite far apart (as far apart as words beginning h-a are from words beginning with j-o are in the dictionary). But in some kind of adjacency matrix which show how often these words appear in some relation to other words, they might be quite close in vector-space; "guilt" and "shame" might likewise be closer to each other than either is from "happiness", and each of the four words ("happiness", "joy", "guilt", "shame") would be closer to any other of those words than they would be to "crankshaft"; probably close to "crankshaft" (a noun) than they'd be to "chewy" (an adjective).

    Anyhow, if you'd read the paper, at least as far as the abstract, you'd see that this is about *generating* likely dictionary entries for unknown words using analysis of some corpus of texts.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  14. Re:And what's the algorithm complexity? by SuricouRaven · · Score: 4, Funny

    Statistical translation is always going to have issues like that, but it can perhaps reach the 'good enough' point to hold a conversation with.

    I can easily see it getting confused by formal vs informal use. If it goes on association, eventually it's going to get 'lawyer' and 'extortionist' confused.

  15. Re:And what's the algorithm complexity? by Anonymous Coward · · Score: 4, Funny

    I too get lawyer and extortionist confused.

  16. Like so many of these algorithms by holophrastic · · Score: 3, Interesting

    They do a great job of improving the precision of what used to be mediocre. And then, as a direct result, they not only make the errors worse, they make the errors undetectable.

    CAT: small, furry, pet.
    BIG CAT: big, furry, pet.

    Um. Both are orange. One's a tabby. One's a tiger.

    It's not good enough that your translation system has a 99% accuracy whereas the old one has a 90% accuracy. What matters is that the old one's 10% error rate sounded like an error (e.g. tiger becomes monster), whereas your new one's 1% passes the turing test and can't be discerned by an intelligent listener (e.g. tiger becomes tabby).

    "My friend owns a monster." -- You friend owns what? I don't think you meant a monster. -- "eh, you know, a very big dangerous jungle cat" -- oh, like a lion -- "not a lion, it has stripes" -- oh, a tiger.

    "My friend owns a tabby." -- Ok.