Slashdot Mirror


Multinational Machine Translation?

TheLocustNMI asks: "I'm currently employed as a systems analyst for a mid-sized consulting firm, and we have been charged with the task of finding a multilingual solution for an entire enterprise system. After much study, the question remains: Is there an effective multi-lingual machine translation system out there? Could one be built? Would it be a massive distributed knowledge system, akin to Everything2? Could it be free to the net public? Ideas?"

2 of 9 comments (clear)

  1. The problem with translation systems... by jfrisby · · Score: 2

    The problem with translation systems is grammatical ambiguity. Computers lack the massive "database" of context which we humans have for resolving ambiguities.

    How do you translate "plane"? It could be "plane" as in mathematics, or as in flying...

    That's just one simple type of ambiguity. When you have a fluid grammar like English -- for which a BNF grammar cannot be made, regardless of how many tokens of lookahead you have -- the lack of context makes translation all but futile.

    Assembling such a database is not a theoretical impossibility. It is however a *practical* impossibility for the present and foreseeable future.

    -JF

    --
    MrJoy.com -- Because coding is FUN!
  2. Re:Yes, But! by jfrisby · · Score: 2

    Using user submitted data is a double-edged sword. On the one hand it gives you a more realistic way to gather the neccesary information but then you have quality control issues. Bad data will necesarily lead to bad translations and just because you speak a language fluently or natively does not mean that you can accurately relate the rules of the language or describe contextual/circumstantial disambiguating information.

    Such an engine would *not* be as good as current systems at the start because current systems are designed with their own weaknesses in mind. To get translation comparable to what current systems can do you'd either have to *use* a current system until enough data was provided, or you would have to start with a VERY large dataset.

    But then there is an insurmountable theoretical problem: The smaller the corpus being translated, the less accuracy you will be able to get no matter how large your database. A single ambiguous sentence by itself can often not be disambiguated by even a human. If it is in the context of a paragraph or a page then more context is provided.

    But the extraction and interpretation of contextual information is a task which by itself remains almost wholly unaddressed.

    But let us for a moment assume that you have a system that using a large enough database (I'd guess we're talking TBs here, but that's pure speculation...) and was able to gather contextual information and disambiguate sentences. So now, you have a data structure that represents all the ideas you wish to convey in your text in a language-neutral fashion.

    Then you have the problem of going the other direction. Now, you need a similar database for your target language, and your databases both must contain information about language-specific idioms and customs.

    As my Japanese professor said "you do not *translate* English to Japanese (or vice versa), you *restate* your idea."

    Even if your program could understand cultural context and understand the weight certain grammatical constructs/words/phrases are given in different contexts, how does the software "restate" your idea in a culturally meaningful manner?

    Consider the Japanese phrase: "Nodo ga kawakimashita." Accurately translated it is "My throat has become dry." But a translation is clumsy at best here. A better way of stating it is "I am thirsty."

    So you would need for each language, a HUGE, extremeley detailed database of grammatical, lexical, contextual, idiomatic, and cultural information for a given language that is interrelated at a higher level than just words or sentences. It would have to have culturally-significant weighting of concepts, ideas, grammatical structures, and words. On top of that you would need a very complex, detailed mapping between the database of each language.

    Just as it is theoretically possible to move three tons of sand from New York to California using only a unicycle and a pair of tweezers, such a project is possible. However, the complexity and resource constraints make it all but impossible even for thousands go-get-em open source coding wizards and hundreds of thousands of community members.

    -JF

    --
    MrJoy.com -- Because coding is FUN!