Slashdot Mirror


Romancing The Rosetta Stone

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

15 of 486 comments (clear)

  1. Re:oh oh... by Anonymous Coward · · Score: 4, Interesting

    This is exactly NOT a universal translator as it uses matched bilingual texts. You need an already translated text for his system to work.

  2. The vodka is strong but the meat is rotten by zptdooda · · Score: 5, Interesting

    That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

    Can anyone try this on the new (or some other recent) algorithm?

    BTW here's Doc Och's most recent website:

    Franz Josef Och

    --
    Esteem isn't a zero sum game
    1. Re:The vodka is strong but the meat is rotten by JJ · · Score: 4, Interesting

      This actually is a myth. That particular text and translation was taken as anecdotal in a 1964 report. I did a masters thesis on MT at the University of Chicago and my advisor (once a major figure in MT) refused to approve my thesis until I got that statement correct.

      --
      So long and thanks for all the fish . . . !!!
  3. Finally, the correct approach by tuxlove · · Score: 4, Interesting

    I believe that using a statistical approach like this is a step in the right direction. Manually building sets of rules, dictionaries, etc., is a waste of time and hard to do. And manuall-built systems become stale as languages evolve, unless a lot of continuing work is done.

    For me the holy grail is when I can converse with a computer meaningfully. I believe a similar approach will be required for the computer to "understand" language, and to be able to formulate a coherent and appropriate response.

  4. Integration by slusich · · Score: 3, Interesting

    Sounds like a brilliant idea. Hopefully this is something that could eventually be compacted enough to fit into consumer electronics. It would be great to be able to watch TV from every country without any language barrier!

    1. Re:Integration by ahfoo · · Score: 3, Interesting

      Not to sound arrogant, but I find actually learning another language by watching foreign TV with subtitles in the original language to be even more interesting than watching the dubbed or English subtitled version. It involves commitment to get to the point where you can understand the basics, but there are rewards to making a commitment to learn something new.
      I like the idea of translating sentence by sentence as opposed to grammatically and word for word. I'm sure this guy is right that at some point this will produce reasonably acurate translations in many cases, but multiple languages are one of our greatest treasures.
      I have read that the single most important factor in preventing senile dementia is the difference between those who continue to create novel memories throughout their lives and those who stick to what they have already learned. Learning multiple languages is a wonderful thing and once you get well into it, it is a lot of fun. It certainly increases your options for punning and rhyming and you end up with lots of aliases.

  5. Re:Could help by Abcd1234 · · Score: 4, Interesting

    I'm not sure this is really applicable to translating literary works. These kinds of translations require an understanding of the native culture of both the source and target languages, as well as the intent of the writer, in order to generate an understandable translation that the target group can appreciate. A computer translation system like this one is incapable of performing these sorts of analysis.

    What this is really good for is on-the-fly translation of material where the reader simply wants to comprehend what was written (think the old babelfish engine). This has obvious applications on the web, as well as many other areas (on-the-fly server-side translation for IM systems, etc, etc).

  6. If you want a universal translator... by flicken · · Score: 4, Interesting
    ...here is a link to the Universal Networking Language (UNL). UNL is a computer markup language that allows the author of the text to specify how exactly the text should be translated (i.e. what the precise definition of the words in the text are). Taking this specification, a machine is able to produce a readable version of the text in a variety of languages.

    It's not quite done yet, but the system does show promise. Dictionaries have already been created in Spanish, English, German, Japanese, Italian, French and several other languages.

    --
    20 mil and I will! Learn Esperanto with 20M others.
  7. Several Missing Details by Flwyd · · Score: 5, Interesting

    As press releases tend to do, this leaves much to be desired for folks who are familiar with the discipline. As I read it, it seems to imply that the main driver is phrase-matching. What does it do with phrases it hasn't seen before? The problem is solved by throwing lots of data at it -- how much data is needed for a reasonable system? How well does it generalize to text outside the domains of the training data?

    Incidentally, had my brother been a girl, he was in serious danger of being named Rosetta Stone.

    -- Trevor Stone, aka Flwyd

    --
    Ceci n'est pas une signature.
  8. Translate Pascal To C and Such by Potpatriot · · Score: 4, Interesting

    How about piping in various algorirhtms encoded in Pascal and C into the thing and seeing what it does to convert arbitrary sources. Where Can I get the soource? Pawel

  9. Programming Languages? by The+Raven · · Score: 5, Interesting

    I wonder how this would fare putting two computer languages side by side? I mean... take a few thousand programs, coded using the same algorithms but different computer languages... would his language translation software translate between them? Would it be able to differentiate between languages that manually allocate memory and those that use garbage collection? How about between procedural langauages like C, and more esoteric and oddly structured languages like LISP?

    An interesting challenge, eh?

    Would there be any benefit to this?

    --
    "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
  10. Not to mention.. by k98sven · · Score: 3, Interesting

    The Rosetta stone itself did not do much in the way of our knowledge of the egyptian language.
    What it did do, was provide insight into their method of writing.
    It was the latter discovery of the the relation between Coptic and Egyptian that revealed most of the actual language.

    (IIRC)

    1. Re:Not to mention.. by LenE · · Score: 3, Interesting

      For those who don't know, Coptic is Egyptian written in Greek, or at least the Greek alphabet. It would be similar to transcribing a language that uses glyphs for words by recording them with the phonemes and alphabet of another language.

      A more modern example is what happened with the slavic Croatian language. The original speakers had a glyph based alphabet called Glagolitic, through the middle ages. This would be as foreign as Egyptian hieroglyphs to people today, and could stand in nicely for an alien text in any sci-fi movie.

      Through falling under different feudal states (Venice, Austro-Hungary) the language was cast under both the Cyrillic and Roman alphabets. Today Croatian uses an accented Roman alphabet (like French), but each letter has only one pronunciation, like Russian.

      -- Len

  11. statistics is the key by gemseele · · Score: 5, Interesting

    Time for inflamatory reasoning. The statistical approach will beat out the grammar and rule based ones, at least for English, is for the simple reason:

    English is not a language

    Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly (and not only in colloquial forms, just look at what the political-correctness movement has done to phraseology). You know the story... more exceptions than rules, things that are legitimate to say language-wise are considered incorrect anyways, and vice versa, etc. etc.

    That's not to say it doesn't have advantages; it's relatively easy to learn the basics of communication since it's weakly conjugated, has genderless articles, fairly simple uncased sentence structure. But, it is monstrous to master and I suspect most native speakers aren't true masters (not to mention the orthographical nightmare; is English the only language with spelling bee contests?)

    The reason it's the new lingua franca (or should it be lingua angla now?) is techno-socio-political as is always the case. Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.

  12. How's that news? by Yurka · · Score: 3, Interesting

    This has already been done some years ago in Canada, where the translation system was fed the complete text of parliamentary debates for umpteen years (required by law to be translated by humans into French, if originally in English, and vice versa). I don't know how it fares when presented with a sample of parliament-speak (I concede, this is not a fair approximation of human language), but it fails miserably on a simple rhyme. Read your Hofstadter, guys.

    --
    I can assure you, the best way to get rid of dragons is to have one of your own.