Slashdot Mirror


Romancing The Rosetta Stone

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

33 of 486 comments (clear)

  1. Article text by Anonymous Coward · · Score: 4, Informative

    Romancing the Rosetta Stone

    'Give me enough parallel data, and you can have a translation system in hours'

    University of Southern California computer scientist Franz Josef Och echoed one of the most famous boasts in the history of engineering after his software scored highest among 23 Arabic- and Chinese-to-English translatio systems, commercial and experimental, tested in in recently concluded Department of Commerce trials.

    "Give me a place to stand on, and I will move the world," said the great Greek scientist Archimedes, after providing a mathematical explanation for the lever.

    "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, a computer scientist in the USC School of Engineering's Information Sciences Institute.

    Och spoke after the 2003 Benchmark Tests for machine translation carried out in May and June of this year by the U.S. Commerce Department's National Institute of Standards and Technology.

    Och's translations proved best in the 2003 head-to-head tests against 7 Arabic systems (5 research and 2 commercial-off-the-shelf products) and 14 Chinese systems (9 research and 5 off-the-shelf). In the previous, 2002 evaluations they had proved similarly superior.

    The researcher discussed his methods at a NIST post-mortem workshop on the benchmarking held July 22-23 at Johns Hopkins University in Baltimore, Maryland.

    Och is a standout exponent of a newer method of using computers to translate one language into another that has become more successful in recent years as the ability of computers to handle large bodies of information has grown, and the volume of text and matched translations in digital form has exploded, on (for example) multilingual newspaper or government web sites.

    Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones.

    "Our approach uses statistical models to find the most likely translation for a given input," Och explained

    "It is quite different from the older, symbolic approaches to machine translation used in most existing commercial systems, which try to encode the grammar and the lexicon of a foreign language in a computer program that analyzes the grammatical structure of the foreign text, and then produces English based on hard rules," he continued.

    "Instead of telling the computer how to translate, we let it figure it out by itself. First, we feed the system it with a parallel corpus, that is, a collection of texts in the foreign language and their translations into English.

    "The computer uses this information to tune the parameters of a statistical model of the translation process. During the translation of new text, the system tries to find the English sentence that is the most likely translation of the foreign input sentence, based on these statistical models."

    This method ignores, or rather rolls over, explicit grammatical rules and even traditional dictionary lists of vocabulary in favor of letting the computer itself find matchup patterns between a given Chinese or Arabic (or any other language) texts and English translations.

    Such abilities have grown, as computers have improved, by enabling them to move from using individual words as the basic unit to using groups of words -- phrases.

    Different human translators' versions of the same text will often vary considerably. Another key improvement has been the use of multiple English human translations to allow the computer to more freely and widely check its rendering by a scoring system.

    This not coincidentally allows researchers to quantitatively measure improvement in translation on a sensitive and useful scale.

    The original work along these lines dates back to the late 1980s and early 1990s and was done by Peter F. Brown and his colleagues at IBM's Watson Research Center.

    Much of the improvement and

  2. Let me know by gazuga · · Score: 5, Funny

    when it's in the form of a fish, and can fit in my ear...

    --
    "I turn away with fright and horror from the lamentable evil of functions which do not have derivatives."
  3. Re:oh oh... by Anonymous Coward · · Score: 4, Interesting

    This is exactly NOT a universal translator as it uses matched bilingual texts. You need an already translated text for his system to work.

  4. Oh god... by gerf · · Score: 4, Funny

    The uber-geeks are going to have a field day with Klingon...

    1. Re:Oh god... by Jeremi · · Score: 4, Funny
      I'm pretty sure you can have your throat slit for saying "Yay!" near a Klingon. Do be careful. ;)


      Having your throat slit is nothing compared to what Klingons do to people who put smiley-faces in their text messages...

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    2. Re:Oh god... by daeley · · Score: 4, Funny

      Having your throat slit is nothing compared to what Klingons do to people who put smiley-faces in their text messages...

      You're telling me! My emoticons used to have noses! Now look:

      :(

      Such a tragedy.

      --
      I watched C-beams glitter in the dark near the Tannhauser gate.
  5. Re:Obsolete? by Surak · · Score: 5, Insightful

    'Almost everyone'? What *are* you talking about? You must be an American. From a recent online Harris poll, most Americans think at least half the world speaks English. This is just plain wrong. The truth of the matter is that it's more like 20%. That's it. Most people on the NET might speak English, but most people in the world? Hardly.

  6. The vodka is strong but the meat is rotten by zptdooda · · Score: 5, Interesting

    That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

    Can anyone try this on the new (or some other recent) algorithm?

    BTW here's Doc Och's most recent website:

    Franz Josef Och

    --
    Esteem isn't a zero sum game
    1. Re:The vodka is strong but the meat is rotten by rossz · · Score: 4, Insightful

      That particular phrase translated badly because they used a word-for-word translation program. You simply can't do that, especially when dealing with euphenisms. This new system is the only possible way that could properly translate text.

      My wife is a professional translator and has absolutely no respect for machine translatations.

      --
      -- Will program for bandwidth
    2. Re:The vodka is strong but the meat is rotten by JJ · · Score: 4, Interesting

      This actually is a myth. That particular text and translation was taken as anecdotal in a 1964 report. I did a masters thesis on MT at the University of Chicago and my advisor (once a major figure in MT) refused to approve my thesis until I got that statement correct.

      --
      So long and thanks for all the fish . . . !!!
  7. Finally, the correct approach by tuxlove · · Score: 4, Interesting

    I believe that using a statistical approach like this is a step in the right direction. Manually building sets of rules, dictionaries, etc., is a waste of time and hard to do. And manuall-built systems become stale as languages evolve, unless a lot of continuing work is done.

    For me the holy grail is when I can converse with a computer meaningfully. I believe a similar approach will be required for the computer to "understand" language, and to be able to formulate a coherent and appropriate response.

  8. Re:Obsolete? by DG · · Score: 5, Funny

    A man who speaks three languages is trilingual.

    A man who speaks two languages is bilingual.

    A man who speaks one language is American.

    DG

    --
    Want to learn about race cars? Read my Book
  9. Damn Babelfish! by Zog+The+Undeniable · · Score: 5, Funny
    "Most the bay only of news of the college of southern extremity California it knows an all big contents all there is this emission annular subject, it also there is a RolandPiquepaille and it writes. The Franz taxes where his software height one lyel with lines up between the translation system quite phu the Och and this history are the summary thing their scientist. The Och "it gave the data which is parallel is sufficient in me, it spread out," inside questioning the hour 2 specialties the language which it does not do of the multi Archimedes which is the possibility which there will be a hazard translation system the doctor repulsively it talked. It approach collects the sheep which data is enormous, apply the statistical model in this data a foundation in 2 concepts which it puts. It is complete and the wool of rule lu the dictionary of grammar "the m3ethode of the Och the duplex language original and the Rosetta which agree one equivalent with computer password of noble and wise pebble epitaph adopts. Or, rather, the gigaoctets and pebble gigaoctets of the Rosetta." Detail fact compared to read the hazard my synopsis.

    English --> French --> English --> Korean --> English. Of course, it helps that the first sentence is munged anyway ;-)

    --
    When I am king, you will be first against the wall.
  10. Old Texts by holygoat · · Score: 5, Insightful

    Firstly we could consider the enormous body of work currently available in other languages.
    Having this able to be translated into English or other languages could be very valuable for scholars.

    Secondly, English is not the primary tongue for the majority of people on the planet - to suggest that because a lot of people can manage to converse in it that the ability to translate between other languages isn't valuable is foolish.

    Also note that the article specifically mentions Arabic and Chinese, which I don't think crossed your mind. China has the largest population on the planet, remember.

    Translation is far from obsolete, especially given that the majority of the Western world, and especially America, is piss poor at being bilingual.

  11. Re:DARPA by Abcd1234 · · Score: 5, Insightful

    Oh please... so many conspiracy theories. You do realize that the *internet* was originally developed by DARPA, right? My point: DARPA does a lot of work... not all of it revolves around spying on or otherwise taking away the rights of American citizens.

  12. Statistical approach looks promising by TwistedGreen · · Score: 4, Insightful

    "One of the great advantages of the statistical approach," Och explained, "is that most of the work goes into components that are language-independent. As long as you give me enough parallel data to train the system on, you can have a new system in a matter of days, if not hours."

    This statistical method is probably the best approach to computerized translation. It seems to approximate how the human mind will translate a give sentence most efficiently. Language can get awfully complex, and individual words often have, at best, an ambiguous meaning when interpreted alone. One must take into account the context of that word to specify and refine its meaning. This obviously leads to a huge number of permutations to represent a huge variety of thoughts, but the relative size of this number is diminishing as computers become more powerful.

    Therefore, instead of playing with messy grammars and sentence structures, we can simply have a catalogue of thoughts as represented by words, and correlate that catalogue with a different set of words to facilitate translation. This software would operate on a deeper level than it would if it operated with the words and symbols themselves. It would utilize a map of the deep structures of language, instead of a map of the less-meaningful words and grammars.

    I really like this method, and while it may seem like a brute-force hack applied to translation, the simple fact that languages do not contain elegant patterns must be accepted. It also appears to be a most efficient method, as the simple comparisons involved would bring the speed of translation into realtime.

  13. Re:Could help by Abcd1234 · · Score: 4, Interesting

    I'm not sure this is really applicable to translating literary works. These kinds of translations require an understanding of the native culture of both the source and target languages, as well as the intent of the writer, in order to generate an understandable translation that the target group can appreciate. A computer translation system like this one is incapable of performing these sorts of analysis.

    What this is really good for is on-the-fly translation of material where the reader simply wants to comprehend what was written (think the old babelfish engine). This has obvious applications on the web, as well as many other areas (on-the-fly server-side translation for IM systems, etc, etc).

  14. "The vodka is strong, but the meat is rotten" by quantum+bit · · Score: 5, Funny

    You know, that actually does sound like something that would be a Russian aphorism...

  15. Re:A bit of a worry for privacy by bigjocker · · Score: 4, Insightful

    This is a bit of a worry for privacy concerns, given that if I want to keep something secret from the world and private just between me and my intended recipient I have one less option.

    If you are using foreign languages or even lexically analyzable scemes to do your encription, you deserve what you get

    --
    Life isn't like a box of chocolates. It's more like a jar of jalapenos. What you do today, might burn your ass tomorrow.
  16. If you want a universal translator... by flicken · · Score: 4, Interesting
    ...here is a link to the Universal Networking Language (UNL). UNL is a computer markup language that allows the author of the text to specify how exactly the text should be translated (i.e. what the precise definition of the words in the text are). Taking this specification, a machine is able to produce a readable version of the text in a variety of languages.

    It's not quite done yet, but the system does show promise. Dictionaries have already been created in Spanish, English, German, Japanese, Italian, French and several other languages.

    --
    20 mil and I will! Learn Esperanto with 20M others.
  17. Several Missing Details by Flwyd · · Score: 5, Interesting

    As press releases tend to do, this leaves much to be desired for folks who are familiar with the discipline. As I read it, it seems to imply that the main driver is phrase-matching. What does it do with phrases it hasn't seen before? The problem is solved by throwing lots of data at it -- how much data is needed for a reasonable system? How well does it generalize to text outside the domains of the training data?

    Incidentally, had my brother been a girl, he was in serious danger of being named Rosetta Stone.

    -- Trevor Stone, aka Flwyd

    --
    Ceci n'est pas une signature.
  18. Translate Pascal To C and Such by Potpatriot · · Score: 4, Interesting

    How about piping in various algorirhtms encoded in Pascal and C into the thing and seeing what it does to convert arbitrary sources. Where Can I get the soource? Pawel

  19. Re:DARPA by wwest4 · · Score: 5, Insightful

    well, not EVERY bottle of beer at the duff plant has a nose or hitler's head in it, but i'm glad the inspector is tasked to look at every single bottle.

    just because government abuse isn't guaranteed doesn't mean we shouldn't vigilantly examine the possibilities when we see them.

    it's all boils down to balancing powers of government and freedom of individuals, and this country (USA) was founded upon principles intended to favor the rights of individuals. i'll go out on a limb and make a value statement - that's the way to go. power to the people, man!

  20. What about C++? by MobyDisk · · Score: 4, Funny

    So, can I train this program with a bunch of requirements documents, and a bunch of implementations, and have it learn how to code? :-) If so, I think I am obsolete. *poof*

  21. Programming Languages? by The+Raven · · Score: 5, Interesting

    I wonder how this would fare putting two computer languages side by side? I mean... take a few thousand programs, coded using the same algorithms but different computer languages... would his language translation software translate between them? Would it be able to differentiate between languages that manually allocate memory and those that use garbage collection? How about between procedural langauages like C, and more esoteric and oddly structured languages like LISP?

    An interesting challenge, eh?

    Would there be any benefit to this?

    --
    "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
  22. Give me enough Slashdot antries... by Pac · · Score: 5, Funny

    ...and I will make pseudo-insightful comments based on the headline text without reading any of the source articles, until my karma is excellent?

  23. Scientific Papers by acoustiq · · Score: 4, Informative
    Being an undergrad hoping to do research in this area in the next few years, I've already read a few of Och's papers and others in the field. Some of the best that I remember are: Kevin Knight prepared an excellent (if now somewhat outdated) introduction to statistical machine translation that you can see in HTML or RTF (the formatting was corrupted when the RTF was converted to HTML - I recommend the RTF).
    --

    --
    I romp with joy in the bookish dark
  24. Re:A poor analogy, and a poor method by Abcd1234 · · Score: 5, Insightful

    If they offered me the same money (and one of those Linux NetworX clusters) I could have a superior system in a month, although (as stated above) it would require more than one known language.

    LOL! If this problem was so friggin' easy, why are these researchers the first to demonstrate a working system using this technique (which blows away all existing systems, BTW)? Hell, if it's as easy as you say, this whole "translating text" thing must be a breeze. I wonder why so much money is spent every year on R&D in this area? Hell, why didn't they just hire you to whip up a system in a month?

    Why? Because it ain't that easy and you have no idea what you're talking about. Given these are world-class researchers, I'm sure they've considered the multiple-translation route, and subsequently rejected it for very good reasons (likely far more complex than your simplistic "it's easier" excuse). Moreover, the really hard work in this area is the statistical modelling necessary to generate a working system, something which would, I suspect, be far more complex if a multiple-translation route were taken. But, hey, that's just some number crunching, right? What's so hard about that?

  25. statistics is the key by gemseele · · Score: 5, Interesting

    Time for inflamatory reasoning. The statistical approach will beat out the grammar and rule based ones, at least for English, is for the simple reason:

    English is not a language

    Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly (and not only in colloquial forms, just look at what the political-correctness movement has done to phraseology). You know the story... more exceptions than rules, things that are legitimate to say language-wise are considered incorrect anyways, and vice versa, etc. etc.

    That's not to say it doesn't have advantages; it's relatively easy to learn the basics of communication since it's weakly conjugated, has genderless articles, fairly simple uncased sentence structure. But, it is monstrous to master and I suspect most native speakers aren't true masters (not to mention the orthographical nightmare; is English the only language with spelling bee contests?)

    The reason it's the new lingua franca (or should it be lingua angla now?) is techno-socio-political as is always the case. Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.

    1. Re:statistics is the key by Jeremi · · Score: 4, Insightful
      English is not a language. Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly


      You are actually arguing that English is not a dead language. Every language that is actually in use by large numbers of people is as you describe.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
  26. How dare you ask by Anonymous Coward · · Score: 4, Funny
    But Can It Do Klingon?
    How dare you question the honor of this program! I should kill you where you stand!
  27. Re:A poor analogy, and a poor method by William+Tanksley · · Score: 4, Insightful

    If you double the number of known languages, you more than quarter the number of errors

    Your post is reasonable and interesting (using three-way parallelism would give better translations), but you're missing something important here.

    First, none of these languages are "known" to this interpreter program. The program reads parallel texts, and when you feed it a text without a parallel, it generates the parallel for you. In other words, it can translate either way. So you don't have two known languages and one unknown; all you have is three text corpuses. (Well, in this case you have two, but you know what I mean.)

    Second, yes; three would be FAR better than two; but two is also useful, and in more situations. You don't always have a Rosetta stone.

    They're doing well here. Yes, there's an obvious next step to take; but no, the existance of a "next step" doesn't destroy the usefulness of this step.

    -Billy

  28. Article text (in Babel-German-back-to-English) by Wraithlyn · · Score: 4, Funny

    I just had to. Besides, I think it's proving a point, or something.

    --

    Romancing of the Rosetta stone

    ' you give me sufficient parallel data, and you can have translation a system in the hours '

    University southern California of the computer scientist Franz Josef, which Och of most famous against-resounded, praises itself in the history of the technology, after its software counted the Arab strongly under 23 and Chinese English translatio systems, commercially and experimentally, examined inside in recently concluded Ministry of Trade of attempts.

    "you indicate a place to me to the location, and I shift the world,", after to to order a mathematical explanation for the lever said the large Greek scientist Archimedes place.

    "you give me sufficient parallel data, and you can have translation a system for all possible two languages in an affair of hours,", said Dr. Och, a computer scientist in the USC school of the institute for information science of the technology.

    Och spoke after the benchmark tests 2003 for the machine translation, which was accomplished in the May and June of this yearly by the National Institute of Standards and Technology United States of the trade department.

    Translations Ochs examined well into the 2003 head ton head tests against 7 Arab systems (5 research and 2 commercial away dregal products) and 14 Chinese systems (9 research and 5 from stock). In preceding 2002 evaluations had examined it similarly superior.

    The researcher discussed his methods held at a NIST Postmortemseminar over the Benchmarking July 22-23 of John Hopkins at the university in Baltimore, Maryland.

    Och is an outstanding exponent of a newer method of using the computers to touch in order to translate a language into other one, which became more successful in the last years, while the ability of the computers grew, large bodies of the information, and the volume of the text and the brought together translations in the digital form has, on (for example) multilingual newspaper or government net places of assembly explodes.

    Method Ochs uses brought together bilingual texts, the computer-coded equivalents of the famous Rosetta descriptions of stone. Or rather gigabytes and gigabyte Rosetta of stones.

    "our approximation uses statistic models, in order to find the most probable translation for a given entrance," Och avowedly

    "it is rather different to the older, symbolic approximations for the machine translation, which in most existing the commercial systems is used, which try, to code the grammar and the encyclopedia of a foreign language in a computer program the grammatical structure of the strange text analyzed, and produced then English, which on hard guidelines," it is based, continued.

    "employs, explaining from the computer, how one, we left it it out explains translated. First we draw the system it with a parallel korpus i.e. an accumulation of texts in the foreign language and their translations into English.

    "the computer uses these information, in order to co-ordinate the parameters of a statistic model translation of the process. During the translation of the new text, the system tries to find English sentence which is the most probable translation strange entrance of the sentence, be based in these statistic models."

    This method ignores or rolls over rather, finds express grammatical guidelines and even traditional dictionary lists of the vocabulary in favor of leaving the computer matchup samples between given Chinese or Arab (or any another language) texts and English translations.

    Such abilities grew, while computers improved, by making possible for them, from using the individual words as the fundamental unit on using the groups of words to move -- cliches.

    Versions of the different human translators of the same text change frequently considerably. Another key improvement was the use of repeated English human translations to permit the computer too its transmission by an ana

    --
    "Mind, as manifested by the capacity to make choices, is to some extent present in every electron." -Freeman Dyson