Romancing The Rosetta Stone

← Back to Stories (view on slashdot.org)

Posted by Hemos on Monday July 28, 2003 @03:48AM from the cool-story dept.

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

9 of 486 comments (clear)

Min score:

Reason:

Sort:

Article text by Anonymous Coward · 2003-07-28 03:51 · Score: 4, Informative

Romancing the Rosetta Stone

'Give me enough parallel data, and you can have a translation system in hours'

University of Southern California computer scientist Franz Josef Och echoed one of the most famous boasts in the history of engineering after his software scored highest among 23 Arabic- and Chinese-to-English translatio systems, commercial and experimental, tested in in recently concluded Department of Commerce trials.

"Give me a place to stand on, and I will move the world," said the great Greek scientist Archimedes, after providing a mathematical explanation for the lever.

"Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, a computer scientist in the USC School of Engineering's Information Sciences Institute.

Och spoke after the 2003 Benchmark Tests for machine translation carried out in May and June of this year by the U.S. Commerce Department's National Institute of Standards and Technology.

Och's translations proved best in the 2003 head-to-head tests against 7 Arabic systems (5 research and 2 commercial-off-the-shelf products) and 14 Chinese systems (9 research and 5 off-the-shelf). In the previous, 2002 evaluations they had proved similarly superior.

The researcher discussed his methods at a NIST post-mortem workshop on the benchmarking held July 22-23 at Johns Hopkins University in Baltimore, Maryland.

Och is a standout exponent of a newer method of using computers to translate one language into another that has become more successful in recent years as the ability of computers to handle large bodies of information has grown, and the volume of text and matched translations in digital form has exploded, on (for example) multilingual newspaper or government web sites.

Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones.

"Our approach uses statistical models to find the most likely translation for a given input," Och explained

"It is quite different from the older, symbolic approaches to machine translation used in most existing commercial systems, which try to encode the grammar and the lexicon of a foreign language in a computer program that analyzes the grammatical structure of the foreign text, and then produces English based on hard rules," he continued.

"Instead of telling the computer how to translate, we let it figure it out by itself. First, we feed the system it with a parallel corpus, that is, a collection of texts in the foreign language and their translations into English.

"The computer uses this information to tune the parameters of a statistical model of the translation process. During the translation of new text, the system tries to find the English sentence that is the most likely translation of the foreign input sentence, based on these statistical models."

This method ignores, or rather rolls over, explicit grammatical rules and even traditional dictionary lists of vocabulary in favor of letting the computer itself find matchup patterns between a given Chinese or Arabic (or any other language) texts and English translations.

Such abilities have grown, as computers have improved, by enabling them to move from using individual words as the basic unit to using groups of words -- phrases.

Different human translators' versions of the same text will often vary considerably. Another key improvement has been the use of multiple English human translations to allow the computer to more freely and widely check its rendering by a scoring system.

This not coincidentally allows researchers to quantitatively measure improvement in translation on a sensitive and useful scale.

The original work along these lines dates back to the late 1980s and early 1990s and was done by Peter F. Brown and his colleagues at IBM's Watson Research Center.

Much of the improvement and
Re:Obsolete? by ShadeARG · 2003-07-28 03:58 · Score: 3, Informative

Here is Japanese Slashdot, and I'm sure there are others.
A poor analogy, and a poor method by jd · 2003-07-28 04:15 · Score: 3, Informative

The Rosetta stone encoded three languages, not two, where two were known in advance. Indeed, there have been many three-way translations of treaties found, now.

The use of three languages is critical. Grammar isn't consistant, and words have multiple meanings. By using two known languages, you can eliminate many of the errors thus introduced, because the chances of some error fitting both known languages in the same way is much smaller.

If you double the number of known languages, you more than quarter the number of errors, because although errors can occur in either or both, they're unlikely to be the same error. Once more information exists, you can re-scan the same text and fill in the blanks.

Me, personally - I'd require four languages, three of which were known. The number of texts required would be considerably smaller and the number of residual errors would be practically non-existant.

They chose two languages for the obvious reason: It's simple. It's easy to find a student who knows two languages. At least, easier than finding one who knows four.

However, the price of simplicity is bad science. The volume of information they require makes their system little better than an infinite number of very smart monkeys with text editors and a grep function. That they're being paid signficant money on such stuff is a joke.

If they offered me the same money (and one of those Linux NetworX clusters) I could have a superior system in a month, although (as stated above) it would require more than one known language.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Obsolete? by red_dragon · 2003-07-28 04:25 · Score: 3, Informative

Spanish Slashdot: Barrapunto. It's been around for almost as long as Slashdot itself.

--
In Soviet Russia, Jesus asks: "What Would You Do?"
not a new technique by Anonymous Coward · 2003-07-28 04:38 · Score: 1, Informative

IBM tried this statistical technique years ago, it's not a new approach. They used the texts of Canadian parliamentary discussion, which is kept in both English and French. See here or just search Google for "IBM tranlslation canadian parliament" or the like.
Scientific Papers by acoustiq · 2003-07-28 04:50 · Score: 4, Informative
Being an undergrad hoping to do research in this area in the next few years, I've already read a few of Och's papers and others in the field. Some of the best that I remember are:
- Improved Statistical Alignment Models (2000) - Franz Josef Och, Hermann Ney, which investigates and compares several models
- A Syntax-based Statistical Translation Model - Yamada, Knight (2001), which tries to treat sentences structurally instead of just a stream of words
- A Finite-State Approach to Machine Translation - Bangalore, Riccardi (2001), which uses a different way of looking at the problem than usual
Kevin Knight prepared an excellent (if now somewhat outdated) introduction to statistical machine translation that you can see in HTML or RTF (the formatting was corrupted when the RTF was converted to HTML - I recommend the RTF).
--

--
I romp with joy in the bookish dark
been done before by Fratz · 2003-07-28 05:32 · Score: 2, Informative

They've had the same technology at CMU's LTI for years now, called EBMT. This officially stands for Example-Based Machine Translation, but those of us who worked with it called it Extremely Bad Machine Translation because it took millions of example sentences before it started to not suck, and even then it required manual tweaking and the addition of primitive grammar rules.

So yeah, this method learns fast, but it generally learns to a useless level for anything other than a rough assessment of some of the phrases that were in the original text.

--
-- Fratz, human
Re:Where can I download his software? by Anonymous Coward · 2003-07-28 06:08 · Score: 3, Informative

Franz Josef Och homepage is at:

http://www.isi.edu/~och/

There are links to 3 software packages for download.
Actually, it operates on a *shallower* level... by Jerf · 2003-07-28 08:53 · Score: 3, Informative

This software would operate on a deeper level than it would if it operated with the words and symbols themselves. It would utilize a map of the deep structures of language, instead of a map of the less-meaningful words and grammars.

Actually, as a result, it operates on a shallower level. In fact, it's almost like you wrote this comment for an article in a parallel universe where statistical translation was the norm, and somebody was just now proposing symbolic translation, so much so that it's almost spooky.

This translation technique is so shallow it doesn't even particularly care what languages it works with. In a way, it can't really be said to be "translating" in the traditional sense; it's just correlating phrases with no clue what they are.

Traditional symbolic translation is better described by what you said:

Therefore, instead of playing with messy grammars and sentence structures, we can simply have a catalogue of thoughts as represented by words, and correlate that catalogue with a different set of words to facilitate translation.

Word(/phrase) -> symbol -> word(/phrase) is traditional tranlation. This is word -> word translation.

It's working better because we've had little or no success creating the middle part of the symbolic translation; matching the symbology used in our head has proven impossible to date. This works better by skipping that step, which introduces horrible distortions by forcing the words to fit into an incredibly poor symbology (compared to what we're actually using).

However, in theory, traditional translation should still have a brighter future; this is a hack around our ignorance, perhaps even a good one, but eventually we will want to extract the symbols.

(Incidentally, it's also why this same technique can't be used to match words -> symbols; we don't know how to represent the symbols yet! This kind of technique could eventually potentially be hybridized with something else to attack that problem, but simple, direct application can't result in the complicated relationships between symbols that exist, and we'd want a computer to "understand" those relations before we'd say it was truly translating or understanding English.)

Anyways, just flip your comments around 180 degrees and you're pretty close.