Slashdot Mirror


Romancing The Rosetta Stone

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

10 of 486 comments (clear)

  1. Re:Obsolete? by Surak · · Score: 5, Insightful

    'Almost everyone'? What *are* you talking about? You must be an American. From a recent online Harris poll, most Americans think at least half the world speaks English. This is just plain wrong. The truth of the matter is that it's more like 20%. That's it. Most people on the NET might speak English, but most people in the world? Hardly.

  2. Old Texts by holygoat · · Score: 5, Insightful

    Firstly we could consider the enormous body of work currently available in other languages.
    Having this able to be translated into English or other languages could be very valuable for scholars.

    Secondly, English is not the primary tongue for the majority of people on the planet - to suggest that because a lot of people can manage to converse in it that the ability to translate between other languages isn't valuable is foolish.

    Also note that the article specifically mentions Arabic and Chinese, which I don't think crossed your mind. China has the largest population on the planet, remember.

    Translation is far from obsolete, especially given that the majority of the Western world, and especially America, is piss poor at being bilingual.

  3. Re:DARPA by Abcd1234 · · Score: 5, Insightful

    Oh please... so many conspiracy theories. You do realize that the *internet* was originally developed by DARPA, right? My point: DARPA does a lot of work... not all of it revolves around spying on or otherwise taking away the rights of American citizens.

  4. Statistical approach looks promising by TwistedGreen · · Score: 4, Insightful

    "One of the great advantages of the statistical approach," Och explained, "is that most of the work goes into components that are language-independent. As long as you give me enough parallel data to train the system on, you can have a new system in a matter of days, if not hours."

    This statistical method is probably the best approach to computerized translation. It seems to approximate how the human mind will translate a give sentence most efficiently. Language can get awfully complex, and individual words often have, at best, an ambiguous meaning when interpreted alone. One must take into account the context of that word to specify and refine its meaning. This obviously leads to a huge number of permutations to represent a huge variety of thoughts, but the relative size of this number is diminishing as computers become more powerful.

    Therefore, instead of playing with messy grammars and sentence structures, we can simply have a catalogue of thoughts as represented by words, and correlate that catalogue with a different set of words to facilitate translation. This software would operate on a deeper level than it would if it operated with the words and symbols themselves. It would utilize a map of the deep structures of language, instead of a map of the less-meaningful words and grammars.

    I really like this method, and while it may seem like a brute-force hack applied to translation, the simple fact that languages do not contain elegant patterns must be accepted. It also appears to be a most efficient method, as the simple comparisons involved would bring the speed of translation into realtime.

  5. Re:The vodka is strong but the meat is rotten by rossz · · Score: 4, Insightful

    That particular phrase translated badly because they used a word-for-word translation program. You simply can't do that, especially when dealing with euphenisms. This new system is the only possible way that could properly translate text.

    My wife is a professional translator and has absolutely no respect for machine translatations.

    --
    -- Will program for bandwidth
  6. Re:A bit of a worry for privacy by bigjocker · · Score: 4, Insightful

    This is a bit of a worry for privacy concerns, given that if I want to keep something secret from the world and private just between me and my intended recipient I have one less option.

    If you are using foreign languages or even lexically analyzable scemes to do your encription, you deserve what you get

    --
    Life isn't like a box of chocolates. It's more like a jar of jalapenos. What you do today, might burn your ass tomorrow.
  7. Re:DARPA by wwest4 · · Score: 5, Insightful

    well, not EVERY bottle of beer at the duff plant has a nose or hitler's head in it, but i'm glad the inspector is tasked to look at every single bottle.

    just because government abuse isn't guaranteed doesn't mean we shouldn't vigilantly examine the possibilities when we see them.

    it's all boils down to balancing powers of government and freedom of individuals, and this country (USA) was founded upon principles intended to favor the rights of individuals. i'll go out on a limb and make a value statement - that's the way to go. power to the people, man!

  8. Re:A poor analogy, and a poor method by Abcd1234 · · Score: 5, Insightful

    If they offered me the same money (and one of those Linux NetworX clusters) I could have a superior system in a month, although (as stated above) it would require more than one known language.

    LOL! If this problem was so friggin' easy, why are these researchers the first to demonstrate a working system using this technique (which blows away all existing systems, BTW)? Hell, if it's as easy as you say, this whole "translating text" thing must be a breeze. I wonder why so much money is spent every year on R&D in this area? Hell, why didn't they just hire you to whip up a system in a month?

    Why? Because it ain't that easy and you have no idea what you're talking about. Given these are world-class researchers, I'm sure they've considered the multiple-translation route, and subsequently rejected it for very good reasons (likely far more complex than your simplistic "it's easier" excuse). Moreover, the really hard work in this area is the statistical modelling necessary to generate a working system, something which would, I suspect, be far more complex if a multiple-translation route were taken. But, hey, that's just some number crunching, right? What's so hard about that?

  9. Re:A poor analogy, and a poor method by William+Tanksley · · Score: 4, Insightful

    If you double the number of known languages, you more than quarter the number of errors

    Your post is reasonable and interesting (using three-way parallelism would give better translations), but you're missing something important here.

    First, none of these languages are "known" to this interpreter program. The program reads parallel texts, and when you feed it a text without a parallel, it generates the parallel for you. In other words, it can translate either way. So you don't have two known languages and one unknown; all you have is three text corpuses. (Well, in this case you have two, but you know what I mean.)

    Second, yes; three would be FAR better than two; but two is also useful, and in more situations. You don't always have a Rosetta stone.

    They're doing well here. Yes, there's an obvious next step to take; but no, the existance of a "next step" doesn't destroy the usefulness of this step.

    -Billy

  10. Re:statistics is the key by Jeremi · · Score: 4, Insightful
    English is not a language. Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly


    You are actually arguing that English is not a dead language. Every language that is actually in use by large numbers of people is as you describe.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.