Slashdot Mirror


Romancing The Rosetta Stone

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

34 of 486 comments (clear)

  1. Great summary by spectasaurus · · Score: 3, Insightful

    You know, it's not really a summary when you just delete half the article.

  2. Re:Obsolete? by Anonymous Coward · · Score: 0, Insightful

    Guess what asshat, the majority of the people on this planet don't speak English. Just because everyone you know does doesn't make it a majority or even a large minority.

  3. DARPA by BlackHawk-666 · · Score: 2, Insightful

    That reference to DARPA has me a little worried about the sort of uses this technology will be put to. I wonder, are the CIA trying to shore up holes in their translation abilities (particularly for Arabic/etc) by using software. What happens when you pair this technology up with the Echelon project? Are we going to see a dramatic rise in the ability of the government to spy on nationals and particularly foreign nationals now?

    --
    All those moments will be lost in time, like tears in rain.
    1. Re:DARPA by Abcd1234 · · Score: 5, Insightful

      Oh please... so many conspiracy theories. You do realize that the *internet* was originally developed by DARPA, right? My point: DARPA does a lot of work... not all of it revolves around spying on or otherwise taking away the rights of American citizens.

    2. Re:DARPA by wwest4 · · Score: 5, Insightful

      well, not EVERY bottle of beer at the duff plant has a nose or hitler's head in it, but i'm glad the inspector is tasked to look at every single bottle.

      just because government abuse isn't guaranteed doesn't mean we shouldn't vigilantly examine the possibilities when we see them.

      it's all boils down to balancing powers of government and freedom of individuals, and this country (USA) was founded upon principles intended to favor the rights of individuals. i'll go out on a limb and make a value statement - that's the way to go. power to the people, man!

  4. The Law of Eventuality by Speare · · Score: 3, Insightful

    "Give me enough" is a key element of the Law of Eventuality. Give me enough money, and I'll solve the Microsoft monopoly threat with a hostile takeover. Give me enough time and I'll clean up almost any unnatural disaster site by leveraging nature's own methods.

    Give me enough simulated neurons and enough truisms and I'll make a sentient machine.

    Eventually, with enough resources, anything is possible. Throwing more time and resources to a problem is rarely exciting science. Reducing the inconveniently large values of 'eventually' and 'enough' are the real problem.

    --
    [ .sig file not found ]
    1. Re:The Law of Eventuality by Abcd1234 · · Score: 2, Insightful

      Err... how is this interesting or insightful? It's barely related to the discussion! If what you're is referring to is the large corpus of paired texts they inject into the system, you've completely missed the point.

      The cool science here is in the advancements in their statistical model and new techniques they've developed for "scoring" translations in order to improve their output. In addition, they've also demonstrated the ability to statistically translate whole phrases effectively, rather than individual words, which can also improve translation quality. The fact that you've missed all this makes me wonder if you actually *read* the press release.

  5. Re:Obsolete? by Surak · · Score: 5, Insightful

    'Almost everyone'? What *are* you talking about? You must be an American. From a recent online Harris poll, most Americans think at least half the world speaks English. This is just plain wrong. The truth of the matter is that it's more like 20%. That's it. Most people on the NET might speak English, but most people in the world? Hardly.

  6. Re:Obsolete? by Anonymous Coward · · Score: 2, Insightful

    English may be the closest thing we have to a universally-spoken language, but it certainly isn't going to become the -only- language any time soon, if ever. If all other languages disappeared, though, we would definitely need translation for all the literature we have that isn't written in English.

  7. Old Texts by holygoat · · Score: 5, Insightful

    Firstly we could consider the enormous body of work currently available in other languages.
    Having this able to be translated into English or other languages could be very valuable for scholars.

    Secondly, English is not the primary tongue for the majority of people on the planet - to suggest that because a lot of people can manage to converse in it that the ability to translate between other languages isn't valuable is foolish.

    Also note that the article specifically mentions Arabic and Chinese, which I don't think crossed your mind. China has the largest population on the planet, remember.

    Translation is far from obsolete, especially given that the majority of the Western world, and especially America, is piss poor at being bilingual.

    1. Re:Old Texts by OmniVector · · Score: 2, Insightful

      A friend of mine, Hani, who is from Egypt told me a joke once.
      "What do you call a person that only speaks one language?" A: An american

      It's quite true when you think about it. He said in when he was growing up he had a choice between going to a french school or an english school where the given language was tought just as much as arabic. Americans really need to be tought french or spanish at a MUCH younger age (say 5 right as they start kindergarden).

      --
      - tristan
  8. Re:Obsolete? by Anonymous Coward · · Score: 1, Insightful

    And if it wasn't for Spanish (and South America), Americans would think 100% of the world speaks English..

  9. Re:Obsolete? by GoofyBoy · · Score: 2, Insightful

    http://www.britishcouncil.org/english/engfaqs.htm# howmany

    Translators are needed for 3/4ths of the world. Not what I would call close to obsolete.

    --
    The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
  10. Statistical approach looks promising by TwistedGreen · · Score: 4, Insightful

    "One of the great advantages of the statistical approach," Och explained, "is that most of the work goes into components that are language-independent. As long as you give me enough parallel data to train the system on, you can have a new system in a matter of days, if not hours."

    This statistical method is probably the best approach to computerized translation. It seems to approximate how the human mind will translate a give sentence most efficiently. Language can get awfully complex, and individual words often have, at best, an ambiguous meaning when interpreted alone. One must take into account the context of that word to specify and refine its meaning. This obviously leads to a huge number of permutations to represent a huge variety of thoughts, but the relative size of this number is diminishing as computers become more powerful.

    Therefore, instead of playing with messy grammars and sentence structures, we can simply have a catalogue of thoughts as represented by words, and correlate that catalogue with a different set of words to facilitate translation. This software would operate on a deeper level than it would if it operated with the words and symbols themselves. It would utilize a map of the deep structures of language, instead of a map of the less-meaningful words and grammars.

    I really like this method, and while it may seem like a brute-force hack applied to translation, the simple fact that languages do not contain elegant patterns must be accepted. It also appears to be a most efficient method, as the simple comparisons involved would bring the speed of translation into realtime.

  11. Re:The vodka is strong but the meat is rotten by rossz · · Score: 4, Insightful

    That particular phrase translated badly because they used a word-for-word translation program. You simply can't do that, especially when dealing with euphenisms. This new system is the only possible way that could properly translate text.

    My wife is a professional translator and has absolutely no respect for machine translatations.

    --
    -- Will program for bandwidth
  12. Copyright issues by PhilHibbs · · Score: 1, Insightful

    I wonder if the resultant translation engine could be considered a derivative work of the texts that populated it. This system is standing on the shoulders of all the translation efforts that went in to it. I think it's a great idea, but in the current IP climate, could well be shot down in flames. How much dual-language text is available in the PD or on open content licence?

  13. Re:The vodka is strong but the meat is rotten by Abcd1234 · · Score: 2, Insightful

    Heh, given this is a not-uncommon phrase in the English language, it very well may be in their English-to-target-language corpus, meaning it could end up being a straight lookup-and-translate operation. Which is, of course, one of the advantages of a system like this (you can translation common idioms without having to analyze the text itself).

  14. Re:A bit of a worry for privacy by bigjocker · · Score: 4, Insightful

    This is a bit of a worry for privacy concerns, given that if I want to keep something secret from the world and private just between me and my intended recipient I have one less option.

    If you are using foreign languages or even lexically analyzable scemes to do your encription, you deserve what you get

    --
    Life isn't like a box of chocolates. It's more like a jar of jalapenos. What you do today, might burn your ass tomorrow.
  15. ignoring grammar seems strange by meshko · · Score: 2, Insightful

    I understand that this is a cool idea for building automatic translators, but is it practical? Basically what they are doing is taking a well-researched domain of languages and trying to make something new and cool in it by completely ignoring the domain knowledge. My intuition tells me that "always use as much domain knowledge as posssible" is an engineering axiom.

    --
    I passed the Turing test.
  16. Re:I expect they used many Bible versions by ejdmoo · · Score: 3, Insightful

    Actually, I think that this may be an interesting way to translate the Bible (assuming you didn't use the Bible itself as a reference...that would skew the translation).

    Think about it: every translation of the Bible is always criticized for some reason. If the Bible were translated this way it could be like the Google news of Bible translations: completely independent of human bias and editing.

  17. Re:Obsolete? by JohnsonJohnson · · Score: 2, Insightful

    BTW, every "% of humanity" statistic has to consider that most humans are Chinese.

    If you want to be even remotely close to statistically significant you have to include citizens of India as well most of whom are very different from those of Chinese descent. . In fact most people will probably be an Indian citizen within the next 20 years. However citizens of India are a more heterogeneous population than that of China. Then again, Chinese of the diaspora (eg. in Malaysia, Indonesia, the Philipines, Vancouver etc.) are also a large population but can be very different than mainland Chinese. So I guess in the end every % of humanity statistic that measures some culturally derived phenomenon has to be considered BS.

  18. Re:Could help by Anonymous Coward · · Score: 1, Insightful

    In this case I believe the statistical analysis would work. One would just use the 4 other Potter books on the market and their subsequent translations into German. I'm sure JK Rowling writes in a similar style in all the books... so one phrase that means one thing in one book should mean the same in the new one... so for books in a series maybe this should be the first crack? (And then have actual translators correcting what may be wrong?)

  19. Re:The vodka is strong but the meat is rotten by bogado · · Score: 2, Insightful

    I doubt computers will ever get near a good translator, shure it can make some people lose their jobs translating math thesis, but a book, play, movies or even conversation have to use humans. Humans are the only thing that can realy understand what is going on, human translator (good ones) knows about the culture of both countries that it is translating. It can understand the subtext and change the words so they have the same subtext in the other language.

    A good book has many things to be learned that are not written in words.

    --
    []'s Victor Bogado da Silva Lins

    ^[:wq

  20. Re:Oh, please no... by radish · · Score: 3, Insightful

    You're right, traditional machine translation is difficult, primarily due to context. However, you're also right that the example you gave is a bad one - in english it only has one meaning (the second one you give). A HDD controller would never have an assigned gender. Of course in German for example, it would (not sure which though - neuter?).

    However you're missing what I think is the most important point. If an example is so ambiguous as to confuse an "ideal" machine, it would confuse us too. What you're really saying is "it is possible to write sentences with ambiguous meaning in most languages" - which is of course true. That doesn't however make it impossible to create a machine which is at least as good as a human at translating (and wouldn't that be good enough?). When you read something you interpret it according to a set of learned rules. Obviously there's the basic syntax and vocab, but then you add context like the other clauses in the prose, the identity of the author, the subject matter. We're a long way off getting those concepts into a machine reader, but I would be very hesitant to say we'll never get there.

    Besides, the artical is about taking a different approach to the problem - one which should be quite happy with ambiguity. They're looking at essentially pattern matching, so provided your sample data sets include enough info to describe the ambiguity it should have a decent enough chance of working it out.

    --

    ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

  21. Re:A poor analogy, and a poor method by Abcd1234 · · Score: 5, Insightful

    If they offered me the same money (and one of those Linux NetworX clusters) I could have a superior system in a month, although (as stated above) it would require more than one known language.

    LOL! If this problem was so friggin' easy, why are these researchers the first to demonstrate a working system using this technique (which blows away all existing systems, BTW)? Hell, if it's as easy as you say, this whole "translating text" thing must be a breeze. I wonder why so much money is spent every year on R&D in this area? Hell, why didn't they just hire you to whip up a system in a month?

    Why? Because it ain't that easy and you have no idea what you're talking about. Given these are world-class researchers, I'm sure they've considered the multiple-translation route, and subsequently rejected it for very good reasons (likely far more complex than your simplistic "it's easier" excuse). Moreover, the really hard work in this area is the statistical modelling necessary to generate a working system, something which would, I suspect, be far more complex if a multiple-translation route were taken. But, hey, that's just some number crunching, right? What's so hard about that?

  22. Re:A bit of a worry for privacy by nanojath · · Score: 2, Insightful
    It's time for us all to get over the fact that technology is going to end practical privacy. It's a done deal. Cameras and microphones will get smaller and smaller. Translation, electronic selectivity (i.e. snoop anybody transferring bombmaking directions) and tapping of all forms of electronic conversation will get more and more sophisticated. I've no doubt the NSA made PGP its bitch a long time ago. IF they hadn't it would be getting fought a lot harder. Assuming you can get real privacy from something on the scale of the government is just foolish.


    I'm not, incidentally, saying just live with it. I'm saying, you can't stop the technology, you have to fight it on the level of policy and practice. Get interested in the work of privacy advocates, work for a consitutional amendment guaranteeing privacy in the same manner as freedom of expression, protest egregious violations of privacy (basically, be against John Ashcroft).

    --

    It Is the Nature of Information to Transgress Artificial Boundaries

  23. Re:A poor analogy, and a poor method by William+Tanksley · · Score: 4, Insightful

    If you double the number of known languages, you more than quarter the number of errors

    Your post is reasonable and interesting (using three-way parallelism would give better translations), but you're missing something important here.

    First, none of these languages are "known" to this interpreter program. The program reads parallel texts, and when you feed it a text without a parallel, it generates the parallel for you. In other words, it can translate either way. So you don't have two known languages and one unknown; all you have is three text corpuses. (Well, in this case you have two, but you know what I mean.)

    Second, yes; three would be FAR better than two; but two is also useful, and in more situations. You don't always have a Rosetta stone.

    They're doing well here. Yes, there's an obvious next step to take; but no, the existance of a "next step" doesn't destroy the usefulness of this step.

    -Billy

  24. Seems similar to Bayesian spam filter programs... by jetsetscoot · · Score: 2, Insightful

    ... where the more available examples of actual spam and actual non-spam the better the accuracy of the result, and where you basically let the computer work out the probability, rather than feeding it hard and fast rules up front.

    Can anyone say if the two procedures are technically related?

  25. Re:Or a "culturally superior" American. by raehl · · Score: 2, Insightful

    For starters, we specifically target young people when asking questions where a non-native language will be required. 3-4 of the people were employees, indicating at least a passing knowlege of "What track is this train on?" in a few European languages might be a job-relevant talent. Additionally, the sneer. Attitude is attitude regardless of what country you're in.

    We don't expect people to know foreign languages. We *DO* find it amusing when people who are razzing *US* for not knowing THEIR language do not know any foreign languages.

  26. A plan for translation? by Jeremi · · Score: 3, Insightful
    Actually this system reminds me a lot of the good old Bayesian Spam detector algorithms... but instead of trying to determine what category of content an email contains, the statistical classifier is trying to determine (e.g.) what English phrase a Russian phrase most closely matches.


    Given the impressive progress made by Bayesian algorithms in spam detection, I wouldn't be surprised to see impressive results from this method either.


    So bravo for Franz Och! He's taken what appeared to be an intractible problem requiring magic AI to solve, and perhaps found a way to solve it effectively using the stupid brute force methods computers are so good at.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  27. Re:A poor analogy, and a poor method by Draxinusom · · Score: 2, Insightful

    RTFA. The method described in the article is a purely statistical method, NOT a semantic one; it has zero "knowledge" of grammar, syntax, or meaning. So having more than one "known" language to start with would not help in the slightest, because the advantages that you describe are only applicable to semantic methods.

    I agree though that the analogy to the Rosetta Stone is a poor one.

  28. Re:The vodka is strong but the meat is rotten by rossz · · Score: 3, Insightful

    Because they suck, of course. She uses computers to assist her. It's just a tool. Just as you can't expect a wrench to rebuild your transmission, you can't (currently) expect a computer to create a proper translation. That will change in the future (as this article shows).

    Currently, computer translations work the best in technical documents and the worse in prose (stinking turd horribly bad quality translations).

    BTW, computer translations has never been any kind of competition for work. These days, competition is from untrained college students in Central Europe. All too often a Romanian student who "knows Hungarian" bids a couple of pennies per word, far under the going rate and far too little for my wife to consider as reasonable pay. The resultant translation sucks, but that's to be expected from someone who not only isn't trained as a translator, but also doesn't not have a good command of either languages in question (Hungarian and English).

    Oops, I started ranting.

    --
    -- Will program for bandwidth
  29. Re:statistics is the key by Jeremi · · Score: 4, Insightful
    English is not a language. Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly


    You are actually arguing that English is not a dead language. Every language that is actually in use by large numbers of people is as you describe.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  30. A flawed approach by Oryx3 · · Score: 2, Insightful

    And where are you going to find gigabytes of parallel Klingon-English texts?

    No seriously, this is the fallacy behind any statistical approach to automated translation.The news release gives the telling comment:

    "Different human translators' versions of the same text will often vary considerably. Another key improvement has been the use of multiple English human translations to allow the computer to more freely and widely check its rendering by a scoring system. This not coincidentally allows researchers to quantitatively measure improvement in translation on a sensitive and useful scale."

    This paragraph just doesn't make any sense to me. Either it's badly explained, or the entire approach is flawed:

    • You have to start with correctly human-translated and aligned texts to begin with. How many versions of the same text are you willing to pay for?
    • Most likely, you will have some texts well translated, and some badly translated. How do you rate the relative quality of each version? How many translators does it take to revise gigabytes of text? (One to screw in the lightbulb...)
    • A large percentage of existing translations are mediocre. So you are going to get mostly bad translation out, since they don't even attempt to build any linguistic knowledge into the system. GIGO rules!

    Statistical methods just cannot deal with the subtlety of meaning to be found in natural language texts. It's a little like believing that you can always win at chess if you can just look ahead far enough. I believe that this approach is inherently limited and any apparent success is illusory. This news release hasn't changed my opinion.

    Sorry to be a party-pooper, but that's how I feel.