Slashdot Mirror


Romancing The Rosetta Stone

Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."

35 of 486 comments (clear)

  1. Re:oh oh... by Anonymous Coward · · Score: 4, Interesting

    This is exactly NOT a universal translator as it uses matched bilingual texts. You need an already translated text for his system to work.

  2. A bit of a worry for privacy by Anonymous Coward · · Score: 1, Interesting

    This is a bit of a worry for privacy concerns, given that if I want to keep something secret from the world and private just between me and my intended recipient I have one less option.

    How long until this is able to decode things like speech, too, and convert it into something recognisable in another langauge? would it still hold my voice patterns and sound like me? and if it were converted back to the English I already do speak, with mistakes, could that then be used against me in a court of law?

    Scary stuff

  3. The vodka is strong but the meat is rotten by zptdooda · · Score: 5, Interesting

    That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

    Can anyone try this on the new (or some other recent) algorithm?

    BTW here's Doc Och's most recent website:

    Franz Josef Och

    --
    Esteem isn't a zero sum game
    1. Re:The vodka is strong but the meat is rotten by Alton_Brown · · Score: 1, Interesting

      With all due respect, does your wife have no respect because they currently stink compared to a human or because she'll be out of a job when they're sufficiently accurate?

      Who thought computers would grow up and play chess so well? Who thought they'd be building cars? Certain jobs will go to machines, but jobs will stil be there in a re-defined state. If DARPA has an interest in the technology, it's only a matter or time before the system approaches the accuracy level of a human. After all, on the translation side language is largely a logic problem. It's on the conversational side that you actually need AI.

    2. Re:The vodka is strong but the meat is rotten by Fratz · · Score: 2, Interesting

      My wife is a professional translator and has absolutely no respect for machine translatations.

      Most of them suck, but I worked on a system that was actually quite good. It was designed for technical documentation in the heavy equipment domain, and because of this limited use, we were able to constrain the input grammar and vocabulary, which made it easier to make very good translations.

      We worked with some of the best human translators around to make it as accurate and natural-sounding as possible, but we made the mistake of allowing the human translators at our customer's company to evaluate the system. They felt threatened by it and decided they didn't like it, even when they had to criticize sentences the system generated which were word-for-word what they asked us to make the system do.

      --
      -- Fratz, human
    3. Re:The vodka is strong but the meat is rotten by JJ · · Score: 4, Interesting

      This actually is a myth. That particular text and translation was taken as anecdotal in a 1964 report. I did a masters thesis on MT at the University of Chicago and my advisor (once a major figure in MT) refused to approve my thesis until I got that statement correct.

      --
      So long and thanks for all the fish . . . !!!
  4. Finally, the correct approach by tuxlove · · Score: 4, Interesting

    I believe that using a statistical approach like this is a step in the right direction. Manually building sets of rules, dictionaries, etc., is a waste of time and hard to do. And manuall-built systems become stale as languages evolve, unless a lot of continuing work is done.

    For me the holy grail is when I can converse with a computer meaningfully. I believe a similar approach will be required for the computer to "understand" language, and to be able to formulate a coherent and appropriate response.

  5. Related Independent article last week by Anonymous Coward · · Score: 1, Interesting

    The battle for the Rosetta Stone "Things are looking decidedly rocky at the British Museum - Egypt's leading archaeologist has demanded the return of the Rosetta Stone. But the museum argues that the removal of the four-foot slab that unlocked the mysteries of the pharaohs would be disastrous"

  6. Integration by slusich · · Score: 3, Interesting

    Sounds like a brilliant idea. Hopefully this is something that could eventually be compacted enough to fit into consumer electronics. It would be great to be able to watch TV from every country without any language barrier!

    1. Re:Integration by ahfoo · · Score: 3, Interesting

      Not to sound arrogant, but I find actually learning another language by watching foreign TV with subtitles in the original language to be even more interesting than watching the dubbed or English subtitled version. It involves commitment to get to the point where you can understand the basics, but there are rewards to making a commitment to learn something new.
      I like the idea of translating sentence by sentence as opposed to grammatically and word for word. I'm sure this guy is right that at some point this will produce reasonably acurate translations in many cases, but multiple languages are one of our greatest treasures.
      I have read that the single most important factor in preventing senile dementia is the difference between those who continue to create novel memories throughout their lives and those who stick to what they have already learned. Learning multiple languages is a wonderful thing and once you get well into it, it is a lot of fun. It certainly increases your options for punning and rhyming and you end up with lots of aliases.

  7. Dialects? by dethl · · Score: 2, Interesting

    How can this system compensate for the different dialects of all of the different languages?

    --
    "Some fight for law. Some fight for justice. What will you fight for? One day, you will see."
  8. Re:Obsolete? by lildogie · · Score: 2, Interesting

    > Americans think at least half the world speaks English.

    Better-informed Americans (a small miniority of the class) would be aware that Spanish is well on the way to becoming the predominant language in the USA.

    But, IMHO, English could become the next Latin: the dead language that everybody has to learn if they're going to try and influence the world.

    BTW, every "% of humanity" statistic has to consider that most humans are Chinese.

  9. Re:DARPA by kmac06 · · Score: 1, Interesting

    Kneejerk /. response: its a government conspiracy to take away more of our rights.

    Kneejerk /. mod response: he's right.

  10. Re:Could help by Abcd1234 · · Score: 4, Interesting

    I'm not sure this is really applicable to translating literary works. These kinds of translations require an understanding of the native culture of both the source and target languages, as well as the intent of the writer, in order to generate an understandable translation that the target group can appreciate. A computer translation system like this one is incapable of performing these sorts of analysis.

    What this is really good for is on-the-fly translation of material where the reader simply wants to comprehend what was written (think the old babelfish engine). This has obvious applications on the web, as well as many other areas (on-the-fly server-side translation for IM systems, etc, etc).

  11. You don't get it, do you? by mossr · · Score: 2, Interesting

    ***WHAT THE FUCK ARE YOU THINKING?***

    Look, seriously, even if everyone did speak English, there are still tonnes of literary works in other languages - the original texts of the Ancient Greek classics, for example. To read in the original language is often a much more rewarding experience. Besiders, relying on past translations of non-english material can lead to errors. And consider how many different English translations of the Bible there are.

    Almost everyone can speak, read and write at least tolerable english

    Almost everyone can communicate using gestures, facial expressions and grunts, but is that any reason to use that as our primary communication method? I mean, to really stretch a metaphor from human languages to programming languages, we can write any computer program "tolerably" in assembler (it's Turing-complete), but that doesn't mean it's the best way to do it. If I can only speak one language "tolerably", but another exceptionally well, which one is better for conveying my ideas?

    most young people can have full fledged discussions in it

    I don't think we can rely on "d00d, u r so l33t" to teach people true literacy. Young people are increasingly using SMS and online chat and are actually losing their ability to correctly spell words or write grammatically correct sentences. The number of young adults I see who cannot distinguish correctly between there, their and they're is ABSOLUTELY TERRIBLE. Literacy is a major problem in English-speaking nations.

    Just look at Slashdot, I'm quite sure I'm not the only one who doesn't have english as primary language

    that doesn't mean you can use it well. Take a good look at slashdot - many, many people mangle the English language. The American people are probably the biggest infringers here... :)

    It's not that farfetched idea that in the (near) future everyone uses or at least knows english well enough to make translations meaningless

    Human languages don't map to each other 1:1. Some languages have words that basically cannot be translated without a serious loss of accuracy. (I guess you could ssay that no human language is Turing-Complete, in that it can't totally express every conceivable human thought). Having everything translated to english is NOT a solution. Brevity, language tricks (such as puns, rhyming, etc) cannot always be substituted across languages.

    If it wasn't 2:15am in Melbourne right now, I'd try to order my thoughts and express them more clearly, but after 4 hours of Java debugging I'm off to get some sleep before uni tomorrow. Goodnight.

    --
    The PowerPC includes for this purpose two instructions called SYNC and EIEIO.
    1. Re:You don't get it, do you? by technothrasher · · Score: 2, Interesting
      I don't think we can rely on "d00d, u r so l33t" to teach people true literacy. Young people are increasingly using SMS and online chat and are actually losing their ability to correctly spell words or write grammatically correct sentences. The number of young adults I see who cannot distinguish correctly between there, their and they're is ABSOLUTELY TERRIBLE. Literacy is a major problem in English-speaking nations.



      Get off your high horse already. Unless you use English like that below, then (by your rules) your grasp of English is also "ABSOLUTELY TERRIBLE":


      Hwæt! Ær issum dæge seofon wintra and hundeahtig, ure ealdfaederas acennodon on issum lande niw rice, geacnod on freodome and gegiefen to æm geohte, æt ealle menn beoð gelice gesceapen.

      (Hint: Language is an evolving tool for communication, not a political weapon to keep the ruling elite in power)

  12. What about copyrights? by The+Lord+of+Chaos · · Score: 1, Interesting

    The big problem I see with this scheme is how do you collect the Gigs of data (ie content) without wholesale copyright violation or licensing (big bucks). Sure you can get lots of content whose copyright ran out from the Guttenburg project. But that's gonna be +70 year stuff.

    Add the fact that the Mickey Mouse Copyright Extension act and related legislation threaten to extend copyright terms for infinity minus a day and you're never gonna have much content available that reflects CURRENT usage of the languages you're trying to translate.

  13. Ranking System by freeze128 · · Score: 2, Interesting

    Even existing translation programs could benefit from a ranking system. Wouldn't it be helpful if you could tell just how confident the translator is about a certain phrase or word? That way, you could rephrase your sentence before you foolishly ask someone to "taste" you....

  14. If you want a universal translator... by flicken · · Score: 4, Interesting
    ...here is a link to the Universal Networking Language (UNL). UNL is a computer markup language that allows the author of the text to specify how exactly the text should be translated (i.e. what the precise definition of the words in the text are). Taking this specification, a machine is able to produce a readable version of the text in a variety of languages.

    It's not quite done yet, but the system does show promise. Dictionaries have already been created in Spanish, English, German, Japanese, Italian, French and several other languages.

    --
    20 mil and I will! Learn Esperanto with 20M others.
  15. Several Missing Details by Flwyd · · Score: 5, Interesting

    As press releases tend to do, this leaves much to be desired for folks who are familiar with the discipline. As I read it, it seems to imply that the main driver is phrase-matching. What does it do with phrases it hasn't seen before? The problem is solved by throwing lots of data at it -- how much data is needed for a reasonable system? How well does it generalize to text outside the domains of the training data?

    Incidentally, had my brother been a girl, he was in serious danger of being named Rosetta Stone.

    -- Trevor Stone, aka Flwyd

    --
    Ceci n'est pas une signature.
  16. Wordrank by chronos2266 · · Score: 2, Interesting

    I always thought it would be interesting if google applied its page rank algorithm to provide a translation service. Like poll the top 5 translation service sites for a translated sentence and then based on what each of them return, generate a 'average' or best possible result for that sentence.

  17. Or a "culturally superior" Parisan Frechman. by raehl · · Score: 1, Interesting

    When I lived in Europe, a friend and I went to Paris. We're both bi-lingual; myself German, him Spanish, but unfortunately neither of us knew French. We had occasion to ask which train we neededto be on to get somewhere; and asked (in French) if the person we were asking for directions knew Spanish, English or German. We went through a good ten people before we found someone willing to admit that they spoke something other than French.

    I'm sure they thought they were being all "Ha-ha, I will not let these Americans get away with not speaking French!" but our interpretation of the situation was "We're americans, we speak two languages, what's wrong with you?"

  18. Translate Pascal To C and Such by Potpatriot · · Score: 4, Interesting

    How about piping in various algorirhtms encoded in Pascal and C into the thing and seeing what it does to convert arbitrary sources. Where Can I get the soource? Pawel

  19. Programming Languages? by The+Raven · · Score: 5, Interesting

    I wonder how this would fare putting two computer languages side by side? I mean... take a few thousand programs, coded using the same algorithms but different computer languages... would his language translation software translate between them? Would it be able to differentiate between languages that manually allocate memory and those that use garbage collection? How about between procedural langauages like C, and more esoteric and oddly structured languages like LISP?

    An interesting challenge, eh?

    Would there be any benefit to this?

    --
    "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
  20. Re:oh oh... by Man+Eating+Duck · · Score: 2, Interesting


    I think what was implied was that if you already had a translation engine trained for English/Japanese, when you are training it for English/French you can use the already existing "metadata" for English/Japanese to make the process quicker (requires smaller datasets to achieve the same precision).

    I might be far out here. Excuse my crappy English, btw.

    --
    Are you a grammar Nazi? I'm trying to improve my English; please correct my errors! :)
  21. But Can It Do Klingon? by opti6600 · · Score: 2, Interesting

    Now that would be cool.

    Seriously though, this leaves only the odd tribal languages of African (and perhaps South American?) tribes that are comprised entirely of clicks and gutteral sounds as not easily comprehended. Could this system's approach finally result in a Babelfish-like universality even for languages such as Chinese and Japanese? The added complexity makes it much more challenging for things like Babelfish, but if this system can do it, it's going to be a landfall discovery.

    Anybody have any further research by this guy? I'm interested! Who knows, maybe I could have gotten a better grade in French thanks to this research...

  22. Not to mention.. by k98sven · · Score: 3, Interesting

    The Rosetta stone itself did not do much in the way of our knowledge of the egyptian language.
    What it did do, was provide insight into their method of writing.
    It was the latter discovery of the the relation between Coptic and Egyptian that revealed most of the actual language.

    (IIRC)

    1. Re:Not to mention.. by LenE · · Score: 3, Interesting

      For those who don't know, Coptic is Egyptian written in Greek, or at least the Greek alphabet. It would be similar to transcribing a language that uses glyphs for words by recording them with the phonemes and alphabet of another language.

      A more modern example is what happened with the slavic Croatian language. The original speakers had a glyph based alphabet called Glagolitic, through the middle ages. This would be as foreign as Egyptian hieroglyphs to people today, and could stand in nicely for an alien text in any sci-fi movie.

      Through falling under different feudal states (Venice, Austro-Hungary) the language was cast under both the Cyrillic and Roman alphabets. Today Croatian uses an accented Roman alphabet (like French), but each letter has only one pronunciation, like Russian.

      -- Len

  23. statistics is the key by gemseele · · Score: 5, Interesting

    Time for inflamatory reasoning. The statistical approach will beat out the grammar and rule based ones, at least for English, is for the simple reason:

    English is not a language

    Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly (and not only in colloquial forms, just look at what the political-correctness movement has done to phraseology). You know the story... more exceptions than rules, things that are legitimate to say language-wise are considered incorrect anyways, and vice versa, etc. etc.

    That's not to say it doesn't have advantages; it's relatively easy to learn the basics of communication since it's weakly conjugated, has genderless articles, fairly simple uncased sentence structure. But, it is monstrous to master and I suspect most native speakers aren't true masters (not to mention the orthographical nightmare; is English the only language with spelling bee contests?)

    The reason it's the new lingua franca (or should it be lingua angla now?) is techno-socio-political as is always the case. Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.

    1. Re:statistics is the key by The+Cydonian · · Score: 2, Interesting
      English is not a language... [because it]... is a large collection of idiomatic expressions that changes quite rapidly

      Fair enough, English changes rapidly alright, but how would you define a language? A set of logical syntactic and semantic rules that haven't changed for the past few thousand years? I can think of only two languages like that, Latin and Sanskrit.

      Nope, I can't agree with your assertion; language is much more than mere (unchanging) grammar. In many multi-cultural places, it is a strong factor for socio-political identities; throughout history, communities have fought against great powers to assert their linguistic identities.

      Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.
      Actually, there are 329 languages spoken in the United States, many of which are spoken only in the US and nowhere else.

      Of course, like in other countries, most of these languages will probably end up as an anthropologist's museum specimens, but really, mono-lingualism of most educated Americans is not because you speak only English in the US. It's mostly because the numbers of other languages aren't quite there.

      Which brings us to a very interesting conjecture; I'm no American, (nor have I visited the area in question, so I appreciate responses on this) but if I may hazard a guess, by 2030's, learning Spanish will be essential to live in most of south and south-western US. That is to say, I assert that the current pre-dominance of English in the US is only a historical accident, one that will change with shifting demographics.

  24. Re:The Law of Eventuality by spuke4000 · · Score: 2, Interesting

    Maybe this is offtopic, but if you want really elegant language processing you should check this out. Basically, you look at the compressiblity of given text and can determine what language it's in, or even what author produced it. This works with as few as 20 words.

    I realize this isn't translation, but cool nonetheless. For further reading see here and here.

    --
    This post cannot be rebroadcast without the express written constent of Major League Baseball.
  25. Simulating persons' way of speech? by ivoras · · Score: 2, Interesting
    Given the statistical data, this could probably be used to simulate a text written by a specific person, for example Shakespeare.

    "You look nice..." --> "Shall I compare thee to a Summer's day..."

    --
    -- Sig down
  26. How's that news? by Yurka · · Score: 3, Interesting

    This has already been done some years ago in Canada, where the translation system was fed the complete text of parliamentary debates for umpteen years (required by law to be translated by humans into French, if originally in English, and vice versa). I don't know how it fares when presented with a sample of parliament-speak (I concede, this is not a fair approximation of human language), but it fails miserably on a simple rhyme. Read your Hofstadter, guys.

    --
    I can assure you, the best way to get rid of dragons is to have one of your own.
  27. Similar to natural learning? by Bodrius · · Score: 2, Interesting

    Interesting method.

    It seems to me this is more similar to natural learning of a language (usually at a young age) by exposure and immersion, as opposed to scholar learning of a language in classrooms, etcetera.

    It shouldn't be surprising that in humans, the first method also works best at acquiring fluency in multiple languages. As a matter of fact, it's the only method through which we come to understand our FIRST language, which is in almost every case the one we command the best.

    I think most people get, by consuming huge amounts of information, a feeling of "what sounds right" and "what sounds wrong" that is more effective for them at predicting the unwritten rules and exceptions, both in translations and in original sentence-creation, than memorizing a set of grammar rules which, in the end, are just codifications of the current state of the language.

    I don't think the success of the approach means the symbolic methods are pointless for this endeavor, any more than the formal study of languages and their grammars is for human translators.

    Professional writers and translators do study such rules to dramatically improve their command of the different languages, and do get much better results.

    But it seems to me they are more successful going from "statistical matching with massive real-use data" to "optimized grammar rules matching the data" than going backward, from "scholastic grammar rules" to "consumption of massive data to acquire exceptions, and correct and complement the rules".

    What would be interesting, I think, is if one can study the state of the system after it's performing well and extract/deduct grammar rules, algorithmically.

    It would be interesting to see the results of a program doing that, collecting (and correcting) the grammar using the data, and using the grammar rules when no match in the dictionaries is found to, say, apply a greater weight to the gramatically-correct choice among the alternatives.

    If the results were good with this approach, one could consider decreasing the size of the database as the grammar gains stability. Use that memory for other processes, other languages, or new sample data that could not be examined before.

    --
    Freedom is the freedom to say 2+2=4, everything else follows...
  28. Re:Old Texts by lakmiseiru · · Score: 2, Interesting

    I'm forced to disagree. Although reading texts in their primary languages is certainly valuable, I severely doubt every single scholar who studies ancient Mesopotamia is fluent in reading cuneiform script! Also, asking scholars to be fluent in one or two dead languages is quite a lot (according to my sister, who's a medieval scholar and speaks Latin and Medieval French)- would you have them be fluent in every single language they encounter? That's unrealistic, as well as inefficient.

    Although it's certainly true that many scholars can read the primary languages of the periods they study, some do not. For example, if one were studying Culture A through the medium of Culture B's records of interactions with Culture A, one would not need to read primary sources from Culture A.

    It's true that many scholars do prefer to rely on personal translations of primary sources, but for many it's a simple waste of time that could be better spent. Instead of arguing that all scholars must be able to read all primary sources of the cultures they study, I would argue that they should be able to analyze the translations of others (perhaps even the translations this system produces) with regards to the culture. If 20,000 scholars all translate a primary source and their translations are all relatively accurate (errors will be corrected in time), then 19,999 of them have wasted weeks or months.

    Yes. Scholars do need translations - they help verify the scholar's own translations, provide much-needed resources, give insight into the translator's view of the culture - in short, they are a resource too valuable to put aside.

    --

    Access denied: Not enough clue for requested operation.