Slashdot Mirror


Coming Soon, The Google Translator

compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation. The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases. If anyone were capable of making a serious go of MT, that would have to be Google."

9 of 418 comments (clear)

  1. fascinating by professorhojo · · Score: 5, Informative

    since the RTFAs lacked any kind of crunchiness, i sourced some great stuff here that does a wonderful job explaining how this system works, and gives the advantages the statistical translation method has over the rules-based approach. as well as the disadvantages.

    fascinating stuff:

    "Currently, most machine translation technology, including consumer-oriented programs such as Systran's Babel Fish, have been "taught" the rules of language, such as verb tenses and when to use parts of speech. Programmers painstakingly hand-build systems based on such rules. "The computer is told, if you see this thing in Russian, replace it with this thing in English," explains Yarowsky.

    "While somewhat effective, such systems are time-consuming to build (consider how long it takes most humans to learn a language and all its rules), and resulting translations are still marred by grammatical and other errors. Those that do work fairly well usually tackle popular Western languages, such as French, German, and Spanish; there are few translation programs developed for other important tongues, such as Chinese, Turkish, or Arabic, let alone for more obscure languages like Tajik.

    "To tackle a broader range of the world's languages, and to improve on the quality of machine translation, Yarowsky and his Hopkins colleagues are developing computer programs that can be trained to figure out any language using statistical analysis, i.e., looking at the probabilities of language patterns. In what's known as automatic knowledge acquisition, the computer could "learn" Serbian well enough to translate future documents or conversation, or at the least pick out pertinent words like "bomb."

    "As Yarowsky explains: "Say you want to teach a computer how to translate Chinese: You give the computer 100,000 sentences in English and the same 100,000 sentences in Chinese and run a program that can figure out which words go to which words. If in 2,000 sentences you have the word Washington, and in about the same number of sentences you have the word Huashengdun, and they occur in the same place in the sentence, these words are likely translations.

    "It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."

    "So, instead of telling a computer how to do something -- conjugate the verb 'to be' in Spanish, for example (I am = soy) -- researchers give it tens of thousands of examples and program the computer to find repeated patterns that the computer can use to conjugate new verbs. Trained this way, the program could potentially "learn" phrase structure and the rules of translation.

    "As Yarowsky notes in his 100,000-sentence example, one way to accomplish automatic knowledge acquisition is to use bilingual or parallel text. The program "reads" a document in English and then a version in a second language. Such texts used by Hopkins researchers include the Bible, which is available on the Web in more than 60 languages, the Book of Mormon (over 60 languages), and the United Nations Declaration of Human Rights (240 languages).

    "Aiding the computer is the fact that the English version of such texts can be annotated by hand or using another computer program -- essentially marked up to show, for example, that Jesus is a noun and pray is a verb. The translation program-in-training needs such information because it cannot translate future text just by substituting individual words in each language; it must also be able to analyze how sentences work. To do so, the computer program uses pattern recognition templates and other tools to understand sentences on a syntactic level. Simply put, the program is essentially given clues to know what to look for, notes Yarowsky: "It should figure out the subject, figure out the object, and other elements of sentence structure."

    1. Re:fascinating by aldoman · · Score: 2, Informative

      You input them all, and let the statistics do their magic.

      Just like your email spam filter can handle you pressing junk on stuff that isn't junk, or not junk on stuff that is, it's just all numbers and there is an inherent tolerance for small errors that will be created with this sort of system.

    2. Re:fascinating by Arjen · · Score: 2, Informative

      ...the flesh is weak" comes out as "The meat is rotten, but the wine's great".

      Seems like I have to repeat myself over and over again, since this is an urban legend. According to MACHINE TRANSLATION: An Introductory Guide:

      The `spirit is willing' story is amusing, and it really is a pity that it is not true. However, like most MT `howlers' it is a fabrication. In fact, for the most part, they were in circulation long before any MT system could have produced them (variants of the `spirit is willing' example can be found in the American press as early as 1956, but sadly, there does not seem to have been an MT system in America which could translate from English into Russian until much more recently --- for sound strategic reasons, work in the USA had concentrated on the translation of Russian into English, not the other way round). Of course, there are real MT howlers. Two of the nicest are the translation of French avocat (`advocate', `lawyer' or `barrister') as avocado, and the translation of Les soldats sont dans le café as The soldiers are in the coffee. However, they are not as easy to find as the reader might think, and they certainly do not show that MT is useless.

      BTW, since this book is no longer available in the stores, the whole contents is placed online. I recommend reading this book to anyone who is interested into the subject of MT. It really is a nice introduction into the subject.

  2. Re:Google's translator by iantri · · Score: 5, Informative

    SystranSoft's Systran is behind almost all of the machine translation srevices on the Internet, lincluding Google's.

  3. Re:Needs a *bit* more work... by Anonymous Coward · · Score: 1, Informative

    The current version of "Google translates" is based on Babelfish (a rule-based machine translation system), it isn't based on Google's research into SMT (statistical machine translation)

  4. Re:Anyone care to make a bet? by Anonymous Coward · · Score: 2, Informative

    Well, it's not like they don't have the technology...

    http://research.microsoft.com/nlp/Projects/MTproj. aspx

  5. Re:if anyone... by digidave · · Score: 3, Informative

    Yeah right. Not while they're trying to convince customers to buy their current generation of crap translators. I got sucked into an IBM conference two years ago where they tried to convince me that their Websphere translator was "near perfect" and that it was ready to be deployed on web sites wanting to offer content in multiple languages. They even went so far as to bring in supposed unbiased happy customers who testified that the Websphere translator was as good as human translators.

    In the conference was mostly IBM platinum partners (development firms who specialize in IBM "solutions" and make IBM enough money to be called platinum partners) and they seemed to buy into it. Of course, platinum partners tend to believe everything IBM tells them.

    --
    The global economy is a great thing until you feel it locally.
  6. Re:if anyone... by Anonymous Coward · · Score: 1, Informative

    Actually, a lot of the original work that the current statistical mt methods are based on was developed at IBM. They were named, appropriately, IBM Models 1-5.

  7. They were one of the first in the early '90s... by msbmsb · · Score: 2, Informative

    The Mathematics of Statistical Machine Translation: Parameter Estimation by Brown, Pietra, et al. IBM was on this a while ago, and other efforts have improved upon this work, through the use of Maximum Entropy, etc.