Slashdot Mirror


Coming Soon, The Google Translator

compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation. The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases. If anyone were capable of making a serious go of MT, that would have to be Google."

13 of 418 comments (clear)

  1. fascinating by professorhojo · · Score: 5, Informative

    since the RTFAs lacked any kind of crunchiness, i sourced some great stuff here that does a wonderful job explaining how this system works, and gives the advantages the statistical translation method has over the rules-based approach. as well as the disadvantages.

    fascinating stuff:

    "Currently, most machine translation technology, including consumer-oriented programs such as Systran's Babel Fish, have been "taught" the rules of language, such as verb tenses and when to use parts of speech. Programmers painstakingly hand-build systems based on such rules. "The computer is told, if you see this thing in Russian, replace it with this thing in English," explains Yarowsky.

    "While somewhat effective, such systems are time-consuming to build (consider how long it takes most humans to learn a language and all its rules), and resulting translations are still marred by grammatical and other errors. Those that do work fairly well usually tackle popular Western languages, such as French, German, and Spanish; there are few translation programs developed for other important tongues, such as Chinese, Turkish, or Arabic, let alone for more obscure languages like Tajik.

    "To tackle a broader range of the world's languages, and to improve on the quality of machine translation, Yarowsky and his Hopkins colleagues are developing computer programs that can be trained to figure out any language using statistical analysis, i.e., looking at the probabilities of language patterns. In what's known as automatic knowledge acquisition, the computer could "learn" Serbian well enough to translate future documents or conversation, or at the least pick out pertinent words like "bomb."

    "As Yarowsky explains: "Say you want to teach a computer how to translate Chinese: You give the computer 100,000 sentences in English and the same 100,000 sentences in Chinese and run a program that can figure out which words go to which words. If in 2,000 sentences you have the word Washington, and in about the same number of sentences you have the word Huashengdun, and they occur in the same place in the sentence, these words are likely translations.

    "It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."

    "So, instead of telling a computer how to do something -- conjugate the verb 'to be' in Spanish, for example (I am = soy) -- researchers give it tens of thousands of examples and program the computer to find repeated patterns that the computer can use to conjugate new verbs. Trained this way, the program could potentially "learn" phrase structure and the rules of translation.

    "As Yarowsky notes in his 100,000-sentence example, one way to accomplish automatic knowledge acquisition is to use bilingual or parallel text. The program "reads" a document in English and then a version in a second language. Such texts used by Hopkins researchers include the Bible, which is available on the Web in more than 60 languages, the Book of Mormon (over 60 languages), and the United Nations Declaration of Human Rights (240 languages).

    "Aiding the computer is the fact that the English version of such texts can be annotated by hand or using another computer program -- essentially marked up to show, for example, that Jesus is a noun and pray is a verb. The translation program-in-training needs such information because it cannot translate future text just by substituting individual words in each language; it must also be able to analyze how sentences work. To do so, the computer program uses pattern recognition templates and other tools to understand sentences on a syntactic level. Simply put, the program is essentially given clues to know what to look for, notes Yarowsky: "It should figure out the subject, figure out the object, and other elements of sentence structure."

    1. Re:fascinating by NoMoreNicksLeft · · Score: 5, Interesting

      Some questions:

      Why can't a dictionary be made of nouns, of verbs? Why can't we have it statistically analyze the grammar for ambiguous words?

      Does it only recognize exact matches? Especially with verb conjugation, I'd think any words 80% similar or so should be considered matches. Not all languages are as conjugation happy as latin or spanish or even english, and you often lose some nuanced conjugations when translating from one to the other.

      What will be done about idioms? Translating these word for word often makes no sense at all, and for me at least (no idea what the official stance is), I'd rather they substitute in idioms with the same general meaning, but for the culture being translated to.

      Does it work on alternate character systems, is it word boundary dependent?

      Does it understand punctuation rules, will this post translated to spanish have the upside down question marks where they're supposed to be?

      How many of the world's existing languages have enough text for this to even be feasible?

    2. Re:fascinating by elrous0 · · Score: 5, Insightful
      or at the least pick out pertinent words like "bomb."

      Why do I have a funny feeling that this research isn't being funded by philanthropic foundations?

      -Eric

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    3. Re:fascinating by Anonymous Coward · · Score: 5, Funny

      You go all the way to Iran to get gasoline? Who are you, George W Bush?

    4. Re:fascinating by kebes · · Score: 5, Interesting

      What will be done about idioms? Translating these word for word often makes no sense at all, and for me at least (no idea what the official stance is), I'd rather they substitute in idioms with the same general meaning, but for the culture being translated to.

      I think this is precisely where statistical approaches can really shine. A purely dictionary-based conversion will translate an idiom word-for-word, which will make no sense at all. However, a statistical approach could be constructed to look for the "longest reliable match." So if the idiom "cat got your tongue" re-appears over and over, and is correlated to a different idiom in other languages (that may not use the word "cat"!), then the algorithm could tokenize "cat got your tongue" as a single entry that would map to something different in each language.

      How many of the world's existing languages have enough text for this to even be feasible?

      You're right... that's the killer. Translating using statistics (especially idioms) properly will require a huge database of samples. Even what's been suggested so far is not enough. If we want to translate technical documents, we need a new database. If we want to translate "free form writing" we need yet more data.

      However, there's lots of data out there (already in digital format) that could be used... we just need people to see the potential and start using these datasets (or making these datasets available). For instance, for technical stuff there are thousands of abstracts for papers and for theses that are translated into various languages (for instance, many articles published in german are then also released in english... I live in Quebec, and every thesis abstract has to be translated into french also... etc.). Many legal documents (many of which are already available to the public) are also translated for various reasons. It would also be interesting if translators all around the world uploaded documents they had translated into some database (assuming it's nothing sensitive of course!). As this database grew, it would become more and more reliable. Let's face it, there's tons of human-based translation going on, forming a massive dataset... but by and large it's just scattered and not useable.

    5. Re:fascinating by should_be_linear · · Score: 5, Funny

      but the Bible uses many outdated or non-standard phrases and sentence structures, as does most legal text I've ever seen. I'm not a linguist or a statistician, but from my uneducated viewpoint it sounds like problems might arise in the texts that are available for training the system. Anyone know how they're planning to overcome this?

      Harry Potter is the answer. It is several "normal language" books and is translated to all major languages. Also, program would finally figure out how to translate words like "Quidditch".

      --
      839*929
  2. Integrate with GMAIL! by RubberDogBone · · Score: 5, Interesting

    Make this work with Gmail and I'd even pay money for it!

    Tired of getting email from Amazon.DE on my Gmail account and having to copy and paste it over to Babelfish.

    That would be very useful for me.

    --
    Sig for hire.
  3. if anyone... by rdc_uk · · Score: 5, Interesting

    Actually, my bet for most likely to make a real go of machine translation would be...

    IBM

    Look how far they ran with chess programs, because they felt like it...

    If they decided to go the same distance with translation...

  4. Re:Google's translator by iantri · · Score: 5, Informative

    SystranSoft's Systran is behind almost all of the machine translation srevices on the Internet, lincluding Google's.

  5. oh no! by danharan · · Score: 5, Interesting

    I don't ever expect such translation to work perfectly, but taking existing phrases should lead to useful first drafts.

    This will mean one less possible career for me, and fewer babelfish induced laugther moments.

    As a fluently bilingual person, I often recognize expressions that were translated in Canadian government documents. "Anglicisme" is the word the french have for it.

    There's subtlety to languages we may forever lose. Take for example:

    "Je donne ma langue au chat" - "I give up (answering a riddle) instead of the more picturesque "I give my language to the cat". Well, that should be tongue, but hey, it's just babelfish!

    "Bullshit" won't produce "merde de taureau". That is a strange expression you anglos have, don't you realize?

    "Il pleut comme vache qui pisse" will give us "it's pouring cats and dogs" rather than "it's pouring like cows' a'pissin". The french also have never heard of cats and dogs falling from the sky.

    While an improved Babelfish may improve our mutual comprehension, please pause for a moment to consider all the linguistic hilarity we'll forever lose.

    --
    Information: "I want to be anthropomorphized"
  6. Re:Unsupported assertions by KagatoLNX · · Score: 5, Interesting

    Ummm, geeks like Google because Google employs scientists. Which mere scientists were you talking about?

    Were you talking about the PhDs at universities busy teaching classes, churning out research papers to avoid being fired (an ugly numbers game some departments play), or perhaps burning time generating volumes of grant paperwork?

    Oh, maybe you were talking about the scientists employed by the private sector. I'm sure the management teams wherever they work are willing to take the time and care that Google won't.

    You do know how may PhDs Google employs, right? Not to mention that they won't be fighting for resources there either. No backstabbing, liquidating MBAs trashing their corporate budget. No football-crazed alumni assassinating their funding proposals either.

    Also, I would remind you that "mere scientists" often come up with the needed research (there are volumes in MT alone), but rarely can afford to put in the years that it takes into a good implementation.

    Geeks love Google because it is, in many respects, where the best of business meets the best of academia.

    --
    I think Mauve has the most RAM. --PHB (Dilbert Comic)
  7. Lovely translation source... by isa-kuruption · · Score: 5, Funny

    So when you go to translate.google.com and translate something, the result will be legal-eze in the resulting languages.

    Spanish: "Que pasa?"
    English translation: "With regards to the current situation, how is the day progressing?"

  8. words don't really have meanings by mincognito · · Score: 5, Interesting

    Some people here seem to have a false picture of how language works. Individual words do not have meanings. Not to a human interpreter anyway. Sentences used in actual contexts have meanings (unless a single word is uttered as an elliptical sentence). The "meanings" of words, as found in dictionaries, are simply abstractions from occasions of use. The idea that individual words have meanings hasn't been current in philosophy or linguistics for about 50 years. Also, the idea of St. Augustine that children learn the meaning of words by associating sounds that they hear with particular objects that they observe is now also considered rather dubious.