Slashdot Mirror


Coming Soon, The Google Translator

compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation. The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases. If anyone were capable of making a serious go of MT, that would have to be Google."

9 of 418 comments (clear)

  1. Only works for translating speeches by Shotgun · · Score: 4, Insightful

    If your blog sounds like a politician giving a speech at the UN, this service will do a wonderful job. Doubtful that it will do any better that Babelfish otherwise.

    The biggest problem in artificial intelligence is that the system learns the material that it is trained to, and only that material. Computers don't generalize or extrapolate the known into the unknown worth a damn.

    --
    Aah, change is good. -- Rafiki
    Yeah, but it ain't easy. -- Simba
  2. Re:Unsupported assertions by stevejsmith · · Score: 4, Insightful

    No, it's because Google has tons of talent, money, already-archived text to work with, computers, respect in the industry, and consumer base. I can't think of a company that possesses these characteristics more so than Google.

  3. T.Q. by moviepig.com · · Score: 4, Insightful
    The system has been trained using the United Nations Documents as a corpus.

    Seems one could devise a TQ (tranlsation quotient) measuring the effectiveness of machine (or human) translators. Take any standard reading-comprehension test, a send its text material through the translator, and back ...and then compare the scores of subjects taking the resulting test vs. those taking the original.

    (Before such translators make their way into, say, diplomatic circles, I'd sure hope there's some objective demonstration of near-infallibility...)

    --
    Seeing bad movies only encourages them. Watch responsibly
  4. Re:fascinating by elrous0 · · Score: 5, Insightful
    or at the least pick out pertinent words like "bomb."

    Why do I have a funny feeling that this research isn't being funded by philanthropic foundations?

    -Eric

    --
    SJW: Someone who has run out of real oppression, and has to fake it.
  5. Re:fascinating by MindStalker · · Score: 3, Insightful

    Well the bible is hebrew, greek and latin. There are no outdated English phrases in the Bible. Now if your refering to the King James translation of the bible, obviously such would be good for teaching google Old English but not modern english. You would need a much newer translation that doesn't use old phrases. Such do exist btw.

  6. except, no. by mattdm · · Score: 3, Insightful

    "It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."

    Except, no. Humans are basically generalization machines. Babies are able to grasp very quickly that words apply to categories of things -- not just that a *specific* item is a bird or a book, but to learn "I know a bird when I see it", even without necessarily being able to provide a scientific definition. Computers can be built to emulate this ability, but learning word-to-word mappings isn't *nearly* the same as learning abstract concepts and which words apply to them.

  7. Re:fascinating by Bigman · · Score: 3, Insightful

    Don't forget that many works of fiction are translated into several languages. The only problem with that is persuading the copyright holders to permit their use in training computer translation systems. I'm not sure where you would stand with this legally (After all, IANAL!), so I suspect this is why Google has been using the UN documents. I would imagine these are effectively public domain; and if not, I would imagine the UN would see a reliable machine translation project worth supporting. The only downside I can see is that the UN texts are unlikely to have many idioms or colloqualisms, which would limit the resulting translators usefulness in a more general context.

    --
    *--BigMan--- Time flies like an arrow.. but personally I prefer a nice glass of wine!
  8. Re:DVD's subtitle tracks by BullfrogJones · · Score: 3, Insightful
    One serious problem I see with the 'matching source' method is that it's rare to find two sources that truly match. Movies are a great example - as a native English speaker that lived for 5 years in Spain, I can attest to the fact that the translations provided by the movie studios (used for subtitles in the theater and also for DVDs) are problematic on many levels.

    It's not enough to recognize a given word in language A is such and such word in language B, and not even enough to do the same with idiomatic phrases such as 'His bark is worse than his bite' (Mucho ruido, pocas nueces in Spanish, literally 'Lots of noise but few pecans').

    The problem is that the content itself is sometimes changed in translation. Cultural differences, pop culture references, names and places are all changed liberally when creating movie subtitles. This is something that it is easy enough for a bilingual human to notice and disregard, but how is a computer to know what to keep and what to disregard when comparing the supposedly matching sources.

    Choice of source material is extremely important here, and probably explains why they are starting with UN documents, a formal, business-like body of text with presumably less room for content differences. Unfortunately, the fact that movie translations cannot easily be used means that much of what we humans find amusing about bad babelfish translation (literal translation of slang, etc...) will continue to plague us for some time to come.

  9. Re:translations of translations by grahamsz · · Score: 3, Insightful

    I was wrong about the french. However the spanish NVI appears to parallel the NIV, and i'd imagine would be pretty good candidates for this sort of analysis.

    http://www.booksofthebible.com/p2390.html

    I believe it's key that in the situation of

    Ancient Lang A -> Modern Lang B -> Modern Lang C

    that B and C will be far closer than

    Ancient Lang A -> Modern Lang B
    Ancient Lang A -> Modern Lang C