Coming Soon, The Google Translator
compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation.
The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases.
If anyone were capable of making a serious go of MT, that would have to be Google."
since the RTFAs lacked any kind of crunchiness, i sourced some great stuff here that does a wonderful job explaining how this system works, and gives the advantages the statistical translation method has over the rules-based approach. as well as the disadvantages.
fascinating stuff:
"Currently, most machine translation technology, including consumer-oriented programs such as Systran's Babel Fish, have been "taught" the rules of language, such as verb tenses and when to use parts of speech. Programmers painstakingly hand-build systems based on such rules. "The computer is told, if you see this thing in Russian, replace it with this thing in English," explains Yarowsky.
"While somewhat effective, such systems are time-consuming to build (consider how long it takes most humans to learn a language and all its rules), and resulting translations are still marred by grammatical and other errors. Those that do work fairly well usually tackle popular Western languages, such as French, German, and Spanish; there are few translation programs developed for other important tongues, such as Chinese, Turkish, or Arabic, let alone for more obscure languages like Tajik.
"To tackle a broader range of the world's languages, and to improve on the quality of machine translation, Yarowsky and his Hopkins colleagues are developing computer programs that can be trained to figure out any language using statistical analysis, i.e., looking at the probabilities of language patterns. In what's known as automatic knowledge acquisition, the computer could "learn" Serbian well enough to translate future documents or conversation, or at the least pick out pertinent words like "bomb."
"As Yarowsky explains: "Say you want to teach a computer how to translate Chinese: You give the computer 100,000 sentences in English and the same 100,000 sentences in Chinese and run a program that can figure out which words go to which words. If in 2,000 sentences you have the word Washington, and in about the same number of sentences you have the word Huashengdun, and they occur in the same place in the sentence, these words are likely translations.
"It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."
"So, instead of telling a computer how to do something -- conjugate the verb 'to be' in Spanish, for example (I am = soy) -- researchers give it tens of thousands of examples and program the computer to find repeated patterns that the computer can use to conjugate new verbs. Trained this way, the program could potentially "learn" phrase structure and the rules of translation.
"As Yarowsky notes in his 100,000-sentence example, one way to accomplish automatic knowledge acquisition is to use bilingual or parallel text. The program "reads" a document in English and then a version in a second language. Such texts used by Hopkins researchers include the Bible, which is available on the Web in more than 60 languages, the Book of Mormon (over 60 languages), and the United Nations Declaration of Human Rights (240 languages).
"Aiding the computer is the fact that the English version of such texts can be annotated by hand or using another computer program -- essentially marked up to show, for example, that Jesus is a noun and pray is a verb. The translation program-in-training needs such information because it cannot translate future text just by substituting individual words in each language; it must also be able to analyze how sentences work. To do so, the computer program uses pattern recognition templates and other tools to understand sentences on a syntactic level. Simply put, the program is essentially given clues to know what to look for, notes Yarowsky: "It should figure out the subject, figure out the object, and other elements of sentence structure."
Just to illustrate, here's the summary of this story, translated to German and back to English using Google's current version:
____
~ |rip/\/\aster /\/\onkey
So what powers Google's current translator? I have seen it give word-for-word the same as Babel on some occasions (but with better handling of non-ASCII characters).
# cat
Damn, my RAM is full of llamas.
"Guugle-a gefe-a a Gleempse-a ooff its mecheene-a Uebersetzoongsystems zee fullooeeng prudoocshun et zee fectury ruoote-a ooff zee A Mey 19 tu juoorneleests. Guugle-a. "Guugle-a Bluguscuped" ooffffers un ixcellent ooferfeeoo ooff zee representeshun. Zee system ves treeened veet zee neshun ducooments es kurpoos. Thees kurpoos is sumetheeng 20 beelliun vurd felooe-a ooff cuntents. It uses zee ixeesting terget lungooege-a trunsleshuns (tekes plece-a feea hoomun trunsleturs et zee U.N.) Semples feend, vheech use-a it zeen tu istebleesh gooeedelines fur trunsleteeng betveee thuse-a lungooeges. Epperent it ves sooccessffool, vhere-a zee present ferseeun hed feeeled, iff it trunsleted certeeen cleeches. Iff iferyune-a ooff furmeeng a sereeuoos vere-a cepeble-a, ooff zee M.Ue-a., thuse-a vuoold gu tu hefe-a hefeeng tu Guugle-a."
Looking forward to a www.borkle.com which returns all its results in such a format.
Don't blame Durga. I voted for Centauri.
Make this work with Gmail and I'd even pay money for it!
Tired of getting email from Amazon.DE on my Gmail account and having to copy and paste it over to Babelfish.
That would be very useful for me.
Sig for hire.
That Microsoft will announce a new revolutionary language translation service sometime in the next two weeks or so?
Weaselmancer
rediculous.
Oh, no. It's because geeks like Google. Therefore, Google are capable of superhuman feats that mere scientists -- those with years of experience in relevant fields -- are incapable of doing.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
boakes.org
When questioned on the matter, Altavista's Babelfish translator gave this quote:
Google does not have anything on my amazing abilities of the translation!
Pulp Audio Weekly - Geek News and Reviews
Actually, my bet for most likely to make a real go of machine translation would be...
IBM
Look how far they ran with chess programs, because they felt like it...
If they decided to go the same distance with translation...
If your blog sounds like a politician giving a speech at the UN, this service will do a wonderful job. Doubtful that it will do any better that Babelfish otherwise.
The biggest problem in artificial intelligence is that the system learns the material that it is trained to, and only that material. Computers don't generalize or extrapolate the known into the unknown worth a damn.
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba
While Google's existing translator and Altavista's Babelfish are good, they do not help in the translation of several other languages.
That would be a really good benefit - for instance, I wanted something translated to and fro from Svensk (Swedish), but I really couldn't find any translation service that did.
Good translation of the more common languages would be nice, but simple translations, even - of a variety of languages would be really useful.
At last I can translate all those non-English spam emails I get! There'll be no more missed opportunities to buy chinese viagra, woohoo.
Since it's become "hip" to bash Google these days and support either MSN's search technology or Yahoo, I'm making a pre-emptive strike for the IT fashionistas:
"Duh!!! The best machine translator in the world already exists and there can be no improving upon it! Babblefish (thank you Altavista) has been doing this for well nigh a decade. All you Johnny-come-latelys are probably going to rave on with fanboy adoration of Google (the company that can do no wrong)!!! To top it all off, you lot apparently know nothing about Microsoft's language transtlation project which is slated to be deployed as part of Longhorny in 2010. Online language translation from Google will fail because Microsoft will have it built into the OS itself. Why send your document online for translation when the OS itself will not only translate it, but it will correct the grammar, punctuation and generate a WMA file in one of ten thousand gorgeously rendered synthetic voices. Google has lost. Google as been trolled. Google will have a nice day".
We now return you to your regularly scheduled pos[tt]en.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
There is already a tranzilator
Seems one could devise a TQ (tranlsation quotient) measuring the effectiveness of machine (or human) translators. Take any standard reading-comprehension test, a send its text material through the translator, and back ...and then compare the scores of subjects taking the resulting test vs. those taking the original.
(Before such translators make their way into, say, diplomatic circles, I'd sure hope there's some objective demonstration of near-infallibility...)
Seeing bad movies only encourages them. Watch responsibly
I don't ever expect such translation to work perfectly, but taking existing phrases should lead to useful first drafts.
This will mean one less possible career for me, and fewer babelfish induced laugther moments.
As a fluently bilingual person, I often recognize expressions that were translated in Canadian government documents. "Anglicisme" is the word the french have for it.
There's subtlety to languages we may forever lose. Take for example:
"Je donne ma langue au chat" - "I give up (answering a riddle) instead of the more picturesque "I give my language to the cat". Well, that should be tongue, but hey, it's just babelfish!
"Bullshit" won't produce "merde de taureau". That is a strange expression you anglos have, don't you realize?
"Il pleut comme vache qui pisse" will give us "it's pouring cats and dogs" rather than "it's pouring like cows' a'pissin". The french also have never heard of cats and dogs falling from the sky.
While an improved Babelfish may improve our mutual comprehension, please pause for a moment to consider all the linguistic hilarity we'll forever lose.
Information: "I want to be anthropomorphized"
First, this is outstanding; Google, unsatisfied with traditional machine translation techniques, pioneers their own design. I'm certain their advertisers will be pleased to have their adds auto-translated to whatever language is necessary.
Second, I think we'll witness a case of having the AI ante upped once again when another traditional AI challenge is met. Wikipedia puts this best; When viewed with a moderate dose of cynicism, AI can be viewed as 'the set of computer science problems without good solutions at this point.' Once a sub-discipline results in useful work, it is carved out of artificial intelligence and given its own name.
Lurking at the bottom of the gravity well, getting old
So when you go to translate.google.com and translate something, the result will be legal-eze in the resulting languages.
Spanish: "Que pasa?"
English translation: "With regards to the current situation, how is the day progressing?"
DVD subtitle tracks would be another good addition to help pick up slang too (most have an english track along with a couple others depending on the region)... all time-synced and easy to match up...
(I'm guessing that it'd fall under fair use and google wouldn't have to struggle to get the movie studios approval, (even though such tech would benefit the studios too))
In 'Hitchhiker's Guide to the Galaxy' (the 'trilogy' of books, not the recent movie), it's mentioned that the babelfish has effectively started many, many wars. The reasons seem to be that any being can be rude to any other being without a serious set of translations that explain exactly what the rude terms mean and how they should be regarded.
I'm highly concerned for this warmongering that Google has undertaken.
Reference Here: http://www.bbc.co.uk/cult/hitchhikers/guide/belgi
Picture this: I write a blog entry with either bad punctuation or erroneous content. Under the old system (pre-Goolge translation), I would receive several flames about my idiocy. With Google translations:
* People around the world will be confused and angered about my punctuation;
* Vastly larger numbers of people will complain about my erroneous content;
* Other people will step up to my defense and a massive flame war will ensue;
* Idiots eveywhere (who speak other languages) will echo my idiocy by believing the erroneous content I posted;
* The signal to noise ratio of the net will rise markedly;
* I will still be unsure of whether to count on my fingers starting with my thumb or forefinger depending on which European country I'm in.
I believe this pro-war, anti-peace, conflict-ridden idea of making everyone THINK they understand each other is ripe for critism. God made everyone else speak funny, I think it should stay that way! Only right thinking people speak my language anyway, and everyone else should just shut up and sit down!
(WARNING: above post contains carcinogenic levels of sarcasm, fasciousness, satire, irony, and adjectives. Please unplug brainstem and wipe with a clean, damp cloth before continuing.)
Unitarian Church: Freethinkers Congregate!
Wenn ist das Nunstruck git und Slotermeyer? Ja!... Beiherhund das Oder die Flipperwaldt gersput. be careful! If you translate this you may end up dead.....
"Computers don't generalize or extrapolate the known into the unknown worth a damn."
Fortunately, that's not all that google has to go on. Google has 8 billion webpages, in many different languages, most of which are written by non-speechwriters. Not only can they analyze words based on translated context, but they can analyze words based on intra-language context, to form associations between words and meanings.
The real trick is getting down two important linguistic concepts: "Sandhi Rules" (for instance, the use of "an" before a vowel and "a" before a consonant, which are totally regular but more complicated than a word-to-word matchup), and the "degree" or "quality" of words, which indicate the type of adjective most appropriate in any given context.
For instance, "erudite", "learned", "educated", "knowledgeable", "skilled", and "cunning" could all be related words, but many of them have positive or negative assocations which may only really be conveyed by understanding the meaning, irony, or sarcasm of a particular phrase.
For instance, "John has been skilled in writing beautiful code for most of his adult life" is quite different from "John has been educated in writing beautiful code for most of his adult life", or "John has been erudite...". The first one is probably right if John has had a natural inclination to doing it properly, the second if he has undergone some training (though we don't know the actual state of his ability), the third (though the word doesn't even really make sense here) if he has been arrogant about his ability, shouting RTFM! every time someone asked him a question.
Some people here seem to have a false picture of how language works. Individual words do not have meanings. Not to a human interpreter anyway. Sentences used in actual contexts have meanings (unless a single word is uttered as an elliptical sentence). The "meanings" of words, as found in dictionaries, are simply abstractions from occasions of use. The idea that individual words have meanings hasn't been current in philosophy or linguistics for about 50 years. Also, the idea of St. Augustine that children learn the meaning of words by associating sounds that they hear with particular objects that they observe is now also considered rather dubious.
Ludwig Wittgenstein
"It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."
Except, no. Humans are basically generalization machines. Babies are able to grasp very quickly that words apply to categories of things -- not just that a *specific* item is a bird or a book, but to learn "I know a bird when I see it", even without necessarily being able to provide a scientific definition. Computers can be built to emulate this ability, but learning word-to-word mappings isn't *nearly* the same as learning abstract concepts and which words apply to them.
If they use UN documents as a guide, the Google MT engine will be excellent at translating bureaucratese between languages. I'm not sure if that's a good thing!
Exactly. And the UN surely has fairly rigorous QA processes for its translations. Now try expanding the corpus with more translated copy.
In addition to feeding the system with translations that haven't been through formal QA (in many but not all cases), you also are now feeding it copy that has not had all the style deliberately squeezed out of it for easy translatability. (Which is the way they write in bi- and multilingual bureaucracies.)
If and when MT can handle that situation, I'll be impressed. But a "bureaucratese" translator seems like a much smaller challenge to me, relatively speaking.
John, the cunning linguist.
Flourescent (adj): smelling like ground wheat.
The Mathematics of Statistical Machine Translation: Parameter Estimation by Brown, Pietra, et al. IBM was on this a while ago, and other efforts have improved upon this work, through the use of Maximum Entropy, etc.
Clinton tours devastated Bandeh Aceh.
Of course, I knew what the writer really meant. But the Bable Fish translation into French produces exactly the meaning which I first parsed when reading that headline.
Les excursions de Clinton ont dévasté Bandeh Aceh.
If machine translation become more common, perhaps English writers will have to be a little more careful.
I was wrong about the french. However the spanish NVI appears to parallel the NIV, and i'd imagine would be pretty good candidates for this sort of analysis.
http://www.booksofthebible.com/p2390.html
I believe it's key that in the situation of
Ancient Lang A -> Modern Lang B -> Modern Lang C
that B and C will be far closer than
Ancient Lang A -> Modern Lang B
Ancient Lang A -> Modern Lang C
There is an arguably better solution which is to agree on a common writing system (note that adopting a common writing system is more feasible than adopting a common language as one need not learn any phonology). Fifty years ago, a man by the name of Charles K. Bliss developed a system he hoped that, in the future, would become universally adopted. His invention was dubbed Blissymbolics. It is currently used in the field of augmentative and assistive communication where it gives language to those who would, due to handicap, be unable to communicate with any fluency.
The basic idea behind Blissymbolics is to use mostly indexical ideographs - that is to say, eg, the symbol for man looks somewhat like a stick figure man. There are some pure symbols, however, though they somewhat conventional - for instance, a heart shaped symbol represents emotion. However, it is not limited to concrete meanings, and, though I doubt it could be proved, I believe it's has the same capability for expression as any other writing system, including English writing, due to its compositionality. Couple that with the fact that it can be learned quite easily, one might begin to see that yes, this is a better solution. I am dedicated to this ideal, so if you get a chance, check out http://www.activebliss.com/ for more information about the ideal of universal communication.
Cheers,
Matt Landau