More on Statistical Language Translation
DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.
The key improvement is not just to search for phrases that appear in the sample texts. If you have an idea for what a word means and what its grammatical role is then you can plug it into other sentences and greatly extend the set of phrases you can translate. Thus an important idea is to search for phrases that match gramatically with phrases you can translate.
however, this requires a stage where the sample texts are used to extract grammatical information on the second language. Of course, it helps alot if you are familiar with one of the two languages.
What happens when it hits a word with several meanings? For example the reply to a previous story "I got pissed and installed OSX"
drunk?
angry?
urinated?
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
I remember reading about IBM doing this research about 10 years ago. The biggest problems then adequate processing power and storage space. Those things have greatly improved in the last 10 years (thank the spirits of Moore). I think that's why you're starting to see all this cool research with speech recognition and AI that was being done in the 80s and 90s become more and more commonplace. This trend will likely continue, and all the cool research only stuff you remember reading about in the 80s and 90s will just be common fixtures on PCs of today.
:)
Speaking of which -- speech recognition, AI, translation learning algorithms -- sounds like we have the seeds for the Universal Translator.
My journal has hot
France = "Cheese Eating Surrender Monkey"
George Bush = "Neo-Imperialist Moron"
Tony Blair = "Lap Dog"
WMD = "No where to be found"
and of course
Dossier = Creative Story Telling
An Eye for an Eye will make the whole world blind - Gandhi
Translation-unit this algorithm perfectly works! Deutsch this was typed and translation-unit to English makes this was!
The cake is a pie
malo: I had rather be
malo: in an apple tree
malo: than a naughty boy
malo: in adversity
based on four very distinct meanings of malo, in which the word endings put the stem of the word in context, but unfortunately the same word endings are used for different things.
Not that I'm trying to rubbish the work, because I actually think that statistical methods are close to the fuzzy way that we actually try and make out foreign languages. I just wonder what the limits are.
Panurge has posted for the last time. Thanks for the positive moderations.
The article's text has "Compare two simple phrases in Arabic: "rajl kabir'' and "rajl tawil.'' If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively". Are we going pro-homeland security and not tipping off the powers that be? Or did michael want to show his uber leet 1st quarter espanol skillz?
Spanish is easy and led me to believe that the article had relatively little weight (it is lightweight and a topical PHB read anyway). I do a lot of data mining in text streams and have found it to be fairly easy work. Getting cursors to play in ideograms/unicode and reversing the data is something I haven't tried yet and the article barely covers it. When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again. All of my databases are unicode and I want to learn more about having truly international systems that are automated and then hand tweaked to avoid the engrish.com type mistakes. Any help here?
-B
Yoda, is that you?
Black holes are where God divided by zero
But that's an old story. Even the translation of complete sentences is fairly feasible in terms of syntactic structure.
Harder to translate are things like discourse markers ("then", "because") because they are highly ambiguous and you would have to understand the text in a way. I have tried to guess these discourse markers with machine learning model in my thesis about rhetorical analysis with support vector machines (shameful self-promotion), and I got around 62 percent accuracy. While that's probably better than or similar to competing approaches, it's still not good enough for a reliable translation.
And that's just one example for the hurdles in the field. The need for understanding of the text kept the field from succeeding commercially. Machine Translation in these days is a good tool for translators, for example in Localization.
There are a number of problems with the model here that point very clearly to the fact that it has the same shortcomings as other machine translation models.
For example, so long as we're working with cognates or 1:1 equivalencies (tall, man, etc.) it's fine. If we go to words for which there is no 1:1 lexical item, what's it do then? Consider especially words that signify complex concepts that are culture-bound. There would be, by definition, no reason for language #2 to have such a concept, if the culture isn't similar. The other problem arises from statistical sampling. Lexical items that are used exceedingly rarely and have no 1:1 or cognate would be unlikely to make the reference database.
Another similar problem arises with novel coinages and idioms. The example of "The spirit is willing..." is rightly cited. Consider the Russian saying, "He nyxa, He nepa," which translates as "Neither down nor feathers" but doesn't mean anything of the sort.
Real machine translation has been the golden fleece of computational linguistics for a long time. I'll believe it when I see it.
I'm sure that everybody's familiar with the output and quality of different various translators available online. I myself have been very interested in creating such a utility, and then one based on statistical language analysis. In my time in Holland, I've enjoyed learning the Dutch language, and have found online utilities to be of little help when translating documents (though I do not require this much anymore, it would have been helpful in the beginning).
...Maar ja, ik ben de niet roker van het jaar.
JS: Hoezo?
PRdV: Nou, ik rook 2 pakjes per dag... niet.
...Anyway, I'm the non smoker of the year.
JS: How do you figure that?
PRdV: Well, I ... don't ... smoke 2 packs per day.
Although these methods work better than literal word-for-word translation, they're still not going to be perfect without some sort of human intervention. Dutch, for instance, has a completely different sentence structure than does English. For instance, the sentence "The cow is going to jump over the moon." becomes "De koe gaat over de maan springen" or, literally, "The cow goes over the moon to jump".
Don't laugh at this structure or perhaps any unobvious usefulness. I've had discussions with people regarding the grammatical structure of a language and the society around it. Indeed, a specific example I have comes from a TV show "Kop Spijkers", which is a show focused mainly poking fun at political activity and news events. At times, they have people dressed as popular media and political figures and have comical debates.
In one show, a person acting as Peter R. de Vries (roughly the Dutch equivalent of William Shatner on America's Most Wanted) stated the following joke (JS stands for Jack Spijkerman, the host of the program):
PRdV:
Translated into English, we would not find the humor in this transaction:
PRdV:
Sure you can crack a smile about it, but it's much funnier when the punchline comes at a climax. And in English, it is not possible to state "Well, I smoke 2 packs per day... NOT" (without sounding like a retard who's watched too much Wayne's World).
Getting back on topic, I believe there will be major issues with any tranlsation algorithm to come. This is, of course, to be expected; I hope, however, that more advances will soon follow.
Kind regards, Devon H. O'Dell
On the other hand, having just finished translating a letter from Finnish to German, I fear that in light of the fact that, unlike most other cultures, Germans consider unspeakably long, intertwined sentences with multiple asides quoting their dead grandmothers who used to go on and on like this all day and the mandatory Goethe or Immanuel Kant quote concerning the importance of staying on topic, of which this run-on piece of drivel gives you but a faint impression, rather stylish and intelligent, we might have to wait a while yet.
Would a program know how to break up a monster like that?
Or, seriously, I ended up rewriting most of the letter to convey its contents in a tone that hopefully won't insult the recipient because of differing cultural expectations.
Finns often consider politeness a waste of time. Now explain that to a statistical translator program: "Leave out/add in some polite blablablah"?
From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.
Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.
I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.
It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.
For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.
Head down, go to sleep to the rhythm of the war drums...