More on Statistical Language Translation
DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.
The key improvement is not just to search for phrases that appear in the sample texts. If you have an idea for what a word means and what its grammatical role is then you can plug it into other sentences and greatly extend the set of phrases you can translate. Thus an important idea is to search for phrases that match gramatically with phrases you can translate.
however, this requires a stage where the sample texts are used to extract grammatical information on the second language. Of course, it helps alot if you are familiar with one of the two languages.
What happens when it hits a word with several meanings? For example the reply to a previous story "I got pissed and installed OSX"
drunk?
angry?
urinated?
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
Windows = "Windows" - will give translation when enough statistical info becomes available
.Net = Non-ExistenT
NT = Not Trustworthy
That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.
Can anyone try this on the new (or some other recent) algorithm?
BTW here's Doc Och's most recent website:
Franz Josef Och [isi.edu]
--
Esteem isn't a zero sum game
I remember reading about IBM doing this research about 10 years ago. The biggest problems then adequate processing power and storage space. Those things have greatly improved in the last 10 years (thank the spirits of Moore). I think that's why you're starting to see all this cool research with speech recognition and AI that was being done in the 80s and 90s become more and more commonplace. This trend will likely continue, and all the cool research only stuff you remember reading about in the 80s and 90s will just be common fixtures on PCs of today.
:)
Speaking of which -- speech recognition, AI, translation learning algorithms -- sounds like we have the seeds for the Universal Translator.
My journal has hot
You are back? I didn't know you left.
That'd make Windows some form of negative or negating word, (ie no not- non- anti- kinda thing), making NT trustworthy and Net existent.
And again we prove that automated translation can always be fooled by a well-chosen (or badly chosen) example...
That is, following the article's example as gospel.
.. so presumably the system would create some weightings, then words are assigned meanings according to how likely that meaning is from the probabilities of the surrounding words.
(As far as I can remember such things from Uni.)
France = "Cheese Eating Surrender Monkey"
George Bush = "Neo-Imperialist Moron"
Tony Blair = "Lap Dog"
WMD = "No where to be found"
and of course
Dossier = Creative Story Telling
An Eye for an Eye will make the whole world blind - Gandhi
Translation-unit this algorithm perfectly works! Deutsch this was typed and translation-unit to English makes this was!
The cake is a pie
how about trigrams (N=3)? how much memory
will they take? too much. N-grams are
ancient history.
malo: I had rather be
malo: in an apple tree
malo: than a naughty boy
malo: in adversity
based on four very distinct meanings of malo, in which the word endings put the stem of the word in context, but unfortunately the same word endings are used for different things.
Not that I'm trying to rubbish the work, because I actually think that statistical methods are close to the fuzzy way that we actually try and make out foreign languages. I just wonder what the limits are.
Panurge has posted for the last time. Thanks for the positive moderations.
The article's text has "Compare two simple phrases in Arabic: "rajl kabir'' and "rajl tawil.'' If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively". Are we going pro-homeland security and not tipping off the powers that be? Or did michael want to show his uber leet 1st quarter espanol skillz?
Spanish is easy and led me to believe that the article had relatively little weight (it is lightweight and a topical PHB read anyway). I do a lot of data mining in text streams and have found it to be fairly easy work. Getting cursors to play in ideograms/unicode and reversing the data is something I haven't tried yet and the article barely covers it. When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again. All of my databases are unicode and I want to learn more about having truly international systems that are automated and then hand tweaked to avoid the engrish.com type mistakes. Any help here?
-B
The trouble with the Star Trek "Universal Translator" is that they show it working on languages where there is no already translated work. This sort of statistical translation requires someone to sit down and hand-translate a bunch of documents to teach the machine the correlations.
The cake is a pie
As for inflected (read most) languages, learning to separate a word into its stem and inflections is the first step, even if you have a number of such possible break-ups.
Wow, these guys are just begging for a lawsuit from you-know-who.
Yoda, is that you?
Black holes are where God divided by zero
If this is just statistics, and you can do anything in C, why not statistically relate C to machine code and look at Windows machine code to get a C source that is clean room? Or perhaps look at MSword input vs word document format?
FINALLY! After all these years of scrambled languages, we can finally get together and plan that tower of Babel!
Now, all we need is to pinpoint Kolob and we'll be set!
But that's an old story. Even the translation of complete sentences is fairly feasible in terms of syntactic structure.
Harder to translate are things like discourse markers ("then", "because") because they are highly ambiguous and you would have to understand the text in a way. I have tried to guess these discourse markers with machine learning model in my thesis about rhetorical analysis with support vector machines (shameful self-promotion), and I got around 62 percent accuracy. While that's probably better than or similar to competing approaches, it's still not good enough for a reliable translation.
And that's just one example for the hurdles in the field. The need for understanding of the text kept the field from succeeding commercially. Machine Translation in these days is a good tool for translators, for example in Localization.
Go to babelfish type in something and translate it from english > german > french > english. If you're creative you'll get some of the funniest translations ever. If you can use slang words it generally loses all context in the translation.
This is why watching foreign films and listening to the french spoken and reading the english subtitles leaves so much out. A simple Tu versus Vous is not directly translateable to english because we don't have formal/familiar built in. Someone saying Tu to an old lady on a bus you don't know in France will get you bitched out.
There are a number of problems with the model here that point very clearly to the fact that it has the same shortcomings as other machine translation models.
For example, so long as we're working with cognates or 1:1 equivalencies (tall, man, etc.) it's fine. If we go to words for which there is no 1:1 lexical item, what's it do then? Consider especially words that signify complex concepts that are culture-bound. There would be, by definition, no reason for language #2 to have such a concept, if the culture isn't similar. The other problem arises from statistical sampling. Lexical items that are used exceedingly rarely and have no 1:1 or cognate would be unlikely to make the reference database.
Another similar problem arises with novel coinages and idioms. The example of "The spirit is willing..." is rightly cited. Consider the Russian saying, "He nyxa, He nepa," which translates as "Neither down nor feathers" but doesn't mean anything of the sort.
Real machine translation has been the golden fleece of computational linguistics for a long time. I'll believe it when I see it.
Make sure the dupe points to a major advertiser's website.
I always said you Yanks couldn't even use your own language properly... [fx: ducks]
Ceterum censeo subscriptionem esse delendam.
I'm sure that everybody's familiar with the output and quality of different various translators available online. I myself have been very interested in creating such a utility, and then one based on statistical language analysis. In my time in Holland, I've enjoyed learning the Dutch language, and have found online utilities to be of little help when translating documents (though I do not require this much anymore, it would have been helpful in the beginning).
...Maar ja, ik ben de niet roker van het jaar.
JS: Hoezo?
PRdV: Nou, ik rook 2 pakjes per dag... niet.
...Anyway, I'm the non smoker of the year.
JS: How do you figure that?
PRdV: Well, I ... don't ... smoke 2 packs per day.
Although these methods work better than literal word-for-word translation, they're still not going to be perfect without some sort of human intervention. Dutch, for instance, has a completely different sentence structure than does English. For instance, the sentence "The cow is going to jump over the moon." becomes "De koe gaat over de maan springen" or, literally, "The cow goes over the moon to jump".
Don't laugh at this structure or perhaps any unobvious usefulness. I've had discussions with people regarding the grammatical structure of a language and the society around it. Indeed, a specific example I have comes from a TV show "Kop Spijkers", which is a show focused mainly poking fun at political activity and news events. At times, they have people dressed as popular media and political figures and have comical debates.
In one show, a person acting as Peter R. de Vries (roughly the Dutch equivalent of William Shatner on America's Most Wanted) stated the following joke (JS stands for Jack Spijkerman, the host of the program):
PRdV:
Translated into English, we would not find the humor in this transaction:
PRdV:
Sure you can crack a smile about it, but it's much funnier when the punchline comes at a climax. And in English, it is not possible to state "Well, I smoke 2 packs per day... NOT" (without sounding like a retard who's watched too much Wayne's World).
Getting back on topic, I believe there will be major issues with any tranlsation algorithm to come. This is, of course, to be expected; I hope, however, that more advances will soon follow.
Kind regards, Devon H. O'Dell
...when it's able to translate stuff like:
"Shaka, when the walls fell!"
"I'm an old-fashioned type of guy. I worship the Sun and Moon as gods. And fear them."
It might save a nasty kick in the nads too.
On the other hand, having just finished translating a letter from Finnish to German, I fear that in light of the fact that, unlike most other cultures, Germans consider unspeakably long, intertwined sentences with multiple asides quoting their dead grandmothers who used to go on and on like this all day and the mandatory Goethe or Immanuel Kant quote concerning the importance of staying on topic, of which this run-on piece of drivel gives you but a faint impression, rather stylish and intelligent, we might have to wait a while yet.
Would a program know how to break up a monster like that?
Or, seriously, I ended up rewriting most of the letter to convey its contents in a tone that hopefully won't insult the recipient because of differing cultural expectations.
Finns often consider politeness a waste of time. Now explain that to a statistical translator program: "Leave out/add in some polite blablablah"?
A famous quote from one of the project leaders, Fred Jelinek if I'm not mistaken was that for every linguist that he fired from the team, the performance of the system improved by 10%...
Now that artificial intelligence (AI) has been solved, machine translation (MT) may advance to a higher plane of equality with human translators who spend years learning the nuances and subtleties of their target human languages.
Computer science has found the Holy Grail of AI in the Concept-Fiber Theory of Mind that led directly to the free AI source code of the Mind-1.1 Tutorial AI described in the AI For You textbook of artificial intelligence and robotics.
The Association for Computing Machinery has published an article on the robot Mind.Forth AI, and a well-known AI expert has favorably reviewed the Fiber-Concept Theory of Mind.
Traditional Artificial Intelligence Textbooks are suddenly obsolete, outmoded, or desperately in need of thorough revision and updating to teach Automatic Machine Translation now that AI has been solved.
- restricted domains (subject matters)
- restricted range of grammatical constructions
- restricted genre (style)
- restricted range of cultural presuppositions
In other words, it works best for technical manualsOne of the keys to making a statistical model work is to make wise choices about what statistics to collect, and what dependencies to include. For example, N-grams work by predicting the probability of a certain word appearing given the previous word or so; this kind of works but misses a lot because the structure of a sentence is more like a tree than a series. More complex models can capture more relevant information. On the other hand, if the model is too complex, it won't work for two reasons: because it requires too much memory/cpu, and because you can't get a reliable estimate of the probabilities without multiple examples of each situation (this problem is called data sparsity).
Hey now... engrams? I thought those were under the exclusive purview of the scientologists...
This idea is like the behavioralist idea that a baby is a blank slate and he just learns the language by association like Pavlov's dog. something similar has been tried with neural networks etc.
However, this method does not work, as the silly examples elsewhere in the discussion show. You can only understand or translate if you "know" what is meant.
There is no way of figuring it out. There isn't enough information supplied in the texts themselves. You have to be born with the inherent ability to understand stuff.
You'll find a good discussion of this in Steven Pinker's "The Language Instinct", which I recommend.
Raw dictionary work is pretty much the least interesting, most mechanical part of an MT system.
Grammar (source parsing, transformation and target generation) takes a lot more work and careful thinking.
The more accurate you want your MT system to be, the more extra information you want to attach to your dictionary entries (the more the system knows about all the words, the more disambiguation using real-world knowledge it can do.) "I have a ball" vs "I have an idea" translate to some languages quite differently; you need to know that you don't (usually) physically hold "an idea" in your hand. The most common words ("is", "have") are often the worst in this respect.
(I have worked coding an MT system.)
N-grams? N-grams? DON'T CLICK ON THE LINK!
It's a CoS trick to enslave us all!
We piss you!
just hope the system doesn't involve karma...
madonna (5 ineresting)
microsoft (-3 troll)
aren't there also words (slang) that have no direct translation? what happens then?
Linux: Helping nerds look smarter since the late 90s.
Did anyone else here take Dr. Eisner's "Natural Language Processing" course at Hopkins? I've definitely had my fill of n-grams for now, thanks :)
Intercarve Networks, LLC
Go have your piss parties somewhere else, you perverts!
Step 1 - "Out of sight, out of mind."
Step 2 - Step 1 machine translated to Russian
Step 3 - Step 2 machine translated back to English
Result:
"Invisible idiot"
Mr. Spock would say: Logical!
Like most computerised translation efforts, this ignores the fact that translation always requires context. The sentence 'fruit flies like an orange' is a classic example in the English language of a sentence which can be interpreted in two different ways - sentences can easily be constructed which have completely different meanings in different contexts.
'As a punishment, he was given a longer sentence'. Obviously, we're talking prison, right? Well, what if the preceding sentence was:
'The teacher had grown weary of his poor attempts at translation'?
A statistical system, even working with the entire phrase, won't be able to figure out which meaning of the word 'sentence' is intended there.
how about:
'The box was heavy. We had to put it down'
'The dog was ill. We had to put it down'
You need semantic understanding to be able to perform translation.
I wonder if this could work on the Voynich Manuscript? (www.voynich.nu)
N-Gram is/was also the name of William Orbit's label.
From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.
Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.
I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.
It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.
For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.
Head down, go to sleep to the rhythm of the war drums...
God no.
That is so wrong it pains me.
What about when one language is completely missing concepts that are present in antoher?
Lots of languages contain masculine and feminine words, which changes the structure of the sentence they're in (for example "une tete le poo-poo" would be correct, but "un tete le poo-poo" would not - even though "une" and "un" are basically the same word..)
Or how about asian languages, in which there is the concept of honorific titles for older (and sometimes younger) siblings - in Tagalog, for example, "ate" (pronounced "ah-tey") means "older sister".. so in English, "ate Teresa" means "big-sister Teresa" - and it's very important that this prefix is used, otherwise it would be disrespectful.
English has a similar function for uncle/aunt, but the concept for siblings doesn't exist..
Or in Welsh, the concept of "having" something is absent.. you don't "have" something, it's "with you".. instead of saying "take this pen", you would say "go with this pen"... you wouldn't say "he is rich", you'd say "he is of the money"
Or mutations - syllables of one word mutate into another depending on the context in which it's used..
Combined with your examples of differing grammatical structure, I don't see this being better than human beings for a long time to come.
For example, the English word pattern can be translated in French by any of (please excuse the lack of accents, they were stripped when I submitted): modele, exemple, type schema, dessin, motif, maquette, patron, plan, disposition, groupement, repartition, combinaison, diagramme, gabarit, echantillon, tendance, figure, circuit (and probably others as well) depending on the context -- and not just the lexical context, but the meaning.
Previous attempts to automate translation focused on giving computers grammatical and semantic knowledge, in the hope that it could infer some meaning from this and so choose the right equivalents. Despite some success, this approach failed in general, putting machine translation (MT) firmly in the realm of AI. I believe this statistical approach is a step in the wrong direction (back to purely lexical means of analyzing texts with a view to translation). Further progress in MT will come from AI.
This doesn't detract from the ways in which computers have been useful to translators -- in the area of computer-assisted translation (translation memory, localization, terminology databases, etc.)
The other point is it's a lot harder to get a good-quality parallel corpus than you'd think (even in the Internet age -- most of the stuff on the Internet is crap anyway).
It's not the idea of using computers in translation that I think is limited, just this approach.
I also find it humorous that someone actually modded you up! Isn't that hilarious? They must be in on the joke too!
One of Beryllium Sphere's partners is a computational linguist specializing in hand-built representations of how one small domain of discourse uses words.
Her last big project was automatic translation of (you guessed it) technical manuals.
godot42a is spot on. The English originals of the technical manuals had to be written in a subset of English which restricted the range of grammatical expressions. Tech writers had to run a program to check their work for compliance.
In summary, even if you build a translation program that has "word knowledge" hand-crafted by brilliant polymaths, you still have the limitations that godot42a points out.
Actually, we do sometimes use 'fries', to distinguish them from 'chips' which are usually more than three millimetres thick and have actually been near a potato! We also use both 'cookie' and 'biscuit'; the former for larger, thicker things, often with chocolate drops, nuts, or whatever. What do you mean by 'biscuit'?
And I've no idea what 'podger' is - I've never heard it, and neither dictionaries nor Google can come up with anything more relevant than its use as a surname. Is it an obscure regional or dialect term?
On a more general point, ISTM that US English tends to like ambiguity more than British English, which is a slightly more precise tool that can distinguish between a rubber thing on a car wheel (tyre) and to become exhausted (tire); between road edging (kerb) and to prevent (curb); between verb and noun forms of practise/practice, license/licence, &c; between a measuring device (meter) and the unit of length (metre); between a movement of fluid (draught) and a rough outline (draft); between a series of instructions to a computer (program) and a list of events (programme); between a test (check) and a written instruction to a bank (cheque); &c &c. The 'pissed'/'pissed off' distinction is simply one more example of this.
The other interesting point is that in the majority of cases where usage, spelling, punctuation &c differs, it's US English that is the older variant. Oddly enough, here we seem to be more open to change, especially positive change, to the language.
Ceterum censeo subscriptionem esse delendam.
An engineer was confused when a a translated spec included water goats. "Water goats"?! Hydraulic rams, actually.
And perhaps most famous of all, "out of sight, out of mind" supposedly came back as "blind idiot".
Language is a curious thing. I can't help thinking there's some deeper meaning to the fact that misapplication of it can so easily be funny to us.
Fuck the system? Nah, you might catch something.
I had written a paper on this of the application of N-gram technique with statistical methods for use in CBR a long time ago.
;-)
You can find the paper here (PDF) and the presentation here.
I can imagine some distributions of this translation system that take this code - with improvements - and precook large corpuses to create translators. Anyone want to write the Mozilla and OpenOffice plug-ins for the new menu item "Edit/Translate Language"?
- David A. Wheeler (see my Secure Programming HOWTO)
I totally agree.
Furthermore, in the end language is only a carrier of meaning and meaning ultimately refers to non-linguistic objects. Therefore, you can't understand language (fully) without understanding reality (at least partially).
And while machine translation is a relatively hard job, there are examples that suggest that automated insertion of hyphens ultimately need extralinguistic knowledge!
Dead end research gets lots of press but the inventions that really change our lives are not reported until everybody knows about them?
What do you mean by 'biscuit'?
A biscuit is like a scone, only delicious.
(Just don't ask how to pronounce them...)
Ceterum censeo subscriptionem esse delendam.
translator who makes the source texts.
Ok, so it can parse things at the word level but what about the sentence level understanding (semantics) and odd exceptions such as idioms that could totally screw up the statistical learning process.
After taking a linguistics class I realized language is very very complex and we are many years away from being able to create decent language translation systems (babelfish is not really decent at least in my eyes), speech synthesis, etc.
"If we can learn how to translate even Klingon into English, then most human languages are easy by comparison," [Dr. Knight] said.
That's not really the case. Klingon was created through conscious effort and hasn't evolved many (any?) warts over time. Its structure is akin to well-understood human languages.
Now take Turkish, which has concatenative grammar. Adjectives are applied by tacking suffixes on to the word, sometimes changing spelling of previous chunks. Thus, a 20-word English phrase may correspond to a single Turkish word and extremely long words may be reasonably assumed to be unique. Statistical techniques can work with Turkish, but it requires some work up front to extract tokens. \b\B+\b doesn't help much. German (and, I think, Greek) are like this to a lesser extent.
Statistical approaches are often quite effective in language processing, much to the surprise and disheartening of linguists. They're far from perfect, but often the best thing so far.
Ceci n'est pas une signature.
... we can always use Babelfish!
err...
I hereby place the above post in the public domain.
Hm. I guess a biscuit is sort of like a scone. But generally a biscuit is salty and has more butter than a scone, and less sugar. People sometimes dip them in gravy. I don't know of any British equivalent, as it seems to be more or less unique to the American South. You might as well ask for an equivalent to grits.
I hereby place the above post in the public domain.
That will be a fun one to give a translation program. (Or a speech recognition program, for that matter).
Kind of like the approach one takes to crack simple encryption schemes.
here
[for e][or ex][r exa][ exam][examp] and so on.
Using n-grams this way helps with things like mis-spellings. Mr. Metlin (parent of this) used the character definition is his paper. N-grams are widely used in Information Retrieval Research.
I mod down all the "free iPod"-sig losers.
Does this method deal with different grammer structures? In flective languages, like Latin and Russian, grammical cases are signified by different endings. Also, is only one translation used or do they use many competing translations? Different translations of the same phrases can yield drastically different meanings.
often the change is pointless, negative and the result of a snobbish regard for classical and latin languages.
e.g, colour, favourite, defence.
the Victorians had some weird ideas, they insisted Shakespeare wrote in iambic pentameter (a Greek meter impossible in English), they insisted you couldn't split an infinitive (just because you can't in ancient Greek), even told people off for ending a sentence with a preposition - why can't I if I want to!?
The Americans rightly resisted all that nonsense.
I meant to say:
<person> pissed = <person> inebrated
liquid pissed = liquid urinated