More on Statistical Language Translation

← Back to Stories (view on slashdot.org)

More on Statistical Language Translation

Posted by ryuzaki0 on Thursday July 31, 2003 @12:18AM from the ma-grandmere-est-flambe dept.

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.

13 of 193 comments (clear)

Min score:

Reason:

Sort:

No Universal Translator any time soon by ucblockhead · 2003-07-31 00:45 · Score: 2, Insightful

The trouble with the Star Trek "Universal Translator" is that they show it working on languages where there is no already translated work. This sort of statistical translation requires someone to sit down and hand-translate a bunch of documents to teach the machine the correlations.

--
The cake is a pie
Re:So statiscally... by Matthias+Wiesmann · 2003-07-31 00:49 · Score: 4, Insightful

Actually, using this technology to translate from english to english could be quite interesting. Imagine you could automatically translate legalese, or marketing speak to plain english. Or translate an article with a given political bias towards another political bias.
If this happens, I suspect this technology will be illegal...
Re:Translator by buro9 · 2003-07-31 01:21 · Score: 2, Insightful

That wouldn't apply here as the sample data you've suggested is too little.

For statistical translations to work, you would need a substantial set of data, already translated, from which you could do the comparisons and create your database of phrases and words.

In the example you've given you would need to have pre-populated this database in advance for the statistical engine to understand how to do the translation.

What you've got to do is stop thinking that this is actually performing a translation... it's not... it's performing a cost-based replacement where the costs have been calculated from statistics gathered from a large pool of sample data.

Once you have the sample data... then you will have the translation.
I'll be convinced... by Rocky · 2003-07-31 01:29 · Score: 2, Insightful

...when it's able to translate stuff like:

"Shaka, when the walls fell!"

--
"I'm an old-fashioned type of guy. I worship the Sun and Moon as gods. And fear them."
Re:Of course, in British English... by shish · 2003-07-31 01:47 · Score: 2, Insightful

> I urinated is 'I pissed'

Not "I urinated", but "I got urinated" - how could it tell?

Also I sometimes say "I'm pissed" (no 'off') when I'm angry, and I'm british. Although as I just pointed out, that could mean "I'm urinated" :P

--
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
Re:Same words, different meanings by Anonymous Coward · 2003-07-31 01:49 · Score: 2, Insightful

It gets even more complicated, particularly with the connotations attached to certain words and phrases.

For example one country's "Weapons of Mass Destruction" is another country's "Strategic Deterrent". Both phrases mean the same thing but the tone is very different. Same thing with "terrorists" and "freedom fighters". You can use either phrase to describe the same people and imply very different meanings.

It will be a long time before an automated system will be able to make an acceptable translation of these subtleties.
We won't. by godot42a · 2003-07-31 02:00 · Score: 2, Insightful
There's no chance (or risk) statistical translation can put human translators out of business for quite a long time to come. The main point is that because these programs completely lack word knowledge, they must try to "understand" the sentences on a purely structural level. This works for
- restricted domains (subject matters)
- restricted range of grammatical constructions
- restricted genre (style)
- restricted range of cultural presuppositions
In other words, it works best for technical manuals ;).
I do this stuff for a living... by elbanevretep · 2003-07-31 02:04 · Score: 2, Insightful

One of the keys to making a statistical model work is to make wise choices about what statistics to collect, and what dependencies to include. For example, N-grams work by predicting the probability of a certain word appearing given the previous word or so; this kind of works but misses a lot because the structure of a sentence is more like a tree than a series. More complex models can capture more relevant information. On the other hand, if the model is too complex, it won't work for two reasons: because it requires too much memory/cpu, and because you can't get a reliable estimate of the probabilities without multiple examples of each situation (this problem is called data sparsity).
This approach is limited by Oryx3 · 2003-07-31 02:57 · Score: 3, Insightful
Yes, that's a big problem with statistical methods. The point is that we don't just use words with specific meanings like "man" or "tall", but we also use:
- abstract words that take on different meanings in different contexts (i.e. they're polymorphic)
- we use words metaphorically (the "pissed" example above). Metaphor requires the reader to make the connection on the fly between two concepts, hence it requires intelligence. ("On the fly" is a good example. A computer can be given a list of such metaphorical expressions, but recognizing new ones is a much harder problem.)
- we use words incorrectly, or misspell them, or use imperfect grammar, but that's OK because our human reader is able to infer the meaning
- humans think it's funny sometimes to use words in the wrong context, i.e. where the metaphorical meaning is really outlandish, or there is a conflict between the idea and the way it is expressed. I think we like this because it requires intelligence to work out the meaning in these cases.
For example, the English word pattern can be translated in French by any of (please excuse the lack of accents, they were stripped when I submitted): modele, exemple, type schema, dessin, motif, maquette, patron, plan, disposition, groupement, repartition, combinaison, diagramme, gabarit, echantillon, tendance, figure, circuit (and probably others as well) depending on the context -- and not just the lexical context, but the meaning.

Previous attempts to automate translation focused on giving computers grammatical and semantic knowledge, in the hope that it could infer some meaning from this and so choose the right equivalents. Despite some success, this approach failed in general, putting machine translation (MT) firmly in the realm of AI. I believe this statistical approach is a step in the wrong direction (back to purely lexical means of analyzing texts with a view to translation). Further progress in MT will come from AI.

This doesn't detract from the ways in which computers have been useful to translators -- in the area of computer-assisted translation (translation memory, localization, terminology databases, etc.)

The other point is it's a lot harder to get a good-quality parallel corpus than you'd think (even in the Internet age -- most of the stuff on the Internet is crap anyway).
It's not the idea of using computers in translation that I think is limited, just this approach.
Re:unfortunately doomed by plasticmillion · 2003-07-31 02:58 · Score: 4, Insightful

This is definitely true. At the same time, the results of statistical natural language processing are surprisingly good. Really this should not be so surprising, since they function in a way similar to the human brain. A neural network like the brain is designed to deduce a complex function from training data. I believe strongly that the best way to get intelligent(-seeming) behavior out of machines is to mirror this process.
Artificial neural nets are one way to do this, but statistical methods are more or less analogous and have the advantage of being highly optimizable. Personally I don't understand the details, but Very Smart Mathematicians have found ways to optimize models like Singular Value Decompositions (SVDs) so that they can be calculated orders of magnitude faster than models that cannot be represent as formally using mathematics.
The bottom line is that statistical methods are probably the way that we will end up producing brain-like behavior on computers, and the fact that there are promising results already is heartening. Yes, for truly intelligent behavior a lot of domain knowledge will also be needed, as you point out. But I don't see any reason why the extraction and mapping of this knowledge couldn't also be achieved with large training corpora and statistical methods, rather than hand-crafting.

--
Peer Pressure
Re:the real problems lie in understanding... by t · 2003-07-31 05:53 · Score: 2, Insightful

To expound on the AC and Koos Baster's comments, try asking people to define ordinary words. You'll find quite often that the more basic the word, the more difficult it becomes. The definition of all words is circular since the definition of any word is given by other words, e.g., recursive: see recursive. Somewhere there needs to be a list of words with pictures, or math, or other way of defining each word without using any previously undefined words.
Re:unfortunately doomed by capologist · 2003-07-31 06:12 · Score: 4, Insightful

It may be possible for this approach to address that issue somewhat. Statistics can be collected not only on associations of words with other words, but also on associations of groups of words or phrases with others. So if the translator has learned from documents in which the phrase "put it down" appears near the word "ill" and the word "dog," and from other documents in which the phrase is associated with the word "heavy," it can make a good guess.

Clearly, it would need to learn from a tremendous amount of input data before it could begin to approach the experience of a human, and hence make guesses of similar quality to a human translator. However, the amount of available source material is increasing so rapidly that it may be possible for a translator to get pretty darn smart this way.
Re:Arabic Grammar Nazi by Anonymous Coward · 2003-07-31 14:51 · Score: 1, Insightful

But what is meaning but an insanely large net of crossreferences? I'm thinking of an article from Linguistic Anthro about the 'arbitrariness of sign.' I wish I could remember the big name dude that said it, I think it was Saussure. Knowing for instance that 'that' is a 'dog' is really just knowing that 'that' barks, runs, smells bad when wet, is friendly, kills, defends, etc etc etc. All of those words are also arbitrary when it comes to significance (at some grand level), so the same holds. The statistical approach is right on target and I'm genuinely curious to see where it will go.