More on Statistical Language Translation

← Back to Stories (view on slashdot.org)

More on Statistical Language Translation

Posted by ryuzaki0 on Thursday July 31, 2003 @12:18AM from the ma-grandmere-est-flambe dept.

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.

3 of 193 comments (clear)

Min score:

Reason:

Sort:

Re:IBM research 10 years ago by Jugalator · 2003-07-31 00:49 · Score: 5, Informative

Yes, I see IBM's project was called the "Candide Project". Here's a link with some information about it, including a link to the paper describing the prototype system they built:

http://www-2.cs.cmu.edu/~aberger/mt.html

--
Beware: In C++, your friends can see your privates!
Arabic Grammar Nazi by nat5an · 2003-07-31 02:43 · Score: 5, Informative

From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.

Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.

I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.

It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

--
Head down, go to sleep to the rhythm of the war drums...
Re:Limited value? by Jadrano · 2003-07-31 03:57 · Score: 4, Informative

Of course, you can buy dictionaries or get trained people write them, but the amount of data needed for every lexical item would be so large that a wide coverage would be very hard to achieve. For example, you have to note all collocations. Often, such preferences aren't clear-cut. For instance, 'essential' appears much more frequently in an attributive position (e.g. 'X is essential') than in , while 'basic', which can have a very similar meaning in many contexts (e.g. 'the essential X'), appears much more often in an attributive position. Such information is necessary for good translation, but dictionaries usually don't provide it. Statistical analyses of lexical items reveal many things dictionaries don't tell you. Nowadays, a significant part of the work of trained people writing dictionaries is looking at corpora, and making this process automatic is a logical step.

Strictly separating raw dictionary work and grammar seems rather old-fashioned to me. Of course, it can work to some degree, but there are so many different types of collocational preferences that just providing each lexeme with a 'grammatical category' from a relatively small list and basing the grammar on these grammatical categories is hardly enough.

It is true that automatic systems' lack of world knowledge is a big problem, but the examples you provide aren't really a good demonstration of this fact. As you write, 'have' is translated differently into some languages depending on whether the object is abstract. So, given a translation system that recognizes the verb and its object and a bilingual parallel corpus, a statistical system can find out about that.

I heard of people who write dictionaries that can be used for automatic processing, for every lexeme they need between half an hour or an hour (consulting dictionaries and corpora, checking whether the application of rules gives correct sentences). This can only work if the aim of the MT system is either only a very limited domain (e.g. weather forecasts, for which there are working rule-based translation systems) or very low quality. It could never be affordable to have trained people provide all relevant characteristics for the millions of words that would be needed for a good MT system with wide coverage.

Differentiating between concrete and abstract entities is something that seems quite natural to us, but there are many other relevant characteristics of lexical items that don't come to linguists' minds so easily, statistical analyses can be better at discovering them.