More on Statistical Language Translation
DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.
The trouble with the Star Trek "Universal Translator" is that they show it working on languages where there is no already translated work. This sort of statistical translation requires someone to sit down and hand-translate a bunch of documents to teach the machine the correlations.
The cake is a pie
If this happens, I suspect this technology will be illegal...
That wouldn't apply here as the sample data you've suggested is too little.
For statistical translations to work, you would need a substantial set of data, already translated, from which you could do the comparisons and create your database of phrases and words.
In the example you've given you would need to have pre-populated this database in advance for the statistical engine to understand how to do the translation.
What you've got to do is stop thinking that this is actually performing a translation... it's not... it's performing a cost-based replacement where the costs have been calculated from statistics gathered from a large pool of sample data.
Once you have the sample data... then you will have the translation.
...when it's able to translate stuff like:
"Shaka, when the walls fell!"
"I'm an old-fashioned type of guy. I worship the Sun and Moon as gods. And fear them."
> I urinated is 'I pissed'
:P
Not "I urinated", but "I got urinated" - how could it tell?
Also I sometimes say "I'm pissed" (no 'off') when I'm angry, and I'm british. Although as I just pointed out, that could mean "I'm urinated"
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
It gets even more complicated, particularly with the connotations attached to certain words and phrases.
For example one country's "Weapons of Mass Destruction" is another country's "Strategic Deterrent". Both phrases mean the same thing but the tone is very different. Same thing with "terrorists" and "freedom fighters". You can use either phrase to describe the same people and imply very different meanings.
It will be a long time before an automated system will be able to make an acceptable translation of these subtleties.
- restricted domains (subject matters)
- restricted range of grammatical constructions
- restricted genre (style)
- restricted range of cultural presuppositions
In other words, it works best for technical manualsOne of the keys to making a statistical model work is to make wise choices about what statistics to collect, and what dependencies to include. For example, N-grams work by predicting the probability of a certain word appearing given the previous word or so; this kind of works but misses a lot because the structure of a sentence is more like a tree than a series. More complex models can capture more relevant information. On the other hand, if the model is too complex, it won't work for two reasons: because it requires too much memory/cpu, and because you can't get a reliable estimate of the probabilities without multiple examples of each situation (this problem is called data sparsity).
For example, the English word pattern can be translated in French by any of (please excuse the lack of accents, they were stripped when I submitted): modele, exemple, type schema, dessin, motif, maquette, patron, plan, disposition, groupement, repartition, combinaison, diagramme, gabarit, echantillon, tendance, figure, circuit (and probably others as well) depending on the context -- and not just the lexical context, but the meaning.
Previous attempts to automate translation focused on giving computers grammatical and semantic knowledge, in the hope that it could infer some meaning from this and so choose the right equivalents. Despite some success, this approach failed in general, putting machine translation (MT) firmly in the realm of AI. I believe this statistical approach is a step in the wrong direction (back to purely lexical means of analyzing texts with a view to translation). Further progress in MT will come from AI.
This doesn't detract from the ways in which computers have been useful to translators -- in the area of computer-assisted translation (translation memory, localization, terminology databases, etc.)
The other point is it's a lot harder to get a good-quality parallel corpus than you'd think (even in the Internet age -- most of the stuff on the Internet is crap anyway).
It's not the idea of using computers in translation that I think is limited, just this approach.
Artificial neural nets are one way to do this, but statistical methods are more or less analogous and have the advantage of being highly optimizable. Personally I don't understand the details, but Very Smart Mathematicians have found ways to optimize models like Singular Value Decompositions (SVDs) so that they can be calculated orders of magnitude faster than models that cannot be represent as formally using mathematics.
The bottom line is that statistical methods are probably the way that we will end up producing brain-like behavior on computers, and the fact that there are promising results already is heartening. Yes, for truly intelligent behavior a lot of domain knowledge will also be needed, as you point out. But I don't see any reason why the extraction and mapping of this knowledge couldn't also be achieved with large training corpora and statistical methods, rather than hand-crafting.
Peer Pressure
To expound on the AC and Koos Baster's comments, try asking people to define ordinary words. You'll find quite often that the more basic the word, the more difficult it becomes. The definition of all words is circular since the definition of any word is given by other words, e.g., recursive: see recursive. Somewhere there needs to be a list of words with pictures, or math, or other way of defining each word without using any previously undefined words.
It may be possible for this approach to address that issue somewhat. Statistics can be collected not only on associations of words with other words, but also on associations of groups of words or phrases with others. So if the translator has learned from documents in which the phrase "put it down" appears near the word "ill" and the word "dog," and from other documents in which the phrase is associated with the word "heavy," it can make a good guess.
Clearly, it would need to learn from a tremendous amount of input data before it could begin to approach the experience of a human, and hence make guesses of similar quality to a human translator. However, the amount of available source material is increasing so rapidly that it may be possible for a translator to get pretty darn smart this way.
But what is meaning but an insanely large net of crossreferences? I'm thinking of an article from Linguistic Anthro about the 'arbitrariness of sign.' I wish I could remember the big name dude that said it, I think it was Saussure. Knowing for instance that 'that' is a 'dog' is really just knowing that 'that' barks, runs, smells bad when wet, is friendly, kills, defends, etc etc etc. All of those words are also arbitrary when it comes to significance (at some grand level), so the same holds. The statistical approach is right on target and I'm genuinely curious to see where it will go.