Open-Source Language Translator Opens For Beta
mind21_98 writes "A new machine-translator designed for language translation has offically opened for public testing. GPLTrans is a translator similiar to Babelfish. Pre-alpha testing has shown that it is the most accurate of the major Web-based machine translators. More information can be found here. "
It would be nice if someone were to make a CORBA translation service and add this to one or more of the linux desktops. Then it could be used for email, documentation, irc, coding, etc, not just for the occasional web page. It would also be good if the data at gpltrans was snapshotted regularly and pushed around, ideally so that everyone would have their own copy.
It's common to here the pundits opine that "open source may be good at improving 30-year-old operating systems, but the open-source model just doesn't work when it comes to large scale applications." Various reasons are given, for example: "open source programmers only do what is fun and interesting, and applications aren't interesting". But here we see yet another large-scale application falling to the barbarian hordes.
Those pundits are wrong: there is no genre of software that the open-source model will never absorb. Simply because the open-source model results in better software, for reasons that are well-known. And no, there is no no software application that is so uninteresting that no volunteer anywhere in the world will touch it. On the contrary: the more an application area remains untouched, the more interesting it becomes to open-source programmers, simply because it's virgin territory.
This is the "stamp collector" syndrome: when you already have a goodly number of stamps in your collection, adding the missing ones becomes an obsession.
Life's a bitch but somebody's gotta do it.
What most of these language translation programs need is a better understanding of context. I was surprised to find that Altavista's Babelfish utility has very poor analysis of context (possibly none at all). For example, when translating from English to French, "run" always translates to "exécute". For a sentence like you get which is reasonable, but if you translate you get which doesn't make any sense. More incredibly, "store" always translates to "mémoire". You would think that, if they were going to force every word to be interpreted in one sense, they would choose the most common meaning. But this choice leads to insanity where translates to
With knowledge of context, a more advanced system could notice situations in which it was more reasonable for "run" to have a particular meaning. In the last example, "run" is followed by a prepositional phrase indicating a direction, which would imply that the meaning involving physical movement is appropriate, and so on.
Even more revealing is the fact that the confusion of meaning happens differently for different languages. If you translate
into Spanish, you get the hilarious result: For translation software that has multiple language targets, i would have expected it to first resolve the meaning of the English sentence into an internal semantic representation before using it to emit Spanish or French. The above would be evidence that the Systran software has no such representation -- or at least that their representation is too weak to indicate the difference between "store" as in "memory" and "store" as in a warehouse.-- ?!ng
Although the site has been slashdotted, it would be interesting to see what sort of algorithms it uses to perform the translations. Mmm, open source.
:) In addition to this, it's very difficult to write simple, lucid grammar rules that also count for the myriad exceptions found in language.
:) The parsing itself is a hefty (and not terribly exciting) task. I attempted to make a term project of a fairly basic English parser and ended up changing the project.
I would be inclined to say that if it is based on grammar rules, the project won't make much headway - machine translation has been butting its head against this brick wall for forty years. The problem with hard-and-fast grammar rules, e.g.,
S = NP VP
NP = Det (Adj)* N
VP = V (Adv)
is that they don't account for rapid linguistic change, and people have this nasty habit of twisting grammar to express themselves in new and creative ways.
I imagine GPLTrans would probably be using some sort of probability frame of phrases and words occurring together, but one can't be sure without looking at the source. I think the best way to do translation software would be to convert the text into syntax, then into a more abstract semantic form, and from the semantic form, translate back into the target language's syntax, and then into the target language's text. Of course, the trick is to figure out just exactly how to do this.
My 2 cents/Pfennig/lire/pesos,
Y
"There is no culture in computer science, only cults." - M. Felleisen
The `spirit is willing' story is amusing, and it really is a pity that it is not true. However, like most MT `howlers' it is a fabrication. In fact, for the most part, they were in circulation long before any MT system could have produced them (variants of the `spirit is willing' example can be found in the American press as early as 1956, but sadly, there does not seem to have been an MT system in America which could translate from English into Russian until much more recently --- for sound strategic reasons, work in the USA had concentrated on the translation of Russian into English, not the other way round). Of course, there are real MT howlers. Two of the nicest are the translation of French avocat (`advocate', `lawyer' or `barrister') as avocado, and the translation of Les soldats sont dans le café as The soldiers are in the coffee. However, they are not as easy to find as the reader might think, and they certainly do not show that MT is useless.
BTW, since this book is no longer available in the stores, the whole contents is placed online. I recommend reading this book to anyone who is interested into the subject of MT. It really is a nice introduction into the subject.