Slashdot Mirror


More on Statistical Language Translation

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.

33 of 193 comments (clear)

  1. Not just matching phrases by marcopo · · Score: 5, Interesting

    The key improvement is not just to search for phrases that appear in the sample texts. If you have an idea for what a word means and what its grammatical role is then you can plug it into other sentences and greatly extend the set of phrases you can translate. Thus an important idea is to search for phrases that match gramatically with phrases you can translate.
    however, this requires a stage where the sample texts are used to extract grammatical information on the second language. Of course, it helps alot if you are familiar with one of the two languages.

  2. Same words, different meanings by shish · · Score: 5, Interesting

    What happens when it hits a word with several meanings? For example the reply to a previous story "I got pissed and installed OSX"

    drunk?
    angry?
    urinated?

    --
    I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    1. Re:Same words, different meanings by CyberSlugGump · · Score: 5, Funny

      Reminds me of story see bottom of this page

      The US Gov't was funding an early computer group to translate documents from Russian-to-English and back. The hope, obviously, was to eliminate the need for human translators. A particular sentence was fed to the computer, which translated it into Russian. The computer was then fed the Russian, and it translated it back to English.

      The original sentence was "The spirit is strong, but the flesh is weak".
      The resulting sentence? "The vodka is good, but the meat is rotten".


      The computer didn't know which of the many possible words to use when translating spirit, so it used "vodka". Likewise, it tried to put the word "strong" into context, and since strong vodka is prized in Russia, it decided that the vodka was good. Likewise, flesh got translated to meat, and weak flesh became bad meat.

    2. Re:Same words, different meanings by godot42a · · Score: 3, Interesting

      Short and simplified version: Look out for different typically co-occurring words and cluster them. For "pissed", you'll find Cluster 1: {pissed, toilet} Cluster 2: {pissed, booze, get} and probably some more These clusters correspond to different meanings of the word. Then determine which of these clusters fits the current usage.

  3. Translator by Anonymous Coward · · Score: 3, Informative

    That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

    Can anyone try this on the new (or some other recent) algorithm?

    BTW here's Doc Och's most recent website:

    Franz Josef Och [isi.edu]

    --
    Esteem isn't a zero sum game

  4. IBM research 10 years ago by Surak · · Score: 5, Interesting

    I remember reading about IBM doing this research about 10 years ago. The biggest problems then adequate processing power and storage space. Those things have greatly improved in the last 10 years (thank the spirits of Moore). I think that's why you're starting to see all this cool research with speech recognition and AI that was being done in the 80s and 90s become more and more commonplace. This trend will likely continue, and all the cool research only stuff you remember reading about in the 80s and 90s will just be common fixtures on PCs of today.

    Speaking of which -- speech recognition, AI, translation learning algorithms -- sounds like we have the seeds for the Universal Translator. :)

    1. Re:IBM research 10 years ago by kmak · · Score: 3, Interesting

      I have one question though, while obviously, you can get a mapping of definitions, can you actually translate a full sentence into another full sentence?

      With exceptions in tons of languages, is this even feasible in the near future? Sure, we can understand a poorly translated sentence, but can it translate it so that we don't have to?

      --

      I'm not the devil.. just his advocate.
    2. Re:IBM research 10 years ago by Jugalator · · Score: 5, Informative

      Yes, I see IBM's project was called the "Candide Project". Here's a link with some information about it, including a link to the paper describing the prototype system they built:

      http://www-2.cs.cmu.edu/~aberger/mt.html

      --
      Beware: In C++, your friends can see your privates!
  5. So statiscally... by MosesJones · · Score: 5, Funny


    France = "Cheese Eating Surrender Monkey"

    George Bush = "Neo-Imperialist Moron"

    Tony Blair = "Lap Dog"

    WMD = "No where to be found"

    and of course

    Dossier = Creative Story Telling

    --
    An Eye for an Eye will make the whole world blind - Gandhi
    1. Re:So statiscally... by Matthias+Wiesmann · · Score: 4, Insightful
      Actually, using this technology to translate from english to english could be quite interesting. Imagine you could automatically translate legalese, or marketing speak to plain english. Or translate an article with a given political bias towards another political bias.

      If this happens, I suspect this technology will be illegal...

  6. Works it does! by ucblockhead · · Score: 5, Funny

    Translation-unit this algorithm perfectly works! Deutsch this was typed and translation-unit to English makes this was!

    --
    The cake is a pie
  7. Older languages not supported? by panurge · · Score: 5, Interesting
    Modern languages tend to have less inflected grammars than older languages. That is a benefit for statistical methods because individual words do not change significantly. But how would this work for Latin, Greek and other highly inflected languages? Anyone who knows "The Turn of the Screw" (Britten version) will remember:

    malo: I had rather be
    malo: in an apple tree
    malo: than a naughty boy
    malo: in adversity

    based on four very distinct meanings of malo, in which the word endings put the stem of the word in context, but unfortunately the same word endings are used for different things.

    Not that I'm trying to rubbish the work, because I actually think that statistical methods are close to the fuzzy way that we actually try and make out foreign languages. I just wonder what the limits are.

    --
    Panurge has posted for the last time. Thanks for the positive moderations.
  8. Why the change and Internationalization by beacher · · Score: 5, Interesting

    The article's text has "Compare two simple phrases in Arabic: "rajl kabir'' and "rajl tawil.'' If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively". Are we going pro-homeland security and not tipping off the powers that be? Or did michael want to show his uber leet 1st quarter espanol skillz?

    Spanish is easy and led me to believe that the article had relatively little weight (it is lightweight and a topical PHB read anyway). I do a lot of data mining in text streams and have found it to be fairly easy work. Getting cursors to play in ideograms/unicode and reversing the data is something I haven't tried yet and the article barely covers it. When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again. All of my databases are unicode and I want to learn more about having truly international systems that are automated and then hand tweaked to avoid the engrish.com type mistakes. Any help here?
    -B

  9. Re:this doesn't work well by NathanE · · Score: 4, Interesting

    You are prety much right about that, although I do not see the need to actually maintain your table in RAM. Trigrams require a HUGE corpus of training material to get good results, and even then you come up with the need to fudge your data a bit when you come across an unknown trigram. I think its called "and one rounding" or something like that (trying to remember from class).

    Fascinating stuff for sure, but hardly new unless they have come up with some new development. I haven't read the article.

  10. Yoda? by allanj · · Score: 5, Funny

    Yoda, is that you?

    --
    Black holes are where God divided by zero
  11. the real problems lie in understanding... by davids-world.com · · Score: 5, Interesting
    Statistics work quite well not just for phrases or so-called collocations such as "high and low" (vs. *"high and small"). they can help figure out the meaning of a word (bank=credit institute vs. bank=place to rest in a park). You can even learn (automatically learn) this stuff from parallel corpora where you can get a sentence-by-sentence translation, and you figure out statistically, which words or phrases belong together.

    But that's an old story. Even the translation of complete sentences is fairly feasible in terms of syntactic structure.

    Harder to translate are things like discourse markers ("then", "because") because they are highly ambiguous and you would have to understand the text in a way. I have tried to guess these discourse markers with machine learning model in my thesis about rhetorical analysis with support vector machines (shameful self-promotion), and I got around 62 percent accuracy. While that's probably better than or similar to competing approaches, it's still not good enough for a reliable translation.

    And that's just one example for the hurdles in the field. The need for understanding of the text kept the field from succeeding commercially. Machine Translation in these days is a good tool for translators, for example in Localization.

    1. Re:the real problems lie in understanding... by Knowledge+Hacker · · Score: 3, Informative

      I spent a decade working in the field of knowledge-based machine translation (KBMT), in the Center for Machine Translation (now part of the Language Technologies Institute) at Carnegie Mellon. Prior to that, I worked on several natural language processing projects that were focused on knowledge-based automatic analysis of English text..

      KBMT can be done. We demonstrated that pretty definitively. It's labor-intensive. Yes, we DID create concept maps (ontologies) for the domains of human endeavor relating to the texts to be translated, and yes, we DID link words (lexical units) to those concepts, in multiple languages. And it turned out that we didn't have to make the ontologies very deep--we had to make them broad, and start by assuming the need for a one-to-one mapping of concepts to nouns, verbs, and modifiers in the domain. This we arrived at after several attempts at making them deep. Turned out the lexicon is the real key. You only have to use enough ontological structure to support the way you deal with analyzing function words (relative pronouns, conjunctions, prepositions, certain adverbs, etc.) and to capture deep generalizations for certain classes of verbs and verb-derived nouns (deverbal nouns.) The system uses a fast (real-time) English analysis parser based on the Tomita algorithm, and a rule-based target-language generator based on the KANT system.

      We created a custom-built MT system for Caterpillar, to perform automated translation of their operations and maintenance manuals from the English of Peoria, Illinois into French, Spanish and German. It took us six years (not counting all the projects that preceeded it, from which we learned a great deal.) The system empoys a controlled subset of English that forces Caterpillar's technical writers to favor certain constructions in their writing, and to disambiguate certain other constructions using a writer's workbench interface. (Caterpillar has a patent on this application of MT technology for technical documents.) It contains all the vocabulary that Caterpillar needs--hundreds of thousands of terms. Caterpillar updates the lexicons as needed.

      This system has been in production use at Caterpillar since 1996. It translates controlled English text at accuracies in the high 90-percents. The tech writers adapted, the translators got turned into post-processors (and I believe there was some turnover of personnel--the work had to have gotten a lot more boring), the English reads a little bit stilted but it's perfectly clear. Response from Caterpillar's customers was positive, the manuals get translated faster, and are accurate. The controlled English can actually force a little higher accuracy. Caterpillar's investment in this techology ended up saving them a bundle.

      Due to the proprietary context of our work for Caterpillar, there were very few academic publications that came out of the project.

      If you want to engage in further reading, search on KBMT-89 (an MT project funded by IBM-Japan that laid much of the foundation for workable real-time KBMT.) We published a book on it.

      You can read about the KANT technology at

      http://www.lti.cs.cmu.edu/Research/Kant/

      There are also a number of pointers to other knowledge-based projects on the lti.cs.cmu.edu site.

      For looking at the progress of KBMT in the U.S. generally (over the past couple of decades), search for publications by Jaime Carbonell, Sergei Nirenburg, Eric Nyberg, Masaru Tomita, Teruko Mitamura, Robert Frederking, Lori Levin, Kathy Baker, Ralf Brown, and a cast of dozens. Warning--this will bring you vast amounts of material.

  12. I'll believe it when I see it by domovoi · · Score: 5, Interesting

    There are a number of problems with the model here that point very clearly to the fact that it has the same shortcomings as other machine translation models.

    For example, so long as we're working with cognates or 1:1 equivalencies (tall, man, etc.) it's fine. If we go to words for which there is no 1:1 lexical item, what's it do then? Consider especially words that signify complex concepts that are culture-bound. There would be, by definition, no reason for language #2 to have such a concept, if the culture isn't similar. The other problem arises from statistical sampling. Lexical items that are used exceedingly rarely and have no 1:1 or cognate would be unlikely to make the reference database.

    Another similar problem arises with novel coinages and idioms. The example of "The spirit is willing..." is rightly cited. Consider the Russian saying, "He nyxa, He nepa," which translates as "Neither down nor feathers" but doesn't mean anything of the sort.

    Real machine translation has been the golden fleece of computational linguistics for a long time. I'll believe it when I see it.

    1. Re:I'll believe it when I see it by YU+Nicks+NE+Way · · Score: 5, Interesting

      When I read this, I'm reminded of the SPHINX project at CMU in the mid 80's. Kai-Fu Lee was a doctoral student at CMU in computer science. His advisor set him to evaluating the performance of the (clearly inferior) statistical SR systems that IBM was touting. It was a throw-away project; his advisor just wanted some numbers to compare his rule-based system against. The linguists had clearly shown that the irregularities of human speech required deep knowledge of the phonology, syntax, and sematics of the language being spoken, but the projectg leader needed a benchmark to measure against.

      Lee's toy project, SPHINX, won the DARPA competition that year. The highest scoring rule-based system came in fifth. What the linguists "knew" was wrong.

      The example you gave is another example of the linguists not know as much about statistics as they think. The corpora used for statistical translation include examples of idiomatic usages. Idiomatic usage is highly stereotypical, so the Viterbi path through an N-gram analysis captures such highly linked phrases with high accuracy.

  13. Of course, in British English... by gidds · · Score: 3, Interesting
    ...there's no ambiguity. Becoming angry is getting pissed off. I urinated is I pissed (no 'got'). So, here, your sentence could only refer to inebriation. (Though why that should be a prerequisite for installing such a cool system, I've no idea.)

    I always said you Yanks couldn't even use your own language properly... [fx: ducks]

    --

    Ceterum censeo subscriptionem esse delendam.

  14. Grammatical Differences by dhodell · · Score: 5, Interesting

    I'm sure that everybody's familiar with the output and quality of different various translators available online. I myself have been very interested in creating such a utility, and then one based on statistical language analysis. In my time in Holland, I've enjoyed learning the Dutch language, and have found online utilities to be of little help when translating documents (though I do not require this much anymore, it would have been helpful in the beginning).

    Although these methods work better than literal word-for-word translation, they're still not going to be perfect without some sort of human intervention. Dutch, for instance, has a completely different sentence structure than does English. For instance, the sentence "The cow is going to jump over the moon." becomes "De koe gaat over de maan springen" or, literally, "The cow goes over the moon to jump".

    Don't laugh at this structure or perhaps any unobvious usefulness. I've had discussions with people regarding the grammatical structure of a language and the society around it. Indeed, a specific example I have comes from a TV show "Kop Spijkers", which is a show focused mainly poking fun at political activity and news events. At times, they have people dressed as popular media and political figures and have comical debates.

    In one show, a person acting as Peter R. de Vries (roughly the Dutch equivalent of William Shatner on America's Most Wanted) stated the following joke (JS stands for Jack Spijkerman, the host of the program):
    PRdV: ...Maar ja, ik ben de niet roker van het jaar. JS: Hoezo? PRdV: Nou, ik rook 2 pakjes per dag... niet.

    Translated into English, we would not find the humor in this transaction:
    PRdV: ...Anyway, I'm the non smoker of the year. JS: How do you figure that? PRdV: Well, I ... don't ... smoke 2 packs per day.

    Sure you can crack a smile about it, but it's much funnier when the punchline comes at a climax. And in English, it is not possible to state "Well, I smoke 2 packs per day... NOT" (without sounding like a retard who's watched too much Wayne's World).

    Getting back on topic, I believe there will be major issues with any tranlsation algorithm to come. This is, of course, to be expected; I hope, however, that more advances will soon follow.

    --
    Kind regards, Devon H. O'Dell
  15. Do put me out of work. Please! by Frantactical+Fruke · · Score: 5, Interesting

    On the other hand, having just finished translating a letter from Finnish to German, I fear that in light of the fact that, unlike most other cultures, Germans consider unspeakably long, intertwined sentences with multiple asides quoting their dead grandmothers who used to go on and on like this all day and the mandatory Goethe or Immanuel Kant quote concerning the importance of staying on topic, of which this run-on piece of drivel gives you but a faint impression, rather stylish and intelligent, we might have to wait a while yet.

    Would a program know how to break up a monster like that?

    Or, seriously, I ended up rewriting most of the letter to convey its contents in a tone that hopefully won't insult the recipient because of differing cultural expectations.

    Finns often consider politeness a waste of time. Now explain that to a statistical translator program: "Leave out/add in some polite blablablah"?

  16. Limited value? by sjasja · · Score: 3, Interesting
    Automatic dictionary generation for MT seems of limited value to me. You can purchase dictionaries easily enough, or get trained monkeys^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H linguistics students cheaply enough to do the work.

    Raw dictionary work is pretty much the least interesting, most mechanical part of an MT system.

    Grammar (source parsing, transformation and target generation) takes a lot more work and careful thinking.

    The more accurate you want your MT system to be, the more extra information you want to attach to your dictionary entries (the more the system knows about all the words, the more disambiguation using real-world knowledge it can do.) "I have a ball" vs "I have an idea" translate to some languages quite differently; you need to know that you don't (usually) physically hold "an idea" in your hand. The most common words ("is", "have") are often the worst in this respect.

    (I have worked coding an MT system.)

    1. Re:Limited value? by Jadrano · · Score: 4, Informative

      Of course, you can buy dictionaries or get trained people write them, but the amount of data needed for every lexical item would be so large that a wide coverage would be very hard to achieve. For example, you have to note all collocations. Often, such preferences aren't clear-cut. For instance, 'essential' appears much more frequently in an attributive position (e.g. 'X is essential') than in , while 'basic', which can have a very similar meaning in many contexts (e.g. 'the essential X'), appears much more often in an attributive position. Such information is necessary for good translation, but dictionaries usually don't provide it. Statistical analyses of lexical items reveal many things dictionaries don't tell you. Nowadays, a significant part of the work of trained people writing dictionaries is looking at corpora, and making this process automatic is a logical step.

      Strictly separating raw dictionary work and grammar seems rather old-fashioned to me. Of course, it can work to some degree, but there are so many different types of collocational preferences that just providing each lexeme with a 'grammatical category' from a relatively small list and basing the grammar on these grammatical categories is hardly enough.

      It is true that automatic systems' lack of world knowledge is a big problem, but the examples you provide aren't really a good demonstration of this fact. As you write, 'have' is translated differently into some languages depending on whether the object is abstract. So, given a translation system that recognizes the verb and its object and a bilingual parallel corpus, a statistical system can find out about that.

      I heard of people who write dictionaries that can be used for automatic processing, for every lexeme they need between half an hour or an hour (consulting dictionaries and corpora, checking whether the application of rules gives correct sentences). This can only work if the aim of the MT system is either only a very limited domain (e.g. weather forecasts, for which there are working rule-based translation systems) or very low quality. It could never be affordable to have trained people provide all relevant characteristics for the millions of words that would be needed for a good MT system with wide coverage.

      Differentiating between concrete and abstract entities is something that seems quite natural to us, but there are many other relevant characteristics of lexical items that don't come to linguists' minds so easily, statistical analyses can be better at discovering them.

  17. unfortunately doomed by aziraphale · · Score: 4, Interesting

    Like most computerised translation efforts, this ignores the fact that translation always requires context. The sentence 'fruit flies like an orange' is a classic example in the English language of a sentence which can be interpreted in two different ways - sentences can easily be constructed which have completely different meanings in different contexts.

    'As a punishment, he was given a longer sentence'. Obviously, we're talking prison, right? Well, what if the preceding sentence was:
    'The teacher had grown weary of his poor attempts at translation'?

    A statistical system, even working with the entire phrase, won't be able to figure out which meaning of the word 'sentence' is intended there.

    how about:
    'The box was heavy. We had to put it down'
    'The dog was ill. We had to put it down'

    You need semantic understanding to be able to perform translation.

    1. Re:unfortunately doomed by plasticmillion · · Score: 4, Insightful
      This is definitely true. At the same time, the results of statistical natural language processing are surprisingly good. Really this should not be so surprising, since they function in a way similar to the human brain. A neural network like the brain is designed to deduce a complex function from training data. I believe strongly that the best way to get intelligent(-seeming) behavior out of machines is to mirror this process.

      Artificial neural nets are one way to do this, but statistical methods are more or less analogous and have the advantage of being highly optimizable. Personally I don't understand the details, but Very Smart Mathematicians have found ways to optimize models like Singular Value Decompositions (SVDs) so that they can be calculated orders of magnitude faster than models that cannot be represent as formally using mathematics.

      The bottom line is that statistical methods are probably the way that we will end up producing brain-like behavior on computers, and the fact that there are promising results already is heartening. Yes, for truly intelligent behavior a lot of domain knowledge will also be needed, as you point out. But I don't see any reason why the extraction and mapping of this knowledge couldn't also be achieved with large training corpora and statistical methods, rather than hand-crafting.

    2. Re:unfortunately doomed by capologist · · Score: 4, Insightful

      It may be possible for this approach to address that issue somewhat. Statistics can be collected not only on associations of words with other words, but also on associations of groups of words or phrases with others. So if the translator has learned from documents in which the phrase "put it down" appears near the word "ill" and the word "dog," and from other documents in which the phrase is associated with the word "heavy," it can make a good guess.

      Clearly, it would need to learn from a tremendous amount of input data before it could begin to approach the experience of a human, and hence make guesses of similar quality to a human translator. However, the amount of available source material is increasing so rapidly that it may be possible for a translator to get pretty darn smart this way.

  18. Arabic Grammar Nazi by nat5an · · Score: 5, Informative

    From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.

    Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.

    I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.

    It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

    For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

    --
    Head down, go to sleep to the rhythm of the war drums...
    1. Re:Arabic Grammar Nazi by capologist · · Score: 3, Informative

      But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

      For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

      I think that one of the major points of the statistical technique is to deal with precisely this sort of thing.

      It doesn't have to know the "meaning" of words like "box" or "man," it just has to have seen them in a particular context before. If it has, then it knows that "big" is usually translated one way when it appears with "man," and another way when it appears with "box." So it just follows those observed patterns, without knowing anything about "meaning."

      In principle (the article hints at this), it might even be possible in the future to make good guesses about combinations that it hasn't seen before, by inferring rules about how things are put together. For example, it encounters the phrase "strong box," which it hasn't seen before. But it has seen many other words that frequently are associated with inanimate-strong and that are also frequently associated with inanimate-big, and many words that frequently are associated with animated-strong and also frequently associated with animate-big, but few words that are frequently associated with inanimate-big and animate-strong. So it infers that inanimate-strong is somehow "parallel" to inanimate-big. Since it also finds that "box" is more typically associated with inanimate-big than with animate-big, it infers that the version of "strong" that goes with "box" is the one that is parallel to inanimate-big. Therefore, it selects inanimate-strong. Ta da!

  19. This approach is limited by Oryx3 · · Score: 3, Insightful
    Yes, that's a big problem with statistical methods. The point is that we don't just use words with specific meanings like "man" or "tall", but we also use:
    • abstract words that take on different meanings in different contexts (i.e. they're polymorphic)
    • we use words metaphorically (the "pissed" example above). Metaphor requires the reader to make the connection on the fly between two concepts, hence it requires intelligence. ("On the fly" is a good example. A computer can be given a list of such metaphorical expressions, but recognizing new ones is a much harder problem.)
    • we use words incorrectly, or misspell them, or use imperfect grammar, but that's OK because our human reader is able to infer the meaning
    • humans think it's funny sometimes to use words in the wrong context, i.e. where the metaphorical meaning is really outlandish, or there is a conflict between the idea and the way it is expressed. I think we like this because it requires intelligence to work out the meaning in these cases.

    For example, the English word pattern can be translated in French by any of (please excuse the lack of accents, they were stripped when I submitted): modele, exemple, type schema, dessin, motif, maquette, patron, plan, disposition, groupement, repartition, combinaison, diagramme, gabarit, echantillon, tendance, figure, circuit (and probably others as well) depending on the context -- and not just the lexical context, but the meaning.

    Previous attempts to automate translation focused on giving computers grammatical and semantic knowledge, in the hope that it could infer some meaning from this and so choose the right equivalents. Despite some success, this approach failed in general, putting machine translation (MT) firmly in the realm of AI. I believe this statistical approach is a step in the wrong direction (back to purely lexical means of analyzing texts with a view to translation). Further progress in MT will come from AI.

    This doesn't detract from the ways in which computers have been useful to translators -- in the area of computer-assisted translation (translation memory, localization, terminology databases, etc.)

    The other point is it's a lot harder to get a good-quality parallel corpus than you'd think (even in the Internet age -- most of the stuff on the Internet is crap anyway).

    It's not the idea of using computers in translation that I think is limited, just this approach.

  20. Two more classic machine mistranslations by Mawbid · · Score: 4, Interesting
    The one you mentioned if often accompanied by two more, so I'll continue the tradition. These smell like urban legend, but who cares? :-)

    An engineer was confused when a a translated spec included water goats. "Water goats"?! Hydraulic rams, actually.

    And perhaps most famous of all, "out of sight, out of mind" supposedly came back as "blind idiot".

    Language is a curious thing. I can't help thinking there's some deeper meaning to the fact that misapplication of it can so easily be funny to us.

    --
    Fuck the system? Nah, you might catch something.
  21. A paper on this by metlin · · Score: 3, Informative

    I had written a paper on this of the application of N-gram technique with statistical methods for use in CBR a long time ago.

    You can find the paper here (PDF) and the presentation here. ;-)

  22. Language Applicability by Flwyd · · Score: 3, Interesting

    "If we can learn how to translate even Klingon into English, then most human languages are easy by comparison," [Dr. Knight] said.

    That's not really the case. Klingon was created through conscious effort and hasn't evolved many (any?) warts over time. Its structure is akin to well-understood human languages.

    Now take Turkish, which has concatenative grammar. Adjectives are applied by tacking suffixes on to the word, sometimes changing spelling of previous chunks. Thus, a 20-word English phrase may correspond to a single Turkish word and extremely long words may be reasonably assumed to be unique. Statistical techniques can work with Turkish, but it requires some work up front to extract tokens. \b\B+\b doesn't help much. German (and, I think, Greek) are like this to a lesser extent.

    Statistical approaches are often quite effective in language processing, much to the surprise and disheartening of linguists. They're far from perfect, but often the best thing so far.

    --
    Ceci n'est pas une signature.