Slashdot Mirror


More on Statistical Language Translation

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.

193 comments

  1. Not just matching phrases by marcopo · · Score: 5, Interesting

    The key improvement is not just to search for phrases that appear in the sample texts. If you have an idea for what a word means and what its grammatical role is then you can plug it into other sentences and greatly extend the set of phrases you can translate. Thus an important idea is to search for phrases that match gramatically with phrases you can translate.
    however, this requires a stage where the sample texts are used to extract grammatical information on the second language. Of course, it helps alot if you are familiar with one of the two languages.

  2. Same words, different meanings by shish · · Score: 5, Interesting

    What happens when it hits a word with several meanings? For example the reply to a previous story "I got pissed and installed OSX"

    drunk?
    angry?
    urinated?

    --
    I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    1. Re:Same words, different meanings by Anonymous Coward · · Score: 0

      I think you mean, "I got pissed because I installed OS X."

    2. Re:Same words, different meanings by marcopo · · Score: 1
      In this case the sentence is actually ambiguous :)

      However, if this sentence appears in some context, and the sample texts are extensive enough to include the idiom "get pissed" in a similar context it may be enough to let the translator prefer one translation over the other.

      If this project got this far I would be impressed.

    3. Re:Same words, different meanings by CyberSlugGump · · Score: 5, Funny

      Reminds me of story see bottom of this page

      The US Gov't was funding an early computer group to translate documents from Russian-to-English and back. The hope, obviously, was to eliminate the need for human translators. A particular sentence was fed to the computer, which translated it into Russian. The computer was then fed the Russian, and it translated it back to English.

      The original sentence was "The spirit is strong, but the flesh is weak".
      The resulting sentence? "The vodka is good, but the meat is rotten".


      The computer didn't know which of the many possible words to use when translating spirit, so it used "vodka". Likewise, it tried to put the word "strong" into context, and since strong vodka is prized in Russia, it decided that the vodka was good. Likewise, flesh got translated to meat, and weak flesh became bad meat.

    4. Re:Same words, different meanings by Dave+Angle+is+....+M · · Score: 1

      Not that impressive ... just good selection of corpora.

    5. Re:Same words, different meanings by Anonymous Coward · · Score: 2, Funny

      you can see this effect in action (with the babelfish translator from altavista) here: http://www.tashian.com/multibabel

      example:

      Original English Text:
      I am a lame anonymous coward

      Translated to French:
      Je suis un lache anonyme boiteux

      Translated back to English:
      I am a lame anonymous coward

      Translated to German:
      Ich bin ein lahmer anonymer Feigling

      Translated back to English:
      I am a lame anonymous coward

      Translated to Italian:
      Sono un vigliacco anonimo zoppo

      Translated back to English:
      They are vigliacco an anonymous cripple

      Translated to Portuguese:
      Sao vigliacco um o aleijado anonymous

      Translated back to English:
      Anonymous is vigliacco the one cripple

      Translated to Spanish:
      Anonimo es el vigliacco el un lisiado

      Translated back to English:
      Anonymous a disabled one is vigliacco

    6. Re:Same words, different meanings by Alkonaut · · Score: 2, Interesting
      Since the meaning of "pissed" is determined by the context (nationanlity for example), you would need more information than the sentence itself to make an educated guess. A little context is given by the "installed OSX", but probably not enough to decide between angry and drunk...

      Does anyone know if for example babel is context/locale sensitive in this sense:

      If I write "theatre" or some other word with british spelling, does it then understand that any other words with different meanings in en-US and en-GB english should use the meaning from en-GB? The test sentance "At the theatre getting pissed" won't work since no slang seems to work with babel.

    7. Re:Same words, different meanings by ColonelPanic · · Score: 2, Funny

      "Driving home from work with a manual transmission, wearing a dress after her shift, she had to shift her shift in order to shift."

      --
      "Skill shows through where genius wears thin." -Wittgenstein || Religion: uniting aviation and architecture.
    8. Re:Same words, different meanings by jandrese · · Score: 1

      The best part is that the second result actually sounds like a Russian Proverb.

      --

      I read the internet for the articles.
    9. Re:Same words, different meanings by godot42a · · Score: 3, Interesting

      Short and simplified version: Look out for different typically co-occurring words and cluster them. For "pissed", you'll find Cluster 1: {pissed, toilet} Cluster 2: {pissed, booze, get} and probably some more These clusters correspond to different meanings of the word. Then determine which of these clusters fits the current usage.

    10. Re:Same words, different meanings by Anonymous Coward · · Score: 2, Insightful

      It gets even more complicated, particularly with the connotations attached to certain words and phrases.

      For example one country's "Weapons of Mass Destruction" is another country's "Strategic Deterrent". Both phrases mean the same thing but the tone is very different. Same thing with "terrorists" and "freedom fighters". You can use either phrase to describe the same people and imply very different meanings.

      It will be a long time before an automated system will be able to make an acceptable translation of these subtleties.

    11. Re:Same words, different meanings by Anonymous Coward · · Score: 0

      Kokoo kokoon koko kokko.
      Koko kokkoko?
      Koko Kokko.

    12. Re:Same words, different meanings by Anonymous Coward · · Score: 0

      "Word Sense Disambiguation"
      http://www.cs.jhu.edu/~yarowsky/p ubs/coling92.ps

      I took Prof. Yarowsky class, Information Retrieval and Web Agents, and we had an assignment that implemented this algorithm.

    13. Re:Same words, different meanings by Blaskowicz · · Score: 1

      thanks for the link, it's fun as hell!
      If you take the final output and reuse it in the "multibabel machine", again and again.. you'll get that

      Original English Text:
      I need you clothes, your boots, and your motorcycle.
      You forgot to say please!


      after ~ 10 runs :
      __ from D the shape, the extremity, the extremity, equipped of to the beginning of the necessity with the rayon of the parents and the relative bicycle of the movement. The east forgets this to the inner part in order to finish it D, how much it it it extremely, if the concli, to this extremity it arrests itself, in the extremity of order in order around to the opinion to it with the this this it, of this, in this extremity, than with it these years, like it, it arrests it, in how much crank that apportionable has dealt!

    14. Re:Same words, different meanings by GaryCoen · · Score: 1

      [Consider this note a sanity check for people using the web as a research tool.] The exchange noted above occurred as part of US House of Representatives testimony evaluating recent accomplishments of MT research funded by NSF (at MIT, Harvard, and elsewhere) prior to presentation of the 1966 ALPAC Report. Your source sentence is mistaken, though, along with part of your exegesis. (For the record, this research was focused on uni-directional--not bi-directional--translation of physics abstracts from Russian to English.) The original source sentence was as follows: "The spirit is willing, but the flesh is weak." Cheers.--Gary

  3. Re:Who can translate this?? by qta · · Score: 0, Offtopic

    Windows = "Windows" - will give translation when enough statistical info becomes available

    NT = Not Trustworthy
    .Net = Non-ExistenT

  4. Translator by Anonymous Coward · · Score: 3, Informative

    That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

    Can anyone try this on the new (or some other recent) algorithm?

    BTW here's Doc Och's most recent website:

    Franz Josef Och [isi.edu]

    --
    Esteem isn't a zero sum game

    1. Re:Translator by buro9 · · Score: 2, Insightful

      That wouldn't apply here as the sample data you've suggested is too little.

      For statistical translations to work, you would need a substantial set of data, already translated, from which you could do the comparisons and create your database of phrases and words.

      In the example you've given you would need to have pre-populated this database in advance for the statistical engine to understand how to do the translation.

      What you've got to do is stop thinking that this is actually performing a translation... it's not... it's performing a cost-based replacement where the costs have been calculated from statistics gathered from a large pool of sample data.

      Once you have the sample data... then you will have the translation.

    2. Re:Translator by Anonymous Coward · · Score: 0

      The spirit is willing but the flesh is weak

      Hey, moderators. What the crap is informative about trotting out this tired old story again, and then wondering aloud if it is still applicable??

  5. IBM research 10 years ago by Surak · · Score: 5, Interesting

    I remember reading about IBM doing this research about 10 years ago. The biggest problems then adequate processing power and storage space. Those things have greatly improved in the last 10 years (thank the spirits of Moore). I think that's why you're starting to see all this cool research with speech recognition and AI that was being done in the 80s and 90s become more and more commonplace. This trend will likely continue, and all the cool research only stuff you remember reading about in the 80s and 90s will just be common fixtures on PCs of today.

    Speaking of which -- speech recognition, AI, translation learning algorithms -- sounds like we have the seeds for the Universal Translator. :)

    1. Re:IBM research 10 years ago by kmak · · Score: 3, Interesting

      I have one question though, while obviously, you can get a mapping of definitions, can you actually translate a full sentence into another full sentence?

      With exceptions in tons of languages, is this even feasible in the near future? Sure, we can understand a poorly translated sentence, but can it translate it so that we don't have to?

      --

      I'm not the devil.. just his advocate.
    2. Re:IBM research 10 years ago by Jugalator · · Score: 5, Informative

      Yes, I see IBM's project was called the "Candide Project". Here's a link with some information about it, including a link to the paper describing the prototype system they built:

      http://www-2.cs.cmu.edu/~aberger/mt.html

      --
      Beware: In C++, your friends can see your privates!
    3. Re:IBM research 10 years ago by Anonymous Coward · · Score: 0

      It totally depends. It isn't a matter of "We can now translate x number of words or x number of sentences without error"... it's not the quantity. It is the context that makes all the difference. So one really long sentence might be translated perfectly whereas a short one might have glaring mistakes.

      One can use more and more powerful computers and algorithms to reduce the number of mistakes, perhaps to the point that it will get things right most of the time. But the thing that is needed for a computer to translate as accurately as a human is AI. Context and understanding of the actual meaning of the sentence, as well as some creativity, are required to translate a text like a human.

    4. Re:IBM research 10 years ago by MojoRilla · · Score: 1

      One can use more and more powerful computers and algorithms to reduce the number of mistakes, perhaps to the point that it will get things right most of the time. But the thing that is needed for a computer to translate as accurately as a human is AI. Context and understanding of the actual meaning of the sentence, as well as some creativity, are required to translate a text like a human.

      I completely disagree with this. People thought they would need artificial intelligence to beat a chess grandmaster. But IBM did it with brute force.

      People also thought we would need artificial intelligence for computers to invent things. But university researchers are doing it now with genetic algorithms.

      Translating language is probably no different.

    5. Re:IBM research 10 years ago by MrSubtle · · Score: 1

      The problem is worse than that. There are frequently cases where the information required to come up with a proper translation is simply not present in the source text. For example, in English (unless you are south of the Mason-Dixon) there's no plural second-person pronoun, but in many other languages there are such words, so when the source text says "You should vote for the Republicans." there's no way to tell whether "you" refers to one person or many. Likewise, in some languages there is social relationship content in the way words are used. You might use different words for the same thing based on whether you are talking to a dog, a child, a pretty girl, a co-worker, a boss or an emperor. If there's no indication in the source text then you aren't going to be able to tell which word to use. The only way those kinds of problems is to incent a HAL 9000ish AI that actually understands what language means at an abstract level and can reason out what is happening. That of course would have to include a vast amount of contextual information and so far nobody has been able to build anything even remotely like that. I have nothing against machine translation, but like every other technology it has its limits. --Brian

    6. Re:IBM research 10 years ago by Surak · · Score: 1

      The problem is worse than that. There are frequently cases where the information required to come up with a proper translation is simply not present in the source text. For example, in English (unless you are south of the Mason-Dixon) there's no plural second-person pronoun,

      Y'all. :-P

      but in many other languages there are such words, so when the source text says "You should vote for the Republicans."

      as in "Y'all should vote for the Republicans." (Errmm...no, you actually shouldn't just randomly vote for a candidate based on their party, but it was *your* example. ;)

      Actually, there are two grammatical variations of that. There is "Y'all" and then there is "All y'all". "Y'all" refers to the people you are immediately talking to, while "All y'all" refers to those people and the larger subset of people that they belong to. For instance:

      "Y'all should come to my party."

      means the two people I am talking to should come to my party.

      "All y'all should come come to my party."

      means the two people and (perhaps) their entire family, group, etc. should come to my party.

      Of course "All y'all" can be rather ambiguous.

      Exercise for the reader and a point I was making in the above: What did I mean by party? :) Ahhh...the subtelties of the English language. :)

      If there's no indication in the source text then you aren't going to be able to tell which word to use.

      Well, that rather is the point isn't it? How would the machine translate my use of the word 'party' for that matter? I was referring to political parties above, but how does the machine know if below that, I didn't mean a kegger?

      Of course machine translations are never going to quite get it. But my point is that each new technology that is developed is a new piece of the technlogical puzzle to develop such technologies as a Universal Translator or HAL 9000.

  6. Re:Massive article troll coming up! by Anonymous Coward · · Score: 0

    You are back? I didn't know you left.

  7. Re:Who can translate this?? by Anonymous Coward · · Score: 0

    That'd make Windows some form of negative or negating word, (ie no not- non- anti- kinda thing), making NT trustworthy and Net existent.

    And again we prove that automated translation can always be fooled by a well-chosen (or badly chosen) example...

    That is, following the article's example as gospel.

  8. That's where the stats come in! by Anonymous Coward · · Score: 0

    .. so presumably the system would create some weightings, then words are assigned meanings according to how likely that meaning is from the probabilities of the surrounding words.

    (As far as I can remember such things from Uni.)

  9. So statiscally... by MosesJones · · Score: 5, Funny


    France = "Cheese Eating Surrender Monkey"

    George Bush = "Neo-Imperialist Moron"

    Tony Blair = "Lap Dog"

    WMD = "No where to be found"

    and of course

    Dossier = Creative Story Telling

    --
    An Eye for an Eye will make the whole world blind - Gandhi
    1. Re:So statiscally... by Anonymous Coward · · Score: 0

      you missed:

      Microsoft Security = Missing In Action

    2. Re:So statiscally... by Matthias+Wiesmann · · Score: 4, Insightful
      Actually, using this technology to translate from english to english could be quite interesting. Imagine you could automatically translate legalese, or marketing speak to plain english. Or translate an article with a given political bias towards another political bias.

      If this happens, I suspect this technology will be illegal...

    3. Re:So statiscally... by Lord_Slepnir · · Score: 2, Funny
      If this happens, I suspect this technology will be illegal...

      Not illegeal, just when you try to run it in windows it will mysteriously crash. Microsoft won't want there to be a program that will translate their EULAs into "w3 0wnz0r j00 50ul!!!!!111"

      I'm still holding out for one that will translate CS-speak into english. God i'm sick of having to translate "3y3 g0t m4d d34gl3 l0lz!!!1"

    4. Re:So statiscally... by MosesJones · · Score: 1


      Hi George.

      --
      An Eye for an Eye will make the whole world blind - Gandhi
    5. Re:So statiscally... by Anonymous Coward · · Score: 0

      They're one step ahead of you here, no statistical translation necessary.

      Information

      Download

    6. Re:So statiscally... by Anonymous Coward · · Score: 0

      Already been done. Start here:

      http://squishyware.webhostme.com/haxor/default.a sp

      and work backwards.

    7. Re:So statiscally... by sanchny · · Score: 1
      Imagine you could automatically translate legalese, or marketing speak to plain english. Or translate an article with a given political bias towards another political bias.
      I like the first two points you made; translating jargon would be extremely useful (though I'm more interested in the translation between different languages).
      But how would it translate an article from one political bias to another? If you change the political bias, you change the underlying tone and meaning of the article.
    8. Re:So statiscally... by arcanumas · · Score: 2, Funny

      What are you talking about ? How can you translate legalese when there is nothing to translate? No matter what they say, you can be 100% confident hat an accurate translation would be "You are f*cked"

      --
      Slashdot Sig. version 0.1alpha. Use at your own risk.
    9. Re:So statiscally... by Fratz · · Score: 1

      The technology doesn't do that, since it doesn't do advanced semantic analysis on the texts. And it's only meant to work on texts that were translated from one language to another. Feeding it one set of text in legalese and another set that explains the legalese doesn't fit the theory. Even if it did, you'd need millions of such texts before you'd begin to get anything usable.

      Machine Translation as a whole does theoretically allow what you suggest, but example-based technologies don't understand the texts they translate so they can't re-express the same concepts in different ways. Knowledge-based systems, however, could do this. They don't actually understand the texts either, but many of them do use abstract meaning representations (or Interlinguas) to encode what a sentence or phrase means. You can make any kind of text generator you want to go from that meaning back to text. This includes foreign languages, the same language as the original, or 1980s Valley Girl speech.

      --
      -- Fratz, human
    10. Re:So statiscally... by Anonymous Coward · · Score: 0

      THE ONLY WINNING MOVE IS NOT TO PLAY.
      .
      .
      .
      .
      .
      .
      .
      .
      .
      abc def ghi jkl mno pqr stu vwx yz abc def ghi jkl mno pqr stu vwx yz

    11. Re:So statiscally... by Anonymous Coward · · Score: 0
      Oh, my bad. Thanks for saving us from the winnebagos of mass destruction! No doubt they could have deployed the centrifuge from beneath the rose bush in less than 45 minutes! They had Castor Beans! Oh the humanity! We are helpless without you, Dubya!

      The Castor Beans were hilarious. They announced they'd discovered a load of them, and the Castor Beans can be used for the manufacture of Ricin, a deadly poison.

      They left out two important facts that were not reported till later:

      1. Castor beans are also used in the production of brake fluid.

      2. The beans were found in a brake fluid plant.

      Bunch of freakin morons.

    12. Re:So statiscally... by Lord_Dweomer · · Score: 1
      Microsofts 3000 page EULA for Windows could be whittled down to one short little phrase (you knew it was coming):

      "All Your Base Are Belong to Us!"

      --
      Buy Steampunk Clothing Online!
    13. Re:So statiscally... by Matthias+Wiesmann · · Score: 1
      But how would it translate an article from one political bias to another? If you change the political bias, you change the underlying tone and meaning of the article.
      If you have an article which contains actual information, this would, of course, be impossible. The tone, on the other hand, can be seen as a langage, a way of expressing things. Saying 'Coalition forces announced collateral losses' or 'The occupying army killed innocent people' contains the same semantic information. The language is simply different.

      I find that many article contains little or no information, but much spin. This would be a way of checking for good articles, if they can be translated to the opposite context and remain meaningfull there is no data and they are worthless.

  10. Works it does! by ucblockhead · · Score: 5, Funny

    Translation-unit this algorithm perfectly works! Deutsch this was typed and translation-unit to English makes this was!

    --
    The cake is a pie
    1. Re:Works it does! by Roofus · · Score: 1

      Sounds more like you translated from "Yoda" than German.

  11. this doesn't work well by Anonymous Coward · · Score: 0

    how about trigrams (N=3)? how much memory
    will they take? too much. N-grams are
    ancient history.

    1. Re:this doesn't work well by NathanE · · Score: 4, Interesting

      You are prety much right about that, although I do not see the need to actually maintain your table in RAM. Trigrams require a HUGE corpus of training material to get good results, and even then you come up with the need to fudge your data a bit when you come across an unknown trigram. I think its called "and one rounding" or something like that (trying to remember from class).

      Fascinating stuff for sure, but hardly new unless they have come up with some new development. I haven't read the article.

    2. Re:this doesn't work well by Anonymous Coward · · Score: 0

      Actually, they can be stored in RAM, because you don't have to store zero-count trigrams (or bigrams). In practice, you may not store trigrams that occur les sthan n times, where n is about two or three.

  12. Older languages not supported? by panurge · · Score: 5, Interesting
    Modern languages tend to have less inflected grammars than older languages. That is a benefit for statistical methods because individual words do not change significantly. But how would this work for Latin, Greek and other highly inflected languages? Anyone who knows "The Turn of the Screw" (Britten version) will remember:

    malo: I had rather be
    malo: in an apple tree
    malo: than a naughty boy
    malo: in adversity

    based on four very distinct meanings of malo, in which the word endings put the stem of the word in context, but unfortunately the same word endings are used for different things.

    Not that I'm trying to rubbish the work, because I actually think that statistical methods are close to the fuzzy way that we actually try and make out foreign languages. I just wonder what the limits are.

    --
    Panurge has posted for the last time. Thanks for the positive moderations.
    1. Re:Older languages not supported? by Anonymous Coward · · Score: 2, Interesting

      There are plenty of highly inflected modern language, e.g. Russian and a few dozen other Slavic languages and Japanese are highly inflected.

      Get this idea out of your head. There is no continuum of inflectedness upon which modern languages align to the uninflected.

    2. Re:Older languages not supported? by xyzzy · · Score: 1

      In inflected languages, the words with differently stemmed endings (or, beginnings) can just be treated as "extra" vocabulary -- so if a noun "apple" has 6 forms, you have 6 words with different parts of speech.

    3. Re:Older languages not supported? by godot42a · · Score: 2, Informative

      > Modern languages tend to have less inflected > grammars than older languages. In general, that's not true. There is development in both directions, depending on the language family. Proto Indo European started out with many cases, and that's why there is a tendency towards less inflections and more particles. In languages with many particles, the development can be inversed. Cliticization is such a process. For example, in some dialects of German, personal pronouns become new verb endings: Laufen Sie! (run!) -> Laufen'S!

    4. Re:Older languages not supported? by tibbetts · · Score: 1, Interesting

      (Offtopic, but indulge me.)

      For anyone who doesn't know Latin, or for anyone who isn't familiar with inflected languages in general, here's a detailed morphological breakdown of this poem.

      malo: I had rather be

      First-person, present indicative active form of the irregular verb malle, "to prefer, wish". It takes an infinitive (most likely esse, "to be"), which is often, as here, dropped.

      malo: in an apple tree

      The locative form of malus, -i (feminine noun), "apple tree").

      malo: than a naughty boy

      Dative of comparison (as dictated by malle) of the adjective malus, -a, -um, "bad, evil". This is the masculine (or neuter) form, hence the translation "boy".

      malo: in adversity

      Ablative of the neuter noun (really a substantive adjective) malum, -i "evil".

      In short, we have a verb, a noun, an adjective, and a homonymic noun.

      (Thanks to the original poster for the poem--I've never heard this one.)

      --
      :wq
    5. Re:Older languages not supported? by Anonymous Coward · · Score: 0

      In what area do they say "Laufen'S"?

    6. Re:Older languages not supported? by plasticmillion · · Score: 1

      The same statistically methods that are applied to syntax (phrase structure) are equally effective when applied to morphology (word structure). In this case the unit of processing would be N-graphs (sets of N letters) rather than N-grams. All languages are inflected to some degree, so I think that a morphological component of this type should be part of any serious statisical translation system.

    7. Re:Older languages not supported? by Xerithane · · Score: 1, Informative

      Japanese are highly inflected.

      Japanese doesn't use inflection for any meaning at all. You can speak Japanese without using any inflection, you would just sound like a robot.

      Sometimes it's easier to understand two words that sound similar with inflection, but the way they are written or even spoken is different without any inflection.

      --
      Dacels Jewelers can't be trusted.
    8. Re:Older languages not supported? by sesquipedalian_one · · Score: 2, Informative

      Clearly you've never looked at Turkish. Or any of the Bantu languages, which make the inflectional system of Latin or Greek look like child's play. But the differences between inflectional systems in two languages is really part of a broader issue, namely that translation doesn't occur on the basis of a token-for-token replacement. One word in the source language may correspond to several in the target language, and vice-versa. This is a problem in alignment, and any MT system must deal with it, but that's a fairly well understood problem. A system of this sort certainly would not just look at words as atomic units, but would have to look at parts words (i.e., their morphology)

    9. Re:Older languages not supported? by JJ · · Score: 1

      Japanese (and other Altaic languages, like Korean, Mongol and Turkish) are either highly inflected or not at all. It really depends on how you write them. What I want to see is a statistical system handle a language like Basque, where the passive voice substitutes for the active.

      --
      So long and thanks for all the fish . . . !!!
    10. Re:Older languages not supported? by Anonymous Coward · · Score: 0

      Inflection isn't that much of a problem. However, I suspect that you may find a lot of reanalysis issues with any software intended to parse inflections (the sort of thing that led "an ewt" in English to become "a newt"). But I suspect that given enough brute force, you'll get something better than what we have. However, I don't imagine you'll ever have human-quality literary translation without human-quality ai.

    11. Re:Older languages not supported? by tcsh(1) · · Score: 1

      Complex morphologies shouldn't be a problem if syntax and morphology are viewed by the system as one and the same. That is, word boundaries and morpheme boundaries serving similar function. Thus a very analytic language (eg. Mandarin) and a very polysynthetic language (eg. many Native American languages) work pretty similarly. The difference is one likes word boundaries and the other likes morpheme boundaries.
      This same argument could be applied to fusional languages.

      Also, give up the idea of new and old languages.

    12. Re:Older languages not supported? by Anonymous Coward · · Score: 0

      While generally true, there are a couple words that are distinct in their inflection. Hana (flower/nose), ame (rain/candy), hashi (chopstick/bridge/edge) are a few. In writing, they are differentiated by kanji; spoken they are differentiated by inflection; however, speaking them incorrectly makes you sound foolish, not incomprehensible. Chinese on the other hand, is very tonal.

    13. Re:Older languages not supported? by khanyisa · · Score: 1

      You misunderstand the meaning of inflection here ... it really is being using in a grammatical context, not a tone-of-voice context

    14. Re:Older languages not supported? by Xerithane · · Score: 1
      You misunderstand the meaning of inflection here ... it really is being using in a grammatical context, not a tone-of-voice context


      Alteration in pitch or tone of the voice.
      Grammar.
      An alteration of the form of a word by the addition of an affix, as in English dogs from dog, or by changing the form of a base, as in English spoke from speak, that indicates grammatical features such as number, person, mood, or tense.
      An affix indicating such a grammatical feature, as the -s in the English third person singular verb form speaks.
      The paradigm of a word.
      A pattern of forming paradigms, such as noun inflection or verb inflection.


      Either way, Japanese still isn't inflected hardly at all. English is three times more inflected than Japanese. Chinese is 10 times more inflected than Japanese, as well.
      --
      Dacels Jewelers can't be trusted.
  13. Why the change and Internationalization by beacher · · Score: 5, Interesting

    The article's text has "Compare two simple phrases in Arabic: "rajl kabir'' and "rajl tawil.'' If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively". Are we going pro-homeland security and not tipping off the powers that be? Or did michael want to show his uber leet 1st quarter espanol skillz?

    Spanish is easy and led me to believe that the article had relatively little weight (it is lightweight and a topical PHB read anyway). I do a lot of data mining in text streams and have found it to be fairly easy work. Getting cursors to play in ideograms/unicode and reversing the data is something I haven't tried yet and the article barely covers it. When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again. All of my databases are unicode and I want to learn more about having truly international systems that are automated and then hand tweaked to avoid the engrish.com type mistakes. Any help here?
    -B

    1. Re:Why the change and Internationalization by delstar+dotstar · · Score: 0, Funny
      All of my databases are unicode
      ...and they are belong to us. :D
    2. Re:Why the change and Internationalization by nat5an · · Score: 1

      I've been studying Arabic for awhile, and it is quite dissimilar to English, and translation to and from it is much more difficult than between english and spanish. I mean, in Spanish you can practically guess what words are since a lot of them are english cognates these days.

      For example, there's a rather complex root and form system in Arabic that often gives words multiple meanings, e.g. I take the root "k t b," which always has something to do with writing, and put it in a certain form to get "maktab" which most literally means "a place where writing is done," and can mean an office, a desk, etc. Formal Arabic is probably best described as being very metaphorical. Figuring out what words mean in a simple context like the one shown is pretty easy (although their translation is actually incorrect, "kabir" only means "big" when referring to a non-human object, it means "old" when referring to a human). In other words, the meaning of a word is highly context dependent (moreso than English, I would wager), and would also require tracking whether pronouns are referring to human or non-human objects, etc.

      The statistical method is neat, but would need to be complemented with some basic knowledge about the language too I would think. Being able to translate from a language like english to japanese to arabic and back to english....that would be impressive.

      --
      Head down, go to sleep to the rhythm of the war drums...
  14. No Universal Translator any time soon by ucblockhead · · Score: 2, Insightful

    The trouble with the Star Trek "Universal Translator" is that they show it working on languages where there is no already translated work. This sort of statistical translation requires someone to sit down and hand-translate a bunch of documents to teach the machine the correlations.

    --
    The cake is a pie
    1. Re:No Universal Translator any time soon by Anonymous Coward · · Score: 0

      they show it working on languages where there is no already translated work.

      You think that's the ONLY problem?

      How about the fact that nobody ever carries one around? Oh, that's right, they're built in..

      So how come when people go to another ship (Say, when Riker served aboard the Klingon ship), and he doesn't understand them?

      Did his translator run out of batteries? No, because they work perfectly fine in the next scene - maybe he swapped them when the camera was on the other guy?

      The whole "universal translator" is the biggest load of BS I've ever seen.

    2. Re:No Universal Translator any time soon by Anonymous Coward · · Score: 0

      Funny you should mention that, well maybe not considering the context of the thread. On a Star Trek rerun last night they encountered a species where they couldn't translate. This species had a language based on metaphores so even though they understood the words (yeah, a stretch) they couldn't get the meaning because they didn't know the stories behind the words.

    3. Re:No Universal Translator any time soon by Jugalator · · Score: 1

      Hehe, I think you missed one thing, the thing that strikes me as the silliest -- when the universal translator is active, the language of aliens isn't even translated to english, but suddenly spoken by them. Anyone in doubt can look at the lip synch. ;-)

      --
      Beware: In C++, your friends can see your privates!
  15. Missed the idea by marcopo · · Score: 2, Interesting
    Translation (computerized or not) is about picking the correct meaning from the context. If the word appears in the given text and in a similar context in the sample texts you could pick the correct meaning.

    As for inflected (read most) languages, learning to separate a word into its stem and inflections is the first step, even if you have a number of such possible break-ups.

  16. Engrams? by Finni · · Score: 1
    Engrams?

    Wow, these guys are just begging for a lawsuit from you-know-who.

  17. Yoda? by allanj · · Score: 5, Funny

    Yoda, is that you?

    --
    Black holes are where God divided by zero
    1. Re: Yoda? by Black+Parrot · · Score: 1


      > Yoda, is that you?

      Shouldn't you ask, "Yoda, that you is?"

      --
      Sheesh, evil *and* a jerk. -- Jade
    2. Re:Yoda? by lakeland · · Score: 1

      Actually, yoda's grammar comes from Japanese, with the words transliterated into English.

  18. Why not machine language to compiler language by Anonymous Coward · · Score: 1, Interesting

    If this is just statistics, and you can do anything in C, why not statistically relate C to machine code and look at Windows machine code to get a C source that is clean room? Or perhaps look at MSword input vs word document format?

  19. who wants to help me build a tower to heaven? by Anonymous Coward · · Score: 2, Funny

    FINALLY! After all these years of scrambled languages, we can finally get together and plan that tower of Babel!

    Now, all we need is to pinpoint Kolob and we'll be set!

    1. Re: who wants to help me build a tower to heaven? by Black+Parrot · · Score: 1


      > FINALLY! After all these years of scrambled languages, we can finally get together and plan that tower of Babel!

      Vebwe? Kootchka qwim?

      --
      Sheesh, evil *and* a jerk. -- Jade
    2. Re:who wants to help me build a tower to heaven? by Anonymous Coward · · Score: 0

      Oh yeah baby! Thanks to the word of the Lord Almighty who reigns Heaven and Earth I would never guessed that by coordinating the effort of the entire world, we (as in humanity) could build the ever illusive stairway to heaven. Count me in!

    3. Re:who wants to help me build a tower to heaven? by Anonymous Coward · · Score: 0

      I think the design is done..http://www.delmars.com/wright/flw7a.htm

  20. the real problems lie in understanding... by davids-world.com · · Score: 5, Interesting
    Statistics work quite well not just for phrases or so-called collocations such as "high and low" (vs. *"high and small"). they can help figure out the meaning of a word (bank=credit institute vs. bank=place to rest in a park). You can even learn (automatically learn) this stuff from parallel corpora where you can get a sentence-by-sentence translation, and you figure out statistically, which words or phrases belong together.

    But that's an old story. Even the translation of complete sentences is fairly feasible in terms of syntactic structure.

    Harder to translate are things like discourse markers ("then", "because") because they are highly ambiguous and you would have to understand the text in a way. I have tried to guess these discourse markers with machine learning model in my thesis about rhetorical analysis with support vector machines (shameful self-promotion), and I got around 62 percent accuracy. While that's probably better than or similar to competing approaches, it's still not good enough for a reliable translation.

    And that's just one example for the hurdles in the field. The need for understanding of the text kept the field from succeeding commercially. Machine Translation in these days is a good tool for translators, for example in Localization.

    1. Re:the real problems lie in understanding... by Orne · · Score: 2, Interesting

      Or bank = shoreline, as in river bank
      or bank = hardware bus, as in a bank of memory
      or banking = betting, as in I'm banking on that... :)

      These statistical language solutions are interesting, in that they can analyze sentence structures and deduce the grammar of a language; however, I would think that they fail on generating the actual definitions of words. You almost need to generate a list of "concepts", then link each concept to a word, by language. Not my field, thank goodness; I wouldn't have the patience for it.

    2. Re:the real problems lie in understanding... by davids-world.com · · Score: 1

      you're not that wrong with the concepts.

      re defining: sometimes it's not bad to define a term using several samples of its context. you can use google for that -- just enter a complicated term and you'll find out how it is used and who uses it.

      i do that quite often when i am looking for the correct usage of a word or a phrase in a foreign language...

    3. Re:the real problems lie in understanding... by Anonymous Coward · · Score: 0

      The thing with statistical langauge translation is that you don't ever find the definition of the word, you just use a bunch of other clues and assign likelihoods to each meaning.

      So perhaps bank (monetary) has a 55% likelihood, bank (shore) has 30%, and so on. And then when you are translating a sentence, you might have the likelihood change depending on the word(s) that come right before and right after, and so on. It always stays fuzzy, you never end up with direct definitions or mapping.

    4. Re:the real problems lie in understanding... by ornil · · Score: 2, Informative

      This is one of the oldest basically solved problems in natural language processing: word-sense disambiguation. Simply look at the words around it: if you see "river", or "park", or "memory", or "money" - you know which one to pick. That works amazingly well, and you can learn which words correspond to each sense, by starting with only a few examples belonging to each sense and then bootstrapping.

      You start with a few words that occur with each sense, you now can disambiguate a few example occurences in the text. Each of these occurences has words around it - add them to your list of sense indicators. Then do the whole thing again and again.

    5. Re:the real problems lie in understanding... by t · · Score: 2, Insightful

      To expound on the AC and Koos Baster's comments, try asking people to define ordinary words. You'll find quite often that the more basic the word, the more difficult it becomes. The definition of all words is circular since the definition of any word is given by other words, e.g., recursive: see recursive. Somewhere there needs to be a list of words with pictures, or math, or other way of defining each word without using any previously undefined words.

    6. Re:the real problems lie in understanding... by Knowledge+Hacker · · Score: 3, Informative

      I spent a decade working in the field of knowledge-based machine translation (KBMT), in the Center for Machine Translation (now part of the Language Technologies Institute) at Carnegie Mellon. Prior to that, I worked on several natural language processing projects that were focused on knowledge-based automatic analysis of English text..

      KBMT can be done. We demonstrated that pretty definitively. It's labor-intensive. Yes, we DID create concept maps (ontologies) for the domains of human endeavor relating to the texts to be translated, and yes, we DID link words (lexical units) to those concepts, in multiple languages. And it turned out that we didn't have to make the ontologies very deep--we had to make them broad, and start by assuming the need for a one-to-one mapping of concepts to nouns, verbs, and modifiers in the domain. This we arrived at after several attempts at making them deep. Turned out the lexicon is the real key. You only have to use enough ontological structure to support the way you deal with analyzing function words (relative pronouns, conjunctions, prepositions, certain adverbs, etc.) and to capture deep generalizations for certain classes of verbs and verb-derived nouns (deverbal nouns.) The system uses a fast (real-time) English analysis parser based on the Tomita algorithm, and a rule-based target-language generator based on the KANT system.

      We created a custom-built MT system for Caterpillar, to perform automated translation of their operations and maintenance manuals from the English of Peoria, Illinois into French, Spanish and German. It took us six years (not counting all the projects that preceeded it, from which we learned a great deal.) The system empoys a controlled subset of English that forces Caterpillar's technical writers to favor certain constructions in their writing, and to disambiguate certain other constructions using a writer's workbench interface. (Caterpillar has a patent on this application of MT technology for technical documents.) It contains all the vocabulary that Caterpillar needs--hundreds of thousands of terms. Caterpillar updates the lexicons as needed.

      This system has been in production use at Caterpillar since 1996. It translates controlled English text at accuracies in the high 90-percents. The tech writers adapted, the translators got turned into post-processors (and I believe there was some turnover of personnel--the work had to have gotten a lot more boring), the English reads a little bit stilted but it's perfectly clear. Response from Caterpillar's customers was positive, the manuals get translated faster, and are accurate. The controlled English can actually force a little higher accuracy. Caterpillar's investment in this techology ended up saving them a bundle.

      Due to the proprietary context of our work for Caterpillar, there were very few academic publications that came out of the project.

      If you want to engage in further reading, search on KBMT-89 (an MT project funded by IBM-Japan that laid much of the foundation for workable real-time KBMT.) We published a book on it.

      You can read about the KANT technology at

      http://www.lti.cs.cmu.edu/Research/Kant/

      There are also a number of pointers to other knowledge-based projects on the lti.cs.cmu.edu site.

      For looking at the progress of KBMT in the U.S. generally (over the past couple of decades), search for publications by Jaime Carbonell, Sergei Nirenburg, Eric Nyberg, Masaru Tomita, Teruko Mitamura, Robert Frederking, Lori Levin, Kathy Baker, Ralf Brown, and a cast of dozens. Warning--this will bring you vast amounts of material.

  21. We used to do this for fun at my last job by Anonymous Coward · · Score: 0

    Go to babelfish type in something and translate it from english > german > french > english. If you're creative you'll get some of the funniest translations ever. If you can use slang words it generally loses all context in the translation.

    This is why watching foreign films and listening to the french spoken and reading the english subtitles leaves so much out. A simple Tu versus Vous is not directly translateable to english because we don't have formal/familiar built in. Someone saying Tu to an old lady on a bus you don't know in France will get you bitched out.

    1. Re:We used to do this for fun at my last job by Anonymous Coward · · Score: 1, Funny

      Your post translated as described:

      Babelfish [ altavista.com ] with something type and translate of English > to German > French > English. If you are creative, you receive indeed some-of the merriest translations. If you can use words of jargon, all in general loses general context in the translation. Consequently, to pay free attention spoken foreign about films and hearing French and to read English subtitles so much outside. Simple to make against is not to You directly with English translateable, because we do not have formellement/familiar geeinbaut. Somebody that to make with an old lady on a bus says that, you do not know in France, receives you outside gemeckert.

  22. I'll believe it when I see it by domovoi · · Score: 5, Interesting

    There are a number of problems with the model here that point very clearly to the fact that it has the same shortcomings as other machine translation models.

    For example, so long as we're working with cognates or 1:1 equivalencies (tall, man, etc.) it's fine. If we go to words for which there is no 1:1 lexical item, what's it do then? Consider especially words that signify complex concepts that are culture-bound. There would be, by definition, no reason for language #2 to have such a concept, if the culture isn't similar. The other problem arises from statistical sampling. Lexical items that are used exceedingly rarely and have no 1:1 or cognate would be unlikely to make the reference database.

    Another similar problem arises with novel coinages and idioms. The example of "The spirit is willing..." is rightly cited. Consider the Russian saying, "He nyxa, He nepa," which translates as "Neither down nor feathers" but doesn't mean anything of the sort.

    Real machine translation has been the golden fleece of computational linguistics for a long time. I'll believe it when I see it.

    1. Re:I'll believe it when I see it by YU+Nicks+NE+Way · · Score: 5, Interesting

      When I read this, I'm reminded of the SPHINX project at CMU in the mid 80's. Kai-Fu Lee was a doctoral student at CMU in computer science. His advisor set him to evaluating the performance of the (clearly inferior) statistical SR systems that IBM was touting. It was a throw-away project; his advisor just wanted some numbers to compare his rule-based system against. The linguists had clearly shown that the irregularities of human speech required deep knowledge of the phonology, syntax, and sematics of the language being spoken, but the projectg leader needed a benchmark to measure against.

      Lee's toy project, SPHINX, won the DARPA competition that year. The highest scoring rule-based system came in fifth. What the linguists "knew" was wrong.

      The example you gave is another example of the linguists not know as much about statistics as they think. The corpora used for statistical translation include examples of idiomatic usages. Idiomatic usage is highly stereotypical, so the Viterbi path through an N-gram analysis captures such highly linked phrases with high accuracy.

    2. Re:I'll believe it when I see it by penultimatepost · · Score: 1

      That's why I believe that for this to work, only already translated sampels are used. As for single to many and many to many word translations, as long as the sets of words don't change, I don see a problem doing the match. aster all a space is a character.

    3. Re:I'll believe it when I see it by t · · Score: 1
      I assme you meant "after all a space is a character.", except in languages like Japanese which doesn't use spaces. It's quite the bitch, you have to find things like the longest common subsequence, and even then you can't distinguish words from phrases. Although in this case it may give better results.

      Personally I would love to have system for translating science papers, there are a lot of papers on wavelets written in french. Extremely well defined corpus, very little abstract phrases, no poetryesque language. There are even many papers that have already been translated (quite well) to english by their own authors.

    4. Re:I'll believe it when I see it by penultimatepost · · Score: 1

      That is what I meant, thanks for the correction (the beauty of not previewing).

  23. Important note when posting dupes by Anonymous Coward · · Score: 0

    Make sure the dupe points to a major advertiser's website.

  24. Script vs. language by pablo.cl · · Score: 1
    When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again.
    You are confusing a language with its script. A translation from Serbian to Croatian or from Urdu to Hindi should be straightforward, since they are actually two languages and not four. Translation is about languages, not character sets.
  25. Of course, in British English... by gidds · · Score: 3, Interesting
    ...there's no ambiguity. Becoming angry is getting pissed off. I urinated is I pissed (no 'got'). So, here, your sentence could only refer to inebriation. (Though why that should be a prerequisite for installing such a cool system, I've no idea.)

    I always said you Yanks couldn't even use your own language properly... [fx: ducks]

    --

    Ceterum censeo subscriptionem esse delendam.

    1. Re:Of course, in British English... by Raffaello · · Score: 1

      One shouldn't assume that the slang used in ones locale is universal for english.

      To Wit:

      In the UK, "get pissed," means "become inebriated."
      In the USA "get pissed," does not mean "become inebriated." In fact, only people familiar with UK culture and slang know that it does mean that on the other side of the Pond.
      In the USA, "get pissed," is a commonly used shorthand for "get pissed off," as in, "I really got pissed when when they told me I had to work late."

      So, yes, the original model sentence is ambiguous, but only to people who use "pissed" as a shorthand for "pissed off" who also know some UK slang.

    2. Re:Of course, in British English... by shish · · Score: 2, Insightful

      > I urinated is 'I pissed'

      Not "I urinated", but "I got urinated" - how could it tell?

      Also I sometimes say "I'm pissed" (no 'off') when I'm angry, and I'm british. Although as I just pointed out, that could mean "I'm urinated" :P

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    3. Re: Of course, in British English... by gidds · · Score: 1
      Not "I urinated", but "I got urinated" - how could it tell?

      To be precise (goodness knows why...), just as you'd never say "I got urinated" without a qualifier, such as "I got urinated on", the same applies to 'pissed' too. You could get pissed on, which would refer unambiguously to urination (literally or metaphorically), but if you just "got pissed", with no qualifier, it would almost certainly refer to inebriation. (Unless you were resorting to US slang -- but IME that usage is still very rare here.)

      --

      Ceterum censeo subscriptionem esse delendam.

    4. Re: Of course, in British English... by shish · · Score: 1

      You're still being too sensible - I mean it LITERALLY:

      1) get drunk (LITERALLY)
      2) go through the digestive system
      3) get pissed (LITERALLY)

      Like I say - very few people would mean it that way, but seeing as the most common use of "drunk" is the past tense of drink (ie, to drink a liquid), the computer would learn that meaning and take it literally, even when applied to a person:

      a) The lemonade got drunk.
      b) My friend got drunk.

      Gramatically speaking, what's the difference?

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    5. Re: Of course, in British English... by gidds · · Score: 1
      OIC...

      You're still being too sensible

      Story of my life. :)

      Oh, and ObQuote:

      "It's unpleasantly like being drunk."
      "What's so unpleasant about being drunk?"
      "You ask a glass of water."
      --

      Ceterum censeo subscriptionem esse delendam.

    6. Re: Of course, in British English... by PhilHibbs · · Score: 2, Interesting
      a) The lemonade got drunk.
      b) My friend got drunk.
      Gramatically speaking, what's the difference?
      Grammatically, there is none. However, a statistical translation system could cope with this. If it had two matched texts:

      "The liquid was pissed some time later" translated into Language X as "The liquid was urinated some time later"

      "John was pissed some time later" translated to Language X as "John was inebriated some time later"

      It would assimilate this into it's linguistic map as something like:

      pissed = inebrated
      liquid pissed = liquid urinated
  26. Grammatical Differences by dhodell · · Score: 5, Interesting

    I'm sure that everybody's familiar with the output and quality of different various translators available online. I myself have been very interested in creating such a utility, and then one based on statistical language analysis. In my time in Holland, I've enjoyed learning the Dutch language, and have found online utilities to be of little help when translating documents (though I do not require this much anymore, it would have been helpful in the beginning).

    Although these methods work better than literal word-for-word translation, they're still not going to be perfect without some sort of human intervention. Dutch, for instance, has a completely different sentence structure than does English. For instance, the sentence "The cow is going to jump over the moon." becomes "De koe gaat over de maan springen" or, literally, "The cow goes over the moon to jump".

    Don't laugh at this structure or perhaps any unobvious usefulness. I've had discussions with people regarding the grammatical structure of a language and the society around it. Indeed, a specific example I have comes from a TV show "Kop Spijkers", which is a show focused mainly poking fun at political activity and news events. At times, they have people dressed as popular media and political figures and have comical debates.

    In one show, a person acting as Peter R. de Vries (roughly the Dutch equivalent of William Shatner on America's Most Wanted) stated the following joke (JS stands for Jack Spijkerman, the host of the program):
    PRdV: ...Maar ja, ik ben de niet roker van het jaar. JS: Hoezo? PRdV: Nou, ik rook 2 pakjes per dag... niet.

    Translated into English, we would not find the humor in this transaction:
    PRdV: ...Anyway, I'm the non smoker of the year. JS: How do you figure that? PRdV: Well, I ... don't ... smoke 2 packs per day.

    Sure you can crack a smile about it, but it's much funnier when the punchline comes at a climax. And in English, it is not possible to state "Well, I smoke 2 packs per day... NOT" (without sounding like a retard who's watched too much Wayne's World).

    Getting back on topic, I believe there will be major issues with any tranlsation algorithm to come. This is, of course, to be expected; I hope, however, that more advances will soon follow.

    --
    Kind regards, Devon H. O'Dell
    1. Re:Grammatical Differences by wimbor · · Score: 1
      Although I agree with some statements of the writer and I'm flattered he uses my native Dutch to prove his point, I don't think he choose a correct example.

      The sentence "ik rook 2 pakjes per dag... niet" is the literal translation of the Wayne's World "I smoke 2 packs per day..NOT" that he is referring to. The Dutch construction of this sentence is as wrong as the Wayne's World English one. It should indeed be "Ik rook *geen* 2 pakjes per dag" or "I do *not* smoke 2 packs per day".

      You can, however, put the punchline at the end of the sentence in Dutch if you say "Ik rook *die* 2 pakjes per dag niet.", or meaning a bit astonished and dismissive "I do *not* smoke *those* 2 packs per day."

      By the way, if the previous writer knew that difference still by heart, he must have had a very good knowledge op Dutch... so this is not intended as a negative comment :-) Goed gedaan!

    2. Re:Grammatical Differences by Anonymous Coward · · Score: 0

      It's funny that you would choose Dutch as an example of a language with a very different grammatical structure. It is probably one of the most similar to English you'll find.

    3. Re:Grammatical Differences by dhodell · · Score: 1

      Thanks for the compliment. I do believe that the actor portraying Peter R. de Vries used it in the "correct" manner as you've described. As an American who's been in Holland for a little over 1.5 years, I must say that my Dutch grammar isn't completely perfect yet :). Of course, this ehhhh (English fails me at the moment) uitzending of the program was probably 8 or 9 months ago (last season), so it isn't exactly fresh.
      Bedankt :)

      --
      Kind regards, Devon H. O'Dell
  27. I'll be convinced... by Rocky · · Score: 2, Insightful

    ...when it's able to translate stuff like:

    "Shaka, when the walls fell!"

    --
    "I'm an old-fashioned type of guy. I worship the Sun and Moon as gods. And fear them."
    1. Re:I'll be convinced... by Winterblink · · Score: 1

      Hahhahaha, I loved that episode. Check this out, through a quick bit of Googling I managed to find The Darmok Dictionary.

      --
      "I'm a leaf on the wind. Watch how I soar."
      -Hoban Washburn
  28. Re:I'm waiting... by Anonymous Coward · · Score: 0

    It might save a nasty kick in the nads too.

  29. Do put me out of work. Please! by Frantactical+Fruke · · Score: 5, Interesting

    On the other hand, having just finished translating a letter from Finnish to German, I fear that in light of the fact that, unlike most other cultures, Germans consider unspeakably long, intertwined sentences with multiple asides quoting their dead grandmothers who used to go on and on like this all day and the mandatory Goethe or Immanuel Kant quote concerning the importance of staying on topic, of which this run-on piece of drivel gives you but a faint impression, rather stylish and intelligent, we might have to wait a while yet.

    Would a program know how to break up a monster like that?

    Or, seriously, I ended up rewriting most of the letter to convey its contents in a tone that hopefully won't insult the recipient because of differing cultural expectations.

    Finns often consider politeness a waste of time. Now explain that to a statistical translator program: "Leave out/add in some polite blablablah"?

  30. Re: Good quote by tgv · · Score: 1

    A famous quote from one of the project leaders, Fred Jelinek if I'm not mistaken was that for every linguist that he fired from the team, the performance of the system improved by 10%...

  31. AI has been solved -- use AI robots instead by Mentifex · · Score: 0, Troll

    Now that artificial intelligence (AI) has been solved, machine translation (MT) may advance to a higher plane of equality with human translators who spend years learning the nuances and subtleties of their target human languages.

    Computer science has found the Holy Grail of AI in the Concept-Fiber Theory of Mind that led directly to the free AI source code of the Mind-1.1 Tutorial AI described in the AI For You textbook of artificial intelligence and robotics.

    The Association for Computing Machinery has published an article on the robot Mind.Forth AI, and a well-known AI expert has favorably reviewed the Fiber-Concept Theory of Mind.

    Traditional Artificial Intelligence Textbooks are suddenly obsolete, outmoded, or desperately in need of thorough revision and updating to teach Automatic Machine Translation now that AI has been solved.

  32. wow by Anonymous Coward · · Score: 2, Informative
    'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big."
    Wow. You could not provide a more wrong description of what's going on here. I don't know where to start. The statistical methods are explicitly free of meaning. There's no symbol-grounding going on here. Thus the statistical method does not say that hombre = man and alto = tall. All it says is that often when "hombre" showed up in text A, "man" showed up in text B, regardless of whatever the symbols "hombre" or "man" mean. Further, an Ngram is a fixed-length string of symbols that's two or more in length: a bigram is two symbols, and a trigram is three symbols, etc. If a symbol were taken to mean a word, then "hombre" is not an Ngram. "hombre grande" is an Ngram. Anyway, if the statistics are based on Ngrams, then they're computing relationships between Ngrams, NOT between the pieces inside of them ("hombre", "grande", etc.).
  33. We won't. by godot42a · · Score: 2, Insightful
    There's no chance (or risk) statistical translation can put human translators out of business for quite a long time to come. The main point is that because these programs completely lack word knowledge, they must try to "understand" the sentences on a purely structural level. This works for
    • restricted domains (subject matters)
    • restricted range of grammatical constructions
    • restricted genre (style)
    • restricted range of cultural presuppositions
    In other words, it works best for technical manuals ;).
  34. I do this stuff for a living... by elbanevretep · · Score: 2, Insightful

    One of the keys to making a statistical model work is to make wise choices about what statistics to collect, and what dependencies to include. For example, N-grams work by predicting the probability of a certain word appearing given the previous word or so; this kind of works but misses a lot because the structure of a sentence is more like a tree than a series. More complex models can capture more relevant information. On the other hand, if the model is too complex, it won't work for two reasons: because it requires too much memory/cpu, and because you can't get a reliable estimate of the probabilities without multiple examples of each situation (this problem is called data sparsity).

  35. WHoa ho ho.... by Tom7 · · Score: 1

    Hey now... engrams? I thought those were under the exclusive purview of the scientologists...

  36. This was proved impossible about fifty years ago by Anonymous Coward · · Score: 2, Interesting

    This idea is like the behavioralist idea that a baby is a blank slate and he just learns the language by association like Pavlov's dog. something similar has been tried with neural networks etc.

    However, this method does not work, as the silly examples elsewhere in the discussion show. You can only understand or translate if you "know" what is meant.

    There is no way of figuring it out. There isn't enough information supplied in the texts themselves. You have to be born with the inherent ability to understand stuff.

    You'll find a good discussion of this in Steven Pinker's "The Language Instinct", which I recommend.

  37. Limited value? by sjasja · · Score: 3, Interesting
    Automatic dictionary generation for MT seems of limited value to me. You can purchase dictionaries easily enough, or get trained monkeys^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H linguistics students cheaply enough to do the work.

    Raw dictionary work is pretty much the least interesting, most mechanical part of an MT system.

    Grammar (source parsing, transformation and target generation) takes a lot more work and careful thinking.

    The more accurate you want your MT system to be, the more extra information you want to attach to your dictionary entries (the more the system knows about all the words, the more disambiguation using real-world knowledge it can do.) "I have a ball" vs "I have an idea" translate to some languages quite differently; you need to know that you don't (usually) physically hold "an idea" in your hand. The most common words ("is", "have") are often the worst in this respect.

    (I have worked coding an MT system.)

    1. Re:Limited value? by Jadrano · · Score: 4, Informative

      Of course, you can buy dictionaries or get trained people write them, but the amount of data needed for every lexical item would be so large that a wide coverage would be very hard to achieve. For example, you have to note all collocations. Often, such preferences aren't clear-cut. For instance, 'essential' appears much more frequently in an attributive position (e.g. 'X is essential') than in , while 'basic', which can have a very similar meaning in many contexts (e.g. 'the essential X'), appears much more often in an attributive position. Such information is necessary for good translation, but dictionaries usually don't provide it. Statistical analyses of lexical items reveal many things dictionaries don't tell you. Nowadays, a significant part of the work of trained people writing dictionaries is looking at corpora, and making this process automatic is a logical step.

      Strictly separating raw dictionary work and grammar seems rather old-fashioned to me. Of course, it can work to some degree, but there are so many different types of collocational preferences that just providing each lexeme with a 'grammatical category' from a relatively small list and basing the grammar on these grammatical categories is hardly enough.

      It is true that automatic systems' lack of world knowledge is a big problem, but the examples you provide aren't really a good demonstration of this fact. As you write, 'have' is translated differently into some languages depending on whether the object is abstract. So, given a translation system that recognizes the verb and its object and a bilingual parallel corpus, a statistical system can find out about that.

      I heard of people who write dictionaries that can be used for automatic processing, for every lexeme they need between half an hour or an hour (consulting dictionaries and corpora, checking whether the application of rules gives correct sentences). This can only work if the aim of the MT system is either only a very limited domain (e.g. weather forecasts, for which there are working rule-based translation systems) or very low quality. It could never be affordable to have trained people provide all relevant characteristics for the millions of words that would be needed for a good MT system with wide coverage.

      Differentiating between concrete and abstract entities is something that seems quite natural to us, but there are many other relevant characteristics of lexical items that don't come to linguists' minds so easily, statistical analyses can be better at discovering them.

  38. N-grams? by illtud · · Score: 1

    N-grams? N-grams? DON'T CLICK ON THE LINK!

    It's a CoS trick to enslave us all!

  39. IN SOVIET RUSSIA by putaro · · Score: 0

    We piss you!

    1. Re:IN SOVIET RUSSIA by Oryx3 · · Score: 1

      There's no such thing as Soviet Russia.

    2. Re:IN SOVIET RUSSIA by Anonymous Coward · · Score: 0

      In Soviet Russia there's no such thing as YOU!

  40. Karma? by not_a_george · · Score: 1

    just hope the system doesn't involve karma...

    madonna (5 ineresting)
    microsoft (-3 troll)

    aren't there also words (slang) that have no direct translation? what happens then?

    --
    Linux: Helping nerds look smarter since the late 90s.
  41. Shiver... by bjtuna · · Score: 1

    Did anyone else here take Dr. Eisner's "Natural Language Processing" course at Hopkins? I've definitely had my fill of n-grams for now, thanks :)

  42. You people are sick by Anonymous Coward · · Score: 0

    Go have your piss parties somewhere else, you perverts!

  43. A true story from the seventies by Anonymous Coward · · Score: 1, Funny

    Step 1 - "Out of sight, out of mind."
    Step 2 - Step 1 machine translated to Russian
    Step 3 - Step 2 machine translated back to English

    Result:
    "Invisible idiot"

    Mr. Spock would say: Logical!

  44. unfortunately doomed by aziraphale · · Score: 4, Interesting

    Like most computerised translation efforts, this ignores the fact that translation always requires context. The sentence 'fruit flies like an orange' is a classic example in the English language of a sentence which can be interpreted in two different ways - sentences can easily be constructed which have completely different meanings in different contexts.

    'As a punishment, he was given a longer sentence'. Obviously, we're talking prison, right? Well, what if the preceding sentence was:
    'The teacher had grown weary of his poor attempts at translation'?

    A statistical system, even working with the entire phrase, won't be able to figure out which meaning of the word 'sentence' is intended there.

    how about:
    'The box was heavy. We had to put it down'
    'The dog was ill. We had to put it down'

    You need semantic understanding to be able to perform translation.

    1. Re:unfortunately doomed by plasticmillion · · Score: 4, Insightful
      This is definitely true. At the same time, the results of statistical natural language processing are surprisingly good. Really this should not be so surprising, since they function in a way similar to the human brain. A neural network like the brain is designed to deduce a complex function from training data. I believe strongly that the best way to get intelligent(-seeming) behavior out of machines is to mirror this process.

      Artificial neural nets are one way to do this, but statistical methods are more or less analogous and have the advantage of being highly optimizable. Personally I don't understand the details, but Very Smart Mathematicians have found ways to optimize models like Singular Value Decompositions (SVDs) so that they can be calculated orders of magnitude faster than models that cannot be represent as formally using mathematics.

      The bottom line is that statistical methods are probably the way that we will end up producing brain-like behavior on computers, and the fact that there are promising results already is heartening. Yes, for truly intelligent behavior a lot of domain knowledge will also be needed, as you point out. But I don't see any reason why the extraction and mapping of this knowledge couldn't also be achieved with large training corpora and statistical methods, rather than hand-crafting.

    2. Re:unfortunately doomed by capologist · · Score: 4, Insightful

      It may be possible for this approach to address that issue somewhat. Statistics can be collected not only on associations of words with other words, but also on associations of groups of words or phrases with others. So if the translator has learned from documents in which the phrase "put it down" appears near the word "ill" and the word "dog," and from other documents in which the phrase is associated with the word "heavy," it can make a good guess.

      Clearly, it would need to learn from a tremendous amount of input data before it could begin to approach the experience of a human, and hence make guesses of similar quality to a human translator. However, the amount of available source material is increasing so rapidly that it may be possible for a translator to get pretty darn smart this way.

    3. Re:unfortunately doomed by tealwarrior · · Score: 1

      The point you make has doesn't address whether you should use a statistical or symbolic approach. Symbolic approaches can ignore context just as easily as statistical ones. The benifit of statistical approachs is that they typically make it much easier to determine the combined influence of a number of different factors. The bane of symbolic approaches is the hours that can be spent tweeking one rule only to find that the new tweek intracts badly with some other rule. The addition and integration of context and other more meaning-based types of information will likly be much simplier in a statistical framework. Case in point: some systems in the evaluation mentioned gave more weight to statistics gathered from similar documents to the one being translated when translating names. This has the effect of providing a sort of document level topic bias. This type of approach could potentially allow you to prefer sentence as a gramatical unit in a school or linguistics context and sentence as punishment in a legal or penal context.

      That said it is always possible to construct a patalogical or even reasonable casess that will twart a particular approach. But most people would be happy with somehthing that did a good job most of the time. Specifically DARPA, the evaluation's sponsor, would like a system that gave them the ability to determine which documents are worth having a human being translate. A recent article in Time says the number of Arabic speakers in the FBI has trippled since 9/11 to 208. That's still not that many people given the amount of data they monitor. Mediocre translation techniques, availble today and easy to adapt to new languages, are probably the best bet for leveraging the vastly larger number of Enlgish speaking government employees.

      --
      In theory, there is no difference between theory and practice, in practice there is.
  45. Voynich Manuscript? by aeschenkarnos · · Score: 1

    I wonder if this could work on the Voynich Manuscript? (www.voynich.nu)

    1. Re:Voynich Manuscript? by oldwarrior · · Score: 0

      only works with parallel text for translation...

      --
      If it were done when 'tis done, then t'were well it were done quickly... MacBeth
  46. William Orbit by mr.henry · · Score: 1

    N-Gram is/was also the name of William Orbit's label.

  47. Arabic Grammar Nazi by nat5an · · Score: 5, Informative

    From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.

    Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.

    I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.

    It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

    For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

    --
    Head down, go to sleep to the rhythm of the war drums...
    1. Re:Arabic Grammar Nazi by capologist · · Score: 3, Informative

      But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

      For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

      I think that one of the major points of the statistical technique is to deal with precisely this sort of thing.

      It doesn't have to know the "meaning" of words like "box" or "man," it just has to have seen them in a particular context before. If it has, then it knows that "big" is usually translated one way when it appears with "man," and another way when it appears with "box." So it just follows those observed patterns, without knowing anything about "meaning."

      In principle (the article hints at this), it might even be possible in the future to make good guesses about combinations that it hasn't seen before, by inferring rules about how things are put together. For example, it encounters the phrase "strong box," which it hasn't seen before. But it has seen many other words that frequently are associated with inanimate-strong and that are also frequently associated with inanimate-big, and many words that frequently are associated with animated-strong and also frequently associated with animate-big, but few words that are frequently associated with inanimate-big and animate-strong. So it infers that inanimate-strong is somehow "parallel" to inanimate-big. Since it also finds that "box" is more typically associated with inanimate-big than with animate-big, it infers that the version of "strong" that goes with "box" is the one that is parallel to inanimate-big. Therefore, it selects inanimate-strong. Ta da!

    2. Re:Arabic Grammar Nazi by Anonymous Coward · · Score: 1, Insightful

      But what is meaning but an insanely large net of crossreferences? I'm thinking of an article from Linguistic Anthro about the 'arbitrariness of sign.' I wish I could remember the big name dude that said it, I think it was Saussure. Knowing for instance that 'that' is a 'dog' is really just knowing that 'that' barks, runs, smells bad when wet, is friendly, kills, defends, etc etc etc. All of those words are also arbitrary when it comes to significance (at some grand level), so the same holds. The statistical approach is right on target and I'm genuinely curious to see where it will go.

  48. Oh..... by Anonymous Coward · · Score: 0

    God no.

    That is so wrong it pains me.

  49. How about missing concepts? by Anonymous Coward · · Score: 0

    What about when one language is completely missing concepts that are present in antoher?

    Lots of languages contain masculine and feminine words, which changes the structure of the sentence they're in (for example "une tete le poo-poo" would be correct, but "un tete le poo-poo" would not - even though "une" and "un" are basically the same word..)

    Or how about asian languages, in which there is the concept of honorific titles for older (and sometimes younger) siblings - in Tagalog, for example, "ate" (pronounced "ah-tey") means "older sister".. so in English, "ate Teresa" means "big-sister Teresa" - and it's very important that this prefix is used, otherwise it would be disrespectful.

    English has a similar function for uncle/aunt, but the concept for siblings doesn't exist..

    Or in Welsh, the concept of "having" something is absent.. you don't "have" something, it's "with you".. instead of saying "take this pen", you would say "go with this pen"... you wouldn't say "he is rich", you'd say "he is of the money"

    Or mutations - syllables of one word mutate into another depending on the context in which it's used..

    Combined with your examples of differing grammatical structure, I don't see this being better than human beings for a long time to come.

  50. This approach is limited by Oryx3 · · Score: 3, Insightful
    Yes, that's a big problem with statistical methods. The point is that we don't just use words with specific meanings like "man" or "tall", but we also use:
    • abstract words that take on different meanings in different contexts (i.e. they're polymorphic)
    • we use words metaphorically (the "pissed" example above). Metaphor requires the reader to make the connection on the fly between two concepts, hence it requires intelligence. ("On the fly" is a good example. A computer can be given a list of such metaphorical expressions, but recognizing new ones is a much harder problem.)
    • we use words incorrectly, or misspell them, or use imperfect grammar, but that's OK because our human reader is able to infer the meaning
    • humans think it's funny sometimes to use words in the wrong context, i.e. where the metaphorical meaning is really outlandish, or there is a conflict between the idea and the way it is expressed. I think we like this because it requires intelligence to work out the meaning in these cases.

    For example, the English word pattern can be translated in French by any of (please excuse the lack of accents, they were stripped when I submitted): modele, exemple, type schema, dessin, motif, maquette, patron, plan, disposition, groupement, repartition, combinaison, diagramme, gabarit, echantillon, tendance, figure, circuit (and probably others as well) depending on the context -- and not just the lexical context, but the meaning.

    Previous attempts to automate translation focused on giving computers grammatical and semantic knowledge, in the hope that it could infer some meaning from this and so choose the right equivalents. Despite some success, this approach failed in general, putting machine translation (MT) firmly in the realm of AI. I believe this statistical approach is a step in the wrong direction (back to purely lexical means of analyzing texts with a view to translation). Further progress in MT will come from AI.

    This doesn't detract from the ways in which computers have been useful to translators -- in the area of computer-assisted translation (translation memory, localization, terminology databases, etc.)

    The other point is it's a lot harder to get a good-quality parallel corpus than you'd think (even in the Internet age -- most of the stuff on the Internet is crap anyway).

    It's not the idea of using computers in translation that I think is limited, just this approach.

    1. Re:This approach is limited by Oryx3 · · Score: 0, Offtopic

      By the way, does anybody know how to post non-ASCII text on Slashdot?

    2. Re:This approach is limited by Jadrano · · Score: 1

      I agree that machine translation is in the realm of AI. But so-called "New AI" is not purely symbol-based, as old AI methods used to be, it is either numeric or a combination of numeric and symbolic. There is no sharp border between statistical methods and new AI methods.

  51. Mentifex, great to see you! by Anonymous Coward · · Score: 0
    Nice to see that comp.ai's longtime gibberish crackpot has finally escaped the drudgery of torturing scientists who know much more than he does and moved on to easier kill... Slashdot! You'll love this place. It consists primarily of prepubescent computer geeks who wouldn't know real AI research from crackpot quack stuff it it bit 'em in the ass. You can troll not only for the usual annoyed "go away you moron" responses from actual AI scientists ... but oohs and awes from 14 year olds who don't realize your work is the AI equivalent of the Flat Earth Society. It's awesome.

    I also find it humorous that someone actually modded you up! Isn't that hilarious? They must be in on the joke too!

    1. Re:Mentifex, great to see you! by Anonymous Coward · · Score: 0
      I also find it humorous that someone actually modded you up! Isn't that hilarious? They must be in on the joke too!
      Yeah, that was funny. Did you know that Mentifex (otherwise known as Arthur Murray) is the reason that comp.ai went moderated? Man, now that's a talent for trolling.
  52. Knowledge-based approaches have the same limit by Beryllium+Sphere(tm) · · Score: 1

    One of Beryllium Sphere's partners is a computational linguist specializing in hand-built representations of how one small domain of discourse uses words.

    Her last big project was automatic translation of (you guessed it) technical manuals.

    godot42a is spot on. The English originals of the technical manuals had to be written in a subset of English which restricted the range of grammatical expressions. Tech writers had to run a program to check their work for compliance.

    In summary, even if you build a translation program that has "word knowledge" hand-crafted by brilliant polymaths, you still have the limitations that godot42a points out.

  53. Re: No, British English is whacked by gidds · · Score: 2
    It makes more sense when you live here :)

    Actually, we do sometimes use 'fries', to distinguish them from 'chips' which are usually more than three millimetres thick and have actually been near a potato! We also use both 'cookie' and 'biscuit'; the former for larger, thicker things, often with chocolate drops, nuts, or whatever. What do you mean by 'biscuit'?

    And I've no idea what 'podger' is - I've never heard it, and neither dictionaries nor Google can come up with anything more relevant than its use as a surname. Is it an obscure regional or dialect term?

    On a more general point, ISTM that US English tends to like ambiguity more than British English, which is a slightly more precise tool that can distinguish between a rubber thing on a car wheel (tyre) and to become exhausted (tire); between road edging (kerb) and to prevent (curb); between verb and noun forms of practise/practice, license/licence, &c; between a measuring device (meter) and the unit of length (metre); between a movement of fluid (draught) and a rough outline (draft); between a series of instructions to a computer (program) and a list of events (programme); between a test (check) and a written instruction to a bank (cheque); &c &c. The 'pissed'/'pissed off' distinction is simply one more example of this.

    The other interesting point is that in the majority of cases where usage, spelling, punctuation &c differs, it's US English that is the older variant. Oddly enough, here we seem to be more open to change, especially positive change, to the language.

    --

    Ceterum censeo subscriptionem esse delendam.

  54. Two more classic machine mistranslations by Mawbid · · Score: 4, Interesting
    The one you mentioned if often accompanied by two more, so I'll continue the tradition. These smell like urban legend, but who cares? :-)

    An engineer was confused when a a translated spec included water goats. "Water goats"?! Hydraulic rams, actually.

    And perhaps most famous of all, "out of sight, out of mind" supposedly came back as "blind idiot".

    Language is a curious thing. I can't help thinking there's some deeper meaning to the fact that misapplication of it can so easily be funny to us.

    --
    Fuck the system? Nah, you might catch something.
    1. Re:Two more classic machine mistranslations by DoctorRad · · Score: 1
      And perhaps most famous of all, "out of sight, out of mind" supposedly came back as "blind idiot".

      I always knew the translation as "invisible maniac", but that must have been late 70s / early 80s. I presume machine translation has progressed rather since then...

      Matt...

  55. A paper on this by metlin · · Score: 3, Informative

    I had written a paper on this of the application of N-gram technique with statistical methods for use in CBR a long time ago.

    You can find the paper here (PDF) and the presentation here. ;-)

  56. EGYPT translation toolkit is GPL'ed. by dwheeler · · Score: 2, Informative
    I was curious about this statistical translation toolkit, so I downloaded it from here: http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/. I then peeked into the LICENSE file, and found that it's released under the GPL. No funny weird one-off licenses, or requiring only non-commercial use, or such. So, if you're interested in statistical translation, download this system and try it out.

    I can imagine some distributions of this translation system that take this code - with improvements - and precook large corpuses to create translators. Anyone want to write the Mozilla and OpenOffice plug-ins for the new menu item "Edit/Translate Language"?

    --
    - David A. Wheeler (see my Secure Programming HOWTO)
  57. Extralinguistic knowledge by Koos+Baster · · Score: 1

    I totally agree.

    Furthermore, in the end language is only a carrier of meaning and meaning ultimately refers to non-linguistic objects. Therefore, you can't understand language (fully) without understanding reality (at least partially).

    And while machine translation is a relatively hard job, there are examples that suggest that automated insertion of hyphens ultimately need extralinguistic knowledge!

  58. Isn't it funny how... by Anonymous Coward · · Score: 0

    Dead end research gets lots of press but the inventions that really change our lives are not reported until everybody knows about them?

  59. Re: No, British English is whacked by schtum · · Score: 1

    What do you mean by 'biscuit'?

    A biscuit is like a scone, only delicious.

  60. Re: No, British English is whacked by gidds · · Score: 1
    Ah. In that case, we call 'em... scones :)

    (Just don't ask how to pronounce them...)

    --

    Ceterum censeo subscriptionem esse delendam.

  61. this system is only as good as the.... by BelugaParty · · Score: 1


    translator who makes the source texts.

    1. Re:this system is only as good as the.... by cyberon22 · · Score: 1

      Dead right. Unless the inputed text really is sentence for sentence... you just end up with a bunch of garbage.

      Makes me think the explosion of interest in this type of MT has less to do with its advantages over other systems than the limitations of current dictionaries. Who wants to spend years defining 300,000 words and phrases just on the off-chance it might be useful, right?

  62. What about Semantics and Idioms? by zpiderz · · Score: 1

    Ok, so it can parse things at the word level but what about the sentence level understanding (semantics) and odd exceptions such as idioms that could totally screw up the statistical learning process.

    After taking a linguistics class I realized language is very very complex and we are many years away from being able to create decent language translation systems (babelfish is not really decent at least in my eyes), speech synthesis, etc.

    1. Re:What about Semantics and Idioms? by Anonymous Coward · · Score: 0

      As several previous posts have mentioned, MT -- as well as speech recognition and synthesis -- work well above the word level. Spech recognition works at the "phrase" level (NP, DetP, AdvP, etc.) an the sentence level (CP/IP).

      Previous attempts in speech technology statistical analyses have concentrated at the word level because we were stuck with 1MHz machines with 512k memory ("more than enough..."). Now, we have 4+GHz and 2+GB, so we can work with pairs of words, triplets of words, whole phrases, and make more complex statisitcal analyses. Not only "how does this word relate to that word", but "how does this group of words relate to that group of words".

      Because we can now work with groups of words, we can take into account semantics, and possibly even pragmatics (meaning in context), depending on the power of the machine, the quality of the corpus and the strength of the statisical analyses.

      It's not perfect, but as another reader mention, now that Moore has caught up with the Ivory Tower research, we can have lots of fun.

      Just don't expect HAL any time soon....

  63. Language Applicability by Flwyd · · Score: 3, Interesting

    "If we can learn how to translate even Klingon into English, then most human languages are easy by comparison," [Dr. Knight] said.

    That's not really the case. Klingon was created through conscious effort and hasn't evolved many (any?) warts over time. Its structure is akin to well-understood human languages.

    Now take Turkish, which has concatenative grammar. Adjectives are applied by tacking suffixes on to the word, sometimes changing spelling of previous chunks. Thus, a 20-word English phrase may correspond to a single Turkish word and extremely long words may be reasonably assumed to be unique. Statistical techniques can work with Turkish, but it requires some work up front to extract tokens. \b\B+\b doesn't help much. German (and, I think, Greek) are like this to a lesser extent.

    Statistical approaches are often quite effective in language processing, much to the surprise and disheartening of linguists. They're far from perfect, but often the best thing so far.

    --
    Ceci n'est pas une signature.
  64. Don't worry about that... by wirelessbuzzers · · Score: 1

    ... we can always use Babelfish!

    err...

    --
    I hereby place the above post in the public domain.
  65. Not really like a scone... by wirelessbuzzers · · Score: 1

    Hm. I guess a biscuit is sort of like a scone. But generally a biscuit is salty and has more butter than a scone, and less sugar. People sometimes dip them in gravy. I don't know of any British equivalent, as it seems to be more or less unique to the American South. You might as well ask for an equivalent to grits.

    --
    I hereby place the above post in the public domain.
  66. hard sentence by harlows_monkeys · · Score: 1
    In Burbank, California, there is a street named Pass Avenue. It goes over the freeway, via an overpass. If you were to cross that, on a certain Jewish holiday, you would pass over Pass overpass over Passover.

    That will be a fun one to give a translation program. (Or a speech recognition program, for that matter).

  67. Hmmmm by Anonymous Coward · · Score: 0

    Kind of like the approach one takes to crack simple encryption schemes.

  68. google link by Anonymous Coward · · Score: 0
  69. at least you got the n-gram definition right by _randy_64 · · Score: 2, Interesting
    The article says n-grams are "Phrases like these, called "N-grams" (with N representing the number of terms in a given phrase)". I've always used n-grams as character counts, using a sliding window over the text. For example, the 5-grams of the phrase "for example" would be

    [for e][or ex][r exa][ exam][examp] and so on.

    Using n-grams this way helps with things like mis-spellings. Mr. Metlin (parent of this) used the character definition is his paper. N-grams are widely used in Information Retrieval Research.

    --
    I mod down all the "free iPod"-sig losers.
  70. Grammer, translation by daevt · · Score: 1

    Does this method deal with different grammer structures? In flective languages, like Latin and Russian, grammical cases are signified by different endings. Also, is only one translation used or do they use many competing translations? Different translations of the same phrases can yield drastically different meanings.

  71. Re: No, British English is whacked by Anonymous Coward · · Score: 0

    often the change is pointless, negative and the result of a snobbish regard for classical and latin languages.

    e.g, colour, favourite, defence.

    the Victorians had some weird ideas, they insisted Shakespeare wrote in iambic pentameter (a Greek meter impossible in English), they insisted you couldn't split an infinitive (just because you can't in ancient Greek), even told people off for ending a sentence with a preposition - why can't I if I want to!?

    The Americans rightly resisted all that nonsense.

  72. Damn, it ate my < and > by PhilHibbs · · Score: 1

    I meant to say:
    <person> pissed = <person> inebrated
    liquid pissed = liquid urinated