Romancing The Rosetta Stone
Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."
when it's in the form of a fish, and can fit in my ear...
"I turn away with fright and horror from the lamentable evil of functions which do not have derivatives."
'Almost everyone'? What *are* you talking about? You must be an American. From a recent online Harris poll, most Americans think at least half the world speaks English. This is just plain wrong. The truth of the matter is that it's more like 20%. That's it. Most people on the NET might speak English, but most people in the world? Hardly.
My journal has hot
That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.
Can anyone try this on the new (or some other recent) algorithm?
BTW here's Doc Och's most recent website:
Franz Josef Och
Esteem isn't a zero sum game
A man who speaks three languages is trilingual.
A man who speaks two languages is bilingual.
A man who speaks one language is American.
DG
Want to learn about race cars? Read my Book
English --> French --> English --> Korean --> English. Of course, it helps that the first sentence is munged anyway ;-)
When I am king, you will be first against the wall.
Firstly we could consider the enormous body of work currently available in other languages.
Having this able to be translated into English or other languages could be very valuable for scholars.
Secondly, English is not the primary tongue for the majority of people on the planet - to suggest that because a lot of people can manage to converse in it that the ability to translate between other languages isn't valuable is foolish.
Also note that the article specifically mentions Arabic and Chinese, which I don't think crossed your mind. China has the largest population on the planet, remember.
Translation is far from obsolete, especially given that the majority of the Western world, and especially America, is piss poor at being bilingual.
Oh please... so many conspiracy theories. You do realize that the *internet* was originally developed by DARPA, right? My point: DARPA does a lot of work... not all of it revolves around spying on or otherwise taking away the rights of American citizens.
You know, that actually does sound like something that would be a Russian aphorism...
As press releases tend to do, this leaves much to be desired for folks who are familiar with the discipline. As I read it, it seems to imply that the main driver is phrase-matching. What does it do with phrases it hasn't seen before? The problem is solved by throwing lots of data at it -- how much data is needed for a reasonable system? How well does it generalize to text outside the domains of the training data?
Incidentally, had my brother been a girl, he was in serious danger of being named Rosetta Stone.
-- Trevor Stone, aka Flwyd
Ceci n'est pas une signature.
well, not EVERY bottle of beer at the duff plant has a nose or hitler's head in it, but i'm glad the inspector is tasked to look at every single bottle.
just because government abuse isn't guaranteed doesn't mean we shouldn't vigilantly examine the possibilities when we see them.
it's all boils down to balancing powers of government and freedom of individuals, and this country (USA) was founded upon principles intended to favor the rights of individuals. i'll go out on a limb and make a value statement - that's the way to go. power to the people, man!
I wonder how this would fare putting two computer languages side by side? I mean... take a few thousand programs, coded using the same algorithms but different computer languages... would his language translation software translate between them? Would it be able to differentiate between languages that manually allocate memory and those that use garbage collection? How about between procedural langauages like C, and more esoteric and oddly structured languages like LISP?
An interesting challenge, eh?
Would there be any benefit to this?
"I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
...and I will make pseudo-insightful comments based on the headline text without reading any of the source articles, until my karma is excellent?
If they offered me the same money (and one of those Linux NetworX clusters) I could have a superior system in a month, although (as stated above) it would require more than one known language.
LOL! If this problem was so friggin' easy, why are these researchers the first to demonstrate a working system using this technique (which blows away all existing systems, BTW)? Hell, if it's as easy as you say, this whole "translating text" thing must be a breeze. I wonder why so much money is spent every year on R&D in this area? Hell, why didn't they just hire you to whip up a system in a month?
Why? Because it ain't that easy and you have no idea what you're talking about. Given these are world-class researchers, I'm sure they've considered the multiple-translation route, and subsequently rejected it for very good reasons (likely far more complex than your simplistic "it's easier" excuse). Moreover, the really hard work in this area is the statistical modelling necessary to generate a working system, something which would, I suspect, be far more complex if a multiple-translation route were taken. But, hey, that's just some number crunching, right? What's so hard about that?
Time for inflamatory reasoning. The statistical approach will beat out the grammar and rule based ones, at least for English, is for the simple reason:
English is not a language
Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly (and not only in colloquial forms, just look at what the political-correctness movement has done to phraseology). You know the story... more exceptions than rules, things that are legitimate to say language-wise are considered incorrect anyways, and vice versa, etc. etc.
That's not to say it doesn't have advantages; it's relatively easy to learn the basics of communication since it's weakly conjugated, has genderless articles, fairly simple uncased sentence structure. But, it is monstrous to master and I suspect most native speakers aren't true masters (not to mention the orthographical nightmare; is English the only language with spelling bee contests?)
The reason it's the new lingua franca (or should it be lingua angla now?) is techno-socio-political as is always the case. Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.