Romancing The Rosetta Stone
Roland Piquepaille writes "Not only this news release from the University of Southern California has a fantastic title, it also has a great content. This story is about one of their scientists, Franz Josef Och, whose software ranks very high among translation systems. "Give me enough parallel data, and you can have a translation system for any two languages in a matter of hours," said Dr. Och, paraphrasing Archimedes. His approach relies on two concepts, gathering huge amounts of data, and applying statistical models to this data. It completely ignores grammar rules and dictionaries. "Och's method uses matched bilingual texts, the computer-encoded equivalents of the famous Rosetta Stone inscriptions. Or, rather, gigabytes and gigabytes of Rosetta Stones." Read my summary for more details."
This is exactly NOT a universal translator as it uses matched bilingual texts. You need an already translated text for his system to work.
This is a bit of a worry for privacy concerns, given that if I want to keep something secret from the world and private just between me and my intended recipient I have one less option.
How long until this is able to decode things like speech, too, and convert it into something recognisable in another langauge? would it still hold my voice patterns and sound like me? and if it were converted back to the English I already do speak, with mistakes, could that then be used against me in a court of law?
Scary stuff
That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.
Can anyone try this on the new (or some other recent) algorithm?
BTW here's Doc Och's most recent website:
Franz Josef Och
Esteem isn't a zero sum game
I believe that using a statistical approach like this is a step in the right direction. Manually building sets of rules, dictionaries, etc., is a waste of time and hard to do. And manuall-built systems become stale as languages evolve, unless a lot of continuing work is done.
For me the holy grail is when I can converse with a computer meaningfully. I believe a similar approach will be required for the computer to "understand" language, and to be able to formulate a coherent and appropriate response.
The battle for the Rosetta Stone "Things are looking decidedly rocky at the British Museum - Egypt's leading archaeologist has demanded the return of the Rosetta Stone. But the museum argues that the removal of the four-foot slab that unlocked the mysteries of the pharaohs would be disastrous"
Sounds like a brilliant idea. Hopefully this is something that could eventually be compacted enough to fit into consumer electronics. It would be great to be able to watch TV from every country without any language barrier!
DeviantArt Page
NSFWHow can this system compensate for the different dialects of all of the different languages?
"Some fight for law. Some fight for justice. What will you fight for? One day, you will see."
> Americans think at least half the world speaks English.
Better-informed Americans (a small miniority of the class) would be aware that Spanish is well on the way to becoming the predominant language in the USA.
But, IMHO, English could become the next Latin: the dead language that everybody has to learn if they're going to try and influence the world.
BTW, every "% of humanity" statistic has to consider that most humans are Chinese.
Kneejerk /. response: its a government conspiracy to take away more of our rights.
/. mod response: he's right.
Kneejerk
I'm not sure this is really applicable to translating literary works. These kinds of translations require an understanding of the native culture of both the source and target languages, as well as the intent of the writer, in order to generate an understandable translation that the target group can appreciate. A computer translation system like this one is incapable of performing these sorts of analysis.
What this is really good for is on-the-fly translation of material where the reader simply wants to comprehend what was written (think the old babelfish engine). This has obvious applications on the web, as well as many other areas (on-the-fly server-side translation for IM systems, etc, etc).
***WHAT THE FUCK ARE YOU THINKING?***
:)
Look, seriously, even if everyone did speak English, there are still tonnes of literary works in other languages - the original texts of the Ancient Greek classics, for example. To read in the original language is often a much more rewarding experience. Besiders, relying on past translations of non-english material can lead to errors. And consider how many different English translations of the Bible there are.
Almost everyone can speak, read and write at least tolerable english
Almost everyone can communicate using gestures, facial expressions and grunts, but is that any reason to use that as our primary communication method? I mean, to really stretch a metaphor from human languages to programming languages, we can write any computer program "tolerably" in assembler (it's Turing-complete), but that doesn't mean it's the best way to do it. If I can only speak one language "tolerably", but another exceptionally well, which one is better for conveying my ideas?
most young people can have full fledged discussions in it
I don't think we can rely on "d00d, u r so l33t" to teach people true literacy. Young people are increasingly using SMS and online chat and are actually losing their ability to correctly spell words or write grammatically correct sentences. The number of young adults I see who cannot distinguish correctly between there, their and they're is ABSOLUTELY TERRIBLE. Literacy is a major problem in English-speaking nations.
Just look at Slashdot, I'm quite sure I'm not the only one who doesn't have english as primary language
that doesn't mean you can use it well. Take a good look at slashdot - many, many people mangle the English language. The American people are probably the biggest infringers here...
It's not that farfetched idea that in the (near) future everyone uses or at least knows english well enough to make translations meaningless
Human languages don't map to each other 1:1. Some languages have words that basically cannot be translated without a serious loss of accuracy. (I guess you could ssay that no human language is Turing-Complete, in that it can't totally express every conceivable human thought). Having everything translated to english is NOT a solution. Brevity, language tricks (such as puns, rhyming, etc) cannot always be substituted across languages.
If it wasn't 2:15am in Melbourne right now, I'd try to order my thoughts and express them more clearly, but after 4 hours of Java debugging I'm off to get some sleep before uni tomorrow. Goodnight.
The PowerPC includes for this purpose two instructions called SYNC and EIEIO.
The big problem I see with this scheme is how do you collect the Gigs of data (ie content) without wholesale copyright violation or licensing (big bucks). Sure you can get lots of content whose copyright ran out from the Guttenburg project. But that's gonna be +70 year stuff.
Add the fact that the Mickey Mouse Copyright Extension act and related legislation threaten to extend copyright terms for infinity minus a day and you're never gonna have much content available that reflects CURRENT usage of the languages you're trying to translate.
Even existing translation programs could benefit from a ranking system. Wouldn't it be helpful if you could tell just how confident the translator is about a certain phrase or word? That way, you could rephrase your sentence before you foolishly ask someone to "taste" you....
It's not quite done yet, but the system does show promise. Dictionaries have already been created in Spanish, English, German, Japanese, Italian, French and several other languages.
20 mil and I will! Learn Esperanto with 20M others.
As press releases tend to do, this leaves much to be desired for folks who are familiar with the discipline. As I read it, it seems to imply that the main driver is phrase-matching. What does it do with phrases it hasn't seen before? The problem is solved by throwing lots of data at it -- how much data is needed for a reasonable system? How well does it generalize to text outside the domains of the training data?
Incidentally, had my brother been a girl, he was in serious danger of being named Rosetta Stone.
-- Trevor Stone, aka Flwyd
Ceci n'est pas une signature.
I always thought it would be interesting if google applied its page rank algorithm to provide a translation service. Like poll the top 5 translation service sites for a translated sentence and then based on what each of them return, generate a 'average' or best possible result for that sentence.
When I lived in Europe, a friend and I went to Paris. We're both bi-lingual; myself German, him Spanish, but unfortunately neither of us knew French. We had occasion to ask which train we neededto be on to get somewhere; and asked (in French) if the person we were asking for directions knew Spanish, English or German. We went through a good ten people before we found someone willing to admit that they spoke something other than French.
I'm sure they thought they were being all "Ha-ha, I will not let these Americans get away with not speaking French!" but our interpretation of the situation was "We're americans, we speak two languages, what's wrong with you?"
paintball
How about piping in various algorirhtms encoded in Pascal and C into the thing and seeing what it does to convert arbitrary sources. Where Can I get the soource? Pawel
I wonder how this would fare putting two computer languages side by side? I mean... take a few thousand programs, coded using the same algorithms but different computer languages... would his language translation software translate between them? Would it be able to differentiate between languages that manually allocate memory and those that use garbage collection? How about between procedural langauages like C, and more esoteric and oddly structured languages like LISP?
An interesting challenge, eh?
Would there be any benefit to this?
"I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
I think what was implied was that if you already had a translation engine trained for English/Japanese, when you are training it for English/French you can use the already existing "metadata" for English/Japanese to make the process quicker (requires smaller datasets to achieve the same precision).
I might be far out here. Excuse my crappy English, btw.
Are you a grammar Nazi? I'm trying to improve my English; please correct my errors!
Now that would be cool.
Seriously though, this leaves only the odd tribal languages of African (and perhaps South American?) tribes that are comprised entirely of clicks and gutteral sounds as not easily comprehended. Could this system's approach finally result in a Babelfish-like universality even for languages such as Chinese and Japanese? The added complexity makes it much more challenging for things like Babelfish, but if this system can do it, it's going to be a landfall discovery.
Anybody have any further research by this guy? I'm interested! Who knows, maybe I could have gotten a better grade in French thanks to this research...
The Rosetta stone itself did not do much in the way of our knowledge of the egyptian language.
What it did do, was provide insight into their method of writing.
It was the latter discovery of the the relation between Coptic and Egyptian that revealed most of the actual language.
(IIRC)
Time for inflamatory reasoning. The statistical approach will beat out the grammar and rule based ones, at least for English, is for the simple reason:
English is not a language
Or rather, it resembles one but is more not than is, IMO. It is a large collection of idiomatic expressions that changes quite rapidly (and not only in colloquial forms, just look at what the political-correctness movement has done to phraseology). You know the story... more exceptions than rules, things that are legitimate to say language-wise are considered incorrect anyways, and vice versa, etc. etc.
That's not to say it doesn't have advantages; it's relatively easy to learn the basics of communication since it's weakly conjugated, has genderless articles, fairly simple uncased sentence structure. But, it is monstrous to master and I suspect most native speakers aren't true masters (not to mention the orthographical nightmare; is English the only language with spelling bee contests?)
The reason it's the new lingua franca (or should it be lingua angla now?) is techno-socio-political as is always the case. Stop harping on Americans for being largely mono-lingual. "Why didn't the Romans learn the local languages when they controlled Europe? Because they didn't have to." If every state spoke a different language, which would be more akin to Europe, then there would be need.
Maybe this is offtopic, but if you want really elegant language processing you should check this out. Basically, you look at the compressiblity of given text and can determine what language it's in, or even what author produced it. This works with as few as 20 words.
I realize this isn't translation, but cool nonetheless. For further reading see here and here.
This post cannot be rebroadcast without the express written constent of Major League Baseball.
"You look nice..." --> "Shall I compare thee to a Summer's day..."
-- Sig down
This has already been done some years ago in Canada, where the translation system was fed the complete text of parliamentary debates for umpteen years (required by law to be translated by humans into French, if originally in English, and vice versa). I don't know how it fares when presented with a sample of parliament-speak (I concede, this is not a fair approximation of human language), but it fails miserably on a simple rhyme. Read your Hofstadter, guys.
I can assure you, the best way to get rid of dragons is to have one of your own.
Interesting method.
It seems to me this is more similar to natural learning of a language (usually at a young age) by exposure and immersion, as opposed to scholar learning of a language in classrooms, etcetera.
It shouldn't be surprising that in humans, the first method also works best at acquiring fluency in multiple languages. As a matter of fact, it's the only method through which we come to understand our FIRST language, which is in almost every case the one we command the best.
I think most people get, by consuming huge amounts of information, a feeling of "what sounds right" and "what sounds wrong" that is more effective for them at predicting the unwritten rules and exceptions, both in translations and in original sentence-creation, than memorizing a set of grammar rules which, in the end, are just codifications of the current state of the language.
I don't think the success of the approach means the symbolic methods are pointless for this endeavor, any more than the formal study of languages and their grammars is for human translators.
Professional writers and translators do study such rules to dramatically improve their command of the different languages, and do get much better results.
But it seems to me they are more successful going from "statistical matching with massive real-use data" to "optimized grammar rules matching the data" than going backward, from "scholastic grammar rules" to "consumption of massive data to acquire exceptions, and correct and complement the rules".
What would be interesting, I think, is if one can study the state of the system after it's performing well and extract/deduct grammar rules, algorithmically.
It would be interesting to see the results of a program doing that, collecting (and correcting) the grammar using the data, and using the grammar rules when no match in the dictionaries is found to, say, apply a greater weight to the gramatically-correct choice among the alternatives.
If the results were good with this approach, one could consider decreasing the size of the database as the grammar gains stability. Use that memory for other processes, other languages, or new sample data that could not be examined before.
Freedom is the freedom to say 2+2=4, everything else follows...
I'm forced to disagree. Although reading texts in their primary languages is certainly valuable, I severely doubt every single scholar who studies ancient Mesopotamia is fluent in reading cuneiform script! Also, asking scholars to be fluent in one or two dead languages is quite a lot (according to my sister, who's a medieval scholar and speaks Latin and Medieval French)- would you have them be fluent in every single language they encounter? That's unrealistic, as well as inefficient.
Although it's certainly true that many scholars can read the primary languages of the periods they study, some do not. For example, if one were studying Culture A through the medium of Culture B's records of interactions with Culture A, one would not need to read primary sources from Culture A.
It's true that many scholars do prefer to rely on personal translations of primary sources, but for many it's a simple waste of time that could be better spent. Instead of arguing that all scholars must be able to read all primary sources of the cultures they study, I would argue that they should be able to analyze the translations of others (perhaps even the translations this system produces) with regards to the culture. If 20,000 scholars all translate a primary source and their translations are all relatively accurate (errors will be corrected in time), then 19,999 of them have wasted weeks or months.
Yes. Scholars do need translations - they help verify the scholar's own translations, provide much-needed resources, give insight into the translator's view of the culture - in short, they are a resource too valuable to put aside.
Access denied: Not enough clue for requested operation.