Coming Soon, The Google Translator
compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation.
The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases.
If anyone were capable of making a serious go of MT, that would have to be Google."
since the RTFAs lacked any kind of crunchiness, i sourced some great stuff here that does a wonderful job explaining how this system works, and gives the advantages the statistical translation method has over the rules-based approach. as well as the disadvantages.
fascinating stuff:
"Currently, most machine translation technology, including consumer-oriented programs such as Systran's Babel Fish, have been "taught" the rules of language, such as verb tenses and when to use parts of speech. Programmers painstakingly hand-build systems based on such rules. "The computer is told, if you see this thing in Russian, replace it with this thing in English," explains Yarowsky.
"While somewhat effective, such systems are time-consuming to build (consider how long it takes most humans to learn a language and all its rules), and resulting translations are still marred by grammatical and other errors. Those that do work fairly well usually tackle popular Western languages, such as French, German, and Spanish; there are few translation programs developed for other important tongues, such as Chinese, Turkish, or Arabic, let alone for more obscure languages like Tajik.
"To tackle a broader range of the world's languages, and to improve on the quality of machine translation, Yarowsky and his Hopkins colleagues are developing computer programs that can be trained to figure out any language using statistical analysis, i.e., looking at the probabilities of language patterns. In what's known as automatic knowledge acquisition, the computer could "learn" Serbian well enough to translate future documents or conversation, or at the least pick out pertinent words like "bomb."
"As Yarowsky explains: "Say you want to teach a computer how to translate Chinese: You give the computer 100,000 sentences in English and the same 100,000 sentences in Chinese and run a program that can figure out which words go to which words. If in 2,000 sentences you have the word Washington, and in about the same number of sentences you have the word Huashengdun, and they occur in the same place in the sentence, these words are likely translations.
"It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."
"So, instead of telling a computer how to do something -- conjugate the verb 'to be' in Spanish, for example (I am = soy) -- researchers give it tens of thousands of examples and program the computer to find repeated patterns that the computer can use to conjugate new verbs. Trained this way, the program could potentially "learn" phrase structure and the rules of translation.
"As Yarowsky notes in his 100,000-sentence example, one way to accomplish automatic knowledge acquisition is to use bilingual or parallel text. The program "reads" a document in English and then a version in a second language. Such texts used by Hopkins researchers include the Bible, which is available on the Web in more than 60 languages, the Book of Mormon (over 60 languages), and the United Nations Declaration of Human Rights (240 languages).
"Aiding the computer is the fact that the English version of such texts can be annotated by hand or using another computer program -- essentially marked up to show, for example, that Jesus is a noun and pray is a verb. The translation program-in-training needs such information because it cannot translate future text just by substituting individual words in each language; it must also be able to analyze how sentences work. To do so, the computer program uses pattern recognition templates and other tools to understand sentences on a syntactic level. Simply put, the program is essentially given clues to know what to look for, notes Yarowsky: "It should figure out the subject, figure out the object, and other elements of sentence structure."
Just to illustrate, here's the summary of this story, translated to German and back to English using Google's current version:
____
~ |rip/\/\aster /\/\onkey
So what powers Google's current translator? I have seen it give word-for-word the same as Babel on some occasions (but with better handling of non-ASCII characters).
# cat
Damn, my RAM is full of llamas.
"Guugle-a gefe-a a Gleempse-a ooff its mecheene-a Uebersetzoongsystems zee fullooeeng prudoocshun et zee fectury ruoote-a ooff zee A Mey 19 tu juoorneleests. Guugle-a. "Guugle-a Bluguscuped" ooffffers un ixcellent ooferfeeoo ooff zee representeshun. Zee system ves treeened veet zee neshun ducooments es kurpoos. Thees kurpoos is sumetheeng 20 beelliun vurd felooe-a ooff cuntents. It uses zee ixeesting terget lungooege-a trunsleshuns (tekes plece-a feea hoomun trunsleturs et zee U.N.) Semples feend, vheech use-a it zeen tu istebleesh gooeedelines fur trunsleteeng betveee thuse-a lungooeges. Epperent it ves sooccessffool, vhere-a zee present ferseeun hed feeeled, iff it trunsleted certeeen cleeches. Iff iferyune-a ooff furmeeng a sereeuoos vere-a cepeble-a, ooff zee M.Ue-a., thuse-a vuoold gu tu hefe-a hefeeng tu Guugle-a."
Looking forward to a www.borkle.com which returns all its results in such a format.
Don't blame Durga. I voted for Centauri.
Make this work with Gmail and I'd even pay money for it!
Tired of getting email from Amazon.DE on my Gmail account and having to copy and paste it over to Babelfish.
That would be very useful for me.
Sig for hire.
www.googledot.org and www.appledot.org
That Microsoft will announce a new revolutionary language translation service sometime in the next two weeks or so?
Weaselmancer
rediculous.
Oh, no. It's because geeks like Google. Therefore, Google are capable of superhuman feats that mere scientists -- those with years of experience in relevant fields -- are incapable of doing.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
Googlefish or babelgoogle? Maybe we shouldjust change "internet?" to google and every site much have google involved.
g ooglejournal.com
Googlesoft.com
Googlenix.com
Opengoogle.org
I like muppets.
boakes.org
When questioned on the matter, Altavista's Babelfish translator gave this quote:
Google does not have anything on my amazing abilities of the translation!
Pulp Audio Weekly - Geek News and Reviews
Actually, my bet for most likely to make a real go of machine translation would be...
IBM
Look how far they ran with chess programs, because they felt like it...
If they decided to go the same distance with translation...
Bubla *Cick BAle Walkie *Hotka BaCa Sopika *luek Gack *Zoek Pael Quazic Translate that google!
If your blog sounds like a politician giving a speech at the UN, this service will do a wonderful job. Doubtful that it will do any better that Babelfish otherwise.
The biggest problem in artificial intelligence is that the system learns the material that it is trained to, and only that material. Computers don't generalize or extrapolate the known into the unknown worth a damn.
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba
While Google's existing translator and Altavista's Babelfish are good, they do not help in the translation of several other languages.
That would be a really good benefit - for instance, I wanted something translated to and fro from Svensk (Swedish), but I really couldn't find any translation service that did.
Good translation of the more common languages would be nice, but simple translations, even - of a variety of languages would be really useful.
At last I can translate all those non-English spam emails I get! There'll be no more missed opportunities to buy chinese viagra, woohoo.
Since it's become "hip" to bash Google these days and support either MSN's search technology or Yahoo, I'm making a pre-emptive strike for the IT fashionistas:
"Duh!!! The best machine translator in the world already exists and there can be no improving upon it! Babblefish (thank you Altavista) has been doing this for well nigh a decade. All you Johnny-come-latelys are probably going to rave on with fanboy adoration of Google (the company that can do no wrong)!!! To top it all off, you lot apparently know nothing about Microsoft's language transtlation project which is slated to be deployed as part of Longhorny in 2010. Online language translation from Google will fail because Microsoft will have it built into the OS itself. Why send your document online for translation when the OS itself will not only translate it, but it will correct the grammar, punctuation and generate a WMA file in one of ten thousand gorgeously rendered synthetic voices. Google has lost. Google as been trolled. Google will have a nice day".
We now return you to your regularly scheduled pos[tt]en.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
There is already a tranzilator
Seems one could devise a TQ (tranlsation quotient) measuring the effectiveness of machine (or human) translators. Take any standard reading-comprehension test, a send its text material through the translator, and back ...and then compare the scores of subjects taking the resulting test vs. those taking the original.
(Before such translators make their way into, say, diplomatic circles, I'd sure hope there's some objective demonstration of near-infallibility...)
Seeing bad movies only encourages them. Watch responsibly
I don't ever expect such translation to work perfectly, but taking existing phrases should lead to useful first drafts.
This will mean one less possible career for me, and fewer babelfish induced laugther moments.
As a fluently bilingual person, I often recognize expressions that were translated in Canadian government documents. "Anglicisme" is the word the french have for it.
There's subtlety to languages we may forever lose. Take for example:
"Je donne ma langue au chat" - "I give up (answering a riddle) instead of the more picturesque "I give my language to the cat". Well, that should be tongue, but hey, it's just babelfish!
"Bullshit" won't produce "merde de taureau". That is a strange expression you anglos have, don't you realize?
"Il pleut comme vache qui pisse" will give us "it's pouring cats and dogs" rather than "it's pouring like cows' a'pissin". The french also have never heard of cats and dogs falling from the sky.
While an improved Babelfish may improve our mutual comprehension, please pause for a moment to consider all the linguistic hilarity we'll forever lose.
Information: "I want to be anthropomorphized"
I predict we'll see google developping the Universal translator pin.
then the warp drive,,then teleporter and why not everlasting youth?
Oh yeah!
Most of the time you don't know in what language the text is written in. When you get a alian looking content..... most of the time, you don't know the best way to make sense out of the shit! they should have something that detects (pattern matching etc.....) the language in which the context is written in!
That should be 200 billion words according to the article
So how do you think it will handle all your base are belong to us? Seriously thought it will be interesting to see how well they can make it work. My expereince so far with translators has been dreadful
Madre de Dios! Es El Pollo Diablo! -- Captain Blondebeard
When are we going to see calendaring functionality with Gmail? You know it's in the works in Google labs...come on Google! ;)
First, this is outstanding; Google, unsatisfied with traditional machine translation techniques, pioneers their own design. I'm certain their advertisers will be pleased to have their adds auto-translated to whatever language is necessary.
Second, I think we'll witness a case of having the AI ante upped once again when another traditional AI challenge is met. Wikipedia puts this best; When viewed with a moderate dose of cynicism, AI can be viewed as 'the set of computer science problems without good solutions at this point.' Once a sub-discipline results in useful work, it is carved out of artificial intelligence and given its own name.
Lurking at the bottom of the gravity well, getting old
This sounds very interesting... imagine the possibilities for localization of applications - I'm sure a simple script could be created to extract strings from source, parse them through the translator and substitute them in your chosen language, this could save a LOT of time!!!
I can't wait for a Welsh version of firefox =P
Time is an illusion. Lunchtime doubly so. - Douglas Adams
Good luck to them, but I doubt that they are gonna make it.
OK, make in 10 years or sth
I'm into natural language processing myself and it seems to me that it's very difficult to build a system that works globally on all kinds of input.
They'll have to LISP it to death!
Anyway my $0.02
www.lemonodor.com A mostly Lisp weblog
A relative worked in an "internationalization" department, creating software/manuals in many langugages.
In order for machine translation to be as good as human translation, you fist need to determine what the sentance "means". Often times you need to track previous sentances to determine meaning of things like the word "it". Human languague is not very detailed and relies on common knowledge experences to infer meaning.
Its very hard. Some langauges are easier than others for this stuff. German/french/spanish all change the gender of the word "the" based on the noun and give clues about how its used in a sentence. This can help a little.
For many web pages this approach may give an understandable translation, but for literary references and books (manuals etc) machine assisted translation is now the norm.
even using AI determining meaning is very difficult. google semantic processing for companies trying. One is CYC, a stanford spin off.
http://www.cyc.com/
So when you go to translate.google.com and translate something, the result will be legal-eze in the resulting languages.
Spanish: "Que pasa?"
English translation: "With regards to the current situation, how is the day progressing?"
FTA:
;)
researchers working on this enabled the system to translate from Chinese to English without any researcher being able to speak Chinese
Hmmm.. and they that it works because...??
"Is this just useless, or is it expensive as well?"
DVD subtitle tracks would be another good addition to help pick up slang too (most have an english track along with a couple others depending on the region)... all time-synced and easy to match up...
(I'm guessing that it'd fall under fair use and google wouldn't have to struggle to get the movie studios approval, (even though such tech would benefit the studios too))
But can it translate Pig Latin, Bork Bork Bork!, and Klingon?
In 'Hitchhiker's Guide to the Galaxy' (the 'trilogy' of books, not the recent movie), it's mentioned that the babelfish has effectively started many, many wars. The reasons seem to be that any being can be rude to any other being without a serious set of translations that explain exactly what the rude terms mean and how they should be regarded.
I'm highly concerned for this warmongering that Google has undertaken.
Reference Here: http://www.bbc.co.uk/cult/hitchhikers/guide/belgi
Picture this: I write a blog entry with either bad punctuation or erroneous content. Under the old system (pre-Goolge translation), I would receive several flames about my idiocy. With Google translations:
* People around the world will be confused and angered about my punctuation;
* Vastly larger numbers of people will complain about my erroneous content;
* Other people will step up to my defense and a massive flame war will ensue;
* Idiots eveywhere (who speak other languages) will echo my idiocy by believing the erroneous content I posted;
* The signal to noise ratio of the net will rise markedly;
* I will still be unsure of whether to count on my fingers starting with my thumb or forefinger depending on which European country I'm in.
I believe this pro-war, anti-peace, conflict-ridden idea of making everyone THINK they understand each other is ripe for critism. God made everyone else speak funny, I think it should stay that way! Only right thinking people speak my language anyway, and everyone else should just shut up and sit down!
(WARNING: above post contains carcinogenic levels of sarcasm, fasciousness, satire, irony, and adjectives. Please unplug brainstem and wipe with a clean, damp cloth before continuing.)
Unitarian Church: Freethinkers Congregate!
That happens being, the Google has an updated technology and it goes, it will make a method it is a first in them,! It congratulates in them. To them being company percentage chance to this!!
"hey, could you pass me a paper towel? er.. I mean... DEPLOY ABSORBTION PANEL!"
- If they use UN documents as a guide, the Google MT engine will be excellent at translating bureaucratese between languages. I'm not sure if that's a good thing!
- Its obvious that the US Gov't is dumping money into Google -- I often wonder if Google is a front for some US gov't agency.
Conformity is the jailer of freedom and enemy of growth. -JFK
Oh, come on. I (still) like Google, but that's a bit silly, no?
"I love my job, but I hate talking to people like you" (Freddie Mercury)
People using a translator who don't take the time to familiarize with grammatical-lexical quirkies of mechanical translation and 'take offense' should be rounded up along with all those people who are so fond of taking offense on behalf of others who might be offended. Grind up for shrimp feed.
Imagine a world in which everyone stopped to consider environment, context, and cultural POV when engaged in conversation with others.
But, evolution won't let this happen. It favors numbers, rapid breeding, and in the case of humans, the hive-nest-swarm-colony-'what have you' of group focus on simplistic solutions serving fulfilment of immediate desire.
Translate that GOOGLE!
MEMRI (memri.org) does a nice job of translating articles, essays, and even video from various media in the Middle East.
[Insert pithy quote here]
Wenn ist das Nunstruck git und Slotermeyer? Ja!... Beiherhund das Oder die Flipperwaldt gersput. be careful! If you translate this you may end up dead.....
Yes, but can it translate German or Italian opera to english and still have it rhyme? :-)
"Computers don't generalize or extrapolate the known into the unknown worth a damn."
Fortunately, that's not all that google has to go on. Google has 8 billion webpages, in many different languages, most of which are written by non-speechwriters. Not only can they analyze words based on translated context, but they can analyze words based on intra-language context, to form associations between words and meanings.
The real trick is getting down two important linguistic concepts: "Sandhi Rules" (for instance, the use of "an" before a vowel and "a" before a consonant, which are totally regular but more complicated than a word-to-word matchup), and the "degree" or "quality" of words, which indicate the type of adjective most appropriate in any given context.
For instance, "erudite", "learned", "educated", "knowledgeable", "skilled", and "cunning" could all be related words, but many of them have positive or negative assocations which may only really be conveyed by understanding the meaning, irony, or sarcasm of a particular phrase.
For instance, "John has been skilled in writing beautiful code for most of his adult life" is quite different from "John has been educated in writing beautiful code for most of his adult life", or "John has been erudite...". The first one is probably right if John has had a natural inclination to doing it properly, the second if he has undergone some training (though we don't know the actual state of his ability), the third (though the word doesn't even really make sense here) if he has been arrogant about his ability, shouting RTFM! every time someone asked him a question.
Since Esperanto is mentioned so prominently, I have to wonder whether the tool will support it. There has been at least one previous attempt to use Esperanto as an intermediate language for a machine translation project. The only English translation of the article I could find is now only available in Google's cache. There is an ironic symmetry to that.
The net will not be what we demand, but what we make it. Build it well.
As the article suggest, Google could use this if they ever decide to go ahead and launch an instant messanger. Imagine being able to chat with anyone in the world while google does the translation in real time for you. What are the implications of this.
;) ).
As an example, in one hand my family back in Peru, who don't speak english, would be able to chat with my current gf who doesn't speak much spanish but still likes chatting with them. In the other hand, this would slow both parties' motivation of learning a new language (maybe good in my case
[alk]
The Adventures of *Super Monkey Car *
...and get out a fansub?
I think it's great that goolge is putting their resources behind this and I'm sure improvements in MT will be the result. What we can't expect though is perfect machine translations. Computers translate on the basis of syntax and semantic correspondance between words in the two languages. What's missing from the software is an understanding of context. I think google's efforts should help here. Training the software should be able it determine that a word is likely to have a particular value/meaning on the basis, for example, of certain words surrounding it (i.e. in previous and subsequent sentences). Current software seems only to translate at a per sentence level -- thus the lack of coherence in translated paragraphs. What google can't do though is solve in reference to non-textual context, for example: the character of the writer, when the text was written, for whom, purpose of the text, etc. So much of "meaning" for us humans is also the affect or force that a text has on us -- how it makes us feel -- and writers tweak what they say to bring about those particular effects. For this reason computers would find it difficult to account for stylistic variations that affect how a human would interpret. So much of meaning is implicated (i.e. not literal but tied to speaker/writer intentions) and this is what easily gets lost in translation. Even humans find translating this stuff hard. As any translater will tell you there is no such thing as a perfect translation (the translator can understand the meaning in the original language but not when trying to think it in the other). BUt still, I'm really excited to see what google is going to be able to do.
Ludwig Wittgenstein
The Adventures of *Super Monkey Car [Insert Blank]
Some people here seem to have a false picture of how language works. Individual words do not have meanings. Not to a human interpreter anyway. Sentences used in actual contexts have meanings (unless a single word is uttered as an elliptical sentence). The "meanings" of words, as found in dictionaries, are simply abstractions from occasions of use. The idea that individual words have meanings hasn't been current in philosophy or linguistics for about 50 years. Also, the idea of St. Augustine that children learn the meaning of words by associating sounds that they hear with particular objects that they observe is now also considered rather dubious.
Ludwig Wittgenstein
imagine the product you have if they can do it.
You configure the web accelerator to automatically translate all the languages to yours and all the pages would be translated to YOUR language in real time.
Wouldn't that be great? And seem to be easily doable with their technology if they would like to.
"It's all just observation," Yarowsky adds. "Children do the same thing, but they also do it through visual stimulation and feedback. They see a book and hear the word 'book,' and eventually they learn that it's a book. They see a bird with its wings flapping around and learn that is called a bird. It's the same with machines, only they have much better memories. Computers could remember exactly when and where they saw the words bird and book."
Except, no. Humans are basically generalization machines. Babies are able to grasp very quickly that words apply to categories of things -- not just that a *specific* item is a bird or a book, but to learn "I know a bird when I see it", even without necessarily being able to provide a scientific definition. Computers can be built to emulate this ability, but learning word-to-word mappings isn't *nearly* the same as learning abstract concepts and which words apply to them.
I guess some languages are harder to translate than others, and until they some up with a really good AI, they won't make it. Languages like Japanese simply lack a lot of concepts that are in English, German, French and the like. No plural or future tense for example. "Neko no mimi" could mean cat ears in general, the ears of a bunch of cats, a specific cat's ears, one specific ear of cat and so on. Stuff like this is usually clarified by the context. But depending on the text, the context might be considered as understood and therefore not be specified in a sentence.
If Googlefish learns only on a sentence pattern basis, this will not really help anymore in translating Japanese texts than current technology does. To adequately grasp the contents of a text and correctly translate it, a lot of AI work will need to be done for these languages...
Bitten Apples are still better than dirty Windows...
Of course, the issue would be for me to show that I add value to what may freely (presumably) be gotten from the web. And luckily enough, no translation software has come close to providing literature-quality work.
In my mind, Google's choice of the UN indicates a confidence that they will reach a high level of accurate technical translation. This makes great business sense, as the UN is typical of markets that will require a quick turnaround on translation, and thus will be a great proving ground.
Also, those docs are all written in an argot which is highly repetitive and quite uniform. Thus, Google has, in a way, set itself up for success.
Rick Mourneau's Lexical Semantics details his creation of a machine translation intermediary language.
Absolutely fascinating stuff if you're into that sort of thing. Though definitely a less AI-esque attempt at the problem.
---
On an unrelated topic: if the stupid captcha's instituted by the idiotic editors continue for much longer, I will go out of my way and null-route all slashdot ad sources at both home and work.
Akarsz Magyar Gentoo fórumot? Akkor
Great, we create an advanced translator and then use the words of politicans to train it. Now everything we run thru it still won't make sense!
Coder's Stone: The programming language quick ref for iPad
Hip?
o late+has+been+associated+with+romance+and+sharing
n ts2004/newyear/yhxmas-1-4-0-0-0-1021752.htm
Search on long phrases like this:
http://www.google.com/search?hl=en&q=history+choc
Doesn't find this:
http://www.cadbury.co.uk/
But finds a lot of sites that clone the text, like this:
http://search.hotbot.co.uk/results/chocolate/
http://yahooshopping.rediff.com/yahooshopping/eve
http://www.jlr.co.uk/partners.htm
Its not hip to bash Google, its deserved criticism for launching a poor result. They're getting off lightly.
Learning from pre-translated texts is a great start...
Step two should be human corrections to machine-translated documents (learn from your mistakes - like we do), should it not?
MadCow.
I used to have a sig, but I set it free and it never came back.
Many bible translations aren't made from the original languages but from other modern language versions.
Thus, i'd expect you'd find, a french translation of the NIV, which is quite a modern translation in the first place.
"The spirit is willing, but the flesh is weak"
The computer crunched and crunched, tapes spun (this was back in the 60's) and eventually it printed out:
"The wine is pretty good, but the steak is lousy"
Therein encapsulated is all the folly of every attempt at word-matching translation.
As long as we have another Google thread started, what are some stock thoughts. Jim Cramer is advocating this stock upwards of $440 on the basis of forward P/E and a 30% earnings growth. His main thought process is "Think of this stock as a $26 stock going to $35." However, he is not taking options into account. Options hedging strategies on GOOG are extremely costly at near-the-money options, and within two trading days one can become wealthy or poor trading GOOG options.
/. community. However, we do not have any position in GOOG at this time.
Anyway, I am a hedge fund manager, and our fund growing increasingly bearish on this puppy as well as the whole U.S. economy, so I just wanted to pique some interest of the
-Aj
http://theopenfund.com/
-------
artlu.net
I can see this working for languages with similiar grammars (like English German or even English-Chinese) but once you throw in languages with somewhat different grammars (like English-Japanese or English-Basque) I can't see how a statistical approach will succeed.
So long and thanks for all the fish . . . !!!
John, the cunning linguist.
Flourescent (adj): smelling like ground wheat.
Mod parent "-1 whiny bitch", please.
I modded this as redundant, but since it's now modded 5, Interesting, I'm going to post to get my mod points back.
Jesus, you suck. Since your 'redundant' mod got overpowered, you posted to get your mod points back? I think the only thing worse than doing that is admitting you did that. What a fucktard.
It would have been much more interesting to see how the new translator would handle the news blurb.
Yes, it would. Why didn't you post that then, instead of whining about how your mod didn't matter? Oh...what's that? You don't have access to the 'new translator'? No one does? Then STFU.
I mean... Something like half the web's translation services seem to be licensing Systran's aging engine, and the rest are even worse. Yes, there are ambiguities that are hard to take care of, but with computers great at managing a lot of data, you'd think they'd at least have more complete dictionaries. :-p And regarding the ambiguities -- it's here a good engine comes into play. The better it is, the more it'll be able to correctly resolve by analyzing the context.
Beware: In C++, your friends can see your privates!
Statistical MT methods are old hat and have been even used in things like automatic image annotation years ago. Parallel text correspondence learning is not novel. Short bib.
MT is a VERY difficult problem space, especially for languages that are non-Western (unlike Spanish, German, English, French).
I am a translator and interpreter and part time computer geek. I have tried various computer translation programs and have never been satisfied with any of them. I hoped they would at least be good enough to give me a draft translation that would only need minor editing but none have ever lived up to the hipe. I have found that it is still faster to translate the old fashioned way.
The biggest problem I have found with these programs is not necessarily the programs fault. People tend to write the way the speak and that is seldom grammatically correct and often filled with the jargon from their particular industry and/or slang. (a good example is this post) The programs I have used are not designed to work this way. They seem to rigidly adhere to the rules of the particular languages they are designed to translate. The other problem I have had relates directly to the jargon. It can take weeks of work to add all or most of translations I need in order to get a relatively accurate translation out of the programs I have used, and that is too long.
At the end of all of this, my hope is that someone can get it right so I can spend less time typing translations and more time interpreting. I enjoy the personal contact and the challenge of interpreting much more than translating alone in my office. Like I said earlier, even a program that could produce a decent draft document would be helpful. So I'll keep trying them out as they improve but until they reach that point, I'll still do it the old fasioned way because for me it's faster.
The Mathematics of Statistical Machine Translation: Parameter Estimation by Brown, Pietra, et al. IBM was on this a while ago, and other efforts have improved upon this work, through the use of Maximum Entropy, etc.
You need quote marks (") if you're searching for a particular phrase. Add the quote marks and your webpage is ranked....#1. Learn to use Google, kthnx.
Meine Schwester ist sehr, sehr reizvoll - Nietzsche
I'd be willing to be that if people agreed on a simple subset of their languages, macine language would work a lot better.
autopr0n is like, down and stuff.
"John went to the bank to exchange some money."
Take for example the word "bank". When you read that word, do you activate either the representation of "Bank", as in, "the place where you deposit currency", or "Bank", as in, "the place where the river's edge meets the land?"
It's both, actually. Both representations activate. So it's not so clear that representations in the brain are context dependant.
...thanks for the link there tiger. [ROWR!!!] The birds on the Cadbury splash page are hott!!
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
HELLO DEAR, WHITE I THAT THIS LETTER LIKES a SURPRISE, but, TO THEM, THERE not CONCERNS, all COMING IS PROPERTY. I AM Mr. HARRIS PETER, OFFICE LEADER FINANCIAL confidence $ bank PLC, which is IN MAURITIUS CLUTCHES OF EGGS some years ago, CAME a MAN, who WAS CALLED Mr. SHAW SMITH, that, Who FROM ITS COUNTRY, DETAILS FROM YOUR PART IS, TO MY COUNTRY (MAURITIUS) IN the RUBBER SECTOR.UNFORTUNATELY TO INVESTMENT, HE DIED IN a SELBSTCAbbruch. Mr. SHAW SMITH DIED, the SUM the DOLLAR 15MILLION US IN MY BANK LEAVING. I REQUEST HEREBY YOUR SUPPORT TO HELPING, the MONEY TO STATING. I BECOME THEM NEEDING, WHEN the COUSIN LATE SHAW SMITH FOR SERVING, BECAUSE at the moment, HE DOES NOT HAVE the FOLLOWING of the TRUNKS, SO THAT the MONEY ON BROUGHT can. IF THEM RECIEVE THE MONEY, THEM IS 40% TAKING, THE OVER DOLLAR 6MILLION LIKE YOUR PORTION AND IT GIVING ME THE OTHER 60%. The GOVERNMENT PLANS, the MONEY TO TAKING OVER, IF KEINS REPRESENTS ABOVE, SINCE ITS FOLLOWING OF KIN.I EXAMINES that EVERYTHING IS NOT UNDER CONTROL, SINCE I AM the ADDRESS MANAGER.SO THEM ANYTHING CREDIT, itself ABOUT.ALL TO CONCERNS, THEM DOING HAVING BEING SUPPOSED ME ANSWERS, IF THEY ARE INTERESTED THUS WE the NESSECARY -, DOCUMENTS FOR the TRANSMISSION TO PROCESSING BEGINNINGS ABILITY. OFFICE LEADER OF THE THANKS HARRIS PETER F.T.B
Clinton tours devastated Bandeh Aceh.
Of course, I knew what the writer really meant. But the Bable Fish translation into French produces exactly the meaning which I first parsed when reading that headline.
Les excursions de Clinton ont dévasté Bandeh Aceh.
If machine translation become more common, perhaps English writers will have to be a little more careful.
I'm a translator, and I'll be impressed by machine translations the day a computer can translate a joke so that it's also funny in the target language.
IMO, the future of written document translation lies in translation memory software, which records pairs of syntactic units as documents are translated, for future reference.
If such pairs come up in a future translation, the software can auto-replace those syntactic units within a specified tolerance, which in turn accelerates the translation process.
How will it translate W00t?
or was their choice of Arabic translation text a bit... ummm... odd? Of all the things to choose from, they chose this? Wow.
Linux with kernel panic...
MadPenguin.org
I read the headlines, notice the Google thing, I am not surprised at all - I'm like "yeap, right on cue" - but then I'm like... hey, why I'm not surprised by this thing? :-)
Such is Google, I guess...
I'll be happy when it stops 'translating' the name of my favorite French pop singer into "geostrophic."
;-)
Yes I know alizé means tradewind or some such thing, but really, there IS an extra 'e' on her name.
At present, you get stuff like: ,
"As for the red swimming wears being few,
The Mass phosphorus (everyone phosphorus is not) with the coffee drinking, inside the
Because photographing started, m is (the _ _) m
Even story is funny excessively, it is with * & & & & *"
Proper names typically come through as a burst of nouns, like "Showa Water Beauty Feather".
On second though, the actual content is probably less engaging.
(-_^)
I bet you understood that word in isolation.
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
so if Google will be able to use statistical analysis to translate any language into any other language how difficult will it be to use statistical analysis to connect actual meaning to that text? The way to train Google to understand the meaning and context of the language will be by using people with various devices attached to them: video-cameras, microphones, touch/smell/heat/acceleration/pain/pleasure sensors in order to collect statistical information on the meaning of the text.
And the first real AI will be born.
You can't handle the truth.
What is the business model on this one?
Injected ads in the translation?
Licensing the technology to corperate instant messaging and email services?
Embeding the technology into voice recognition phone systems?
Pay translation services?
http://brandonbloom.name
The language translation aspect of this system is impressive. However...
Is this a key component of the future of programming?
Give it thousands of high-level design documents. Give it the thousands of corresponding pieces of code which resulted from said documents. Do you get a system that can translate between design documents and code?
Perhaps there's going to be some pre-processing and post-processing, but I don't think this is out of the question. Think about it.
RD
Uebersetzoongsystems zee fullooeeng prudoocshun et zee fectury ruoote-a ooff zee A Mey 19 tu juoorneleests.
Looks like Dutch, which is somewhat close to English.
The problem with phonetic and dialect translations is that they rely on a non-standard way of expressing the phonetic sounds. A French speaker is going to have a different way of spelling out a Swedish accent than an English speaker.
Linguists have a precise set of symbols for describing the funny sounds that people make with their mouths. Unfortunately no one else knows this symbol set.
It might not be a bad idea to start using these standard symbols as a way to encode speech into text using computers. Maybe, just maybe, we can start a systematic and scientific way to approach machine translation speech-to-text-to-alternative language speech. The idea of using religious texts as the basis of multiple language translation gives me pause because (no offense, but let's be real here) most religious texts have been originally written by people with severe mental disorders whom we accept as messiahs and prophets simply because it is politically expedient and convenient for us to do so.
The more advanced the translations become, the greater the risk of incorporating the reminents of these mental disorders into our translation machines.
will it allow to translate italian cooking recipes, so I can do it right? UN documents probably won't help with that. Darn, have to learn italian then.
Yes, but will it do Klingon?
/ducks
Or what about other weird languages? Darmok on the water at Tanagra! Tanagra, his arms wide! Darmok and Tanagra on the ocean!
The reliance on datasets for statistical analysis seems to be a prime oppertunity for the use of the semantec web. Where datasets could be appropriately described using the ontology and the indexing and processing of these datasets could be then completed autonomously.
Reasonable?
You can study a book as much as you like. Copyright no more excludes a statistical analysis like this than it excludes publishing an article which points out that a book has 600 pages.
As long as the text itself is not reproduced, either explicitly or implicitly, you're fine.
They've probably been using U.N. documents because it's a nice homogenous set that's already entered into a computer, or at least is all in one place. Chasing down copies of Moby Dick in Tagalog is hardly a productive use of time. It takes too much brainpower... you can have grunts handle the processing if you have all the documents in one place to start.
That's my line!
I keep hearing that the FBI is backlogged out the wazoo when it comes to translating Arabic, Pashto, etc for terrorist messages. A news story ("60 minutes"?) on the subject stated that the FBI is also sluggish to remedy the situation due to dumbass bureaucratic game-playing.
But MT seems now to be mature enough to step in and solve the problem almost in a single stroke.
FBI, are you listening?
Perhaps Google is going to use the information made available in the Google Print Library Projekt and in the Google Print Publisher Program to feed this project with lots and lots of text in different languages.
This
If it were any other book, you might be able to establish a valid parallel between two different languages. However, almost every translation of the Bible is "informed by tradition." This means the translators attempted to translate the Bible in the context of what the people paying the translators believe. Almost all Bible translations are made by committees. They interpret the text through theological doctrines and dogmas that arose centuries after the Bible was written. And, this "understanding" of what the Bible means can change not only from version to version, but also from culture to culture. The book is just too burdened with tradition for any two translations to parallel each other as closely as, say, two translations or Huck Finn. Any Gaus's translation, "The Unvarnished New Testament," is the only one I've found that simply translates from the original Greek without interpretation. You would need two language versions that both attempt to suppress the author's prejudice and beliefs to use the Bible as a corpus for translation.
User Training for Busy Programmers
For the last six years, I've been collecting data on all civil wars fought since 1816 as part of an update to the Correlates of War datasets, which have been instrumental in reshaping the scientific study of international politics. Right now, the biggest obstacle to further progress is that most of the abscure wars we're considering simply aren't described in English. The only materials on many Latin American wars (e.g. the dozen or so civil conflicts in Ecuador) are in Spanish, while information on many African revolts is only available in French. This project simply doesn't have the resources to hire full-time translators, so even basic MT would be great, for it would allow me to skim through reams of documents and online articles in order to identify the materials worth the costly time of a human translator. In addition, even a modest improvement in MT would allow me to extract data from foreign-language materials myself, since I'm generally seeking quantitative data on casualties and force levels, not a detailed description of events.
Make cheese not war 8:)
There is an arguably better solution which is to agree on a common writing system (note that adopting a common writing system is more feasible than adopting a common language as one need not learn any phonology). Fifty years ago, a man by the name of Charles K. Bliss developed a system he hoped that, in the future, would become universally adopted. His invention was dubbed Blissymbolics. It is currently used in the field of augmentative and assistive communication where it gives language to those who would, due to handicap, be unable to communicate with any fluency.
The basic idea behind Blissymbolics is to use mostly indexical ideographs - that is to say, eg, the symbol for man looks somewhat like a stick figure man. There are some pure symbols, however, though they somewhat conventional - for instance, a heart shaped symbol represents emotion. However, it is not limited to concrete meanings, and, though I doubt it could be proved, I believe it's has the same capability for expression as any other writing system, including English writing, due to its compositionality. Couple that with the fact that it can be learned quite easily, one might begin to see that yes, this is a better solution. I am dedicated to this ideal, so if you get a chance, check out http://www.activebliss.com/ for more information about the ideal of universal communication.
Cheers,
Matt Landau
How will it handle words that have 2 or more very different meanings. Best Example I can think of: spanish word fui= I was or I went. Fui al cine= I went to the movies. So from the context it should learn the Fui al cine= I went to the movies NOT I was the movies.
But what happens when it's translating say some fantasy novel where a boy is turned into a house and it's supposed to say he was a house, how will it translate that into spanish?
There are 11 types of people, those who know unary and those who don't.
The choice of languages used to demo the new translation tool seem to point to something interesting.
In the only four slides where translations are shown, these are the original languages which are translated into English:
Slide 137 - Chinese
Slide 138 - Arabic
Slide 139 - Arabic
Slide 140 - Arabic
As accidental as these choices may be, is Google trying to sell the new translation tool to some arm of the U.S. Federal Government?
Consider that the previous two U.S. wars were fought against enemies whose holy text is considered definitive only in Arabic.
Consider that the only nation challenging the U.S. as a global superpower is China.
Why do *I* get the feeling that this won't be available "off-line".
Imagine if some corporate or government espionage entities get subpoenas or inspection "rights" to the queries and translations in the Google and other online translation engines.
Imagine if entrepreneurial but not-yet-burnt-int-the-real-world types innocently place implicit trust in these systems, only to find out the handy-dandy idea they were translating "to go international" got ripped off and disseminated by larger, faster, lawyer-backed corporations.
Imagine a world where every off-line move you made or idea you formulated on your local PC (read: Linux PC or deprecated windoze PC) got intercepted or monitored by government agencies which had the self-accorded "right" to pre-empt your "publications" or periodicals or distributions of information.
Imagine a world where more companies slide under the sheets with governments when they realize there's a profit (in money or newly-accorded exemptions or such) in providing international espionage (*cough* domestic protection) enhancement.
(Slightly lifing thin, umm, tin-foil hat...)
If you don't do much electronic writing or distribution of information and don't fear governments and don't do marginally-interesting things, then fear not, I suppose.
(Letting thin-foil hat fall back down...)
But, if you're a rogue of types, and intent on translating them for global release, the next form of "defamation" or "discrediting" could come when the government-backed translation engine fatally alters your doc with "subtle" or "nuanced" changes in diction, grammar, word choice, and the like, or just makes the "hostile" or "ingrate" document look or sound "unprofessional" by introducing improperly spelt (spelled) worlds.
(Cocking tin/thin-foil hat again)
But, then, there always are available the various traditional brick-and-mortar (and pricey) translation services in major cities to which you can drive or overnight your documents. And, likely you can SUE them for bad effort, shoddy work, and the like so long as the product rendered is not beyond contractural protections.
(SLAMMING thin foil down, AGAIN)
Now, think of some company that can't innovate, yet steals or "co-opts" or "borrows" ideas from others and runs them out of business (ie, willing to "cut off their air supply"...), and you'll likely have Google out of business because it lacks the deep $60 Bn pockets...)
(Thin foil worn out now...)
David Syes
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
Reminds me of:
"We have bicycles for boys with adjustable seats."
Who/what has "adjustable seats"? Boys or bicycles?
That was the topic of sentence structure, word choice, and sensible description of the subject vs the object...
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
It seems as pointless as Babelfish to me. Here is the Japanese I dumped in, from an email I received yesterday. This is really simple stuff. UOEZX"úi-ØjSwZÀOE±Ì½ßAOEïiî-ØA"Ñ"cA'AZODjðXZzRO©çZ nßÜB "sÌlÍ\oÄB
Here is Google's translation:
June 9th (the wood) for student experiment, workshop (the rice plant wood, Ida, the medium bridge, Miyoshi) o'clock of 9 it begins from 30 minutes. The person whose are inconvenient please requests.
Here is my translation:
In order to accomodate student experiments, the journal seminar (Inagi, Iida, Nakahashi, Miyoshi) will be moved to Thursday, June 6th at 9:30am. If this is inconvenient please let me know.
It didn't even get the abbreviated 'Thursday' right, even though it is written this way all the time. It also missed half the names, even though these are common ones.
I think google should name it after the best translator ever
You searched for a bunch of words appearing in the text on a page, expecting that page to come up straight away.
Moreover, you searched for a phrase contaminated by other search engines search pages. Altavista has obviously had human intervention to remove these from the consideration of the results, Google hasn't. I could equally find terms where Altavista was corrupted and Google wasn't - WTF does that prove?
Finally, you have to think about what google is interpreting what you want from your search. You wanted a web page containing those words. To get that, google has a mechanism whereby you can search for those words in that context. The way you entered the search terms was phrased as a request for information on "history chocolate romance sharing", as google redacts unnecessary words from your search terms. Google is a contextual search, it searches for pages about the history of chocolate. Does the cadburys page contain a great deal of useful information on the history of chocolate? If so, it will appear higher. It is ranked not by keyword, but by the most useful webpage in the search that google guesses you want. So cadbury's despite possessing all your keywords, might not cut the mustard. So is not ranked at the top.
The difference is not in the ability of google to find the page you want and rank it accordingly. The difference stems from your inadequate understanding of search engines. Google may, or may not be inferior to Altavista. I'd rather fuck a puppy than bother to find out. Your hamfisted attempt to show that it was was pointless and misleading.
And for gods sake, if you're already using google to search, you don't want google appearing when you type in search and click "I'm Feeling Lucky" - its no use at all. I'm not surprised google redacts itself from the search query - I'm surprised it's in there at all.
Meine Schwester ist sehr, sehr reizvoll - Nietzsche
*shrug* Machine translation is always going to require humans to massage the translation algorithms, I suspect. Those translators who get involved early on are more likely to be in the position to be the experts here.
This sig has absolutely no significance and serves only to take up screen space and waste the time of the reader.
Somewhat off-topic in some ways, but I was amused by a story I read some years ago in a magazine, some mention made here and here, about a UN translator who, stymied by a Russian idiom which defied literal translation, drew from Shakespeare and translated it to "Something's rotten in the state of Denmark" which of course led to protests from the representative from Denmark, etc. The core idea part of the story stays intact, but the location, date, and details of the Russian idiom vary (I remember the first time reading it, it was "something about a cow and two piles of hay" and the links I've included talk about "an orange tree, a backyard, Moscow" and "an elder-bush in the garden and an uncle in Kiev"), so there's a decent chance this is an urban legend.
Personally, I'm curious less about the idioms than I am about the MT's parsing of grammar. Not all languages use Subject-Verb-Object grammar and the rules from adding adjectives, adverbs, suffixes, and the like vary greatly between languages and often aren't all that consistent. For instance, Russian doesn't have articles like English does, instead relying on order of words in the sentence to indicate whether one is referring to a generic instance of an object or a specific instance. The grammar section of Mark Rosenfelder's Language Construction Kit provides several examples of differing grammars in other languages. I'm currently taking ASL courses (which admittedly do not have a written form for this kind of translation) and I will freely admit that learning to express sentences in a "Timeframe-object-subject-verb-time signifier-query word" structure is kicking my ass, despite having done some studying of other languages in the past. Heck, just learning when and where to place adjectives before or after words usually takes years for most people. (That's one place where English does seem to shine. Adjectives are always in front of nouns, as best I can recall. Adverbs, on the other hand...)
Anyhow, I'll be eagerly watching the progress here, inasmuch as my scattered attention span will ocasionally provoke me to check my bookmarks list...
This sig has absolutely no significance and serves only to take up screen space and waste the time of the reader.
is the quality of translation. I often see subtitles poorly done due to budget & time concerns, as well as the translator's comprehension of the source language.
If some phrases only show up once or twice in a corpus (say only once in one out of hundreds of films), and the translator didn't have the time or wits to get it right, it would be stuck for ever in the analysis.
You certanly didn't give a very technical explanation other then to assert (yes I'm right). how exactly would one teach a Neural Network language?
autopr0n is like, down and stuff.