Microsoft Announces Breakthrough In Chinese-To-English Machine Translation (techcrunch.com)
A team of Microsoft researchers announced on Wednesday they've created the first machine translation system that's capable of translating news articles from Chinese to English with the same accuracy as a person. "The company says it's tested the system repeatedly on a sample of around 2,000 sentences from various online newspapers, comparing the result to a person's translation in the process -- and even hiring outside bilingual language consultants to further verify the machine's accuracy," reports TechCrunch. From the report: The sample set, called newstest2017, was released just last fall at the research conference WMT17. Deep neural networks, a method of training A.I. systems, allowed the researchers to create more fluent and natural-sounding translations that take into account broader context that the prior approaches, called statistical machine translation. Microsoft's researchers also added their own training methods to the system to improve its accuracy -- things they equate to how people go over their own work time and again to make sure it's right.
The researchers said they used methods including dual learning for fact-checking translations; deliberation networks, to repeat translations and refine them; and new techniques like joint training, to iteratively boost English-to-Chinese and Chinese-to-English translation systems; and agreement regularization, which can generate translations by reading sentences both left-to-right and right-to-left. Zhou said the techniques used to achieve the milestone won't be limited to machine translations. The researchers caution the system has not yet been tested on real-time news stories, and there are other challenges that still lie ahead before the technology could be commercialized into Microsoft's products. You can play around with the new translation system here.
The researchers said they used methods including dual learning for fact-checking translations; deliberation networks, to repeat translations and refine them; and new techniques like joint training, to iteratively boost English-to-Chinese and Chinese-to-English translation systems; and agreement regularization, which can generate translations by reading sentences both left-to-right and right-to-left. Zhou said the techniques used to achieve the milestone won't be limited to machine translations. The researchers caution the system has not yet been tested on real-time news stories, and there are other challenges that still lie ahead before the technology could be commercialized into Microsoft's products. You can play around with the new translation system here.
Can it translate a Chinese Reporter's "eye-roll"? 'Cause one apparently broke China's Internet
With a fellow reporter’s fawning question to a Chinese official pushing past the 30-second mark, Liang Xiangyi, of the financial news site Yicai, began scoffing to herself. Then she turned to scrutinize the questioner in disbelief.
Looking her up and down, Ms. Liang rolled her eyes with such concentrated disgust, it seemed only natural that her entire head followed her eyes backward as she looked away in revulsion.
Captured by China’s national news broadcaster, CCTV, the moment spread quickly across Chinese social media.
...
On Chinese social media, GIFs and other online riffs inspired by Ms. Liang’s epic eye roll quickly proliferated, and by evening they were being deleted by government censors. Ms. Liang’s name became the most-censored term on Weibo, the microblogging platform. On Taobao, the freewheeling online marketplace, vendors began selling T-shirts and cellphone cases bearing her image.
It must have been something you assimilated. . . .
TFS is missing the important test of accuracy: translate Chinese > English, then back to Chinese. Will any Chinese person be able to understand it? Go back and forth twice for a more serious serious test. If you can't get access to Microsoft's software you can easily try this test with existing software. The results can be comical if your business doesn't depend on accuracy.
...omphaloskepsis often...
I heard a story about an engineering company who used automatic translation to send documents back and forth with their international collaborators. At one point, their engineers were perplexed by the frequent mention of an âoewater goatâ in their correspondence.
After digging through their source documents, they learned that the water goats were in fact hydraulic rams.
It's pretty obvious to an English native speaker when a translation is gibberish. A native English-only speaker can't really affirm accuracy, as you stated, but could certainly tell when something is blatantly wrong. They could also at least judge the quality of the final translation's English.
Generally speaking, most translation programs do really horribly at translating idioms, or context-sensitive but otherwise ambiguous phrases. I'd think this is a perfect application for deep learning algorithms to thrive at. Also, kudos for the article summary and headlines for not breathlessly calling this "AI", but pointing out that the techniques are used in training AI systems.
Irony: Agile development has too much intertia to be abandoned now.
I read the MS blog and skimmed the actual paper. It gives a decent overview of the system design but has basically no details on the linguistics side of things. They just hired a bunch of people to do manual translation, both for training and for testing, but the only details of the results are a single table summarizing what categories of errors occurred.
A lot of relevant information was missing. To start with, saying "Chinese language" is like saying "European language" - there isn't one unified "Chinese", but rather a variety of languages, topolects and dialects, with some level of mutual intelligibility, but it varies considerably. Not all variants use the same writing system - most use Hanzi, but there's the whole Traditional vs. Simplified issue, and some obscure varieties use entirely different systems (eg. Dungan is written using Cyrillic, despite being closer to Mandarin than many Hanzi-using topolects). And secondary writing systems abound - for teaching and for computer usage, both the Latin alphabet and Bopomofo syllabary are used, in the mainland and Taiwan, respectively.
From context, they seem to be aiming for Mandarin Chinese, the most common variety, and they only accept input in Simplified Hanzi, but they don't make that at all clear from the paper. Was the training corpus exclusively Mandarin, or did it include Cantonese or Hakka or Minnan? Was it entirely Mainstream Mandarin, or were regional dialects like Sichuanese included? The nature of the logographic writing system elides a lot of differences, but I can't see how you could completely ignore the issue. At the very least, I would expect it would be a problem for false negatives in the validation - these are issues for human translators as well. Did they dig deeper into the reported translation issues, and find any were a case of "oh, the news article was written in MSM but quoted someone using Dalian dialect" and then have to figure out whether the human or the machine was more accurate? I didn't read the paper thoroughly but I didn't see any mention at all of any of this crap.
Anyways, they may or may not have made progress on the AI front. I am even less qualified to judge that than I am the linguistics side of it. But there's so many things *not* discussed in the paper that I can't help but feel like they're overstating their results. Guess I'll have to wait for the language blogs to pick up on it.
It's not reasonable as a test set if it's chosen to be stuff that's easy to translate. I just tried some Chinese assembly instructions, and it's terrible. And I don't mean the technical stuff, I mean the introduction:
Original:
(SNIP - the chinese characters won't work here. Alas for unicode.)
Microsoft:
Structure assembly, according to a certain order, the relevance of parts to subcontract; a total of eight subcontracts, from A1 to A8, the same package of parts are related, after assembly will constitute a machine components; in order to improve efficiency, to avoid confusion, please do not mix the different packages of parts all open after mixing together!!!!
Actual:
The structure is assembled in a certain order with parts relevant to each subsystem. There are a total of eight parts packages, from A1 to A8, each package related to a particular subsystem. In order to assist assembly and avoid confusion, please do not mix parts from the different packages.
This isn't particularly technical writing. About the most complex word is subsystem, and even if you give them that as a mulligan, their translation is still almost incomprehensible. Definitely not natural english as they claim.