AI Goes Bilingual -- Without a Dictionary (sciencemag.org)
sciencehabit shares a report from Science Magazine: Automatic language translation has come a long way, thanks to neural networks -- computer algorithms that take inspiration from the human brain. But training such networks requires an enormous amount of data: millions of sentence-by-sentence translations to demonstrate how a human would do it. Now, two new papers show that neural networks can learn to translate with no parallel texts -- a surprising advance that could make documents in many languages more accessible.
The two new papers, both of which have been submitted to next year's International Conference on Learning Representations but have not been peer reviewed, focus on another method: unsupervised machine learning. To start, each constructs bilingual dictionaries without the aid of a human teacher telling them when their guesses are right. That's possible because languages have strong similarities in the ways words cluster around one another. The words for table and chair, for example, are frequently used together in all languages. So if a computer maps out these co-occurrences like a giant road atlas with words for cities, the maps for different languages will resemble each other, just with different names. A computer can then figure out the best way to overlay one atlas on another. Voila! You have a bilingual dictionary. The studies -- "Unsupervised Machine Translation Using Monolingual Corpora Only" and "Unsupervised Neural Machine Translation" -- were both submitted to the e-print archive arXiv.org.
The two new papers, both of which have been submitted to next year's International Conference on Learning Representations but have not been peer reviewed, focus on another method: unsupervised machine learning. To start, each constructs bilingual dictionaries without the aid of a human teacher telling them when their guesses are right. That's possible because languages have strong similarities in the ways words cluster around one another. The words for table and chair, for example, are frequently used together in all languages. So if a computer maps out these co-occurrences like a giant road atlas with words for cities, the maps for different languages will resemble each other, just with different names. A computer can then figure out the best way to overlay one atlas on another. Voila! You have a bilingual dictionary. The studies -- "Unsupervised Machine Translation Using Monolingual Corpora Only" and "Unsupervised Neural Machine Translation" -- were both submitted to the e-print archive arXiv.org.
In order to go "bilingual" ...
The headline says "bilingual". Neither paper uses that term.
it would have to be able to understand one language first.
It is not clear if this is true. Translation accuracy has greatly improved, and is continuing to improve, despite the NNs having no understanding of how the languages map to reality. They only learn how the languages map to each other.
"neural" nets will not cut it and they are really old
What does age have to do with anything? Biological neural nets have been around for 600 million years.
That would be fine. The number of times I wanted a machine translated story in the past... I dunno, ever. 0. The number of times I wanted a technical paper, or instructions or tech specs are significant. Or even news. Storytelling, jokes and wordplay are the least interesting thing to translate, because there are people who actually already do that.
Your ad here. Ask me how!
These are very cool advances, but they don't solve the major problem of machine learning (ML): Having lots of data.
While these approaches don't need bilingual corpora, they still need big monolingual corpora. Very few languages have those, and those that do usually also have bilingual corpora to one or more of the major world languages.
This does lower the barrier to entry significantly for those doing ML machine translation. But, if one took the resources spent on gathering and curating corpora and instead invested in rule-based systems, you could get much further in less time.
The assumption, that the world is the same, and languages are attached to it, lies at the bottom of the idea of this learning strategy. The example given - of 'table and chairs' demonstrates this. Most of these ideas belong to a 19th century eurocentric understanding of the world we live in. Modern neuroscience and other work points to the fact that the world we perceive is very much dominated by the language we use, and not the other way around.
Concrete Example: For a large portion of the 19th-20th Century many Greeks measured distance in cigarettes - how many cigarettes I will smoke while travelling from one place to another. There is no cognate in English for this. Not only that, but the language usage indicates a specific timespan as well as cultural differences.
"Idiom!" I hear you say. Consider cultures where there are many more tables than there are chairs - such as in Asia where most people sit on the floor or on cushions.
"But there are some universals - we can still use those!" - generally, there are no universals, or so few that they are not worth talking about. Talk to an anthropologist about it. Not even the concept of 'mother' is a universal.
This comment was written with the intention to opt out of advertising.
It's good as long as all the languages are in the same language family, meaning that they share grammatic logic but have different vocabulary. But try translating English into a non-Indo-European language like Korean, with a fundamentally different way of expressing ideas, and it fails miserably. It's often not understandable at all.
(For instance: English sentences require a subject in every sentence to be complete, meaning that you say "John is growing up" even though it's obvious who we're talking about. In Korean, you mention who you're talking in the beginning, and then it's implicit from context until you start talking about someone else, so you drop the subject in following sentences. Machine learning systems so far don't understand this distinction, so translating from Korean to English they keep inventing people in the sentences, so that "is growing up" might become "Dave is growing up" or "Alice is growing up", even though no Dave or Alice has been mentioned in the previous sentences, while they were mentioned a few times in the training material.)