Automatic Translation Without Dictionaries
New submitter physicsphairy writes "Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words. For example, in any language, a word like 'cat' will have a particular relationship to words like 'small,' 'furry,' 'pet,' etc. The set of relationships of words in a language can be described as a vector space, and words from one language can be translated into words in another language by identifying the mapping between their two vector spaces. The technique works even for very dissimilar languages, and is presently being used to refine and identify mistakes in existing translation dictionaries."
My nipples explode with delight!
Well, that sounds quite cool, but also makes me wonder how does the algorithm tell wrong associations from the good ones. These things can easily go up to n^2 complexity.
Neither the article or PDF contain the word "pun". We're still a little way off. But hopefully we'll get better than this attempt from google translate: = She turned me off with her bossy manner. but Google translate gets OPPOSITE meaning. saying "Her attitude arbitrary pleases me too."
work in progress
And would that really work? Make that the cat wise!
'tight pussy" be translated?
I got to the chocolate box before you, that's why the hard ones have teeth marks.
Once again, Star Trek is ahead of the curve.
Syllable : It's an Operating System
Makes me think about hash functions and flash storage and data interoperability..... future..
I come to Slashdot only to read sigs. One you are reading is mine.
Reminds me a lot of the Fluid Concepts and Creative Analogies work that Hofstadter led back in the day.
I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.
That being said.... Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC-like rules to figure it out, seems like a useful building block, but maybe I'm wrong.
Very cool stuff. Makes me want to go back and finish that CS degree after all.
Hire a Linux system administrator, systems engineer,
What's it got in its pocketses?
If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
Agg. firefox put me on the wrong story... bye bye karma
OK, I am just having a fag. I bet that will bugger it up.
Now assign each word/phrase a certainty/confidence value, and apply your new algorithm only to words/phrases for which a literal (i.e. dictionary) translation has a low degree of confidence.
Much appreciated,
Multilingual speakers
Firefox had nothing to do with it.
It was PEBCAK, pure and simple.
Sig Battery depleted. Reverting to safe mode.
Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.
Its still a dictionary, and also a thesaurus. Come to think of it a thesaurus is simply a meaning vectored dictionary.
What's old is new again.
Mathematicians, late to the party, still trying to drink all the punch.
Sig Battery depleted. Reverting to safe mode.
Um... while it is awesome and works, uh the translator has for over about a decade. Heck the original Babel Fish (used by Yahoo, Overture, Altavista) were the first with such concepts pre 2000.
Cat, associated with:
Big
Steel
Heavy
Wheels
Track
Blade
With this technology we should be able to understand Dolphin-talk.
It should also allow us to detect future ape rebellions before they happen.
Was the know-nothing reporter using/regurgitating something that was mistranslated?
You WILL have to convert the old word to the one from the new langauage - THAT takes a dictionary operation.
The problem is, that old word may translate into many different possible new word or phrase and the difficulty is WHICH new verbage is correct.
Whats being talked of IS combining the basic translation lookup (dictionary) with some extra association information (context of adjacent) to try to pick the RIGHT translation and resulting new words.
SO its pretty farging stupid declaring this is 'dictionaryless' .
The word association link info can be a thousand-fold increase in the information such a translation database would need to maintain (and is largely what such real translator efforts have been doing in the past 50 years).
Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words.
Like how Google Translate have noticed that Danish domain names ends in "dk" and therefore translates "dk" to "com" with "uk", "gb" and "en" as some of the other suggestions?
Sometimes "a simple means" can be too simple.
When I was in grad school, studying linguistics, compitational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?
This is old news, going back to 1975. Yawn. http://en.wikipedia.org/wiki/Vector_space_model
Sounds very similar to StarTrek's universal translator. You only have to say about a dozen words to map the language right?
Meaning of words, and their translations, vary with time and location. Infering meanings from texts from 20 years ago or another country, state or even region inside a state, even if the language is the "same", could be risky. There had been a lot of marketing problems thanks to this kind of bad translation
They do a great job of improving the precision of what used to be mediocre. And then, as a direct result, they not only make the errors worse, they make the errors undetectable.
CAT: small, furry, pet.
BIG CAT: big, furry, pet.
Um. Both are orange. One's a tabby. One's a tiger.
It's not good enough that your translation system has a 99% accuracy whereas the old one has a 90% accuracy. What matters is that the old one's 10% error rate sounded like an error (e.g. tiger becomes monster), whereas your new one's 1% passes the turing test and can't be discerned by an intelligent listener (e.g. tiger becomes tabby).
"My friend owns a monster." -- You friend owns what? I don't think you meant a monster. -- "eh, you know, a very big dangerous jungle cat" -- oh, like a lion -- "not a lion, it has stripes" -- oh, a tiger.
"My friend owns a tabby." -- Ok.
But there's a space between linear and quadratic called linearithmic, or O(n log n). Merge sort uses nested loops and lies in this space.
The allusion-heavy Tamarian language has real-world analogs, such as Tropese and the tendency for users of sites closely linked to 4chan to talk in memes.
Anyone who regularly uses Google Translate has seen the problems that come with this approach.
It "translates" analogous terms in ways that make no sense. Translate "Amsterdam" from Dutch to English and it often gives you "London". Same with kilometres / miles, and other things that significantly change the meaning of the text.
With some hand-crafted guidance, the outcome can be much less useful than the more rough-sounding word-by-word machine translations from days of yore.
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
It's already been partially decoded.
Some of the calls are individually identifiable, unique to the individual. The rest doesn't make much sense. A human pod is essentially land bound a lot of the time but spread apart with no inherent electromagnetic navigation organs and little of import to say. Still, they babble on to track each other after splitting up or silently stalk each other online.
I wonder how they handle synonyms, which may be much more prevalent in a given language from another one.
If the destination language is poorer in synonyms than the source language, this is straightforward, and that automatic translation will just miss subtle points that cannot be translated without a periphrase. In the opposite case, which is moving from synonym-poor language to a synonym rich language, the computer needs to choose the right word, and doing so requires some understanding of the context.
And the problem exists beyond synonyms with sentence structures. Let us take the english sentence "We will give territories". In french it could become "Nous cèderons des territoires" (We will give some territories) or "Nous cèderons les terriroires" (We will give the territories). What should be chosen? It depends of the context, something the computer may have a hard time to grasp.
So is it a game of 20 questions, with each answer projecting out one or more dimensions?
1. Rough word-by-word is the beginning
2. Sentence structure reorganization
3. Idiom recognition.
4. Connotation, Tone, Irony
5. Generation / Area / Nature: How a native listener can determine details about the speaker.
The result will always be annotated-looking with warnings for plays-on-words, and will always be longer with maximum detail extraction from the source language.
I'm sure there's more to do after these items are done.
Science & open-source build trust from peer review. Learn systems you can trust.
babelfish will be "created" through crossbreeding and genetic experimentation long before the language barrier on this planet is gone...
"Vector spaces" is the heart of the Google proposal. Previous posters have disassembled the weaknesses pretty well.
The thing a "vector spaces" analysis needs is specific vector mapping based on the sounds of speech, the rythmns of a language, the breathing of the speaker and the physical proximity parts of the brain associated with hearing and parts of the brain associated with speech.
Multiple languages exist because the growing infant's brain organizes the sounds it hears by passing the neural sensations through many layers of pattern forming and recognition processes. Multiple languages and the ambiguities in languages means the language learning process within a developing child has some features that are quite consistent, like saying "ma ma". The rest of language aquisition spreads out in the physical vector space of the topology of the brain. Italian has been noted as wonderful for singing, Spanish as good at expressing emotion. Perhaps these languages follow slightly different paths in the brain.
An idea I picked up from digital ham radio tutorials is quadrature phase demodulation. It extracts data from a carrier signal, it looks simple, it looks like you could do it with nerve cells and it associates nicely with known large scale brain electrical activity.
I work with severely disabled kids. Language aquisition or finding work arounds for missing or weak parts of the language pathway is an interesting challenge. A fellow who stone facededly ignored my spoken words laughed at me and smiled when I began signing to him in pigin half made up American Sign Language.
Looks like this could be the beginning of a Universal translation scheme. Next all we need is to add voice recognition to this and Star Trek tech comes alive once again!
In recent times there are regularly articles in technology magazines about topics in computational linguistics (CL) that are blatantly ignorant of the current research. This is just another example.
The time that dictionaries are used for applied machine translation is already history since 10 years. Statistical machine translation (SMT) and the techniques described here have not been developed by google. In fact the basic idea of SMT is over 25 years old and distributional semantics is over 50 years old. Phrase tables for SMT are nothing new and always contained these properties of distributional semantics (DS) which are tightly connected to vector space models (VSM) and the next step to merge VSM and SMT is just the next logical step.
If you find that topic interesting search for "statistical machine translation" "phrase tables" "distributional semantics". Have fun the next 2 years reading all the stuff.
English(Chinese(Input)) became: "I strained my friend's cat loves the taste of sausage meat." Have a guess as to the original sentence.
With this story being about automated translations getting it very wrong, there was a 95% chance people would have thought you were just making a joke about Apple doing language translations!
If you had posted a follow up like "That's what Apple translate gets when I wrote 'Orchards of apple trees have fans to spray microscopic poison dust on all trees', it would have been perfectly believable.
John
The amusing side effect of the effectiveness of statistical machine translation is that more and more people would use machine translation instead of employing fluent humans to do the translation and that is where the fun begins: The machine-translated phrases will reenter the corpus as seed data and as the percentage of human-origin data in the corpus reduces, so does the quality of the translations as subtle errors are magnified over time.
There should be some kind of "fingerprint" added to the machine translation which can be trivially detectable so that such work doesn't reenter the corpus. Of course, the quantity of human text will decline but that will degrade the quality at a slower pace.
Fun to think about.
No sig. Move along - nothing to see here.
Stanford actually has a peer reviewed paper with the same model but better and actually show it works on real machine translation
http://ai.stanford.edu/~wzou/emnlp2013_ZouSocherCerManning.pdf
from
https://plus.google.com/u/0/communities/107785538899595981479
Pet the small furry pussy. Hmmm that could have some very different meanings...
01/01/01