Automatic Translation Without Dictionaries
New submitter physicsphairy writes "Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words. For example, in any language, a word like 'cat' will have a particular relationship to words like 'small,' 'furry,' 'pet,' etc. The set of relationships of words in a language can be described as a vector space, and words from one language can be translated into words in another language by identifying the mapping between their two vector spaces. The technique works even for very dissimilar languages, and is presently being used to refine and identify mistakes in existing translation dictionaries."
My nipples explode with delight!
hmmm?? slashdot doesn't easily accommodate unicode.
work in progress
how would 'tight pussy" be translated?
"Tight pussy" would be translated automatically, and without dictionaries. This is answered right in the headline.
Once again, Star Trek is ahead of the curve.
Syllable : It's an Operating System
Reminds me a lot of the Fluid Concepts and Creative Analogies work that Hofstadter led back in the day.
I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.
That being said.... Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC-like rules to figure it out, seems like a useful building block, but maybe I'm wrong.
Very cool stuff. Makes me want to go back and finish that CS degree after all.
Hire a Linux system administrator, systems engineer,
What's it got in its pocketses?
If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
Agg. firefox put me on the wrong story... bye bye karma
Yes exactly. For sayings google translate works not so good now. But perhaps with this technique it will be to plums in the future.
Firefox had nothing to do with it.
It was PEBCAK, pure and simple.
Sig Battery depleted. Reverting to safe mode.
With this technology we should be able to understand Dolphin-talk.
It should also allow us to detect future ape rebellions before they happen.
"tight pussy" be translated?
"The cat has drunk a saucer of wine."
When I was in grad school, studying linguistics, compitational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?
This is old news, going back to 1975. Yawn. http://en.wikipedia.org/wiki/Vector_space_model
Yes, the pretty vectors (nothing but lists of words) still have to be assembled by humans for the most part. Maybe not EVERY association, but enough of them such that you can build relationships and associations in-directly, and achieve a round-about translation, even if you end up having to go through 2 or 3 related languages to get there.
After a few words of context are translated you can, perhaps deduce the rest. But the idea you can do so without a dictionary is ridiculous. And putting your dictionary into digital forms and calling it a vector doesn't change the fact that you still have a dictionary associating an english word with a french word and a Mandarin word.
Sig Battery depleted. Reverting to safe mode.
jazz musician
Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.
True, but calling a dictionary a vector space doesn't make it so. For example how "close" are the definitions of "happiness" and "joy"? In a dictionary, the only concept of "closeness" is the lexical ordering of the word itself, and in that sense "happiness" and "joy" are quite far apart (as far apart as words beginning h-a are from words beginning with j-o are in the dictionary). But in some kind of adjacency matrix which show how often these words appear in some relation to other words, they might be quite close in vector-space; "guilt" and "shame" might likewise be closer to each other than either is from "happiness", and each of the four words ("happiness", "joy", "guilt", "shame") would be closer to any other of those words than they would be to "crankshaft"; probably close to "crankshaft" (a noun) than they'd be to "chewy" (an adjective).
Anyhow, if you'd read the paper, at least as far as the abstract, you'd see that this is about *generating* likely dictionary entries for unknown words using analysis of some corpus of texts.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Statistical translation is always going to have issues like that, but it can perhaps reach the 'good enough' point to hold a conversation with.
I can easily see it getting confused by formal vs informal use. If it goes on association, eventually it's going to get 'lawyer' and 'extortionist' confused.
Depends on source corpus. If they trained it using one of the usual formal collections of publications, it would only have built up associations based on the slang-free usage and so would translate it as 'Tight cat.' If they have instead fed it a broader selection, perhaps culled from a web spider, it may pick up the other meaning.
I too get lawyer and extortionist confused.
Welcome to /. where we still party like it's 1999. We'll have colonies on Mars before this site gets unicode support.
Live today, because you never know what tomorrow brings
They do a great job of improving the precision of what used to be mediocre. And then, as a direct result, they not only make the errors worse, they make the errors undetectable.
CAT: small, furry, pet.
BIG CAT: big, furry, pet.
Um. Both are orange. One's a tabby. One's a tiger.
It's not good enough that your translation system has a 99% accuracy whereas the old one has a 90% accuracy. What matters is that the old one's 10% error rate sounded like an error (e.g. tiger becomes monster), whereas your new one's 1% passes the turing test and can't be discerned by an intelligent listener (e.g. tiger becomes tabby).
"My friend owns a monster." -- You friend owns what? I don't think you meant a monster. -- "eh, you know, a very big dangerous jungle cat" -- oh, like a lion -- "not a lion, it has stripes" -- oh, a tiger.
"My friend owns a tabby." -- Ok.
Slashdot has a fairly strict code point whitelist because there were problems in the past with trolls using directionality override characters to break Slashdot's layout and big blocks of foreign characters to make not-ASCII ASCII art.
Anyone who regularly uses Google Translate has seen the problems that come with this approach.
It "translates" analogous terms in ways that make no sense. Translate "Amsterdam" from Dutch to English and it often gives you "London". Same with kilometres / miles, and other things that significantly change the meaning of the text.
With some hand-crafted guidance, the outcome can be much less useful than the more rough-sounding word-by-word machine translations from days of yore.
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
Rimmer, Lister
I am officially gone from
Synonyms are only the tip of the iceberg: there are so many other problem areas. Collocations (words that 'go together'): we can say a 'tall boy', but not a 'high boy'; 'a large beer', but not 'a big beer'. Connotations (attitudes, feelings and emotions that a word acquires): compare 'a slim girl' with 'a skinny girl'. Idioms: 'hot potato' and 'red herring' cannot be translated directly into any another language. Add irony and sarcasm to the mix, class and regional usage, dialects, diglossia (for example, demotic and classical Arabic), puns and plays on words - the list goes on. Machine translation is a chimera.
I understand that collocation are adressed by their model: they study texts to discover that 'boy' may be preceded by 'tall' but not by 'high', and that in french, 'garçon' may be preceded by 'grand' but not 'haut'. That enables them to translate without a hitch.
But even adjectives handling may come with traps. Adjectives in french may appear before or after a noun. You may say 'un grand garçon' or 'un garçon grand', the meaning is the same most of the time. But there are exceptions! 'un type pauvre' is a poor guy, 'un pauvre type' is a mediocre person. Even the 'grand garçon' vs 'garçon grand' may carry subtle difference, as a father will tell his son he is 'un grand garçon' now (which means he is not a child anymore), but he will probably not tell him he is now 'un garçon grand' (which just mean he is tall). I guess this can be handled by their statistical model, but at some time they will need to add some logic to handle it. I guess it falls in the idiom category.
Puns and irony are probably the most difficult part of the game. Even human translator have a hard time with them
With this story being about automated translations getting it very wrong, there was a 95% chance people would have thought you were just making a joke about Apple doing language translations!
If you had posted a follow up like "That's what Apple translate gets when I wrote 'Orchards of apple trees have fans to spray microscopic poison dust on all trees', it would have been perfectly believable.
John