Automatic Translation Without Dictionaries
New submitter physicsphairy writes "Tomas Mikolov and others at Google have developed a simple means of translating between languages using a large corpus of sample texts. Rather than being defined by humans, words are characterized based on their relation to other words. For example, in any language, a word like 'cat' will have a particular relationship to words like 'small,' 'furry,' 'pet,' etc. The set of relationships of words in a language can be described as a vector space, and words from one language can be translated into words in another language by identifying the mapping between their two vector spaces. The technique works even for very dissimilar languages, and is presently being used to refine and identify mistakes in existing translation dictionaries."
My nipples explode with delight!
how would 'tight pussy" be translated?
"Tight pussy" would be translated automatically, and without dictionaries. This is answered right in the headline.
Once again, Star Trek is ahead of the curve.
Syllable : It's an Operating System
Reminds me a lot of the Fluid Concepts and Creative Analogies work that Hofstadter led back in the day.
I don't see this directly working for translation into non-lexographically swappable languages (eg, English -> Japanese) very well, because even if you have the idea space mapped out, you'd still have to build up the proper grammar, and you'll need rules for that.
That being said.... Holy cow, you have the idea space mapped out! That's a big chunk of Natural Language Processing and an important step in AI development. ... Understanding a sentence emergently in terms of fuzzy concepts that are an internal and internally created symbol of what's "going on", not just using a dictionary and CYC-like rules to figure it out, seems like a useful building block, but maybe I'm wrong.
Very cool stuff. Makes me want to go back and finish that CS degree after all.
Hire a Linux system administrator, systems engineer,
What's it got in its pocketses?
If I had an Ass, I'd call it Fanny Bottom, then I could slap my Ass; Fanny Bottom, on the Arse.
With this technology we should be able to understand Dolphin-talk.
It should also allow us to detect future ape rebellions before they happen.
"tight pussy" be translated?
"The cat has drunk a saucer of wine."
When I was in grad school, studying linguistics, compitational linguistics, and automatic speech recognition, I recall it mentioned more than once the idea of using latent semantic analysis and such to do this kind of translation. So am I correct in assuming that this hasn't been done well in the past, and Google finally made it work well because they have larger corpora of translated texts?
This is old news, going back to 1975. Yawn. http://en.wikipedia.org/wiki/Vector_space_model
Simply because you embed your dictionary in something you choose to call a vector doesn't make it any less of a dictionary.
True, but calling a dictionary a vector space doesn't make it so. For example how "close" are the definitions of "happiness" and "joy"? In a dictionary, the only concept of "closeness" is the lexical ordering of the word itself, and in that sense "happiness" and "joy" are quite far apart (as far apart as words beginning h-a are from words beginning with j-o are in the dictionary). But in some kind of adjacency matrix which show how often these words appear in some relation to other words, they might be quite close in vector-space; "guilt" and "shame" might likewise be closer to each other than either is from "happiness", and each of the four words ("happiness", "joy", "guilt", "shame") would be closer to any other of those words than they would be to "crankshaft"; probably close to "crankshaft" (a noun) than they'd be to "chewy" (an adjective).
Anyhow, if you'd read the paper, at least as far as the abstract, you'd see that this is about *generating* likely dictionary entries for unknown words using analysis of some corpus of texts.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Statistical translation is always going to have issues like that, but it can perhaps reach the 'good enough' point to hold a conversation with.
I can easily see it getting confused by formal vs informal use. If it goes on association, eventually it's going to get 'lawyer' and 'extortionist' confused.
I too get lawyer and extortionist confused.