Translation Software That Learns by Reading
redcone writes "New Scientist is reporting that translation software that develops an understanding of languages by scanning through thousands of previously translated documents has been released by U.S. researchers. According to the article "The translated documents used to teach the translation algorithms can be electronic, on paper, or even audio files. The system is not only faster than other methods, but also better suited to tackling less common languages and the unusual vocabulary found in specialised or technical texts.""
I wonder if we could train it to translate a EULA ;)
* Olaserov is in the process of thinking up a signature.
I remember hearing about this a couple years ago. They were using translations of Harry Potter and the Bible to teach this software to translate. It seems to work well. I wonder what it'd make of different translations of technical documentation. That'd probably be even more interesting than what it'd make out of 'quidditch'.
This could be great if it were opensourced. It'd be nice to translate email, instant messages, websites, technical docs, and lots of other stuff we're currently using the fish for. The fish is nice but not that effecient to add to other programs and it's translations aren't usually that great.
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Teach Software translating on scanning up
Not hard wares that sticks an comprehension of talks by scanning on thousands of fish translated papers has been vomited by US scientists.
Many existing translation not hard wares uses palm rules for botching words and phrases. But the new software, snarked by Kevin Knight and Daniel Marcu at the Information Sciences[...]
Read More...
I'm a big tall mofo.
...bu7 (4n 17 unÐ3r$74nÐ £337?
"The newly born animals are then whisked off for a quick run through a giant baking oven." --heard on Food Network
As a caveat, we should be wary of saying the system "understands" a language.
I would say generally that humans able to translate between languages generally understand both languages, but whether a statistical, probabilistic model based on correlations understands a language might be a stretch.
Further reading: Searle's Chinese Room argument- http://en.wikipedia.org/wiki/Chinese_room
This is akin to asking, Does your tax software understand the tax code? Does Photoshop understand the principles of image manipulation?
Are these silly questions to ask?
Further reading: Dennett on intentionality (http://en.wikipedia.org/wiki/Dennett but the entry is pretty sparse).
RD
Don't remember exactly where I read this, but google apparently has long believed that there is enough data on the internet alone to be able to intelligently translate... What these guys claim to have done is, it would seem, the missing peace of the puzzle for google. I wouldn't be surprised if google gets in on this.
The article (and the text of the orginial posting) makes it seem like translating a specialized technical text is somehow harder than translating, say, a newspaper article. As someone experienced in translating technical (science/engineering) documents, I can say that any tech document is far _easier_ to translate after an initial learning curve.
...)
The main reason (I think) is that: tech documents have specialised vocabulary and idioms, but these are much fewer than the idioms one has to master in order to understand the editorial page in a newspaper.
With a rudimentary knowledge of Russian and French, I have found it much easier to read an engineering textbook or paper in these languages, than reading any nontechnical text. (This is not necessarily the case with other languages. Any document in Japanese for instance is an entirely different ballgame
This reminda me of Jamie Zawinskies hack Dadadodo which used probability trees to create new texts from old texts by examining the probability any given word follows the previous word/string of words. I always thought his program was cool, in that his description of it involved Markov Chains and William S. Burroughs.
I did a presentation for an AI class a while ago and discovered that Microsoft already does this with their MSR-MT project. Apparently the Spanish entries in their Knowledge Base were translated by this as well.
Beware, Nugget is watching... See?
After a quick web search, all I was able to find was this site, which has a pretty sketchy TOS agreement.
...and fruit flies like a banana.
When an automated translator can handle that one without bursting into flames, I'll start to believe.
Fortunately I had the next best thing in High School Spanish. The trick is simply going to the #spain channel on efnet and talking nice to some people. You'd be amazed as to how often my teacher would fail my fellow students because they attempted using the primitive babelfish.altavista.com to do their work for them; she could easily spot the syntax errors and mis-spelled english words which were never translated.
Until I see this new process in the works, however, there is nothing that will make me believe it's better than finding another human who can *understand* what you are saying and the context to which you are implying.
Here's a couple of suggestions for you:
.$. r3$34r(h3r$. 4((0rÐ1n9 70 7h3 4r71(£3 "7h3 7r4n$£473Ð Ð0(m3n7$ $3Ð 70 734(h 7h3 7r4n$£4710n 4£90r17hm$ (4n b3 3£3(7r0n1(, 0n p4p3r, 0r 3v3n 4Ð10 1£3$. 7h3 $¥$73m 1$ n07 0n£¥ 4$73r 7h4n 07h3r m37h0Ð$, b7 4£$0 b3773r $173Ð 70 74(|{£1n9 £3$$ (0mm0n £4n9493$ 4nÐ 7h3 n$4£ v0(4b£4r¥ 0nÐ 1n $p3(14£1$3Ð 0r 73(hn1(4£ 73x7$.""
r3Ð(0n3 wr173$ "N3w $(13n71$7 1$ r3p0r71n9 7h47 7r4n$£4710n $07w4r3 7h47 Ð3v3£0p$ 4n nÐ3r$74nÐ1n9 0 £4n9493$ b¥ $(4nn1n9 7hr09h 7h0$4nÐ$ 0 pr3v10$£¥ 7r4n$£473Ð Ð0(m3n7$ h4$ b33n r3£34$3Ð b¥
And translation #2:
REDCONE WRIETS NU SCEINTIST IS R3PORTNG TAHT TRANSLATION R TAHT D3V3LOPS AN UNDERSTANDNG OF LANGUAEGS BY SCANNG THROUGH THOUSANDS OF PREVIOUSLY TRANSLAETD DOCUMENTS HAS B3N REL3AESD BY US!!!! OMG R3S3ARCHARS!!1!1!! LOL ACORDNG 2 DA ARTICL3 TEH TRANSLAETD DOCUMENTS US3D 2 T3ACH TEH TRANSLATION ALGORITHMS CAN B 3LECTRONIC ON PAEPR OR 3V3N AUDIO FIELS!!1111 TEH SYSTEM IS NOT ONLY FASTER THAN OTH3R M3THODS BUT ALSO BT3R SUIETD 2 TAKLNG LAS COMON LANGUAEGS AND TEH UNUSUAL VOCABULARY FOUND IN SPACIALIESD OR TECHNICAL TEXTS!1!! WTF
The basic approach has been developed over 10
years ago by IBM: The Mathematics of Statistical Machine Translation. And even free software has been available for a while, see
http://www.fjoch.com/GIZA++.html.
Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14am Eastern Time....
k apr3ndist3 3sp4ni0l en IRC?
q w3n0! 3so si está 1337!