Microsoft Shows Off Adaptive, Multilingual Text to Speech System

← Back to Stories (view on slashdot.org)

Microsoft Shows Off Adaptive, Multilingual Text to Speech System

Posted by ryuzaki0 on Monday March 12, 2012 @02:10PM from the still-no-sign-of-warp-drives dept.

MrSeb writes about a really cool project from Microsoft's speech research group. From the article: "Microsoft Research has shown off software that translates your spoken words into another language while preserving the accent, timbre, and intonation of your actual voice. In a demo of the prototype software, Rick Rashid, Microsoft's chief research officer, said a long sentence in English, and then had it translated into Spanish, Italian, and Mandarin. You can definitely hear an edge of digitized 'Microsoft Sam,' but overall it's remarkable how the three translations still sound just like Rashid. The translation requires an hour of training, but after that there's no reason why it couldn't be run in real time on a smartphone, or near-real-time with a cloud backend. Imagine this tech in a two-way setup. You speak into your smartphone, and it comes out in their language. Then, the person you're talking to speaks into your smartphone and their voice comes out in your language." The Techfest 2012 keynote has a demo of the technology around minute 13:00.

17 of 171 comments (clear)

Min score:

Reason:

Sort:

The big boss was impressed by another demo by Anonymous Coward · 2012-03-12 14:18 · Score: 5, Funny

"Programmeurs, programmeurs, programmeurs, programmeurs, programmeurs!"
First translation fail by HBI · 2012-03-12 14:24 · Score: 5, Funny

"My hovercraft is full of eels" would have been perfect.

--
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
1. Re:First translation fail by mug+funky · 2012-03-12 16:36 · Score: 3, Funny
  
  instead of bobcat, hovercraft contained eels. would not buy again.
Re:Do they sound alike? by Pseudonym+Authority · 2012-03-12 14:48 · Score: 5, Funny

I completely agree. It is total garbage and if it isn't absolutely flawless in every possible regard, then it should not even have been attempted.
Re:I see where this is headed. by poetmatt · 2012-03-12 14:51 · Score: 3, Funny

I want to hear a TTS that can turn Punjabi into Valley Girl.
Re:Given the torment that foreign language class by cptdondo · 2012-03-12 15:16 · Score: 3, Informative

Hehe....
I am bilingual in English and another language. When I go to that country, many of the tourist attractions have price lists in English, Spanish, Russian, Japanese, you name it. Then they have one in the local language. The prices on that one are half of what they are for the tourists. And they're written out in words, not numbers, so if you can't read them you're SOL.
So yup, you don't need to speak the other guy's language, if you're willing to play by his rules.
Just FAIL (pipe dream?) by theNAM666 · 2012-03-12 15:21 · Score: 3, Interesting

1) The translations aren't semantically equivalent (as pointed out by commenters above above). I can already say "Ich bin ein dummer Amerikaner" in my own voice, without machine help. If the meaning isn't there, who cares?
2) The machine accent ain't that great, either.
All of this makes me think this is still somewhat of a pipe dream. The AI guys have been selling the idea of machine translation for years and years-- at least since the 50s, when it was promised to eliminate the need for trained State Department linguists. It's never emerged because it's still a hard problem. Even Google's translate, which beats the MS stuff by some yards, produces results which range from awkward phrasing to just plain inaccurate and misleading.
He's selling a great idea, but it's kind of like the Fountain of Youth. It ain't there, vaporware.
1. Re:Just FAIL (pipe dream?) by NoKaOi · 2012-03-12 16:44 · Score: 3, Insightful
  
  He's selling a great idea, but it's kind of like the Fountain of Youth. It ain't there, vaporware.
  Is he actually trying to sell a mature product, or is he just showing something cool? I'm not sure where the innovation is, if it's in being able to train text-to-speech to sound like your voice, preserving intonations and such across the translation (even though it's obviously not great at it yet), or if it's just in putting a few existing technologies together, but you have speech recognition, and a translator, and text to speech that sounds like your voice, then this is what you can have. Include preserving the intonation and you have something cool. So what if it's just showing off a cool application of existing technologies?
  Translators aren't great but are getting better...speech recognition isn't great but is getting better. Preserving intonation across the translation and including in text-to-speech in a voice that sounds kinda like your own can probably get better too. Put the 3 together and you get something useful. I think that's all it's trying to show, and I think as these technologies get better we could end up with something pretty cool.
  If this was a something out of any other company, would the same people be criticizing it?
Re:microsoft and their credibility by Ethanol-fueled · 2012-03-12 15:28 · Score: 5, Funny

My employer is a Microsoft shop. Microsoft Windows Seven optimizes my productivity with its new context-sensitive search. Microsoft Office allows me to quickly compose documents and spreadsheets of arbitrary complexity.

It is no surprise that Excel is being used for engineering given its power and flexibility. Hell, a shop I worked for used Excel as its database.

Now let's get down the the nitty-gritty - Visual Studio is one of the most powerful IDEs on the face of the planet. You want power? You got it. You want speed? You got it. You want both? It empowers you, the ninety-pound weakling, with both, with minimal effort. I got a raise because I used Visual Studio. I got my dick sucked by my boss' hottest secretary because I wrote an patch in C# that prevented our ERP system from total meltdown.

Why be some boring open-source ODBC slob when you can be fast. Quick. Nimble. Packing.

Be potent. Be Microsoft.
Re:Given the torment that foreign language class by ChatHuant · 2012-03-12 16:08 · Score: 5, Insightful

That said, I don't regret learning Spanish, but learning it just so you can get a cheaper tourist trap is not worth it at all.
Of course it's not worth it, if all the benefit you find in knowing another language is saving a couple of bucks at some touristy place. But knowing a different language is much more than that. You have now access to new worlds of literature, movies, poetry and music first hand, without a translator to intermediate (because, as the Italians say, "traduttore, traditore"!). You can talk to more people directly, understand their culture, expand your mind. You can read a whole set of new web sites, see different perspectives, or read news that aren't easily available otherwise. It opens lots of new possibilities for you - for example if you want to work for a global company, or if you ever feel like work in a different country for a few years. And even without any of those, the very effort of learning a different language improves your brain and slows mental aging.
I'm relatively fluent in three languages now, and can more or less read another two. I read books in all of them, and I find it really enriches my mind. I just started learning a fourth (Japanese), and am really looking forward to reading Japanese books in their original form (even though learning enough of the kanji characters will be a pain).
Re:Do they sound alike? by Phics · 2012-03-12 16:23 · Score: 5, Insightful

It's not garbage, and if they had real innovations, it would be nice. Instead, they've taken a few characteristics of a speaker, like pitch, and used those to model the computer voice in another language.
No, if you listened to the keynote, they took speech characteristics, and then broke the target voice pattern up into 5ms pieces and reconstructed the voice to match a reference translation from a different language. What they are doing is not only very interesting, but clearly has space for improvement and a variety of applications.

It's about as interesting as if someone said, "what would you look like if you were a boy?" (or girl, if you are male), and then sampled your eye color, hair length, nose shape, etc, and then morphed those into a stock photo of a boy. Yeah, it would have some characteristics of you, but it also wouldn't be what you would look like if you were a boy.
That's sort of the point. The sampled voice may not speak fluent Mandarin, but if you'd like it to, this technology will allow it to. A better analogy would be along the lines of taking a computerized sample of your body shape and texture, (skin, hair, face, etc), and then using 3D animation to reconstruct a model of you doing karate, even if you didn't actually know karate.
Eventually, as the 'resolution' improves, the bits of this that you disapprove of, (the computerized feel you are getting from the voice), will most certainly improve as well. But it's the underlying ideas and tech which are interesting here.

--
There are two types of people in the world; those who believe there are two types of people, and those who don't.
Re:Given the torment that foreign language class by phantomfive · 2012-03-12 16:40 · Score: 3, Informative

I just started learning a fourth (Japanese), and am really looking forward to reading Japanese books in their original form (even though learning enough of the kanji characters will be a pain).
Might want to check out this book, it is good. And since I'm giving completely unsolicited advice, the exposition of grammar in "Communicating with Japanese by the Total Method" is my favorite of all language textbooks I've seen.

--
"First they came for the slanderers and i said nothing."
Re:The Future of International Business by malakai · 2012-03-12 16:48 · Score: 3, Informative

. I foresee someone attempting a friendly gesture by offering to share her mother's recipe for "shut up."

Context is context. Obviously, an English speaker hearing a Spanish speaker offer to share a recipe for "shut up" on a (up until this point) benign and friendly conference call is going to assume translation error. Better than that, translation software knows about these little mix ups better than you do. On a Text To Speech, there's not much to do but suffer the mis-translation ( or maybe they play an audble 'ping' when they warn about a context or idiosyncrasy error), but in a system that displays you something on a device, these things tend to be shaded a different color, and offer options as to what other possible meaning they may have meant, based on context.

One, our text translation software isn't foolproof, but people expect it to be.
No, they don't. No one even expects paid human translators to be perfect.

Two, live conversations depend upon both parties building on a shared experience. If each one has a different account of the experience, conversations break down very quickly. Ever tried to carry on a conversation with a schizophrenic?
Honestly, with a schizophrenic, chances are I have, at some point in my life, on IRC. But more to your point, i've played games where opposing sides are communicating from different languages via google translate. Think Russia vs US, and the only way to talk to them is via delayed google translate results. It's slow, it's tedious, and yet we somehow managed to have amazing rapport with people of like mind. The assholes were still assholes via google translate, and the people we wanted to work with we managed to communicate with. Again, you are ignoring the fact than incrementally better translation is still better than it's predecessor. For now. Sure, one day we'll identify some uncanny valley with voice translation, and we'll all spend lots of time plotting how bad the translation software has to be for us to feel it's robotic.... but for now, any small step forward is better than the previous one.

Then again, this whole discussion is purely academic. Gene Roddenberry's estate will just claim prior art [memory-alpha.org] and prevent this from ever becoming a reality. Hopefully.
Yup, god forbid someone spends time and money on a problem that sci-fi writers got to magically make disappear in one sentence, and a prop. Maybe someday some brilliant young chap will figure out how to make warp drive not require 3x the mass of the universe for power, and Gene's children can make some more cash. Hopefully.

--
-Malakai
A Dragon Lives in my Garage
Re:Sounds cool....but.. by Gadget_Guy · 2012-03-12 19:57 · Score: 4, Informative

They sell Microsoft Office for operating systems other than Windows.
This concession to the antitrust authorities and Apple is something of an exception to the general rule and it was a brutal fight to make it come about.
What rubbish! The first version of Microsoft Office EVER was for the Mac in August 1989. The Windows release came out in November 1990. With whom did they have this "brutal fight" to get this released for the Mac?
Interestingly, according to Wikipedia, after the release of Word for the Mac in 1985 (2 years after Word for MS-DOS and Xenix), "Word for Mac's sales were higher than its MS-DOS counterpart for at least four years". It seems that Microsoft were rather pragmatic about selling software where it would make a buck!
Re:I see where this is headed. by msclrhd · 2012-03-12 20:49 · Score: 5, Informative

Provided that the speech recognition engine is good enough, it can distinguish between the /Q/ and /A/ sounds in lot (British English: /lQt/, General American English: /lAt/), cot, hot, etc, with /A/ also appearing in father /fA:D@/. This will mean that the speech recognition engine will record the actual phonemes spoken, rather than the phonemes it thinks are being spoken. With this, it can then build up a database of phonemes to the recorded audio.
When a given language is selected (strictly speaking it is a language + accent, as Liverpudlian English sounds different to Australian English and Mexican Spanish sounds different to Argentinian Spanish) it will have a set of rules that describe how to convert the text into phonemes specific to that accent (for example, "ook" is usually pronounced /Vk/ in English, but in Scouse English it can be /Vx/). These rules provide a set of phonemes required by the language+accent to speak it properly.
The phonemes are transcriptions of IPA-based phonemes (http://en.wikipedia.org/wiki/International_Phonetic_Alphabet). If you plot the phonemes available by the voice on the phoneme charts, you can fill in more phonemes that are similar (e.g. using /A/ instead of /Q/ if the voice does not support /Q/, or an untrilled /r/ if the trilled version is not supported, where a trilled /r/ can be found in Spanish).
Then, provided that the voice can handle all the phonemes in a language+accent, you can then map between the two, allowing your English speaking voice to speak German, Chinese, Afrikaans or whatever language you have data for. The eSpeak text-to-speech program does a simple version of this to make the German, Polish, Swedish, Romanian, Dutch, Hungarian, French and Afrikaans MBROLA voices speak English.
You can also use it to have a voice support different accents, provided you have the rules for producing the correct phonemes.
Re:Sounds cool....but.. by symbolset · 2012-03-12 21:10 · Score: 3, Informative

The selective memory of you 'softie fans is amazing. There's a reason for these things. In 1986 Windows looked like this. Sales of Mac Office kept Microsoft alive in this period. Microsoft Office was moved to reinforce Windows as soon as Windows was a credible environment. Windows wasn't even a credible platform until Windows for Workgroups (Windows 3.11) was released in November 1993, some 7 years later (or 1/3 of the time to present day). Mac Office was so lagging for a long while after WfW launch that it was effectively discontinued, and Office's superior support of the Windows platform was a huge part of Windows assuming dominance over the superior Mac OS which had come to rely on Office, which now offered degraded inferior performance and features on the Mac OS. There were some other shenanigans you can read about in the above links. It was a very successful strategy you can read more about here - enough horrifying content to keep you awake for years. But if that's not enough, you might try these. Microsoft through these lessons evolved a strategy where all their products have to reinforce each other, and that became their core strategy. And then...
Apple got some traction in their TrueType font rendering patent suit against Microsoft and the Justice department was closing in on an antitrust action legendary in its scope and reach. Bill Gates blinked, and they settled, and now there's Mac Office, but you can't say that it's fully supported. The Mac versions lag the Windows versions by some years and are not fully compatible with each other in ways that can't be explained by OS platform differences. The Office platform supports Windows now, as you can see by all the sockpuppets who come out every time somebody mentions some non-Windows operating system to say "you can't get Microsoft Office for that and you never will." And then the rest of us chime in "Application vitualization solves that problem."
Eventually Microsoft discovered political advocacy and contributed in various ways to the installation of a government more supportive of their business activities. Then the enforcement of antitrust protections to limit them and protect us against their abuse of their monopoly became lax, the limits were quashed until those protections expired. But that's another long story for another day.

--
Help stamp out iliturcy.
Would have watched the video... by tenco · 2012-03-12 22:31 · Score: 3, Insightful

... if only my software could translate a bytestream of type video/x-ms-asf into a video.
In light of this experience, why should i believe that someone actually invented a unidirectional universal translator? Nice try.