Google's DeepMind Develops New Speech Synthesis AI Algorithm Called WaveNet (qz.com)

← Back to Stories (view on slashdot.org)

Google's DeepMind Develops New Speech Synthesis AI Algorithm Called WaveNet (qz.com)

Posted by BeauHD on Friday September 9, 2016 @09:20AM from the all-strung-out dept.

Artem Tashkinov writes: Researchers behind Google's DeepMind company have been creating AI algorithms which could hardly be applied in real life aside from pure entertainment purposes -- the Go game being the most recent example. However, their most recent development, a speech synthesis AI algorithm called WaveNet, beats the two existing methods of generating human speech by a long shot -- at least 50% by Google's own estimates. The only problem with this new approach is that it's very computationally expensive. The results are even more impressive considering the fact that WaveNet can easily learn different voices and generate artificial breaths, mouth movements, intonation and other features of human speech. It can also be easily trained to generate any voice using a very small sample database. Quartz has a voice demo of Google's current method in its report, which uses recurrent neural networks, and WaveNet's method, which "uses convolutional neural networks, where previously generated data is considered when producing the next bit of information." The report adds, "Researchers also found that if they fed the algorithm classical music instead of speech, the algorithm would compose its own songs."

46 comments

Min score:

Reason:

Sort:

Re:Double standard...??? by NEDHead · 2016-09-09 09:36 · Score: 2

You reference 3 different nationalities - how is that a double standard?
Re:Double standard...??? by Anonymous Coward · 2016-09-09 09:38 · Score: 0

Political Correctness is such a hot mess. It's a shame how the good intentions behind it have been strangulated with censorship and agendas.
Re:Double standard...??? by theghost · 2016-09-09 09:42 · Score: 2

You are not a frictionless sphere at rest on a perfect plane in a vacuum. Surprise!
Unless you are somewhere on the autism spectrum or being willfully obtuse, figuring out why people do things is not usually very difficult. As a fun exercise, try and figure it out for yourself using clues from history and culture.

--
The only thing necessary for the triumph of evil is that good men do nothing.
Re:Double standard...??? by thinkwaitfast · 2016-09-09 09:42 · Score: 2

Political Correctness is fascism with manners
speech synthesis vs "artificial intelligence" by sittingnut · 2016-09-09 09:45 · Score: 1

while more natural speech synthesis maybe useful to create an illusion of intelligence, speech synthesis and so called "artificial intelligence", are too different things.
even more relevant learning to mimic speech and word use of others, is just another way of using "artificial intelligence".
1. Re:speech synthesis vs "artificial intelligence" by DamnOregonian · 2016-09-09 10:10 · Score: 1
  
  I two like to put scare quotes around the so called "artificial intelligence"
2. Re:speech synthesis vs "artificial intelligence" by ShanghaiBill · 2016-09-09 10:29 · Score: 3, Insightful
  
  speech synthesis and so called "artificial intelligence", are too different things.
  Accurate speech synthesis, with appropriate pronunciation and intonation, is absolutely a subset of AI. There is no way to do it with a dumb algorithm, such as a lookup table. No one has done it without machine learning.
3. Re:speech synthesis vs "artificial intelligence" by Anonymous Coward · 2016-09-09 11:37 · Score: 0
  
  Fun fact: It can be really "annoying" to put scare quotes around all "kinds" of expressions. Draw them with your "hands" while talking to colleagues.
4. Re:speech synthesis vs "artificial intelligence" by sittingnut · 2016-09-09 13:26 · Score: 1
  
  Accurate speech synthesis, ..., is absolutely a subset of AI. There is no way to do it with a dumb algorithm, such as a lookup table. No one has done it without machine learning.
  1st, no one has done it, period. even this story do not claim 'accuracy'.
  2nd, method involved here is in fact 'a dumb algorithm'. nor is there any inherent logical reason why speech synthesis cannot be done by an algorithm. mere assertion that it cannot be done is not an acceptable reason.
  3rd, so far so called "artificial intelligence","machine learning", are dumb algorithms. sorry to burst your bubble.
  -
  btw all that does not take away from my original point, that creating so called "artificial intelligence", and speech synthesis, are two different things.
5. Re:speech synthesis vs "artificial intelligence" by Anonymous Coward · 2016-09-09 14:14 · Score: 0
  
  I two like to put scare quotes around the so called "artificial intelligence"
  Fun fact: I like to put scare quotes around grammar issues "too".
6. Re:speech synthesis vs "artificial intelligence" by raftpeople · 2016-09-09 15:11 · Score: 1
  
  "3rd, so far so called "artificial intelligence","machine learning", are dumb algorithms." - Sure, but so are the algorithms that we use in our brains for object recognition and speech synthesis etc. If we ignore consciousness for a minute, all of this stuff is just very complex function approximation, whether google does it or your neurons do it.
7. Re:speech synthesis vs "artificial intelligence" by sittingnut · 2016-09-09 19:10 · Score: 1
  
  "3rd, so far so called "artificial intelligence","machine learning", are dumb algorithms." - Sure, but so are the algorithms that we use in our brains for object recognition and speech synthesis etc. If we ignore consciousness for a minute, all of this stuff is just very complex function approximation, whether google does it or your neurons do it.
  that is a mere assumption about our brains.
  we actually do not know how our brains/neurons work for most part. so it is a big jump, an unscientific one, to think they work that same way as google's algorithms do.
8. Re:speech synthesis vs "artificial intelligence" by Anonymous Coward · 2016-09-10 00:01 · Score: 0
  
  that is a mere assumption about our brains.
  No, it's merely an assumption of yours that it's an assumption about our brains.
9. Re:speech synthesis vs "artificial intelligence" by Anonymous Coward · 2016-09-10 08:52 · Score: 0
  
  Since there is a defined meaning for word "assumption", it is nor "merely an assumption of" GP.
Re:Double standard...??? by PopeRatzo · 2016-09-09 09:46 · Score: 4, Funny

You are not a frictionless sphere at rest on a perfect plane in a vacuum.
Not yet, but I'm working on it.

--
You are welcome on my lawn.
Re:Double standard...??? by PopeRatzo · 2016-09-09 09:50 · Score: 1

You reference 3 different nationalities - how is that a double standard?
It's three double standards, so I think it should be 3^2 standard. Though I suppose you could make a case that it should be binary, so 11 standards.

--
You are welcome on my lawn.
Facebook Features 9/11 Conspiracy Theory as 'Trend by Anonymous Coward · 2016-09-09 10:02 · Score: 0

And why isn't Slashdot taking comments on Facebook Features 9/11 Conspiracy Theory as 'Trending' ???
Because the official government conspiracy is real! OPEN YOUR EYES AMERICA.
It's good by Anonymous Coward · 2016-09-09 10:23 · Score: 0

Listening to the two tracks, I have to say I'm impressed.
Re:Double standard...??? by Anonymous Coward · 2016-09-09 10:25 · Score: 0

And the alt-right are fascists who fetishize incivility.
Pick your poison.
Amazing! by Anonymous Coward · 2016-09-09 10:28 · Score: 0

It sounds more natural than a crappy sounding text-to-speech system.
In other news, Apple out-did itself by announcing the best iPhone ever.
Re:Double standard...??? by Anonymous Coward · 2016-09-09 10:32 · Score: 0

> Please explain....?????????????/
Context.
Hardest thing for nerds to figure out is that context almost always matters more than the words themselves. If it helps, think of it as a lookup table - the word itself is just a key to look up much more information, and just as importantly -- few people have identical copies of that lookup table.
Dutchman and Englishman were never derogatory terms. Chinaman frequently was. That's why.
Error in article? by davesque · 2016-09-09 10:35 · Score: 1

Did they confuse recurrent neural networks and convolutional neural networks when discussing the old versus new method of speech synthesis?
I'm unimpressed by RandCraw · 2016-09-09 10:37 · Score: 1

Flouting ./ tradition, I actually listened to Deepmind's examples of their voice. They're rather unimpressive compared to the other two voice samples they compare themselves to, and very noisy. I heard much better from IBM Watson four years ago.
Methinks Deepmind published too soon.
1. Re:I'm unimpressed by Anonymous Coward · 2016-09-09 10:50 · Score: 0
  
  I guess there is an element of personal preference to it. But from asking family members which they think is best they all preferred DeepMind over the other samples.
2. Re:I'm unimpressed by SuperKendall · 2016-09-09 13:10 · Score: 1
  
  I thought they sounded generally better, except one sample that used two numbers at the end of a sentence... in that case it seemed like it pronounced them very un-naturally. More of an incremental improvement in synthesis and it also required the sentences all be parsed out pretty careful going in.
  
  --
  "There is more worth loving than we have strength to love." - Brian Jay Stanley
3. Re:I'm unimpressed by SigmaTao · 2016-09-09 13:18 · Score: 2
  
  This link https://text-to-speech-demo.my... allows you to experiment with the Watson version directly for anyone who is interested.
4. Re:I'm unimpressed by Visarga · 2016-09-09 18:05 · Score: 1
  
  Nah, I still prefer Alex from Mac OS. The IBM voice is smooth but unnatural in intonation, even in the example where they marked intonation on the text. I really loved the DeepMind samples, but they come at 1 second of speech generated in 90 minutes of computation, so, no chance of having that voice on my laptop.
Re: Double standard...??? by Anonymous Coward · 2016-09-09 10:41 · Score: 0

We've come along way from Creative Labs' Dr.Sbaitso.
Anyways, to answer your question. It is the same reason why we call the countries Neder-Lands, Eng-Land, Scot-Land, Ice-Land, Fin-Land but not Chink-Land.
Horribly bad and confusing summary by markus · 2016-09-09 10:51 · Score: 1

I'll never understand why Slashdot likes to link to poorly written and misleading summaries, when the original blog post is so much more readable and informative. I suggest everybody skip the "Quartz" article and instead read the original blog post. Thankfully, for once it was in fact included in the Slashdot summary, even if it was downplayed: https://deepmind.com/blog/wave...
1. Re: Horribly bad and confusing summary by hackwrench · 2016-09-09 11:57 · Score: 1
  
  Two possibilities: The same reason that Wikipedia wants secondary sources instead of primary. Less biased is supposedly more accurate. Two: The submitter submits the story frow where he usually sources his news and that's what they go with. My personal experience sumitting stories and looking at stoy submissions suggests it is usually two.
2. Re:Horribly bad and confusing summary by Anonymous Coward · 2016-09-09 13:46 · Score: 0
  
  To generate more comments.
3. Re: Horribly bad and confusing summary by Visarga · 2016-09-09 18:06 · Score: 1
  
  But this time the source article is really nice. DeepMind's blog is quite good.
While the individual words are are better... by tlambert · 2016-09-09 10:59 · Score: 1

While the individual words are are better... the sentence pacing is not.
This is similar to the "singing computer" pronunciation, many years ago, in which the ACM distributed CD's with the tracks on it.
You don't get the stilted words, but unless it's intentionally paced (for example, a real human would have put a pause before "directed"), it's still going to be recognizably artificial -- but worse than that: difficult for a human expecting the pacing to understand.
Given that age related hearing loss tends to cut out vowels and not consonants, this is unlikely to be a useful implementation for care giving of older people, for example, unless there are also visible facial cues associated with it, if the pacing can not be made distinct.
1. Re: While the individual words are are better... by Anonymous Coward · 2016-09-09 11:17 · Score: 0
  
  Obvious solution: also train its pacing and emphasis on a set of actual humans reading the same text for clarity (as opposed to conversationally or for speed). With a good enough training set, the ANN should be able to pick up most of what it needs without too many problems.
Games by K.+S.+Kyosuke · 2016-09-09 12:08 · Score: 2

It will be great when games will be able to use non-pre-recorded speech for dialogs. No need to have characters express just two or three different game states with one recording each.

--
Ezekiel 23:20
Calling Adam Selene . . . by msk · 2016-09-09 12:23 · Score: 1

. . . Mycroft is on the line.
Give CBS a call by Guspaz · 2016-09-09 14:15 · Score: 2

The word is that Star Trek: Discovery may attempt to use Majel Barrett's voice for the computer, due to her having recorded a complete phonetic sample before she passed. If this really does outperform the best available TTS engines, then perhaps DeepMind would be a good fit to generate that for the show: since it's supposed to be a computer, it's not the end of the world if it doesn't sound completely human...
1. Re:Give CBS a call by Anonymous Coward · 2016-09-09 17:18 · Score: 0
  
  First they do have to rename the algorithm to Majel, or at least the next version of it. |v|:)
2. Re:Give CBS a call by Anonymous Coward · 2016-09-09 19:15 · Score: 0
  
  WORKING!
  I think all they need to do is get some woman who can do a good imitation of Majel Barrett-Roddenberry to do a monotone voice since Discovery is set between the time of Pike and Kirk's command.. go for authenticity for what was on the original series in the prime universe.. For added effect have her speak through a ring modulator or phaser to sound kind of dalek esque.. like they did in Star Trek the Motion Picture.. "Attention Launch Crew.. A travel pod is now available at Cargo Six".. etc.
3. Re:Give CBS a call by Anonymous Coward · 2016-09-09 19:18 · Score: 0
  
  First they do have to rename the algorithm to Majel, or at least the next version of it. |v|:)
  I vote for Red Queen!
Is the code for it available anywhere? by mark-t · 2016-09-09 15:03 · Score: 1

[nt]

--
File under 'M' for 'Manic ranting'
Re:Double standard...??? by Anonymous Coward · 2016-09-09 18:55 · Score: 0

Political Correctness is such a hot mess. It's a shame how the good intentions behind it have been strangulated with censorship and agendas.
2009 called and they want their "hot mess" quip back.
The real difference seems to be... by Anonymous Coward · 2016-09-10 09:09 · Score: 0

WaveNet examples seem to have been produced using higher bitrate samples. Otherwise the Parametric examples sound better to me. Either way, if the sample rates were different, the comparison isn't completely fair.
Should be canon by Anonymous Coward · 2016-09-10 11:23 · Score: 0

You're right, Star Trek mythos should include the Majel voice becoming a standard computer voice sometime between TNG and TOS, and Discovery should include a voice which sounded machine, like in TOS, probably not yet standardized, so a different voice could be used.
speech synthesis and AI by Anonymous Coward · 2016-09-10 16:11 · Score: 0

When you have 12 rules based speech synthesis system (around 1967), where given a phoneme (or say a symbol p -bilabial voiceless plosive, has a noise spectrum around 4000 Hz) followed by say, i, to produce the spoken syllable pi, then the sound 'i' has three frequencies 220, 1900, and 2700 HZ, the rule says that for the sound p give a duration of 20 milliseconds, the noise be around 4000 Hz, the lowest frequency of the sound 'i' that follows p should start around 170 Hz and within 30 mmiliseconds reach the target frequency 220 Hz so on. Thus the frequency- energy or voltage-amplitude for each pitch period has to be computed and played back at say, 8 bits per second decoded into analogue waves. Therefore, the rules embedded in the algorithm based on AI tells the computational part to use a table lookup part, retrieve the relevant coefficient for a digital filter for generating a wave with embedded frequencies, synthesize the digital wave and decode it into an analogue wave via a sound card.. The setup is complicated. To this if you add the fundamental frequency of 80-160 Hz for males, 140-280 for females and up to 400 Hz for a child, the rules become complicated. Tone, gender, age and other factors will end up in producing more complexity. Thus, dialect, region, mother tongue, sex, age, abnormal vocal tract compensation, nasalization and other linguistic features need AI at the source of generation. Thus, IBM had voice type dictation which worked only for about 98% time for standard American English. Blacks, Jews, northerners, Indians and Chinese had less than 80% accuracy from the system. If Google had achieved 99% accuracy based on Baye’s theorem and predictive filtering etc., for the general population, then it is a big break through. Time will tell the results.