Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com)

← Back to Stories (view on slashdot.org)

Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com)

Posted by BeauHD on Wednesday December 27, 2017 @01:00AM from the rise-of-the-machines dept.

An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?

20 of 101 comments (clear)

Min score:

Reason:

Sort:

Not so much by smallfries · 2017-12-27 01:07 · Score: 4, Informative

Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
1. Re:Not so much by jellomizer · 2017-12-27 03:09 · Score: 2
  
  I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard.
  The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice.
  Much like how CGI Characters even perfectly rendered ones, just don't show the details of the emotions.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
2. Re:Not so much by kwoff · 2017-12-27 04:34 · Score: 2
  
  The voice reminded me of the narrator for "Physics Videos by Eugene Khutoryansky". Several people have asked in that channel's comments section if it is computer-generated, but it's claimed to be a woman named Kira. AFAICT, it's a voice actor, Kira Vincent. It makes me wonder if Google had her pronounce things, and her pronounciation just happens to be somewhat synthetic-sounding :) (though I looked quickly at the research paper and didn't find a mention of "Kira" or a name for the voice).
Welcome to the wide world of.... by Zurkeyon3733 · 2017-12-27 01:25 · Score: 5, Insightful

Robocalls! :-D
Re:Baloney by rodrigoandrade · 2017-12-27 01:28 · Score: 3, Insightful

Duuuuude, it's AI!!!! Everything you can label "AI" gets a shit ton of page views.

Even my doorbell has AI in it, because it rings when it "knows" someone is at the door looking for me.
Re:Baloney by 110010001000 · 2017-12-27 01:28 · Score: 3, Insightful

Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".
Re:Baloney by Anonymous Coward · 2017-12-27 01:37 · Score: 5, Informative

Listen for the "plosives", the "p" or "b" sounds. All text-speech systems get them wrong, because they are generally programmed from recorded speech that is very frequency limited. There are reasons for that. Full digital sampling of sound uses analog-to-digital converters, limited by the digital sampling. To reduce the amount of digital storage and processing required, the designers of both recording and synthesis tools lower the sampling frequency as far as possible. They also add low bandwidth filter on the input and the outputs, to avoid sharp step functions from generating undesired artifacts on the output, and to avoid weird "beat" harmonics with the sampling frequency from confusing the recorded inputs. But the result is smearing of sharp sounds which are more rich in transients, such as "t" and "p". And dear lord, does it screw up languages with "click" sounds like Zulu.
Re:Baloney by Anonymous Coward · 2017-12-27 01:44 · Score: 2, Insightful

Everyone is going to call it AI, though.
Everyone can be wrong, of course, but who loses in normal conversation? The Average Joe or a pedant?
I'm sure the technology will be referred to in the correct terms by the people who use and probably invented the correct terms. For everyone else, there's AI.
Re: What about accents? by Anonymous Coward · 2017-12-27 02:00 · Score: 2, Interesting

As speech synthesis rises in usage, my guess is evolution will eliminate harder accents like the Irish, Jamaican, Cuban, etc. It will also eventually eliminate plosive sounds, etc. The language we speak will end up leaving towards how these systems speak because they'll be more ubiquitous.
Breath by lazarus · 2017-12-27 02:02 · Score: 4, Insightful

One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.
"Alexa, what is 69! (factorial)"
Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...

--
I am not interested in articles about life extension advancements.
Re:Baloney by mikael · 2017-12-27 02:29 · Score: 4, Funny

Same with electric heater. The thermostat has built in AI so that it knows when to turn the heater off when it is too hot.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:Baloney by 110010001000 · 2017-12-27 02:58 · Score: 2

No, they don't.
Re:Baloney by Anonymous Coward · 2017-12-27 03:52 · Score: 2, Insightful

Dude, the proper definition of AI is obvious - It's whatever computers can't yet do.
Re:What about accents? by Paradise+Pete · 2017-12-27 03:53 · Score: 2

I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland.
Nobody else says anything the same way you do in Ireland.
This will be great! by burhop · 2017-12-27 03:58 · Score: 4, Funny

Hey google, read all slashdot comments to me with a sarcastic tone.
Re:Baloney by sound+vision · 2017-12-27 05:23 · Score: 3, Informative

The storage and CPU cost of recording audio are so small that they reached the point of irrelevance 15-20 years ago, for low-end consumer hardware. More like 40 years ago for professional grade equipment - around the time that CDs were introduced. Despite what a bunch of "audiophile" sites trying to push a product will tell you, it is not difficult, expensive, or taxing in any way to work with PCM audio of a sufficient bit depth and sampling rate to cover the entire range of human hearing. Or even dog hearing!

But regarding speech synthesis specifically - there is software out there, still being used by somebody I'm sure, that was designed to be run on consumer PCs back in the 90s. At that time, on those systems, there were computational limits that were relevant to sound quality. Whatever outdated software Stephen Hawking uses, sounds like it renders the output at no higher than 10 or 12 kHz sampling rate (compared to 40 - 50 kHz to cover the human hearing range.) But the sampling rate is a very small part of why Hawking sounds bad. The artifacts you hear from a low sampling rate are mostly limited to high-frequency sounds being cut. (And possibly temporal smearing, depending on how you filter.) It sounds similar to turning the treble knob on your stereo all the way down.

The quality problems with Hawking's synthesizer go way beyond a treble knob. Things like pacing, emphasis, minor slurring of certain sounds that are adjacent to each other, etc... problems that you take care of by making the software more intelligent, not upping the sample frequency. Which is exactly what Google is doing, and making some progress at it too. No, it doesn't sound like a human yet.
Re:Baloney by ranton · 2017-12-27 05:31 · Score: 2

Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".
People really need to start modding these types of comments as Troll and move on. AI has included basic algorithms used as a stand in for intelligent thought since the field arguably began at The Dartmouth Summer Research Project on Artificial Intelligence over 60 years ago. At the time they were very aware of how difficult it could be to define intelligence, so they intentionally did not let that limit what was considered artificial intelligence research.
Today the researchers and field of scientific journalism both agree that machine learning and neural networks fit within the field of artificial intelligence. That is all that matters, not your personal feelings about what the field should be.

--
-- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
Re: Baloney by ranton · 2017-12-27 05:32 · Score: 2

But not in science they don't! AI has a definite scientific meaning.
And since its inception in the 1960's, AI has included basic algorithms used to approximate the results of intelligent thought.

--
-- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
Re:Baloney by ljw1004 · 2017-12-27 07:12 · Score: 2

Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".
There's a straightforward difference. If the logic (or business logic, or branching structure / conditionals) was authored by a human programmer then we call it a conventional program. If the logic was an emergent property of running a learning algorithm over a training set, then we call it AI.
This is a practically useful distinction for us working software engineers. (Why? The latter can't usefully be checked into source control itself; only its training data. You can't diff it. The typical bugs you get is very different between the two - the first kind of software has weird discontinuous edge cases, and the latter is generally "smooth". We engineers need different skillsets to develop and debug the two. The way we respond to requirements specs is different between the two. Each of them have their strengths at particular classes of problems - compiler-writing is dominated by the first kind; real-world sensory processing was done at first by the first kind like OpenCV up to 2010, but has been wholly eclipsed by the second kind).
No, Microsoft Word isn't "AI" under this commonly-used definition.
If you want to keep railing against it, why not (1) recognize that it's a practically useful distinction to make, (2) come up with a term you think is better?
Just a start by BradMajors · 2017-12-27 10:20 · Score: 2

In a few years. AI will progress so that AI will sound more human than humans.