Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com)
An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Listen for the "plosives", the "p" or "b" sounds. All text-speech systems get them wrong, because they are generally programmed from recorded speech that is very frequency limited. There are reasons for that. Full digital sampling of sound uses analog-to-digital converters, limited by the digital sampling. To reduce the amount of digital storage and processing required, the designers of both recording and synthesis tools lower the sampling frequency as far as possible. They also add low bandwidth filter on the input and the outputs, to avoid sharp step functions from generating undesired artifacts on the output, and to avoid weird "beat" harmonics with the sampling frequency from confusing the recorded inputs. But the result is smearing of sharp sounds which are more rich in transients, such as "t" and "p". And dear lord, does it screw up languages with "click" sounds like Zulu.
The storage and CPU cost of recording audio are so small that they reached the point of irrelevance 15-20 years ago, for low-end consumer hardware. More like 40 years ago for professional grade equipment - around the time that CDs were introduced. Despite what a bunch of "audiophile" sites trying to push a product will tell you, it is not difficult, expensive, or taxing in any way to work with PCM audio of a sufficient bit depth and sampling rate to cover the entire range of human hearing. Or even dog hearing!
But regarding speech synthesis specifically - there is software out there, still being used by somebody I'm sure, that was designed to be run on consumer PCs back in the 90s. At that time, on those systems, there were computational limits that were relevant to sound quality. Whatever outdated software Stephen Hawking uses, sounds like it renders the output at no higher than 10 or 12 kHz sampling rate (compared to 40 - 50 kHz to cover the human hearing range.) But the sampling rate is a very small part of why Hawking sounds bad. The artifacts you hear from a low sampling rate are mostly limited to high-frequency sounds being cut. (And possibly temporal smearing, depending on how you filter.) It sounds similar to turning the treble knob on your stereo all the way down.
The quality problems with Hawking's synthesizer go way beyond a treble knob. Things like pacing, emphasis, minor slurring of certain sounds that are adjacent to each other, etc... problems that you take care of by making the software more intelligent, not upping the sample frequency. Which is exactly what Google is doing, and making some progress at it too. No, it doesn't sound like a human yet.