Google Launches More Realistic Text-To-Speech Service Powered By DeepMind's AI (theverge.com)
Google is launching a new AI voice synthesizer, named Cloud Text-to-Speech, that will be available for any developer or business that needs voice synthesis on tap, whether that's for an app, website, or virtual assistant. The Cloud Text-to-Speech service is being powered by WaveNet, software created by Google's UK-based AI subsidiary DeepMind. The Verge explains why this is significant: First, ever since Google bought DeepMind in 2014, it's been exploring ways to turn the company's AI talent into tangible products. So far, this has meant using DeepMind's algorithms to reduce electricity costs in Google's data centers by 40 percent and DeepMind's forays into health care. But, directly integrating WaveNet into its cloud service is arguably more significant, especially as Google tries to win cloud business away from Amazon and Microsoft, presenting its AI skills as its differentiating factor. Second, DeepMind's AI voice synthesis tech is some of the most advanced and realistic in the business. Most voice synthesizers (including Apple's Siri) use what's called concatenative synthesis, in which a program stores individual syllables -- sounds such as "ba," "sht," and "oo" -- and pieces them together on the fly to form words and sentences. This method has gotten pretty good over the years, but it still sounds stilted.
WaveNet, by comparison, uses machine learning to generate audio from scratch. It actually analyzes the waveforms from a huge database of human speech and re-creates them at a rate of 24,000 samples per second. The end result includes voices with subtleties like lip smacks and accents. When Google first unveiled WaveNet in 2016, it was far too computationally intensive to work outside of research environments, but it's since been slimmed down significantly, showing a clear pipeline from research to product. The Verge has embedded some samples in their report to see how WaveNet sounds.
WaveNet, by comparison, uses machine learning to generate audio from scratch. It actually analyzes the waveforms from a huge database of human speech and re-creates them at a rate of 24,000 samples per second. The end result includes voices with subtleties like lip smacks and accents. When Google first unveiled WaveNet in 2016, it was far too computationally intensive to work outside of research environments, but it's since been slimmed down significantly, showing a clear pipeline from research to product. The Verge has embedded some samples in their report to see how WaveNet sounds.
Given Google's history of taking things away, I would not build anything that depends on this. It will probably disappear in a year.
Hopefully, the Tech Awakening we're experiencing in the US at a consumer level might trickle upwards into actual products as well.
No way in hell I'm going to rely on something I have to use a remote service for, which is no doubt collecting and storing as many bits of data as possible. I don't need human-sounding-voice *that* badly that I can't wait for someone to figure out how to get 95% of this does and run on a few cores, or perhaps spare GPU capacity.
Hire a Linux system administrator, systems engineer,
These voices are quite a far cry from the results of the original wavenet paper. I suppose a lot of computational tradeoffs happened, but these are Siri-level, not human level.
I'm sorry Dave. I'm afraid I can't do that.
the Terminator
Are you sure you know what you want?
Have gnu, will travel.
I'll finally figure out how to pronounce 'doge'.
Have gnu, will travel.
This will be very useful, to telemarketers.
T2V is already good enough for telemarketers. Their problem is not generating the voice, but semantic analysis of the replies.
I get an occasional spam call that I am not 100% sure if it is a human or a robot. So I try to immediately force it off script by asking something like "What color underwear are you wearing?" Sometime the call is disconnected, sometimes it is forwarded to a human, and sometimes the robot tries to get back on script. But best of all, sometimes it is an actual human, who will sometimes hang up, sometimes give a flustered response, and sometimes say something creative like "I'm not wearing any underwear".
The Verge has embedded some samples in their report to hear how WaveNet sounds.
The NSA does speech to text so they can collect all voice calls and index them for textsearch and trigger words. THis is about text to speech.
If you check the competitor voice generation in the article it's also pretty good. Things have improved since Radiohead's depressing song 'Fitter Happier'
https://www.youtube.com/watch?...
Why 'cloud' when local works well?
$ sudo aptitude install libttspico-utils
$ pico2wave -w h.wav "Hello World"
$ aplay h.wav
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.