Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com)
An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Of course this is more "AI" baloney as you can clearly tell it is speech synthesis. Even if it were indistinguishable, this is NOT AI. A "neural network" is nothing like a human brain. It is a weasel term to fool laypeople into thinking it is some sort of magic. Nice try Google. Keeping pushing your Google Home gadgets.
Robocalls! :-D
Just yesterday we saw a thread about someone giving Alexa the skills to ask questions. Now we see Google home is answering them. Set one against another and watch the fun!
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
I. Think. This. Google. A. I. Sounds. A. Maze. ING.
"Can you tell which is the AI generated sample?"
So you can use me as a Turing test guinea pig? For free? My answer is "no". Or... rather "show me the money".
I'm going to guess they this is with an American accent. I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland. (It's something I find a little irritating when using Google Maps for navigation).
I'm impressed with the progress, but annoyed at how the results are oversold. First, they seemed to have asked that human comparison voice to sound like a robot and she succeeded, but credit for that doesn't go to the robot. Second, they only demonstrated sentences that fit in one breath. The way humans read a paragraph or a book chapter requires us to adjust our pauses for breath and our pacing to the content being read. I expect that Google know this and are working on it, and to be fair to them, it was slashdot and not they who came up with the "as good as humans" line. But I'm still annoyed.
One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.
"Alexa, what is 69! (factorial)"
Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...
I am not interested in articles about life extension advancements.
> claims near-human accuracy at imitating audio of a person speaking from text
If you believe this, I have a japanese hologram teenage pop idol to sell you. No kidding, one can buy the "Vocaloid CV-01 V4x" singing synthesis software, boxed or online for about 150USD. It comes with a clumsy manga-girl mascot design, who became a full-blown celebrity in her own right.
Have you heard that Hatsune Miku perform in concert? She's the Number One Princess in the singing synthesis world exactly because she sounds so robotic and emotionless, which attracts weaboos like a Magnet.
Why does she sound robotic? Because it hasn't been possible to refine singing synthesis for fluent pro-musician use, despite 15 years of best efforts by Yamaha Music Corp. in Japan and the Pompeu Fabra Research University lab in Spain. Thus everybody has lost interest in procedural song generation except the otaku subculture, who even want their own all-singing all-dancing Miku "waifu" in a jar called Gatebox.
Hey google, read all slashdot comments to me with a sarcastic tone.
I'll be impressed when it (or any other text-to-speech bot) can read a novel aloud even half as well as a human narrator.
This would include things like subtle voice changes for different characters, (and yet another change for narrative voice), changing the reading pace according to the mood of the scene (eg fast-paced for action, slower for deliberation or melancholy), and handling punctuation properly. (The latter isn't that hard, but the Kindle reader fails miserably at it, running chapter titles into the text because typically a chapter title has no period at the end.)
Bonus points for auto-correcting typos and inserting dramatic pauses where appropriate.
Extra bonus points for not screwing up a sentence like "Polish the silverware." and pronouncing the first word as the verb polish rather than Polish as in the language or someone from Poland.
I do not like it. It is unsettling.
Brought to you by Carl's Junior.
When I was a kid, 35 years ago, I had a TI-99/4A home computer with a speech synthesizer (which was actually 5 years old tech at the time). Sure, it didn't sound great, but it was totally understandable. With the Terminal Emulator II cartridge you could build from phonemes directly and thus have it say any English word, and not just words from its predefined "dictionary" of words it knew how to pronounce already. That was 35 years ago, with a consumer grade home computer running at 3Mhz, that a 10 year old was goofing around with for fun.
The fact that we didn't reach "Indistinguishable From Humans" in TTS *years* ago is not saying much for the state of our software.
Here's an example of it speaking... https://youtu.be/0vu1GftX02Q?t...
Better known as 318230.
Seriously. What's the problem we need solved here? The Google voice in maps is fine - even the robotic one when maps cannot connect to the mother ship.
Focus on "The AI " understanding what I'm asking, please.
I would think if they were trying to showcase their technology they would have chosen someone with a less "robotic" voice to copy. I guess they just wanted someone who spoke very clearly?
If every book can be accessed by those who want to listen instead of read! Not a trivial development at all.
...from United Airlines Flight 93 on 9/11/2001. So, that tech is declassified now.
A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.
If anyone remembers "reading groups" from primary school, there is a pretty big range in the term "human accurate reading".
Good enough for Hawking maybe.
I'd prefer a nice high class British female voice Or Paul Bethany as Jarvis..
Of course it's "not peer reviewed". Google doesn't want their stupid hype train derailed. They've been doing this shit for years. Same reason they never bothered to play their sixty thousand dollar chess machine against anything at least resembling a half decent laptop.
I think it might be more realistic to say that Google and a speaker speaking in a monotonous, robotic way are pretty much indistinguishable from another. They both sound robotic to me. When it can imitate what people really sound like, normal people, then talk to me. Not that this isn't cool, but from the cursory bits I read and heard it seems to over-hype itself.
In a few years. AI will progress so that AI will sound more human than humans.
Quartz mangled the article, this source is better in every way:
https://research.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html
I'd like to see (ehm hear) it to this little poem: https://www.cs.cmu.edu/~clamen/misc/humour/TheChaos.html and teach me a few things in the proces.
When I make it say words like "shit" "fuck" "cunt" "penis"....
I like Australian Siri and wish Alexa would offer similar accents. $0.02
-==- Buy a Mac and leave me alone!
Why exactly is this legal in the US?
Also, why don't we have communication whitelist firewalls?
I started to, when I had a stalker in 2004.
I made an answering machine that only allowed people in my address book through.
Then I configured my e-mail client, and then server, to use the same logic.
And my Jabber instant messenger too.
I currently still have a mailbox, due to living shared apartment, but we have legally valid digital signatures in our passports, which can be used for e-mail and everything, so there really is zero reason to send information-only letters. Hence, I plan to reject all letters, and exclusively accept parcel (including letters that aren't just information) in person, as soon as I move. (Around here, if you're not home, you can tell them a time or go to their branch to get it.)
I also still have a doorbell, but have plans to disable it, add a note, and just have people call me if they are at the door, which in the last three apartment buildings was more convenient anyway. I just have to think about how to handle e.g. emergency services or cops ringing. Maybe the doorbell and intercom will be connected to a small single-board computer, which then routes it as a SIP call, but that would defeat the purpose of a whitelist. Hmm ...
Make your "answering machine" instantly take the call (before the first ring), and play your local "There is no such number".
Otherwise, they will just keep calling.