Slashdot Mirror


Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com)

An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?

64 of 101 comments (clear)

  1. Not so much by smallfries · · Score: 4, Informative

    Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.

    --
    Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    1. Re:Not so much by Anonymous Coward · · Score: 1

      Heck, a good number of the ads I hear on radio have unnatural timing. Even a politician on a teleprompter sounds unnatural to me. Lots of people are bad (or untrained) at sounding natural as they read from copy.

    2. Re:Not so much by Oswald+McWeany · · Score: 1

      Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.

      Funny thing is, I thought both samples sounded more like a computer more than a human.

      --
      "That's the way to do it" - Punch
    3. Re: Not so much by megamind · · Score: 1

      Still easy to distinguish. Just wait a few seconds and then try to interrupt and see if it stops talking.

    4. Re:Not so much by jellomizer · · Score: 2

      I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard.
      The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice.
      Much like how CGI Characters even perfectly rendered ones, just don't show the details of the emotions.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    5. Re:Not so much by kwoff · · Score: 2

      The voice reminded me of the narrator for "Physics Videos by Eugene Khutoryansky". Several people have asked in that channel's comments section if it is computer-generated, but it's claimed to be a woman named Kira. AFAICT, it's a voice actor, Kira Vincent. It makes me wonder if Google had her pronounce things, and her pronounciation just happens to be somewhat synthetic-sounding :) (though I looked quickly at the research paper and didn't find a mention of "Kira" or a name for the voice).

    6. Re:Not so much by cascadingstylesheet · · Score: 1

      I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard. The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice. Much like how CGI Characters even perfectly rendered ones, just don't show the details of the emotions.

      Still ... "it took over a hundred questions with Rachel, didn't it??"

    7. Re:Not so much by chispito · · Score: 1

      Heck, a good number of the ads I hear on radio have unnatural timing.

      Part of that is because audio can now be digitally sped up without a corresponding pitch change, which precludes the need to hire actors like John Moschitta Jr. to read the terms, conditions, warnings, etc., at the end of an ad. I'm starting to suspect some agencies compress the entire ad in this manner to try to fit in more content without their actors sounding out of breath.

      --
      The Daddy casts sleep on the Baby. The Baby resists!
    8. Re:Not so much by nospam007 · · Score: 1

      "Even a politician on a teleprompter sounds unnatural to me."

      But some of them 'have the best words', or so they say.

    9. Re:Not so much by iMadeGhostzilla · · Score: 1

      That makes sense. Our speaking apparatus, the muscles and nerves and whatnot are modulated by the emotions running through us at the moment. At the same time our own listening apparatus is trained through endless repetition to catch many of those modulations and identify them, consciously or not. For AI speech to be "indistinguishable from humans" it would need to simulate modulation by emotions which depend on the person and the context.

    10. Re:Not so much by MichaelSmith · · Score: 1

      Which never made sense to me. All through the movies, artificial organisms have serial numbers, as did the Nexus 8 in 2049. Couldn't Deckard just sample Rachel's DNA? Probably do it with a hand held reader by that time.

    11. Re:Not so much by smallfries · · Score: 1

      Yeah... it said that when I commented. Hence my claim that it is not indistinguishable. Do you understand?

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  2. Welcome to the wide world of.... by Zurkeyon3733 · · Score: 5, Insightful

    Robocalls! :-D

    1. Re:Welcome to the wide world of.... by Megane · · Score: 1

      Wake me up when they can answer out-of-band questions like "What is today?", or respond in a human way to talking over their script with "Hello? Hello? Hello? Hello?" I'm not saying it won't happen, but for now, those are the fastest ways to fail them on a Turing test. When they figure those out, I'll move up to a next level of ez-fail questions.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    2. Re:Welcome to the wide world of.... by sound+vision · · Score: 1

      Welcome? I've been in that world for years. Anyway, most robocalls play a recording of an actual human voice, so I fail to see what they'd gain by using a synthesizer. I doubt that *recording the message* is the thing that limits their profits.

  3. Ha! Sabash!! Great competition. by 140Mandak262Jamuna · · Score: 1

    Just yesterday we saw a thread about someone giving Alexa the skills to ask questions. Now we see Google home is answering them. Set one against another and watch the fun!

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  4. Re:Baloney by rodrigoandrade · · Score: 3, Insightful

    Duuuuude, it's AI!!!! Everything you can label "AI" gets a shit ton of page views.

    Even my doorbell has AI in it, because it rings when it "knows" someone is at the door looking for me.

  5. Re:Baloney by 110010001000 · · Score: 3, Insightful

    Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".

  6. Re:Baloney by Anonymous Coward · · Score: 5, Informative

    Listen for the "plosives", the "p" or "b" sounds. All text-speech systems get them wrong, because they are generally programmed from recorded speech that is very frequency limited. There are reasons for that. Full digital sampling of sound uses analog-to-digital converters, limited by the digital sampling. To reduce the amount of digital storage and processing required, the designers of both recording and synthesis tools lower the sampling frequency as far as possible. They also add low bandwidth filter on the input and the outputs, to avoid sharp step functions from generating undesired artifacts on the output, and to avoid weird "beat" harmonics with the sampling frequency from confusing the recorded inputs. But the result is smearing of sharp sounds which are more rich in transients, such as "t" and "p". And dear lord, does it screw up languages with "click" sounds like Zulu.

  7. What about accents? by Tomahawk · · Score: 1

    I'm going to guess they this is with an American accent. I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland. (It's something I find a little irritating when using Google Maps for navigation).

    1. Re: What about accents? by Anonymous Coward · · Score: 2, Interesting

      As speech synthesis rises in usage, my guess is evolution will eliminate harder accents like the Irish, Jamaican, Cuban, etc. It will also eventually eliminate plosive sounds, etc. The language we speak will end up leaving towards how these systems speak because they'll be more ubiquitous.

    2. Re:What about accents? by jrumney · · Score: 1

      Have you tried setting your default language to English (Ireland) or English (UK)? (they seem to both be the same South-East England accent) The way they pronounce kilometers is definitely different than the US English voice.

    3. Re:What about accents? by chill · · Score: 1

      No need to guess, it says so right in the last paragraph of the article.

      However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.

      Training against different accents is something that would easily be within Google's reach, once they're satisfied with the main product.

      --
      Learning HOW to think is more important than learning WHAT to think.
    4. Re: What about accents? by jabuzz · · Score: 1

      I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.

    5. Re: What about accents? by chill · · Score: 1

      I read some time back, that when first working on their Translate application, Google contracted with the United Nations for access to their professional translation archive. Thousands of samples of source material and professional translations in dozens of different languages.

      If that included voice recordings as well as written translations, it could be the solution to the problem of training material. Not regional accents, of course, but still, a big leg up.

      --
      Learning HOW to think is more important than learning WHAT to think.
    6. Re: What about accents? by EvilSS · · Score: 1

      I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.

      The problem is neural networks can be unpredictable in their response to training. Start feeding it different voices and it might just start averaging them out, or start doing the voice equivalent of code switching. That would be really weird to listen to.

      Also don't go getting the author's guilds and voice actors all riled up. They'll be suing preemptively.

      --
      I browse on +1 so AC's need not respond, I won't see it.
    7. Re:What about accents? by Paradise+Pete · · Score: 2

      I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland.

      Nobody else says anything the same way you do in Ireland.

    8. Re: What about accents? by Tomahawk · · Score: 1

      True.
      Specifically for this, most say Keelow-meters or Killow-meters, while we day kill-Om-eters. Emphasis is on the Om.

  8. Re:Baloney by Anonymous Coward · · Score: 2, Insightful

    Everyone is going to call it AI, though.

    Everyone can be wrong, of course, but who loses in normal conversation? The Average Joe or a pedant?

    I'm sure the technology will be referred to in the correct terms by the people who use and probably invented the correct terms. For everyone else, there's AI.

  9. Terrible comparisons by Anonymous Coward · · Score: 1

    I'm impressed with the progress, but annoyed at how the results are oversold. First, they seemed to have asked that human comparison voice to sound like a robot and she succeeded, but credit for that doesn't go to the robot. Second, they only demonstrated sentences that fit in one breath. The way humans read a paragraph or a book chapter requires us to adjust our pauses for breath and our pacing to the content being read. I expect that Google know this and are working on it, and to be fair to them, it was slashdot and not they who came up with the "as good as humans" line. But I'm still annoyed.

  10. Breath by lazarus · · Score: 4, Insightful

    One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.

    "Alexa, what is 69! (factorial)"

    Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...

    --
    I am not interested in articles about life extension advancements.
    1. Re:Breath by DigiShaman · · Score: 1

      The ever-lasting wind bag. Oh, what bagpipes she could be!

      --
      Life is not for the lazy.
  11. Re:Baloney by mikael · · Score: 4, Funny

    Same with electric heater. The thermostat has built in AI so that it knows when to turn the heater off when it is too hot.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  12. Re:Baloney by rkordmaa · · Score: 1

    The problem is that some people expect AI to be like something from sci-fi movie and happen to know that sci-fi AI and real world AI are nothing alike. For a layman it doesn't really matter, it's all magic anyway. "Deep learning neural networks" is a bit of a mouthful and doesn't get the point across as well as "AI", even if some people have unrealistic expectations about what AI is supposed to be. Complaining about it is nonsense semantics anyway, whatever you call it won't change what it is.

  13. Re:Baloney by 110010001000 · · Score: 2

    No, they don't.

  14. Re:Baloney by Anonymous Coward · · Score: 1

    It's funny how angry you keep getting every time the word AI appears in a slashdot article.

    And yet, for all your rants, nothing changes. The world keeps on using AI to mean what you insist it doesn't mean.

    In the English language, popular use determines meanings. So, this word has attained a new meaning, whether you approve of it or not.

    But hey, keep posting your angry rants. Maybe they will go viral and convince the world to change.

  15. Re:Baloney by religionofpeas · · Score: 1

    Sounds like bullshit. A CD is only 650 MB, and holds 80 minutes of high quality audio. Who cares about the amount of digital storage for a couple of "b" and "t" samples ?

  16. Re:Baloney by Dog-Cow · · Score: 1

    If you smash a pickaxe through your eye, you will no longer care what people call AI, and we won't have to read your inane shit. It's a win/win.

  17. Re:Baloney by Anonymous Coward · · Score: 2, Insightful

    Dude, the proper definition of AI is obvious - It's whatever computers can't yet do.

  18. Re:Baloney by swillden · · Score: 1

    And a hacker is someone who enjoys making technology do interesting things. Good luck trying to redefine common language.

    For that matter, this isn't even "common" language. Researchers in the field call it AI as well, and have for decades. When necessary they distinguish between strong AI and weak AI, but most of the time it's not necessary because strong AI doesn't yet exist.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  19. Re:Baloney by swillden · · Score: 1

    I'm looking for a decent smart doorbell. I'd like one that rings when someone who doesn't live in my house approaches the door. It should have a button for backup.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  20. Re:Baloney by K.+S.+Kyosuke · · Score: 1

    Of course this is more "AI" baloney as you can clearly tell it is speech synthesis.

    Meanwhile, actual speech synthesis researchers are acutely aware that mimicking human speech requires dedicating significant NLP resources to generating correct prosody, which may very well be hard or next to impossible without the machine actually understanding what the text is about.

    --
    Ezekiel 23:20
  21. This will be great! by burhop · · Score: 4, Funny

    Hey google, read all slashdot comments to me with a sarcastic tone.

  22. Re:Baloney by K.+S.+Kyosuke · · Score: 1

    When necessary they distinguish between strong AI and weak AI, but most of the time it's not necessary because strong AI doesn't yet exist.

    And you haven't even started distinguishing between AI the result (what you're talking about) and AI the field (which you need to have before you arrive at the former).

    --
    Ezekiel 23:20
  23. Re: Baloney by Anonymous Coward · · Score: 1

    I feel your pain binary. You should relax though, can you remember the mainframe, cloud, and e buzzwords? Everything will be called AI for a short while because its sounds cool and advanced to the masses, but this buzzword shall pass.

  24. I noticed this after the last upgrade. by wjcofkc · · Score: 1

    I do not like it. It is unsettling.

    --
    Brought to you by Carl's Junior.
  25. Re:To.Tall.E. by Megane · · Score: 1

    I guess I need to listen to it to see just how bad it is. You make it seem like William Shatner should be worried about losing work to automation.

    About 10 or so years ago, there was an automated voice reading weather reports on an HDTV sub-channel. I think it was actually the official National Weather Service radio audio. Whenever it came across "patchy fog", it would always say "patch-eef ogg". So now I'm expecting that times a hundred.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  26. Re:Baloney by fph+il+quozientatore · · Score: 1

    Clippy was AI.

    --
    My first program:

    Hell Segmentation fault

  27. That's not saying much. by Dan+East · · Score: 1

    When I was a kid, 35 years ago, I had a TI-99/4A home computer with a speech synthesizer (which was actually 5 years old tech at the time). Sure, it didn't sound great, but it was totally understandable. With the Terminal Emulator II cartridge you could build from phonemes directly and thus have it say any English word, and not just words from its predefined "dictionary" of words it knew how to pronounce already. That was 35 years ago, with a consumer grade home computer running at 3Mhz, that a 10 year old was goofing around with for fun.

    The fact that we didn't reach "Indistinguishable From Humans" in TTS *years* ago is not saying much for the state of our software.

    Here's an example of it speaking... https://youtu.be/0vu1GftX02Q?t...

    --
    Better known as 318230.
    1. Re:That's not saying much. by religionofpeas · · Score: 1

      Replaying pre-recorded phonemes is an adequate solution for poor quality speech, but you can't extend that method to reach high quality. In order to do that, you have to start over from scratch, using a much more difficult method.

  28. Re:Baloney by sound+vision · · Score: 3, Informative

    The storage and CPU cost of recording audio are so small that they reached the point of irrelevance 15-20 years ago, for low-end consumer hardware. More like 40 years ago for professional grade equipment - around the time that CDs were introduced. Despite what a bunch of "audiophile" sites trying to push a product will tell you, it is not difficult, expensive, or taxing in any way to work with PCM audio of a sufficient bit depth and sampling rate to cover the entire range of human hearing. Or even dog hearing!

    But regarding speech synthesis specifically - there is software out there, still being used by somebody I'm sure, that was designed to be run on consumer PCs back in the 90s. At that time, on those systems, there were computational limits that were relevant to sound quality. Whatever outdated software Stephen Hawking uses, sounds like it renders the output at no higher than 10 or 12 kHz sampling rate (compared to 40 - 50 kHz to cover the human hearing range.) But the sampling rate is a very small part of why Hawking sounds bad. The artifacts you hear from a low sampling rate are mostly limited to high-frequency sounds being cut. (And possibly temporal smearing, depending on how you filter.) It sounds similar to turning the treble knob on your stereo all the way down.

    The quality problems with Hawking's synthesizer go way beyond a treble knob. Things like pacing, emphasis, minor slurring of certain sounds that are adjacent to each other, etc... problems that you take care of by making the software more intelligent, not upping the sample frequency. Which is exactly what Google is doing, and making some progress at it too. No, it doesn't sound like a human yet.

  29. Maybe not the best test subject by Headw1nd · · Score: 1

    I would think if they were trying to showcase their technology they would have chosen someone with a less "robotic" voice to copy. I guess they just wanted someone who spoke very clearly?

  30. Re:Baloney by ranton · · Score: 2

    Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".

    People really need to start modding these types of comments as Troll and move on. AI has included basic algorithms used as a stand in for intelligent thought since the field arguably began at The Dartmouth Summer Research Project on Artificial Intelligence over 60 years ago. At the time they were very aware of how difficult it could be to define intelligence, so they intentionally did not let that limit what was considered artificial intelligence research.

    Today the researchers and field of scientific journalism both agree that machine learning and neural networks fit within the field of artificial intelligence. That is all that matters, not your personal feelings about what the field should be.

    --
    -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
  31. Re: Baloney by ranton · · Score: 2

    But not in science they don't! AI has a definite scientific meaning.

    And since its inception in the 1960's, AI has included basic algorithms used to approximate the results of intelligent thought.

    --
    -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
  32. Re:Baloney by ranton · · Score: 1

    Before the mid 1900's if you saw the term AI it would have almost certainly meant artificial insemination, so I assure you the meaning of AI has changed over time.

    --
    -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
  33. This is huge for the audio book market by Botched · · Score: 1

    If every book can be accessed by those who want to listen instead of read! Not a trivial development at all.

  34. Re:Baloney by Merk42 · · Score: 1

    No, they don't.

    Yes, they do

    If your argument was somehow about "AI" specifically, you can see ranton's comment and/or picture how "AI" can become another instance of the example words I linked to.

  35. Re:Baloney by ljw1004 · · Score: 2

    Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".

    There's a straightforward difference. If the logic (or business logic, or branching structure / conditionals) was authored by a human programmer then we call it a conventional program. If the logic was an emergent property of running a learning algorithm over a training set, then we call it AI.

    This is a practically useful distinction for us working software engineers. (Why? The latter can't usefully be checked into source control itself; only its training data. You can't diff it. The typical bugs you get is very different between the two - the first kind of software has weird discontinuous edge cases, and the latter is generally "smooth". We engineers need different skillsets to develop and debug the two. The way we respond to requirements specs is different between the two. Each of them have their strengths at particular classes of problems - compiler-writing is dominated by the first kind; real-world sensory processing was done at first by the first kind like OpenCV up to 2010, but has been wholly eclipsed by the second kind).

    No, Microsoft Word isn't "AI" under this commonly-used definition.

    If you want to keep railing against it, why not (1) recognize that it's a practically useful distinction to make, (2) come up with a term you think is better?

  36. Compared to what humans? by SeaFox · · Score: 1

    A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.

    If anyone remembers "reading groups" from primary school, there is a pretty big range in the term "human accurate reading".

  37. Still sounds choppy by bobstreo · · Score: 1

    Good enough for Hawking maybe.

    I'd prefer a nice high class British female voice Or Paul Bethany as Jarvis..

  38. Eh, I think the title might be better worded... by itwasgreektome · · Score: 1

    I think it might be more realistic to say that Google and a speaker speaking in a monotonous, robotic way are pretty much indistinguishable from another. They both sound robotic to me. When it can imitate what people really sound like, normal people, then talk to me. Not that this isn't cool, but from the cursory bits I read and heard it seems to over-hype itself.

  39. Just a start by BradMajors · · Score: 2

    In a few years. AI will progress so that AI will sound more human than humans.

  40. Re:Study English pronunciation by knorthern+knight · · Score: 1

    Have you heard about the woman working in a tourist shop on "The Sunshine Coast" of British Columbia, Canada?

    She sells sea shells on the Sechelt Peninsula.

    --

    I'm not repeating myself
    I'm an X window user; I'm an ex-Windows user
  41. More voices please by Not-a-Neg · · Score: 1

    I like Australian Siri and wish Alexa would offer similar accents. $0.02

    --
    -==- Buy a Mac and leave me alone!