Slashdot Mirror


Text to Speech Software Copies Any Human Voice

mindpixel writes " A New York Times Report (registration required) states that AT&T Labs will start selling speech software that it says is so good at reproducing the sounds, inflections and intonations of a human voice that it can recreate voices and even bring the voices of long-dead celebrities back to life. The software, which turns printed text into synthesized speech, makes it possible for a company to use recordings of a person's voice to utter things that the person never actually said."

37 of 299 comments (clear)

  1. another thing by Mandrake · · Score: 3

    Also, building a voice from speakers who you do not control the coverage on (particularly the mention of reviving dead actors, etc) would be problematic at best. You could not get the proper coverage (nor the quality) to really do anything useful.
    --
    Geoff Harrison (http://mandrake.net)

    --
    Geoff "Mandrake" Harrison
    Some Random UI Hacker
  2. open source speech synthesis by Mandrake · · Score: 4

    AT&T's synthesis system actually contains dinburgh University's Festival Speech Synthesis System (http://festvox.org/festival), Although the synthesis technique in NextGen is not in Festival (as its proprietary). However there is work from Carnegie Mellon, by Kevin Lenzo and Alan Black (http://www.festvox.org) that provides all the tools (for free) that allow you to build your own voice in Festival. For simple domains the tools really work well, and easily capture the quality of the original speaker, for a whole general voice that can say anything it is a *lot* of work, but is possible from the tools. This is what we are doing in our company Cepstral (http://www.cepstral.com)

    Actually there is even and example of Hemos himself, doing a talking clock on http://www.festvox.org/ldom/ldom_time.html
    --
    Geoff Harrison (http://mandrake.net)

    --
    Geoff "Mandrake" Harrison
    Some Random UI Hacker
  3. Re:Try it out! - It's not that great by ivan256 · · Score: 3

    It's interesting that their precooked demo's sound great, but the speach generated in the interactive demo still sounds like a classic text-to-speech program with a few enhancements. This doesn't seem like a significant improvement over, say, what ships with MacOS by default. I'm not impressed.

  4. Voice over IP compression; useful for the deaf by ChrisDolan · · Score: 3

    With a good speech recognition package, this would be a good way to get extremely high compression for voice. Record your voice, convert to text, compress text, spit over the net, change back to *your* voice on the other end. It would require initially transmitting your voice profile. However, it would not work well with current technology because the lag during speech recognition would be quite noticable. Also, you would have to detect inflection in the speech recognition phase and encode that in the text.

    This could also be very useful for deaf telephone users. Currently, a deaf person relies on a human relay to talk to a non-TDD equipped person. With good speech-to-text and text-to-speech technology the human middle-man could be removed, saving a ton of money.

  5. Human rights? by GregWebb · · Score: 3

    I'm honestly not sure what to think here, but do I have a right to my voice?

    Let's say someone wanted to make me say something in direct contradiction to my normal views, then publish that. Now, I don't consider myself famous enough for this to be a problem ;-) but the possibilities are obvious. The technical liberal in me says that this is fine. The, erm, other part of me says that this could cause some serious problems and harm for people, so shouldn't be allowed. Which do people think here?

    The flipside for law enforcement is perhaps even more scary. What if I published a recording, generated in this way, of (for example) Gary Condit (sp?) confessing to having killed Chandra Levy (again, sp?)? For a parallel (and I never thought I'd cite Lois & Clarke... Promise I'm not a fan, my sister used to watch it over meals so we all had to, I have a weird memory, honest really...) the episode where a photographer produces a pre-wedding image of them in bed which could have been taken properly but was actually faked due to a lost film.

    This has been coming for years, I know, but it's still a nasty big can of worms.

    --

    Greg

    (Inside a nuclear plant)
    Aaaarrrggh! Run! The canary has mutated!

  6. On Yahho w/o registration here by __aadkms7016 · · Score: 3

    Read it on Yahoo without registration here.

  7. On the other hand... by Monthenor · · Score: 4

    ...it still stumbles over the relatively simple "Gonna bust a cap in this bizatch's shizass."
    ------------------------

    --
    Co-founder of GerbilMechs
  8. Re:One more step... by cr0sh · · Score: 3
    --
    Reason is the Path to God - Anon
  9. One more step... by cr0sh · · Score: 4

    Prior to this, the best sounding speech synthesis I had heard was from the Festival system, which is still pretty good - epecially considering it has an open source license, something the AT&T system doesn't.

    Another good speech synthesizer, no doubt an early version of the AT&T one (possibly?), is by Lucent.

    Still, I am amazed at the quality of the AT&T system - it sounds almost perfectly natural. To the naysayers that say "No, it isn't natural" - what all of you have to realize is that this simply demo doesn't allow you to tweak all the variables that would really allow the inflections or type of voice (like whispering, etc) to really come through - it is too bad they don't give an advanced interface with a FAQ or some other form of documentation to allow this, but I imagine that if they did, it would probably take quite a while to compose even a simple sentence (I remember the hell you had to go through with an old Radio Shack speech synth for the Color Computer, specifying individual phoenomes (sp?) just to get proper speech to come out - it could pronounce many words, but others it just fell flat on its face).

    Finally - something I want everyone to ponder. Take a look at this old article (it was about Square redubbing FFTM) - once it loads, search for "cr0sh" and "I dare say" - you will come across a series of comments about what I think may happen in the future - what is funny is that the comments in reply to my take on things sound like your typical naysayers. How many computers were we supposed to only need back in the 60's? How much memory would people "only" need again Mr. Gates?

    What I predict will come about - probably sooner than we can all imagine. It may not be cheap enough to do it now, at a quality that people would watch, fast enough to be done quicker than what can be done with live actors - but it is all software and hardware - this stuff will get faster and cheaper. Anybody who has been in this business long enough knows that it will happen. There might still be a need for actors, and voice artists, and such - but they probably won't have the "god" status society seems to confer on them now (with the exception, perhaps, of stage acting - which will probably enjoy a huge comeback).

    Worldcom - Generation Duh!

    --
    Reason is the Path to God - Anon
  10. Code words and access lists by wiredog · · Score: 3

    I used to be in the army.

    A general can't just call up the guard post and order the person on duty to let unknown people in. I once was on duty in a radio room and we had a Very Important Senior Officer come by to see what we were doing. He wasn't on the access list, so we wouldn't let him in, even though we recognized him. He had to go get the Colonel, who was on the list, to get in. We got attaboys from him, the Colonel, and our NCOs for that. If we'd let him in, we'd have been in deep doo doo.

  11. Re:Cool... and disturbing. by ncc74656 · · Score: 3
    Its main use is for telephony (surprise!) but it I suppose it'll be turning up in new and exciting places.

    On the radio this morning, CBS ran a short blurb about this system, including hypothetical news and sports reports. It sounded pretty good, too...if you've done anything with TTS before, the speech quality of this system was considerably ahead of what's been done before. (Light years ahead of Speak & Spell, but that's almost a given at this point. Compared to more modern systems such as Festival, it still comes out ahead quite a bit.)

    The announcer posited that, one day, his job could be in danger from this kind of technology. With some broadcasters' penchants for cutting costs any way possible (somebody either here or on K5 posted a link about Clear Channel and its shenanigans a while back, but I can't find it), DJs could end up going the way of the dodo as well.

    --
    20 January 2017: the End of an Error.
  12. Re:Doubtful. by Puk · · Score: 3

    That's patently false. Speech synthesis systems are getting better and better at (or, technically, their creators are getting better at creating systems which) generate speech with very similar intonation to what a human would, based on sentence structure analysis and concatenation of recorded subword units with various intonations (there aren't as many as you might think).

    Of course, it would need a corpus of recorded and (possibly automatically) tagged speech from the person they wish to imitate, but that's not that impossible. Every notice how the generated speech on some speech recognizing phone system (such as American Airlines) is getting better and better, with more and more human-like pronunciation and intonation? And these are the production systems -- not the research systems. I'm not saying they're perfect (and, of course, they're dealing with multiple intonations of fully recorded words, not subwords), but the problem is a far cry from "true AI", and the work on it is getting better all the time.

    Check out http://www.sls.lcs.mit.edu/sls/publications/1998/m engthesis-jonyi.pdf for som more detailed info on such research. (Other papers and theses at http://www.sls.lcs.mit.edu/sls/publications/index. html may be relevant as well.)

    -Puk

    p.s. If this gets modded up, I could cap my karma on this. :P

  13. So much for voice print security systems. by TomatoMan · · Score: 3

    Disable those voice passwords on your machines, kids. Your pr0n is now exposed.

    TomatoMan

    --
    -- http://frobnosticate.com
  14. Re:Movie dubbing today... by Tiroth · · Score: 3

    I think that is a very interesting idea, but there are a lot of subtleties to consider. Languages don't share a common sound set...if you were dubbing English into German, there just isn't a sound for the glottal stop. How would you infer how the "actor model" should sound? I'm guessing this is a very nontrivial problem.

    One solution would be to get demo reels of the actors saying various sounds in the target language. The downside is that they will come across speaking the foreign language with a terrible accent...a Japanese actor might be fairly unintelligable speaking English since they are missing so many sounds (la=da=ra, no th-, etc)

    It's definitely a neat idea though.

  15. The AT&T "Rich" Voice by jaydho · · Score: 3

    If you haven't already, listen to the AT&T Customized Voice Product Demo (U.S. English, Male: "Rich"), truly amazing.

    With online news feeds coming in to the local radio station and the quality of the "Rich" custom voice, I have a feeeling a lot of announcers may be going bye bye. In these samples he's way better than our local guy. Plus, since Shoutcast and such already have all the song info, think of the cool DJ announcing you could have.

    My roommate and I used the older online AT&T TTS to do our answering machine message for the dorm... It's did pretty will with "This is mack daddy JD and phat daddy John's room" that's the only message we've ever had that people would call back just to hear. With the old AT&T system you could adjust the pitch and various other settings to get it to sound good, I can't imagine what their new system will do!

    If you don't think too good, don't think too much.

    KingoftheBongo.com
  16. So? by 11thangel · · Score: 3

    So you have a computer program that takes binary (or ascii converted to binary) and makes it into a sound. Get me something that turns a sound into text with more than 90% accuracy and under 5 minutes of training routines, and I'll buy it.

    --

    I am !amused.
  17. Try it out! by Mr.+Sketch · · Score: 5

    On AT&T Speech Labs website, they have a little demo where you can enter you're own text and have it play for you using their software (30 word limit). Way Cool!!

    They also have recorded demos you can listen to, but I thought the interactive demo was pretty nifty.


    --BEGIN SIG BLOCK--
    I'd rather be trolling for goatse.cx.

  18. Re:Grrrreat by mr_gerbik · · Score: 3

    "i guess this can only mean more fraudulent accounts of his-story."

    His-story.. I hate that term. Who are you? Michael Jackson?

  19. Phone Sex With Anyone!! Call Now 1-800-ANJOLIE by Sydney+Weidman · · Score: 4

    Yes, we can give you any celebrity as your own personal plaything. All you have to do is send us the script (or enter it on our website) and we'll give you 5 minutes to remember. 5.99/minute. Long distance charges may apply.

  20. Re:Entropy-licious by Planesdragon · · Score: 4

    Expect video testimony to become useless in court cases... I mean, with a bit of photo work anyone can fake the gerky security camera footage--

    No, wait. We already have laws that cover this. I think they're called perjury...

  21. Fakes by DreamingReal · · Score: 4
    Dr. Rabiner said he was excited about the possibility of resurrecting renowned voices, like that of Harry Caray, the Chicago Cubs announcer who delivered rousing play-by-play broadcasts. "There are probably hours of recordings in archives," he said. Wouldn't it be great, he asked, if Harry Caray's voice could again be broadcasting in Wrigley Field?

    Absolutely not. And for the same reason that second-printings, plastic surgery, and fake breasts all suck - they're not the real deal.

    And as a die-hard Cubs fan since the age of 4, might I also add that the World Series drought for the last half century has taken on a sort of religious significance, not unlike the 40 years the Hebrews spent wandering in the desert. And Harry Caray was our Moses - resurrecting his voice without the man behind it is tantamount to sacrilege (not to mention unbelievably morbid!).


    -------

    --
    We want some answers and all that we get
    Some kind of shit about a terrorist threat

    - Ministry
  22. There's an evil use for this too: by AFCArchvile · · Score: 5
    I quote from U.S. Code, Title 47, Section 227, otherwise known as the Telephone Consumer Protection Act:

    "(b) (1) It shall be unlawful for any person within the United States
    (B) to initiate any telephone call to any residential telephone line using an artificial or prerecorded voice to deliver a message without the prior express consent of the called party, unless the call is initiated for emergency purposes or is exempted by rule or order by the Commission under paragraph (2)(B); ..."

    You hear that? There is to be no telemarketing use of this technology!

    --
    "Ancillary does not mean you get to rule the world." --U.S. Circuit Judge Harry Edwards, speaking to the FCC's lawyer
  23. This could be useful in games. by AFCArchvile · · Score: 5

    Just imagine how much less space some of the more involving computer games like Half-Life and Deus Ex would take up if all the dialog was synthesized with key samples from the voice actor (or, should I say, the "phoneme source"). That saved space could be used toward other things, like textures or ambient sounds. Of course, the biggest challenge would be to allocate some processing power for the synthesis. Still, it's probably in the works.

    --
    "Ancillary does not mean you get to rule the world." --U.S. Circuit Judge Harry Edwards, speaking to the FCC's lawyer
  24. Try it out! by DaneelGiskard · · Score: 3

    You can try out the "research version of Next-Generation Text-To-Speech (TTS) from AT&T Labs." here.

    I'm sure it's not the same thing as the one mentioned in the article, but I'm pretty sure the one in the article is at least based on this one.

    Try it out!

  25. Other Online Demos by DaneelGiskard · · Score: 5

    Some links to other online demos, so you can compare:

    http://www.elantts.com/indemo.htm
    http://www.cstr.ed.ac.uk/projects/festival/userin. html
    http://www.flexvoice.com/demo.html
    http://www.acuvoice.com/downloads/ttsdemo.html


    I searched for good TTS software to give voice to some of the 3d animations I did in max ... but I did not find anything satisfactory... :(

  26. Re:Job cuts in Hollywood... by KarmaBlackballed · · Score: 4

    expect the same audience as if Tom Hanks were doing the character

    And who says Tom Hanks ever has to fade away? It could be a brave new world where your future kids and mine grow up watching the same stars we have today and some from yesterday. I can imagine my grandchildren raving about that new Humphrey Bogart action film. Not so far fetched really.

    And for those that wonder about the legal aspects ... I think Tom Hanks would not mind getting paid nice royalty fees for the use of his young persona when he is retired in his 80's.


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~ the real world is much simpler ~~

    --

    --- -- - -
    Give me LIBERTY, or give me a check.
  27. Movie dubbing today... by KarmaBlackballed · · Score: 5

    One neat application would be to dub foreign language films in the target language using the voice of the original actor even though they do not know the target language. They could start doing that today.

    They could start by fixing all those old Chinese and Japanese action/monster flicks dubbed by the same guy talking in false baritone and falsetto.


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~ the real world is much simpler ~~

    --

    --- -- - -
    Give me LIBERTY, or give me a check.
  28. I can see it now by Auckerman · · Score: 3
    George Bush: All your Scuds are belong to us!

    Saddam Hussein: Somebody set us up the bomb!

    God help us all!

    --

    Burn Hollywood Burn
  29. I don't know about it.... by 3-State+Bit · · Score: 3
    I could understand it if they said "We can take a sample of speech, for instance, an actor reading a script in a dead celebrity's role, and then digitize it into an inflection and reproduce the same inflection in a different voice."
    But this isn't what they're saying:
    "The software [...] turns printed text into synthesized speech"
    Which prays the question "How does the software know what inflection to associate with the printed text?"
    I know that the same words can sound radically different. Take the phrase "one, two, or three" in each of the following contexts (not that none begins or ends a sentence):
    • "I can't imagine why ANYONE would want four subnets in their own house. I mean one, two, or three I can imagine, but four??"
    • Please press one, two, or three at the tone."
    • Okay, so it was in the early morning before 4. But can you be at all more specific? Do you have any idea whether it was at around one, two, or three AM?
    • Settings of four or five are considered dangerous, while settings of one, two, or three are considered to be within acceptible parameters.
    I think that if you record yourself saying the above phrases, then crop out just the highlighted phrase, you'll find a different inflection in each one. Without understanding what a sentence says, or, more precisely, what the person means who is saying the sentence, the fact that you can produce any inflection won't help you determine which one is right.

    I found Liz and Ike playing scrabble while very drunk, and putting on all sorts of none-sensical words. I even saw "Zisis's", using a piece of rice for an apostrophe! (Zisis is a greek convenience store near us).
    I told Liz and Ike that I thought they were crazy. "Heheh, yeah we're crazy", Ike says, "but each of us only put one word down that broke the rules in a major way."
    "Which words were those?"
    " 'Zisis's' and 'Windology' "
    Since Liz was the crazier of the two, I ventured a guess, "Liz's is Zisis's, isn't it?"
    "Nope. Liz's is 'windology'. 'Zisis's' is mine." Ike replied proudly.

    Anyway, the point of this exercise is to show that a human reader reading this can make the phrase "Liz's is Zisis's, isn't it" sound natural, but I bet any speech-synthesizing software that just follows rules will make it sound incomprehensible. That's because speech is more than reading things by set rules -- it is reading things to reflect your internal parsing of the sentence.

    Not to mention the fact that actors can read the same line in a thousand different ways to show a thousand different "interpretations" (states of the character who speaks it, or parsings of the sentence). How will this software produce them, if it only has the same text to parse?

    Either someone manually will give it an inflection, or it needs (or would need before truly being able to make good its claim) a human oral reading to "mimic", where it can use the synthesized voice to sound the same inflection in a different voice. Now that would, as the old mis-translated Coke slogan goes, "bring your dead relatives back alive."

    Mere dancing with power brooms? Ha, now celebrities will be telling you about how easy to use AOL is. So easy to use, no wonder it's number 1 -- even among the dead!

    Gee, I can hardly wait.




    (It was intended to sound like "coca cola" when its Chinese characters pronounced).

    --
  30. John F. Headroom by corvi42 · · Score: 3

    Ask not what your country ... can do ... for you but what
    what
    what
    what
    what
    what
    you can do for for your country.

    --

    There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
  31. Re:Entropy-licious by Bonker · · Score: 4

    Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses,

    Good for them... Better for us! Who wants dumpy Sandra Bullock, bug-eyed Steve Buscemi, or smarmy Ben Affleck when we can have perfect, artist produced, fan-boy (and fan-girl) material like Aki from FF?

    --
    The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
  32. Cool... and disturbing. by Tin+Weasil · · Score: 3

    While this is a really great leap in TTS technologies... which is sure to make computers for the blind even more accessible then ever... the idea of being able to reproduce any voice is very scary.

    What happens when you get a sample of some General's voice and then use a synthesiser to call up the poor kid on guard duty and get him to let a bunch of terrorists enter the base?

    1. Re:Cool... and disturbing. by dachshund · · Score: 5
      Actually, this isn't a very exciting thing for the blind. For most practical uses, the visually impaired tend to prefer speed over quality. It doesn't have to sound great as long as it can read several times faster than "normal" speed. The AT&T TTS isn't really designed for this purpose.

      Its main use is for telephony (surprise!) but it I suppose it'll be turning up in new and exciting places.

    2. Re:Cool... and disturbing. by Anixamander · · Score: 5

      What happens when you get a sample of some General's voice and then use a synthesiser to call up the poor kid on guard duty and get him to let a bunch of terrorists enter the base?

      Obviously if this does happen, then all their bases...aww, forget it.
      --

      --
      Do not taunt Happy Fun Ball(TM)
  33. Most excellent by baptiste · · Score: 3

    This is great news. For too long TTS has been held back by questionable voice quality. Microsofts engine was a huge step forward, but still wasn't quite there. As the technology advances and requires less CPU power (or more CPU power is fit into a smaller space) I can imagine this will rapidly show up in places where voice prompts would be nice be are so critical as to deploy a bad sounding technology.

  34. Doubtful. by MarkusQ · · Score: 5
    Match the intonation of any human voice, without a sample of that voice saying the phrase in the desired intonation, just from the text?

    "Yeah, right!"

    "Officer, it is clear to me that you are in fact the one who is inebriated."

    "I found it that way. Honest."

    "Now, nothing has really changed since the last contract, we just cleaned up a few details; Please sign and return ASAP."

    "But Billy got one...why can't I? Please?"

    "Would you like to move to the sofa?"

    I don't buy it for a minute. To do what they claim would require real AI(tm).

    -- MarkusQ

  35. Entropy-licious by Nihilanth · · Score: 5

    Well kids, say goodbye to phone taps, voice mail, and important business being conducted over the phone. If this technology really accomplishes what the above says, Voice recordings wouldnt be able to hold up in court because..well..it would be difficult/impossible to proove that they were really recordings of the persons voice.

    Of course, i don't think this kind of techonology should be "outlawed" or "restricted", that will only make it easier to be used maliciously, as with any technological advancement.

    Another interesting point of interest is with the new Final Fantasy: spririts within movie, actors are beginning to consider copyrighting their likenesses, since they can be reproduced on a computer with frightening quality and clarity. Perhaps this applies to voice reproduction as well.

    This sounds like a very beneficial technology, especially for games, where a high-quality voice synth could replace volumes of digitally recorded and compressed audio files..but it opens the door for some really frightening possabilities of fraud, social engineering, and copywrite side-stepping.