Slashdot Mirror


State of Speech Synthesis and Text-To-Speech?

Gnulix asks: "Are there any, preferably either open source products available that produce realistic speech from an arbitrary (English) text? Projects such as Festival doesn't sound all that much better than SAM (Software Automatic Mouth) did on a Commodore 64 back in 1979, nor does SoftVoice's or IBM's new products sound very good. I mean we all know that Stephen Hawking is a fun loving guy, but I bet you that he didn't choose his unrealistic, robotic voice just for the heck of it. With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?" Ask Slashdot last handled the Voice-To-Text issue in January of this year.

7 of 52 comments (clear)

  1. AT&T Natural Voices by Utopia · · Score: 5, Informative

    is the best Text to speech conversion program
    checkout http://www.naturalvoices.att.com/

  2. Hawking... by 3-State+Bit · · Score: 5, Interesting

    Actually, I heard that they offered Hawking a revamped speech synthesizer, since although his was state-of-the-art in the seventies, today we have much better. He declined, saying he and his friends had gotten used to the voice, and it was "his". In fact, whenever on hears that particular flavor of voice synthesis, it's difficult not to think of Hawking.

    He does relate, however, in A Brief History of Time, that at first people had trouble understanding "his voice", so that when he would speak or answer questions at lectures, he would have an interpreter who was more familiar with his voice repeat what he just said.

    Interesting stuff...

    1. Re:Hawking... by GuyMannDude · · Score: 4, Funny

      He declined, saying he and his friends had gotten used to the voice, and it was "his".

      Not to mention the legions of fans who follow his side-career as a gangsta rapper with due vigor! Changing his would give his music a very different sound!

      GMD

  3. The larger issue is NLP by RobotWisdom · · Score: 5, Interesting
    Modulating intonations is part of the larger challenge of natural-language processing (NLP, a subdiscipline of AI). We simply don't have the sort of general theory of language-production that could systematically predict how the intonations should fall, any more than we have a theory of translation that can do substantially better than Babelfish.

    Nor, to harp on my pet peeve, do we have a theory of semantics that can put XML to any important use on the average webpage. These all need a model of the human psyche, because all human language is flavored with metaphors from the realm of motives and plans, etc (the psychological realm). Psychological science isn't delivering the sorts of models that NLP-etc need, and probably won't for many decades yet. [My AI FAQ]

  4. Actually, graphics hasn't come 'that far' by stienman · · Score: 4, Insightful

    With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?

    We haven't had that many amazing advances in graphics. Natural speech is to advanced raytracing what current text to speech is to current graphics. We still cannot raytrace in a single system in real time at the resolution of our eyes, and we still cannot produce natural speech in a single system in real time at the resolution of our ears.

    Furthermore, we know less about the math of speech than we know about the math of light. Go visit your local university that has a good CS program, and browse the bookstore for the books used to teach speech recognition. In that book you will find that the average sound a human makes goes from production of complex, multitonal sound from the vocal cords through as many as five complex natural filters (body cavities between the vocal cords and lips) before it reaches the ears of the recipient.

    Modeling these filters for one sound is hard enough. Each letter in our alphabet, except simple vowels, changes the filters throughout the letter. Furthermore the filters for a given letter may also change depending on the previous and next letter.

    A system to create speech, therefore, must generate hundreds (perhaps thousands) of different filtered 'noises' just to reproduce the english language. Other languages can be much more complex.

    Current common technology is to simply record the hundreds of 'simple' sounds and add them together. Really good programs use hundreds of hours of speech by voice actors to get several hundred sounds.

    The penultimate is to mathematically recreate every part of the human vocal system from the lungs to the lips. This has obviously not occured. The computers may well be powerful enough, but the understanding of the vocal tract is extremely limited.

    In other words, wait 5-10 years. There still isn't a killer application for text to speech, but with devices getting smaller and smaller, there will be soon enough.

    -Adam

    1. Re:Actually, graphics hasn't come 'that far' by Guspaz · · Score: 5, Funny

      I just hope they have enough recordings of Majel Barrett ;-)

  5. Re:AT&T Natural Voices by pediddle · · Score: 4, Informative

    Another extremely strong competetor to Natural Voices is Speechwork's Speechify. Take the "Speechify Challenge" -- it's still possible to tell which is a real recording and which is the computer, but it is very difficult. Some say it's the best engine available, but I guess that's a matter of personal preference.

    I don't know about Open Source TTS, but the commercial versions (AT&T, Speechworks, and others) are sitting on the threshold of truly natural speech. I work in the speech industry, so I follow progress and have seen some of the unreleased demos of upcoming versions. In the next couple years, we can expect amazing things. It won't be long before the Speechify Challenge will truly be impossible to beat.

    By the way, for those of you who don't know, the newest and best-sounding engines don't use purely synthesized sounds as older and small-footprint engines do (Festival and Steven Hawking). The engines are built using actual recordings: a "voice actor" will sit in a studio and record dozens of hours of speech, and then, over the course of several months, the recordings are then cut and spliced into individual phonyms, which are reassembled by the engine. This means that the voices actually sound like real people, and the only unrealistic part is the inflection when generating complete sentences. You can order custom voices (for several tens of thousands of dollars) and get a voice that sounds identical to that of your celebrity of choice.