State of Speech Synthesis and Text-To-Speech?

← Back to Stories (view on slashdot.org)

State of Speech Synthesis and Text-To-Speech?

Posted by Cliff on Thursday November 14, 2002 @12:33PM from the my-computer-still-doesn't-talk-to-me dept.

Gnulix asks: "Are there any, preferably either open source products available that produce realistic speech from an arbitrary (English) text? Projects such as Festival doesn't sound all that much better than SAM (Software Automatic Mouth) did on a Commodore 64 back in 1979, nor does SoftVoice's or IBM's new products sound very good. I mean we all know that Stephen Hawking is a fun loving guy, but I bet you that he didn't choose his unrealistic, robotic voice just for the heck of it. With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?" Ask Slashdot last handled the Voice-To-Text issue in January of this year.

3 of 52 comments (clear)

Min score:

Reason:

Sort:

Related ? by Tolchz · 2002-11-14 12:38 · Score: 2, Insightful

How does "voice to text" relate to "text to voice" ?

Look at the older article, it's a completely different question.
Apple, and MS by GigsVT · 2002-11-14 12:58 · Score: 3, Insightful

Yeah, closed source :)

MS has had text-to-speech as a object you can embed in your program with one line of VB code (same as you can embed IE) for a while now.

Apple has had text to speech entensions in tons of different voices for a long time. Some of the G4s used to read dialog boxes to you by default if you didn't click on them fast enough. Pretty unnerving the first couple times.

Several voice activated automated attendant systems I have called for my credit card and bank are amazing these days. They have insanely accurate speech recognition and really good text-to-speech.

So I wouldn't say the field is not advancing... it is.

Of course, a Google search for "open source text to speech" without quotes yields many promising looking hits, which I havn't evaluated. Why didn't you search there before asking Slashdot?

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Actually, graphics hasn't come 'that far' by stienman · 2002-11-14 14:00 · Score: 4, Insightful

With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?

We haven't had that many amazing advances in graphics. Natural speech is to advanced raytracing what current text to speech is to current graphics. We still cannot raytrace in a single system in real time at the resolution of our eyes, and we still cannot produce natural speech in a single system in real time at the resolution of our ears.

Furthermore, we know less about the math of speech than we know about the math of light. Go visit your local university that has a good CS program, and browse the bookstore for the books used to teach speech recognition. In that book you will find that the average sound a human makes goes from production of complex, multitonal sound from the vocal cords through as many as five complex natural filters (body cavities between the vocal cords and lips) before it reaches the ears of the recipient.

Modeling these filters for one sound is hard enough. Each letter in our alphabet, except simple vowels, changes the filters throughout the letter. Furthermore the filters for a given letter may also change depending on the previous and next letter.

A system to create speech, therefore, must generate hundreds (perhaps thousands) of different filtered 'noises' just to reproduce the english language. Other languages can be much more complex.

Current common technology is to simply record the hundreds of 'simple' sounds and add them together. Really good programs use hundreds of hours of speech by voice actors to get several hundred sounds.

The penultimate is to mathematically recreate every part of the human vocal system from the lungs to the lips. This has obviously not occured. The computers may well be powerful enough, but the understanding of the vocal tract is extremely limited.

In other words, wait 5-10 years. There still isn't a killer application for text to speech, but with devices getting smaller and smaller, there will be soon enough.

-Adam