State of Speech Synthesis and Text-To-Speech?

← Back to Stories (view on slashdot.org)

State of Speech Synthesis and Text-To-Speech?

Posted by Cliff on Thursday November 14, 2002 @12:33PM from the my-computer-still-doesn't-talk-to-me dept.

Gnulix asks: "Are there any, preferably either open source products available that produce realistic speech from an arbitrary (English) text? Projects such as Festival doesn't sound all that much better than SAM (Software Automatic Mouth) did on a Commodore 64 back in 1979, nor does SoftVoice's or IBM's new products sound very good. I mean we all know that Stephen Hawking is a fun loving guy, but I bet you that he didn't choose his unrealistic, robotic voice just for the heck of it. With all the amazing advances we have seen in real-time graphics, shouldn't speech synthesis have come much, much further than what is, seemingly, available today?" Ask Slashdot last handled the Voice-To-Text issue in January of this year.

11 of 52 comments (clear)

Min score:

Reason:

Sort:

AT&T Natural Voices by Utopia · 2002-11-14 12:37 · Score: 5, Informative

is the best Text to speech conversion program
checkout http://www.naturalvoices.att.com/
AT&T Labs Research by jcbphi · 2002-11-14 12:52 · Score: 2, Informative

AT&T Labs Research has some recent work in TTS. I'm not sure how state-of-the-art it is, but its certainly much better than the TTS refered to.
Check out the National Weather Service by tdyson · 2002-11-14 12:53 · Score: 3, Informative

The NWS's automated weather channel broadcasts use a new technology this year. The changeas quite a big deal in the marine communities, wear people listen to these voices every day. The new voices are pretty darn good.
Natoinal Weather Service describes their new system.
I have an interest in this by Kafteinn · 2002-11-14 13:09 · Score: 3, Informative

And the best I have found so far is Festival with Mbrola voices (although not perfect they are far superior than the Festival voices)

For voice control stuff I found a little program called cvoicecontrol to be quite nice.

--
Hitler's in the fridge.
AT&T has done a lot by xagon7 · 2002-11-14 13:32 · Score: 3, Informative

Just check THIS out:

http://www.naturalvoices.com/

quite a big step in the right direction in my opinion.
Re:AT&T Natural Voices by pediddle · 2002-11-14 14:35 · Score: 4, Informative

Another extremely strong competetor to Natural Voices is Speechwork's Speechify. Take the "Speechify Challenge" -- it's still possible to tell which is a real recording and which is the computer, but it is very difficult. Some say it's the best engine available, but I guess that's a matter of personal preference.

I don't know about Open Source TTS, but the commercial versions (AT&T, Speechworks, and others) are sitting on the threshold of truly natural speech. I work in the speech industry, so I follow progress and have seen some of the unreleased demos of upcoming versions. In the next couple years, we can expect amazing things. It won't be long before the Speechify Challenge will truly be impossible to beat.

By the way, for those of you who don't know, the newest and best-sounding engines don't use purely synthesized sounds as older and small-footprint engines do (Festival and Steven Hawking). The engines are built using actual recordings: a "voice actor" will sit in a studio and record dozens of hours of speech, and then, over the course of several months, the recordings are then cut and spliced into individual phonyms, which are reassembled by the engine. This means that the voices actually sound like real people, and the only unrealistic part is the inflection when generating complete sentences. You can order custom voices (for several tens of thousands of dollars) and get a voice that sounds identical to that of your celebrity of choice.
Re:AT&T Natural Voices by pediddle · 2002-11-14 14:38 · Score: 3, Informative

One addendum: the fact that the newest engines use real recordings is exactly the reason why it will be nearly impossible for Open Source engines to approach the quality of commercial versions. The amount of work involved in extracting the raw sounds from recordings is staggering, and it requires full-time commitment from trained experts over the course of many months (not to mention the cost of hiring voice talent). There is no way to avoid the costs involved, and so Open Source alternatives cannot become available without some sort of large grant. Unfortunate.
TTS Synthesizers by irrelevant · 2002-11-14 15:33 · Score: 3, Informative

Here at work we monitor progress of and/or use the following:

DECTalk (One of the most widely used)
Eloquent (http://www.eloq.com - dead URL?) (fairly natural-sounding with dialects)
Elan (European languages)

They've all been improving over the years.
Call me picky if you will.. by Anonymous Coward · 2002-11-14 15:51 · Score: 1, Informative

..but the Commodore 64 wasnt even released in 1979. It didnt come out till late '82 if my memory serves me correctly.
State of the art in TTS by Sam+Lowry · 2002-11-14 20:51 · Score: 3, Informative
There are basicaly two TTS technologies on the market:
- dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
- corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important
Given that the cost of developing a database for corpus synthesis may easily be 100 times higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT and Scansoft (former L&H) and
Festival with MBrola by tigersha · 2002-11-14 22:19 · Score: 2, Informative
I can only concur with the poster above who said that Festival with MBrola is probably the bet OSS bet. Actually, The MBrola voice itself has a license for "non-commercial" use, but we are a nonprofit, so...

In particular, there is one high-res female voice in MBrola that is very good. If you need any help setting it up (I can happiyl give you my festival config file) just say mail me at netgrok @at@ yahoo . de

That said, I think text output is very underrated technology and is quite useful, if used in moderation for the right purposes. One sometimes reads overexcited hyping about reading your emails out loud in the car or at breakfast, but that ain't gonna happen with current technology.

For one, the synthesis is bit monotonous for long texts (but then, now that I think about it, having you SO or kids read out a letter out loud would probably be not any better...)

Secondly, you do not necessarily want a user interface where a computer reads out things the whole time (logs, for instance) because a) its annoying and b) it will not work in an office with multiple people.

Where it DOES work and is trivial to implement is for things that are singular events that occur during the day or alarms. Similar to the sort of PA announcements that you would get in a department store. They do not read loud the whole time, do they?

In our case we use festival with Mbrola for two things:
- There is a small script that checks the main services on all our critical machines and if one goes down, the system moans. A loud voice that says "The http server on web1 is down" gets more attention than a little light.
- Our backup system moans at specific times (about three times a day) about the next tape that needs to be put in. "Please insert the correct tape into the tape drive on backup. The tape needed is Unix 1.
  
  The backup will start at 4 this afternoon" is the announcement I just heard. Sometimes I add "Please insert the tape, pleeeeease", just to hear the damn computer beg ME for a change :)
- Also, if I FORGET to insert the tape the computer starts moaning continously about it. Nothing like a whining b.tch to convince you get off your butt to put the tape in :)
It is also trivial to insert this into a standard sysv start/stop script at boot time so that you have some tag when critical servers are shut down for some reason.

Costs for this setup? 2 hours of install time (installing MBrola on festival took some digging-through-the-docs. If you need it, mail me). Writing a script that "say x" instead of "echo x) took about 2 minutes. And putting these commands into the cron job took about 10. So for a bout 3 hours worth of time and set of very cheapo computer speakers you get a good useful functional system which works, and the voice is very, very good. This is pretty neat for critical or semi-critical announcement kind of events, not continuous interaction.

Since the only command I use to activate it is "say x" from unix shellscripts with your current setup is trivial.

Btw, the MBrola website has a demo of a german voice reading a weather and traffic report which is even better than the English one.

Of yeah, it was fun to watch the cleaning lady almost get a heart attack when the computer greeted her...
--
The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism