Phoneme Approach For Text-to-Speech in SCIAM

← Back to Stories (view on slashdot.org)

Phoneme Approach For Text-to-Speech in SCIAM

Posted by Hemos on Sunday March 16, 2003 @11:59PM from the understanding-the-language dept.

jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."

14 of 189 comments (clear)

Min score:

Reason:

Sort:

Phonemes not phenomes by Tucan · 2003-03-17 00:04 · Score: 4, Informative

Phonemes are the building blocks of language not phenomes.
I was expecting better... by LeoDV · 2003-03-17 00:04 · Score: 5, Informative

If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

What's so special about it?
1. Re:I was expecting better... by Rubyflame · 2003-03-17 01:03 · Score: 5, Informative
  
  Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo.
  
  --
  
  All it takes is nukes and nerves.
speaking of the /. effect by trelanexiph · 2003-03-17 00:06 · Score: 4, Funny

I guess IBM didn't have much to say on the matter.

IBM Text-to-Speech Research Demonstration

Input Communcations Error.

You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.
AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:13 · Score: 5, Informative

If you visit here:
http://www.naturalvoices.att.com/demos/

You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.
Re:PHONEME, y'all, not *phenome by WeeBull · 2003-03-17 00:15 · Score: 5, Funny

.. and often uttered in distressed tones at the end of a night out, usually by desperate males attempting to re-attach themselves to some female. PHONEME! PLEASE PHONEME! I LOVE YOU! PHONEME!
*blush* by WeeBull · 2003-03-17 00:22 · Score: 5, Funny

Uhm, ok, who else did just spent 10 minutes (thoroughly) checking if IBM filter naughty words at the text-to-speech interface? Getting the female voices to utter favourable phrases regarding to one's studlyness, perhaps?
Oh ... just me? *blush*
Open Source Speech Synthesis by wzrd2002 · 2003-03-17 00:23 · Score: 5, Informative

There is already freely available open source speech synthesis application for both linux and windows, called Festival created by The University of Edinburgh
comparison to Apple's technology? by inblosam · 2003-03-17 00:26 · Score: 4, Informative

I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.
1. Re:comparison to Apple's technology? by aseidl · 2003-03-17 00:56 · Score: 4, Interesting
  
  I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
And don't forget Bell Labs by rpiquepa · 2003-03-17 00:28 · Score: 4, Informative

IBM is not alone to work on text-to-speech technology and to have demos where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.
TTS is great by jjohn · 2003-03-17 00:31 · Score: 4, Interesting

Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher, Plug Nickle and Progress.
mmm. I hope the server can take a slashdotting...
The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:
#!/bin/bash # Make a text file a spoken MP3 if [ -z "$1" ] ; then echo "usage: $0 <input.txt>"; exit; fi base=`basename $1 .txt` echo "attempting to create $base.mp3" /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1 writewav.pl temp.au temp.wav lame -h temp.wav $base.mp3 rm -f temp.au temp.wav

speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(
It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.
#!/usr/bin/perl use Audio::SoundFile; use Audio::SoundFile::Header; my $BUFFSIZE = 16384; my $ifile = shift || usage(); my $ofile = shift || usage(); my $buffer; my $header; my $reader = new Audio::SoundFile::Reader($ifile, \$header); $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM; my $writer = new Audio::SoundFile::Writer($ofile, $header); while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) { $writer->bwrite_pdl($buffer); } $reader->close ; $writer->close; exit(0); sub usage { print <<EOT; usage: $0 <infile> <outfile> EOT exit(1); }

mmm. There was indenting in code at one point. Sigh...
State of the art in TTS by Sam+Lowry · 2003-03-17 00:52 · Score: 4, Informative
There are basicaly two TTS technologies on the market:
- dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
- corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT and Scansoft (former L&H) and
I'm not actually convinced phonemes exist, y'know by Bertie · 2003-03-17 02:01 · Score: 5, Insightful

I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.