Phoneme Approach For Text-to-Speech in SCIAM

Does the poster have something against IBM by watzinaneihm · 2003-03-17 00:04 · Score: 3, Insightful

Does the poster have something against IBM ... to link an application to a slashdot post?
Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.

--
.ACMD setaloiv siht gnidaeR

Re:Does the poster have something against IBM by borgdows · 2003-03-17 00:07 · Score: 2, Funny

It looks like IBM is not running their servers on a dead fly ;)

Phonemes not phenomes by Tucan · 2003-03-17 00:04 · Score: 4, Informative

Phonemes are the building blocks of language not phenomes.

I was expecting better... by LeoDV · 2003-03-17 00:04 · Score: 5, Informative

If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

What's so special about it?

Re:I was expecting better... by Rubyflame · 2003-03-17 01:03 · Score: 5, Informative

Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo.

--

All it takes is nukes and nerves.
Re:I was expecting better... by Mandrake · 2003-03-17 03:14 · Score: 2, Interesting

We've also been doing this for quite some time. you can check out the Cepstral On-Line High Quality Synthesis Demos, as well as our High Quality Limited Domain Demos.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker
Re:I was expecting better... by MrScience · 2003-03-17 04:36 · Score: 2, Interesting

This was used in Mission to Mars for the spaceship's voice. The director was looking to do some sound FX to create one from a human voice, then found AT&T's product which was a perfect fit.

I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.

--
You quitting proves that the karma kap worked. The most annoying of the whores shut up. --CmdrTaco
Re:I was expecting better... by tchapin · 2003-03-17 09:11 · Score: 2, Informative

SpeechWorks also offers a high-quality network telephony concatenative TTS engine, called Speechify. We also offer a formant-based TTS engine, as well as an embedded TTS one based on Speechify. See some demos here.
We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(
Todd

--
-- !todd erases a red dot! I steal music on the internet.

speaking of the /. effect by trelanexiph · 2003-03-17 00:06 · Score: 4, Funny

I guess IBM didn't have much to say on the matter.

IBM Text-to-Speech Research Demonstration

Input Communcations Error.

You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.

PHONEME, y'all, not *phenome by texchanchan · 2003-03-17 00:10 · Score: 3, Informative

Phoneme, a unit of sound in a word. From Dictionary.com: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"

Related to "telephone," "phonics," etc.

Re:PHONEME, y'all, not *phenome by WeeBull · 2003-03-17 00:15 · Score: 5, Funny

.. and often uttered in distressed tones at the end of a night out, usually by desperate males attempting to re-attach themselves to some female. PHONEME! PLEASE PHONEME! I LOVE YOU! PHONEME!

AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:13 · Score: 5, Informative

If you visit here:
http://www.naturalvoices.att.com/demos/

You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.

Re:AT&T have been doing this for a while! by wiggys · 2003-03-17 00:22 · Score: 2, Funny

The Currah speech unit for the Spectrum was hilarious. It came with a free game which was supposed to say "The Banshee wails at you but nothing happens".
It actually sounded like "Shbansheehailsacthoowawaaaawaaaens"
I remember you could also turn it on while you were programming, so evertime you pressed a key it would say "ONE ZERO PRINT QUOTE ACH EE ELL ELL O QUOTE ENTER TWO ZERO ENTER RUN ENTER". I used to drive me batty. It was one of those eighties things which you thought was "cool" at the time, but had no practical use. I think they were only ever invented so you could show your neighbours how advanced your computer is: "LOOK, IT CAN TALK TO ME!"

--
Sorry, but my karma just ran over your dogma.
Re:AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:38 · Score: 2, Insightful

The IBM product seems to take the recording of a long text read by a human and automatically produce the data collection that is the artificial voice. It uses speech recognition methods to align text and recording. It also stores more than just a simple collection of phonemes: Where older text-to-speech solutions would modify the sample of a phoneme to reflect a certain position in a sentence, IBMs solution appears to use a phoneme sample from the same context, making the result much less monotone. This approach does however beg the question whether "phoneme based" is still its most important characteristic. There are only 40 phonemes, not 10000 (the number of samples used by the IBM "voices").
Re:AT&T have been doing this for a while! by prowley · 2003-03-17 08:10 · Score: 2, Insightful

The way to smooth out the lumps is to not use phonemes at all, but diphones. Imagine recording two phonemes uttered by a human speaker in sequence, and then slicing through the middle of each phoneme to and discarding the ends. That gives you a diphone. Diphones are far superior because phonemes do not change in the middle, so there are no "lumps" at the splice. On the other hand phonemes do change depending on what phoneme is uttered next, simply because in articulating different phoneme sequences the human vocal tract must perform different gymnastics. The only downside is that a full set of diphones is much larger than a full set of phonemes - and they are all buggers to record.

Here's another text-to-speech site by wiggys · 2003-03-17 00:16 · Score: 3, Funny

http://www.research.att.com/~ttsweb/cgi-bin/ttsdem o

Some of the voices sound okay I guess. Better than Stephen Hawking anyway.

--

Sorry, but my karma just ran over your dogma.

*blush* by WeeBull · 2003-03-17 00:22 · Score: 5, Funny

Uhm, ok, who else did just spent 10 minutes (thoroughly) checking if IBM filter naughty words at the text-to-speech interface? Getting the female voices to utter favourable phrases regarding to one's studlyness, perhaps?

Oh ... just me? *blush*

Open Source Speech Synthesis by wzrd2002 · 2003-03-17 00:23 · Score: 5, Informative

There is already freely available open source speech synthesis application for both linux and windows, called Festival created by The University of Edinburgh

Re:Open Source Speech Synthesis by WWWWolf · 2003-03-17 02:17 · Score: 3, Informative

Festival is great, especially with the OGI patches. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.
Regrettably OGI patches are for personal/research use only, so Debian won't ship them...
Re:Open Source Speech Synthesis by Mandrake · 2003-03-17 03:10 · Score: 2, Informative

You should also check out CMU Flite, which is by one of the guys who built Festival. He also works on other, high quality synthesizers at our company, which you can get demos of at our demo site.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker

comparison to Apple's technology? by inblosam · 2003-03-17 00:26 · Score: 4, Informative

I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.

Re:comparison to Apple's technology? by aseidl · 2003-03-17 00:56 · Score: 4, Interesting

I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
Re:comparison to Apple's technology? by silentbozo · 2003-03-17 06:12 · Score: 2, Interesting

Apple's TTS technology is pretty old... and it shows. I've been waiting for them to release voice upgrades since the original PowerPC macs came out, but after they axed their (basic) research section, the likelyhood of that happening decreased dramatically. The IBM approach is also pretty old, but the voice quality is slightly better, probably because there are more voice samples/higher quality.

No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded speaker - you cannot synthesize a brand new voice (with on the fly inflections that were never recorded, etc.) with that TTS method. There has been research into modeled speech synthesis, where a mathematical model of lungs, windpipe, vocal cords, and mouth/tongue/lips, are manipulated in order to generate speech. Given the extreme amount of computing power today, you'd expect more people to use that type of TTS, since it's inherently more flexible. However, the biggest problem so far is nobody really has a good model for how all the various fleshy parts within the human speech apparatus interact together. Any open source people want to tackle this problem and start implementing some of these modeled synthesis speech algorithms?

And don't forget Bell Labs by rpiquepa · 2003-03-17 00:28 · Score: 4, Informative

IBM is not alone to work on text-to-speech technology and to have demos where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.

I've always wondered why... by jkrise · 2003-03-17 00:28 · Score: 2, Interesting

Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.

Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?

--
If you keep throwing chairs, one day you'll break windows....

Re:I've always wondered why... by g4dget · 2003-03-17 00:51 · Score: 2, Informative

Debian has several text-to-speech systems built-in. One of them is Festival, based on a research prototype from Edinburgh. It's a few years behind IBM and ATT, but passable. With more training data, it would get better. There are also several open source speech recognition engines of varying quality, again, mostly derived from university research (I believe Cambridge, CMU, and a few others).
Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought lots of companies and hired away experts from other companies and universities. Those people are now toiling away at Microsoft research and waiting for their options to be worth something. Whether they'll make significant contributions to speech research while at Microsoft remains to be seen.

This is not a new approach. by anubi · 2003-03-17 00:29 · Score: 2, Interesting

About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.

I googled for +"General Instrument" +"SC-01" and got links shown here .

I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.

It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.

I know IBM is refining this, but the concept is really old hat.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]

Re:This is not a new approach. by wiggys · 2003-03-17 00:47 · Score: 2, Informative

"It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine."
It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note the "ov" instead of "of")
Not only that we pronounce words differently depending on the context of which they appear in (if you think about the mechanics of speaking you'll realise our mouths change shape, therefore if you've just pronounced an "m" you may find it tricky to hit an immediate "l"). Also, we give away many clues about our state or mind as we speak - when we say "yours truly" we often sound humble, but when we say "Mine's better than yours" the "yours" in the latter sentence sounds more aggressive.
Probably the most important difference is emotion. A good narrator or speaker can draw you in to what he's saying because of the way he says it. Think about Kennedy delivering the line "We do these things not because they are easy..." - now feed the same line into a speech synthesizer. It's dead, isn't it? No impact, no emotion, no feeling. Personally, I find I can concentrate much more when a good narrator is reading an audio book than I can if a bad one reads it.
I found an audio book on Kazaa once where Stephen Hawking's synthesizer reads aloud A Brief History Of Time. I had to stop listening after 2 minutes because it no longer made sense - had Richard Dawkin been reading it then I'm sure I could have absorbed it 10 times better.

--
Sorry, but my karma just ran over your dogma.

TTS is great by jjohn · 2003-03-17 00:31 · Score: 4, Interesting

Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher, Plug Nickle and Progress.

mmm. I hope the server can take a slashdotting...

The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:

#!/bin/bash # Make a text file a spoken MP3 if [ -z "$1" ] ; then echo "usage: $0 <input.txt>"; exit; fi base=`basename $1 .txt` echo "attempting to create $base.mp3" /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1 writewav.pl temp.au temp.wav lame -h temp.wav $base.mp3 rm -f temp.au temp.wav

speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(

It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.

#!/usr/bin/perl use Audio::SoundFile; use Audio::SoundFile::Header; my $BUFFSIZE = 16384; my $ifile = shift || usage(); my $ofile = shift || usage(); my $buffer; my $header; my $reader = new Audio::SoundFile::Reader($ifile, \$header); $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM; my $writer = new Audio::SoundFile::Writer($ofile, $header); while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) { $writer->bwrite_pdl($buffer); } $reader->close ; $writer->close; exit(0); sub usage { print <<EOT; usage: $0 <infile> <outfile> EOT exit(1); }

mmm. There was indenting in code at one point. Sigh...

ack. no good by lingqi · 2003-03-17 00:31 · Score: 2, Funny

Unless the female voice can render the below lines with feelings, I don't think it's a mature technology.

give me! give me! oh! I am coming!! OHHHH!

Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.

Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.

--

My life in the land of the rising sun.

State of the art in TTS by Sam+Lowry · 2003-03-17 00:52 · Score: 4, Informative

There are basicaly two TTS technologies on the market:

dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT and Scansoft (former L&H) and

Old news by payndz · 2003-03-17 00:58 · Score: 3, Interesting

Text-to-speech? Come on, this has been around for donkey's years - maybe the computer voice doesn't sound like Majel Barrett yet, but it's hardly new and amazing stuff.

I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")

--
You must think in Russian.

Hollywood applications for speech synthesis? by Sheriff+Fatman · 2003-03-17 01:20 · Score: 2, Interesting

Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.

OK, the point of this article is on-demand realtime speech synthesis - roughly analogous to the 3D engines used in games. It has to compromise quality and detail for the sake of speed and responsiveness. Could there be a market for 'voice rendering' - a system which can take a script (possibly with some additional mark-up to indicate emotions, emphasis, timing, etc.) and generate an audio version which approaches a reading of the same script by a competent voice actor?

As well as the obvious 'virtual thespian' Hollywood angle, I'm thinking about stuff like low-budget audio drama - people who have the time and the technology to tweak a voice script, but can't afford professional actors to do it the old-fashioned way. There could be applications for creating audio books for the visually impaired, or to make life easier for students working through Shakespeare and Chaucer - I'm still amazed every time I hear Shakespeare read out loud how much meaning can be conveyed by the nuances of the human voice, compared to dry printed prose.

Is anyone actively working on anything like this? If not, why not? Is it really that hard to fool the human ear? Or is it just a case that it's still cheaper and easier just to employ people to read things into a mic?

--

--
-- Open Source: It's mad, but you don't have to work here to help.

I'm not actually convinced phonemes exist, y'know by Bertie · 2003-03-17 02:01 · Score: 5, Insightful

I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.

we've been doing this for a while by Mandrake · 2003-03-17 03:01 · Score: 2, Informative

This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos. In fact, we have Higher Quality Limited Domain Demos available as well.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker

Re:This could be a hit... by wcb4 · 2003-03-17 03:46 · Score: 2, Interesting

I have actually used textaloudMP3 (from nextUp) to real project gutenberg e-text aloud. Its not perfect, far from it, but it gets better since you can correct mispronunciations over time (my exceptions file now has about 200 entries) The program is a windows front end to ANY installed text to speach engine, be it Microsoft's or L&H or AT&T. I often have it read into mp3 files, which I burn onto CDs and listen to on the way to work I can usually get about 5-6 full books on a single CD, and its free (well...once you spend the $50 for the software and the TTS engine and the high quality voices)

--
I reject your reality ... and substitute my own.

Is it just me? by evronm · 2003-03-17 03:55 · Score: 2, Insightful

Or does anyone else not understand what the big deal about text to speech is?

I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.

Speech to text, on the other hand...

--
Follow the adventures of the new wandering jews

Not very good TTS by DulcetTone · 2003-03-17 06:08 · Score: 2, Funny

The quality of AT&T's TTS or SpeechWorks' TTS is far more advanced. I had some fun with Speechworks' one and posted samples:

What I wish On-Star would actually say

A slightly-edited announcement calling our Bulldog to attend to a special matter

tone

--
tone

Re:I'm not actually convinced phonemes exist, y'kn by HoldmyCauls · 2003-03-17 06:43 · Score: 2, Interesting

I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."

In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these will be the same algorithm in reverse will be for the future to decide).

"Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

Right there, you've laid out a *very complicated* but by no means difficult way of looking at phonemes individually. In my class, we have some 20 people who all have great difficulty in just figuring out allomorphs, which some Slashdotters might not know are phonemes either in complementary distribution, such as in the case of plural nouns: /s/ sound at the end of /kæts/ {cats} /z/ sound at the end of /kIdz/ {kids} /z/ sound at the end of /mæz/ {matches}

or in free variation, such as the Lisa/Liza name which mean the same thing, and are derived from the same root, but which have split due to geographical/cultural/other reasons.

Now, where the average English major might not always recognize similarities and patterns, the average Slashdotter has trained him/herself to do so, and some are likely saying to themselves, "where else does this happen?" and "where is this not true?", which are useful, scientific questions.

You yourself present the answer to the problem you raise: we have to look at the surrounding phonemes in order to figure out how to make one particular sound fit the word it's in. This is *damn hard*, but not impossible. It's like the fact that stress affects a phoneme in certain languages: we just need to adapt to thinking about language in different terms than simply speaking it and spelling vague representations of it (by first realizing how vague those representations are, which is why the phoneme set is taught first in Linguistics classes).

Personally, I think the problem lies in the fact that we all want TTS/STT and we want it *now!*, and why can't the computer just say it or hear it the way we do, and all the other questions that come from a lack of understanding, both of how the machine represents everything and the garbled way in which our language is represented. Phonemes are the obvious solution: the software should only have to do STP/PTS conversion, and our language should conform to that, really, since it's the creative dialectical shifts that create a problem, but we'll end up devising a creative solution for that, too.

Now, we all know what happens with lossy compression...

Yes, we get a slightly inaccurate but highly useful jpeg of the Andes, or someone's new desktop widget set, or a very listenable 192kbps mp3 of "Hurt" covered by Johnny Cash (even sadder than the original, IMHO).

And TTS/STT will have its flaws as well, but a digital (though wide) set of sound symbols like phonemes will help us to break things down somewhat until we figure out that something *smaller* about those sounds is very functional, *and* how to represent *that* level of speech, just as we represented matter by some informal type, then by molecules, then atoms, and now we know quite a bit about how the electron, proton and neutron work, and are working on a smaller level.

To say you "don't really believe in phonemes" oversi

--
Emacs: for people who just never know when to :q!

Natural Voices Gagged: AT&T is asleep at the d by SimHacker · 2003-03-17 07:47 · Score: 2, Informative

I'm working on a project involving voice synthesis, so we've been shopping around and evaluating different systems.

We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, they appearently didn't see any purpose in actually SUPPORTING the product -- so once their libraries stopped working against the latest Gnu/Linux libraries (happy birthday RMS!), they dropped their Linux voice synthesizer product like a hot potato instead of bothering to recompile it and issue an update.

So we hoped AT&T would show more comittment to the promises they made on their web site about their flagship voice synthesizer product, but...

Has anyone actually tried buying a single user copy of Natural Voices from AT&T? YOU CAN'T ANYMORE! They used to sell the synthesizer for workstations and voices for competitive prices (in the 100s of dollars range). So we bought a few voices to evaluate, and sent some simple technical questions into the email address they provided for support, never receiving a reply.

After several weeks they never answered any of our questions, but we decided to buy some more voices to evaluate anyway. But by then, AT&T had pulled the consumer single user version of Natural Voices off of the market (and it took weeks of phone tag to find that out because they don't give out "technical" information on the phone, and they never answer their email support address).

Now if you want to buy a Natural Voice from AT&T, you have to buy the server edition for tens of thousands of dollars. Had their support not absolutely sucked, it might have been worth us paying such a high price, but no way we'd ever consider going with AT&T, after they demonstrated such horrible unresponsive service.

Actually it's a good thing we didn't go with AT&T's voice synthesizer, because we need support for voice authoring tools, and AT&T is incompetent in that regard, since they refuse to give out technical information over the phone, and never answer their email. No support whatsoever. Zilch. Nada. Forget about it.

Fortunately we found some excellent open source software that works together (and whose authors are MUCH more responsive than IBM or AT&T): the Festival Speech Synthesis System, the FestVox voice authoring tools, the small fast Flite runtime speech engine, the Edinburgh Speech Tools, the CSLU speech tools, the OGI Festival tools, and the MBROLA Multilingual Speech Project. This is state of the art research software, where IBM and AT&T got their ideas.

The quality of the commercial voices comes more from throwing lots of time and money into the production process -- the commercial software is not any more advanced than the open source research projects -- in fact the research projects inspired the commercial products!

-A speech synthesizer user who's been jerked around by AT&T and IBM, and is now happy to have no other choice but to use excellent open source software.

--
Take a look and feel free: http://www.PieMenu.com

Re:I'm not actually convinced phonemes exist, y'kn by decrocher · 2003-03-17 07:57 · Score: 2, Interesting

I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.

You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-phones, you may be better off trying to understand the phrase from higher level of abstraction.

Have we correctly identified the right compact expression of speech? I doubt it. Getting speech stuff to work involves a lot of tweaking that is theoretically ungrounded. Tweaking in a methodical and science-biased way _is_ engineering, however.

BTW, I seem to remember a prof saying that X-ray cinematography more-or-less proved the existance of vocal tract target configurations in speech, which correspond to phonemes. Not to mention that you can encode a message in IPA and have it understood by someone else. Even if they're not totally correct, phonemes may be a sufficient basis for building speech systems.

Slashdot Mirror

Phoneme Approach For Text-to-Speech in SCIAM

41 of 189 comments (clear)