Phoneme Approach For Text-to-Speech in SCIAM
jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
Does the poster have something against IBM ... to link an application to a slashdot post?
Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.
.ACMD setaloiv siht gnidaeR
Phonemes are the building blocks of language not phenomes.
If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.
What's so special about it?
I guess IBM didn't have much to say on the matter.
IBM Text-to-Speech Research Demonstration
Input Communcations Error.
You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.
...if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work.
With the right technical document, this could cure insomnia as well...
Everything in the world is controlled by a small, evil group to which, unfortunately, no one you know belongs.
Phoneme, a unit of sound in a word. From Dictionary.com: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"
Related to "telephone," "phonics," etc.
If you visit here:
http://www.naturalvoices.att.com/demos/
You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".
The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.
Whoa- finally something better than what we've had for years.
Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!
graspee
Some of the voices sound okay I guess. Better than Stephen Hawking anyway.
Sorry, but my karma just ran over your dogma.
Wasnt there some company about two years ago that was developing a service to use voice to build XHTML pages? I did a search on Google, but could not find it. They had a test 1-800 number that you could call, say something, and then go to a webpage that was automatically created for you. It seemed to work pretty good, whatever happened to that?
News, Girls, Cams, Jokes, and other complete time wasters
Oh ... just me? *blush*
festival anyone?
cut'n paste:
http://www.cstr.ed.ac.uk/projects/festival/
There is already freely available open source speech synthesis application for both linux and windows, called Festival created by The University of Edinburgh
this link:
http://festvox.org/voicedemos.html
does the same as IBM's demo page. sounds the same as well. but hey, i'm a layman in linguistic matters, so there's prolly a *huge* improvement i understand crap about
I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.
How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.
IBM is not alone to work on text-to-speech technology and to have demos where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.
i was one minute earlier :-) but you'll prolly get the karma, because of the direct lijnks. i am too lazy to type in a href="etcetcetc.
:-)
o wait, this will cost me karma as well! -1 offtopic
Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.
Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
If you keep throwing chairs, one day you'll break windows....
Of course, the intonation is roughly that kind of compromise a PR spokesman employs who is trying to sound convincing but has no clue what he is saying. That's not surprising, given that the TTS systems really do not have any understanding of the meaning of what they are saying.
I googled for +"General Instrument" +"SC-01" and got links shown here .
I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.
It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.
I know IBM is refining this, but the concept is really old hat.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
mmm. I hope the server can take a slashdotting...
The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:
speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(
It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.
mmm. There was indenting in code at one point. Sigh...
give me! give me! oh! I am coming!! OHHHH!
Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.
Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.
My life in the land of the rising sun.
Yes, they should be called Freedomgnomes. Stupid latins editing our language for PH sounds.
And where's freedomdot? It's all wrong, I tells ya, it's all freedomin' wrong!
Freedom. The new Marklar.
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
Phenomes....phonemes...why not pheromones?
With a talking PDA for my chatup lines the chicks will find me IRRESISTIBLE!!!
what's the status of the infinitely more amazing speech-to-text ? Being from belgium, and thus beiung scammed by Lernout&Hauspie who promised true S2T to be reality by 2000, I'm kinda sceptical towards it by now.
Will it ever be possible ? As far as I can tell, S2T is quite a bit more difficult then english->french translation for instance, and that still has a long way to go...
When will I end this grieving ? When will my future begin ?
uttering the sequence:
"Aargh! I've been slashdotted!"
Bandwidth sponsored by danish research funding...
Any sufficiently advanced libertarian utopia is indistinguishable from government.
I find your posts, though insightful, tend to divert attention from the topic at the top of the thread. If you start a new thread, I promise to read all your posts. Just remember to retain the same title thuogh. Thanks.
If you keep throwing chairs, one day you'll break windows....
"Natalie Portman naked and petrified, with hot grits down her sweet, sweet panties. Hrmm.... don't wake me from this dream. Everlast. Diablo, Deus est."
"Take that slashdot fucker. Swallow hole. And bend over. It is your turn to be the pillow biter. Thankyou and goodnight. Say Hi to your mum for me."
"Fuck me. Fuck you. Lets fuck like rats."
And the way the chick says "Fuck" and "Fucker" almost turns me on. Very sexy voice, though it has a Stephen Hawking twang to it.
Maybe in windows 2010 I will have a chick with such a nice voice that my girlfriend/wife will be jealous of all the hours I spend with my digital companion.
"Aargh! I've been slashdotted!"
;-)
This one is much better at saying "slashdotted". Neither of them do the "Aargh!" very well. Especially the IBM one ought to be convincing, given current circumstances
Generate more samples for yourself at http://www.naturalvoices.att.com/demos/
Any sufficiently advanced libertarian utopia is indistinguishable from government.
In the 80's, TI had a number of speech synth chips that were of amazing quality. The one used with the add-in modules for the TI-994A was amazing. I still have not heard a better quality speech synth since then. I wonder what happened to that TI technology.
ttyl
Farrell
CAN-CON 2019 - Ottawa's only book oriented Science Fiction Convention! October 18-20, Sheraton Hotel, Ottawa, Canada h
I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")
You must think in Russian.
I guess this is what comes of dopes who don't know their own language...
I think this has been the first time I've been able to experience some sort of off-site media before it has been slashdotted.
:)
That just makes my day!
I have always wanted a sexy robot voice which says Kernel Panic!
Note to self: get smarter troll to guard door.
hey maybe IT industry should take a note from us musicians for a change (excuse the pun)...
With sampling technology, especially multisampling where for example each note can have different sounds associated to it depending on the accent, you could achieve some really stunning results in the text to speech market.
People like EastWest have created such systems for virtual choirs...check out Voices Of The Apocalypse as this is some pretty basic but revolutionary way of using samplers...
The best text-to-speech that I know about is from Rhetorical Systems at at www.rhetorical.com. The system still doesn't really understand what it is trying to say, but the quality of the speech itself seems good to me. Their technology is proprietary, so one can't be quite sure how they are doing this, but it looks to be large database unit selection (like some of the Festival voices) done very well. (Disclaimer: before the company existed, I used to work with some of the people at Rhetorical, so I might be biased, but listen for yourselves).
You keep using that word.
I do not think it means what you think it means.
Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.
OK, the point of this article is on-demand realtime speech synthesis - roughly analogous to the 3D engines used in games. It has to compromise quality and detail for the sake of speed and responsiveness. Could there be a market for 'voice rendering' - a system which can take a script (possibly with some additional mark-up to indicate emotions, emphasis, timing, etc.) and generate an audio version which approaches a reading of the same script by a competent voice actor?
As well as the obvious 'virtual thespian' Hollywood angle, I'm thinking about stuff like low-budget audio drama - people who have the time and the technology to tweak a voice script, but can't afford professional actors to do it the old-fashioned way. There could be applications for creating audio books for the visually impaired, or to make life easier for students working through Shakespeare and Chaucer - I'm still amazed every time I hear Shakespeare read out loud how much meaning can be conveyed by the nuances of the human voice, compared to dry printed prose.
Is anyone actively working on anything like this? If not, why not? Is it really that hard to fool the human ear? Or is it just a case that it's still cheaper and easier just to employ people to read things into a mic?
---- Open Source: It's mad, but you don't have to work here to help.
My (former) university : mbrola
It is even is free (as in beer) for personnal use.
#include "coucou.h"
Apparently IBM bought the formerly AT&T's later Lucent's Watson project. The web page is even called webtts.watson.ibm.com. Obviously the quality of TTS has not improved much since 1996.
Can someone please tell me why this 8 y.o. project is considered news?
Tech guy 1> Hey, look at my cool new web based speech thingy that lets 1000's of users web pages talk to them!
Tech guy 2> Bah.. bet it wouldnt support 2 people
Tech guy 1> It would!
Tech guy 2> Prove it... (loud musical sound of doom follows) post it to slashdot
Tech guy 1> Ulp... (reluctantly taps away on the keyboard)
5 minutes later, strained sounds can faintly be heard from the smoking pile of rubble that used to be the server room, and the fried piece of circuit board that used to be the shiny new voice system crackles begin to wane, still trying to come up with 500,000 convincing renditions of "goatsec"
I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.
/t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.
In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a
Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.
Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.
The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...
Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).
In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.
So TTS with synthesized phonemes sounds bad, and they try to use recorded phonemes instead. Those still sound bad when the computer has to produce a phoneme combination that wasn't recorded.
... but with a lot of processing power.
So what's the next step? Is there anyone working on physical modelling of the acoustic properties of the mouth, tongue, throat, larynx, and lungs as they glide between different phonemes to produce speech sounds? This seems like the only way you're gonna get something closer to natural than this recorded-phoneme technology
Of course there's the 2nd problem of inflecting things properly, but that seems to require text recognition technology beyond what we currently have.
The following sentence is true. The preceding sentence was false.
If they can't get the voice truely smooth, they should just leave it as the Hawking voice.
My favorite test phrase is "If I were to bitch slap you while falling into a black hole, the bitch slap would last an eternity."
Ken
There is a book called MITalk (MIT Talk) that involves the efforts of using some major hardware to do this years ago. They were using a Vax (780?) just for one part of the processing and a few other big computers to do the rest. This lead to the DecTalker (aka the voice of Stephen Hawkings)
It seems to me that with modern DSP's cranking along with much more calculations per second than a VAX could ever hope for, and one of the best theoretical mathematicians ever having a reliance on the technology, that things should have improved substantially since the MITtalk book came out but I have yet to hear any real world examples.
I treid some lyrics like "A winter's day In a deep and dark December; I am alone, Gazing from my window to the streets below, On a freshly fallen silent shroud of snow." and it actually started to sing!! (lyric is from Paul Simon - I'm a rock)
This raises the bar on fake sound bites. Imagine recording thousands of phrases spoken by Mr. Burns and piecing them together with this technique to make him say "Hello, Smithers. You're quite good at turning me on".
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Upon trying the classic "Hello smithers, you're very good at turning me on" quote, with both the male voices I was very dissapointed. This thing doesn't really sound any better than that crap piece of software that came with my 8-bit sound blaster back in the day.
I noticed the female voices sounded a lot better than the male voices. Nice to see those boys over at ibm got their priorities in order.
I wonder if this technology could be advanced far enough that it could actually imitate an actual person, by feeding in sound files of the person talking, these could be analysed. Then they could make it sound like a specific person, that would be awesome.
And for once my quote from a brief history of time actually applies to the article i'm posting under
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Don't you mean Phreedomgnomes?
including AT&T. This demo sounds much more natural over a broader range of words to my ear. Not much, but some better.
these guys have a synth that runs on a handheld and does real time dsp. check out the demos - very cool.
This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos. In fact, we have Higher Quality Limited Domain Demos available as well.
Geoff "Mandrake" Harrison
Some Random UI Hacker
It's funny how most synthesized voices sound like the Software Automatic Mouth (S.A.M.) software that was available for Atari 800 computers long ago.
The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).
I doubt that anything can be 100% successful given that human perception is not totally error-free. Speech recognition HMM models are based on contextual realizations of phonemes (allophonic model). This takes into account coarticulation. Same technique applies for generation: in-context phonemes are used.
Poor quality of current TTS engine comes from bad concatenation of segments (e.g distortion at jonction). Other problems come from high-level analysis (semantic) and from inappropriate prosody (emphasis, rhythm, intonation).
All humans are mortal. Socrates is a human. Socrates is dead.
In addition to a call center, IBM has this Bangalore/India-based speak-center with thousands of males and females speaking the text you entered into a microphone...
I doubt that anything can be 100% successful given that human perception is not totally error-free.
True enough. Maybe I should have said "never be as successful as humans".
Speech recognition HMM models are based on contextual realizations of phonemes (allophonic model). This takes into account coarticulation. Same technique applies for generation: in-context phonemes are used.
Yes, but they're not contextual enough. Triphones, which to my knowledge is the norm, just isn't a wide enough catchment area. Furthermore, I would question the logic of segmenting at all, for the reasons described previously.
Or does anyone else not understand what the big deal about text to speech is?
I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.
Speech to text, on the other hand...
Follow the adventures of the new wandering jews
OMFG. I was trying to do that. I exceeded my 34 word limit, so I pressed the back button and corrected it.
:-/
Little did I realise the voice is reset back to default when I do so. I just got offered a blowjob from Charles
Seems to me that text to speech would be a good problem for darwinian competitive algorithms. You can take a book on tape, feed the text as input, and have the computer have different algorithms compete by judging them against the human speaker.
Many iterations later, you probably can get a computer sounding just like a person. And since it has had a whole book to practice over, it should be pretty general.
Compare the IBM to ATT demo with the following phrase:
/.'ed yet?"
"Has this site been
IBM pronounces it clearly, while ATT fails. IBM obviously expected the slashdotting it so richly deserves.
I did a fair bit of research in speech synthesis a few years ago and I have to say this sounds good to me.
The system I was building was a diphone/triphone hybrid. We had a large inventory of basic segments (1500 or so) in LPC encodings, but we kept having to expand it to get different pitch contours.
One thing we did find that helped (probably the best idea of my life) was to try to capture features of the glottal excitation instead of using a simple spike excitation. Keeping a library of glottal pulses gave the voice a lot of naturalness (our goal was to generate arbitrary utterances for a particular speaker's voice).
So I have to say I totally agree with you. I think that natural voices will have to follow this sort of path - modelling the actual human vocal tract, not just a convenient mathematical model.
You will not drink with us, but you would taste our steel? - Walter Matthau, The Pirates
Does anyone know if people think it might be possible to create artificial phonemes digitally rather then recording them. Do you know what problems this presents?
Prediction: They'll look at their server logs and find:
a) requests for female voices saying dirty things and
b) requests for male voices saying: "How are you gentlemen!! All your base are belong to us!! You have no chance to survive make your time!!"
c) "I got an error, you insensitive clod!"
if the answer isn't violence, neither is your silence / freedom of expression doesn't make it alright
What I wish On-Star would actually say
A slightly-edited announcement calling our Bulldog to attend to a special matter
tone
tone
http://gnufoo.org/macosx/
:)
cat -a is even cooler than snoop -a.
-- The world is watching America, and America is watching TV.
...from the old BBC Model B with "*SAY Whatever you like", which got it right about 95% of the time?
Not all that for, considering its been nearly 20 years, to be honest.
I've done a bit of research on text to speech systems, and the absolutely BEST most natural text to speech I've come across is Rhetorical..
Demo here
It's got a good range of voices. My answering machine is using one of them...
I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."
/s/ sound at the end of /kæts/ {cats} /z/ sound at the end of /kIdz/ {kids} /z/ sound at the end of /mæz/ {matches}
In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these will be the same algorithm in reverse will be for the future to decide).
"Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.
Right there, you've laid out a *very complicated* but by no means difficult way of looking at phonemes individually. In my class, we have some 20 people who all have great difficulty in just figuring out allomorphs, which some Slashdotters might not know are phonemes either in complementary distribution, such as in the case of plural nouns:
or in free variation, such as the Lisa/Liza name which mean the same thing, and are derived from the same root, but which have split due to geographical/cultural/other reasons.
Now, where the average English major might not always recognize similarities and patterns, the average Slashdotter has trained him/herself to do so, and some are likely saying to themselves, "where else does this happen?" and "where is this not true?", which are useful, scientific questions.
You yourself present the answer to the problem you raise: we have to look at the surrounding phonemes in order to figure out how to make one particular sound fit the word it's in. This is *damn hard*, but not impossible. It's like the fact that stress affects a phoneme in certain languages: we just need to adapt to thinking about language in different terms than simply speaking it and spelling vague representations of it (by first realizing how vague those representations are, which is why the phoneme set is taught first in Linguistics classes).
Personally, I think the problem lies in the fact that we all want TTS/STT and we want it *now!*, and why can't the computer just say it or hear it the way we do, and all the other questions that come from a lack of understanding, both of how the machine represents everything and the garbled way in which our language is represented. Phonemes are the obvious solution: the software should only have to do STP/PTS conversion, and our language should conform to that, really, since it's the creative dialectical shifts that create a problem, but we'll end up devising a creative solution for that, too.
Now, we all know what happens with lossy compression...
Yes, we get a slightly inaccurate but highly useful jpeg of the Andes, or someone's new desktop widget set, or a very listenable 192kbps mp3 of "Hurt" covered by Johnny Cash (even sadder than the original, IMHO).
And TTS/STT will have its flaws as well, but a digital (though wide) set of sound symbols like phonemes will help us to break things down somewhat until we figure out that something *smaller* about those sounds is very functional, *and* how to represent *that* level of speech, just as we represented matter by some informal type, then by molecules, then atoms, and now we know quite a bit about how the electron, proton and neutron work, and are working on a smaller level.
To say you "don't really believe in phonemes" oversi
Emacs: for people who just never know when to
This approach does however beg the question whether "phoneme based" is still its most important characteristic. There are only 40 phonemes, not 10000 (the number of samples used by the IBM "voices").
It appears the "40" is an over-simplification. If they vary slightly based on context (surrounding phoneme's), then there is technically much more than 40.
I guess you could say that there are 40 that are easily identifiable, while the context-sensative variations are too subtle for speach researchers to isolate using their own ear. IOW, the subtleties are only "exposed" when you try to use the 40-rule alone to synthesize speech.
Hey, does this mean that S. Hawkins will get new voice?
Table-ized A.I.
We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, they appearently didn't see any purpose in actually SUPPORTING the product -- so once their libraries stopped working against the latest Gnu/Linux libraries (happy birthday RMS!), they dropped their Linux voice synthesizer product like a hot potato instead of bothering to recompile it and issue an update.
So we hoped AT&T would show more comittment to the promises they made on their web site about their flagship voice synthesizer product, but...
Has anyone actually tried buying a single user copy of Natural Voices from AT&T? YOU CAN'T ANYMORE! They used to sell the synthesizer for workstations and voices for competitive prices (in the 100s of dollars range). So we bought a few voices to evaluate, and sent some simple technical questions into the email address they provided for support, never receiving a reply.
After several weeks they never answered any of our questions, but we decided to buy some more voices to evaluate anyway. But by then, AT&T had pulled the consumer single user version of Natural Voices off of the market (and it took weeks of phone tag to find that out because they don't give out "technical" information on the phone, and they never answer their email support address).
Now if you want to buy a Natural Voice from AT&T, you have to buy the server edition for tens of thousands of dollars. Had their support not absolutely sucked, it might have been worth us paying such a high price, but no way we'd ever consider going with AT&T, after they demonstrated such horrible unresponsive service.
Actually it's a good thing we didn't go with AT&T's voice synthesizer, because we need support for voice authoring tools, and AT&T is incompetent in that regard, since they refuse to give out technical information over the phone, and never answer their email. No support whatsoever. Zilch. Nada. Forget about it.
Fortunately we found some excellent open source software that works together (and whose authors are MUCH more responsive than IBM or AT&T): the Festival Speech Synthesis System, the FestVox voice authoring tools, the small fast Flite runtime speech engine, the Edinburgh Speech Tools, the CSLU speech tools, the OGI Festival tools, and the MBROLA Multilingual Speech Project. This is state of the art research software, where IBM and AT&T got their ideas.
The quality of the commercial voices comes more from throwing lots of time and money into the production process -- the commercial software is not any more advanced than the open source research projects -- in fact the research projects inspired the commercial products!
-A speech synthesizer user who's been jerked around by AT&T and IBM, and is now happy to have no other choice but to use excellent open source software.
Take a look and feel free: http://www.PieMenu.com
I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.
You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-phones, you may be better off trying to understand the phrase from higher level of abstraction.
Have we correctly identified the right compact expression of speech? I doubt it. Getting speech stuff to work involves a lot of tweaking that is theoretically ungrounded. Tweaking in a methodical and science-biased way _is_ engineering, however.
BTW, I seem to remember a prof saying that X-ray cinematography more-or-less proved the existance of vocal tract target configurations in speech, which correspond to phonemes. Not to mention that you can encode a message in IPA and have it understood by someone else. Even if they're not totally correct, phonemes may be a sufficient basis for building speech systems.
You don't know what you're talking about, and you trivialize an extremely complex problem. The main problem is "smoothing the phoneme edges", huh? It's a "very basic practice indeed", huh? Of course you're "not entirely sure," because you know nothing about modern speech synthesis, and you're wildly extrapolating from your extremely limited out-of-date experience. You have no need for a speech synthesizer yourself, because you just like to hear yourself talk.
-A Patriotic American
This is not a very coherent argument. You might as well say that you doubt the existence of musical notes, since you've diagrammed the power spectrum of middle C on a piano and various other instruments, and always you see a really complex waveform, not a simple 440 hertz sine wave. And sometimes there is even a complete absence of energy at that frequency. The situation indeed is so complex that no algorithm has ever been developed that can reliably detect the supposed period/frequency of the supposed musical note (especially in human voices).
But despite those complex observations, there is every reason to think that musical notes exist. It's just that there are difficult intertwined subjects here; the physics is difficult (if you try to accurately model the non-linearities of the resonating chambers), the math is difficult (the signal is non-stationary, so Fourier analysis is a really bad approximation of the truth), and the brain is, as always, extremely complex, so we don't understand the psycho-acoustics, either.
And yet, musical notes exist.
And so do phonemes, despite the fact that they blur together etc etc.
On a final note, don't forget that even the word "word" has no concrete rigorous definition widely agreed upon in the linguistics community. Naively it seems simple, but when you look closely at the subject, the notion of "word" also gets really complex. So one could deny that words exist...but I don't think that would be a smart stance.
As a side note I'm not real thrilled about your history of written language. I suggest you take a refresher on ancient Egyptian. :-)
Masters in Linguistics, eh? Hopefully you are moving on to a PhD?
P.S. Yes, I have done work in computational linguistics...and shipped product! The topics are very difficult, indeed, but in the commercial world, failure to solve problems is not an option. :-)
Professional Wild-Eyed Visionary
And Oregon Graduate Institute's CSLU Toolkit extends Festival with an implementation of Sable: an XML format that lets you mark up text with arbitrary timing, pitch and volume envelopes.
An of course there's Dictionaraoke!
Main Entry: dictionaraoke Pronunciation: 'dik-sh&-"ner-A-O-ke Definition: Audio clips from online dictionaries sing the hits of yesterday and today. The fun of karaoke meets the word power of the dictionary.
-Don
Take a look and feel free: http://www.PieMenu.com
If all you have is a hammer, and you try pounding in a bolt, then driving a few non-tapping screws, and finally try to spot-weld a few bits of metal together by pounding very rapidly, you go into a cocoon and emerge with a whole field of study about how construction materials don't exist and, anyway, construction is impossible -- but only a scholar in the field can say so without being shouted down.
This is why engineers get high blood pressure when soft-science people mount the podium.
There's lots of companies out there not interested in making art. They just want to be able to get their customers through their phone call centers without falling back on "press one for... press two for..."
There's actually a section of the article discussing this. The IBM engineer was talking about what he considers the "holy grail" of TTS, and his opinion was that it is _not_ perfect reproduction of human inflections.
-----
Kvetch is Yiddish for "throw an exception" --Dr. Ron Cytron
I'm already convinced.
What are the alternatives? Can anyone point to work using words or sentences? At first cut I can imagine building a simple dictionary (spoken vocabulary) of words and trying to register the edges somehow so it doesn't sound like a ransom note. But the inflections are still guaranteed to be wrong.
It seems the most natural unit would at least be a sentence. A sentence is the smallest unit of song, with a beginning and an end. But how do you decompose and reconstruct a sentence as a whole? Suppose you can reduce the spoken sentence to some parameters capable of regenerating the sound nicely using a synthesis engine. Furthermore, suppose you have a huge library of spoken sentences and accompanying text. How do you train a mapping function between generic (non-training) text and sentence synthesis parameters, without doing any decomposition?
It seems like you have to do some kind of decomposition, but perhaps the "phoneme" is not correct. It might be more appropriate to look for clear features (in the time domain) to partition more realistic components, then see if the detected components generalize and synthesize.
Buddhists say we actually don't exist. And certainly, from the right perspective, we truly don't exist as independent extractable entities. Certainly we don't exist as independently as phonemes are required to in order for phonemes to sound right.
Anyway, I once had a famous Producer tell me that intonation doesn't matter, style matters; even style without intonation. This guy has produced a string a platinum selling albums. Practically speaking, this means that notes DO NOT exist as far as the human voice is concerned, at least not in any useful or marketable way. His advice was to ignore pitch correction technology and to concentrate on style only. I would argue that in a similar way the notion of phonemes interferes with the production of listenable synthetic speech.
Anyway, for a couple of examples:
* Piano - notes exist, though they are usally out of tune
* Guitar - notes are suggested, but you can tune them by pushing the strings around with your fingers
* Flute - notes are suggested, and you can inflect and blend the hell out of them by using your lips
* Voice - notes don't really exist, just vocal style
* Phonemes - exist only in the minds of people failing to build TTS systems that a normal person would enjoy listening to
Epistemology is at the root of hard science; it is the basis of the scientific method. Engineers get high blood pressure becuase they like to look stuff up instead of actually thinking.
It looks like they are using glottal pulses as you say, and they are doing the female voice (Crystal) by boosting the first two harmonics and by filtering out the range past 4 kHz and replacing it with noise to give it that breathy sound that is characteristic of female voices in American culture (this varies with culture -- the "Dame Edna" effect, yeah, I know Dame Edna is a dude in drag, but different cultures have different norms on how women are supposed to talk). I think they are doing some other tricks, like varying the formant damping pitch synchronously to fake the effect of a coupled voice source and vocal tract acoustic load.
At the segmental level the synthesis is kind of clunky, but at the voice quality level it sound remarkably good, especially Crystal. Mike (the male voice) sounds kind of buzzy at the voice level, but Crystal sounds quite female and quite natural.
I worked quite a bit on some telecom projects using RealSpeak from (now defunct) L&H which also uses a triphone concatenation technique IIRC. It doesn't sound as good as this stuff but it was a useful, shipping product.
Yes I think "phoneme" gets a bit fuzzy when you put it under the microscope, but so do other handy abstractions like "word" and "adjective".
ASR and TTS techniques in use now are pretty sophisticated and relatively successful considering they only try to simulate the bottom of the stack of our (human) language machine, i.e., to simplify, since the TTS doesn't know what the words or sentence "mean", how can it know how to get the right intonation, emphasis, etc.? Ditto for ASR; the state of the art is to build a grammar or language model of some kind by hand for each step in a dialog. Effectively the app developer must tell the recognizer exactly what words to listen for, and in what order, (and with what probability/preference).
So the clever stuff in current TTS engines isn't just how to glue the phonemes together, but how to generate the right intonation/prosody, emphasis, choice of pronunciations (the verb "read" in the past tense is pronounced differently from "read" in the present tense). These things can vary from one speaker or region to the next, just like the accents, so it's hard to find the "rules". This is something that is maddening about computational linguistics... seems like for every "rule" there is a phonebook full of fine print.
's fun!
IANAL (linguist), but I think I both agree and disagree with what you're saying. I actually have a reason, even. That reason being that we can't see how our brains actually process things like phonemes, or even what's really all that important about them. This is, IMO, analogous to vision in that while we have neurons that fire in response to this or that shape or brightness or color in their visual field but are incredibly difficult to cause to fire artificially. There are also such variations to what we hear and smell. What we might think of as the building blocks of speech or language might not be the case at all. Phonemes are a good example. What might be a good experiment here may be to train a neural network to recognize a few different phonemes in speech and see what they come up with when certain phonemes are not present. Just an idea, but based on this I agree with a large part of your post.
You are all fartheads.
I think the parent poster makes a good point. What you need to realize is that speech research like this has been ongoing for like 50 years. It is my opinion that if we were going down the right track we would be further along then we are now. It is always a good idea to step back (sometimes way the hell back) and see if there are any other ways to attack a problem. Even examining methods that have failed miserably can be quite useful if you bother to figure out exactly why they failed so badly.
Personally I think the way the speech problem itself is formulated is flawed. Think of it this way, you hire a secretary that does not speak English. So instead you read to her from a book for a couple hours until she has figured out how to map your utterances to the words in the book. Later you ask her to read stuff back to you. Can you see how much much more difficult the problem is? If the secretary had knowledge of the language, then the problem would not be so difficult. But the current methodology is typical engineering style, break the problem down into small chunks, solve small chunks, assemble small chunks to form much larger chunk, voila! Problem solved.
It sounds like you have trouble with the "existence" of phonemes because they are not discreetly separable into clean, isolated units on a spectrometer. This seems like a failure, on your part, to understand the Phonemic/Phonetic distinction. This is a common misunderstanding. Just for the record: Phonemes are theoretical and in your head, Phones are physical air waves smacking against your eardrums and therefore microphones. Everything you saw on the spectrometer was an electro-visual representation of a phone.
Additionally, I have seen more compelling evidence for the "existence" of phonemes, as a theoretical construct than you have shown against them. (I will leave the discussion of "being" and "existence" of a theory to the Philosophy students).
As a theoretical reference, they are very useful, much in the same way we posited subatomic particles before we could detect/segment them.
Much of what TTS and NLP engineers are trying to do is work from the theory to produce a results. IMO, the results using modern, phonemic practices are much better than older spectra/recording based TTS systems. The proof is in the pudding; real work is getting done by way of a theory.
One final note, unrelated to my main point: your description of the history of writing systems and writing's relation to language is highly inaccurate. Your description of the IPA and what it's used for is equally so. I used to think my Linguistics education was lacking, so it's nice to see that the there are schools still lower on the bell curve.
it may be a problem with the U.B.A.
Sigs are bad for your health.