Phoneme Approach For Text-to-Speech in SCIAM

Does the poster have something against IBM by watzinaneihm · 2003-03-17 00:04 · Score: 3, Insightful

Does the poster have something against IBM ... to link an application to a slashdot post?
Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.

--
.ACMD setaloiv siht gnidaeR

Re:Does the poster have something against IBM by borgdows · 2003-03-17 00:07 · Score: 2, Funny

It looks like IBM is not running their servers on a dead fly ;)
Re:Does the poster have something against IBM by jelle · 2003-03-17 02:38 · Score: 1

Didn't they have that scalable supercomputer on demand thing going on?

Obviously, they are testing the ssytem under load now, and this is part of their test plan.

Tomorrow, we'll see a 'get your own freshly compiled linux ISO from IBM' here...

--
--- Hindsight is 20/20, but walking backwards is not the answer.
Re:Does the poster have something against IBM by Anonymous Coward · 2003-03-17 18:48 · Score: 0

Obviously, they are testing the ssytem under load now, and this is part of their test plan.

Tomorrow, we'll see a 'get your own freshly compiled linux ISO from IBM' here...

Interesting, but I'm sure IBM can afford to pay for testing of their own.
Re:Does the poster have something against IBM by jelle · 2003-03-18 13:33 · Score: 1

"Interesting, but I'm sure IBM can afford to pay for testing of their own."

But.... nothing compares to the /. effect. It's the real thing(tm)!

--
--- Hindsight is 20/20, but walking backwards is not the answer.

Phonemes not phenomes by Tucan · 2003-03-17 00:04 · Score: 4, Informative

Phonemes are the building blocks of language not phenomes.

Re:Phonemes not phenomes by Anonymous Coward · 2003-03-17 00:48 · Score: 1, Interesting

Methinks another case of /.ers obtaining their scant science knowledge from bad TV and movie sci-fi (real SF comes in books!)

Anybody willing to write "The Extended Phoneme?"
Homer Simpson perhaps....
Re:Phonemes not phenomes by OpenSourced · 2003-03-17 01:11 · Score: 1, Funny

First time I sae the headline, I thought it read Pheromone approach to Text-to-speech. Now that could be some interesting concept!

--
Rome taught me patience and assiduous application to detail. Virtues which temper the boldness of great, general views.
Re:Phonemes not phenomes by Anonymous Coward · 2003-03-17 01:21 · Score: 0

Or maybe that was Phenom like phenomenon :-).
Re:Phonemes not phenomes by jscribner · 2003-03-17 02:41 · Score: 1

Doh - that's what happens when late nights get mixed with a background in biology.
Yes, Phoneme, not phenome (thank you to the ed. who corrected that).

--
JS - IBM Metaverse devteam
The opinions expressed here are mine & not necessarily representative of IBM
Re:Phonemes not phenomes by Anonymous Coward · 2003-03-17 03:29 · Score: 1, Informative

Check out the FreeTTS. Its free, open source, and very good. The quality of the supply voice (as of now) is not as good, but the engine is very good. The footprint is small. And it's pure Java. Also, it's faster than C code (Flite) if some of you want to compare speed.
Re:Phonemes not phenomes by raile · 2003-03-17 03:31 · Score: 1

I'm just glad it's not a text-to-PHEROMONE system. I'd hate to get turned on every time I try to check my flight status. Although, I guess pr0n sites could take advantage of that kind of technology...
Re:Phonemes not phenomes by famebait · 2003-03-17 07:13 · Score: 1

-and phonemes have been the dominant approach to speech synthesis basically since the beginning. There may be something else interesting about wht IBM has been doing lately, but from a quick skim of the article, I can't see what is "news" about any of this.

--
sudo ergo sum
Re:Phonemes not phenomes by Art+Tatum · 2003-03-17 16:53 · Score: 1

Same here. I thought to myself, "What the hell do pheromones have to do with text-to-speech synthesis? Those guys at IBM research have finally lost their minds."
Re:Phonemes not phenomes by AdmiralBer · 2003-03-22 13:12 · Score: 1

actually her voice really turns me out...

I was expecting better... by LeoDV · 2003-03-17 00:04 · Score: 5, Informative

If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

What's so special about it?

Re:I was expecting better... by Rubyflame · 2003-03-17 01:03 · Score: 5, Informative

Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo.

--

All it takes is nukes and nerves.
Re:I was expecting better... by digitalgiblet · 2003-03-17 01:08 · Score: 1

To my ear the IBM demo sounds a little smoother and more natural.
I managed to get a couple of samples from the IBM site before it became totally bogged down... I then went back to the AT&T site and listened to a few. Unfortunately I didn't think to try an apples to apples comparison, I was just playing around.
They both sound funny because they are so close to sounding like real people. The result is they sound like people with mild to serious speech problems...
Re:I was expecting better... by John+Harrison · 2003-03-17 01:50 · Score: 1

I might be biased as an IBMer, but the IBM one sounds better to me. Both are certainly better than the one included with Notes Buddy, which is all the rage in IBM right now since it is so much better than our previous IM tool.

--
Lasers Controlled Games!
Re:I was expecting better... by perky · 2003-03-17 01:58 · Score: 1

I thought that the IBM one was better. The acoustic stuff seemed to be about the same, but the intonation on the IBM one was a lot nicer for the two samples I tried.

Incidentally, they don't seem to have improved a great deal from the concatenative TTS systems IBM had 4 years ago. There was one model of the UK marketing woman for ViaVoice, and for some sentences the TTS was almost indistinguishable from the real thing. The only problem with these systems is that the memory footprint is massive, so they take a bit to initialise and are only really useful on a telephony server.

--
"The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994
Re:I was expecting better... by true_majik · 2003-03-17 02:54 · Score: 1

lets not forget http://www.bell-labs.com/project/tts/voices.html
Re:I was expecting better... by Mandrake · 2003-03-17 03:14 · Score: 2, Interesting

We've also been doing this for quite some time. you can check out the Cepstral On-Line High Quality Synthesis Demos, as well as our High Quality Limited Domain Demos.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker
Re:I was expecting better... by swb · 2003-03-17 03:54 · Score: 1

Much worse than the AT&T version. The words are run togther too much.

Haven't gotten the IBM one to work yet.
Re:I was expecting better... by MrScience · 2003-03-17 04:36 · Score: 2, Interesting

This was used in Mission to Mars for the spaceship's voice. The director was looking to do some sound FX to create one from a human voice, then found AT&T's product which was a perfect fit.

I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.

--
You quitting proves that the karma kap worked. The most annoying of the whores shut up. --CmdrTaco
Re:I was expecting better... by K3lvin · 2003-03-17 07:17 · Score: 1

That text-to-speech thing is nothing compared to this.
Re:I was expecting better... by K3lvin · 2003-03-17 07:30 · Score: 1

Here's interactive demo
Re:I was expecting better... by Anonymous Coward · 2003-03-17 07:55 · Score: 0

The Amiga had speech synthesis virtually as good in *1985*. The most important difference seems that the list of exceptions was a bit smaller on the Amiga, causing it to mis-pronounce words such as "guild" and "wheatfield".
Re:I was expecting better... by tchapin · 2003-03-17 09:11 · Score: 2, Informative

SpeechWorks also offers a high-quality network telephony concatenative TTS engine, called Speechify. We also offer a formant-based TTS engine, as well as an embedded TTS one based on Speechify. See some demos here.
We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(
Todd

--
-- !todd erases a red dot! I steal music on the internet.
Re:I was expecting better... by tchapin · 2003-03-17 09:20 · Score: 1

Well, try running the name of this domain name through any TTS engine; it's hilarious! llanfairpwllgwyngyllgogerychwyrndrobwyll-llantysil iogogogoch
Todd

--
-- !todd erases a red dot! I steal music on the internet.
Re:I was expecting better... by ungerware · 2003-03-17 10:47 · Score: 1

Ooh, a SpeechWorks plug! Okay, let's give equal time to Nuance :)

Nuance Vocalizer Demo

--

-----
Kvetch is Yiddish for "throw an exception" --Dr. Ron Cytron
Re:I was expecting better... by shut_up_man · 2003-03-17 22:48 · Score: 1

That's awesome man... really cool, really useful. Thanks.
Re:I was expecting better... by Mandrake · 2003-03-19 02:43 · Score: 1

our synthesizer runs in a very very small fraction of the footprint (memory and disk space) as the AT&T synthesizer. The AT&T synthesizer is also based on earlier work from our CTO (the AT&T synthesizer is ultimately just festival with some other code on top of it)

--
Geoff "Mandrake" Harrison
Some Random UI Hacker

speaking of the /. effect by trelanexiph · 2003-03-17 00:06 · Score: 4, Funny

I guess IBM didn't have much to say on the matter.

IBM Text-to-Speech Research Demonstration

Input Communcations Error.

You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.

Re:speaking of the /. effect by timmie... · 2003-03-17 00:41 · Score: 1

Hardly surprising. Down at the bottom of the page there's a note that says the application can only be used 30 times per day... and it's linked to slashdot. :)
Re:speaking of the /. effect by wjvdt · 2003-03-17 03:11 · Score: 1

I received the same error until about the seventh attempt... "If at first you don't succeed, try, try, again. Then quit. There's no use being a damn fool about it." - W.C. Fields

--
"If I were punished for every pun I shed, there would not be left a puny shed of my punnish head." - Samuel Johnson
Re:speaking of the /. effect by Tablizer · 2003-03-17 07:21 · Score: 1

Synth Voice: "Help, I have been slash dot ted"

--
Table-ized A.I.

This could be a hit... by WegianWarrior · 2003-03-17 00:08 · Score: 1, Funny

...if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work.

With the right technical document, this could cure insomnia as well...

--
Everything in the world is controlled by a small, evil group to which, unfortunately, no one you know belongs.

Re:This could be a hit... by yora · 2003-03-17 00:16 · Score: 1

if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work./i

You should check out the Digital Talking Book specs. It is an open format and there are readers available which allows text to speech and other effects. Most of the readers have been designed with visually impaired target audience.

yora
Re:This could be a hit... by wcb4 · 2003-03-17 03:46 · Score: 2, Interesting

I have actually used textaloudMP3 (from nextUp) to real project gutenberg e-text aloud. Its not perfect, far from it, but it gets better since you can correct mispronunciations over time (my exceptions file now has about 200 entries) The program is a windows front end to ANY installed text to speach engine, be it Microsoft's or L&H or AT&T. I often have it read into mp3 files, which I burn onto CDs and listen to on the way to work I can usually get about 5-6 full books on a single CD, and its free (well...once you spend the $50 for the software and the TTS engine and the high quality voices)

--
I reject your reality ... and substitute my own.
Re:This could be a hit... by walt-sjc · 2003-03-17 05:40 · Score: 1

What would be REALLY funny is a tts / voice recognition battle between different computers - maybe running an eliza type system. As it messes up on the recognition, things could go down hill fast... :-)

PHONEME, y'all, not *phenome by texchanchan · 2003-03-17 00:10 · Score: 3, Informative

Phoneme, a unit of sound in a word. From Dictionary.com: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"

Related to "telephone," "phonics," etc.

Re:PHONEME, y'all, not *phenome by WeeBull · 2003-03-17 00:15 · Score: 5, Funny

.. and often uttered in distressed tones at the end of a night out, usually by desperate males attempting to re-attach themselves to some female. PHONEME! PLEASE PHONEME! I LOVE YOU! PHONEME!

AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:13 · Score: 5, Informative

If you visit here:
http://www.naturalvoices.att.com/demos/

You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.

Re:AT&T have been doing this for a while! by wiggys · 2003-03-17 00:22 · Score: 2, Funny

The Currah speech unit for the Spectrum was hilarious. It came with a free game which was supposed to say "The Banshee wails at you but nothing happens".
It actually sounded like "Shbansheehailsacthoowawaaaawaaaens"
I remember you could also turn it on while you were programming, so evertime you pressed a key it would say "ONE ZERO PRINT QUOTE ACH EE ELL ELL O QUOTE ENTER TWO ZERO ENTER RUN ENTER". I used to drive me batty. It was one of those eighties things which you thought was "cool" at the time, but had no practical use. I think they were only ever invented so you could show your neighbours how advanced your computer is: "LOOK, IT CAN TALK TO ME!"

--
Sorry, but my karma just ran over your dogma.
Re:AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:38 · Score: 2, Insightful

The IBM product seems to take the recording of a long text read by a human and automatically produce the data collection that is the artificial voice. It uses speech recognition methods to align text and recording. It also stores more than just a simple collection of phonemes: Where older text-to-speech solutions would modify the sample of a phoneme to reflect a certain position in a sentence, IBMs solution appears to use a phoneme sample from the same context, making the result much less monotone. This approach does however beg the question whether "phoneme based" is still its most important characteristic. There are only 40 phonemes, not 10000 (the number of samples used by the IBM "voices").
Re:AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:47 · Score: 0

Tried the German TTS. Man it sucked. Sounds like an American on speed. :-)

"Wie geht es dir?" sounded like "w w w w"!

I am not impressed.
Re:AT&T have been doing this for a while! by Anonymous Coward · 2003-03-17 00:54 · Score: 0

Not impressed indeed. Completely unintelligible.
Re:AT&T have been doing this for a while! by 68K · 2003-03-17 01:47 · Score: 1

I copied some text from the page into the synthesizer:

"Our recorded demos showcase the broadband versions of each voice"

"BroadBAAAAND". :-)
Re:AT&T have been doing this for a while! by wcb4 · 2003-03-17 03:49 · Score: 1

probably more important that it is context-sensitive-phoneme-based, than just that it is phoneme based.

--
I reject your reality ... and substitute my own.
Re:AT&T have been doing this for a while! by paulcammish · 2003-03-17 05:14 · Score: 1

I remember having a Currah Speech module for my ZX Spectrum
In that case, you'll also likely remember the words (in authentic WOPR from Wargames tones): "The ban-shee way-uls at you and no-thing happ-uns"
Ah, those were the days...
Re:AT&T have been doing this for a while! by prowley · 2003-03-17 08:10 · Score: 2, Insightful

The way to smooth out the lumps is to not use phonemes at all, but diphones. Imagine recording two phonemes uttered by a human speaker in sequence, and then slicing through the middle of each phoneme to and discarding the ends. That gives you a diphone. Diphones are far superior because phonemes do not change in the middle, so there are no "lumps" at the splice. On the other hand phonemes do change depending on what phoneme is uttered next, simply because in articulating different phoneme sequences the human vocal tract must perform different gymnastics. The only downside is that a full set of diphones is much larger than a full set of phonemes - and they are all buggers to record.
Re:AT&T have been doing this for a while! by sipy · 2003-03-18 06:05 · Score: 1

LOTS of people have been doing this for a while. The issue isn't who's idea it was, or who tried it (hell, my company even tried it in the late 80's). The issue is who makes it WORK! And "work" in this case is - adapt to all of the nuances of the human voice, and become "pleasing" to "most" people. The words in quotes are so subjective it leaves room for the multitudes of approaches, such as IBM's, to claim they're "A-Number-One!" Da Vinci "invented" the helicopter, but it took hundreds of years (and Sikorski), to make it "work".

cool by Graspee_Leemoor · 2003-03-17 00:14 · Score: 1, Interesting

Whoa- finally something better than what we've had for years.

Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!

graspee

Here's another text-to-speech site by wiggys · 2003-03-17 00:16 · Score: 3, Funny

http://www.research.att.com/~ttsweb/cgi-bin/ttsdem o

Some of the voices sound okay I guess. Better than Stephen Hawking anyway.

--

Sorry, but my karma just ran over your dogma.

Re:Here's another text-to-speech site by Anonymous Coward · 2003-03-17 01:34 · Score: 0

And here's another very good one

http://actor.loquendo.it/

XHTML by Anonymous Coward · 2003-03-17 00:20 · Score: 0

Wasnt there some company about two years ago that was developing a service to use voice to build XHTML pages? I did a search on Google, but could not find it. They had a test 1-800 number that you could call, say something, and then go to a webpage that was automatically created for you. It seemed to work pretty good, whatever happened to that?

News, Girls, Cams, Jokes, and other complete time wasters

*blush* by WeeBull · 2003-03-17 00:22 · Score: 5, Funny

Uhm, ok, who else did just spent 10 minutes (thoroughly) checking if IBM filter naughty words at the text-to-speech interface? Getting the female voices to utter favourable phrases regarding to one's studlyness, perhaps?

Oh ... just me? *blush*

Re:*blush* by wiggys · 2003-03-17 00:25 · Score: 1

Maybe they should be used to generate the speech in those Weebl and Bob animations you link to in your profile!

--
Sorry, but my karma just ran over your dogma.
Re:*blush* by Doomrat · 2003-03-17 01:10 · Score: 1

I wish I could be cool just like you, but mother says that I'm not allowed to use those words.
Re:*blush* by Anonymous Coward · 2003-03-17 02:02 · Score: 0

No. A total of 2.2 seconds typing/waiting for .wav to download tells me that "Nice pants, wanna f*ck?" (sub your favorite vowel) is expressed quite naturally..

hmmmm... by koekepeer · 2003-03-17 00:22 · Score: 1, Informative

festival anyone?

cut'n paste:

http://www.cstr.ed.ac.uk/projects/festival/

Open Source Speech Synthesis by wzrd2002 · 2003-03-17 00:23 · Score: 5, Informative

There is already freely available open source speech synthesis application for both linux and windows, called Festival created by The University of Edinburgh

Re:Open Source Speech Synthesis by wiggys · 2003-03-17 00:28 · Score: 1

I hope it doesn't have a strong scottish accent, they're hard enough to understand in real life...

--
Sorry, but my karma just ran over your dogma.
Re:Open Source Speech Synthesis by WWWWolf · 2003-03-17 02:17 · Score: 3, Informative

Festival is great, especially with the OGI patches. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.
Regrettably OGI patches are for personal/research use only, so Debian won't ship them...
Re:Open Source Speech Synthesis by Mandrake · 2003-03-17 03:10 · Score: 2, Informative

You should also check out CMU Flite, which is by one of the guys who built Festival. He also works on other, high quality synthesizers at our company, which you can get demos of at our demo site.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker
Re:Open Source Speech Synthesis by anonymous+cupboard · 2003-03-17 03:14 · Score: 1

Thats the problem with BSD style licenses (under which Festival was released). You may extend and restrictively licence the result. I'm still a little suprised that the OGI stuff is for non-commercial use only although it was at least partly government funded.
Unfortunately free-TTS (i.e, playing any, not just replaying canned speech) is a growing area and there will definitely be a large commercial potential and everyone seems to know this.
Re:Open Source Speech Synthesis by jandrese · 2003-03-17 03:46 · Score: 1

The only problem with Festival is that it practically requires a PhD to get it up and running correctly, and the documentation is aimed at the speech synthesis development community, not the end users. The only reason I got mine working was the FreeBSD ports system an running across a reasonably small demo script I could hack to get what I wanted.

--

I read the internet for the articles.
Re:Open Source Speech Synthesis by g4dget · 2003-03-17 14:53 · Score: 1

Doesn't seem that hard... # apt-get install festival festvox-poslex festvox-kallpc16k # lynx -dump -nolist http://www.slashdot.org/ | festival --tts
Re:Open Source Speech Synthesis by MisterFancypants · 2003-03-17 15:13 · Score: 1

well...some people are retarded.
Re:Open Source Speech Synthesis by jandrese · 2003-03-17 16:02 · Score: 1

Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured. For instance, if you want it to just say whever you give it on the command line of a script:
echo "(voice_tll_diphone) (Parameter.set 'Audio_Method 'freebsd16audio)(SayText \"$*\")" | festival --pipe

Obviously using whatever sound system you have. By default it will try to use NAS if it is installed on your system, but I've never managed to make that work.

If you can stand the default voice, it is quite a bit easier to install. I'm pretty sure there's a way to get it to do some TTS on a file with those parameters as well, but I havn't pored over the documentation enough to find it yet.

--

I read the internet for the articles.
Re:Open Source Speech Synthesis by g4dget · 2003-03-17 19:27 · Score: 1

Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured.
Someone who has figured out how to configure that should put it into Debian as a package... then ordinary users could use it.

to try it out by koekepeer · 2003-03-17 00:25 · Score: 1

this link:

http://festvox.org/voicedemos.html

does the same as IBM's demo page. sounds the same as well. but hey, i'm a layman in linguistic matters, so there's prolly a *huge* improvement i understand crap about

comparison to Apple's technology? by inblosam · 2003-03-17 00:26 · Score: 4, Informative

I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.

Re:comparison to Apple's technology? by aseidl · 2003-03-17 00:56 · Score: 4, Interesting

I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
Re:comparison to Apple's technology? by inblosam · 2003-03-17 01:10 · Score: 1

I'm a new Mac user, so that was one of my questions that I failed to include in my post (how early this technology has been integrated in Mac OS). Thanks for the clarification.

Sometimes I walk away from the computer and then it talks back to me as if it misses me, and wants my attention. Talk about "artificial intelligence"...or is that personality?
Re:comparison to Apple's technology? by Anonymous Coward · 2003-03-17 01:11 · Score: 0

Every Mac is a BLACK MAN! Feeling discriminated because nobody mentioned your favorite OS? Even Amigas knew how to talk. Text to speech is not exactly new technology, but it improves - just like other technologies. Automated generation of synthetic voices is noteworthy. Bet you couldn't do that with MacOS 7.
Re:comparison to Apple's technology? by calumr · 2003-03-17 01:59 · Score: 1

Apple's TTS (in 10.2) alters the pronunciation of words based on their context not just within the sentence, but the paragraph. When it was demoed at the WWDC the difference was very clear, and they said that there weren't any other TTS systems around that could do this.
Re:comparison to Apple's technology? by Croaker · 2003-03-17 03:18 · Score: 1

"those voices are going to come."
Maybe that explains the fanactial devotion of Mac users...

"I do what the voices in my Mac tell me" sounds like a t-shirt begging to be printed up.
Re:comparison to Apple's technology? by edmo · 2003-03-17 03:27 · Score: 1

Apple has had text-to-speach and some level of voice recognition cince OS 7.x, which was realeased in May of 1991

For a wile I had my mac read my e-mail to me, but the voices arn't that good and so I stoped that
the biggest use for the voices is as a responce when using voice recogintion, a few handy apple scripts and you'd be suprized what you can do

--
Don't save your orgasms for Heaven; Heaven knows we need them here.
Re:comparison to Apple's technology? by jandrese · 2003-03-17 03:35 · Score: 1

IIRC, it wasn't standard, but you could get Macintalk for OS 6. OS7 shipped with it standard. The default voice is the same one Koko the Gorilla and Stephen Hawking use. IIRC the entire module was 100k in size and left ample CPU time for other projects (like animating Moose lips) on a 16Mhz 68020.

--

I read the internet for the articles.
Re:comparison to Apple's technology? by nullard · 2003-03-17 04:42 · Score: 1

The first mac ever publicly demoed in 1984 actually spoke. It was't actually a 128k Mac. They added some extra ram, but it could speak. I think 7.5 is when the speech recognition was released.

--

t'nera semordnilap
Re:comparison to Apple's technology? by silentbozo · 2003-03-17 06:12 · Score: 2, Interesting

Apple's TTS technology is pretty old... and it shows. I've been waiting for them to release voice upgrades since the original PowerPC macs came out, but after they axed their (basic) research section, the likelyhood of that happening decreased dramatically. The IBM approach is also pretty old, but the voice quality is slightly better, probably because there are more voice samples/higher quality.

No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded speaker - you cannot synthesize a brand new voice (with on the fly inflections that were never recorded, etc.) with that TTS method. There has been research into modeled speech synthesis, where a mathematical model of lungs, windpipe, vocal cords, and mouth/tongue/lips, are manipulated in order to generate speech. Given the extreme amount of computing power today, you'd expect more people to use that type of TTS, since it's inherently more flexible. However, the biggest problem so far is nobody really has a good model for how all the various fleshy parts within the human speech apparatus interact together. Any open source people want to tackle this problem and start implementing some of these modeled synthesis speech algorithms?
Re:comparison to Apple's technology? by gerardrj · 2003-03-17 08:02 · Score: 1

The "How does this compare to Apple's TTS" is really a two part question (at least, I may have missed something).

The one you probably want answered is which sounds better. At this point the IBM voices sound better than the Apple TTS, but not by very much. Especially when you consider that Apple hasn't improved the voices in over 7 years IIRC (Of course given the option of better voices of having OS X, I'll forgo the voices). Playing several phrases from IBM's and Apple's TTS systems yields the opinion that IBM's rendered voices sound more natural but not perfect by any means. Scoring them I'd say Apple's TTS is about a 7 and IBM's result is about an 8. The Apple versions have some strange amplitude problems at the start of some words, as thought the model is adding too much emphesis. The IBM models lack the sophistication of understanding and reading punctuation (like quotes) with appropriate pauses so that the context is understood without seeing the text.

The other way you could ask the question is who's technology is better? With Apple's you can actually synthesize the entire speech process, changing the voice is a matter of changing the model (they also had sampled voices and some older synth types). I wish Apple still had their page about the technology up, but I can't locate it. IBM requires tens of thousands of samples of each phonem to later stitch together when speech is formed.

The differences are much akin to a synthesizer versus a sampler. You can either create the sound of a piano from scratch, or you can sample it and play the tones back when required. In the former you can make the artificial piano do anything a real piano can do. In the latter, you may get a more "true" sound, but are limited in what varioations you can apply.

Witness this in Apple's technology being the only TTS system I know of that comes with a Mexican accent.

I recall that Apple canned the TTS and several other research departments (or severely rediuced them) during one of the dark eras though. I'm guessing that if the Apple engineers had been working on and improving the system od voice modeling over the past 7 years, that IBM's voices would seem ancient and childish.

--
Article X: The powers not delegated... by the Constitution...are reserved...to the people

And don't forget Bell Labs by rpiquepa · 2003-03-17 00:28 · Score: 4, Informative

IBM is not alone to work on text-to-speech technology and to have demos where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.

hehe by koekepeer · 2003-03-17 00:28 · Score: 1

i was one minute earlier :-) but you'll prolly get the karma, because of the direct lijnks. i am too lazy to type in a href="etcetcetc.

o wait, this will cost me karma as well! -1 offtopic :-)

I've always wondered why... by jkrise · 2003-03-17 00:28 · Score: 2, Interesting

Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.

Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?

--
If you keep throwing chairs, one day you'll break windows....

Re:I've always wondered why... by g4dget · 2003-03-17 00:51 · Score: 2, Informative

Debian has several text-to-speech systems built-in. One of them is Festival, based on a research prototype from Edinburgh. It's a few years behind IBM and ATT, but passable. With more training data, it would get better. There are also several open source speech recognition engines of varying quality, again, mostly derived from university research (I believe Cambridge, CMU, and a few others).
Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought lots of companies and hired away experts from other companies and universities. Those people are now toiling away at Microsoft research and waiting for their options to be worth something. Whether they'll make significant contributions to speech research while at Microsoft remains to be seen.
Re:I've always wondered why... by Anonymous Coward · 2003-03-17 01:22 · Score: 0

Well, Language ain't getting any simpler. Even in the non-tech research side, you can't encode a language easily. I mean, have you ever read a book by Chomsky?

As linguists discover more and more about language on the research side, speech technologists try to implement the details.... which requires lots of memory and CPU time.

MS has some research on speech recognition, and has implemented some in XP (with the default to "microphone on"... bringing to life the ghosts in the machine), but they're not a big player in the field yet.
Re:I've always wondered why... by perky · 2003-03-17 02:08 · Score: 1

They have bought lots of companies and hired away experts from other companies and universities.

This reminded me of an amusing sideline in the history of speech Reco. Cambridge University Engineering department (CUED) originally built an engine called HTK. This was then sold to a company called Entropic. Entropic were then bought by Microsoft, who have licensed HTK back to CUED, who distribute it for free. This leads to the ammusing situation in which the license for a piece of Microsoft code contains the following snippet:

We strongly encourage contributions to the HTK source code base. These will in general be additional tools or library modules which will not fall under this HTK License Agreement.

--
"The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994

incremental by g4dget · 2003-03-17 00:29 · Score: 1

These systems seem to be getting incrementally better, but it doesn't look like a big breakthrough.

Of course, the intonation is roughly that kind of compromise a PR spokesman employs who is trying to sound convincing but has no clue what he is saying. That's not surprising, given that the TTS systems really do not have any understanding of the meaning of what they are saying.

This is not a new approach. by anubi · 2003-03-17 00:29 · Score: 2, Interesting

About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.

I googled for +"General Instrument" +"SC-01" and got links shown here .

I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.

It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.

I know IBM is refining this, but the concept is really old hat.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]

Re:This is not a new approach. by anubi · 2003-03-17 00:34 · Score: 1

Dammit... I thought I checked that link..
The Google General Instruments SC-01 Links .
Sorry for the botched post.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:This is not a new approach. by wiggys · 2003-03-17 00:47 · Score: 2, Informative

"It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine."
It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note the "ov" instead of "of")
Not only that we pronounce words differently depending on the context of which they appear in (if you think about the mechanics of speaking you'll realise our mouths change shape, therefore if you've just pronounced an "m" you may find it tricky to hit an immediate "l"). Also, we give away many clues about our state or mind as we speak - when we say "yours truly" we often sound humble, but when we say "Mine's better than yours" the "yours" in the latter sentence sounds more aggressive.
Probably the most important difference is emotion. A good narrator or speaker can draw you in to what he's saying because of the way he says it. Think about Kennedy delivering the line "We do these things not because they are easy..." - now feed the same line into a speech synthesizer. It's dead, isn't it? No impact, no emotion, no feeling. Personally, I find I can concentrate much more when a good narrator is reading an audio book than I can if a bad one reads it.
I found an audio book on Kazaa once where Stephen Hawking's synthesizer reads aloud A Brief History Of Time. I had to stop listening after 2 minutes because it no longer made sense - had Richard Dawkin been reading it then I'm sure I could have absorbed it 10 times better.

--
Sorry, but my karma just ran over your dogma.
Re:This is not a new approach. by anubi · 2003-03-17 00:52 · Score: 1

I was doing some more tracing on what I reported in the parent
Votrax made the SC-01 chip.
General Instruments made the SP0256 chip
I do not remember if the chip I had was dual marked - so I do not know if they were the same chip but under different numbers, and quite frankly I do not wanna tear into the old machine right now to verify.
And it was in the early 1980's , which was about 20 years ago. Not 30.
You can read more about it here .

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:This is not a new approach. by foqn1bo · 2003-03-17 06:18 · Score: 1

This is not a new approach.

No, but it's a fairly sophisticated refinement of an old(ish) approach. The core ideas that make it possible have been around for a number of years, but there are a lot of constraints that make it difficult to achieve. And just for rant's sake, the qualifying use of the term 'phoneme' in the post is misleading. Phonemes are the fundamental of vocal articulation; it would be impossible to synthesize speech without them. What sets different TTS systems apart is how they are realized.

About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.

Most of the early mass produced chips were analog formant synths. You'd get the best output with hand coded formants, but automatic ones were a little more complicated. Nonetheless, popular items such as the Speak and Spell were able to generate fairly intelligible speech with individual discretely coded phonemes. A lot of effort has been more recently given to 'diphone' and more generally 'concatenative' synthesis techniques.

In diphone models, speech is realized as the concatenation of(often recorded) transitions between segments. This helps with the fact that phonemes are contextually variable but gives rise to problems with smoothness and the flexibility of phonetic inventory. IBM's model is of the concatenative variety, but I don't know enough about what they're doing to say what their exact method is.

What is interesting, however, is that if you look carefully on their page you might notice that the voice is trained on natural data. That's something I haven't heard much about. After trying out the female voice a few times, I have to say that the intonation model is pretty solid compared to what else is out there. Not perfect, but certainly better than the drunken voice we've all come to know and love.

It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.

I'm not going to bite your head off for that comment, but I'm in linguistics and that's pretty insulting. On that plane of logic, computer code is nothing more than sequences of on and off(one and zero). Just string them together and you get programs. String them together carefully, and the program begins looking like it does something interesting, instead of causing the machine to freeze up. And I suppose Robotics is as simple as putting parts together and wiring them with motors. Wow, I could do that! Seriously though, truly natural speech is dependent on Semantics, Syntax, prosody, and whole host of intricately connected facets of language that people have devoted their lives to. Don't cheapen it for them.

Later.
Re:This is not a new approach. by anubi · 2003-03-17 09:41 · Score: 1

I'm not going to bite your head off for that comment, but I'm in linguistics and that's pretty insulting. On that plane of logic, computer code is nothing more than sequences of on and off(one and zero). Just string them together and you get programs. String them together carefully, and the program begins looking like it does something interesting, instead of causing the machine to freeze up. And I suppose Robotics is as simple as putting parts together and wiring them with motors. Wow, I could do that! Seriously though, truly natural speech is dependent on Semantics, Syntax, prosody, and whole host of intricately connected facets of language that people have devoted their lives to. Don't cheapen it for them.
Sorry if I cheapened it.
Life is nothing more than sequences of (G-C)/(C-A) or (A-T)/(T-A) sequences on a sugar ladder too. But the exact placement is everything. So is code. Just ones and zeros. These are fundamental parts. But the placement and usage is where whether or not you coded something useful or junk is where the art is.
I tried phonetic synthesis too. It was a nightmare to make anything other than a pre-scripted phrase come out sounding anything like what I wanted. Hence I indicated if one was very careful in how he arranged it, it would come out sounding natural. Mine didn't. Never did. I knew what the machine was trying to tell me, but I would not expect the man on the street would be able to. It would have been better for me to select from a series of pre-encoded .WAV's - professionally spoken - if I were to design such a machine for public usage.
I highly respect your art when you say you are coming from the Linguistics viewpoint. I had played with this thing for years and never got it right. The problems were not in the code or hardware - the machine always did exactly what it was told to do - but due to the almost infinite amounts of human inflection we subconsciously use in our speech, the machine always sounded like, er, a machine.
My apologies for any insult. It was not intended.
My first thought was that people streaming into the field might think this whole thing was something new. There is a lot of past work on this and I wanted to point it out. Like you say, its very complex to implement properly.

--
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
Re:This is not a new approach. by foqn1bo · 2003-03-17 10:11 · Score: 1

Apology accepted. :) Sometimes I'm a little too quick to defend my area of study against real or imagined attacks against its legitimacy.

And I think you're right. Placement is everything. Cheers.

TTS is great by jjohn · 2003-03-17 00:31 · Score: 4, Interesting

Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher, Plug Nickle and Progress.

mmm. I hope the server can take a slashdotting...

The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:

#!/bin/bash # Make a text file a spoken MP3 if [ -z "$1" ] ; then echo "usage: $0 <input.txt>"; exit; fi base=`basename $1 .txt` echo "attempting to create $base.mp3" /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1 writewav.pl temp.au temp.wav lame -h temp.wav $base.mp3 rm -f temp.au temp.wav

speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(

It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.

#!/usr/bin/perl use Audio::SoundFile; use Audio::SoundFile::Header; my $BUFFSIZE = 16384; my $ifile = shift || usage(); my $ofile = shift || usage(); my $buffer; my $header; my $reader = new Audio::SoundFile::Reader($ifile, \$header); $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM; my $writer = new Audio::SoundFile::Writer($ofile, $header); while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) { $writer->bwrite_pdl($buffer); } $reader->close ; $writer->close; exit(0); sub usage { print <<EOT; usage: $0 <infile> <outfile> EOT exit(1); }

mmm. There was indenting in code at one point. Sigh...

ack. no good by lingqi · 2003-03-17 00:31 · Score: 2, Funny

Unless the female voice can render the below lines with feelings, I don't think it's a mature technology.

give me! give me! oh! I am coming!! OHHHH!

Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.

Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.

--

My life in the land of the rising sun.

Re:ack. no good by Wylfing · 2003-03-17 01:50 · Score: 1

Oh yeah, this kind of technology is excellent for a computer to read out the sites to you
I think you discovered the killer application for this technology: the voice reads erotic stories to you while you surf pr0n.

--
Our intelligent designer has never created an animal that we couldn't improve by strapping a bomb to it.

Re:As a concerned American patriot, by shepd · 2003-03-17 00:32 · Score: 1

Yes, they should be called Freedomgnomes. Stupid latins editing our language for PH sounds.

And where's freedomdot? It's all wrong, I tells ya, it's all freedomin' wrong!

Freedom. The new Marklar.

--
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC

Even Better by Anonymous Coward · 2003-03-17 00:36 · Score: 0

Phenomes....phonemes...why not pheromones?
With a talking PDA for my chatup lines the chicks will find me IRRESISTIBLE!!!

This is cool and all, but by selderrr · 2003-03-17 00:37 · Score: 1

what's the status of the infinitely more amazing speech-to-text ? Being from belgium, and thus beiung scammed by Lernout&Hauspie who promised true S2T to be reality by 2000, I'm kinda sceptical towards it by now.

Will it ever be possible ? As far as I can tell, S2T is quite a bit more difficult then english->french translation for instance, and that still has a long way to go...

--
When will I end this grieving ? When will my future begin ?

Listen to "US female 2" by infolib · 2003-03-17 00:40 · Score: 1, Funny

uttering the sequence:
"Aargh! I've been slashdotted!"

Bandwidth sponsored by danish research funding...

--
Any sufficiently advanced libertarian utopia is indistinguishable from government.

Re:As a concerned Slashdot reader.. by jkrise · 2003-03-17 00:46 · Score: 1

I find your posts, though insightful, tend to divert attention from the topic at the top of the thread. If you start a new thread, I promise to read all your posts. Just remember to retain the same title thuogh. Thanks.

--
If you keep throwing chairs, one day you'll break windows....

Some text I entered by Anonymous Coward · 2003-03-17 00:51 · Score: 0, Troll

"Natalie Portman naked and petrified, with hot grits down her sweet, sweet panties. Hrmm.... don't wake me from this dream. Everlast. Diablo, Deus est."

"Take that slashdot fucker. Swallow hole. And bend over. It is your turn to be the pillow biter. Thankyou and goodnight. Say Hi to your mum for me."

"Fuck me. Fuck you. Lets fuck like rats."

And the way the chick says "Fuck" and "Fucker" almost turns me on. Very sexy voice, though it has a Stephen Hawking twang to it.

Maybe in windows 2010 I will have a chick with such a nice voice that my girlfriend/wife will be jealous of all the hours I spend with my digital companion.

State of the art in TTS by Sam+Lowry · 2003-03-17 00:52 · Score: 4, Informative

There are basicaly two TTS technologies on the market:

dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT and Scansoft (former L&H) and

Re:State of the art in TTS by Mandrake · 2003-03-17 10:38 · Score: 1

actually, there are more types than this. For example, formant synthesis, and HMM synthesis.
Also, festival supports unit selection synthesis (which is what you're calling corpus synthesis - the corpus is just the body of text to be recorded, which is used in diphone synthesis also) as well as diphone synthesis.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker

And here's the Bell Labs version: by infolib · 2003-03-17 00:57 · Score: 1

"Aargh! I've been slashdotted!"

This one is much better at saying "slashdotted". Neither of them do the "Aargh!" very well. Especially the IBM one ought to be convincing, given current circumstances ;-)

Generate more samples for yourself at http://www.naturalvoices.att.com/demos/

--
Any sufficiently advanced libertarian utopia is indistinguishable from government.

Re:And here's the Bell Labs version: by gurensan · 2003-03-17 17:16 · Score: 1

This is way better than IBMs.

--
You are all fartheads.

Better than the TI speech synth chips? by farrellj · 2003-03-17 00:57 · Score: 1

In the 80's, TI had a number of speech synth chips that were of amazing quality. The one used with the add-in modules for the TI-994A was amazing. I still have not heard a better quality speech synth since then. I wonder what happened to that TI technology.

ttyl
Farrell

--
CAN-CON 2019 - Ottawa's only book oriented Science Fiction Convention! October 18-20, Sheraton Hotel, Ottawa, Canada h

Old news by payndz · 2003-03-17 00:58 · Score: 3, Interesting

Text-to-speech? Come on, this has been around for donkey's years - maybe the computer voice doesn't sound like Majel Barrett yet, but it's hardly new and amazing stuff.

I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")

--
You must think in Russian.

Bonehead: it's P-H-O-N-E-M-E by evodas · 2003-03-17 01:01 · Score: 1

I guess this is what comes of dopes who don't know their own language...

Unbelievable! by Tuffnut · 2003-03-17 01:02 · Score: 1

I think this has been the first time I've been able to experience some sort of off-site media before it has been slashdotted.

That just makes my day! :)

Finally. by termos · 2003-03-17 01:03 · Score: 1

I have always wanted a sexy robot voice which says Kernel Panic!

--
Note to self: get smarter troll to guard door.

take a note from musicians by Anonymous Coward · 2003-03-17 01:04 · Score: 1, Interesting

hey maybe IT industry should take a note from us musicians for a change (excuse the pun)...

With sampling technology, especially multisampling where for example each note can have different sounds associated to it depending on the accent, you could achieve some really stunning results in the text to speech market.

People like EastWest have created such systems for virtual choirs...check out Voices Of The Apocalypse as this is some pretty basic but revolutionary way of using samplers...

Text to speech by cbrew · 2003-03-17 01:14 · Score: 1

The best text-to-speech that I know about is from Rhetorical Systems at at www.rhetorical.com. The system still doesn't really understand what it is trying to say, but the quality of the speech itself seems good to me. Their technology is proprietary, so one can't be quite sure how they are doing this, but it looks to be large database unit selection (like some of the Festival voices) done very well. (Disclaimer: before the company existed, I used to work with some of the people at Rhetorical, so I might be biased, but listen for yourselves).

Re:Text to speech by MrBandersnatch · 2003-03-17 01:43 · Score: 1

Aye, I did some research into this area recently and Rhetorical were FAR superior to any thing else I could find. I really hope that they come out with some form of "packaged product" at some point since I'd LOVE to use their technology.
Re:Text to speech by Anonymous Coward · 2003-03-17 06:15 · Score: 0

http://www.rhetorical.com/demo

Phenome ... by Johnathon+Walls · 2003-03-17 01:18 · Score: 1

You keep using that word.

I do not think it means what you think it means.

Hollywood applications for speech synthesis? by Sheriff+Fatman · 2003-03-17 01:20 · Score: 2, Interesting

Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.

OK, the point of this article is on-demand realtime speech synthesis - roughly analogous to the 3D engines used in games. It has to compromise quality and detail for the sake of speed and responsiveness. Could there be a market for 'voice rendering' - a system which can take a script (possibly with some additional mark-up to indicate emotions, emphasis, timing, etc.) and generate an audio version which approaches a reading of the same script by a competent voice actor?

As well as the obvious 'virtual thespian' Hollywood angle, I'm thinking about stuff like low-budget audio drama - people who have the time and the technology to tweak a voice script, but can't afford professional actors to do it the old-fashioned way. There could be applications for creating audio books for the visually impaired, or to make life easier for students working through Shakespeare and Chaucer - I'm still amazed every time I hear Shakespeare read out loud how much meaning can be conveyed by the nuances of the human voice, compared to dry printed prose.

Is anyone actively working on anything like this? If not, why not? Is it really that hard to fool the human ear? Or is it just a case that it's still cheaper and easier just to employ people to read things into a mic?

--

--
-- Open Source: It's mad, but you don't have to work here to help.

In the 'has been doing that for a while' series : by dago · 2003-03-17 01:23 · Score: 1

My (former) university : mbrola

It is even is free (as in beer) for personnal use.

--
#include "coucou.h"

This is AT&T's Watson from 1995! by jdoeii · 2003-03-17 01:38 · Score: 1, Insightful

Apparently IBM bought the formerly AT&T's later Lucent's Watson project. The web page is even called webtts.watson.ibm.com. Obviously the quality of TTS has not improved much since 1996.

Can someone please tell me why this 8 y.o. project is considered news?

Re:This is AT&T's Watson from 1995! by perky · 2003-03-17 02:12 · Score: 1

The web page is even called webtts.watson.ibm.com. Obviously the quality of TTS has not improved much since 1996.

Assuming this isn't a troll, then you might notice that IBM operates the massive Thomas J Watson research lab. Perhaps the URL has something to do with that? Second, you might want to have a losten if you think TTS hasn't moved in 8 years.

--
"The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994
Re:This is AT&T's Watson from 1995! by bmetz · 2003-03-17 03:11 · Score: 1

Wrong.

It's the Watson Research Lab, as in T. J. Watson, as in the CEO who started the company over 80 years ago.

--
What did you eat today? http://www.atetoday.com/

Conversation at IBM by mivok · 2003-03-17 01:42 · Score: 1, Funny

Tech guy 1> Hey, look at my cool new web based speech thingy that lets 1000's of users web pages talk to them!
Tech guy 2> Bah.. bet it wouldnt support 2 people
Tech guy 1> It would!
Tech guy 2> Prove it... (loud musical sound of doom follows) post it to slashdot
Tech guy 1> Ulp... (reluctantly taps away on the keyboard)

5 minutes later, strained sounds can faintly be heard from the smoking pile of rubble that used to be the server room, and the fried piece of circuit board that used to be the shiny new voice system crackles begin to wane, still trying to come up with 500,000 convincing renditions of "goatsec"

I'm not actually convinced phonemes exist, y'know by Bertie · 2003-03-17 02:01 · Score: 5, Insightful

I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.

What about physical modelling? by PenguiN42 · 2003-03-17 02:03 · Score: 1

So TTS with synthesized phonemes sounds bad, and they try to use recorded phonemes instead. Those still sound bad when the computer has to produce a phoneme combination that wasn't recorded.

So what's the next step? Is there anyone working on physical modelling of the acoustic properties of the mouth, tongue, throat, larynx, and lungs as they glide between different phonemes to produce speech sounds? This seems like the only way you're gonna get something closer to natural than this recorded-phoneme technology ... but with a lot of processing power.

Of course there's the 2nd problem of inflecting things properly, but that seems to require text recognition technology beyond what we currently have.

--
The following sentence is true. The preceding sentence was false.

Re:What about physical modelling? by Anonymous Coward · 2003-03-17 04:22 · Score: 0

Sure, it's called formant synthesis, and has been around for ages. SoftVoice (http://www.text2speech.com/) has been developing such synths since the Commodore 64 days (S.A.M., the Software Automatic Mouth) and before.
Re:What about physical modelling? by jishcat · 2003-03-17 05:49 · Score: 0

That's not what the previous poster was talking about, I think. I remember seeing on Beyond 2000, several years ago a programmer that claimed he was doing this. They would model the shape of the nasal cavity and the mouth, etc. Then they would simulate the sound waves bouncing around in the pseudo-head to generate the voice. This was kind of neat, because they could generate voices of people who were dead, but had never been recorded. They claimed that one sound they played was what Abraham Lincoln's voice would have sounded like. However, it couldn't have been all that accurate, since I'm sure there have never been any X-rays or other scans of Lincoln's skull (let alone detailed measurments of his actual skull, and I don't think they're going to dig him up for this.) Furthermore, the personality, regional dialect, etc of the human voice is extremely dynamic, as evidenced by our ability to imitate people, such as celebrities.
Re:What about physical modelling? by jhhl · 2003-03-18 01:34 · Score: 1

The physical model you may have seen was probably Perry Cook's Sing program for the NeXT. It's a vocal tract modeller using waveguide synthesis. It's not great, but impressive in that it's completely modelled.
Among the interesting innovations in thisprogram are that the nasal cavities are modelled, and not only a glottal impulse is injected into the model, but noise is also passed into the chain of filters at various points.
Cook always claimed he'd release an API but never did. still, the principles are pretty easy to understand and it could be done again, I'm sure, and connected to a TTS system like Festival. It is more than ten years later, so there's probaby a lot that could be done. Mr. Cook has spent a lot of the intervening time modeing complicated percussion instruments.

A lot of people have mentioned SAM, the Amiga "say" command and other Klatt-based vocal synthesizers. This algorithm is very useful, made by cascading formant filtrs over a "glottal" impulse (switching with noise as needed), but I don't know if there were Klatt synthesizers for languages other than English... making up the tables might not be too hard. Dennis Klatt died many years ago, but his code lives on.

--
-- Real Stupidity is the Artificial Intelligence of the 21st century

The True Stephen Hawking is Best by kpayson · 2003-03-17 02:08 · Score: 0

If they can't get the voice truely smooth, they should just leave it as the Hawking voice.

My favorite test phrase is "If I were to bitch slap you while falling into a black hole, the bitch slap would last an eternity."

Ken

Check a university library by thogard · 2003-03-17 02:13 · Score: 1

There is a book called MITalk (MIT Talk) that involves the efforts of using some major hardware to do this years ago. They were using a Vax (780?) just for one part of the processing and a few other big computers to do the rest. This lead to the DecTalker (aka the voice of Stephen Hawkings)

It seems to me that with modern DSP's cranking along with much more calculations per second than a VAX could ever hope for, and one of the best theoretical mathematicians ever having a reliance on the technology, that things should have improved substantially since the MITtalk book came out but I have yet to hear any real world examples.

Fuck speech,..LET IT SING!!! by Ribert · 2003-03-17 02:14 · Score: 1

I treid some lyrics like "A winter's day In a deep and dark December; I am alone, Gazing from my window to the streets below, On a freshly fallen silent shroud of snow." and it actually started to sing!! (lyric is from Paul Simon - I'm a rock)

Re:Fuck speech,..LET IT SING!!! by Tablizer · 2003-03-17 07:24 · Score: 1

Now I can finally generate that "Leonard Nemoy Sings Barney's Favorites" album that I always wanted.

--
Table-ized A.I.

Counterfeit sound bites by p3d0 · 2003-03-17 02:18 · Score: 1

This raises the bar on fake sound bites. Imagine recording thousands of phrases spoken by Mr. Burns and piecing them together with this technique to make him say "Hello, Smithers. You're quite good at turning me on".

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....

Smithers would be disappointed by CastrTroy · 2003-03-17 02:30 · Score: 1

Upon trying the classic "Hello smithers, you're very good at turning me on" quote, with both the male voices I was very dissapointed. This thing doesn't really sound any better than that crap piece of software that came with my 8-bit sound blaster back in the day.

I noticed the female voices sounded a lot better than the male voices. Nice to see those boys over at ibm got their priorities in order.

I wonder if this technology could be advanced far enough that it could actually imitate an actual person, by feeding in sound files of the person talking, these could be analysed. Then they could make it sound like a specific person, that would be awesome.

And for once my quote from a brief history of time actually applies to the article i'm posting under

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.

Re:As a concerned American patriot, by Greg+Hewgill · 2003-03-17 02:30 · Score: 1

Yes, they should be called Freedomgnomes.

Don't you mean Phreedomgnomes?

this *does* sound better than previous attempts by Anonymous Coward · 2003-03-17 02:50 · Score: 1, Informative

including AT&T. This demo sounds much more natural over a broader range of words to my ear. Not much, but some better.

cepstral by Anonymous Coward · 2003-03-17 02:57 · Score: 0

these guys have a synth that runs on a handheld and does real time dsp. check out the demos - very cool.

we've been doing this for a while by Mandrake · 2003-03-17 03:01 · Score: 2, Informative

This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos. In fact, we have Higher Quality Limited Domain Demos available as well.

--
Geoff "Mandrake" Harrison
Some Random UI Hacker

still sounds like S.A.M. from the 1970s. by Anonymous Coward · 2003-03-17 03:17 · Score: 1, Informative

It's funny how most synthesized voices sound like the Software Automatic Mouth (S.A.M.) software that was available for Atari 800 computers long ago.

Re:I'm not actually convinced phonemes exist, y'kn by gawi · 2003-03-17 03:25 · Score: 1

The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

I doubt that anything can be 100% successful given that human perception is not totally error-free. Speech recognition HMM models are based on contextual realizations of phonemes (allophonic model). This takes into account coarticulation. Same technique applies for generation: in-context phonemes are used.

Poor quality of current TTS engine comes from bad concatenation of segments (e.g distortion at jonction). Other problems come from high-level analysis (semantic) and from inappropriate prosody (emphasis, rhythm, intonation).

--
All humans are mortal. Socrates is a human. Socrates is dead.

A speak center by Anonymous Coward · 2003-03-17 03:28 · Score: 1, Funny

In addition to a call center, IBM has this Bangalore/India-based speak-center with thousands of males and females speaking the text you entered into a microphone...

Re:I'm not actually convinced phonemes exist, y'kn by Bertie · 2003-03-17 03:36 · Score: 1

I doubt that anything can be 100% successful given that human perception is not totally error-free.

True enough. Maybe I should have said "never be as successful as humans".

Speech recognition HMM models are based on contextual realizations of phonemes (allophonic model). This takes into account coarticulation. Same technique applies for generation: in-context phonemes are used.

Yes, but they're not contextual enough. Triphones, which to my knowledge is the norm, just isn't a wide enough catchment area. Furthermore, I would question the logic of segmenting at all, for the reasons described previously.

Is it just me? by evronm · 2003-03-17 03:55 · Score: 2, Insightful

Or does anyone else not understand what the big deal about text to speech is?

I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.

Speech to text, on the other hand...

--
Follow the adventures of the new wandering jews

Re:Is it just me? by Dolohov · 2003-03-20 00:52 · Score: 1

I had that program; we used to make it try to call the dog.

I think, though, that in retrospect it was not quite so good as we remember it; getting something like that to sound more natural is no small thing, nor is it to make it a smaller, faster program that makes fewer pronunciation errors. Incremental advancements are the name of the game for most technologies -- what was Apollo, after all, except a series of incremental advancements over Sputnik?

Damn you! Motherfucker! by Zone-MR · 2003-03-17 04:19 · Score: 1

OMFG. I was trying to do that. I exceeded my 34 word limit, so I pressed the back button and corrected it.

Little did I realise the voice is reset back to default when I do so. I just got offered a blowjob from Charles :-/

Good problem for competitive algorithms? by MojoRilla · 2003-03-17 05:12 · Score: 1

Seems to me that text to speech would be a good problem for darwinian competitive algorithms. You can take a book on tape, feed the text as input, and have the computer have different algorithms compete by judging them against the human speaker.

Many iterations later, you probably can get a computer sounding just like a person. And since it has had a whole book to practice over, it should be pretty general.

The winner is IBM by Anonymous Coward · 2003-03-17 05:19 · Score: 0

Compare the IBM to ATT demo with the following phrase:

"Has this site been /.'ed yet?"

IBM pronounces it clearly, while ATT fails. IBM obviously expected the slashdotting it so richly deserves.

Re:I'm not actually convinced phonemes exist, y'kn by hawkfish · 2003-03-17 05:46 · Score: 1

I did a fair bit of research in speech synthesis a few years ago and I have to say this sounds good to me.

The system I was building was a diphone/triphone hybrid. We had a large inventory of basic segments (1500 or so) in LPC encodings, but we kept having to expand it to get different pitch contours.

One thing we did find that helped (probably the best idea of my life) was to try to capture features of the glottal excitation instead of using a simple spike excitation. Keeping a library of glottal pulses gave the voice a lot of naturalness (our goal was to generate arbitrary utterances for a particular speaker's voice).

So I have to say I totally agree with you. I think that natural voices will have to follow this sort of path - modelling the actual human vocal tract, not just a convenient mathematical model.

--
You will not drink with us, but you would taste our steel? - Walter Matthau, The Pirates

Creating artificial phonemes by Anonymous Coward · 2003-03-17 05:55 · Score: 0

Does anyone know if people think it might be possible to create artificial phonemes digitally rather then recording them. Do you know what problems this presents?

Re:Creating artificial phonemes by mhowell · 2003-03-18 04:20 · Score: 0

This is kind of how the formant/parametric TTS engines work; they use a simulation of the vocal tract to generate sounds on the fly. This does have some advantages, but for whatever reason (I don't know enough to say), the concatenative TTS engines sound much more "natural" right now. And there are many that are commercially available: Rhetorical Systems, Nuance, SpeechWorks (which was developed from AT&T IIRC). If you are interested in an open source TTS engine, check out Festival from the University of Edinburgh, or Flite (Festival-lite) from Carnegie-Mellon. Flite is a Festival relative that is optimized for PDAs and multi-channel servers.
Someday I think their might be a resurgence in the parameteric TTS engines, but I guess the techniques need to advance. I think of them as being more "pure".

Slashdot Demographics by SomeGuyFromCA · 2003-03-17 06:03 · Score: 1

Prediction: They'll look at their server logs and find:

a) requests for female voices saying dirty things and
b) requests for male voices saying: "How are you gentlemen!! All your base are belong to us!! You have no chance to survive make your time!!"
c) "I got an error, you insensitive clod!"

--
if the answer isn't violence, neither is your silence / freedom of expression doesn't make it alright

Not very good TTS by DulcetTone · 2003-03-17 06:08 · Score: 2, Funny

The quality of AT&T's TTS or SpeechWorks' TTS is far more advanced. I had some fun with Speechworks' one and posted samples:

What I wish On-Star would actually say

A slightly-edited announcement calling our Bulldog to attend to a special matter

tone

--
tone

Don't forget the talking cat: by pHDNgell · 2003-03-17 06:14 · Score: 1

http://gnufoo.org/macosx/

cat -a is even cooler than snoop -a. :)

--
-- The world is watching America, and America is watching TV.

Okay, so how far have we come... by julesh · 2003-03-17 06:24 · Score: 1

...from the old BBC Model B with "*SAY Whatever you like", which got it right about 95% of the time?

Not all that for, considering its been nearly 20 years, to be honest.

Rhetorical sounds a LOT better than this by Anonymous Coward · 2003-03-17 06:25 · Score: 1, Informative

I've done a bit of research on text to speech systems, and the absolutely BEST most natural text to speech I've come across is Rhetorical..

Demo here

It's got a good range of voices. My answering machine is using one of them...

Re:I'm not actually convinced phonemes exist, y'kn by HoldmyCauls · 2003-03-17 06:43 · Score: 2, Interesting

I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."

In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these will be the same algorithm in reverse will be for the future to decide).

"Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

Right there, you've laid out a *very complicated* but by no means difficult way of looking at phonemes individually. In my class, we have some 20 people who all have great difficulty in just figuring out allomorphs, which some Slashdotters might not know are phonemes either in complementary distribution, such as in the case of plural nouns: /s/ sound at the end of /kæts/ {cats} /z/ sound at the end of /kIdz/ {kids} /z/ sound at the end of /mæz/ {matches}

or in free variation, such as the Lisa/Liza name which mean the same thing, and are derived from the same root, but which have split due to geographical/cultural/other reasons.

Now, where the average English major might not always recognize similarities and patterns, the average Slashdotter has trained him/herself to do so, and some are likely saying to themselves, "where else does this happen?" and "where is this not true?", which are useful, scientific questions.

You yourself present the answer to the problem you raise: we have to look at the surrounding phonemes in order to figure out how to make one particular sound fit the word it's in. This is *damn hard*, but not impossible. It's like the fact that stress affects a phoneme in certain languages: we just need to adapt to thinking about language in different terms than simply speaking it and spelling vague representations of it (by first realizing how vague those representations are, which is why the phoneme set is taught first in Linguistics classes).

Personally, I think the problem lies in the fact that we all want TTS/STT and we want it *now!*, and why can't the computer just say it or hear it the way we do, and all the other questions that come from a lack of understanding, both of how the machine represents everything and the garbled way in which our language is represented. Phonemes are the obvious solution: the software should only have to do STP/PTS conversion, and our language should conform to that, really, since it's the creative dialectical shifts that create a problem, but we'll end up devising a creative solution for that, too.

Now, we all know what happens with lossy compression...

Yes, we get a slightly inaccurate but highly useful jpeg of the Andes, or someone's new desktop widget set, or a very listenable 192kbps mp3 of "Hurt" covered by Johnny Cash (even sadder than the original, IMHO).

And TTS/STT will have its flaws as well, but a digital (though wide) set of sound symbols like phonemes will help us to break things down somewhat until we figure out that something *smaller* about those sounds is very functional, *and* how to represent *that* level of speech, just as we represented matter by some informal type, then by molecules, then atoms, and now we know quite a bit about how the electron, proton and neutron work, and are working on a smaller level.

To say you "don't really believe in phonemes" oversi

--
Emacs: for people who just never know when to :q!

40 phoneme's is too simplistic for synth usage by Tablizer · 2003-03-17 07:13 · Score: 1

This approach does however beg the question whether "phoneme based" is still its most important characteristic. There are only 40 phonemes, not 10000 (the number of samples used by the IBM "voices").

It appears the "40" is an over-simplification. If they vary slightly based on context (surrounding phoneme's), then there is technically much more than 40.

I guess you could say that there are 40 that are easily identifiable, while the context-sensative variations are too subtle for speach researchers to isolate using their own ear. IOW, the subtleties are only "exposed" when you try to use the 40-rule alone to synthesize speech.

Hey, does this mean that S. Hawkins will get new voice?

--
Table-ized A.I.

Re:40 phoneme's is too simplistic for synth usage by prowley · 2003-03-17 08:14 · Score: 1

No, Hawkins has already refused upgrades. After all, it is his voice now.
Re:40 phoneme's is too simplistic for synth usage by benhaha · 2003-03-17 22:09 · Score: 1

Hawking not Hawkins.

--
NO ID: BEING FREE MEANS NOT HAVING TO PROVE IT

Natural Voices Gagged: AT&T is asleep at the d by SimHacker · 2003-03-17 07:47 · Score: 2, Informative

I'm working on a project involving voice synthesis, so we've been shopping around and evaluating different systems.

We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, they appearently didn't see any purpose in actually SUPPORTING the product -- so once their libraries stopped working against the latest Gnu/Linux libraries (happy birthday RMS!), they dropped their Linux voice synthesizer product like a hot potato instead of bothering to recompile it and issue an update.

So we hoped AT&T would show more comittment to the promises they made on their web site about their flagship voice synthesizer product, but...

Has anyone actually tried buying a single user copy of Natural Voices from AT&T? YOU CAN'T ANYMORE! They used to sell the synthesizer for workstations and voices for competitive prices (in the 100s of dollars range). So we bought a few voices to evaluate, and sent some simple technical questions into the email address they provided for support, never receiving a reply.

After several weeks they never answered any of our questions, but we decided to buy some more voices to evaluate anyway. But by then, AT&T had pulled the consumer single user version of Natural Voices off of the market (and it took weeks of phone tag to find that out because they don't give out "technical" information on the phone, and they never answer their email support address).

Now if you want to buy a Natural Voice from AT&T, you have to buy the server edition for tens of thousands of dollars. Had their support not absolutely sucked, it might have been worth us paying such a high price, but no way we'd ever consider going with AT&T, after they demonstrated such horrible unresponsive service.

Actually it's a good thing we didn't go with AT&T's voice synthesizer, because we need support for voice authoring tools, and AT&T is incompetent in that regard, since they refuse to give out technical information over the phone, and never answer their email. No support whatsoever. Zilch. Nada. Forget about it.

Fortunately we found some excellent open source software that works together (and whose authors are MUCH more responsive than IBM or AT&T): the Festival Speech Synthesis System, the FestVox voice authoring tools, the small fast Flite runtime speech engine, the Edinburgh Speech Tools, the CSLU speech tools, the OGI Festival tools, and the MBROLA Multilingual Speech Project. This is state of the art research software, where IBM and AT&T got their ideas.

The quality of the commercial voices comes more from throwing lots of time and money into the production process -- the commercial software is not any more advanced than the open source research projects -- in fact the research projects inspired the commercial products!

-A speech synthesizer user who's been jerked around by AT&T and IBM, and is now happy to have no other choice but to use excellent open source software.

--
Take a look and feel free: http://www.PieMenu.com

Re:I'm not actually convinced phonemes exist, y'kn by decrocher · 2003-03-17 07:57 · Score: 2, Interesting

I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.

You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-phones, you may be better off trying to understand the phrase from higher level of abstraction.

Have we correctly identified the right compact expression of speech? I doubt it. Getting speech stuff to work involves a lot of tweaking that is theoretically ungrounded. Tweaking in a methodical and science-biased way _is_ engineering, however.

BTW, I seem to remember a prof saying that X-ray cinematography more-or-less proved the existance of vocal tract target configurations in speech, which correspond to phonemes. Not to mention that you can encode a message in IPA and have it understood by someone else. Even if they're not totally correct, phonemes may be a sufficient basis for building speech systems.

You just like to hear yourself talk. by Anonymous Coward · 2003-03-17 07:59 · Score: 0

You don't know what you're talking about, and you trivialize an extremely complex problem. The main problem is "smoothing the phoneme edges", huh? It's a "very basic practice indeed", huh? Of course you're "not entirely sure," because you know nothing about modern speech synthesis, and you're wildly extrapolating from your extremely limited out-of-date experience. You have no need for a speech synthesizer yourself, because you just like to hear yourself talk.

Evil Anti-War Belgian Fries!!!! by Anonymous Coward · 2003-03-17 08:18 · Score: 0

French Fries are actually from Belgium!!!!! And French Fries are EVIL and ANTI-WAR! So Belgians are EVIL! EEEEEVIL!!!! EEEEEEEVILLLLLL!!!!! Only patriotic war-supporting foods are fit to be eaten! I am pouring all my Belgian wine down the toilet!!!!

-A Patriotic American

Re:Evil Anti-War Belgian Fries!!!! by selderrr · 2003-03-17 22:08 · Score: 1

I am pouring all my Belgian wine down the toilet

being belgian, so am I !
we don't have much of a wine culture, dumbo. We're beer drinkers. Check out www.belgianbeer.com. We pratically invented the stuff.

--
When will I end this grieving ? When will my future begin ?

Phonemes don't exist? Do YOU??? by Doug+Merritt · 2003-03-17 08:23 · Score: 1

I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes....In the beginning, there was the word. And the word was spoken...

...sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion.

This is not a very coherent argument. You might as well say that you doubt the existence of musical notes, since you've diagrammed the power spectrum of middle C on a piano and various other instruments, and always you see a really complex waveform, not a simple 440 hertz sine wave. And sometimes there is even a complete absence of energy at that frequency. The situation indeed is so complex that no algorithm has ever been developed that can reliably detect the supposed period/frequency of the supposed musical note (especially in human voices).

But despite those complex observations, there is every reason to think that musical notes exist. It's just that there are difficult intertwined subjects here; the physics is difficult (if you try to accurately model the non-linearities of the resonating chambers), the math is difficult (the signal is non-stationary, so Fourier analysis is a really bad approximation of the truth), and the brain is, as always, extremely complex, so we don't understand the psycho-acoustics, either.

And yet, musical notes exist.

And so do phonemes, despite the fact that they blur together etc etc.

On a final note, don't forget that even the word "word" has no concrete rigorous definition widely agreed upon in the linguistics community. Naively it seems simple, but when you look closely at the subject, the notion of "word" also gets really complex. So one could deny that words exist...but I don't think that would be a smart stance.

As a side note I'm not real thrilled about your history of written language. I suggest you take a refresher on ancient Egyptian. :-)

Masters in Linguistics, eh? Hopefully you are moving on to a PhD?

P.S. Yes, I have done work in computational linguistics...and shipped product! The topics are very difficult, indeed, but in the commercial world, failure to solve problems is not an option. :-)

--
Professional Wild-Eyed Visionary

Singing speech synthesizers: Dictionaraoke! by SimHacker · 2003-03-17 08:49 · Score: 1

Festival has some singing demos, using a simple XML format to mark up text with beat duration and note pitch information.

And Oregon Graduate Institute's CSLU Toolkit extends Festival with an implementation of Sable: an XML format that lets you mark up text with arbitrary timing, pitch and volume envelopes.

An of course there's Dictionaraoke!

Main Entry: dictionaraoke Pronunciation: 'dik-sh&-"ner-A-O-ke Definition: Audio clips from online dictionaries sing the hits of yesterday and today. The fun of karaoke meets the word power of the dictionary.

-Don

--
Take a look and feel free: http://www.PieMenu.com

Re:Singing speech synthesizers: Dictionaraoke! by Anonymous Coward · 2003-03-17 15:08 · Score: 0

That's clever, but it's a pity the creators don't even know it's pronounced ka-da-OH-kay.

Re:I'm not actually convinced phonemes exist, y'kn by Anonymous Coward · 2003-03-17 09:05 · Score: 0

If all you have is a hammer, and you try pounding in a bolt, then driving a few non-tapping screws, and finally try to spot-weld a few bits of metal together by pounding very rapidly, you go into a cocoon and emerge with a whole field of study about how construction materials don't exist and, anyway, construction is impossible -- but only a scholar in the field can say so without being shouted down.

This is why engineers get high blood pressure when soft-science people mount the podium.

Re:I'm not actually convinced phonemes exist, y'kn by ungerware · 2003-03-17 10:54 · Score: 1

There's lots of companies out there not interested in making art. They just want to be able to get their customers through their phone call centers without falling back on "press one for... press two for..."

There's actually a section of the article discussing this. The IBM engineer was talking about what he considers the "holy grail" of TTS, and his opinion was that it is _not_ perfect reproduction of human inflections.

--

-----
Kvetch is Yiddish for "throw an exception" --Dr. Ron Cytron

Thanks. Nice post. by Anonymous Coward · 2003-03-17 13:15 · Score: 0

I'm already convinced.

What are the alternatives? Can anyone point to work using words or sentences? At first cut I can imagine building a simple dictionary (spoken vocabulary) of words and trying to register the edges somehow so it doesn't sound like a ransom note. But the inflections are still guaranteed to be wrong.

It seems the most natural unit would at least be a sentence. A sentence is the smallest unit of song, with a beginning and an end. But how do you decompose and reconstruct a sentence as a whole? Suppose you can reduce the spoken sentence to some parameters capable of regenerating the sound nicely using a synthesis engine. Furthermore, suppose you have a huge library of spoken sentences and accompanying text. How do you train a mapping function between generic (non-training) text and sentence synthesis parameters, without doing any decomposition?

It seems like you have to do some kind of decomposition, but perhaps the "phoneme" is not correct. It might be more appropriate to look for clear features (in the time domain) to partition more realistic components, then see if the detected components generalize and synthesize.

Why so aggressive by Anonymous Coward · 2003-03-17 13:57 · Score: 0

Buddhists say we actually don't exist. And certainly, from the right perspective, we truly don't exist as independent extractable entities. Certainly we don't exist as independently as phonemes are required to in order for phonemes to sound right.

Anyway, I once had a famous Producer tell me that intonation doesn't matter, style matters; even style without intonation. This guy has produced a string a platinum selling albums. Practically speaking, this means that notes DO NOT exist as far as the human voice is concerned, at least not in any useful or marketable way. His advice was to ignore pitch correction technology and to concentrate on style only. I would argue that in a similar way the notion of phonemes interferes with the production of listenable synthetic speech.

Anyway, for a couple of examples:

* Piano - notes exist, though they are usally out of tune

* Guitar - notes are suggested, but you can tune them by pushing the strings around with your fingers

* Flute - notes are suggested, and you can inflect and blend the hell out of them by using your lips

* Voice - notes don't really exist, just vocal style

* Phonemes - exist only in the minds of people failing to build TTS systems that a normal person would enjoy listening to

Epistemology IS hard-science by Anonymous Coward · 2003-03-17 14:09 · Score: 0

Epistemology is at the root of hard science; it is the basis of the scientific method. Engineers get high blood pressure becuase they like to look stuff up instead of actually thinking.

LPC vocoder by Latent+Heat · 2003-03-17 15:59 · Score: 1

I ran the AT&T synthesis through my trusty spectrum analyser and glottal pulse inverse filter analyser (http://www.medsch.wisc.edu/~milenkvc/tools.html) to see what they are up to.

It looks like they are using glottal pulses as you say, and they are doing the female voice (Crystal) by boosting the first two harmonics and by filtering out the range past 4 kHz and replacing it with noise to give it that breathy sound that is characteristic of female voices in American culture (this varies with culture -- the "Dame Edna" effect, yeah, I know Dame Edna is a dude in drag, but different cultures have different norms on how women are supposed to talk). I think they are doing some other tricks, like varying the formant damping pitch synchronously to fake the effect of a coupled voice source and vocal tract acoustic load.

At the segmental level the synthesis is kind of clunky, but at the voice quality level it sound remarkably good, especially Crystal. Mike (the male voice) sounds kind of buzzy at the voice level, but Crystal sounds quite female and quite natural.

Re:I'm not actually convinced phonemes exist, y'kn by mhowell · 2003-03-17 16:54 · Score: 0

Great post and discussion on one of my favorite subjects. I didn't find the demos here to be an amazing leap over AT&T Natural Voices or Rhetorical Systems, which I think are similar technology. I was a little surprised to find that this is "new" in Scientific American (maybe just 'cause it's not new to me), but I would take absolutely nothing away from the folks at IBM and elsewhere who are working on it.

I worked quite a bit on some telecom projects using RealSpeak from (now defunct) L&H which also uses a triphone concatenation technique IIRC. It doesn't sound as good as this stuff but it was a useful, shipping product.

Yes I think "phoneme" gets a bit fuzzy when you put it under the microscope, but so do other handy abstractions like "word" and "adjective".

ASR and TTS techniques in use now are pretty sophisticated and relatively successful considering they only try to simulate the bottom of the stack of our (human) language machine, i.e., to simplify, since the TTS doesn't know what the words or sentence "mean", how can it know how to get the right intonation, emphasis, etc.? Ditto for ASR; the state of the art is to build a grammar or language model of some kind by hand for each step in a dialog. Effectively the app developer must tell the recognizer exactly what words to listen for, and in what order, (and with what probability/preference).

So the clever stuff in current TTS engines isn't just how to glue the phonemes together, but how to generate the right intonation/prosody, emphasis, choice of pronunciations (the verb "read" in the past tense is pronounced differently from "read" in the present tense). These things can vary from one speaker or region to the next, just like the accents, so it's hard to find the "rules". This is something that is maddening about computational linguistics... seems like for every "rule" there is a phonebook full of fine print.

's fun!

Re:I'm not actually convinced phonemes exist, y'kn by gurensan · 2003-03-17 17:37 · Score: 1

IANAL (linguist), but I think I both agree and disagree with what you're saying. I actually have a reason, even. That reason being that we can't see how our brains actually process things like phonemes, or even what's really all that important about them. This is, IMO, analogous to vision in that while we have neurons that fire in response to this or that shape or brightness or color in their visual field but are incredibly difficult to cause to fire artificially. There are also such variations to what we hear and smell. What we might think of as the building blocks of speech or language might not be the case at all. Phonemes are a good example. What might be a good experiment here may be to train a neural network to recognize a few different phonemes in speech and see what they come up with when certain phonemes are not present. Just an idea, but based on this I agree with a large part of your post.

--
You are all fartheads.

Re:I'm not actually convinced phonemes exist, y'kn by t · 2003-03-17 18:16 · Score: 1

Now, we all know what happens with lossy compression...

I understood that to mean perceptually lossy, and not just technically lossy.

I think the parent poster makes a good point. What you need to realize is that speech research like this has been ongoing for like 50 years. It is my opinion that if we were going down the right track we would be further along then we are now. It is always a good idea to step back (sometimes way the hell back) and see if there are any other ways to attack a problem. Even examining methods that have failed miserably can be quite useful if you bother to figure out exactly why they failed so badly.

Personally I think the way the speech problem itself is formulated is flawed. Think of it this way, you hire a secretary that does not speak English. So instead you read to her from a book for a couple hours until she has figured out how to map your utterances to the words in the book. Later you ask her to read stuff back to you. Can you see how much much more difficult the problem is? If the secretary had knowledge of the language, then the problem would not be so difficult. But the current methodology is typical engineering style, break the problem down into small chunks, solve small chunks, assemble small chunks to form much larger chunk, voila! Problem solved.

Re:I'm not actually convinced phonemes exist, y'kn by njpomeroy · 2003-03-17 18:56 · Score: 1

I also have a M.A. in Theoretical Linguistics, and I don't pretend to speak as an expert, just as a student.

It sounds like you have trouble with the "existence" of phonemes because they are not discreetly separable into clean, isolated units on a spectrometer. This seems like a failure, on your part, to understand the Phonemic/Phonetic distinction. This is a common misunderstanding. Just for the record: Phonemes are theoretical and in your head, Phones are physical air waves smacking against your eardrums and therefore microphones. Everything you saw on the spectrometer was an electro-visual representation of a phone.

Additionally, I have seen more compelling evidence for the "existence" of phonemes, as a theoretical construct than you have shown against them. (I will leave the discussion of "being" and "existence" of a theory to the Philosophy students).
As a theoretical reference, they are very useful, much in the same way we posited subatomic particles before we could detect/segment them.

Much of what TTS and NLP engineers are trying to do is work from the theory to produce a results. IMO, the results using modern, phonemic practices are much better than older spectra/recording based TTS systems. The proof is in the pudding; real work is getting done by way of a theory.

One final note, unrelated to my main point: your description of the history of writing systems and writing's relation to language is highly inaccurate. Your description of the IPA and what it's used for is equally so. I used to think my Linguistics education was lacking, so it's nice to see that the there are schools still lower on the bell curve.

I think that ... by Snork+Asaurus · 2003-03-18 05:05 · Score: 1

it may be a problem with the U.B.A.

--
Sigs are bad for your health.

Slashdot Mirror

Phoneme Approach For Text-to-Speech in SCIAM

189 comments