Open Source Speech Recognition - With Source

← Back to Stories (view on slashdot.org)

Open Source Speech Recognition - With Source

Posted by timothy on Tuesday September 28, 2004 @11:18AM from the what-I-hear-you-saying-is dept.

Paul Lamere writes " This story on ZD-Net and this recent story on Slashdot describes the recent open sourcing of IBM's voice recognition software. This release, unfortunately, doesn't include any source for the actual speech recognition engine. Olaf Schmidt, a developer on the KDE Accessibility Project , is quoted as saying 'There is no speech-recognition system available for Linux, which is a big gap.' In an attempt to close this gap, we have just released Sphinx-4, a state-of-the-art, speaker-independent, continuous speech recognition system written entirely in the Java programming language. It was created by researchers and engineers from Sun, CMU, MERL, HP, MIT and UCSC. Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C. Here are the release notes and some performance data."

32 of 404 comments (clear)

Min score:

Reason:

Sort:

Aim You Sing Ate Write How by Anonymous Coward · 2004-09-28 11:19 · Score: 5, Funny

Ate lurks barry wall.
1. Re:Aim You Sing Ate Write How by BarryJacobsen · 2004-09-28 12:43 · Score: 5, Funny
  
  Ate lurks barry wall.
  
  Who ate my wall?
  
  --
  Track your TV Shows with your iPhone - FREE
But what about text to speech? by Anonymous Coward · 2004-09-28 11:26 · Score: 5, Interesting

When are we going to get GOOD text to speech, that uses modeled parameters of human vocal tracts rather than stitching together a bunch of pre-recorded phonemes?
1. Re:But what about text to speech? by QuantumG · 2004-09-28 11:30 · Score: 4, Informative
  
  By we I assume you mean "the open source community" and the answer is "when you get off your ass and code it". If by "we" you mean the world at large then go and look at AT&T's Natural Voices project.
  
  --
  How we know is more important than what we know.
2. Re:But what about text to speech? by Sheetrock · 2004-09-28 11:34 · Score: 5, Funny
  
  Given that there is already a rudimentary text-to-speech package available for Linux, and now a speech-to-text package, perhaps the secret is to pipe one to the other in a closed loop until one learns how to enunciate and the other how to listen?
  
  --
  
  Try not. Do or do not, there is no try.
  -- Dr. Spock, stardate 2822-3.
3. Re:But what about text to speech? by winterlens · 2004-09-28 15:46 · Score: 4, Insightful
  
  Probably because speaking is incredibly complicated, and providing realistic speech from unmarked text is an intractable problem.
  When you write something down, you don't provide a pronunciation guide. Rather, the reader is guided by context. For instance, if I write the word "import", how do you pronounce it? If we're talking about trade deficits, you probably know that the stress is on the second syllable; but if we're discussing meaning, the stress is on the first.
  How do we expect computers that have a difficult time with context to make a pronunciation decision? This is a serious barrier to "good" text to speech (whatever "good" means).
  If you mean that you want the voice to sound more natural, even if it's pronouncing words incorrectly, you still have a lot of hard problems. For instance, the muscles in the tongue and lips move differently based on how phonemes are grouped. Coarticulation models are difficult to construct, and when you try to account for a convincing number of muscles and vibrations, the problem may quickly become intractable.
  Not only do we have to pay attention to the physics of speaking, but also the physics of hearing. The amount of signal processing involved can be pretty staggering if you're going to implement a complete system. Thierry Dutoit has a really good book on the subject called An Introduction to Text-to-Speech Synthesis. You should check it out if you want a somewhat more exhaustive answer to your question.
Virtual Machine Syndrome by nihilogos · 2004-09-28 11:30 · Score: 5, Funny

Colloquially known as "pointer-envy", this condition may affect all programmers, but is especially prevalent in java and C# developers. It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.

Treat suspected cases with caution, and under no condition contradict the patient. There is no known cure.

--
:wq
1. Re:Virtual Machine Syndrome by Xeger · 2004-09-28 11:45 · Score: 5, Funny
  
  KNOWN CAUSES: Recent research results from information-theoretic psychoanalysts shows that Virtual Machine Syndrome is most likely a pre-emptive defensive discourse strategy. VMS sufferers typically become symptomatic after months or years of constant haranguing at the hands of colleagues, friends and professional contacts that anything they write, regardless of its execution environment or portability requirements, could have been done "better and faster in C." Oftentimes, such criticism is levied against VMS sufferers even when the application in question is I/O-bound and spends 80% or more of its time suspended, waiting for network or disk I/O to complete.
  
  TREATMENT: Implement reliable and efficient systems using virtual machine of choice, regardless of criticisms. Apply free-market therapy judiciously, allowing adopters of Virtual Machine technology to thrive and become prosperous if warranted. VMS symptoms typically disappear when sufferer's stock options are valued at 300% of their strike price. Symptoms may also be temporarily relieved through just-in-time compilation.
  
  RELATED SYNDROMES: Ossified Self-Important Myopia (OSIM), which is the tendency to assume that one's favorite programming paradigm, language, or OS is unconditionally and unreservedly the best choice for any software project. Characterised by the inability to understand that the only way to guarantee maximum efficiency is to write everything in assembly language, with complete and perfect knowledge of all quirks of the specific target instruction set.
2. Re:Virtual Machine Syndrome by pslam · 2004-09-28 11:50 · Score: 5, Informative
  
  It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.
  An expecially odd statement considering much of speech recognition can be broken down into great big vector operations, which are perfect for hand coding in C. Bet I could quadruple the speed of it in a couple of hours with some hand coded SIMD ops in x86 assembler.
  It's funny because Java is fantastic at JIT compiling code with lots of non-local behaviour (e.g complex UIs) because it can take into account global behaviour at runtime. But it sucks at tight, heavy computation loop. DSP is a fantastic example of something Java is going to get creamed at when pitched against non-virtual machines.
  Of course, if you have some cross-platform standard API calls for those vector DSP ops, then it's a different argument...
3. Re:Virtual Machine Syndrome by Brandybuck · 2004-09-28 12:18 · Score: 4, Funny
  
  Try my new text editor, it's written in Java!
  
  Why should I?
  
  Because it's written in Java!
  
  How is it better than what I'm currently using?
  
  It's written in Java!
  
  I'm already using vi, emacs, kate and gedit, why should I use yours as well?
  
  Because it's written in Java!
  
  Does it have a spell checker, syntax highlighting, and auto-indent?
  
  Who cares? It's written in Java!
  
  Name two benefits to your text editor?
  
  That's easy! First, it's written in Java. Second, it's uh... uh... hang on, uh... it's written in Java! Yeah, that's it, it's written in Java!
  
  --
  Don't blame me, I didn't vote for either of them!
4. Re:Virtual Machine Syndrome by IgnoramusMaximus · 2004-09-28 14:49 · Score: 5, Insightful
  
  Stuff written in Java is better than stuff written in C or C++ because there are no frapping buffer overflows in Java code
  True, instead there are a thousand "super-efficient" .jar libraries required by a "Hello World" app, which use the "Object Oriented Programing and Long Lasting Cure All and Testicular Itch Relief Paradigm(tm)" to such extremes that it takes 12 objects instantiated in 4 containers to flip a bit in a byte. Additionally, there is the substitution of native performance of compiled code to code compiled "Just Too Late" combined with exceptional memory usage that entails. If it were not enough, as a bonus, we get the garbage collector which is scientifically fine-tuned to run just when user is expected to interact with the application in most time sensitive manner. As an icing on the cake we also are treated to multiple, insideously incompatible with each other, versions of the so-called "universal" VM, resulting in one app demanding that SunVM is used and the other that MS VM is used thus resulting in total impossibility of using both at the same time. Yes I do speak from looong and utterly infuriorating experience with Java apps.
  At one point I considered printing a sign warning of Java advocates being shot on sight, I could probably make some serious money selling it, given similiar amount of grief my other colleagues are going through.
  Ahem, and yes, the greatest offenders in me experience are... err... frigging IBM Java apps. We actually abandoned DB2 8.x release because noone could deal with the havoc the DB2 admin tools were causing with various other retarded banking related Java apps.
5. Re:Virtual Machine Syndrome by dr2chase · 2004-09-28 15:55 · Score: 4, Insightful
  
  Great story, but basically wrong and misleading. You can trowel on the layers in any language, and you can write fast Java programs. The speech engine is proof of that.
  Garbage collection, in particular, is coming along nicely. Check out "Metronome" by David Bacon, of IBM. You set the knobs, it tells you how much memory you will need, and it gives you GC with real time performance. No pauses.
  Or, consider the machine that Azul is working on (good luck getting details now that they are in some sort of a quiet period). It has hardware support for read and write barriers, plus a good story for stack caches. Chances are good its GC pauses will be tiny (1-10 ms).
  I can also tell you that the market very much prefers JIT compilation. I worked on an ahead-of-time-compiling JVM, and there were a couple of others built by other companies. I don't work on that JVM any more, and the other AOT JVM companies have either failed or gone into other lines of business.
  So, great story, but not exactly correlated with reality.
  On the other hand, consider all the buggy apps that we (who sometimes administer Windows machines) have needed to patch over and over again over the years. If I am unwilling to run an application in the first place because of its poor security, does it really matter how little memory it uses, how fast it runs, or how well it gets along with the other worm-friendly apps?
6. Re:Virtual Machine Syndrome by IgnoramusMaximus · 2004-09-28 16:16 · Score: 4, Funny
  
  Check out "Metronome"
  I dont give a damn about Metronomes and speech recognition of questionable usefulnes. None of those are Java apps I deal with. And of those I deal with all suck.
  worm-friendly apps?
  Speaking of security, most Java apps are deployed in places that need them not in the first place, as a kludge for an E-Commerce site or electronic banking interface which can be done with a bit of thinking in plain HTML. Others, like IBMs for example, are mainly administrative tools which have no communication abilities outside of their narrow scope. These, if made in any other language would not be any more prone to worms. As a matter of fact, the use of Java on some of these electronic commerce sites introduces unneeded complexity and results in code executing on customer's computers whereby they become prone to being abused by spoofed/buggy VM's etc.
  So, great story, but not exactly correlated with reality.
  Reality? Oh dear. Listen dude, I am telling you as a user of your wonderful computer science masturbation effort otherwise known as Java: No. Nada. Niet. It aint a go. No can do. The bank we deal with is rewriting their apps to be java-free because of the amount of flaq they are getting (and no they are not going to that other aberration known as C# either). IBM DB2 is banned in many companies we deal with. Etc etc. We, the users, not you, Mr. Java Wanker, have the final word on this. Trust me.
Free C++ alternative from Mississippi State Univ. by j.leidner · 2004-09-28 11:34 · Score: 4, Interesting

Another open source system, but implemented in C++ (like all industrial systems I know of) can be found at here (a vision statement is here.
--
Try Nuggets , the mobile search engine. We answer your questions via SMS, across the UK.
Re:Java comment by Taladar · 2004-09-28 11:36 · Score: 4, Insightful

Java might be as fast as C in Code Execution but if you want to build a library that Open Source Applications outside the Java-Developer-niche use you have to write it in C. C is still THE No. 1 language for libraries for use in programs written in lots of different programming languages.

--
Linux is not Windows
Open Source - With Source! by NSash · 2004-09-28 11:37 · Score: 4, Funny

From dept-of-redundancy-department?

I'm not one to be picky about titles, but sheesh...
Translation for those who still don't get it... by CaptainPinko · 2004-09-28 11:40 · Score: 4, Informative

Title: I'm(Aim) using(You Sing) it(Ate) right(Write) now(How)
Body: It(Ate) works(lurks) very(barry) well(wall).

--
Your CPU is not doing anything else, at least do something.
Sphinx 2 by PiGuy · 2004-09-28 11:44 · Score: 5, Informative

"There is no speech-recognition system available for Linux, which is a big gap."

Um, Sphinx 2 (a predecessor of Sphinx 4) has been around for quite some time now. Like Sphinx 4, it's speaker-independent. Unlike Sphinx 4, it's a C library, and is thus easily interfaced with other languages (insert shameless plug for a simple Python interface for Sphinx 2 I wrote).
Speech recognition by CastrTroy · 2004-09-28 11:44 · Score: 4, Interesting

Speech recognition is one of the worst means of input there is for a computer. Keyboards work so much better. Even for those who don't have full use of their hands, there are many other options for user input, all of which are better than speech recognition. Worst thing ever is someone trying to use speech input in a cubicle environment.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Obligatory Commercial Break by Noksagt · 2004-09-28 11:45 · Score: 5, Funny

Woman: [dictating into cell phone] To: Mike. I had fun last night.
Cell Phone: To: Mike. I have lip fungus.
Woman: [into cell phone, angrily] I had FUN, not lip fungus!
Cell Phone: I have fungus, not lip fungus.
Woman: I DON'T HAVE LIP FUNGUS!!!
OT Star Wars Nitpick by Anonymous Coward · 2004-09-28 11:49 · Score: 5, Informative

Hey moron, it's R2D2 that beep-booped. C3PO was fluent in over 6 million forms of communication. ;-)
Obviously by Moderation+abuser · 2004-09-28 11:59 · Score: 5, Funny

"This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory."

Looking at the performance data it just blazes along on that config. Not exactly what I'd call an embedable system, though Microsoft might beg to differ.

--
Government of the people, by corporate executives, for corporate profits.
Re:The Myth Must Die by nihilogos · 2004-09-28 12:09 · Score: 4, Funny

many benchmarks have shown that a modern optimized JVM with JIT compilation is roughly equivalent with most implementations of C++, with some benchmarks being better for Java and some being better for C++.

And many studies have shown that going with Microsoft software is cheaper than going with open sourced software.

--
:wq
There's more than one kind of overhead. by argent · 2004-09-28 12:17 · Score: 5, Insightful

I could easily live with 10-15% slower, IF Java didn't have the startup overhead. I can run inetd-style fork-exec-terminate servers in C on CPUs that a cellphone would spit on, and handle hundreds of connections a second. Bringing up a JVM on the same processor would take minutes. Bringing up a JIT runtime would be out of the question.

For applications where you can create a JVM and use it as you need it, Java's great. Webservers, sure, no problem. Desktop applications, heck, the GUI overhead's getting to be the same order of magnitude (though that HAS to change, we can't afford to depend on Moore's Law much longer unles someone comes up with a clever way to cut the power consumption of processors faster than the speed increases). Browser plugins? For content, yes, but not for navigation... if it takes 10s to start up a JVM your customer's already hit "back".
Re:Java!?! by darkonc · 2004-09-28 12:40 · Score: 4, Funny

Quick someone port this to C.
Just be glad it wasn't written in Lisp.

--
Sometimes boldness is in fashion. Sometimes only the brave will be bold.
And lo, the point was missed... by DarkMan · 2004-09-28 12:49 · Score: 4, Insightful

You are, of course, perfectly correct in everything you said.

There are a number of HCI aspects where speech recognition is not a good solution.

However, let me enumerate a number of other ones, where it's superior:

Minutes of meetings, or similar. Imagine having a verbatim record of a discussion there by the time you get back to your desk.

Someone who cannot type - e.g. no hands. Rare, granted, but still a viable use.

Someone whose hands are busy. The cannonical example here is a pathologist doing an autopsy, where they dictate everything. Speech recogition saves time in transcription (and money for the audio typist).

I'd love to be able to issue voice commands to a computer, for a few, isolated cases. For example, diagnosing hardware. Bring up a doc, and be able to get the computer to flip pages, without having to remove the probes from the hardware. Re locating them is a pain, and sucks time.

Moreover, I'm certain that there are others, some of which will only be realised when it's common and cheap enough to be widely available.

It's like a mouse. It's one of the worst general purpose input devices for a computer [0], but it's excels at indicating a single element on a display. The mouse and keyboard complement each other, and there are a bunch of other, more specifc input devices, such as the graphics tablet. I have no doubt that if speech recognition was as accurate and reliable as a graphics tablet, it would get a similar amount of use.

[0] Try inputing a block of prose with only a mouse. Even specilist software makes it only suck marginally less.
Rolling your own speech recognition isn't so easy by belmolis · 2004-09-28 13:02 · Score: 4, Informative

Speech recognition is not really a solved problem. For some applications it works adequately, but if you take a look at the error rates for the Sphinx system to which the post links, you'll see that the Word Error Rate for large vocabulary is over 18%. Even for 5,000 words it is 7%. For many applications that is unacceptable.

A second factor is that these statistical speech recognition systems require extensive data for their language model. Building such a system requires recording real speech, segmenting it and creating a set of examples from which to compute the probabilities, which requires some knowledge of acoustic phonetics, and doing the computation for the model. This is time-consuming.

Speech recognition technology isn't a dark secret, but it isn't trivial to create a system with good performance either.
nifty desktop control with sphinx and festival by Danny+Rathjens · 2004-09-28 13:07 · Score: 4, Interesting

http://perlbox.sourceforge.net/
The very small vocabulary needed for desktop control makes the speech recognition much more accurate and usable.
I used to work for MacSpeech, doing UI work by notthepainter · 2004-09-28 13:33 · Score: 5, Insightful

And before that, I worked for Articulate Systems, also doing UI work.
With that said, you can probably guess I have a lot to say about Speech Recognition. (Not Voice Recognition, that's different, that would be able to distinguish Ben from Charlie for example.)
A good SR engine is, of course, essential. And I've not read the details on the two recent giveaways, but I suspect that they are only the engine.
The SR engine is just a begining. There is a ton of UI work that needs to be done. Sit and think about spacing around punctuation marks and then think about capitalization around puncuation marks. Yeah, it is all pretty cut and dried and known but the details really need to be sweated to get it right. This is very time consuming.
Next you have to worry about exactly where you are editing. Is that into Microsoft Word (or Open Office), or emacs, or where? It can make a huge difference when you want to go back and correct misrecognitions. You just don't want to send N delete characters and retype it, that results in a lousy user experience. So just exactly where is the input cursor at all times? This is not an impossible problem, but one where the details must be sweated.
Next is command and control. Just how are you going to let the user grab the text of all the menus and all the text in the dialog box buttons. Again, not impossble, but more of those pesky details.
Finally, is your SR engine good enough? Maybe, maybe not. Let just say that 98% accuracy might look good on paper, but that is one in 50 words wrong. Unless your correction mechanism is smooth, an error rate that high greatly slow you down.
Is Open Source SR a good thing? Oh yes sir, yes! But lets not forget the details. One thing the Open Source community has been accused of, perhaps justly, perhaps, unjustly, is not sweating the details.
Speech Recognition has an awful lot of details.
Why speech recognition on Linux will kill Windows by MarsF · 2004-09-28 13:41 · Score: 5, Insightful

I was thinking about this the other day, and was wondering if this is a huge gap in the Windows user interaction model.

Think about how you input info using windows. You click on a few locations using the mouse, perhaps use some keyboard input, click some more. The output from these inputs is arbitrary: it may result in anything from a 'File/Save' dialog to a custom error dialog box. There is no linear path for inputting commands, or for mapping inputs to results.

Compare this to the command line. You enter a few distinct atomic commands, and view the results in the same medium. You then enter more commands, refining your actions. The key here is that you already have a linear model for input that produces well defined expected results, all in a common medium that is conceptually simple, visible to the user, and easily processed by machines. Extending this model to accept voice input or output is trivial.

How is one supposed to quantify basic tasks and turn them into equivelant voice commands without a baseline framework or paradigm to extend from? How do you automate, simplify, or extend existing tasks without a common input or output medium? GUIs provide no such medium or framework; that same framework is at the heart of the command line interface!

Perhaps this is why we never saw voice recognition technology take off on Windows. It's blinking impossible to script actions for an arbitrary task, let alone process the arbitrary results!

On a similar note we may see voice recognition on Linux take off like a rocket. Anybody can add voice recognition to perform almost any command because the actions are all scriptable throught the CLI already. If you can type it, you can get your computer to do it when you say 'computer, foo!'

Mars

P.S. It would be greatly appreciated if someone could please clarify my point. It's buried in there somewhere...
The issue is Javas footprint and integration by Ndr_Amigo · 2004-09-28 14:28 · Score: 4, Insightful

While I've been waiting for Sphinx to mature into something useful for a long time now, the move to Java makes the whole package pretty useless to me.

Java is a memory hog, and it's certainly not going to be on any device I would want speech recognition on. Heck, I don't have Java installed on any of my machines, mostly because of the absolutely ridiculous footprint on disk as well as when running in ram.

And integrating Java applications into other applications is very difficult. Now, Java is good for certain things, but a speech recognition engine in Java sounds like the worst abuse possible :)

That and I still can't train it to recognise my slight australian accent, unlike every other bit of SR software I've used on Win32 :P

Whether or not Sphinx-4 works, and whether or not Java is 'fast' enough to do speech recognition processing, its of no use to me.
Re:Java!?! by leinhos · 2004-09-29 00:21 · Score: 4, Informative

Can't gcc compile java code directly to native binary code?

Does this mean that one could make a shared library out of the java code for C-programmers to use?