Open Source Speech Recognition - With Source

← Back to Stories (view on slashdot.org)

Open Source Speech Recognition - With Source

Posted by timothy on Tuesday September 28, 2004 @11:18AM from the what-I-hear-you-saying-is dept.

Paul Lamere writes " This story on ZD-Net and this recent story on Slashdot describes the recent open sourcing of IBM's voice recognition software. This release, unfortunately, doesn't include any source for the actual speech recognition engine. Olaf Schmidt, a developer on the KDE Accessibility Project , is quoted as saying 'There is no speech-recognition system available for Linux, which is a big gap.' In an attempt to close this gap, we have just released Sphinx-4, a state-of-the-art, speaker-independent, continuous speech recognition system written entirely in the Java programming language. It was created by researchers and engineers from Sun, CMU, MERL, HP, MIT and UCSC. Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C. Here are the release notes and some performance data."

15 of 404 comments (clear)

Min score:

Reason:

Sort:

Aim You Sing Ate Write How by Anonymous Coward · 2004-09-28 11:19 · Score: 5, Funny

Ate lurks barry wall.
1. Re:Aim You Sing Ate Write How by BarryJacobsen · 2004-09-28 12:43 · Score: 5, Funny
  
  Ate lurks barry wall.
  
  Who ate my wall?
  
  --
  Track your TV Shows with your iPhone - FREE
But what about text to speech? by Anonymous Coward · 2004-09-28 11:26 · Score: 5, Interesting

When are we going to get GOOD text to speech, that uses modeled parameters of human vocal tracts rather than stitching together a bunch of pre-recorded phonemes?
1. Re:But what about text to speech? by Sheetrock · 2004-09-28 11:34 · Score: 5, Funny
  
  Given that there is already a rudimentary text-to-speech package available for Linux, and now a speech-to-text package, perhaps the secret is to pipe one to the other in a closed loop until one learns how to enunciate and the other how to listen?
  
  --
  
  Try not. Do or do not, there is no try.
  -- Dr. Spock, stardate 2822-3.
Virtual Machine Syndrome by nihilogos · 2004-09-28 11:30 · Score: 5, Funny

Colloquially known as "pointer-envy", this condition may affect all programmers, but is especially prevalent in java and C# developers. It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.

Treat suspected cases with caution, and under no condition contradict the patient. There is no known cure.

--
:wq
1. Re:Virtual Machine Syndrome by Xeger · 2004-09-28 11:45 · Score: 5, Funny
  
  KNOWN CAUSES: Recent research results from information-theoretic psychoanalysts shows that Virtual Machine Syndrome is most likely a pre-emptive defensive discourse strategy. VMS sufferers typically become symptomatic after months or years of constant haranguing at the hands of colleagues, friends and professional contacts that anything they write, regardless of its execution environment or portability requirements, could have been done "better and faster in C." Oftentimes, such criticism is levied against VMS sufferers even when the application in question is I/O-bound and spends 80% or more of its time suspended, waiting for network or disk I/O to complete.
  
  TREATMENT: Implement reliable and efficient systems using virtual machine of choice, regardless of criticisms. Apply free-market therapy judiciously, allowing adopters of Virtual Machine technology to thrive and become prosperous if warranted. VMS symptoms typically disappear when sufferer's stock options are valued at 300% of their strike price. Symptoms may also be temporarily relieved through just-in-time compilation.
  
  RELATED SYNDROMES: Ossified Self-Important Myopia (OSIM), which is the tendency to assume that one's favorite programming paradigm, language, or OS is unconditionally and unreservedly the best choice for any software project. Characterised by the inability to understand that the only way to guarantee maximum efficiency is to write everything in assembly language, with complete and perfect knowledge of all quirks of the specific target instruction set.
2. Re:Virtual Machine Syndrome by pslam · 2004-09-28 11:50 · Score: 5, Informative
  
  It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.
  An expecially odd statement considering much of speech recognition can be broken down into great big vector operations, which are perfect for hand coding in C. Bet I could quadruple the speed of it in a couple of hours with some hand coded SIMD ops in x86 assembler.
  It's funny because Java is fantastic at JIT compiling code with lots of non-local behaviour (e.g complex UIs) because it can take into account global behaviour at runtime. But it sucks at tight, heavy computation loop. DSP is a fantastic example of something Java is going to get creamed at when pitched against non-virtual machines.
  Of course, if you have some cross-platform standard API calls for those vector DSP ops, then it's a different argument...
3. Re:Virtual Machine Syndrome by IgnoramusMaximus · 2004-09-28 14:49 · Score: 5, Insightful
  
  Stuff written in Java is better than stuff written in C or C++ because there are no frapping buffer overflows in Java code
  True, instead there are a thousand "super-efficient" .jar libraries required by a "Hello World" app, which use the "Object Oriented Programing and Long Lasting Cure All and Testicular Itch Relief Paradigm(tm)" to such extremes that it takes 12 objects instantiated in 4 containers to flip a bit in a byte. Additionally, there is the substitution of native performance of compiled code to code compiled "Just Too Late" combined with exceptional memory usage that entails. If it were not enough, as a bonus, we get the garbage collector which is scientifically fine-tuned to run just when user is expected to interact with the application in most time sensitive manner. As an icing on the cake we also are treated to multiple, insideously incompatible with each other, versions of the so-called "universal" VM, resulting in one app demanding that SunVM is used and the other that MS VM is used thus resulting in total impossibility of using both at the same time. Yes I do speak from looong and utterly infuriorating experience with Java apps.
  At one point I considered printing a sign warning of Java advocates being shot on sight, I could probably make some serious money selling it, given similiar amount of grief my other colleagues are going through.
  Ahem, and yes, the greatest offenders in me experience are... err... frigging IBM Java apps. We actually abandoned DB2 8.x release because noone could deal with the havoc the DB2 admin tools were causing with various other retarded banking related Java apps.
Sphinx 2 by PiGuy · 2004-09-28 11:44 · Score: 5, Informative

"There is no speech-recognition system available for Linux, which is a big gap."

Um, Sphinx 2 (a predecessor of Sphinx 4) has been around for quite some time now. Like Sphinx 4, it's speaker-independent. Unlike Sphinx 4, it's a C library, and is thus easily interfaced with other languages (insert shameless plug for a simple Python interface for Sphinx 2 I wrote).
Obligatory Commercial Break by Noksagt · 2004-09-28 11:45 · Score: 5, Funny

Woman: [dictating into cell phone] To: Mike. I had fun last night.
Cell Phone: To: Mike. I have lip fungus.
Woman: [into cell phone, angrily] I had FUN, not lip fungus!
Cell Phone: I have fungus, not lip fungus.
Woman: I DON'T HAVE LIP FUNGUS!!!
OT Star Wars Nitpick by Anonymous Coward · 2004-09-28 11:49 · Score: 5, Informative

Hey moron, it's R2D2 that beep-booped. C3PO was fluent in over 6 million forms of communication. ;-)
Obviously by Moderation+abuser · 2004-09-28 11:59 · Score: 5, Funny

"This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory."

Looking at the performance data it just blazes along on that config. Not exactly what I'd call an embedable system, though Microsoft might beg to differ.

--
Government of the people, by corporate executives, for corporate profits.
There's more than one kind of overhead. by argent · 2004-09-28 12:17 · Score: 5, Insightful

I could easily live with 10-15% slower, IF Java didn't have the startup overhead. I can run inetd-style fork-exec-terminate servers in C on CPUs that a cellphone would spit on, and handle hundreds of connections a second. Bringing up a JVM on the same processor would take minutes. Bringing up a JIT runtime would be out of the question.

For applications where you can create a JVM and use it as you need it, Java's great. Webservers, sure, no problem. Desktop applications, heck, the GUI overhead's getting to be the same order of magnitude (though that HAS to change, we can't afford to depend on Moore's Law much longer unles someone comes up with a clever way to cut the power consumption of processors faster than the speed increases). Browser plugins? For content, yes, but not for navigation... if it takes 10s to start up a JVM your customer's already hit "back".
I used to work for MacSpeech, doing UI work by notthepainter · 2004-09-28 13:33 · Score: 5, Insightful

And before that, I worked for Articulate Systems, also doing UI work.
With that said, you can probably guess I have a lot to say about Speech Recognition. (Not Voice Recognition, that's different, that would be able to distinguish Ben from Charlie for example.)
A good SR engine is, of course, essential. And I've not read the details on the two recent giveaways, but I suspect that they are only the engine.
The SR engine is just a begining. There is a ton of UI work that needs to be done. Sit and think about spacing around punctuation marks and then think about capitalization around puncuation marks. Yeah, it is all pretty cut and dried and known but the details really need to be sweated to get it right. This is very time consuming.
Next you have to worry about exactly where you are editing. Is that into Microsoft Word (or Open Office), or emacs, or where? It can make a huge difference when you want to go back and correct misrecognitions. You just don't want to send N delete characters and retype it, that results in a lousy user experience. So just exactly where is the input cursor at all times? This is not an impossible problem, but one where the details must be sweated.
Next is command and control. Just how are you going to let the user grab the text of all the menus and all the text in the dialog box buttons. Again, not impossble, but more of those pesky details.
Finally, is your SR engine good enough? Maybe, maybe not. Let just say that 98% accuracy might look good on paper, but that is one in 50 words wrong. Unless your correction mechanism is smooth, an error rate that high greatly slow you down.
Is Open Source SR a good thing? Oh yes sir, yes! But lets not forget the details. One thing the Open Source community has been accused of, perhaps justly, perhaps, unjustly, is not sweating the details.
Speech Recognition has an awful lot of details.
Why speech recognition on Linux will kill Windows by MarsF · 2004-09-28 13:41 · Score: 5, Insightful

I was thinking about this the other day, and was wondering if this is a huge gap in the Windows user interaction model.

Think about how you input info using windows. You click on a few locations using the mouse, perhaps use some keyboard input, click some more. The output from these inputs is arbitrary: it may result in anything from a 'File/Save' dialog to a custom error dialog box. There is no linear path for inputting commands, or for mapping inputs to results.

Compare this to the command line. You enter a few distinct atomic commands, and view the results in the same medium. You then enter more commands, refining your actions. The key here is that you already have a linear model for input that produces well defined expected results, all in a common medium that is conceptually simple, visible to the user, and easily processed by machines. Extending this model to accept voice input or output is trivial.

How is one supposed to quantify basic tasks and turn them into equivelant voice commands without a baseline framework or paradigm to extend from? How do you automate, simplify, or extend existing tasks without a common input or output medium? GUIs provide no such medium or framework; that same framework is at the heart of the command line interface!

Perhaps this is why we never saw voice recognition technology take off on Windows. It's blinking impossible to script actions for an arbitrary task, let alone process the arbitrary results!

On a similar note we may see voice recognition on Linux take off like a rocket. Anybody can add voice recognition to perform almost any command because the actions are all scriptable throught the CLI already. If you can type it, you can get your computer to do it when you say 'computer, foo!'

Mars

P.S. It would be greatly appreciated if someone could please clarify my point. It's buried in there somewhere...