Open Source Speech Recognition - With Source
Paul Lamere writes " This story
on ZD-Net and this recent story
on Slashdot
describes the recent open sourcing of IBM's voice
recognition software. This release, unfortunately, doesn't include
any source for the actual speech recognition engine. Olaf Schmidt, a
developer on the KDE Accessibility Project ,
is quoted as saying 'There is no speech-recognition system available
for Linux, which is a big gap.' In an attempt to close this gap, we
have just released Sphinx-4,
a state-of-the-art, speaker-independent, continuous
speech recognition system written entirely in the Java programming
language. It was created by researchers and engineers from Sun, CMU,
MERL, HP, MIT and UCSC. Despite (or because of) being written in the
Java programming language, Sphinx-4 performs as well as similar
systems written in C. Here are the release notes and
some performance data."
Ate lurks barry wall.
Quick someone port this to C.
"Open Source Speech Recognition - With Source"
"This release, unfortunately, doesn't include any source for the actual speech recognition engine."
Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C.
Im sick of these comments. Anyone that needs to know about the performance of Java knows its very fast. Why bother commenting about it anymore?
Its like saying "... and because it was written in C, its very fast...", as if we didn't know already.
When are we going to get GOOD text to speech, that uses modeled parameters of human vocal tracts rather than stitching together a bunch of pre-recorded phonemes?
Colloquially known as "pointer-envy", this condition may affect all programmers, but is especially prevalent in java and C# developers. It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.
Treat suspected cases with caution, and under no condition contradict the patient. There is no known cure.
:wq
--
Try Nuggets , the mobile search engine. We answer your questions via SMS, across the UK.
Now my linux box can wreck a nice beach!
#include "humorous_pop_culture_reference.h"
From dept-of-redundancy-department?
I'm not one to be picky about titles, but sheesh...
Title: I'm(Aim) using(You Sing) it(Ate) right(Write) now(How)
Body: It(Ate) works(lurks) very(barry) well(wall).
Your CPU is not doing anything else, at least do something.
"There is no speech-recognition system available for Linux, which is a big gap."
Um, Sphinx 2 (a predecessor of Sphinx 4) has been around for quite some time now. Like Sphinx 4, it's speaker-independent. Unlike Sphinx 4, it's a C library, and is thus easily interfaced with other languages (insert shameless plug for a simple Python interface for Sphinx 2 I wrote).
Speech recognition is one of the worst means of input there is for a computer. Keyboards work so much better. Even for those who don't have full use of their hands, there are many other options for user input, all of which are better than speech recognition. Worst thing ever is someone trying to use speech input in a cubicle environment.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Woman: [dictating into cell phone] To: Mike. I had fun last night.
Cell Phone: To: Mike. I have lip fungus.
Woman: [into cell phone, angrily] I had FUN, not lip fungus!
Cell Phone: I have fungus, not lip fungus.
Woman: I DON'T HAVE LIP FUNGUS!!!
I've used a few packages for speech recognition but none really got me too excited. Well, Dragon Naturally Speaking did have me read a few chapters of Dave Berry to it. I bet it didn't work because of all the laughing, I was in tears.
I must say though that speech recognition is something that the whole computer community needs to work on. Now, we can finally do that. All the "open source community" needs is source that works a little. In a year or so, I bet this works better then most options available today.
Now, I know that isn't the rule but this is the type of thing that computer/math engineers could sit down to and contribute where others can't. It seems to be the rule that the really smart ones tend to work with open source software...
Really the cool thing is that this could get people involved who otherwise wouldn't because they don't know where to start.
Get your Unix fortune now!
Hey moron, it's R2D2 that beep-booped. C3PO was fluent in over 6 million forms of communication. ;-)
Ba-dum-dum ding!
Computers are useless. They can only give you answers.
-- Pablo Picasso
"This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory."
Looking at the performance data it just blazes along on that config. Not exactly what I'd call an embedable system, though Microsoft might beg to differ.
Government of the people, by corporate executives, for corporate profits.
December 2003 http://www.voip-info.org/wiki-Sphinx
Guess I won't be listening to music when root anymore. In fact I am sound proofing my room to keep the noises from infiltrating my microphone and causing me to accidently delete /home
many benchmarks have shown that a modern optimized JVM with JIT compilation is roughly equivalent with most implementations of C++, with some benchmarks being better for Java and some being better for C++.
And many studies have shown that going with Microsoft software is cheaper than going with open sourced software.
:wq
I could easily live with 10-15% slower, IF Java didn't have the startup overhead. I can run inetd-style fork-exec-terminate servers in C on CPUs that a cellphone would spit on, and handle hundreds of connections a second. Bringing up a JVM on the same processor would take minutes. Bringing up a JIT runtime would be out of the question.
For applications where you can create a JVM and use it as you need it, Java's great. Webservers, sure, no problem. Desktop applications, heck, the GUI overhead's getting to be the same order of magnitude (though that HAS to change, we can't afford to depend on Moore's Law much longer unles someone comes up with a clever way to cut the power consumption of processors faster than the speed increases). Browser plugins? For content, yes, but not for navigation... if it takes 10s to start up a JVM your customer's already hit "back".
You are, of course, perfectly correct in everything you said.
There are a number of HCI aspects where speech recognition is not a good solution.
However, let me enumerate a number of other ones, where it's superior:
Minutes of meetings, or similar. Imagine having a verbatim record of a discussion there by the time you get back to your desk.
Someone who cannot type - e.g. no hands. Rare, granted, but still a viable use.
Someone whose hands are busy. The cannonical example here is a pathologist doing an autopsy, where they dictate everything. Speech recogition saves time in transcription (and money for the audio typist).
I'd love to be able to issue voice commands to a computer, for a few, isolated cases. For example, diagnosing hardware. Bring up a doc, and be able to get the computer to flip pages, without having to remove the probes from the hardware. Re locating them is a pain, and sucks time.
Moreover, I'm certain that there are others, some of which will only be realised when it's common and cheap enough to be widely available.
It's like a mouse. It's one of the worst general purpose input devices for a computer [0], but it's excels at indicating a single element on a display. The mouse and keyboard complement each other, and there are a bunch of other, more specifc input devices, such as the graphics tablet. I have no doubt that if speech recognition was as accurate and reliable as a graphics tablet, it would get a similar amount of use.
[0] Try inputing a block of prose with only a mouse. Even specilist software makes it only suck marginally less.
Speech recognition is not really a solved problem. For some applications it works adequately, but if you take a look at the error rates for the Sphinx system to which the post links, you'll see that the Word Error Rate for large vocabulary is over 18%. Even for 5,000 words it is 7%. For many applications that is unacceptable.
A second factor is that these statistical speech recognition systems require extensive data for their language model. Building such a system requires recording real speech, segmenting it and creating a set of examples from which to compute the probabilities, which requires some knowledge of acoustic phonetics, and doing the computation for the model. This is time-consuming.
Speech recognition technology isn't a dark secret, but it isn't trivial to create a system with good performance either.
The very small vocabulary needed for desktop control makes the speech recognition much more accurate and usable.
With that said, you can probably guess I have a lot to say about Speech Recognition. (Not Voice Recognition, that's different, that would be able to distinguish Ben from Charlie for example.)
A good SR engine is, of course, essential. And I've not read the details on the two recent giveaways, but I suspect that they are only the engine.
The SR engine is just a begining. There is a ton of UI work that needs to be done. Sit and think about spacing around punctuation marks and then think about capitalization around puncuation marks. Yeah, it is all pretty cut and dried and known but the details really need to be sweated to get it right. This is very time consuming.
Next you have to worry about exactly where you are editing. Is that into Microsoft Word (or Open Office), or emacs, or where? It can make a huge difference when you want to go back and correct misrecognitions. You just don't want to send N delete characters and retype it, that results in a lousy user experience. So just exactly where is the input cursor at all times? This is not an impossible problem, but one where the details must be sweated.
Next is command and control. Just how are you going to let the user grab the text of all the menus and all the text in the dialog box buttons. Again, not impossble, but more of those pesky details.
Finally, is your SR engine good enough? Maybe, maybe not. Let just say that 98% accuracy might look good on paper, but that is one in 50 words wrong. Unless your correction mechanism is smooth, an error rate that high greatly slow you down.
Is Open Source SR a good thing? Oh yes sir, yes! But lets not forget the details. One thing the Open Source community has been accused of, perhaps justly, perhaps, unjustly, is not sweating the details.
Speech Recognition has an awful lot of details.
I was thinking about this the other day, and was wondering if this is a huge gap in the Windows user interaction model.
Think about how you input info using windows. You click on a few locations using the mouse, perhaps use some keyboard input, click some more. The output from these inputs is arbitrary: it may result in anything from a 'File/Save' dialog to a custom error dialog box. There is no linear path for inputting commands, or for mapping inputs to results.
Compare this to the command line. You enter a few distinct atomic commands, and view the results in the same medium. You then enter more commands, refining your actions. The key here is that you already have a linear model for input that produces well defined expected results, all in a common medium that is conceptually simple, visible to the user, and easily processed by machines. Extending this model to accept voice input or output is trivial.
How is one supposed to quantify basic tasks and turn them into equivelant voice commands without a baseline framework or paradigm to extend from? How do you automate, simplify, or extend existing tasks without a common input or output medium? GUIs provide no such medium or framework; that same framework is at the heart of the command line interface!
Perhaps this is why we never saw voice recognition technology take off on Windows. It's blinking impossible to script actions for an arbitrary task, let alone process the arbitrary results!
On a similar note we may see voice recognition on Linux take off like a rocket. Anybody can add voice recognition to perform almost any command because the actions are all scriptable throught the CLI already. If you can type it, you can get your computer to do it when you say 'computer, foo!'
Mars
P.S. It would be greatly appreciated if someone could please clarify my point. It's buried in there somewhere...
While I've been waiting for Sphinx to mature into something useful for a long time now, the move to Java makes the whole package pretty useless to me.
:)
:P
Java is a memory hog, and it's certainly not going to be on any device I would want speech recognition on. Heck, I don't have Java installed on any of my machines, mostly because of the absolutely ridiculous footprint on disk as well as when running in ram.
And integrating Java applications into other applications is very difficult. Now, Java is good for certain things, but a speech recognition engine in Java sounds like the worst abuse possible
That and I still can't train it to recognise my slight australian accent, unlike every other bit of SR software I've used on Win32
Whether or not Sphinx-4 works, and whether or not Java is 'fast' enough to do speech recognition processing, its of no use to me.
"Tolerance is the virtue of the man without convictions." -- G. K. Chesterton
> It's amazing that the myth of Java being slow is so persistant
Before you mod me down as a Troll , I work on a virtual machine as a hobby.
The problems with Java being slow have little to do with the "execution of code" part. The part that takes a hit are the Garbage Collector and the Class Loader. The latter causes a HUGE hit in the start up. The former is responsible for those strange Swing freezes I've been seeing when I switch into a Java app.
Unicode also brings its own set of junk , for example "Hello World" in dotgnu's JIT does 7302 hastable inserts, 6000+ StringBuffer operations to initialize the Unicode encoder/decoder. And that is the standard way of decoding unicode (mono uses the same code).
Lastly , C/C++ commonly uses a lot of fields while Java brings in get/set methods for these. A method calls for a get or set is a LOT more expensive than a pointer read . Design has a lot to do with why Java is slow.
The enterprise apps where Java is popular are essentially backend applications which run for long periods of time (so have all the classes looked up and loaded) with a HUGE heap (256 MB or more) where occasional GC freeze won't destroy the entire experience (as it is often JSP/Web based interfaces).
Java *is* fast, if you don't count the slow parts.
Quidquid latine dictum sit, altum videtur
Alma.
w w.memoire.com/guillaume-desnoix/alma/+&hl=en
It can read several high level languages and build an internal representation and the convert that to other high level languages.
It is a great tool to help port this software to C for example.
Unfortunately the site seems to have gone, although I have used this software in the past.
See the google cache though: http://66.102.9.104/search?q=cache:Dbw7OX6Tco4J:w
blog.sam.liddicott.com
I think some people should open their eyes, otherwise the world will leave you behind while you are happily consoling each other how Java is slow and unusable. Wake up, folks!
To people which argument about hand writing C and assembly - well, you obviously didn't try to implement any of the algorithms (like hidden Markov models or the statistical searches) used in speech recognition. It is pain in the butt to do it even in Java, but at least you do not have the pointer mess you would have in C/C++. The engine has a good performance already, I am not sure what you would gain by rewriting it, except of bugs (the older Sphinx2 was for sure buggy as hell).
Something about the memory footprint. Java can have a large memory footprint, however with speech recognition, you will always have it. Just the accoustic models for one language can be easily in the order of several hundreds of megabytes. Memory footprint of Java is completely irrelevant here.
And before somebody compares Sphinx with speech "recognition" on you mobile phone or in your car - be aware, that you are comparing scateboard with a Concorde here. Sphinx family of engines are intended for recognition of continuous, large vocabuly speech and to be speaker independent. Your phone/car is small vocabulary, single words and speaker dependent - i.e. completely different problem. You cannot think about Sphinx as something "to have on some device". It is more intended to act as a speech recognition server on a dedicated machine e.g. for a large call center or ticket reservation system. I guess it could be used also in KDE for the KAccessibility purposes, but it is a bit heavy for that (especially with the large datasets).
So next time, before you start spouting BS about Java and applications written in it, at least check the facts. People will not see you as a complete idiot.