Open Source Speech Recognition - With Source
Paul Lamere writes " This story
on ZD-Net and this recent story
on Slashdot
describes the recent open sourcing of IBM's voice
recognition software. This release, unfortunately, doesn't include
any source for the actual speech recognition engine. Olaf Schmidt, a
developer on the KDE Accessibility Project ,
is quoted as saying 'There is no speech-recognition system available
for Linux, which is a big gap.' In an attempt to close this gap, we
have just released Sphinx-4,
a state-of-the-art, speaker-independent, continuous
speech recognition system written entirely in the Java programming
language. It was created by researchers and engineers from Sun, CMU,
MERL, HP, MIT and UCSC. Despite (or because of) being written in the
Java programming language, Sphinx-4 performs as well as similar
systems written in C. Here are the release notes and
some performance data."
By we I assume you mean "the open source community" and the answer is "when you get off your ass and code it". If by "we" you mean the world at large then go and look at AT&T's Natural Voices project.
How we know is more important than what we know.
Title: I'm(Aim) using(You Sing) it(Ate) right(Write) now(How)
Body: It(Ate) works(lurks) very(barry) well(wall).
Your CPU is not doing anything else, at least do something.
"There is no speech-recognition system available for Linux, which is a big gap."
Um, Sphinx 2 (a predecessor of Sphinx 4) has been around for quite some time now. Like Sphinx 4, it's speaker-independent. Unlike Sphinx 4, it's a C library, and is thus easily interfaced with other languages (insert shameless plug for a simple Python interface for Sphinx 2 I wrote).
Maybe you like the Cambridge HTK better, then ;-)
--
Try Nuggets , the mobile search engine. We answer your questions via SMS, across the UK.
Hey moron, it's R2D2 that beep-booped. C3PO was fluent in over 6 million forms of communication. ;-)
An expecially odd statement considering much of speech recognition can be broken down into great big vector operations, which are perfect for hand coding in C. Bet I could quadruple the speed of it in a couple of hours with some hand coded SIMD ops in x86 assembler.
It's funny because Java is fantastic at JIT compiling code with lots of non-local behaviour (e.g complex UIs) because it can take into account global behaviour at runtime. But it sucks at tight, heavy computation loop. DSP is a fantastic example of something Java is going to get creamed at when pitched against non-virtual machines.
Of course, if you have some cross-platform standard API calls for those vector DSP ops, then it's a different argument...
December 2003 http://www.voip-info.org/wiki-Sphinx
The speaker independant feature is the best part. Not all words were recongnized, about 70%. Probably because I slur the other 30%. It works equally well with either my wife or myself issuing commands.
70% is more than I need for this particular project, but I'm sure this new release closes the gap even further.
I am billdar, and I approve this message.
"Speech recognition is one of the worst means of input there is for a computer. Keyboards work so much better."
This statement is far too general to be true. The keyboard is only faster if you know what the command is you're trying to enter AND how to spell it. Voice recognition, used correctly, is much more intuitive. Maybe it's not so hot for dictation, but imagine if an app you're using didn't have to have a bunch of hard-to-sift-through menus. Just say 'Italic!' or "Bold!'
SR is much more interesting on simplified devices, though. I have a TabletPC. In Tablet mode, the kb is tucked under the screen. I have stylus buttons for copy/paste etc, but the voice recognition works better. (Although, as with your cubicle comment, my gf found that annoying.)
It's a lot easier to find problems with something than it is to find good aspects of it. I dunno why this comment was modded 'intresting'.
"Derp de derp."
Speech recognition is not really a solved problem. For some applications it works adequately, but if you take a look at the error rates for the Sphinx system to which the post links, you'll see that the Word Error Rate for large vocabulary is over 18%. Even for 5,000 words it is 7%. For many applications that is unacceptable.
A second factor is that these statistical speech recognition systems require extensive data for their language model. Building such a system requires recording real speech, segmenting it and creating a set of examples from which to compute the probabilities, which requires some knowledge of acoustic phonetics, and doing the computation for the model. This is time-consuming.
Speech recognition technology isn't a dark secret, but it isn't trivial to create a system with good performance either.
I find that startup/shutdown for a simple Java program takes about 200ms at 1GHz with the vanilla Sun JDK 1.5 JVM, or 150ms using gcj (gcc), and an equivalent C program takes about 2ms.
The overhead of starting a JVM should be incurred only once per browsing session.
Larry
Look around at the shpinx project and freetts. Alot of really good voice stuff going on. FreeTTS is excellent, and if you've got time, you can even model your own voice:) Anyway... I forget which of the two sites I got to it from (think it was sphinx) and its got a whole scenario with an airport calling program, its very very nice and sounds great.
Regards,
Steve
Thats true. It is an underestimated problem. People assume that we can recognise a word by using just the sound of it. That is simply not true. When speaking at a reasonable speed humans do not utter words clearly. This is not a problem to us because we can guess the words by using context and semantics.
In order to have a good speech recognition system, the computer would have to actually understand the meaning of the sentences and put it in context. There are different levels of analysis necessary to do this. The system has to analyse sound, morphology, syntax, semantics and pragmatics. Each level contains ambiguity but when two levels are combined together some ambiguity is resolved and you get another piece of the puzzle. When everything is combined the ambiguous parts all converge towards the right meaning, the right syntax, the right morphology and in the case of speech to text, the right choice of word and spelling.
Now to do all that you ideally need sophisticated knowledge representation based on cognitive science and the way we think. Although, there exists tricks and shortcuts that can mimic the important parts of the cognitive system, there isn't any complete system that integrate everything well.
Anyways if you want a summary of the field read the textbook: "SPEECH AND LANGUAGE PROCESSING" from Daniel Jurafsky & James H. Martin
And search on google for "computational linguistics" "word grammar" "open mind common sense" "cyc" "Ray Jackendoff"
In theory, it's all compiled down to assembly in the end anyway so it has equal chance of being just as fast. For some types of code, JIT can be faster.
Some of the advantages of byte-code are:
. branch prediction and other speculative optimisations can be done based on observing the flow at runtime rather than guessing at compile time.
. it's not necessarily tied to a specific architecture
. if code optimisation technology improves, you don't need to recompile anything. the new JIT engine can do it all for you.
. as above but for bugs in the optimiser.
I'm sure someone else will point out some of the disadvantages.
there are problems with it being in java.
Embedded sysrtems can not use it without huge overhead. if it was in C then it could really give a boost to the embedded linux market.
Sigh, it's still a huge gap to what I do, No room for a Java VM in the embedded systems until I double all my costs on the hardware.
on another note, it just might help fill in the other huge gap in linux. There is no Navigation software to use with map data and a GPS.
and no kiddies, GPSdrive is NOT navigation.
Do not look at laser with remaining good eye.
Alma.
w w.memoire.com/guillaume-desnoix/alma/+&hl=en
It can read several high level languages and build an internal representation and the convert that to other high level languages.
It is a great tool to help port this software to C for example.
Unfortunately the site seems to have gone, although I have used this software in the past.
See the google cache though: http://66.102.9.104/search?q=cache:Dbw7OX6Tco4J:w
blog.sam.liddicott.com
Check out gcj. One of it's primary uses is targeting embedded systems. It's quite lean and mean for a Java runtime.
HTH.
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
Can't gcc compile java code directly to native binary code?
Does this mean that one could make a shared library out of the java code for C-programmers to use?
WayBack has it.
I've also mirrored the source Just In Case (that's an ADSL link, you'd be better off downloading it directly from WayBack).
Got time? Spend some of it coding or testing
Note in the 2002 version that the dialog server is not included, this would be great to have too. MIT also has some very cool technologies in this area - SUMMIT, TINA, GENESIS, ... - which I do not believe are public, they just show little bits and pieces of PR about them, but include natural language parsing, question answering, sentence generation, etc. It would be cool if someone on the inside could document just what things are available, what works with what, what is definitely ready for prime time, etc. There must be some people who hacked on this in the past few years and are still developing things, it would be cool if some of their experimentation was available to the open source community so people could get an idea of what things are possible. When I did my survey just about 1 year ago, Communicator was daunting, intriguing, and it looked like you could do tons of stuff if you had some secret decoder docs and a spare year to hack. Maybe now's the time to dig into it hip deep?