Slashdot Mirror


Open Source Speech Recognition - With Source

Paul Lamere writes " This story on ZD-Net and this recent story on Slashdot describes the recent open sourcing of IBM's voice recognition software. This release, unfortunately, doesn't include any source for the actual speech recognition engine. Olaf Schmidt, a developer on the KDE Accessibility Project , is quoted as saying 'There is no speech-recognition system available for Linux, which is a big gap.' In an attempt to close this gap, we have just released Sphinx-4, a state-of-the-art, speaker-independent, continuous speech recognition system written entirely in the Java programming language. It was created by researchers and engineers from Sun, CMU, MERL, HP, MIT and UCSC. Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C. Here are the release notes and some performance data."

21 of 404 comments (clear)

  1. Re:But what about text to speech? by QuantumG · · Score: 4, Informative

    By we I assume you mean "the open source community" and the answer is "when you get off your ass and code it". If by "we" you mean the world at large then go and look at AT&T's Natural Voices project.

    --
    How we know is more important than what we know.
  2. Translation for those who still don't get it... by CaptainPinko · · Score: 4, Informative

    Title: I'm(Aim) using(You Sing) it(Ate) right(Write) now(How)
    Body: It(Ate) works(lurks) very(barry) well(wall).

    --
    Your CPU is not doing anything else, at least do something.
  3. Sphinx 2 by PiGuy · · Score: 5, Informative

    "There is no speech-recognition system available for Linux, which is a big gap."

    Um, Sphinx 2 (a predecessor of Sphinx 4) has been around for quite some time now. Like Sphinx 4, it's speaker-independent. Unlike Sphinx 4, it's a C library, and is thus easily interfaced with other languages (insert shameless plug for a simple Python interface for Sphinx 2 I wrote).

    1. Re:Sphinx 2 by Ndr_Amigo · · Score: 2, Informative

      The speaker-independency of Sphinx2 is debable, I have never been able to get a single successful word recognised :)

  4. Another alternative: HTK by j.leidner · · Score: 2, Informative
    Dear anonymous,

    Maybe you like the Cambridge HTK better, then ;-)

    --
    Try Nuggets , the mobile search engine. We answer your questions via SMS, across the UK.

  5. OT Star Wars Nitpick by Anonymous Coward · · Score: 5, Informative

    Hey moron, it's R2D2 that beep-booped. C3PO was fluent in over 6 million forms of communication. ;-)

  6. Re:Virtual Machine Syndrome by pslam · · Score: 5, Informative
    It is most easily recognized in a release announcement, where for no reason whatsoever the afflicted developer suddenly interjects a statement like "and it's just as fast as C", to the bewilderment of the audience.

    An expecially odd statement considering much of speech recognition can be broken down into great big vector operations, which are perfect for hand coding in C. Bet I could quadruple the speed of it in a couple of hours with some hand coded SIMD ops in x86 assembler.

    It's funny because Java is fantastic at JIT compiling code with lots of non-local behaviour (e.g complex UIs) because it can take into account global behaviour at runtime. But it sucks at tight, heavy computation loop. DSP is a fantastic example of something Java is going to get creamed at when pitched against non-virtual machines.

    Of course, if you have some cross-platform standard API calls for those vector DSP ops, then it's a different argument...

  7. Re:Telephony by dalabrat · · Score: 3, Informative

    December 2003 http://www.voip-info.org/wiki-Sphinx

  8. Good Success by billdar · · Score: 2, Informative
    I've been using sphinx for about a year or so now for a linux-base home automation project. I must say that it has worked out very well for me so far.

    The speaker independant feature is the best part. Not all words were recongnized, about 70%. Probably because I slur the other 30%. It works equally well with either my wife or myself issuing commands.

    70% is more than I need for this particular project, but I'm sure this new release closes the gap even further.

    --
    I am billdar, and I approve this message.
  9. Re:Speech recognition by NanoGator · · Score: 2, Informative

    "Speech recognition is one of the worst means of input there is for a computer. Keyboards work so much better."

    This statement is far too general to be true. The keyboard is only faster if you know what the command is you're trying to enter AND how to spell it. Voice recognition, used correctly, is much more intuitive. Maybe it's not so hot for dictation, but imagine if an app you're using didn't have to have a bunch of hard-to-sift-through menus. Just say 'Italic!' or "Bold!'

    SR is much more interesting on simplified devices, though. I have a TabletPC. In Tablet mode, the kb is tucked under the screen. I have stylus buttons for copy/paste etc, but the voice recognition works better. (Although, as with your cubicle comment, my gf found that annoying.)

    It's a lot easier to find problems with something than it is to find good aspects of it. I dunno why this comment was modded 'intresting'.

    --
    "Derp de derp."
  10. Rolling your own speech recognition isn't so easy by belmolis · · Score: 4, Informative

    Speech recognition is not really a solved problem. For some applications it works adequately, but if you take a look at the error rates for the Sphinx system to which the post links, you'll see that the Word Error Rate for large vocabulary is over 18%. Even for 5,000 words it is 7%. For many applications that is unacceptable.

    A second factor is that these statistical speech recognition systems require extensive data for their language model. Building such a system requires recording real speech, segmenting it and creating a set of examples from which to compute the probabilities, which requires some knowledge of acoustic phonetics, and doing the computation for the model. This is time-consuming.

    Speech recognition technology isn't a dark secret, but it isn't trivial to create a system with good performance either.

  11. Re:There's more than one kind of overhead. by LarryRiedel · · Score: 3, Informative
    I can run inetd-style fork-exec-terminate servers in C on CPUs that a cellphone would spit on, and handle hundreds of connections a second. Bringing up a JVM on the same processor would take minutes.
    [...]
    if it takes 10s to start up a JVM your customer's already hit "back".

    I find that startup/shutdown for a simple Java program takes about 200ms at 1GHz with the vanilla Sun JDK 1.5 JVM, or 150ms using gcj (gcc), and an equivalent C program takes about 2ms.

    Browser plugins? For content, yes, but not for navigation.

    The overhead of starting a JVM should be incurred only once per browsing session.

    Larry

  12. Re:But what about text to speech? by LnxAddct · · Score: 2, Informative

    Look around at the shpinx project and freetts. Alot of really good voice stuff going on. FreeTTS is excellent, and if you've got time, you can even model your own voice:) Anyway... I forget which of the two sites I got to it from (think it was sphinx) and its got a whole scenario with an airport calling program, its very very nice and sounds great.
    Regards,
    Steve

  13. Re:Rolling your own speech recognition isn't so ea by starm_ · · Score: 2, Informative

    Thats true. It is an underestimated problem. People assume that we can recognise a word by using just the sound of it. That is simply not true. When speaking at a reasonable speed humans do not utter words clearly. This is not a problem to us because we can guess the words by using context and semantics.

    In order to have a good speech recognition system, the computer would have to actually understand the meaning of the sentences and put it in context. There are different levels of analysis necessary to do this. The system has to analyse sound, morphology, syntax, semantics and pragmatics. Each level contains ambiguity but when two levels are combined together some ambiguity is resolved and you get another piece of the puzzle. When everything is combined the ambiguous parts all converge towards the right meaning, the right syntax, the right morphology and in the case of speech to text, the right choice of word and spelling.

    Now to do all that you ideally need sophisticated knowledge representation based on cognitive science and the way we think. Although, there exists tricks and shortcuts that can mimic the important parts of the cognitive system, there isn't any complete system that integrate everything well.

    Anyways if you want a summary of the field read the textbook: "SPEECH AND LANGUAGE PROCESSING" from Daniel Jurafsky & James H. Martin

    And search on google for "computational linguistics" "word grammar" "open mind common sense" "cyc" "Ray Jackendoff"

  14. Re:Virtual Machine Syndrome by jamesh · · Score: 2, Informative

    In theory, it's all compiled down to assembly in the end anyway so it has equal chance of being just as fast. For some types of code, JIT can be faster.

    Some of the advantages of byte-code are:
    . branch prediction and other speculative optimisations can be done based on observing the flow at runtime rather than guessing at compile time.
    . it's not necessarily tied to a specific architecture
    . if code optimisation technology improves, you don't need to recompile anything. the new JIT engine can do it all for you.
    . as above but for bugs in the optimiser.

    I'm sure someone else will point out some of the disadvantages.

  15. Re:Java!?! by Lumpy · · Score: 2, Informative

    there are problems with it being in java.

    Embedded sysrtems can not use it without huge overhead. if it was in C then it could really give a boost to the embedded linux market.

    Sigh, it's still a huge gap to what I do, No room for a Java VM in the embedded systems until I double all my costs on the hardware.

    on another note, it just might help fill in the other huge gap in linux. There is no Navigation software to use with map data and a GPS.

    and no kiddies, GPSdrive is NOT navigation.

    --
    Do not look at laser with remaining good eye.
  16. Convert to C easily with ALMA by samjam · · Score: 3, Informative

    Alma.

    It can read several high level languages and build an internal representation and the convert that to other high level languages.

    It is a great tool to help port this software to C for example.

    Unfortunately the site seems to have gone, although I have used this software in the past.

    See the google cache though: http://66.102.9.104/search?q=cache:Dbw7OX6Tco4J:ww w.memoire.com/guillaume-desnoix/alma/+&hl=en

  17. Re:Java!?! by Glock27 · · Score: 2, Informative
    Embedded sysrtems can not use it without huge overhead. if it was in C then it could really give a boost to the embedded linux market.

    Check out gcj. One of it's primary uses is targeting embedded systems. It's quite lean and mean for a Java runtime.

    HTH.

    --
    Galileo: "The Earth revolves around the Sun!"
    Score: -1 100% Flamebait
  18. Re:Java!?! by leinhos · · Score: 4, Informative

    Can't gcc compile java code directly to native binary code?

    Does this mean that one could make a shared library out of the java code for C-programmers to use?

  19. Google, schmoogle, there are better ways! by leonbrooks · · Score: 2, Informative

    WayBack has it.

    I've also mirrored the source Just In Case (that's an ADSL link, you'd be better off downloading it directly from WayBack).

    --
    Got time? Spend some of it coding or testing
  20. Part of Galaxy Communicator by mattr · · Score: 2, Informative
    Wow that is great that Sphinx-4 is open! A-And guess what, Galaxy Communicator also has snuck onto sourceforge too, quietly, a year ago. A year or so ago I had written one of the partners to try to get a copy with no reply.. but some googling found it. Most slashdotters probably don't know Galaxy but it is the same partners - CMU, MITRE, DARPA etc. It is the plug and play hub for related technologies. This stuff has been used to make voice-recognizing automated telephone information services for weather and flight info I believe. Well what I found on sourceforge is 2002-2003 version (when grant ran out?) and has a list of modules which could use some updating i.e. about how Sphinx-4 is available. So can we expect a new Galaxy Communicator distro? I always had trouble finding out about it because each participating institution had their own site, their own distro, some focusing on different things, etc. I remember looking at CMU and I think Colorado U., anyway.

    Note in the 2002 version that the dialog server is not included, this would be great to have too. MIT also has some very cool technologies in this area - SUMMIT, TINA, GENESIS, ... - which I do not believe are public, they just show little bits and pieces of PR about them, but include natural language parsing, question answering, sentence generation, etc. It would be cool if someone on the inside could document just what things are available, what works with what, what is definitely ready for prime time, etc. There must be some people who hacked on this in the past few years and are still developing things, it would be cool if some of their experimentation was available to the open source community so people could get an idea of what things are possible. When I did my survey just about 1 year ago, Communicator was daunting, intriguing, and it looked like you could do tons of stuff if you had some secret decoder docs and a spare year to hack. Maybe now's the time to dig into it hip deep?