Slashdot Mirror


Open Source Speech Recognition - With Source

Paul Lamere writes " This story on ZD-Net and this recent story on Slashdot describes the recent open sourcing of IBM's voice recognition software. This release, unfortunately, doesn't include any source for the actual speech recognition engine. Olaf Schmidt, a developer on the KDE Accessibility Project , is quoted as saying 'There is no speech-recognition system available for Linux, which is a big gap.' In an attempt to close this gap, we have just released Sphinx-4, a state-of-the-art, speaker-independent, continuous speech recognition system written entirely in the Java programming language. It was created by researchers and engineers from Sun, CMU, MERL, HP, MIT and UCSC. Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C. Here are the release notes and some performance data."

30 of 404 comments (clear)

  1. Java comment by Aragorn992 · · Score: 3, Insightful

    Despite (or because of) being written in the Java programming language, Sphinx-4 performs as well as similar systems written in C.

    Im sick of these comments. Anyone that needs to know about the performance of Java knows its very fast. Why bother commenting about it anymore?

    Its like saying "... and because it was written in C, its very fast...", as if we didn't know already.

    1. Re:Java comment by Taladar · · Score: 4, Insightful

      Java might be as fast as C in Code Execution but if you want to build a library that Open Source Applications outside the Java-Developer-niche use you have to write it in C. C is still THE No. 1 language for libraries for use in programs written in lots of different programming languages.

  2. Re:Speech Recognition is a Mature Technology. by Anonymous Coward · · Score: 2, Insightful

    Yeah i'm sure it's just that easy... you dumb f*ck

  3. Build Instructions by Anonymous Coward · · Score: 2, Insightful

    Given those build instructions, you are better off writing your own engine. This is exactly what is wrong with Linux today, and I dont see *any* solution to it. A maze of hidden dependencies and incompatabilities. No thanks.

  4. Never a huge fan... by ImaLamer · · Score: 3, Insightful

    I've used a few packages for speech recognition but none really got me too excited. Well, Dragon Naturally Speaking did have me read a few chapters of Dave Berry to it. I bet it didn't work because of all the laughing, I was in tears.

    I must say though that speech recognition is something that the whole computer community needs to work on. Now, we can finally do that. All the "open source community" needs is source that works a little. In a year or so, I bet this works better then most options available today.

    Now, I know that isn't the rule but this is the type of thing that computer/math engineers could sit down to and contribute where others can't. It seems to be the rule that the really smart ones tend to work with open source software...

    Really the cool thing is that this could get people involved who otherwise wouldn't because they don't know where to start.

    1. Re:Never a huge fan... by jwsd · · Score: 3, Insightful

      It seems to be the rule that the really smart ones tend to work with open source software...

      Or it seems to be the rule that those who work with open source software tend to think they are the really smart ones...

    2. Re:Never a huge fan... by mollymoo · · Score: 2, Insightful
      It seems to be the rule that the really smart ones tend to work with open source software...

      Or it seems to be the rule that those who work with open source software tend to think they are the really smart ones...

      Or it seems to be the rule that it's only the open-source developers we hear from directly, without being filtered through a bunch of marketroids.

      --
      Chernobyl 'not a wildlife haven' - BBC News
  5. Re:Translation for those who still don't get it... by Epsillon · · Score: 2, Insightful

    Don't you feel that the joke loses its appeal when you have to explain it? This is Slashdot. If anyone failed to get it, they probably shouldn't be here ;o)

    --
    Resistance is futile. Reactance buggers it up.
  6. Re:Speech Recognition is a Mature Technology. by Anonymous Coward · · Score: 1, Insightful

    What is the problem? Speech recognition is a mature technology, and algorithms for speech recognition are well documented in the research journals. The federal government has long since stopped funding research into speech recognition.

    And by mature you mean of course immature. Speech recognition is at the Model-T stage.

    Once you get speaker-independent recognition with the same accuracy as humans for a price that's cheaper than hiring a secretary you can claim maturity.

  7. Re:Translation for those who still don't get it... by Anonymous Coward · · Score: 3, Insightful

    I disagree. Many of us are not native English speakers. What may be obvious to you phonetically doesn't have to be obvious to the rest of us. :)

    This being said, yes, that the explanation of the joke would be modded up is a bit sad, alright. ;)

  8. There's more than one kind of overhead. by argent · · Score: 5, Insightful

    I could easily live with 10-15% slower, IF Java didn't have the startup overhead. I can run inetd-style fork-exec-terminate servers in C on CPUs that a cellphone would spit on, and handle hundreds of connections a second. Bringing up a JVM on the same processor would take minutes. Bringing up a JIT runtime would be out of the question.

    For applications where you can create a JVM and use it as you need it, Java's great. Webservers, sure, no problem. Desktop applications, heck, the GUI overhead's getting to be the same order of magnitude (though that HAS to change, we can't afford to depend on Moore's Law much longer unles someone comes up with a clever way to cut the power consumption of processors faster than the speed increases). Browser plugins? For content, yes, but not for navigation... if it takes 10s to start up a JVM your customer's already hit "back".

    1. Re:There's more than one kind of overhead. by argent · · Score: 2, Insightful

      I find that startup/shutdown for a simple Java program takes about 200ms at 1GHz with the vanilla Sun JDK 1.5 JVM, or 150ms using gcj (gcc), and an equivalent C program takes about 2ms.

      A factor of 100 difference in the overhead is a bit better than I've seen. I assume that I've never tried it on a sufficiently simple Java program, or you're talking about a dynamically linked C program. Still, a factor of 100 difference in the startup overhead is hardly a negligable consideration.

      The overhead of starting a JVM should be incurred only once per browsing session.

      I would hope that it didn't re-use the same JVM for applets from separate websites! I'm less than enthused about the Java applet security model as it is (OK, it's a couple of decimal orders of magnitude better than ActiveX, but that's hardly a rousing endorsement), and expecting it to run multiple security domains in the same JVM entirely fails to be my cup of tea.

  9. Re:Translation for those who still don't get it... by darkonc · · Score: 3, Insightful
    Don't you feel that the joke loses its appeal when you have to explain it?

    It takes at least 3 people to make for a really good joke:

    1. One person to tell it
    2. One person to get it
    3. One perdon to laugh at for just missing the whole point.
    #3 is a little bit less obvious when the joke telling is online.
    --
    Sometimes boldness is in fashion. Sometimes only the brave will be bold.
  10. And lo, the point was missed... by DarkMan · · Score: 4, Insightful

    You are, of course, perfectly correct in everything you said.

    There are a number of HCI aspects where speech recognition is not a good solution.

    However, let me enumerate a number of other ones, where it's superior:

    Minutes of meetings, or similar. Imagine having a verbatim record of a discussion there by the time you get back to your desk.

    Someone who cannot type - e.g. no hands. Rare, granted, but still a viable use.

    Someone whose hands are busy. The cannonical example here is a pathologist doing an autopsy, where they dictate everything. Speech recogition saves time in transcription (and money for the audio typist).

    I'd love to be able to issue voice commands to a computer, for a few, isolated cases. For example, diagnosing hardware. Bring up a doc, and be able to get the computer to flip pages, without having to remove the probes from the hardware. Re locating them is a pain, and sucks time.

    Moreover, I'm certain that there are others, some of which will only be realised when it's common and cheap enough to be widely available.

    It's like a mouse. It's one of the worst general purpose input devices for a computer [0], but it's excels at indicating a single element on a display. The mouse and keyboard complement each other, and there are a bunch of other, more specifc input devices, such as the graphics tablet. I have no doubt that if speech recognition was as accurate and reliable as a graphics tablet, it would get a similar amount of use.

    [0] Try inputing a block of prose with only a mouse. Even specilist software makes it only suck marginally less.

  11. Re:withOUT source surely? by Anonymous Coward · · Score: 1, Insightful

    I here you?

    I HEAR you you ignorant fucker.

  12. Re:Virtual Machine Syndrome by deanj · · Score: 3, Insightful

    Heh...you could substitute "uses Linux" for "written in Java", and you'd have the same thing.

    Seriously though, Sphinx-4 is really worth looking at. That group at Sun does great work.

  13. I used to work for MacSpeech, doing UI work by notthepainter · · Score: 5, Insightful
    And before that, I worked for Articulate Systems, also doing UI work.

    With that said, you can probably guess I have a lot to say about Speech Recognition. (Not Voice Recognition, that's different, that would be able to distinguish Ben from Charlie for example.)

    A good SR engine is, of course, essential. And I've not read the details on the two recent giveaways, but I suspect that they are only the engine.

    The SR engine is just a begining. There is a ton of UI work that needs to be done. Sit and think about spacing around punctuation marks and then think about capitalization around puncuation marks. Yeah, it is all pretty cut and dried and known but the details really need to be sweated to get it right. This is very time consuming.

    Next you have to worry about exactly where you are editing. Is that into Microsoft Word (or Open Office), or emacs, or where? It can make a huge difference when you want to go back and correct misrecognitions. You just don't want to send N delete characters and retype it, that results in a lousy user experience. So just exactly where is the input cursor at all times? This is not an impossible problem, but one where the details must be sweated.

    Next is command and control. Just how are you going to let the user grab the text of all the menus and all the text in the dialog box buttons. Again, not impossble, but more of those pesky details.

    Finally, is your SR engine good enough? Maybe, maybe not. Let just say that 98% accuracy might look good on paper, but that is one in 50 words wrong. Unless your correction mechanism is smooth, an error rate that high greatly slow you down.

    Is Open Source SR a good thing? Oh yes sir, yes! But lets not forget the details. One thing the Open Source community has been accused of, perhaps justly, perhaps, unjustly, is not sweating the details.

    Speech Recognition has an awful lot of details.

  14. Why speech recognition on Linux will kill Windows by MarsF · · Score: 5, Insightful

    I was thinking about this the other day, and was wondering if this is a huge gap in the Windows user interaction model.

    Think about how you input info using windows. You click on a few locations using the mouse, perhaps use some keyboard input, click some more. The output from these inputs is arbitrary: it may result in anything from a 'File/Save' dialog to a custom error dialog box. There is no linear path for inputting commands, or for mapping inputs to results.

    Compare this to the command line. You enter a few distinct atomic commands, and view the results in the same medium. You then enter more commands, refining your actions. The key here is that you already have a linear model for input that produces well defined expected results, all in a common medium that is conceptually simple, visible to the user, and easily processed by machines. Extending this model to accept voice input or output is trivial.

    How is one supposed to quantify basic tasks and turn them into equivelant voice commands without a baseline framework or paradigm to extend from? How do you automate, simplify, or extend existing tasks without a common input or output medium? GUIs provide no such medium or framework; that same framework is at the heart of the command line interface!

    Perhaps this is why we never saw voice recognition technology take off on Windows. It's blinking impossible to script actions for an arbitrary task, let alone process the arbitrary results!

    On a similar note we may see voice recognition on Linux take off like a rocket. Anybody can add voice recognition to perform almost any command because the actions are all scriptable throught the CLI already. If you can type it, you can get your computer to do it when you say 'computer, foo!'

    Mars

    P.S. It would be greatly appreciated if someone could please clarify my point. It's buried in there somewhere...



  15. The issue is Javas footprint and integration by Ndr_Amigo · · Score: 4, Insightful

    While I've been waiting for Sphinx to mature into something useful for a long time now, the move to Java makes the whole package pretty useless to me.

    Java is a memory hog, and it's certainly not going to be on any device I would want speech recognition on. Heck, I don't have Java installed on any of my machines, mostly because of the absolutely ridiculous footprint on disk as well as when running in ram.

    And integrating Java applications into other applications is very difficult. Now, Java is good for certain things, but a speech recognition engine in Java sounds like the worst abuse possible :)

    That and I still can't train it to recognise my slight australian accent, unlike every other bit of SR software I've used on Win32 :P

    Whether or not Sphinx-4 works, and whether or not Java is 'fast' enough to do speech recognition processing, its of no use to me.

  16. Re:Virtual Machine Syndrome by IgnoramusMaximus · · Score: 5, Insightful
    Stuff written in Java is better than stuff written in C or C++ because there are no frapping buffer overflows in Java code

    True, instead there are a thousand "super-efficient" .jar libraries required by a "Hello World" app, which use the "Object Oriented Programing and Long Lasting Cure All and Testicular Itch Relief Paradigm(tm)" to such extremes that it takes 12 objects instantiated in 4 containers to flip a bit in a byte. Additionally, there is the substitution of native performance of compiled code to code compiled "Just Too Late" combined with exceptional memory usage that entails. If it were not enough, as a bonus, we get the garbage collector which is scientifically fine-tuned to run just when user is expected to interact with the application in most time sensitive manner. As an icing on the cake we also are treated to multiple, insideously incompatible with each other, versions of the so-called "universal" VM, resulting in one app demanding that SunVM is used and the other that MS VM is used thus resulting in total impossibility of using both at the same time. Yes I do speak from looong and utterly infuriorating experience with Java apps.

    At one point I considered printing a sign warning of Java advocates being shot on sight, I could probably make some serious money selling it, given similiar amount of grief my other colleagues are going through.

    Ahem, and yes, the greatest offenders in me experience are... err... frigging IBM Java apps. We actually abandoned DB2 8.x release because noone could deal with the havoc the DB2 admin tools were causing with various other retarded banking related Java apps.

  17. Re:But what about text to speech? by IgnoramusMaximus · · Score: 2, Insightful
    So to you a "community" is a bunch of people who only do things which they themselves want and never help each other? Weird

    Like any community we get all kinds. There are those who do as you say. But there are also those who care for non-programmers and try to accommodate them. It all depends. In the case of a hard-core problem like speech recognition/synthesis (which is nowhere near acceptable level of scientific understanding) you are likely to get more of the "go code it yourself" kind because this area is prone to be inhabited by people who are arrogant "know-it-alls" but also unable to do anything about it. On the other hand, sometimes the questions asked by the "non-programmers" are also arrogant and of "Gimme now! Free! Now! Or I will hold my breath!" variety of attitude, which will be dealt with accordingly.

  18. Re:But what about text to speech? by winterlens · · Score: 4, Insightful

    Probably because speaking is incredibly complicated, and providing realistic speech from unmarked text is an intractable problem.

    When you write something down, you don't provide a pronunciation guide. Rather, the reader is guided by context. For instance, if I write the word "import", how do you pronounce it? If we're talking about trade deficits, you probably know that the stress is on the second syllable; but if we're discussing meaning, the stress is on the first.

    How do we expect computers that have a difficult time with context to make a pronunciation decision? This is a serious barrier to "good" text to speech (whatever "good" means).

    If you mean that you want the voice to sound more natural, even if it's pronouncing words incorrectly, you still have a lot of hard problems. For instance, the muscles in the tongue and lips move differently based on how phonemes are grouped. Coarticulation models are difficult to construct, and when you try to account for a convincing number of muscles and vibrations, the problem may quickly become intractable.

    Not only do we have to pay attention to the physics of speaking, but also the physics of hearing. The amount of signal processing involved can be pretty staggering if you're going to implement a complete system. Thierry Dutoit has a really good book on the subject called An Introduction to Text-to-Speech Synthesis. You should check it out if you want a somewhat more exhaustive answer to your question.

  19. Re:Virtual Machine Syndrome by dr2chase · · Score: 4, Insightful
    Great story, but basically wrong and misleading. You can trowel on the layers in any language, and you can write fast Java programs. The speech engine is proof of that.

    Garbage collection, in particular, is coming along nicely. Check out "Metronome" by David Bacon, of IBM. You set the knobs, it tells you how much memory you will need, and it gives you GC with real time performance. No pauses.

    Or, consider the machine that Azul is working on (good luck getting details now that they are in some sort of a quiet period). It has hardware support for read and write barriers, plus a good story for stack caches. Chances are good its GC pauses will be tiny (1-10 ms).

    I can also tell you that the market very much prefers JIT compilation. I worked on an ahead-of-time-compiling JVM, and there were a couple of others built by other companies. I don't work on that JVM any more, and the other AOT JVM companies have either failed or gone into other lines of business.

    So, great story, but not exactly correlated with reality.

    On the other hand, consider all the buggy apps that we (who sometimes administer Windows machines) have needed to patch over and over again over the years. If I am unwilling to run an application in the first place because of its poor security, does it really matter how little memory it uses, how fast it runs, or how well it gets along with the other worm-friendly apps?

  20. Re:Virtual Machine Syndrome by Unknown+Lamer · · Score: 2, Insightful

    You could always code the easily vectorizable stuff in C with inline assembly and call it from Java using your preffered runtime's FFI.

    --

    HAL 7000, fewer features than the HAL 9000, but just as homicidal!
  21. Re:But what about text to speech? by misleb · · Score: 2, Insightful
    Exactly how is it arrogant to suggest someone make something themselves if they want it so bad? It may not be productive or helpful, but it certainly isn't "arrogant." Perhaps the non-programmer should have considered that what is better to a programmer is not necessarily better to everyone else. I mean, I don't go making my car buying decisions based on the suggestion of a truck driver. Otherwise I might end up with a Mack tractor and nothing to haul...

    Non-programmer can't code it. Arrogant-open-source-programmer continues to scratch his itch. As a result, we have 10 thousand poor ass themes and numerous barely functional programs for each task. But what would be best is at least ONE good, working, application for each task.

    Why not just be grateful for what you *can* get out of open-source software? It is free, isn't it? Quit whining.

    So to you a "community" is a bunch of people who only do things which they themselves want and never help each other? Weird.

    Never help each other? Are you kidding me? Most open source projects depend on many libraries and much code written by others. Just putting your code into the public domain is helping others. And it isn't uncommon for programmers on one project to contribute to another. Where in the world do you get the idea that open-source developers don't help each other?

    -matthew

    --
    "THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
  22. Re:Virtual Machine Syndrome by IgnoramusMaximus · · Score: 2, Insightful
    but you don't see me strutting around like I'm using God's own language!

    I think there is something specially weird about these Java priests. I have seen people exhibiting unhealthy attachment to some tools or languages, but nothing even approaching the amount of zealotry Java seems to produce. Well maybe except the Emacs/Vi war but that at least had some comedic elements. Java worshippers on the other hand are indeed appearing to believe they somehow discovered some secret source of mojo and the Ultimate And Final Way You Shalt Speaketh To Yer Computer. They seem completely oblivious to the fact that like any other computer language, Java has major shortcomings. Additionally, because it was hyped and targetted at a particular class of applications where (in my and other users experience) it performed abysmally, it has become a dirty word along with other "revolutionary" ideas like Lotus Notes, Object-Oriented Databases, etc. All promising to be magical fairies who will cure all your computing problems with a wave of a wand, and instead turning out to be beer-bellied, hygiene deficient, ugliest-ever men in drag wielding an axe.

  23. Re:But what about text to speech? by silentbozo · · Score: 2, Insightful

    You cannot achieve what you want without the reader understanding context. Since computers can't yet understand context, we can't yet build such a system.

    I disagree. If computers can't understand context, then don't use the computer to provide the context. Instead, use humans to annotate text using emotional markers, in the same way that composers and conductors add accents and other notations to sheet music. Although you can sort of do this now by explicity indicating phonemes, a better system would allow you to "markup" plain text with emotional cues that would be interpreted by whatever speech engine is being used.

    Something like what MIDI programs do that allow you to manipulate sequences and add attacks, change insturments, etc. This would allow you to still "perform" a text-to-speech, using the computer-synthesized voice as the instrument. A more advanced system would couple a speech recognition and pitch analysis system to automatically manipulate a speech output program, so an actor could perform a line using their own voice, but be able to "puppet" the line into someone else's voice.

  24. I am sick of these "Java sucks" comments by janoc · · Score: 3, Insightful
    Hello, did actually any of you Java bashers actually try the Sphinx4 engine ? I tried it and it is pretty good. Actually a lot better (faster and more accurate) than the older Sphinx2 engine which was written in *gasp* C! Or are we bashing a project just because it is written in "slow and bloated" Java ?

    I think some people should open their eyes, otherwise the world will leave you behind while you are happily consoling each other how Java is slow and unusable. Wake up, folks!

    To people which argument about hand writing C and assembly - well, you obviously didn't try to implement any of the algorithms (like hidden Markov models or the statistical searches) used in speech recognition. It is pain in the butt to do it even in Java, but at least you do not have the pointer mess you would have in C/C++. The engine has a good performance already, I am not sure what you would gain by rewriting it, except of bugs (the older Sphinx2 was for sure buggy as hell).

    Something about the memory footprint. Java can have a large memory footprint, however with speech recognition, you will always have it. Just the accoustic models for one language can be easily in the order of several hundreds of megabytes. Memory footprint of Java is completely irrelevant here.

    And before somebody compares Sphinx with speech "recognition" on you mobile phone or in your car - be aware, that you are comparing scateboard with a Concorde here. Sphinx family of engines are intended for recognition of continuous, large vocabuly speech and to be speaker independent. Your phone/car is small vocabulary, single words and speaker dependent - i.e. completely different problem. You cannot think about Sphinx as something "to have on some device". It is more intended to act as a speech recognition server on a dedicated machine e.g. for a large call center or ticket reservation system. I guess it could be used also in KDE for the KAccessibility purposes, but it is a bit heavy for that (especially with the large datasets).

    So next time, before you start spouting BS about Java and applications written in it, at least check the facts. People will not see you as a complete idiot.

  25. Strangle me, I guess by HeaththeGreat · · Score: 2, Insightful

    I built a few client apps that were deployed on a few different VM versions, though most were Win32 (1.3, 1.4, 1.4.2). I deployed to Macs without a problem.

    Development was a snap, I got the whole application off the ground with relatively little problems because of the usefulness of Java's built-in API. Of course, when performance tuning I did rewrite the functionality of some of those API classes, but I'm sure you have to do that in any language.

    Yes, the MS JVM is total crap, but that's what Sun got a huge settlement for. It was put in place by Microsoft in an attempt to shut down Java with a crappy install base.

    Java is all about following standards. As long as you do that, your apps run pretty well.

    This app that I wrote required a lot of Swing specialization and user interaction, displayed custom images, etc. It wasn't a trivial application.

    So, I guess my question is, do you guys just not follow the standards? What is it that you're doing to break your apps so much?

    Someone was talking about using lightweight components with heavyweight components, which I know from experience is a real beast to get working, but other than that, what is it exactly that's breaking all the time?

    I'm talking about client-side apps here. I haven't used an applet in forever. Most applications on the web are jsp apps, so you are totally shielded from its Javaness.

  26. Not your fault, then. by HeaththeGreat · · Score: 2, Insightful

    Though I responded to your thread, it was more of a question to the general populace.

    If you didn't design the app, then its not your fault, is it?

    The solution to your problem is to call the developer and complain. If they don't do anything about it, then your solution is to switch applications.

    The reason that Java is perceived as a bad platform compared to Windows is that Sun's engineers don't go through all of the major applications written in Java and re-engineer the platform to behave as each application expected.

    That is, if you write an app that inadvertantly depends on non-standard behavior in SP1, and it becomes really popular, MS will generally make it so that SP2 behaves the same way that SP1 did for your application. This is because users will largely blame MS for the app's problems and not the developer that didn't follow standards in the first place.

    There was a story about this practice a while ago and how it might be changing with Longhorn.

    Anyways, the point is that if you don't follow the standards Sun isn't going to save you. That might be detrimental to their image, but its not really their fault. Its users like you that don't know any better that perpetuate this misconception.