Slashdot Mirror


Improving Open Source Speech Recognition

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

10 of 121 comments (clear)

  1. It's about time by jesuscyborg · · Score: 5, Informative

    Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.

    Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.

    Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.

  2. two modes of speech-to-text, also by Speare · · Score: 5, Informative

    It's helpful to understand that there are two very different modes of speech recognition.

    Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.

    Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.

    The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.

    --
    [ .sig file not found ]
  3. Here's how to help out by schwaang · · Score: 5, Informative

    Record Your Speech and Submit it to VoxForge

    Donate your speech for a GPL speech data collection so they can do better recognition.

    Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)

  4. Speech to text overlooked by mgkimsal2 · · Score: 2, Informative

    I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.

    What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.

    I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.

    1. Re:Speech to text overlooked by Wumpus · · Score: 2, Informative

      IBM's ViaVoice technology can still be licensed from Wizzard Software (http://www.wizzardsoftware.com) and they're still selling the Linux SDK.

  5. Re:GPL? by cheater512 · · Score: 2, Informative

    Actually the summary hints at this but the GPL fits rather nicely.

    There is the 'source' data which is 'compiled' in to something useful.
    Sounds familiar?

  6. Re:Wreck A Nice Beach... by bdwoolman · · Score: 3, Informative
    wreck a nice beach

    recognize speech

    Entered with Dragon Systems 9.

    Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.

    Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter how it was entered. It's fair to say that speech recognition has come of age with Dragon Systems nine.

    Let me also add that I am not a shill. I am not connected with Dragon Systems in any way shape or form. Just a very happy and satisfied user.

    --
    "No fear. No envy. No meanness." Liam Clancy
  7. IFA Dutch Corpus by finiteSet · · Score: 4, Informative
    Correct me if I'm wrong but GPL was made for code, not audio.
    There is more to it than the poster mentions (I don't know if the site addresses this - it is Slashdotted). You don't just need audio - speech audio is abundant - you need annotated audio. In most cases, this annotation is phonetic (or phonemic) transcription, which labels segments of the audio according to the speech sound present in that audio segment. Most state-of-the-art speech systems use a machine learning approach: the system is "trained" on training data, with the hopes that the patterns learned generalize well on new data. This training is a supervised process: it requires the answers, and the answers are found in the annotation. It is this combination of audio and annotation that is valuable, and that is hard to come by. If their system prompts you to read phrases, it could use an existing recognition system to produce a roughly aligned phonetic transcription. It would be far from perfect, but useful nonetheless.

    From TFA:
    The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines.
    What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?
    --
    If we start buying CDs then the terrorists have already won.
  8. Re:Muffin for Jew to Ski here? by jthayden · · Score: 2, Informative
    I remember messing around with voice recognition in the 90s


    The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.

  9. Isn't this reinventing Librivox's wheel? by jhutch2000 · · Score: 4, Informative

    Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.

    I'm probably missing something in regards to why this stuff can't be used...