Improving Open Source Speech Recognition

← Back to Stories (view on slashdot.org)

Improving Open Source Speech Recognition

Posted by ryuzaki0 on Tuesday October 10, 2006 @08:10AM

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

9 of 121 comments (clear)

Min score:

Reason:

Sort:

Anythings gotta be better than by LiquidCoooled · 2006-10-10 08:19 · Score: 5, Funny

Dear Aunt, let's set so double the killer delete select all.

--
liqbase :: faster than paper
It's about time by jesuscyborg · 2006-10-10 08:21 · Score: 5, Informative

Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.

Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.

Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.
1. Re:It's about time by smilindog2000 · 2006-10-10 09:10 · Score: 4, Interesting
  
  I'll pitch in. I lost the use of my hands for three years due to a repetitive motion injury, and had to code by voice. That was 1997, nine years ago. I figured that within a couple years, the technology would be so great I would out-code my peers. Then the web bubble came, and Dragon Systems lost their focus on helping disabled people and focused instead on letting people dictate to Word. The creators of this great technology eventually sold out and moved on. Nine years later, the best product for coding is nine years old: the original Dragon Dictate. It doesn't even use the CPU for it's signal processing: it runs that on the sound card because in the early 90's the sound card had more power.
  
  We've gotta do something to get this beast moving forward.
  
  --
  Beer is proof that God loves us, and wants us to be happy.
two modes of speech-to-text, also by Speare · 2006-10-10 08:24 · Score: 5, Informative

It's helpful to understand that there are two very different modes of speech recognition.

Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.

Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.

The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.

--
[ .sig file not found ]
Here's how to help out by schwaang · 2006-10-10 08:28 · Score: 5, Informative

Record Your Speech and Submit it to VoxForge

Donate your speech for a GPL speech data collection so they can do better recognition.

Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)
Data conditioning (GIGO) by StateOfTheUnion · 2006-10-10 08:30 · Score: 4, Insightful

What about data conditioning?
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .
GPL versus public domain? by 5plicer · 2006-10-10 08:35 · Score: 5, Insightful

Why not make the files public domain? Is making them GPL really necessary?

--
The bits on the bus go on and off... on and off... on and off...
IFA Dutch Corpus by finiteSet · 2006-10-10 13:15 · Score: 4, Informative

Correct me if I'm wrong but GPL was made for code, not audio.
There is more to it than the poster mentions (I don't know if the site addresses this - it is Slashdotted). You don't just need audio - speech audio is abundant - you need annotated audio. In most cases, this annotation is phonetic (or phonemic) transcription, which labels segments of the audio according to the speech sound present in that audio segment. Most state-of-the-art speech systems use a machine learning approach: the system is "trained" on training data, with the hopes that the patterns learned generalize well on new data. This training is a supervised process: it requires the answers, and the answers are found in the annotation. It is this combination of audio and annotation that is valuable, and that is hard to come by. If their system prompts you to read phrases, it could use an existing recognition system to produce a roughly aligned phonetic transcription. It would be far from perfect, but useful nonetheless.

From TFA:
The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines.
What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?

--
If we start buying CDs then the terrorists have already won.
Isn't this reinventing Librivox's wheel? by jhutch2000 · 2006-10-11 01:01 · Score: 4, Informative

Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.

I'm probably missing something in regards to why this stuff can't be used...