Improving Open Source Speech Recognition

← Back to Stories (view on slashdot.org)

Improving Open Source Speech Recognition

Posted by ryuzaki0 on Tuesday October 10, 2006 @08:10AM

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

32 of 121 comments (clear)

Min score:

Reason:

Sort:

Just what we need... by benzzene · 2006-10-10 08:18 · Score: 2, Funny

Aren't people recognising open source speech well enough already? Perhaps we need to tone down the zealotry.
Anythings gotta be better than by LiquidCoooled · 2006-10-10 08:19 · Score: 5, Funny

Dear Aunt, let's set so double the killer delete select all.

--
liqbase :: faster than paper
GPL? by PhrostyMcByte · 2006-10-10 08:19 · Score: 2, Interesting

Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.
1. Re:GPL? by cheater512 · 2006-10-10 09:17 · Score: 2, Informative
  
  Actually the summary hints at this but the GPL fits rather nicely.
  
  There is the 'source' data which is 'compiled' in to something useful.
  Sounds familiar?
2. Re:GPL? by SpokenLang · 2006-10-10 12:17 · Score: 2, Interesting
  
  The difference between using audio data to "compile" an acoustic model, and using source code to compile an executable is that when you create acoustic models from audio data, you don't modify the acoustic data, you use it "as is". So, it doesn't really make sense to require me to distribute an identical copy of the data along with my acoustic models.
  On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...
  I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The LDC http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
It's about time by jesuscyborg · 2006-10-10 08:21 · Score: 5, Informative

Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.

Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.

Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.
1. Re:It's about time by smilindog2000 · 2006-10-10 09:10 · Score: 4, Interesting
  
  I'll pitch in. I lost the use of my hands for three years due to a repetitive motion injury, and had to code by voice. That was 1997, nine years ago. I figured that within a couple years, the technology would be so great I would out-code my peers. Then the web bubble came, and Dragon Systems lost their focus on helping disabled people and focused instead on letting people dictate to Word. The creators of this great technology eventually sold out and moved on. Nine years later, the best product for coding is nine years old: the original Dragon Dictate. It doesn't even use the CPU for it's signal processing: it runs that on the sound card because in the early 90's the sound card had more power.
  
  We've gotta do something to get this beast moving forward.
  
  --
  Beer is proof that God loves us, and wants us to be happy.
two modes of speech-to-text, also by Speare · 2006-10-10 08:24 · Score: 5, Informative

It's helpful to understand that there are two very different modes of speech recognition.

Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.

Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.

The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.

--
[ .sig file not found ]
Re:A sound affair. by markwalling · 2006-10-10 08:27 · Score: 2, Interesting

telephone services like tell me (18005558355), and my bank (USAA) work fairly well. my old bank had a touch tone system which was hard to use while driving. the error rate of my new bank's system is fairly low.

but agreeing with you, the voice system in my cell phone sucks.

--
...For the beast had been reborn with its strength renewed, and the followers of Mammon cowered in horror.
Here's how to help out by schwaang · 2006-10-10 08:28 · Score: 5, Informative

Record Your Speech and Submit it to VoxForge

Donate your speech for a GPL speech data collection so they can do better recognition.

Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)
1. Re:Here's how to help out by lawpoop · 2006-10-10 09:16 · Score: 3, Funny
  
  I would bet that more of the people who are using Linux are on the Autistic Spectrum. A few of the 'symptoms' or qualities of such people include "Odd or monotonous prosody of speech" and "Overly formal and pedantic language".
  
  So my bet is yes, there will be a difference based on OS.
  
  --
  Computers are useless. They can only give you answers.
  -- Pablo Picasso
Data conditioning (GIGO) by StateOfTheUnion · 2006-10-10 08:30 · Score: 4, Insightful

What about data conditioning?
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .
1. Re:Data conditioning (GIGO) by NamShubCMX · 2006-10-10 08:55 · Score: 2, Interesting
  
  But shouldn't there be many different accent for such a program to work?. I am french canadian, and I sure hope I don't have to imitate a France accent for my voice to be faithfully recognized. Because although I have a strong accent from the point of view of French people, I don't from the point of view of Quebec people.
  
  On the same line of thought, I hope I can use this tool with my heavy (ok not so bad) english accent...
  
  I have no clue how those programs work so I might be off-base, but it seems to me that for speech recognition to work there should be AS MANY accents as possible, as long as those are identified and you can find an accent that corresponds to your own (?)
  
  --
  We've always been at war with Eurasia.
slashdot met voxforge by bfree · 2006-10-10 08:34 · Score: 2, Funny

as if a million voices cried out in terror and were suddenly silenced

--
Never underestimate the dark side of the Source
1. Re:slashdot met voxforge by SeaFox · 2006-10-10 09:27 · Score: 2, Funny
  
  as if a million voices cried out in terror and were suddenly silenced
  
  But thanks to those millions of samples, we can now transcribe "AHHHHHHHHHHHHHHHH!" very accurately.
GPL versus public domain? by 5plicer · 2006-10-10 08:35 · Score: 5, Insightful

Why not make the files public domain? Is making them GPL really necessary?

--
The bits on the bus go on and off... on and off... on and off...
Please by porkThreeWays · 2006-10-10 08:46 · Score: 2

It'd be nice if someone could give an overview of the quality and simplicity of some open source speech recognition projects. I've used sphinx 2,3, and 4 before with little luck. I don't know if I got marbles in my mouth or what. Either way, I'm sure there's got to be someone on slashdot who's used a few and could give an overview to us weekend warriors.

--
If an officer ever threatens to taze you, say you have a pacemaker.
Speech to text overlooked by mgkimsal2 · 2006-10-10 09:11 · Score: 2, Informative

I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.

What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.

I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.

--
creation science book
1. Re:Speech to text overlooked by Wumpus · 2006-10-10 11:04 · Score: 2, Informative
  
  IBM's ViaVoice technology can still be licensed from Wizzard Software (http://www.wizzardsoftware.com) and they're still selling the Linux SDK.
It's about time by pestilence669 · 2006-10-10 09:12 · Score: 2, Interesting

I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.
Re:Voice Response Systems by AngryUndead · 2006-10-10 09:14 · Score: 3, Insightful

Do you remember when you actually had to go to the office to handle things these systems are used for? Talk to a human? Do you remember a time when they just punched you in the bean bag for trying to quit?

The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.

--
I wear the ring.
Re:Muffin for Jew to Ski here? by vertinox · 2006-10-10 09:18 · Score: 2, Interesting

Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?

I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.

You know... "Computer go to Red Alert!" like Star Trek.

But in our case it would be...

"Computer. Go to email and tell me if Bob sent a message."
"Computer. Go to Slashdot and alert me if there is a dupe."

But that would require more AI to understand what you are telling it to do rather than just type what you are saying... Of course which will have to happen first with 100% accuracy before we will see context driven voice recognition.

--
"I am the king of the Romans, and am superior to rules of grammar!"
-Sigismund, Holy Roman Emperor (1368-1437)
Re:Wreck A Nice Beach... by bdwoolman · 2006-10-10 09:32 · Score: 3, Informative

wreck a nice beach

recognize speech

Entered with Dragon Systems 9.
Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.
Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter how it was entered. It's fair to say that speech recognition has come of age with Dragon Systems nine.
Let me also add that I am not a shill. I am not connected with Dragon Systems in any way shape or form. Just a very happy and satisfied user.

--
"No fear. No envy. No meanness." Liam Clancy
Re:Muffin for Jew to Ski here? by AJWM · 2006-10-10 09:50 · Score: 2, Interesting

I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice.

Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.

Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze the speech input. It ought to be real time (or better?) on current hardware.

(* a phrase of some historical significance.)

--
-- Alastair
What is your choice?..."Operator"...I'm sorry. by PRMan · 2006-10-10 10:05 · Score: 2, Insightful

What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...
That's usually the gist of my conversation with those automated systems.
If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!

--
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Well no, not *that* one. by Kadin2048 · 2006-10-10 10:06 · Score: 2, Insightful

Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.

Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?

The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's just the work itself, and other derivative works. Thus a whole lot of the GPL would be redundant.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:A sound affair. by k12linux · 2006-10-10 11:10 · Score: 3, Insightful

I would love to have quality Vox software for use in schools vs paying handsomely for proprietary stuff. The disabled children who use it would be grateful too since we wouldn't be restricted to installing only on 2% of the PCs in a school without breaking our budget.
Why not use the NIST database? by jesup · 2006-10-10 11:40 · Score: 3, Interesting

Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.

Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).
How will it be distributed? by SpokenLang · 2006-10-10 12:28 · Score: 2, Interesting

How much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
IFA Dutch Corpus by finiteSet · 2006-10-10 13:15 · Score: 4, Informative

Correct me if I'm wrong but GPL was made for code, not audio.
There is more to it than the poster mentions (I don't know if the site addresses this - it is Slashdotted). You don't just need audio - speech audio is abundant - you need annotated audio. In most cases, this annotation is phonetic (or phonemic) transcription, which labels segments of the audio according to the speech sound present in that audio segment. Most state-of-the-art speech systems use a machine learning approach: the system is "trained" on training data, with the hopes that the patterns learned generalize well on new data. This training is a supervised process: it requires the answers, and the answers are found in the annotation. It is this combination of audio and annotation that is valuable, and that is hard to come by. If their system prompts you to read phrases, it could use an existing recognition system to produce a roughly aligned phonetic transcription. It would be far from perfect, but useful nonetheless.

From TFA:
The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines.
What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?

--
If we start buying CDs then the terrorists have already won.
Re:Muffin for Jew to Ski here? by jthayden · 2006-10-10 20:36 · Score: 2, Informative

I remember messing around with voice recognition in the 90s

The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.
Isn't this reinventing Librivox's wheel? by jhutch2000 · 2006-10-11 01:01 · Score: 4, Informative

Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.

I'm probably missing something in regards to why this stuff can't be used...