Improving Open Source Speech Recognition
kmaclean writes, "VoxForge
collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially
creating a user-submitted repository of the 'source' speech audio for
the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech
Audio files will then be 'compiled' into Acoustic Models for use with
Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.
Why free GPL Speech Audio?
Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.
Why GPL?
A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.
Why free GPL Speech Audio?
Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.
Why GPL?
A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.
Aren't people recognising open source speech well enough already? Perhaps we need to tone down the zealotry.
Dear Aunt, let's set so double the killer delete select all.
liqbase
Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.
Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.
Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.
Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.
It's helpful to understand that there are two very different modes of speech recognition.
Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.
Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.
The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.
[
telephone services like tell me (18005558355), and my bank (USAA) work fairly well. my old bank had a touch tone system which was hard to use while driving. the error rate of my new bank's system is fairly low.
but agreeing with you, the voice system in my cell phone sucks.
...For the beast had been reborn with its strength renewed, and the followers of Mammon cowered in horror.
Record Your Speech and Submit it to VoxForge
Donate your speech for a GPL speech data collection so they can do better recognition.
Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .
as if a million voices cried out in terror and were suddenly silenced
Never underestimate the dark side of the Source
Why not make the files public domain? Is making them GPL really necessary?
The bits on the bus go on and off... on and off... on and off...
It'd be nice if someone could give an overview of the quality and simplicity of some open source speech recognition projects. I've used sphinx 2,3, and 4 before with little luck. I don't know if I got marbles in my mouth or what. Either way, I'm sure there's got to be someone on slashdot who's used a few and could give an overview to us weekend warriors.
If an officer ever threatens to taze you, say you have a pacemaker.
I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.
What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.
I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.
creation science book
I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.
Do you remember when you actually had to go to the office to handle things these systems are used for? Talk to a human? Do you remember a time when they just punched you in the bean bag for trying to quit?
The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.
I wear the ring.
Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?
I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.
You know... "Computer go to Red Alert!" like Star Trek.
But in our case it would be...
"Computer. Go to email and tell me if Bob sent a message."
"Computer. Go to Slashdot and alert me if there is a dupe."
But that would require more AI to understand what you are telling it to do rather than just type what you are saying... Of course which will have to happen first with 100% accuracy before we will see context driven voice recognition.
"I am the king of the Romans, and am superior to rules of grammar!"
-Sigismund, Holy Roman Emperor (1368-1437)
recognize speech
Entered with Dragon Systems 9.
Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.
Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter how it was entered. It's fair to say that speech recognition has come of age with Dragon Systems nine.
Let me also add that I am not a shill. I am not connected with Dragon Systems in any way shape or form. Just a very happy and satisfied user.
"No fear. No envy. No meanness." Liam Clancy
I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice.
Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.
Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze the speech input. It ought to be real time (or better?) on current hardware.
(* a phrase of some historical significance.)
-- Alastair
What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...
That's usually the gist of my conversation with those automated systems.
If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.
Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?
The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's just the work itself, and other derivative works. Thus a whole lot of the GPL would be redundant.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
I would love to have quality Vox software for use in schools vs paying handsomely for proprietary stuff. The disabled children who use it would be grateful too since we wouldn't be restricted to installing only on 2% of the PCs in a school without breaking our budget.
Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.
Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
From TFA: What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?
If we start buying CDs then the terrorists have already won.
The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.
Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.
I'm probably missing something in regards to why this stuff can't be used...