Improving Open Source Speech Recognition
kmaclean writes, "VoxForge
collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially
creating a user-submitted repository of the 'source' speech audio for
the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech
Audio files will then be 'compiled' into Acoustic Models for use with
Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.
Why free GPL Speech Audio?
Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.
Why GPL?
A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.
Why free GPL Speech Audio?
Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.
Why GPL?
A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.
"Read on for why we need free GPL speech audio."
Considering people's attitudes towards speech interfaces anyway. Why do we need this again?
I don't think that's quite right. I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice. That and you get complete gibberish half the time.
Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?
Aren't people recognising open source speech well enough already? Perhaps we need to tone down the zealotry.
Dear Aunt, let's set so double the killer delete select all.
liqbase
Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.
As long as it's not "Dear aunt, let's set so the double killer delete select all" - I'm happy
Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.
Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.
Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.
It's helpful to understand that there are two very different modes of speech recognition.
Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.
Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.
The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.
[
Record Your Speech and Submit it to VoxForge
Donate your speech for a GPL speech data collection so they can do better recognition.
Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)
Wreck A Nice Beach...Recognize Speech. Call me when it can tell them apart.
None of them can see the clouds; The polished wings don't care.
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .
as if a million voices cried out in terror and were suddenly silenced
Never underestimate the dark side of the Source
computer: "Please say yes or no."
human: "Yes."
computer: "I couldn't understand your response. Please say yes or no."
human: "Yes."
computer: "I seem to be having trouble understanding you. Let's start over. Please enter or say your 21-digit account number followed by the pound key."
human: (unprintable)
This is a typical interaction on one of these Voice-Response-Hell systems. Companies that use such systems deserve to go out of business. A special place in Hell is reserved for people who develop and sell them.
Why not make the files public domain? Is making them GPL really necessary?
The bits on the bus go on and off... on and off... on and off...
It'd be nice if someone could give an overview of the quality and simplicity of some open source speech recognition projects. I've used sphinx 2,3, and 4 before with little luck. I don't know if I got marbles in my mouth or what. Either way, I'm sure there's got to be someone on slashdot who's used a few and could give an overview to us weekend warriors.
If an officer ever threatens to taze you, say you have a pacemaker.
Well with reasearch like this speech recognization will be better.
I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.
What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.
I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.
creation science book
I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.
Well, it was a nice try, but too much success can kill too.
The government can't save you.
Do you realize that GIGO is an older acronym than 99% of all the acronyms you've read on Slashdot.
So your question should be: "Jesus what was going on this world way back when before I was born!"
Ill have to go back when it is not slashdoted. I am not sure what they need but if they just need a good voice reading something ill give it a try. I have been told that I should like the guy on movie phone. :)
I would also love to seen open source dictation software. [shameless plug](See my journal for why)[/shameless plug]
Fired up iListen from MacSpeech (who license the Philips Voice Recognition model). Spoke both phrases in normal pace and tone. Initial accuracy 75%. Take 30 seconds to correct errors. Accuracy 100%. Even before training/correction, "recognize speech" was at 100%. The training was to teach the difference between "wreck" and "rack" (although it offered "wreck" as one of the options in the correction mode).
It ain't perfect, but training is easy these days and accuracies over 95% are arrived at fairly quickly. The biggest problem for many users seems to be overtraining before they start using the program. Many courts and most captioning systems have moved over to voice transcription systems rather than old fashioned stenography.
My wife does professional transcription and she does almost all of it with Dragon Naturally Speaking on a Windows system. I'm not seeing a lot of difference between Windows and Mac in accuracy or ease of use, though the Windows side used to have a slight advantage. That being said, open source alternatives would be a great thing. The more people working on this stuff, but faster it gets simpler, faster, and more powerful.
get back to me when they figure out how to interpret the lisp that all fag linux users have
Error: Insufficient computing power.
It must have been something you assimilated. . . .
Which of the popular open source applications have progressed much in the last 5 years?
Consider the same for the second tier applications (GOCR, voice recognition, etc).
What is needed to progress them to near commercial quality and near commercial feature set?
Does only a for profit business have the resources to raise the quality and feature set of most open source software to near commercial levels?
What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...
That's usually the gist of my conversation with those automated systems.
If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.
Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?
The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's just the work itself, and other derivative works. Thus a whole lot of the GPL would be redundant.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
I wonder if a better understanding of speech recognition - and having accurate voice models - would allow us to tweak or advanced articifical speech programs? While they probably won't do too much to help a computer understand the actual structure of a sentence (word recognition and pronunciation), it might allow them to produce words or sentences that flow more realistically or have more realistic peaks stresses on various words/syllables.
I've seen some decent ones, and the OS ones aren't better than the common paid-for emulation I've seen, but both could use improvement.
what? you want to tickle my ass with a feather? oh, particularly nice weather.
My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.
Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).
"Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good."
You've reached the voice-mail of Anonymous Coward. I'm not in right now, but if you leave an obscene description, curt comment, tart answer, and a flip remark. I'll get right back to you with a rude gesture. Thank you. *BEEP*
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
Open sores peach wreck ignition is final ready.
I hereby place the above post in the public domain.
From TFA: What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?
If we start buying CDs then the terrorists have already won.
... can you copy and paste, "Acoustic Models to be used by Speech Recognition Engines"?
Sorry, someone was excited about "Acoustic Models to be used by Speech Recognition Engines". [giggle]
Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.
I'm probably missing something in regards to why this stuff can't be used...
Nothing to hear here. Move along.
-Clio
Karma: Bad (mostly from not giving a fuck)
Blog: http://clintjcl.wordpress.com
I think you will be pleasantly surprised if you try Dragon Systems. Dragon Systems is special among speech engines. It is the long-term pet project of a couple of gifted scientists who decided to solve the problem of speech recognition a generation ago. They filed hundreds of patents over a couple of decades and solved many engineering problems one at a time. IBM took a long term interest in speech recognition, but none of their products have even approached the steadfast genius of the Dragon. ViaVoice never touched Dragon in capability even on the same machines.
A few years ago Wired magazine wrote a beautiful article on the development of this sublime piece of software. David Pogue, the technology guru of the New York Times, occasionally waxes ecstatic about it. He uses it exclusively for all his writing since he suffers from carpal tunnel syndrome. Wonder no more why the guy is so prolific.
The story of Dragon Systems has an ironic ending. The couple, a married couple by the way, thought they hit the jackpot when they sold Dragon Systems to Learnout and Hauspie for a fortune in stock; this just before the European company disintegrated in a scandal that rivaled Enron's. This transformed their hard-earned fortune into ashes. ScanSoft (now Nuance) bought the package at a fire-sale price and, bless them, they have upgraded and supported it beautifully. This latest release is astonishing.
I dictated this entire post in a very few minutes with very few corrections. If you do get this program I recommend the 'preferred' version. If I'm not mistaken DS version 9 does not require any training. But it does learn beautifully as you dictate. Also, you can suck in text that you've already produced. And even if it has weird words in it, or proper names, Dragon will do a pretty good job of reproducing them when it hears them next.
And what about homonyms like 'there' and 'their'? Dragon has some contextual algorithms that try to sort them out, but obviously it doesn't always guess correctly. Nevertheless it's easily corrected on the fly. You simply say "select their" the offending word gets highlighted and a pop up list appears of its homonyms and near misses with numbers. "There" would probably be right at the top and you would simply say "choose 1" Then the wrong "their" would be replaced.
Once you get used to this baby you can go on for hours. But I ramble...
"No fear. No envy. No meanness." Liam Clancy
This is the sort of effort in which commercial participation would be a strong benefit. If this was MIT or BSD license, I would put this into a specific one of my commercial products for the Nintendo DS, and I'd put a whole lot of work into it. But, I can't. Every time I talk about how I can't help certain projects, I get modded down as a troll, because I'm saying something a GPL fan doesn't want to hear. I'm not trolling, and I'm not being flamebait. This is a serious problem. I can name several other Nintendo software companies who would love to participate in the development of this library, but we can't.
It's a real shame. It'd be good for the library, it'd be good for the Nintendo and it'd be good for the games. Instead, because of the choice of license, everybody loses.
StoneCypher is Full of BS