Slashdot Mirror


Improving Open Source Speech Recognition

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

12 of 121 comments (clear)

  1. GPL? by PhrostyMcByte · · Score: 2, Interesting

    Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.

    1. Re:GPL? by Anonymous Coward · · Score: 1, Interesting

      At least "Creative Commons Attribution-NonCommercial-NoDerivs 2.5" probably won't do if you consider some models to be derivatives of the audio samples.

    2. Re:GPL? by SpokenLang · · Score: 2, Interesting
      The difference between using audio data to "compile" an acoustic model, and using source code to compile an executable is that when you create acoustic models from audio data, you don't modify the acoustic data, you use it "as is". So, it doesn't really make sense to require me to distribute an identical copy of the data along with my acoustic models.

      On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...

      I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The LDC http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)

  2. Re:A sound affair. by markwalling · · Score: 2, Interesting

    telephone services like tell me (18005558355), and my bank (USAA) work fairly well. my old bank had a touch tone system which was hard to use while driving. the error rate of my new bank's system is fairly low.

    but agreeing with you, the voice system in my cell phone sucks.

    --
    ...For the beast had been reborn with its strength renewed, and the followers of Mammon cowered in horror.
  3. Re:Muffin for Jew to Ski here? by Anonymous Coward · · Score: 1, Interesting

    One of the guys in my class last year wrote a dj application that used a mic in which you could speak your commands into. It could find you music based an genre, artist, song title and lots of other stuff. The cool part about it was that it would announce the songs as well as any commands it was currently doing. He had it running on his laptop using the new speech engine in vista. It was really really cool and worked very well. Having an opensource tool to do stuff like this would be fantastic.

  4. Re:Data conditioning (GIGO) by NamShubCMX · · Score: 2, Interesting

    But shouldn't there be many different accent for such a program to work?. I am french canadian, and I sure hope I don't have to imitate a France accent for my voice to be faithfully recognized. Because although I have a strong accent from the point of view of French people, I don't from the point of view of Quebec people.

    On the same line of thought, I hope I can use this tool with my heavy (ok not so bad) english accent...

    I have no clue how those programs work so I might be off-base, but it seems to me that for speech recognition to work there should be AS MANY accents as possible, as long as those are identified and you can find an accent that corresponds to your own (?)

    --
    We've always been at war with Eurasia.
  5. Re:It's about time by smilindog2000 · · Score: 4, Interesting

    I'll pitch in. I lost the use of my hands for three years due to a repetitive motion injury, and had to code by voice. That was 1997, nine years ago. I figured that within a couple years, the technology would be so great I would out-code my peers. Then the web bubble came, and Dragon Systems lost their focus on helping disabled people and focused instead on letting people dictate to Word. The creators of this great technology eventually sold out and moved on. Nine years later, the best product for coding is nine years old: the original Dragon Dictate. It doesn't even use the CPU for it's signal processing: it runs that on the sound card because in the early 90's the sound card had more power.

    We've gotta do something to get this beast moving forward.

    --
    Beer is proof that God loves us, and wants us to be happy.
  6. It's about time by pestilence669 · · Score: 2, Interesting

    I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.

  7. Re:Muffin for Jew to Ski here? by vertinox · · Score: 2, Interesting

    Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?

    I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.

    You know... "Computer go to Red Alert!" like Star Trek.

    But in our case it would be...

    "Computer. Go to email and tell me if Bob sent a message."
    "Computer. Go to Slashdot and alert me if there is a dupe."

    But that would require more AI to understand what you are telling it to do rather than just type what you are saying... Of course which will have to happen first with 100% accuracy before we will see context driven voice recognition.

    --
    "I am the king of the Romans, and am superior to rules of grammar!"
    -Sigismund, Holy Roman Emperor (1368-1437)
  8. Re:Muffin for Jew to Ski here? by AJWM · · Score: 2, Interesting

    I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice.

    Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.

    Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze the speech input. It ought to be real time (or better?) on current hardware.

    (* a phrase of some historical significance.)

    --
    -- Alastair
  9. Why not use the NIST database? by jesup · · Score: 3, Interesting

    Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.

    Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).

  10. How will it be distributed? by SpokenLang · · Score: 2, Interesting
    How much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...

    I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)