Slashdot Mirror


Improving Open Source Speech Recognition

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

121 comments

  1. A sound affair. by Anonymous Coward · · Score: 0

    "Read on for why we need free GPL speech audio."

    Considering people's attitudes towards speech interfaces anyway. Why do we need this again?

    1. Re:A sound affair. by 2.7182 · · Score: 0, Troll

      Right. People don't use it because no one can get it to work well enough. My cell phone has it and I don't bother.

    2. Re:A sound affair. by markwalling · · Score: 2, Interesting

      telephone services like tell me (18005558355), and my bank (USAA) work fairly well. my old bank had a touch tone system which was hard to use while driving. the error rate of my new bank's system is fairly low.

      but agreeing with you, the voice system in my cell phone sucks.

      --
      ...For the beast had been reborn with its strength renewed, and the followers of Mammon cowered in horror.
    3. Re:A sound affair. by Anonymous Coward · · Score: 0

      that's because they only need to work on a very specific domain.

      It's easy to train a voice recognition system to recognize many voices from words (phonemes) taken from a small domain. It's also easy to train a voice recognition system to recognize one voice from words taken from a large domain.

      The trick is getting a recognition system to recognize many voices in a large domain. so far, that hasn't worked out so well, and still has a long way to go.

    4. Re:A sound affair. by k12linux · · Score: 3, Insightful

      I would love to have quality Vox software for use in schools vs paying handsomely for proprietary stuff. The disabled children who use it would be grateful too since we wouldn't be restricted to installing only on 2% of the PCs in a school without breaking our budget.

    5. Re:A sound affair. by Bloater · · Score: 1

      > The trick is getting a recognition system to recognize many voices in a large domain.

      Powergen in the UK had a system where, when paying a bill by credit card, it would ask for the name on the card. I have an unusual name, and it would get it fine everytime. Although, being an AI graduate, I'm used to speaking in a manner that typical analysis algorithms can process well.

  2. Muffin for Jew to Ski here? by Stripsurge · · Score: 1

    I don't think that's quite right. I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice. That and you get complete gibberish half the time.

    Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?

    1. Re:Muffin for Jew to Ski here? by Anonymous Coward · · Score: 1, Interesting

      One of the guys in my class last year wrote a dj application that used a mic in which you could speak your commands into. It could find you music based an genre, artist, song title and lots of other stuff. The cool part about it was that it would announce the songs as well as any commands it was currently doing. He had it running on his laptop using the new speech engine in vista. It was really really cool and worked very well. Having an opensource tool to do stuff like this would be fantastic.

    2. Re:Muffin for Jew to Ski here? by vertinox · · Score: 2, Interesting

      Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?

      I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.

      You know... "Computer go to Red Alert!" like Star Trek.

      But in our case it would be...

      "Computer. Go to email and tell me if Bob sent a message."
      "Computer. Go to Slashdot and alert me if there is a dupe."

      But that would require more AI to understand what you are telling it to do rather than just type what you are saying... Of course which will have to happen first with 100% accuracy before we will see context driven voice recognition.

      --
      "I am the king of the Romans, and am superior to rules of grammar!"
      -Sigismund, Holy Roman Emperor (1368-1437)
    3. Re:Muffin for Jew to Ski here? by sik0fewl · · Score: 1
      "Computer. Go to Slashdot and alert me if there is a dupe."

      Well, it wouldn't take much [artificial] intelligence for that one.

      --
      I remember when legal used to mean lawful, now it means some kind of loophole. - Leo Kessler
    4. Re:Muffin for Jew to Ski here? by AJWM · · Score: 2, Interesting

      I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice.

      Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.

      Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze the speech input. It ought to be real time (or better?) on current hardware.

      (* a phrase of some historical significance.)

      --
      -- Alastair
    5. Re:Muffin for Jew to Ski here? by hackstraw · · Score: 1


      how about make world?

    6. Re:Muffin for Jew to Ski here? by Laserwulf · · Score: 1

      What would be better than real-time for voice recognition software? Predicting what you're going to say?

      --
      "Make cyberlove, not cyberwar!" -Khaed(544779)
    7. Re:Muffin for Jew to Ski here? by jthayden · · Score: 2, Informative
      I remember messing around with voice recognition in the 90s


      The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.

    8. Re:Muffin for Jew to Ski here? by Olivier+Galibert · · Score: 1

      70-80% accuracy (well, 20-30% error rate) seems enough in practice. 100% never happens, even for humans.

          OG.

    9. Re:Muffin for Jew to Ski here? by Anonymous Coward · · Score: 0

      "Are there people out there who use voice as their main method of inputing text?"

      Absolutely. My dad is a quadriplegic and uses Dragon Dictate to do everything from moving his mouse to playing Everquest (charm soloing level 57 enchanter).

    10. Re:Muffin for Jew to Ski here? by CrazedWalrus · · Score: 1

      True, but humans then use outside context to figure out the missing/misunderstood words. If the computer could use the 70-80%, combined with a larger context than the current phrase to infer an additional 15% like humans do (pulling numbers out of my hindquarters), that'd be the key.

      I look at the subject "Muffin for Jew to Ski here?" and use both my knowledge of Slashdot and of similar-sounding words to infer what the writer is getting at. The knowledge of Slashdot is an important factor in my accuracy in deciphering the subject. Without it, I can only make assumptions about which words are incorrect (almost all of them, in this case). The key is the combination of context from outside of the text with the context of the text itself.

      My understanding is that the grammar and language rules help define the textual context, but do nothing to deal with bringing in the larger context of Slashdot memes or other seemingly unrelated topics.

      To use another example, I am learning Spanish. My wife and her mother are both native Spanish speakers. There are times where I understand one word between them in ten, if I'm lucky. However, I can usually use that one word -- a name, a place, etc -- to figure out the topic of conversation. Using my knowledge of the topic, I can assume the direction of the conversation, and use that to help fit in more words that I wouldn't otherwise have been able to figure out. Granted, there's a higher error rate than had I been missing a single word in English, but I'm also working from a much smaller known data set. Stuff like tone of voice (is she angry? excited? happy?) and body language also figure into the context.

      To be sure, true language recognition in computers is difficult because they have smaller datasets to work from. They don't see body language, generally ignore tone of voice, and have little to no knowledge of outside events from which to draw context. Without those things, high accuracy rates are possible, but they'll never match humans.

    11. Re:Muffin for Jew to Ski here? by AJWM · · Score: 1

      What would be better than real-time for voice recognition software?

      Doing speech-to-text from a speeded-up recording, or simultaneously doing multiple transcripts from different audio inputs. Or doing .wav file to text in less time than playing the .wav file takes.

      --
      -- Alastair
  3. Just what we need... by benzzene · · Score: 2, Funny

    Aren't people recognising open source speech well enough already? Perhaps we need to tone down the zealotry.

    1. Re:Just what we need... by RobertLTux · · Score: 1

      this is (open source) speech recognition not (open source speech) recognition

      --
      Any person using FTFY or editing my postings agrees to a US$50.00 charge
    2. Re:Just what we need... by TeknoHog · · Score: 1

      What?

      --
      Escher was the first MC and Giger invented the HR department.
  4. Anythings gotta be better than by LiquidCoooled · · Score: 5, Funny

    Dear Aunt, let's set so double the killer delete select all.

    --
    liqbase :: faster than paper
    1. Re:Anythings gotta be better than by Anonymous Coward · · Score: 0

      I was waiting for that one

    2. Re:Anythings gotta be better than by rts008 · · Score: 1

      Yeah, that sums it up pretty well.

      The reality of Star Trek- like voice interaction with a computer is still a ways off- decades perhaps.

      --
      Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
    3. Re:Anythings gotta be better than by Electrum · · Score: 1

      What really happened during the speech demo.

    4. Re:Anythings gotta be better than by chochos · · Score: 1

      And this is one way of helping to make it a reality sooner...

  5. GPL? by PhrostyMcByte · · Score: 2, Interesting

    Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.

    1. Re:GPL? by Anonymous Coward · · Score: 1, Interesting

      At least "Creative Commons Attribution-NonCommercial-NoDerivs 2.5" probably won't do if you consider some models to be derivatives of the audio samples.

    2. Re:GPL? by cheater512 · · Score: 2, Informative

      Actually the summary hints at this but the GPL fits rather nicely.

      There is the 'source' data which is 'compiled' in to something useful.
      Sounds familiar?

    3. Re:GPL? by SpokenLang · · Score: 2, Interesting
      The difference between using audio data to "compile" an acoustic model, and using source code to compile an executable is that when you create acoustic models from audio data, you don't modify the acoustic data, you use it "as is". So, it doesn't really make sense to require me to distribute an identical copy of the data along with my acoustic models.

      On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...

      I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The LDC http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)

    4. Re:GPL? by cheater512 · · Score: 1

      Actually the compiled data is just statistical data based on the audio. The audio isnt directly used.
      You cant get the original audio from the compiled version.

      What rock have you been hiding under? Dont you know what a mp3 is? ;)
      I highly doubt they are even considering distributing raw pcm data. It will be compressed in one form or another.
      1000 hours of CD quality mp3 is only roughly 60gig (your numbers are wrong I think) and voice doesnt need CD quality.
      Anyway they dont *need* to distribute the audio to everyone. It just needs to be there if someone wants it.

    5. Re:GPL? by SpokenLang · · Score: 1
      Actually the compiled data is just statistical data based on the audio. The audio isnt directly used. You cant get the original audio from the compiled version.
      Yes, I realize that you can't recover the audio from the acoustic models. But my point was that using the GPL in this context seems wrong because it would require that I (as the builder of the "derivative" work, aka the acoustic models) make the audio available (unless I misundertand the GPL.) So, given that I haven't changed the original audio in the process of building my acoustic model, why should I have to distribute it along with my models when the same exact audio is available from these guys? The analogy between "audio data" and "source code" doesn't quite fit because the audio is not modified.

      What rock have you been hiding under? Dont you know what a mp3 is? ;) I highly doubt they are even considering distributing raw pcm data. It will be compressed in one form or another. 1000 hours of CD quality mp3 is only roughly 60gig (your numbers are wrong I think) and voice doesnt need CD quality.
      For speech processing purposes, mp3 is not used because it is lossy http://en.wikipedia.org/wiki/MP3. But you have a good point. There are non-lossy compressions such as shorten http://www.softpedia.com/get/Multimedia/Audio/Audi o-Codecs/Shorten.shtml, which is commonly used by the LDC and NIST to distribute audio data for speech processing purposes. My numbers were correct, but I was assuming no compression. So, even if they use shorten to compress the audio, it will still be a substantial amount of data to make available to those who want to use it.
    6. Re:GPL? by cheater512 · · Score: 1

      Uncompressed audio is 700mb for 80 mins. Look on a pack of blank cds. Your math is wrong. :P

      Flac is a good candidate for a format. Its open source and lossless.

    7. Re:GPL? by SpokenLang · · Score: 1

      Ah, right... my numbers were WAY too low! ;-)

  6. dear aunt, by Anonymous Coward · · Score: 0

    As long as it's not "Dear aunt, let's set so the double killer delete select all" - I'm happy

  7. It's about time by jesuscyborg · · Score: 5, Informative

    Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.

    Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.

    Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.

    1. Re:It's about time by smilindog2000 · · Score: 4, Interesting

      I'll pitch in. I lost the use of my hands for three years due to a repetitive motion injury, and had to code by voice. That was 1997, nine years ago. I figured that within a couple years, the technology would be so great I would out-code my peers. Then the web bubble came, and Dragon Systems lost their focus on helping disabled people and focused instead on letting people dictate to Word. The creators of this great technology eventually sold out and moved on. Nine years later, the best product for coding is nine years old: the original Dragon Dictate. It doesn't even use the CPU for it's signal processing: it runs that on the sound card because in the early 90's the sound card had more power.

      We've gotta do something to get this beast moving forward.

      --
      Beer is proof that God loves us, and wants us to be happy.
    2. Re:It's about time by jacquesm · · Score: 1

      holy crap, you ok now ?

    3. Re:It's about time by Anonymous Coward · · Score: 0

      The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out.

      To be correct, ScanSoft just bought everyone (IIRC ViaVoice, Speechworks, LocusDialog, Dragon, ...) and then *bought* Nuance. Then they kept the name Nuance and (probably) mainly dropped all the previous stuff they bought.

    4. Re:It's about time by smilindog2000 · · Score: 1

      Yes, I eventually recovered, after switching to a laptop (which I actually use in my lap), and after having a child. Half of the problem with many repetitive motion injuries is stress, and having a family refocused mine.

      --
      Beer is proof that God loves us, and wants us to be happy.
    5. Re:It's about time by kent_eh · · Score: 1

      Hiring voice actors isn't always feasible.

      Nor, in this case I suppose, especially desirable.
      I would expect that you would want a wide variety of voised to train the thing.

      Listen to the genreral public sometime, do many of them sound like they have "professionally trained" voices?
      Probably not.

      If you train a speech rec. engine with "golden" voices, how can it be expected to figure out the average Joe/Jane on the street?

      I routinely hear from customers whose accent (or manner of enunciation) makes it nearly impossible to use our company's "convenient self-service phone system".

      My voice isn't "excatly pretty" either, but if there was a volunteer voice corpus being assembled (GPL/CC/BSD/whatever licence) I'd read for it.

      --

      ---
      "I can't complain, but sometimes still do..." Joe Walsh
    6. Re:It's about time by kent_eh · · Score: 1
      but if there was a volunteer voice corpus being assembled
      ...And of course, there is.

      (thwacks self on forehead, while chanting RTFA)

      Now where did I put that microphone....

      --

      ---
      "I can't complain, but sometimes still do..." Joe Walsh
    7. Re:It's about time by jacquesm · · Score: 1

      ok, glad to hear that.

      For the longest time I had a speech recognition system sketched out on a whiteboard in
      my office, maybe once I get done with all my current projects (ww.com, daz.com and a bunch
      of smaller stuff) I'll restart it, it's one of the things I really don't like about
      computers, the fact that our whole 'navigation' experience and knowledge seems to
      revolve around large surfaced displays. If we could somehow get rid of that I think
      computers would be *far* more useful.

      best regards, & congratulations on your recovery.

        Jacques.

  8. two modes of speech-to-text, also by Speare · · Score: 5, Informative

    It's helpful to understand that there are two very different modes of speech recognition.

    Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.

    Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.

    The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.

    --
    [ .sig file not found ]
  9. Here's how to help out by schwaang · · Score: 5, Informative

    Record Your Speech and Submit it to VoxForge

    Donate your speech for a GPL speech data collection so they can do better recognition.

    Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)

    1. Re:Here's how to help out by lawpoop · · Score: 3, Funny

      I would bet that more of the people who are using Linux are on the Autistic Spectrum. A few of the 'symptoms' or qualities of such people include "Odd or monotonous prosody of speech" and "Overly formal and pedantic language".

      So my bet is yes, there will be a difference based on OS.

      --
      Computers are useless. They can only give you answers.
      -- Pablo Picasso
  10. Wreck A Nice Beach... by onkelonkel · · Score: 0

    Wreck A Nice Beach...Recognize Speech. Call me when it can tell them apart.

    --
    None of them can see the clouds; The polished wings don't care.
    1. Re:Wreck A Nice Beach... by Anonymous Coward · · Score: 0

      i am sorry, i have no control over when you improve your pronunciation.

    2. Re:Wreck A Nice Beach... by bdwoolman · · Score: 3, Informative
      wreck a nice beach

      recognize speech

      Entered with Dragon Systems 9.

      Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.

      Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter how it was entered. It's fair to say that speech recognition has come of age with Dragon Systems nine.

      Let me also add that I am not a shill. I am not connected with Dragon Systems in any way shape or form. Just a very happy and satisfied user.

      --
      "No fear. No envy. No meanness." Liam Clancy
    3. Re:Wreck A Nice Beach... by Anonymous Coward · · Score: 0

      If all you are going to say is those two sentences, without careful enunciation or verbal clues as to which one you are actually saying, most humans will fail that test on a regular basis. Indeed, in day to day speech, many humans would say those in a way that there is no clear difference to distinguish.

      If, however, you enunciate them well, this can already be done by the best products, if I am not mistaken.

    4. Re:Wreck A Nice Beach... by onkelonkel · · Score: 1

      OK, my comment was a bit snotty. I'm somewhat entitled to be dubious, because speech recognition was a big disappointment when it first hit the market.

      I remember eagerly anticipating not having to type anymore when I bought IBM's Via Voice. This was about 10 years ago, back when "powerful computer" meant a P90 with 8 Meg Ram. After training the software for about an hour, I could, by. talking. like. William. Shatner. on. Ritalin. produce text that was maybe 60% - 80% accurate. It was definitely oversold at the time.

      With 2 orders of magnitude more computing power, it ought to be a lot better. I will have to give it another chance. Thanks.

      --
      None of them can see the clouds; The polished wings don't care.
  11. Data conditioning (GIGO) by StateOfTheUnion · · Score: 4, Insightful
    What about data conditioning?

    This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?

    Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .

    1. Re:Data conditioning (GIGO) by suv4x4 · · Score: 1

      Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out).

      They had to invent an acronym for this too, didn't they!? Jesus what is going on with this world!

      Wait... who are they?...

    2. Re:Data conditioning (GIGO) by NamShubCMX · · Score: 2, Interesting

      But shouldn't there be many different accent for such a program to work?. I am french canadian, and I sure hope I don't have to imitate a France accent for my voice to be faithfully recognized. Because although I have a strong accent from the point of view of French people, I don't from the point of view of Quebec people.

      On the same line of thought, I hope I can use this tool with my heavy (ok not so bad) english accent...

      I have no clue how those programs work so I might be off-base, but it seems to me that for speech recognition to work there should be AS MANY accents as possible, as long as those are identified and you can find an accent that corresponds to your own (?)

      --
      We've always been at war with Eurasia.
    3. Re:Data conditioning (GIGO) by Anonymous Coward · · Score: 0

      Ben oui, mautadit! C'pas un problème pan toute!

    4. Re:Data conditioning (GIGO) by davids-world.com · · Score: 1

      You don't need Chinese people to get heavily accented English. In fact, English varies a lot. If you're in Yorkshire (England) or in Ayrshire (Scotland), in Singapore, Brisbane or Nashville, you'll find extremely different accents. A good speech corpus will contain large samples of as many accents as possible, including meta-data that allows people to filter this to produce an acoustic model that is tailored to intended target users.

      But the same applies to recording modalities. Depending on whether you're building dialogue systems that run over the phone, or whether you want to recognize dictated input, you will have to have different recordings. The acoustic qualities of the channel differ (phone vs. high-quality microphone), and while that can be simulated and compensated for to some extent, the fact that people speak differently in spontaneous conversation vs. dictation will pretty much break most recognizer - unless you have purpose-built models.

      These are some of the reasons why it's hard to find an all-purpose corpus, and why the existing corpora are so expensive to license. (Not only does each have a limited market - they are very expensive to collect, too.) So it's great that the effort is being made - I hope it'll eventually lead to improvements in the freely available recognition engines, too.

    5. Re:Data conditioning (GIGO) by mrchaotica · · Score: 1
      What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented

      Then it would work great for Chinese users of the software. I don't see a problem here, except that the data needs to be categorized properly.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    6. Re:Data conditioning (GIGO) by penix1 · · Score: 1

      So I guess what you are saying is my Mr. Microphone won't cut it? Damn! Back to the drawing board.

      B.

      --
      This is a sig. This is only a sig. Had this been an actual sig you would have been informed where to tune for more sigs.
    7. Re:Data conditioning (GIGO) by davids-world.com · · Score: 1

      hey better than nothing. or as the statistical NLP people tend to say: there's only one thing that's better than data. more data!

    8. Re:Data conditioning (GIGO) by JumperCable · · Score: 1

      "Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . ." -StateOfTheUnion

      Part of the submission process is for you to classify your dialect. After your recordings are posted, people can rate your recordings & comment on them. I think this is a decent plan to address your concerns.

    9. Re:Data conditioning (GIGO) by Sam+the+Nemesis · · Score: 1
      for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
      What I feel is that for English speech, this is exactly what they should do. The language is so common across the globe, that all the accented variants should be included in the speech database. I remember using a voice recognition software long time back, and it did not support my accented English properly. I wished it did.
  12. slashdot met voxforge by bfree · · Score: 2, Funny

    as if a million voices cried out in terror and were suddenly silenced

    --

    Never underestimate the dark side of the Source

    1. Re:slashdot met voxforge by SeaFox · · Score: 2, Funny
      as if a million voices cried out in terror and were suddenly silenced

      But thanks to those millions of samples, we can now transcribe "AHHHHHHHHHHHHHHHH!" very accurately.
  13. Voice Response Systems by Anonymous Coward · · Score: 0

    computer: "Please say yes or no."
    human: "Yes."
    computer: "I couldn't understand your response. Please say yes or no."
    human: "Yes."
    computer: "I seem to be having trouble understanding you. Let's start over. Please enter or say your 21-digit account number followed by the pound key."
    human: (unprintable)

    This is a typical interaction on one of these Voice-Response-Hell systems. Companies that use such systems deserve to go out of business. A special place in Hell is reserved for people who develop and sell them.

    1. Re:Voice Response Systems by AngryUndead · · Score: 3, Insightful

      Do you remember when you actually had to go to the office to handle things these systems are used for? Talk to a human? Do you remember a time when they just punched you in the bean bag for trying to quit?

      The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.

  14. GPL versus public domain? by 5plicer · · Score: 5, Insightful

    Why not make the files public domain? Is making them GPL really necessary?

    --
    The bits on the bus go on and off... on and off... on and off...
    1. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      The idea is same that in software, if someone uses your audio in their program with their own audio samples, then they would be compelled to give their samples to their customers.

      What I'm worried about is the trolls. As if millions of trolls said "dear aunt lets set so double select the killer" and were suddenly recorded and included in to the program.

    2. Re:GPL versus public domain? by RobertLTux · · Score: 1

      the difference between PD and GPL can be like this

      1 PD is like a public park with the problem that somebody could buy a certain section (say grease a few palms and..) and lock YOU out of it
      2 GPL is like a park owned by some old looney that leaves the gate open (or in some cases owned by a group of folks that HATE Each other)
      PD is free now but could be nonfree later
      GPL is free FOREVER

      (for some projects its like getting a jew a muslim a catholic and several subtypes of protestants to agree on a "winter holiday" In Jerusalem Herself and then add an atheist and a satanist / wiccan to the mix)

      --
      Any person using FTFY or editing my postings agrees to a US$50.00 charge
    3. Re:GPL versus public domain? by Britz · · Score: 1

      Parent is a troll.

      But what the heck: BSD vs. GPL, let me just get my flameproof stuff.

    4. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      "satanist/wiccan"?

      You might as well say "christian/buddhist" or "hindu/scientologist"

    5. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      How does one place something in the public domain? For works created in modern times, SOMEBODY has copyright by default. In order to renounce that copyright and verify it as legitimately renounced, someone has to be established to HAVE copyright. Arrrgh.

    6. Re:GPL versus public domain? by Anonymous Coward · · Score: 0
      PD is free now but could be nonfree later
      No, you put something into public domain, and it stays in public domain. No one else can come along and suddenly take away your rights to it.

      You're convoluting freeness of an original work and of derivative works.
    7. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      Can't you just explicitly state that you want your work to be public domain and that you waive your copyrights?

    8. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      >The bits on the bus go on and off... on and off... on and off...

      Your sig makes baby Jesus cry.

    9. Re:GPL versus public domain? by Em+Adespoton · · Score: 1
      Well, that's not exactly true. Think of it more like this:

      Public Domain is forever, but people are free to copy it and make the copy (plus all improvements) their own.
      GPL is forever, but the only people free to distribute it are those who provide all the original source IP plus their modifications under the GPL.

      Then, of course, there's also BSDL, where the only restriction is that, unlike the public domain, you are required to credit the original authors of any work you use.

    10. Re:GPL versus public domain? by Anonymous Coward · · Score: 0

      Umm.. I'm not sure that you understood his sig. It's obviously a reference to "the wheels on the bus go round and round", and I think he means bits on a computer's bus. I could be mistaken though...

  15. Please by porkThreeWays · · Score: 2

    It'd be nice if someone could give an overview of the quality and simplicity of some open source speech recognition projects. I've used sphinx 2,3, and 4 before with little luck. I don't know if I got marbles in my mouth or what. Either way, I'm sure there's got to be someone on slashdot who's used a few and could give an overview to us weekend warriors.

    --
    If an officer ever threatens to taze you, say you have a pacemaker.
  16. two modes of silence, also by Anonymous Coward · · Score: 0

    Well with reasearch like this speech recognization will be better.

  17. Speech to text overlooked by mgkimsal2 · · Score: 2, Informative

    I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.

    What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.

    I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.

    1. Re:Speech to text overlooked by Wumpus · · Score: 2, Informative

      IBM's ViaVoice technology can still be licensed from Wizzard Software (http://www.wizzardsoftware.com) and they're still selling the Linux SDK.

    2. Re:Speech to text overlooked by mgkimsal2 · · Score: 1

      AFAICT it's a bit out of my price range - cheapest price I can see if $3400.

    3. Re:Speech to text overlooked by Wumpus · · Score: 1

      Well, it is a server product, targeted at developers. Desktop speech products weren't doing too well the last time I looked.

  18. It's about time by pestilence669 · · Score: 2, Interesting

    I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.

  19. Slashdotted by bendodge · · Score: 0

    Well, it was a nice try, but too much success can kill too.

    --
    The government can't save you.
  20. GIGO is older than you by ClosedSource · · Score: 1

    Do you realize that GIGO is an older acronym than 99% of all the acronyms you've read on Slashdot.

    So your question should be: "Jesus what was going on this world way back when before I was born!"

  21. Ill help by Anon-Admin · · Score: 1

    Ill have to go back when it is not slashdoted. I am not sure what they need but if they just need a good voice reading something ill give it a try. I have been told that I should like the guy on movie phone. :)

    I would also love to seen open source dictation software. [shameless plug](See my journal for why)[/shameless plug]

    1. Re:Ill help by finiteSet · · Score: 1
      I am not sure what they need but if they just need a good voice reading something ill give it a try. I have been told that I should like the guy on movie phone. :)
      Speech recognition performance on low-noise, read "proper" speech is actually impressively good. The forefront of speech recognition research is on noisy, spontaneous and conversational speech - i.e. real world speech. Any speech data is helpful, but the state-of-the-art would actually be better served by contributions of sub-optimal speech from a diverse group of speakers. Of course, I doubt this data will be manually transcribed. More likely this system uses automatic alignments, which won't fare as well on sub-optimal speech, and will ultimately produce less accurate training data.
      --
      If we start buying CDs then the terrorists have already won.
  22. But you didn't leave your number by GeorgeVW · · Score: 1

    Fired up iListen from MacSpeech (who license the Philips Voice Recognition model). Spoke both phrases in normal pace and tone. Initial accuracy 75%. Take 30 seconds to correct errors. Accuracy 100%. Even before training/correction, "recognize speech" was at 100%. The training was to teach the difference between "wreck" and "rack" (although it offered "wreck" as one of the options in the correction mode).

    It ain't perfect, but training is easy these days and accuracies over 95% are arrived at fairly quickly. The biggest problem for many users seems to be overtraining before they start using the program. Many courts and most captioning systems have moved over to voice transcription systems rather than old fashioned stenography.

    My wife does professional transcription and she does almost all of it with Dragon Naturally Speaking on a Windows system. I'm not seeing a lot of difference between Windows and Mac in accuracy or ease of use, though the Windows side used to have a slight advantage. That being said, open source alternatives would be a great thing. The more people working on this stuff, but faster it gets simpler, faster, and more powerful.

  23. get back to me by Anonymous Coward · · Score: 0

    get back to me when they figure out how to interpret the lisp that all fag linux users have

  24. Computer. Go to Slashdot ... by fahrbot-bot · · Score: 1
    "Computer. Go to Slashdot and alert me if there is a dupe."

    Error: Insufficient computing power.

    --
    It must have been something you assimilated. . . .
    1. Re:Computer. Go to Slashdot ... by TheRaven64 · · Score: 1

      A function that always returns true shouldn't take much processing power...

      --
      I am TheRaven on Soylent News
  25. actual forward progress on end user applications by Anonymous Coward · · Score: 0

    Which of the popular open source applications have progressed much in the last 5 years?
    Consider the same for the second tier applications (GOCR, voice recognition, etc).

    What is needed to progress them to near commercial quality and near commercial feature set?

    Does only a for profit business have the resources to raise the quality and feature set of most open source software to near commercial levels?

  26. What is your choice?..."Operator"...I'm sorry. by PRMan · · Score: 2, Insightful

    What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...

    That's usually the gist of my conversation with those automated systems.

    If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!

    --
    Peter predicted that you would "deliberately forget" creation 2000 years ago...
    1. Re:What is your choice?..."Operator"...I'm sorry. by gregmac · · Score: 1

      Swear at it.

      If you do that on Bell Canada's system (well, I haven't tried in about a year, but it did then) it will drop you directory to an operator.

      --
      Speak before you think
    2. Re:What is your choice?..."Operator"...I'm sorry. by aichpvee · · Score: 1

      I'm mute, those insensitive clods!

      --
      The Farewell Tour II
  27. Well no, not *that* one. by Kadin2048 · · Score: 2, Insightful

    Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.

    Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?

    The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's just the work itself, and other derivative works. Thus a whole lot of the GPL would be redundant.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  28. How about artificial speech next? by phorm · · Score: 1

    I wonder if a better understanding of speech recognition - and having accurate voice models - would allow us to tweak or advanced articifical speech programs? While they probably won't do too much to help a computer understand the actual structure of a sentence (word recognition and pronunciation), it might allow them to produce words or sentences that flow more realistically or have more realistic peaks stresses on various words/syllables.

    I've seen some decent ones, and the OS ones aren't better than the common paid-for emulation I've seen, but both could use improvement.

    1. Re:How about artificial speech next? by Anomalyst · · Score: 1

      I could have sworn that Steve Gibson wrote an article quite awhile ago on using DSP to join the words in a more natural sounding manner. Can't seem to find it with a search of
      "Steve Gibson" speech DSP words
      Did turn up this reference though "world gay escort dating free of charge", heh. A result of keyword stuffing in the linked site rather than a legitimate hit. Steve must be proud that his name is deemed such a valuable search keyword.

      --
      There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
  29. huh? by b17bmbr · · Score: 0

    what? you want to tickle my ass with a feather? oh, particularly nice weather.

    --
    My problem? I was perfectly gruntled, until some numbnuts came by and dissed me.
  30. Why not use the NIST database? by jesup · · Score: 3, Interesting

    Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.

    Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).

    1. Re:Why not use the NIST database? by hbr · · Score: 1
      This is probably what you mean:

      http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? catalogId=LDC97S62

      This kind of speech, um, yeah, is a - a world away, you know what I mean, from how most users speak to dictation software, command-and-control, etc.

      The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ is the main source of speech corpora that I know about. You have to pay and possibly be a member (depending on the corpus you want I think). The catalog covers all kinds of speech. Another source is ELRA http://catalog.elra.info/, but their corpora are a little pricey!

    2. Re:Why not use the NIST database? by jesup · · Score: 1

      Yes, that's got to be it. Good find - and not free it appears.

    3. Re:Why not use the NIST database? by cyberon22 · · Score: 1

      Since it is funded by NIST, I imagine that the database is not available. This is the same organization that manages to conduct "open" testing of machine translation systems without making the actual translations public.

    4. Re:Why not use the NIST database? by LifesABeach · · Score: 1

      Just a thought, instead of a keyboard, use of phone?

  31. It's about time- when prompted. by Anonymous Coward · · Score: 0

    "Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good."

    You've reached the voice-mail of Anonymous Coward. I'm not in right now, but if you leave an obscene description, curt comment, tart answer, and a flip remark. I'll get right back to you with a rude gesture. Thank you. *BEEP*

  32. How will it be distributed? by SpokenLang · · Score: 2, Interesting
    How much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...

    I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)

    1. Re:How will it be distributed? by Clover_Kicker · · Score: 1

      Why not ship a hard drive with the data, and charge a nominal fee?

  33. Waist uptime by wirelessbuzzers · · Score: 1

    Open sores peach wreck ignition is final ready.

    --
    I hereby place the above post in the public domain.
  34. IFA Dutch Corpus by finiteSet · · Score: 4, Informative
    Correct me if I'm wrong but GPL was made for code, not audio.
    There is more to it than the poster mentions (I don't know if the site addresses this - it is Slashdotted). You don't just need audio - speech audio is abundant - you need annotated audio. In most cases, this annotation is phonetic (or phonemic) transcription, which labels segments of the audio according to the speech sound present in that audio segment. Most state-of-the-art speech systems use a machine learning approach: the system is "trained" on training data, with the hopes that the patterns learned generalize well on new data. This training is a supervised process: it requires the answers, and the answers are found in the annotation. It is this combination of audio and annotation that is valuable, and that is hard to come by. If their system prompts you to read phrases, it could use an existing recognition system to produce a roughly aligned phonetic transcription. It would be far from perfect, but useful nonetheless.

    From TFA:
    The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines.
    What? The IFA Dutch "Open-Source" Corpus is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). They even have an SQL interface. Did you mean English speech corpora?
    --
    If we start buying CDs then the terrorists have already won.
    1. Re:IFA Dutch Corpus by oergiR · · Score: 1

      The text people have to read is given. I.e. the orthographic transcription is available. It is possible to bootstrap a speech recognition system from these transcriptions. It will not be particularly good, though.

      The more important problem is that current speech recognisers do not generalise well. If you train only on read speech, the performance on spontaneous speech will most likely be horrible. Transcribing spontaneous speech, however, takes enormous amounts of time. And it is not the kind of job you want to do for more than ten minutes. So I don't see how a good speech recogniser can be produced without money. I'm afraid this effort is going to lead nowhere, much though its purpose is to be applauded.

      The IFA Dutch Open Source Corpus is way too small for producing a speech recogniser. The best systems produced at the institution where I am studying are trained on about a thousand hours of speech.

    2. Re:IFA Dutch Corpus by finiteSet · · Score: 1
      Transcribing spontaneous speech, however, takes enormous amounts of time. ... So I don't see how a good speech recogniser can be produced without money.
      Aha, that's what undergraduate RAs (+lots of funding) are for. But seriously, this is really what I was getting at in my post.

      The best systems produced at the institution where I am studying are trained on about a thousand hours of speech.
      An IFA Corpus trained system won't be state-of-the-art, admittedly. The key word here is "free" - beggars can't be choosers.
      --
      If we start buying CDs then the terrorists have already won.
  35. How many times ... by multimediavt · · Score: 1

    ... can you copy and paste, "Acoustic Models to be used by Speech Recognition Engines"?

    Sorry, someone was excited about "Acoustic Models to be used by Speech Recognition Engines". [giggle]

  36. Isn't this reinventing Librivox's wheel? by jhutch2000 · · Score: 4, Informative

    Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.

    I'm probably missing something in regards to why this stuff can't be used...

    1. Re:Isn't this reinventing Librivox's wheel? by beholder · · Score: 1

      Librivox's recording are continuous speech. VoxForge is looking for Command and Control phrases (short and snappy).

      I would assume the phrasing patterns would be quite different.

    2. Re:Isn't this reinventing Librivox's wheel? by jhutch2000 · · Score: 1

      Ah, ok. I was assuming they wanted continuous speech patterns. I stand corrected, thanks.

  37. BitTorrent! by ClioCJS · · Score: 1
    Just go to MiniNova.org, and do a search for "audiobook". THERE'S YOUR DATA, FOLKS.

    Nothing to hear here. Move along.

    --
    -Clio
    Karma: Bad (mostly from not giving a fuck)
    Blog: http://clintjcl.wordpress.com
  38. speech recognition by bdwoolman · · Score: 1
    Didn't think your comment was snotty at all. I was just worried mine might be perceived as such.

    I think you will be pleasantly surprised if you try Dragon Systems. Dragon Systems is special among speech engines. It is the long-term pet project of a couple of gifted scientists who decided to solve the problem of speech recognition a generation ago. They filed hundreds of patents over a couple of decades and solved many engineering problems one at a time. IBM took a long term interest in speech recognition, but none of their products have even approached the steadfast genius of the Dragon. ViaVoice never touched Dragon in capability even on the same machines.

    A few years ago Wired magazine wrote a beautiful article on the development of this sublime piece of software. David Pogue, the technology guru of the New York Times, occasionally waxes ecstatic about it. He uses it exclusively for all his writing since he suffers from carpal tunnel syndrome. Wonder no more why the guy is so prolific.

    The story of Dragon Systems has an ironic ending. The couple, a married couple by the way, thought they hit the jackpot when they sold Dragon Systems to Learnout and Hauspie for a fortune in stock; this just before the European company disintegrated in a scandal that rivaled Enron's. This transformed their hard-earned fortune into ashes. ScanSoft (now Nuance) bought the package at a fire-sale price and, bless them, they have upgraded and supported it beautifully. This latest release is astonishing.

    I dictated this entire post in a very few minutes with very few corrections. If you do get this program I recommend the 'preferred' version. If I'm not mistaken DS version 9 does not require any training. But it does learn beautifully as you dictate. Also, you can suck in text that you've already produced. And even if it has weird words in it, or proper names, Dragon will do a pretty good job of reproducing them when it hears them next.

    And what about homonyms like 'there' and 'their'? Dragon has some contextual algorithms that try to sort them out, but obviously it doesn't always guess correctly. Nevertheless it's easily corrected on the fly. You simply say "select their" the offending word gets highlighted and a pop up list appears of its homonyms and near misses with numbers. "There" would probably be right at the top and you would simply say "choose 1" Then the wrong "their" would be replaced.

    Once you get used to this baby you can go on for hours. But I ramble...

    --
    "No fear. No envy. No meanness." Liam Clancy
  39. Again with the GPL by stonecypher · · Score: 1

    This is the sort of effort in which commercial participation would be a strong benefit. If this was MIT or BSD license, I would put this into a specific one of my commercial products for the Nintendo DS, and I'd put a whole lot of work into it. But, I can't. Every time I talk about how I can't help certain projects, I get modded down as a troll, because I'm saying something a GPL fan doesn't want to hear. I'm not trolling, and I'm not being flamebait. This is a serious problem. I can name several other Nintendo software companies who would love to participate in the development of this library, but we can't.

    It's a real shame. It'd be good for the library, it'd be good for the Nintendo and it'd be good for the games. Instead, because of the choice of license, everybody loses.

    --
    StoneCypher is Full of BS