Slashdot Mirror


Improving Open Source Speech Recognition

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

6 of 121 comments (clear)

  1. Data conditioning (GIGO) by StateOfTheUnion · · Score: 4, Insightful
    What about data conditioning?

    This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?

    Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .

  2. GPL versus public domain? by 5plicer · · Score: 5, Insightful

    Why not make the files public domain? Is making them GPL really necessary?

    --
    The bits on the bus go on and off... on and off... on and off...
  3. Re:Voice Response Systems by AngryUndead · · Score: 3, Insightful

    Do you remember when you actually had to go to the office to handle things these systems are used for? Talk to a human? Do you remember a time when they just punched you in the bean bag for trying to quit?

    The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.

  4. What is your choice?..."Operator"...I'm sorry. by PRMan · · Score: 2, Insightful

    What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...

    That's usually the gist of my conversation with those automated systems.

    If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!

    --
    Peter predicted that you would "deliberately forget" creation 2000 years ago...
  5. Well no, not *that* one. by Kadin2048 · · Score: 2, Insightful

    Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.

    Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?

    The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's just the work itself, and other derivative works. Thus a whole lot of the GPL would be redundant.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  6. Re:A sound affair. by k12linux · · Score: 3, Insightful

    I would love to have quality Vox software for use in schools vs paying handsomely for proprietary stuff. The disabled children who use it would be grateful too since we wouldn't be restricted to installing only on 2% of the PCs in a school without breaking our budget.