Actually the compiled data is just statistical data based on the audio. The audio isnt directly used.
You cant get the original audio from the compiled version.
Yes, I realize that you can't recover the audio from the acoustic models. But my point was that using the GPL in this context seems wrong because it would require that I (as the builder of the "derivative" work, aka the acoustic models) make the audio available (unless I misundertand the GPL.) So, given that I haven't changed the original audio in the process of building my acoustic model, why should I have to distribute it along with my models when the same exact audio is available from these guys? The analogy between "audio data" and "source code" doesn't quite fit because the audio is not modified.
What rock have you been hiding under? Dont you know what a mp3 is?;)
I highly doubt they are even considering distributing raw pcm data. It will be compressed in one form or another.
1000 hours of CD quality mp3 is only roughly 60gig (your numbers are wrong I think) and voice doesnt need CD quality.
For speech processing purposes, mp3 is not used because it is lossy http://en.wikipedia.org/wiki/MP3. But you have a good point. There are non-lossy compressions such as shorten http://www.softpedia.com/get/Multimedia/Audio/Audi o-Codecs/Shorten.shtml, which is commonly used by the LDC and NIST to distribute audio data for speech processing purposes. My numbers were correct, but I was assuming no compression. So, even if they use shorten to compress the audio, it will still be a substantial amount of data to make available to those who want to use it.
How much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
The difference between using audio data to "compile" an acoustic model, and using source code to compile an executable is that when you create acoustic models from audio data, you don't modify the acoustic data, you use it "as is". So, it doesn't really make sense to require me to distribute an identical copy of the data along with my acoustic models.
On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The LDC http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
Ah, right... my numbers were WAY too low! ;-)
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)
On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of data. Are they planning to make this available via download? If they want to distribute it on DVDs, that is about 24 DVDs (for 1000 hrs), which would be a lot of work to burn and ship...
I think it is an admirable project, but it seems like the practicalities could make this VERY difficult to complete. The LDC http://www.ldc.upenn.edu/ has been distributing speech corpora for about 15 years and it is not easy (and no, I don't work for the LDC.)