I find the GFDL to be unnecessarily restrictive for this. In fact, I find it objectionable. If i want to share recipes with my friends, I shouldn't have to abide by such an extensive social contract.
It is a big deal. It'll make a big difference for those of us doing speech technology; the free OSS drivers were always weak on full-duplex, and to do speech in/out, we need it. Congrats to everyone involved -- which is guess is all of us.
Like translating through babelfish, english to german to english to german, it everntually converges on something really funny. We did it with CMU Sphinx and Festival, by saying something into it and then having the synthesizer say what the recognizer heard... to the recognizer. Hourse of fun.
Here's some reasons why you want it on the device, and not on the server:
Privacy: I, for one, don't want to send my personal content through a portal provider. I don't want Microsoft getting all my mail, I don't want TellMe getting it either. And I don't want to have everything that I'm supposed to want available for me at the channel, with my usage stats, habits, and particulars sold to direct marketers or worse.
Security: The more places you ship the data around to, and the more intermediaries involved, the more possibilities there are for sniffing, bad security, leaks, and misuse. Passing things through a provider means trusting them to maintain security properly, and I for one don't trust many people enough to allow that.
Bandwidth: Alice in Wonderland can be shipped in full audio at several gigabytes, or shipped as a 100+kb text file and synthesized on the device. Cell connections are terrible, despite what the telcos are pushing in their media campaigns -- coverage even in the Bay Area is spotty; you lose signal as you pass in and out of cells, and there are network overloads and outages. Keeping it down to small text streams and synthesizing on the device means getting one step away from the unreliable, low-bandwidth networks available today, and 3G is a long. long way off.
If you want to look over a bunch of robotics projects at CMU, here's a nice list. It's not complete, but there are a bunch of pictures of robots and links to more info.
Yes, Xavier often has his manservant with him, but it's been a while since he's had to hit the kill switch. Xavier and Amelia both come to my office and annoy me, unattended. I'm in the same corridor and my door is often open (5303).
Flo the Nursebot is an interesting new development. It uses speech recognition, synthesis, and face tracking, nominally intended as a "robotic assistant for the elderly." The 'lips' move when flo talks, and the eyes track the face of whoever it thinks it's talking to.
Robotics still has a long way to go, but things are starting to get interesting. Robots get a lot more interesting to me when you can talk to them; sometimes i wish they would just shut up when i tell them to, though.
This is actually a pretty straightforward computer vision approach. You use two cameras, and, since you've carefully calibrated the cameras and know how far apart they are, you can compute the distortion between the two images pixel by pixel. Since the cameras are in slightly different locations (separated by a fixed baseline and angle of difference), any disparity will be the result of the different angles of view of the two cameras.
One interesting point is that the farther apart the eyes are, the more sensetive the apparatus is. So one way to get better depth perception is to put your eyes out on stalks.
Yep, sgi_ad.c is a stub, but I just added sphinx2-test to the CVS tree, which calls sphinx2-batch to decode an example utterance. If you'd like the get sgi_ad to work:) just check out the current CVS tree and run./autogen.sh, then./configure, etc etc and look at sphinx2-test.
The codebase has adanced considerably since Sphinx 1, and there have been a number of breakthroughs in the field since then. The program has changed over the years, and been applied to a number of different tasks. Furthermore, much of the time it's been used in whole systems, i.e., dialogue systems and natural language interfaces. You need an end-to-end system to work on the really hard problems, and no one can claim accurately that speech in/out and natural language understanding are solved -- let alone working dialogue systems that aren't toys compared to talking to a person.
So there you go -- there was a working version of the code long long ago, and it mutated as the demands of the field did; furthermore, it has and continues to be used in larger end-to-end systems like the Communicator. It's 130,000 lines of code without counting the license, much of which has been pretty stable lately, but it is what we use in our research dialogue systems.
Yeah. That was pretty unfortunate. The wording on the post got people going on the (interesting) patent discussion, but i think it takes away from what a good thing this is.
The OGI CSLU (center for spoken language understanding) also has an open source toolkit and language resources, but their distribution mainly runs on Win32. Good stuff; they use Festival and the group there has made some excellent contributions.
The license is actually almost verbatim Apache, based on BSD. And the only reason we wanted the "you have to mention Sphinx" condition is because there was once a (nameless) system (somewhere nameless) where someone (!) took the source and just erased the authors names, and redistributed it. At least with this we can have an inclusion of the original by reference -- people can go and see the original.
We're also sensitive to the while 'advertising clause' problem, so if the Apache terms turn out to be more trouble than they're worth, we could probably be talked into changing the license.
The license is actually almost verbatim Apache, based on BSD. And the only reason we wanted the "you have to mention Sphinx" condition is because there was once a (nameless) system (somewhere nameless) where someone (!) took the source and just erased the authors names, and redistributed it. At least with this we can have an inclusion of the original by reference -- people can go and see the original.
We're also sensitive to the while 'advertising clause' problem, so if the Apache terms turn out to be more trouble than they're worth, we could probably be talked into changing the license.
RE: what sourceforge said -- sourceforge gives you a menu of licenses, and BSD was the closest.
Actually this version does not require training. The acoustic trainer will be released later, and we're looking to put in speaker adaptation shortly.
About accuracy, it is fiddly about the mic volume, and distance from your mouth. Try playing with that a bit. Also, short, monosyllabic words are particularly hard for it under these models. Try speaking normally and conitinuously (you probably already were).
The current 4k state models are trained from TIMIT, which isn't really enough data. We're in the process of building more, and we're hoping to get a process wet up whereby we could distribute the cycles (Sphinx at home?).
At this point, we only have one set of broadband, 4k state models with the release. Our next step is to get a couple of sets of generic models for broadband and for telephone speech, and make a system for tailoring the generic models to specific language models.
We will also be releasing the trainer, and Sphinx 3, but it's coming out in steps. Sphinx 2 is the real-time engine, and while Sphinx 3 is more accurate, it's still slower.
As far as releasing Data, we will be releasing whatever we can. It's OK for us to release models derived from data from, for instance, the LDC (linguistic data consortium), because their licensing terms explicitly allow it, but much of our data comes from other sources. We'll be able to put some data out, but i think we'd be better off creating a public repository of contributed data, explicitly stating that all contributed data will remain free.
CMU Sphinx has no known Intellectual Property violations. This work is the result of a lot of work at CMU and involvement in publicly funded workshops. There are certainly no copyright issues (we wrote it) and we have no reason to suspect anyone has patent issues with it.
I find the GFDL to be unnecessarily restrictive for this. In fact, I find it objectionable. If i want to share recipes with my friends, I shouldn't have to abide by such an extensive social contract.
Actually, Perl runs on more platforms than Java... Perl is more portable than Java.
> Information wants to be free.
So does my Johnson!
It is a big deal. It'll make a big difference for those of us doing speech technology; the free OSS drivers were always weak on full-duplex, and to do speech in/out, we need it. Congrats to everyone involved -- which is guess is all of us.
Like translating through babelfish, english to german to english to german, it everntually converges on something really funny. We did it with CMU Sphinx and Festival, by saying something into it and then having the synthesizer say what the recognizer heard... to the recognizer. Hourse of fun.
It'd be better if it spoke, then it could read to us.
Here's some reasons why you want it on the device, and not on the server:
Privacy: I, for one, don't want to send my personal content through a portal provider. I don't want Microsoft getting all my mail, I don't want TellMe getting it either. And I don't want to have everything that I'm supposed to want available for me at the channel, with my usage stats, habits, and particulars sold to direct marketers or worse.
Security: The more places you ship the data around to, and the more intermediaries involved, the more possibilities there are for sniffing, bad security, leaks, and misuse. Passing things through a provider means trusting them to maintain security properly, and I for one don't trust many people enough to allow that.
Bandwidth: Alice in Wonderland can be shipped in full audio at several gigabytes, or shipped as a 100+kb text file and synthesized on the device. Cell connections are terrible, despite what the telcos are pushing in their media campaigns -- coverage even in the Bay Area is spotty; you lose signal as you pass in and out of cells, and there are network overloads and outages. Keeping it down to small text streams and synthesizing on the device means getting one step away from the unreliable, low-bandwidth networks available today, and 3G is a long. long way off.
Kevin
Of course, there are CMU Sphinx, FestVox, and Festival available under truly open source licenses. http://www.speech.cs.cmu.edu/
My RAAMs appear to be functional.
your student,
lenzo
If you want to look over a bunch of robotics projects at CMU, here's a nice list. It's not complete, but there are a bunch of pictures of robots and links to more info.
Maybe Nomad will turn up the Loc-Nar.
Flo the Nursebot is an interesting new development. It uses speech recognition, synthesis, and face tracking, nominally intended as a "robotic assistant for the elderly." The 'lips' move when flo talks, and the eyes track the face of whoever it thinks it's talking to.
Robotics still has a long way to go, but things are starting to get interesting. Robots get a lot more interesting to me when you can talk to them; sometimes i wish they would just shut up when i tell them to, though.
Yep, that was Dante. There's still some stuff up at NASA about it. Actually the other thread on "sending a robot to Hell" reminded me of Dante, too...
One interesting point is that the farther apart the eyes are, the more sensetive the apparatus is. So one way to get better depth perception is to put your eyes out on stalks.
Here is a paper on fast stereo vision.
That would be great. I think a little NMI work would get significant portions of Sphinx2 working with JSAPI.
Yep, sgi_ad.c is a stub, but I just added sphinx2-test to the CVS tree, which calls sphinx2-batch to decode an example utterance. If you'd like the get sgi_ad to work :) just check out the current CVS tree and run ./autogen.sh, then ./configure, etc etc and look at sphinx2-test.
You can build your own language models. Take a look at the Sphinx home page for a link to a web-based language model building tool.
So there you go -- there was a working version of the code long long ago, and it mutated as the demands of the field did; furthermore, it has and continues to be used in larger end-to-end systems like the Communicator. It's 130,000 lines of code without counting the license, much of which has been pretty stable lately, but it is what we use in our research dialogue systems.
Yeah. That was pretty unfortunate. The wording on the post got people going on the (interesting) patent discussion, but i think it takes away from what a good thing this is.
The OGI CSLU (center for spoken language understanding) also has an open source toolkit and language resources, but their distribution mainly runs on Win32. Good stuff; they use Festival and the group there has made some excellent contributions.
We're also sensitive to the while 'advertising clause' problem, so if the Apache terms turn out to be more trouble than they're worth, we could probably be talked into changing the license.
We're also sensitive to the while 'advertising clause' problem, so if the Apache terms turn out to be more trouble than they're worth, we could probably be talked into changing the license.
RE: what sourceforge said -- sourceforge gives you a menu of licenses, and BSD was the closest.
About accuracy, it is fiddly about the mic volume, and distance from your mouth. Try playing with that a bit. Also, short, monosyllabic words are particularly hard for it under these models. Try speaking normally and conitinuously (you probably already were).
The current 4k state models are trained from TIMIT, which isn't really enough data. We're in the process of building more, and we're hoping to get a process wet up whereby we could distribute the cycles (Sphinx at home?).
We will also be releasing the trainer, and Sphinx 3, but it's coming out in steps. Sphinx 2 is the real-time engine, and while Sphinx 3 is more accurate, it's still slower.
As far as releasing Data, we will be releasing whatever we can. It's OK for us to release models derived from data from, for instance, the LDC (linguistic data consortium), because their licensing terms explicitly allow it, but much of our data comes from other sources. We'll be able to put some data out, but i think we'd be better off creating a public repository of contributed data, explicitly stating that all contributed data will remain free.
CMU Sphinx has no known Intellectual Property violations. This work is the result of a lot of work at CMU and involvement in publicly funded workshops. There are certainly no copyright issues (we wrote it) and we have no reason to suspect anyone has patent issues with it.