Speech Recognition in Silicon
Ben Sullivan writes "NSF-funded researchers are working to develop a silicon-based approach to speech recognition. "The goal is to create a radically new and efficient silicon chip architecture that only does speech recognition, but does this 100 to 1,000 times more efficiently than a conventional computer." Good use of $1 million?"
If this really is true what they're saying, and knowing how much money is invested in speech recognition research on a yearl y basis, yeah, i would definately say that this is one million dollars of great investment...
- Leon Mergen
http://www.solatis.com
Good use of $1 million?
Let me think for a moment... Hell yeah! If we had low power speech processors, the possibilities would be endless. For one, we'd finally have a Star Trek(TM) interface for our homes!
"Computer, lights!"
"Computer, make coffee!"
"Computer, Earl Grey, hot!"
As silly as it may sound, such an interface would be far more efficient than mashing buttons.
In addition, blind people could be significantly helped by this. Many of them already use speech recognition and synthesis to assist in computer usage. Imagine if their computers could suddenly understand them a thousand times better? They could talk to their computers a bit more naturally, thus saving their vocal chords from undue stress.
Other applications (off the top of my head) are:
- Voice notes on embedded devices (store only text!)
- Helpful Kiosks that can give you directions
- A new use for natural language database queries (i.e. Ask the computer what last quarter's net sales were.)
- Voice controlled robots ("You missed a corner, vacuum cleaner")
- Data search by voice ("Find me a channel that plays Star Trek")
Any other cool ideas out there?
Javascript + Nintendo DSi = DSiCade
Carnegie Mellon University's Rob A. Rutenbar is leading a national research team to develop a new, efficient silicon chip that may revolutionize the way humans communicate and have a significant impact on America's homeland security. Rutenbar, a professor of electrical and computer engineering at Carnegie Mellon, working jointly with researchers at the University of California at Berkeley received a $1 million grant from the National Science Foundation to move automatic speech recognition from software into hardware. ''I can ask my cell phone to 'Call Mom,''' says Rutenbar, ''but I can't dictate a detailed email complaint to my travel agent or navigate a complicated Internet database by voice alone.''
From Carnegie Mellon University:
Carnegie Mellon engineering researchers to create speech recognition in silicon
Team to develop new silicon chip
Carnegie Mellon University's Rob A. Rutenbar is leading a national research team to develop a new, efficient silicon chip that may revolutionize the way humans communicate and have a significant impact on America's homeland security.
Rutenbar, a professor of electrical and computer engineering at Carnegie Mellon, working jointly with researchers at the University of California at Berkeley received a $1 million grant from the National Science Foundation to move automatic speech recognition from software into hardware.
''I can ask my cell phone to 'Call Mom,''' says Rutenbar, ''but I can't dictate a detailed email complaint to my travel agent or navigate a complicated Internet database by voice alone.''
The problem is power--or rather, the lack of it. It takes a very powerful desktop computer to recognize arbitrary speech. ''But we can't put a PentiumTM in my cell phone, or in a soldier's helmet, or under a rock in a desert,'' explains Rutenbar, ''the batteries wouldn't last 10 minutes.''
Thus, the goal is to create a radically new and efficient silicon chip architecture that only does speech recognition, but does this 100 to 1,000 times more efficiently than a conventional computer.
The research team is uniquely poised to deliver on this ambitious project. Carnegie Mellon researchers pioneered much of today's successful speech recognition technology. This includes the influential 'Sphinx' project, the basis for many of today's commercial speech recognizers.
''We're still not even close to having a voice interface that will let you throw away your keyboard and mouse, but this current research could help us see speech as the primary modality on cell phones and PDAs,'' said Richard Stern, a professor in electrical and computer engineering and the team's senior speech recognition expert. ''To really throw away the keyboard, we have to go to silicon.'' But enhanced conversations between people and consumer products is not the main goal. ''Homeland security applications are the big reason we were chosen for this award,'' says Rutenbar. ''Imagine if an emergency responder could query a critical online database with voice alone, without returning to a vehicle, in a noisy and dangerous environment. The possibilities are endless.''
Researchers plan to unveil speech-recognition chip architecture in two to three years.
I can just see the anonymous cowards shouting first post at their pcs now
Cruise TT
My friend and I were talking about this. In countries that are more totalitarian, it could be used to root out "dangerous people" www.geocities.com/James_Sager_PA
God spoke to me.
100 to 1000 times more efficient worth $1M? meh. maybe.
100 to 1000 times more accurate worth $1M? definitely.
Damned straight it is! In government terms, that's a pittance. In government-funded science terms, it's downright INFINITESIMAL. It isn't even couch change, it's more like the stale pretzel under the couch cushion.
But, of course, cue the armchair blogging fanatics without a formal science education, waxing poetic about the infinite power and glory of x86 hardware running clever open source software. Maybe we could do it in perl!
Good use of $1 million?
For something that would be worth hundreds of times that in the form of a finished product, I would hope so. The only dispute might be that the researchers' efforts would be better spent on other things.
On the one hand, it is obvious how much more efficient this would make our day-to-day tasks. Being able to "jot" notes with speech instead of writing, schedule tasks in seconds, the list goes on and on...
This is certainly beneficial... but think about the impact on the economy! Imagine all the "Administrative Professionals" who could, almost instantly, be out of work. I for one would rather pay even $5,000 for a good piece of software to take all my notes than pay a secretary $28,000/year or so.
Then again, when I posed this situation at my wife's office (she's a paralegal) one of the attorneys responded, "Until they come up with software that can find my lost keys and bring me coffee, the secretary's job is secure."
Proudly supporting the Libertarian Party.
...and view the printable version.
The Army reading list
I'm curious to see if their research will improve Natural Language Queries, as opposed to just improving speech recognition. There is an important difference between having to say: SELECT name FROM users WHERE id=12345 and saying: Pull up the name of employee number 12345.
-dave
http://millionnumbers.com/ - own the number of your dreams
(I did not read the article as it is slashdotted so I am relying on the summary's statement of 1 million dollars.)
I do security
Imagine how much money could be saved if you could *perfect* speach recognition.
...
Heck, the hospital I used to work at by itself spent over a million dollars a year on medical transcriptionists
It is an interesting concept, but do we really need this?
We already have voice recognition, this tech will just bring it to everything. You can talk to your keys, your toaster, your watch. But will they have anything interesting to say back?
What would you do if you had 1 million dollars?
You mean besides 2 chicks at the same time...
Refer your friends, get a free ipodThis is completely false. This is not a sig.
I once did a lot of work with speech recognition software, having a former significant other who was disabled. I tested a number of programs, and found the biggest problem to be the wide variances in users' dialects. The programs all have to be trained initially to recognize a single users' voice. This means that a program trained for a Bostonian may not work for someone from Arkansas, Texas, or Louisiana. Also, the programs' effectiveness decreased over time if you did not use it regularly.
I don't know how possible it will be to make a program that can recognize all English users. Will someone who speaks Oxford English be recognized as well as a surfer from California? I doubt it.
Never look down your nose at others. Someday, someone is bound to see your boogers.
This seems like a situation where a hardware accelerated approach is pretty sensible. I'm guessing there is large amounts of signal processing involved in speech recognition. With a custom chip like this it probably helps greatly to offload some of that onto a dedicated chip in the same way as GPUs are used on graphics cards. The only problem I can see is that there might not be much market for it. GPUs have an obvious market (games), but there is less demand for speech processing. Star-Trek style interfaces are nice to dream of but for most common tasks a keyboard and mouse will probably give you a faster and more accurate interface.
gmail invite
I see some results. So far theres been quite a few attempts at speech recongnition. Generally they all fall short, they don't like accents, and often mis-interpret. I know because awhile back we looked at something for my grandfather, he can't keep his hand steady enough to write anymore... *shrug*
Depends. It's not as good as using it to prevent the deaths of thousands - possibly tens of thousands - of people by ensuring they have clean drinking water and shelter from the elements. But hey - you can't put a price on being able to speak to a computer rather than type when you're ordering a pizza.
During 1994 upto 1998 I did marketign and technical support for IBM's Voicetype Dictation products..
Initially, doing anythign beyond understanding a few words would take special hardware, but after a bit of 'training' highly acurate and fast speech to text was quite a possibility with a specially developed dsp.
Then, the pentium class cpus came about, and a p90 could just do the whole thing without the dsp.
So, now someone is developing a new dedicated piece of silicon for this.. lets see how long it takes for general purpose computers to catch up.
The issue is not that this is not usefull, but that it either has to keep developing, or offer a somewhat longer lasting price/performance ratio or much better features for a logn time to come.
Using specialised DSPs makes more sense to me than burning up generic CPU cycles. There have been many examples over the years of how a specialized DSP is more efficient and effective for a narrow task than a regular CPU. Look at portable MP3 players. They use tiny specialized DSPs to decode the files in a manner that is much more efficient than using a regular CPU.
We'll still need to do traditional development to interpret the data from the DSPs. We'll need to parse the output so that we can use natural commands to control devices.
"Coffee maker, brew 10 cups, strong."
"Bathroom lights, on."
Without some manner of AI to interpret them, these phrases will be useless.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
From the blog: ''Homeland security applications are the big reason we were chosen for this award,'' says Rutenbar. ''Imagine if an emergency responder could query a critical online database with voice alone, without returning to a vehicle, in a noisy and dangerous environment. The possibilities are endless.''
Like some slight tweaking in order to deploy massive voiceprint-recognition silicon arrays for amazingly efficient automatic realtime conversation transcription and identity determination, attached to Echelon.
So cool... so potentially evil... head begins to hurt... tinfoil hat burning....
Although $1million significantly can speed things up, this is a pretty ambitious undertaking.
My Master's research was on implementing machine learning in hardware, specifically support vector machines.
Now, they have much more money than I did, and probably this will be a collaboration involving many graduate students, but converting complex algorithms from software to hardware is no easy task.
It is just easier to do things in software, that's why it has evolved. The modular layers of abstraction allow a Computer Scientist working in machine learning or speech recognition to not have to worry about how the underlying hardware works.
Working in hardware, a lot these issues come face to face. Particularly since you want an architecture on a chip, whereas in a conventional desktop/server system there are resources such as lots of RAM, harddrive space, etc are available and their interconnections have been built and refined over decades.
Throw in concerns about small form factor, low power consumption, quite fast a lot of unexpected hurles pop up.
My master's research goal was to produce a data mining/machine learning machine, or at the very least a data mining/machine learning co-processor. In retrospect, that was a very ambitious goal that would require many years of work, probably in collaboration with other graduate students.
What I ended up doing was just Support Vector Machines in digital hardware. Now granted, there is another aspect to my research that I'm not mentioning here, mainly that I didn't use normal floating point mathematical architectures, but a different innovative logarithmic based mathematical architecture. That in itself was a significant undertaking.
In any case, this sounds like a great project, I just wonder how much they can do in their (in an academic sense) very small time frame of 2-3 years. Even though a lot of preliminary work has probably already been done just to apply for the grant.
In any case, it is great to see something like this, something to keep in mind in case I ever go back for a Ph.D.
Once this technology has matured and some more headway can be made in Natural Language Processing, (uncertainty for teh win) we'll be on the cusp of some really excellent improvements in human-computer interfaces. It's becoming more common to see 'intelligent' systems being built to mirror the architecture of the human nervous system. This will be a necessary step to forming a generally proficient AI system. The day a computer can readily recognize you're being sarcastic, it's time to be paranoid.
"Don't waste your time or time will waste you" -MUSE
This sounds like a great idea. Sometimes a Hammer works better than a screwdriver at a certain task. Not all Jobs can be preformed as well by a single tool or method.
After all, the human brain has different areas for processing different types of stimuli.
In fact, some parts of our brain are so radically different they are almost considered brains of their own.
like the cerebellem; it's often referred to as "the small brain". This controls motor coordination - and in humans allows us to do amazing things like flips, kung-fu, and cup-stacking.
And forgive me for forgetting the exact names, but the brain has layers as well. the outmost layer being the cortex (where most of the higher-level mamillian processing takes place - correct me if I'm wrong, the frontal lobe is pretty much purely cortical tissue). as you delve deeper you get into the hippocampus and medulla whatever (sorry IANAN I am not a Neurologist) which is where emotion rules - and if I again remember correctly is sometimes referred to as the "reptilian" brain.
Even the eyes themselves can almost be considered little 'brains' of thier own - considering the amount of pre-processing they do (maybe a co-processor would be more accurate).
make
With the advent hardware speech recognition, hardware speech translation is just the next evolution. Imagine being able to go to any country in the world and have just an iPod size device and a bluetooth hearing aid as a translator.
-Randy
Now, disgruntled ex-employees won't return to the office to "go postal", so to speak. They'll just run up and down the hallway yelling "File! Exit! No!".
Speech recognition is a two-part process. The silicon is to speed up part one: word recognition. The first thing to do is to figure out that the person is saying:
... two hours.
Computer, set timer for (to|too|two) (ours|hours).
Step two changes that into:
based on context. That's where the AI programmers get their turn at the problem.
#include "humorous_pop_culture_reference.h"
You are forgetting the coded phonetic context of a word and distillations for "known dialects". Besides dialects, English is bereft with words that sound the same yet mean different things or even sound differently (slightly) depending on the surrounding contectual words and whether it is a statement, question or exclamation (different intonations). Feel free to multiply that K figure by up to 1000 times.
National Security Agency: "We did, and they are hooked to the national phone system."
With voice software, you can already speak in real-time, conference style. I think Skype supports 5 people.
With speech-to-text, you could log all conversation to IRC.
Then you could have search engines that search *all conversation within the last 5 minutes, world-wide.*
Well, at least all conversation that was okay with being public.
So you could say, "Show me all conversations that are going right now about Python, and immediately find the people talking about Python, wherever they were.
One step towards the HiveMind.
Exactly, and that's where the real problem lies. If people think it's going to be difficult to identify the same word spoken by people from different regions then they probably have not given much thought to the fact that many words with different meanings sound the same in English and also that there are phrases such as "fat-chance" and "slim-chance" that mean exactly the same thing.
We have always been at war with Eurasia!
NSF, to me, translates to "Non-Sufficient Funds" or a bounced check.
I can tell you from personal experience that this method of "funding" only works for the short term.
Jonathan
So far, analog neuromorphic VLSI has hit a dead-end in terms of real applications. Also digital signal processing has been speeding up to the point where it can go almost as fast as a lot of the parallel analog models.
The one exception is that the work on analog retina models lead to the development of the Foveon X3 technology, which is just packing R,G, and B CMOS sensors into a single vertical column on a chip. But again, the neuromorphic part of the retina model is not the X3 technology, the X3 technology is stacking CMOS sensors.
Analog neuromorphic VLSI did have one big result, the electrical engineers managed to teach the biologists a lot about signal processing, and the cross-pollination of this knowledge has lead to discoveries such as ripple analysis in auditory cortex.
something just about but not completely unlike tea
----
WWJD...For a Klondike Bar?
making quantum leaps in speech recognition has tremendous potential for deaf and hard-of-hearing (I am the latter)
Imagine being in a meeting (almost always a problem for hearing impaired people) and having real-time subtitles.
$1 million is a TINY price considering upwards of 20% of the nation has some hearing loss and hearing aids cost on the order of $4000 a pair.
A year spent in artificial intelligence is enough to make one believe in God.
As it is, it's a tossup whether I prefer speaking with a machine or a customer service rep in India. Won't take much for a machine to surpass most of them in English speech recognition. (Alright, to be fair, there are some indians I've gotten on the phone who have been at LEAST as good as the typical US based rep. But that's a minority.) Anything to advance the technology.
Eye use peach recon ingition proton now. Sea how wood it works? Eye love his sea check ignition pro gram. don't ewe tank hugh should met won?
All misspellings and grammatical errors in the above post are intentional and part of my artistic expression.
I am an assistant prof at a major research institution and $1,000,000 is not as much as you would imagine. Firstly most universities take ~ 50% of grants immediately as overhead. You're down to 500K. Second this is spread out over 4 - 5 years, now you're down to about 125 K a year. Third, if we have grants we profs are required to pay our own summer salaries. On average this could be 25K, so you're down to 100 k/ year. In sciece and engineering we are expected to pay our grad-students if we have grants. Yearly salary with additional overhead (in the US, Canada is a bit less) comes to almost 50K/year A post-doctoral researcher would be hard to find for less than 50K/year with overhead. So really it supports a grad student and a post-doc and maybe some equipment for four years. Compared to the resources of industry it sometimes seems kind of puny. But the freedom is worth it. Just some info, OBQT
This should be about algorithms, not architecture. Anything they can do in silicon can and should be implemented and perfected in PC software first. I don't care if it takes PC 10 minutes to recognize 10 second sentence as long as it does it accurately. As soon as that happens, then by all means cut its power consumption and speed it up x1000 by doing it in silicone. If all they are doing is speeding up existing, relatively low accuracy algorithms, then their effort is of limited use.
Too be honest, I doubt that putting a few clever algorithms together will ever achieve any respectable accuracy no matter how fast those algorithms are. Sure, it might accurately recognize words from limited vocabulary when spoken clearly and/or in simple sentences. If this is their goal, then it is quite achievable. It sounds to me though that they are aiming much higher as in "dictating a detailed email". I think that so many things have to happen from effective noise filtering to proper phonetic model representation to parsing to content-based correction. Latter step is especially problematic since it requires a huge knowledge database which takes humans years to accumulate. I am not saying that these difficulties are insurmountable, but simply that their goals are too ambitious for the current state of our technology and knowledge. I'd love to be proven wrong on that account though.
"You mortals are so obtuse." -Q
Why are they talking about querying online databases for 911 calls as the national security app? It's obvious the national security app is to translate every single phone call to text and store them (indexed) in a classified database. I've attempted to believe the US wouldn't do this because it's illegal, but I can't manage to suspend disbelief. The only way to avoid this is if phone calls are encrypted and the US doesn't have the keys.