Speech Recognition in Silicon
Ben Sullivan writes "NSF-funded researchers are working to develop a silicon-based approach to speech recognition. "The goal is to create a radically new and efficient silicon chip architecture that only does speech recognition, but does this 100 to 1,000 times more efficiently than a conventional computer." Good use of $1 million?"
If this really is true what they're saying, and knowing how much money is invested in speech recognition research on a yearl y basis, yeah, i would definately say that this is one million dollars of great investment...
- Leon Mergen
http://www.solatis.com
Good use of $1 million?
Let me think for a moment... Hell yeah! If we had low power speech processors, the possibilities would be endless. For one, we'd finally have a Star Trek(TM) interface for our homes!
"Computer, lights!"
"Computer, make coffee!"
"Computer, Earl Grey, hot!"
As silly as it may sound, such an interface would be far more efficient than mashing buttons.
In addition, blind people could be significantly helped by this. Many of them already use speech recognition and synthesis to assist in computer usage. Imagine if their computers could suddenly understand them a thousand times better? They could talk to their computers a bit more naturally, thus saving their vocal chords from undue stress.
Other applications (off the top of my head) are:
- Voice notes on embedded devices (store only text!)
- Helpful Kiosks that can give you directions
- A new use for natural language database queries (i.e. Ask the computer what last quarter's net sales were.)
- Voice controlled robots ("You missed a corner, vacuum cleaner")
- Data search by voice ("Find me a channel that plays Star Trek")
Any other cool ideas out there?
Javascript + Nintendo DSi = DSiCade
Carnegie Mellon University's Rob A. Rutenbar is leading a national research team to develop a new, efficient silicon chip that may revolutionize the way humans communicate and have a significant impact on America's homeland security. Rutenbar, a professor of electrical and computer engineering at Carnegie Mellon, working jointly with researchers at the University of California at Berkeley received a $1 million grant from the National Science Foundation to move automatic speech recognition from software into hardware. ''I can ask my cell phone to 'Call Mom,''' says Rutenbar, ''but I can't dictate a detailed email complaint to my travel agent or navigate a complicated Internet database by voice alone.''
From Carnegie Mellon University:
Carnegie Mellon engineering researchers to create speech recognition in silicon
Team to develop new silicon chip
Carnegie Mellon University's Rob A. Rutenbar is leading a national research team to develop a new, efficient silicon chip that may revolutionize the way humans communicate and have a significant impact on America's homeland security.
Rutenbar, a professor of electrical and computer engineering at Carnegie Mellon, working jointly with researchers at the University of California at Berkeley received a $1 million grant from the National Science Foundation to move automatic speech recognition from software into hardware.
''I can ask my cell phone to 'Call Mom,''' says Rutenbar, ''but I can't dictate a detailed email complaint to my travel agent or navigate a complicated Internet database by voice alone.''
The problem is power--or rather, the lack of it. It takes a very powerful desktop computer to recognize arbitrary speech. ''But we can't put a PentiumTM in my cell phone, or in a soldier's helmet, or under a rock in a desert,'' explains Rutenbar, ''the batteries wouldn't last 10 minutes.''
Thus, the goal is to create a radically new and efficient silicon chip architecture that only does speech recognition, but does this 100 to 1,000 times more efficiently than a conventional computer.
The research team is uniquely poised to deliver on this ambitious project. Carnegie Mellon researchers pioneered much of today's successful speech recognition technology. This includes the influential 'Sphinx' project, the basis for many of today's commercial speech recognizers.
''We're still not even close to having a voice interface that will let you throw away your keyboard and mouse, but this current research could help us see speech as the primary modality on cell phones and PDAs,'' said Richard Stern, a professor in electrical and computer engineering and the team's senior speech recognition expert. ''To really throw away the keyboard, we have to go to silicon.'' But enhanced conversations between people and consumer products is not the main goal. ''Homeland security applications are the big reason we were chosen for this award,'' says Rutenbar. ''Imagine if an emergency responder could query a critical online database with voice alone, without returning to a vehicle, in a noisy and dangerous environment. The possibilities are endless.''
Researchers plan to unveil speech-recognition chip architecture in two to three years.
I can just see the anonymous cowards shouting first post at their pcs now
Cruise TT
My friend and I were talking about this. In countries that are more totalitarian, it could be used to root out "dangerous people" www.geocities.com/James_Sager_PA
God spoke to me.
100 to 1000 times more efficient worth $1M? meh. maybe.
100 to 1000 times more accurate worth $1M? definitely.
Damned straight it is! In government terms, that's a pittance. In government-funded science terms, it's downright INFINITESIMAL. It isn't even couch change, it's more like the stale pretzel under the couch cushion.
But, of course, cue the armchair blogging fanatics without a formal science education, waxing poetic about the infinite power and glory of x86 hardware running clever open source software. Maybe we could do it in perl!
Good use of $1 million?
For something that would be worth hundreds of times that in the form of a finished product, I would hope so. The only dispute might be that the researchers' efforts would be better spent on other things.
On the one hand, it is obvious how much more efficient this would make our day-to-day tasks. Being able to "jot" notes with speech instead of writing, schedule tasks in seconds, the list goes on and on...
This is certainly beneficial... but think about the impact on the economy! Imagine all the "Administrative Professionals" who could, almost instantly, be out of work. I for one would rather pay even $5,000 for a good piece of software to take all my notes than pay a secretary $28,000/year or so.
Then again, when I posed this situation at my wife's office (she's a paralegal) one of the attorneys responded, "Until they come up with software that can find my lost keys and bring me coffee, the secretary's job is secure."
Proudly supporting the Libertarian Party.
...and view the printable version.
The Army reading list
I'm curious to see if their research will improve Natural Language Queries, as opposed to just improving speech recognition. There is an important difference between having to say: SELECT name FROM users WHERE id=12345 and saying: Pull up the name of employee number 12345.
-dave
http://millionnumbers.com/ - own the number of your dreams
Oh the possibilities of handless command of Virtual Valerie:)
Speech recognition on a chip, yes.
But only "silicon" in the sense that every other silicon chip is silicon.
No magical "silicon" breakthroughs to see here, keep moving.
-kgj
-kgj
Note to self: Eat up Martha.
Computer. Computer? Hello, Computer. Just use the keyboard. Keyboard. How quaint.
(I did not read the article as it is slashdotted so I am relying on the summary's statement of 1 million dollars.)
I do security
...is always underestimate your costs and run over budget later. That $1 million will turn into $1 billion before anything comes of this. Hell, it'll take over a million to get the development organization up and running.
Imagine how much money could be saved if you could *perfect* speach recognition.
...
Heck, the hospital I used to work at by itself spent over a million dollars a year on medical transcriptionists
It is an interesting concept, but do we really need this?
We already have voice recognition, this tech will just bring it to everything. You can talk to your keys, your toaster, your watch. But will they have anything interesting to say back?
What would you do if you had 1 million dollars?
You mean besides 2 chicks at the same time...
Refer your friends, get a free ipodThis is completely false. This is not a sig.
I once did a lot of work with speech recognition software, having a former significant other who was disabled. I tested a number of programs, and found the biggest problem to be the wide variances in users' dialects. The programs all have to be trained initially to recognize a single users' voice. This means that a program trained for a Bostonian may not work for someone from Arkansas, Texas, or Louisiana. Also, the programs' effectiveness decreased over time if you did not use it regularly.
I don't know how possible it will be to make a program that can recognize all English users. Will someone who speaks Oxford English be recognized as well as a surfer from California? I doubt it.
Never look down your nose at others. Someday, someone is bound to see your boogers.
This seems like a situation where a hardware accelerated approach is pretty sensible. I'm guessing there is large amounts of signal processing involved in speech recognition. With a custom chip like this it probably helps greatly to offload some of that onto a dedicated chip in the same way as GPUs are used on graphics cards. The only problem I can see is that there might not be much market for it. GPUs have an obvious market (games), but there is less demand for speech processing. Star-Trek style interfaces are nice to dream of but for most common tasks a keyboard and mouse will probably give you a faster and more accurate interface.
gmail invite
The Adama.
I see some results. So far theres been quite a few attempts at speech recongnition. Generally they all fall short, they don't like accents, and often mis-interpret. I know because awhile back we looked at something for my grandfather, he can't keep his hand steady enough to write anymore... *shrug*
The social, commercial and political usefulness of this technology is worth billions. Will this lead to be the end of word processing by keyboard? Dr. Evil: "Here's the plan. We get the warhead, and we hold the world ransomed for.....One MILLION DOLLARS!!" No.2: "Ahem...Well, don't you think we should maybe ask for *more* than a million dollars? A million dollars isn't exactly a lot of money these days. Virtucon alone makes over nine billion dollars a year!"
Depends. It's not as good as using it to prevent the deaths of thousands - possibly tens of thousands - of people by ensuring they have clean drinking water and shelter from the elements. But hey - you can't put a price on being able to speak to a computer rather than type when you're ordering a pizza.
During 1994 upto 1998 I did marketign and technical support for IBM's Voicetype Dictation products..
Initially, doing anythign beyond understanding a few words would take special hardware, but after a bit of 'training' highly acurate and fast speech to text was quite a possibility with a specially developed dsp.
Then, the pentium class cpus came about, and a p90 could just do the whole thing without the dsp.
So, now someone is developing a new dedicated piece of silicon for this.. lets see how long it takes for general purpose computers to catch up.
The issue is not that this is not usefull, but that it either has to keep developing, or offer a somewhat longer lasting price/performance ratio or much better features for a logn time to come.
Using specialised DSPs makes more sense to me than burning up generic CPU cycles. There have been many examples over the years of how a specialized DSP is more efficient and effective for a narrow task than a regular CPU. Look at portable MP3 players. They use tiny specialized DSPs to decode the files in a manner that is much more efficient than using a regular CPU.
We'll still need to do traditional development to interpret the data from the DSPs. We'll need to parse the output so that we can use natural commands to control devices.
"Coffee maker, brew 10 cups, strong."
"Bathroom lights, on."
Without some manner of AI to interpret them, these phrases will be useless.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
From the blog: ''Homeland security applications are the big reason we were chosen for this award,'' says Rutenbar. ''Imagine if an emergency responder could query a critical online database with voice alone, without returning to a vehicle, in a noisy and dangerous environment. The possibilities are endless.''
Like some slight tweaking in order to deploy massive voiceprint-recognition silicon arrays for amazingly efficient automatic realtime conversation transcription and identity determination, attached to Echelon.
So cool... so potentially evil... head begins to hurt... tinfoil hat burning....
Although $1million significantly can speed things up, this is a pretty ambitious undertaking.
My Master's research was on implementing machine learning in hardware, specifically support vector machines.
Now, they have much more money than I did, and probably this will be a collaboration involving many graduate students, but converting complex algorithms from software to hardware is no easy task.
It is just easier to do things in software, that's why it has evolved. The modular layers of abstraction allow a Computer Scientist working in machine learning or speech recognition to not have to worry about how the underlying hardware works.
Working in hardware, a lot these issues come face to face. Particularly since you want an architecture on a chip, whereas in a conventional desktop/server system there are resources such as lots of RAM, harddrive space, etc are available and their interconnections have been built and refined over decades.
Throw in concerns about small form factor, low power consumption, quite fast a lot of unexpected hurles pop up.
My master's research goal was to produce a data mining/machine learning machine, or at the very least a data mining/machine learning co-processor. In retrospect, that was a very ambitious goal that would require many years of work, probably in collaboration with other graduate students.
What I ended up doing was just Support Vector Machines in digital hardware. Now granted, there is another aspect to my research that I'm not mentioning here, mainly that I didn't use normal floating point mathematical architectures, but a different innovative logarithmic based mathematical architecture. That in itself was a significant undertaking.
In any case, this sounds like a great project, I just wonder how much they can do in their (in an academic sense) very small time frame of 2-3 years. Even though a lot of preliminary work has probably already been done just to apply for the grant.
In any case, it is great to see something like this, something to keep in mind in case I ever go back for a Ph.D.
There isn't much overlap, but there is some. Singal processing, the breaking down of the naunces of speach.
I figure a hardware speech processor and hardware speech synthesis (very very accurate and believable) would have a great use for mankind.
Imagine how much cheaper sex chat lines owuld be for instance!
They owuld only need a limited vocabulary, so perhaps the OS IBM stuff would work for now?
Of course, I bet a patent will come out of this... voice technology that is very realible and very easy will remove a whole interface. Talk back to your sat nav...
"turn left"
"I can't its bloody road works"
"Turn left"
"Damn you!"
"turn left, turn left, you will be assimilated"
"what did you say?"
"erm, nothing, I mean, turn left"
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
So.. Who owns the patents, etc, on this if they do it?
.. is better.
Bring on the silicon, yeah baby, yeah!
{oh, except %ONE thing, that is... right...}
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
cellphones have had voice dialling for ages (+3yrs), i simply say "call home" or "dial pizza" and my phone dials the number automatically presumably the DSP for this is on a chip so i dont get whats new ?
Once this technology has matured and some more headway can be made in Natural Language Processing, (uncertainty for teh win) we'll be on the cusp of some really excellent improvements in human-computer interfaces. It's becoming more common to see 'intelligent' systems being built to mirror the architecture of the human nervous system. This will be a necessary step to forming a generally proficient AI system. The day a computer can readily recognize you're being sarcastic, it's time to be paranoid.
"Don't waste your time or time will waste you" -MUSE
You make an excellent point about blind users.
My dad lost his vision a few years back, and we haven't really found
anything terribly useful in the realm of speech recoginition.
He's tried out the little electronic phone/address book gizmo, but it took
forever to train to his voice, a process that was a PITA to start with since
you had to _READ_ what you were supposed to say to it off the screen,
then whisper it to dad, loud enough for him to hear you, but not loud
enough that your whisper would be picked up.
So that went in the trash, and he's been using a microcassette recorder
ever since. Not really the coolest way to do things, but it gets the job done.
(It has an interface that a blind person can actually use)
This sounds like a great idea, as long as they make it useable.
This sounds like a great idea. Sometimes a Hammer works better than a screwdriver at a certain task. Not all Jobs can be preformed as well by a single tool or method.
After all, the human brain has different areas for processing different types of stimuli.
In fact, some parts of our brain are so radically different they are almost considered brains of their own.
like the cerebellem; it's often referred to as "the small brain". This controls motor coordination - and in humans allows us to do amazing things like flips, kung-fu, and cup-stacking.
And forgive me for forgetting the exact names, but the brain has layers as well. the outmost layer being the cortex (where most of the higher-level mamillian processing takes place - correct me if I'm wrong, the frontal lobe is pretty much purely cortical tissue). as you delve deeper you get into the hippocampus and medulla whatever (sorry IANAN I am not a Neurologist) which is where emotion rules - and if I again remember correctly is sometimes referred to as the "reptilian" brain.
Even the eyes themselves can almost be considered little 'brains' of thier own - considering the amount of pre-processing they do (maybe a co-processor would be more accurate).
make
pr0n? We all know that if there's a pr0n application, then the technology will be developed & shipped 100-1000x faster. Speech recognition + pr0n... ... voice!
of course, the obvious control of the system by speech (first steps towards a holodeck), but also you could identify who's in that video by their
stuff |
I smell BS.
Good speach to text does not take a realy fast CPU it takes a fast CPU + good database + a fair amount of ram. Your cell phone's cpu can handle Call MOM because it only needs to know MOM, DAD, SALLY, and mabe 20 - 30 other names. There are 40,000 + words in english if want to have a low cost CPU great but will not a lot of memory and permant storage to get this to work.
With the advent hardware speech recognition, hardware speech translation is just the next evolution. Imagine being able to go to any country in the world and have just an iPod size device and a bluetooth hearing aid as a translator.
-Randy
L33t D00d: I ownz j00
Me: No you don't, eat sniper rifle
*HEADSHOT*
Me: Dammit
*HEADSHOT*
*Double Kill*
Me: Sh!t
[toilet flushes]
*M..M..M..Monster Kill...Kill...killl
Me: F*ck
bed folds down
*L33t d00d is unstoppable*
Me: Sh!t
[toilet flushes]
*L33t d00d is godlike*
Me: gawd dammit
[house explodes]
L33t D00d: told ya, I ownz j00
L33t D00d: hey, you still there?
The decline in legibility of handwritting due to the widespread use of keyboards has been dicussed on slashdot before, but taking it a step further, what effect do you think prevelant voice recognition will have on out ability to spell?
On a side note:
"I don't have lip fungus!"
"Let it go."
If you're looking at an embedded chip to interpret information, think about something large-scale: languages.
If you had the processing power to interpret and understand language, tack that on to something like Babelfish as a translator program. Now you have something that fits on a chip that can translate between any number of languages into your own. Now you can stick a little hearing aid into your ear, and it will translate anything you hear to english, for example. This would revolutionize international communication. This would reduce the number of barriers between diplomats, making them more effective communicators. Also, it would save governments millions of dollars, euros, or any other form of currency in translator salaries, reduce miscommunication, prevent problems with misunderstanding criminals they are charging with crimes, and increase the quality of education among international/foreign exchange students.
Drawbacks: Keeping up with changing language and slang will be quite difficult to include in older models without the capability of a firmware upgrade. Chip size and speed are a factor as well.
This is, of course, assuming that the chip is smaller than the user's head.
According to this link, the average length of an English word is 6 characters. At one byte per character (two if you use Unicode), we find that a database of 40,000 words would be anywhere from (40,000*6) = 240,000 bytes = 235 kb to 470 kb in size. That's NOT much memory at all.
Javascript + Nintendo DSi = DSiCade
you are working a job as customer support. I suspect that this will be used to help replace customer support, or possibly to change the somebodies accent so that they appear from Boston rather than from India
I prefer the "u" in honour as it seems to be missing these days.
This is very good but English is not a good language for voice to text translations. There are far too many homonyms (sp).
We have always been at war with Eurasia!
Layering a speech-to-commands layer over the current systems is very problematic.
The Star Trek nonsense of 'computer! get me all the data on ship X'
[and why does Data talk to the computer, surely he's Wi-Fi enabled ? ] is plainly wrong.
I found using via-voice and friends physically tiring, talking all day instead of typing is quite draining.
Now sit yourself in an office with 20 or so colleagues all trying to work - talking out loud all day.
It's pretty much like touch screens - they sound great until you actually get one and you find out all that investment as pretty much a waste of time except for niche markets.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
The other thing that this spells to me (haha I made a funny) is the specialization of computer components. Rather than having 1 main processor and a sorta second and third in the North Bridge and GPU we have dedicated processors for the different functions! One for the Graphics, Sound, Integer math, Floating point math, the possibilities are endless. This may be a futuristic idea, but the pracitcal uses will be more general advancements in how computers are used and thought of.
KRAMER: Hewwo and welcome to Movie phone. If you know the name of the movie you'd like to see, press one.
... Agent Zero? If that's correct, press one.
... Brown-Eyed Girl? If this is correct,
press one.
GEORGE: Come on. Come on.
KRAMER: Using your touch-tone keypad, please enter the first three letters of the movie title, now.
(George presses 3 keys)
KRAMER: You've selected
GEORGE: What?
KRAMER: Ah, you've selected
(George looks baffled)
KRAMER: Why don't you just tell me the name of the movie you've selected.
GEORGE: Chunnel?
KRAMER: To find the theater nearest you, please enter your five digit zip-code, now.
(George enters his zip-code)
KRAMER: Why don't you just tell me where you want to see the movie?
GEORGE: Lowes Paragon, 84th and Broadway.
KRAMER: (picks up paper) Chunnel, is playing at the Paragon 84th Street cinema in the main theater at 9:30 PM.
GEORGE: Yeah, now I gotcha! (hangs up the phone and rushes out the door)
KRAMER: It's also playing in theater number two at 9:00.
Now, disgruntled ex-employees won't return to the office to "go postal", so to speak. They'll just run up and down the hallway yelling "File! Exit! No!".
Um. Storing the text part of the word isn't the issue. It's the sound part, which requires more than 6-12 bytes per word.
You are forgetting the coded phonetic context of a word and distillations for "known dialects". Besides dialects, English is bereft with words that sound the same yet mean different things or even sound differently (slightly) depending on the surrounding contectual words and whether it is a statement, question or exclamation (different intonations). Feel free to multiply that K figure by up to 1000 times.
... maybe it'll do the same thing for speech recognition as seperate processors have done for graphics, notably 3d graphics. When I was mucking around with computers as a youngster, I could only dream of the likes of quake3 & doom3. Most computers had a crappy CGA or _whew_ maybe even a EGA adapter on board. GPU's have made things not so much possible as feasible that weren't so before ... maybe a seperate chip for speech processing will have the same effect. I mean, we've been talking about speech recognition for decades now, only it's been going totally nowhere.
---
"The chances of a demonic possession spreading are remote -- relax."
Phonetics. It's quite uncommon to store the complete sound of the word. Not only would it be redundant, but it would be of no practical value to the computer.
Javascript + Nintendo DSi = DSiCade
Human beings are very adept at making a quick judgements on the stream of information we receive from the senses. We then follow along logical paths from those judgements, but we also quickly discern if we're headed down a wrong track and will "re-view" the evidence we've been given. His philosophy is that if you segment AI into perception and cognition, you're missing a fundamental feature of human intelligence.
Go to his page at UI, check his wiki, or better yet read his books.
jaz
Death to Argument by Slogan!! (This post twice-encrypted with ROT-13. Replies not using same will be ignored)
That might be a somewhat dangerous command to have as it would probably lead to many cooked American visitors who rented cars in Canada or Europe (or in fact almost anywhere outside the US!). In fact I can see that the headlines of tomorrow might be subtly different from those of today:
"NASA looses Engineer after spacecraft gets units wrong"
Carver Mead (at Caltech last I heard) was pioneering work to take neural processes such as vision and hearing, and model them in silicon via custom-fab VLSI circuits. This is a MUCH better approach to modelling these proceses, since your neurons process the information in massively-parallel, simple-cicuit networks.
:)
The traditional approach was to take a (completely) serial CPU and have it iterate over sampled data using a complex model of the naturally-occuring network.
It seems like a no-brainer to me, but I doubt that $1million will be much when all is said and done.
But I freely confess that I haven't RTFA.
I would prefer that my computer be able to differentiate between there;their;they're and eight;ate to name a few. For simple commands ("Computer Lights On") it would be farely useless, but what I mentioned would be a drop in the bucket compared to the basic AI needed to do a decent text-to-speach-completely-handsoff solution.
This is not a good use of $1 mil. This is an attempt to throw hardware at a problem that software should solve.
Speech recognition has been stuck in it's use of neural nets for far too long. It is very possible to vastly reduce the hardware requirements by making the software smarter. In speech recognition, the most logical way to do this is to RECOGNIZE that the signal is SPEECH. Speech is unique from many other signals, and there are volumes of linguistic research that show how speech is unique. Application of linguistic knowledge can make speech recognition vastly more efficient.
Neural nets are a highly inefficient way to attempt to recognize speech. I defy anyone to really be able to demonstrate in a detailed way what a neural net is doing when it attempts to recognize speech. This is a blind alley that will keep requiring more muscular hardware. Instead, take a first pass at speech data using linguistic knowledge. This will GREATLY reduce processing overhead.
...it's only $1 Million for the first chip. All the other chips cost about 35 cents. Assuming it works, of course.
Sure I'm paranoid, but am I paranoid enough?
Fair enough. We can assume that the phonetic representation is similar to unicode (i.e. up to 65,000 unique phoneme), so that would double the storage. If we then assume we need data about each phoneme. Now english has about 45 phonemes, which is actually above average for a language. If we assume that the computer stores about 4-8 times that many (different samples used as ranges for interpolation), you still don't have that many samples. A few megs at most.
Javascript + Nintendo DSi = DSiCade
I don't know, let's ask the chip...
"Was it a millionaire who said 'Imagine No Posessions?'" -- Elvis Costello
Good use of a million dollars? Let me ask my computer...
"Hello, computer." - Scotty
That's already handled by text to speech programs. They handle this issue by making a contextual "guess" of which word to use. This is especially important as most english speakers fail to properly enunciate their words. i.e. affect and effect are pronounced slightly different, but most people incorrectly pronounce them with a short 'a'. i.e. 'u-ffect' instead of 'a-ffect' and 'ee-ffect'.
Javascript + Nintendo DSi = DSiCade
I'd hesitate "siliconizing" an algorithm before I knew what the best algorithm was. People have working on this problem for 50 years. They have some reasonable solutions for slow speech. But there are still clever things to be discovered. You can always test it on a supercomputer, or slowed down speech.
National Security Agency: "We did, and they are hooked to the national phone system."
With voice software, you can already speak in real-time, conference style. I think Skype supports 5 people.
With speech-to-text, you could log all conversation to IRC.
Then you could have search engines that search *all conversation within the last 5 minutes, world-wide.*
Well, at least all conversation that was okay with being public.
So you could say, "Show me all conversations that are going right now about Python, and immediately find the people talking about Python, wherever they were.
One step towards the HiveMind.
Heck, most people hear just fine, and lots of us don't listen. Just ask any wife, mother, professor, boss...
This issue is a bit more complicated than you think.
NSF, to me, translates to "Non-Sufficient Funds" or a bounced check.
I can tell you from personal experience that this method of "funding" only works for the short term.
Jonathan
I was just trying to poke fun at all the people who post on slashdot now in slight astroturfing mode. "The company I work for make product X! It'll save the world!"
I don't see anything wrong with that. In the old days, Slashdot was all about new products and projects (and MS-bashing, Linux-loving), no yro, no politics. A lot of us still want to hear about new products and as long as the submitter correctly identifies themself as the developer, there is no problem. That's the way it's been from day 1. It's the true astroturfers who pose as just an interested third party who are doing it wrong.
If you don't like product announcements on Slashdot, ignore them and stick to the other areas. Myself, I don't like all the YRO articles, but I don't add comments to them poking fun of YRO discussions.
ascii art
something just about but not completely unlike tea
----
WWJD...For a Klondike Bar?
So is it like chess where faster translates to better?
One thing is "speech recognition" or speech-to-text. Another different beast is "language recognition" or text-to-meaning.
If you say "open the second drawer from the right and close it after ten seconds", it could be relatively easy to recognize the sound and convert it to text, to take written notes from whatever people talk about. But it's much, much harder to design an "intelligence" capable of really understanding of what you *mean*, human language is not easy, even for humans...
making quantum leaps in speech recognition has tremendous potential for deaf and hard-of-hearing (I am the latter)
Imagine being in a meeting (almost always a problem for hearing impaired people) and having real-time subtitles.
$1 million is a TINY price considering upwards of 20% of the nation has some hearing loss and hearing aids cost on the order of $4000 a pair.
A year spent in artificial intelligence is enough to make one believe in God.
Is this really something radically new?
Or is it just a PGA.
Gate arrays are very fast and very limited. So, prototyping one would take lotsa bux. But it wouldn't really be anything to brag about as a technical achievement.
Article seems kinda short on technical details.
And no, saying "We have a really cool algorithm that's ready to commit to silicon. So, we're going to make a PGA. Then make a billion more just like it." would NOT give away any trade secrets.
"Reality is that which, when you stop believing in it, it doesn't go away." - Philip K. Dick
How long until they can get the computer to read lips?
Eye am yousing dis tex knowledge E all ready
This chip that recognizes voice patterns fast... It seems like reinventing the wheel. Why, because analogical algorithms implemented on cnn (meaning cellular neural networks) chips, on real hardware, could do that (as they can do even much more - as I know some researchers who work on real projects in the field).
:P ) I'd say invest in something that could be more beneficial in the long term, which in this case would be cnn research.
:D And of course my computer would have a female voice (I'd even know who to sample :D )
When somebody would ask me about this (why would they
Don't get me wrong, I really appreciate when hearing about these kinds of money investments, because they will serve a very good purpose. Hell, I am one of those guys who are dreaming about real voice controlled computers since my first contact with Star Trek
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
$1 mil doesn't really go all that far in the research world. That could fund maybe 5 graduate students for 3 years, and would leave maybe just enough money for the purchasing. I'm not sure if in the end a chip will come of it, but its definately a worthy start.
can't sleep. clowns will eat me.
Hmm,
Speech Recognition Chip ~ $1,000,000
Household Robot ~ $2,900
Not missing a single play of the game - Priceless.
" My friend and I were talking about this. In countries that are more totalitarian, it could be used to root out "dangerous people" www.geocities.com/James_Sager_PA"'
But of course. Remember however we're a geek site and hence pro-tech even if it can be used to enslave people (Just wait till that forehead chip comes out. 2 GB of ram. Whoo Hoo!). To paraphrase "It's not the tech that enslaves people. It's the people who enslave people" It's just that tech makes it soo much easier.
As it is, it's a tossup whether I prefer speaking with a machine or a customer service rep in India. Won't take much for a machine to surpass most of them in English speech recognition. (Alright, to be fair, there are some indians I've gotten on the phone who have been at LEAST as good as the typical US based rep. But that's a minority.) Anything to advance the technology.
Eye use peach recon ingition proton now. Sea how wood it works? Eye love his sea check ignition pro gram. don't ewe tank hugh should met won?
All misspellings and grammatical errors in the above post are intentional and part of my artistic expression.
Read closely. The target of this research is not to improve speech technology itself, but rather that current speech recognition technology can run on devices with limited capacities (e.g. cell phones and PDAs).
To improve speech recognition technology itself you do not need more computing power, you need a totally new approach. The current approach based on propabilities of word combinations and hidden markov models can work fine for narrow applications, like dictating medical reports that follow a certain model, but comes quickly to its limit in a totally open environment.
And we are talking here just about the speech to text aspect. Forget anything about the computer trying to understand the meaning, apart from simple command and control.
I think this research does not seem very promising in the long run.
I think that applying only more computing power to audio recognition does not solve the underlying problem of the complexity of speech recognition.
For some people it is ofcourse a nice-to-have gizmo that they can command their cell phone with short sentences but this is really far from speech to text dictating and natural language interfaces.
I personally think that to achieve the "bigger goals" we need to concentrate more on context and sentence aware solutions. Speech recognition can at it's best be highly educated guessing of arbitrary human tone sequences - that's what we humans do too.
In these more desirable goals the audio analyzing which this project concentrates on is only a very small (although vital) subset of the process and thus optimizing it with hardware seems to me as fairly insignificant.
That would be a big advantage even if you hear well, just have trouble keeping concentration for prolonged time. If you let your thoughts wander off for a moment, you just read the last couple lines of the log.
For teleconferences, this would also make it easy to participate in more conferences at once. Like having open several IRC windows.
With an automatic translating system, it would help even with multi-language meetings (and, given the inherent features of machine translation, lead to many funny situations - maybe the translators should be aware of ambiguities and show all the possible meanings).
I am an assistant prof at a major research institution and $1,000,000 is not as much as you would imagine. Firstly most universities take ~ 50% of grants immediately as overhead. You're down to 500K. Second this is spread out over 4 - 5 years, now you're down to about 125 K a year. Third, if we have grants we profs are required to pay our own summer salaries. On average this could be 25K, so you're down to 100 k/ year. In sciece and engineering we are expected to pay our grad-students if we have grants. Yearly salary with additional overhead (in the US, Canada is a bit less) comes to almost 50K/year A post-doctoral researcher would be hard to find for less than 50K/year with overhead. So really it supports a grad student and a post-doc and maybe some equipment for four years. Compared to the resources of industry it sometimes seems kind of puny. But the freedom is worth it. Just some info, OBQT
What I want to know are the architectural details! What are they doing in the silicon -- a basic artificial neural network? What DSP algorithms are they implementing? Are they recognizing at the phoneme level or at the word level? If these questions have been answered and they're just not publicized, that's great. If they have not been answered, we will not have much after 2 years of research. Maybe I should just contact the prof directly...
I dunno, I'll have to ask my computer.
Of course, using such a processor, OpenCyc would also be able to use the video camera at your front door to ID you as you approach, open the door for you, and say "You have 5 new voicemail messages, one from 555-6789, from someone who sounded like your mother. Her tone was urgent. Would you like to listen to this message first?"
I haven't even got to Google integration yet, but that was mainly added as a way to get people to read this ;) OpenCyc can already do independent Google searches and collate the results.
This should be about algorithms, not architecture. Anything they can do in silicon can and should be implemented and perfected in PC software first. I don't care if it takes PC 10 minutes to recognize 10 second sentence as long as it does it accurately. As soon as that happens, then by all means cut its power consumption and speed it up x1000 by doing it in silicone. If all they are doing is speeding up existing, relatively low accuracy algorithms, then their effort is of limited use.
Too be honest, I doubt that putting a few clever algorithms together will ever achieve any respectable accuracy no matter how fast those algorithms are. Sure, it might accurately recognize words from limited vocabulary when spoken clearly and/or in simple sentences. If this is their goal, then it is quite achievable. It sounds to me though that they are aiming much higher as in "dictating a detailed email". I think that so many things have to happen from effective noise filtering to proper phonetic model representation to parsing to content-based correction. Latter step is especially problematic since it requires a huge knowledge database which takes humans years to accumulate. I am not saying that these difficulties are insurmountable, but simply that their goals are too ambitious for the current state of our technology and knowledge. I'd love to be proven wrong on that account though.
"You mortals are so obtuse." -Q
Why are they talking about querying online databases for 911 calls as the national security app? It's obvious the national security app is to translate every single phone call to text and store them (indexed) in a classified database. I've attempted to believe the US wouldn't do this because it's illegal, but I can't manage to suspend disbelief. The only way to avoid this is if phone calls are encrypted and the US doesn't have the keys.
Just a million? Pfft! I went down the tubes with one S.R. startup back in '92 that ate far more of some VC's money than that. Now NSF is not in it to get rich and I hope I am right in assuming that a successful chip design, if a mere $1000000 gets that far, would then be available at no fee to any foundry, or at least US foundry. OK, any foundry that wants to sell S.R. chips to the DOD:( This lines up pretty well with IBM's recent give-away of its S.R. code: it is an admission that Speech Recognition is a commodity and nobody knows how to make any money with it so govt must fund further development. BTW, automated recognition of music [as in "what is this tune I keep humming?"] has been on the drawing board at Philips over in the Netherlands for over a year. Philips isn't saying much. But it appears you have to have a pretty accurate sample to get recognition since they want to arrest your piracy based on this recognition...no S.R. software worth its $1000000 is that fussy about sound quality.
SLASHDOT: news for people who can't concentrate on work or have no life at all and got tired of yelling back at the TV.
This is "News for Nerds". It has been discovered that you are not, in fact, a bona fide nerd. This has been shown in your above post, where you quoted, as examples of speech recognition use from Star Trek, the following text:
"Computer, lights!"
"Computer, make coffee!"
"Computer, Earl Grey, hot!"
The actual text should read "Tea, Earl Grey, Hot." With no "Computer" and with the added description of "tea". To my knowledge no one in any episode has requested "Computer, make coffee!" either.
Your Slashdot UID has been suspended until such time as you demonstrate competency with Star Trek references, and some minor Warcraft or Nethack experience.
The Management
But enhanced conversations between people and consumer products is not the main goal. Just imagine all those notebooks/pda/tamagotchi's etc. That'd be fun. As long as my mom doesn't ask them why i was home late i'm safe.
Hivemind harvest in progress..
"the wine was finished, and the romantic meal was finished. Nick sensually kissed Candice as they slowly went upstairs. As he slowly pushed her back on the bed she moaned, 'arouse!' and the voice activated silicone...."
This will be great for adding speech recognition support to embedded devices and low-power computers. We'll have palm tops that allow us to speak into our date book like a secretary.
Considering that if it were successful it would change every persons life on the face of the earth for the rest of time that interfaces with any electronics I would say 1M is worth it.
Now everything I say on a long distance call will be transcribed 1000 times better and/or faster?
That's it! I'm making a tin-foil hat for my phone!
(did I just forget to post AC?? oh damn!)
"Logic merely enables one to be wrong with authority." - Dr. Who
Besides, making a workable technology cheaper is a job for the private sector. Nokia, etc. should be funding this. If it was likely to work, they would be.
While we are on the topic of speech recognition hardware, here is a shameless plug for the Perception Processor that people might find interesting. The Perception Processor OR The Perception Processor
The last time I worked with vision recognition people was in the early 90s, but the two basic approaches to the problem were
- Conventional Algorithms using conventional CPUs or DSPs, so if you did any special chips they were just for running standard calculations faster, e.g. DSPs in parallel to implement your FFTs.
- Neural Networks, which really are a lot easier to wire up in chips than to emulate in software. You can build the things out of FPGAs and burn them into ASICs once your design is done, and then of course you've got to interface them to more conventional I/O channels.
Does their call for silicon imply they're doing neural nets? Or just that they want to do dumb SIMD number-crunching? Hard to say.Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Buy one of those computers and pipe in the audio from a blaxsploitation movie.
The Jive alone would cause it to go into meltdown.
Speech Recognition is worth so much. It provides the potential for a much faster (can you really type faster than you can talk?), less damaging(carpel tunnel anyone?), form of data input. Now, mind reading would be nice but then the virus potential there is way to dangerous.
-Tim Louden
Who with the what now?!?
Last time I checked, those subtitles were the result of a person typing like crazy (and on the local news, with terrible accuracy) on a keyboard while watching the show, not some speech recognition system. I've even seen what looks like the person typing the subtitles hit the backspace key a few times to correct a mistake. Sometimes they even give up and skip a sentence or two.
If $1M could create an automated system that can do significantly better, then it's definitely worth it. Remember, that's a one-time investment to create a technology, not a single device. How can you not see the potential for good in that?
Next you'll tell us Alexander Graham Bell should have stuck to teaching sign language instead of monkeying around with inventing the telephone.
The blind have the moving line of braille, but this technology could be used to translate, or even help clarify what the person actually said.
If all you have is a hammer, everything looks like a nail.
>Good use of $1 million?
Well, I could name some worse...
Coder's Stone: The programming language quick ref for iPad
Instead of talking to a hairy man named Bubba with a put on voice, dirty old men will be talking to a computer. Yay! No need to pay Bubba any more!
These posts express my own personal views, not those of my employer
Big brother is listening (and understanding). Nowdays it can be a worry when universities do the research, rather than a technology and entertainment company. I would not have thought that 1 million dollars would do much of anything with regards to silicon chip research, so is that the limit of the funding or is there more funding, undeclared going on in the background. Perhaps the the general public might not like the source of the additional funding. The professionally paranoid love this kind of technology.
Chaos - everything, everywhere, everywhen