Neural Net Outperfoms Human in Speech Recognition
orac2 writes "Here's a press release (with a real video clip) on a neural net that can recognise speech better than humans - even in noisy environments. The network uses just 11 neurons. They did it by incorporating an aspect of biological neural networks normally ignored by artificial networks; the timing of signals between neurons. Beyond the immediate application to speech recognition the wider implications for all neural networks are obvious. " Neurons. Mmm.
Actually, it looks as though this thing doesn't understand language better than humans (if at all). All it can do is pick out the sounds and form them into words. It still does not know what the words mean.
In essence, what was created is little more than a super-hearing-aid. Certainly a good thing for the hard-of-hearing (and this one, it would seem, could significantly boost the hearing of anyone, even those with "normal" hearing to start).
And how many still believe that Echelon is not capable of recognizing words in conversations automatically?
-
No, it's just preachy BS.
It's obvious you are just tripping. The creator of the first one that says "I'm Sorry Dave" will think he is Dave, or that he has been mistaken for someone who IS Dave.
As far as "glorifying themselfs" for mimicking what they already saw with the human brain, WTF are you talking about? Of course they're proud of themselves! They did something with a computer that nobody had ever been able to do before. That right there is fucking cool! I'm proud of them too, and I go to UCLA! (the researchers in question were from USC, in case anybody missed that)
--Mizerai
There is more than just a little Big Brother possibility to this. If this technology actually works as advertised, it eliminates the last technical barrier preventing governments from monitoring all voice communications all the time. Heretofore, this was not practically possible because of the manpower required to listen to millions of voice calls; this technology will make it possible to search for key phrases in real time as well as to archive millions of calls efficiently. The fact that it is apparently both cheap and simple only makes things worse.
Almost equally disturbing is the apparent ability of the Berger-Liaw system to distinguish individual voices from background noise, which raises the specter of governments being able to use almost unimaginably faint sounds to avoid more intrusive methods of bugging, and the monitoring of conversations in crowds. Combine that with existing off-the-shelf technology for face recognition...
Let's just say that I will be very surprised if the first customers for this technology aren't in Beijing and even more surprised if they aren't quickly followed by the dolts in Washington.
And hey, if I can reconstruct what you say inside your home from the weak sound waves that drift out into the street, that might not even require a warrant...
Proud member of the Weirdo-American community.
The net has to know what it is listening for inside of the noise before it can actually pick it out.
NO, IT DOES NOT!!
It all comes down to statistics. Speech is a non-white signal. Noise is white. If you have two microphones/ears, you simply search for the linear combination of the two signals that is the most un-correlated temporally, and voala! You have found the speech signal. This is known as blind separation.
It's difficult to evaluate this system given the sparse amount of information available. I, for one, am incredibly skeptical at this point.
e al/real_video.html
a) There is no statement of the train/test procedure for the neural net. It's fairly easy to get good performance if you're training your system on the same dataset that you test. Without this information, you cannot make a reasonable judgement.
b) If you listen to the audio samples in the video at
http://www.usc.edu/ext-relations/news_service/r
You can notice a significant difference in the times of the samples (e.g. "stop" is shorter than "yes"). A fairly unsophisticated NN can pick up on the length of a sound sample and generalize from there. I didn't hear any statement saying that in the official training and testing all sound samples were of the same length.
It's really a mess. If someone has a journal article or other piece of reliable information on this research, a pointer would be appreciated. Until then, I'll be feeding Clever Hans.
I'm more concerned that USC is trying to patent the "system and the architectural concepts on which it is based". As a computational biologist who uses neural nets in my work, I rely on the AI community to develop the underlying algorithms. If they get a patent on the algorithm and not just their hardware, that would severely limit the use of this breakthrough in other scientific areas.
JMC
It said they'll apply for a patent - I wonder how much the patent will cover. I really hope they don't manage to get a patent covering the use of temporal information in neural networks as a whole - ordinarily, I'd assume they wouldn't, but given some recent patents, I tend to worry.
-=Best Viewed Using [INLINE]=-
nope. they have *NOT* done any such thing. They used a 11 node neural net for necognising 4 words...which is all well and good until you reach reality (10,000 words) and continous (rather than the discrete stuff they were doing) language processing. In that environment your pentium iii, K7 or even alpha isnt upto the task. Note that the 20 MILLION dollar electronics in the eurofighter are so far the only platform for recognising language independent speech with nearly 100% success rates in real time (what? you didnt know the eurofighter has speech recognition? now you do).
I wonder what a neural net made of bogons, morons and vogons would be like?
----------
In a real emergency, we would have all fled in terror, and you would not have been notified.
I know speech recognition seems cool and it will be very good for the disabled, but it's not a purely good thing.
Now, instead of requiring at least 2 people to invade your privacy and listen to everything you say, one supercomputer and a bunch of listening devices let The Man (tm) listen to thousands of people at once and scan the transcripts for keywords and sentances.
I get the impression that this net did not perform better "even" under noisy conditions, but "only" under noisy conditions.
r ies/36013.html
Here's the original link
http://ww w.usc.edu/ext-relations/news_service/releases/sto
If I'm right about that, then this development (while still insanely cool - don't get me wrong) might not be so surprising. As I recall from college brain-and-mind psych courses, humans use a variety of factors when singling out a lone voice or conversation in a noisy environment. These include spacial orientation, visual cues, etc. My prof called the "cocktail party effect". Rob them of these cues, and it isn't suprising that they are hobbled.
Also, computers have the mixed blessing of ignoring information patterns unless they are instructed to do otherwise. A person, listening to white noise, would subconsciously attempt to find meaning in every bleep and scratch. A computer, listening only for certain cues, can disregard the majority of the signal.
I would be interested in learning what rate of word recognition this system achieves. Current technology manages about 90%, which means one in every ten words is heard incorrectly. If they could improve that to 99.9% or even just 99%, we might actually get some speech-processors in Office desktop products.
-konstant
-konstant
Yes! We are all individuals! I'm not!
I've recently done quite a bit of research on Neural Networks, including coding and simulating them by hand... There are some (qutie drastic) flaws with neural networks...
I started my research doing a classic 5 pixel by 5 pixel OCR (optical character recognition) on the domain of digits on a single layer perceptron type network (similar to what these guys were using minuns the delayed firing rate)
Not suprisingly, the training algorithm converged to an answer quite quickly and I proceded to run tests with noisy data, to test the genrealazation of the network.
This isn't shocking in itself until you realize that once you go above fifty percent distortion rates you are actually INVERTING the digit!
I retrained the network with inverted digits as well as the normal digits and re-ran the tests on the same set of data (note: The net WILL NOT converge on normal & inverted 5x5 digits with only ten cells).. The correctness rate was only twnety-per cent throughout the whole domain of noise levels.
I then retrained again using TWENTY cells (9 more than this articles) and it converge quite nicely and gave me a quadratic function with an R-Squared value of .9995 or so.
People view Neural networks sometimes as a fix-all solution.. The article on /. earlier about "eveloutionary computing" is the same premise as neural networks : try stuff randomly (or using calculus) until we get a decent solution.
I'm sorry kiddoes, but that just doesn't cut it. A neural network can't ever outperform a Turing machine so there can't be any chance in hell it will ever outperform us in non-specilized tasks.
Of course, I'd probably be more optimistic if these guys would have released there algorithms, papers, source-code, etc so we could actually figure out HOW the HELL they can get an 11 cell network to recognize speech...
The moral of the story? understanding speech is a hell of lot harder than recognizing ten digits!
Yeah the number of phonemes used in most languages is in the 'few dozen' range. And you generally don't have to listen very long to hear them all at least once.
But even when you've got the phonemes, you've still got a fair ammount of work cut out for you. A number of phonological processes take place. For instance 'in plain sight' in may be pronounced 'im'. These kind of transformations (and more complicated ones) are happening all over the place, in every spoken language.
Linguists generally describe this kind of thing by writing context-sensitive rules to enumerate the transformations. Similar syntactic translations are context-sensitive.
Computer programming languages' syntax (er, not counting types, and identifier agreement (which are special cased)) are not even typically generic context-free languages, but instead are almost always part of the LL(1) or LR(1) subsets, meaning that they have the special property that you can determine what's going on just by looking ahead one character. Otherwise you end up with N^3 parsing time, and that's for context-free languages. Parsing of context-sensitive languages is way more problematic (think halting problem).
Unless you can parse the syntax, you can't really resolve ambiguities (to/two/too, there/they're, or even things which merge because of phonology (bitter/bidder/bit her)). Note that humans don't do so great with these issues always either, so a partial solution will be still qutie amazing.
But the fact still stands that turing samples into phonemes is only the first step in a very complicated process towards even something as simple as taking dictation. In fact, I'd say that syntax->semantics may be a smaller step than phonemes->syntax.
Trees can't go dancing
So do them a big favor
Pretend dancing stinks!
Pulsed Neural Networks. It's really not such a new technology. There's a good book on different topologies and algorithms titled,
"Pulsed Neural Networks". I know Amazon has a copy (that's where I got mine a few months back).
Yet another comment from the conspiracy to make it look like there is only one conspiracy
/., just look at the stories on geek profiling (The Katz stories). The Government IS out to get us, they admit it after all, and what is this caused by? Paranoia on the part of people in power. It's a dramatic irony, of sorts. But the Light in the darkness and the shadow from the sun is manditory for everything in life.
Honestly, do we have anything to fear from the technology as it is now? No, of course not. However, you have to expect plenty of fear on the part of people from
This mass paranoia against governments isn't bred because someone reads Farenheight 451 and says "shock!", (although it probaly does happen in SMALL quantaties) It's because we see it in our government today. We see corruption, and special intrests, and all sorts of scary, scary things, in government TODAY. The fact that this could be used to track all of the recordings a person ever made is scary.
Is it a long way off? Sure. Can you blame them for being overprotective of their rights? No, of course not.
Nothing personal but I don't see how you can mock or make fun of anyone for holding these fears.
-[ World domination - rains.net ]-
Of course, I could be mistaken, and that drawing is really a graphical representation of the most sophisticated neural net ever made. *g*
--
Voice recognition wouldn't be of great use to me, at least at the desktop. I hate leaving prolonged voicemail messages because I can't go back and edit a previous sentence. I have to go and compose a speech if I want to sound intelligent and coherent.
Voice recognition only becomes useful to me if natural language parsing and enough cognition power are available for me to command my computer in plain english to a fair degree of abstraction.
In mobile computing, it might be a lot more useful, especially for a device, say the size of the Palm Pilot, where various factors make voice far more convenient and less difficult than other forms of input.
There are a lot of human use factors that complicate voice recognition (making the computer recognize when you want it to parse your speech and when you don't want it listening). Human interface issues often make these things less wonderful than they appear.
Not that I'm saying this isn't a wonderful development and there aren't people out there who could really use this (in specialized environments or people who have mechanical difficulties), but I don't think voice recognition is going to change the world the way some people think it will.
Use more neural nets.
Some people are saying that you can't make a really big neural net efficiently (at least without specialized hardware), but I don't see why you couldn't have hundreds of seperate neural nets each reporting on whether one word was said.
A very tiny, very simple computer could handle the task of managing a few neural nets. You could make it out of a few thousand surface features on a chip, so you could pack thousands of these processors on a chip. For that matter, they probably don't need to be terribly fast, so you could make them like memory chips. Imagine a megabyte chip, but instead of 1024K dumb memory, with 1024 minimal neural processors, each with 512 bytes of RAM.
Broadcasting the incoming data is pretty simple, and I don't think the networking issues of one or two of these processors reporting every few seconds would be too severe.
Training wouldn't be all that hard, either. You need a few man-years of samples, but the training could be done in parallel. It would cost a few million dollars (unless there was a dedicated online effort, which is entirely possible), but not billions. Imagine going down to the mall and asking people if they would read a few hundred words for $20; no problem, just repeat it all over the place so it deals well with accents.
There has never been a task better suited to massive parallel processing.
Oh yeah, I suppose I have to say: hey, we can do it with a Beowulf cluster, |)00|)Z!
To all of you naysayers out there who think this system has no real-world use because it can only understand a handful of words...Do you so easily forget the lesson of the computer? You only need two states to transmit information. If we merely learn to speak in binary (On On On Off Off On) the problem is solved and we have achived practically perfect speech recognition. Narrow minded fools!!!
Are your research materials online?
I like following the progress of projects around the world --- I was in academia myself a decade ago, in a department where colleagues who were working with NNs would discuss their processing requirements and architectures with me. The work you describe sounds interesting.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
I am not a Ph.D in this field, but I do have my Master's degree in Speech Science. While I have taken a break from Speech Science for about 2 years to learn C++ enough to start working in computer speech recognition/perception/production I'm still fairly up on Speech research. That caveat out of the way, let me tell you my thoughts.
/s/ phoneme, but the one in /si/ ("See") has a spectrum much higher than (well, in speech terms, I think about ~1KHz) /su/ ("Sue"). Phonemes are not discrete things, they are gradients or classes. So you are simplifying things far too much when you suggest that morphemes are just combinations of a few dozen phonemes.
/s/ in /si/ vs. /su/) as one thing that can cue a listener into what phoneme follows it. In that particular set of studies, people were able to identify the morphemes (/si/, /su/, etc.) by only hearing the initial /s/. That is, the vowel was cut-off from the morpheme, yet people were able to (with something like 90% accuracy) complete the morpheme.
While you say there are only a few dozen phonemes in most languages what you are missing is the fact that each phoneme is context sensitive. So if I say "See" and "Sue", the 's' sound in each morpheme is spectrally quite different. They are both the
Really, if you think about it, humans do not learn to understand words by rote memorization of the acoustic properties of each word. That would be far, far too inefficient. Think about the fact that you could still understand someone's voice, even if they inhaled helium. That skews the spectral/acoustic properties of the person's voice into a very high frequency range compared to their normal voice. Also, if you tried to listen to non-native speakers who are missing phonemes or substituting phonemes, how could you possibly understand them? What you do is you figure out the missing or corrupted phonemes from the context of the morpheme. Some research supports the addition of other, extraneous acoustic information (such as the spectral shift of
There is an awful lot that speech research has not yet uncovered. One of the problems that I see in the field of computer speech recognition/perception/production is the lack of solid speech research and implementing the trickier research into these projects. Training neurons to recognize individual morphemes doesn't work. It's like brute force calculation of chess; the system is too complex to tackle with such a simple model. It's just too damned inefficient.
Besides, homophones will always be a problem with speech research, until language makes an appearance. How many times do you want to have to correct "their", "there" and "they're" in a document?
---------The early bird gets the worm, but the second mouse gets the cheese.
"Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
I am very excited by the possibilities of this technology. Just imagine it: a really good speech recognition system coupled with a really good natural language analyzer coupled with a good speech generator. What do you get? The comm/computer system from Star Trek:TNG. Hell, there's probably enough "Computer Voice" samples of Majel Barrett to at least give the speech generation software a good starting place.
Who needs a Palm Pilot when you can walk down the hall hands free as the briefs you on your next meeting, or allows you to read and compose your mail on the way to work. My hands shake.
My only concern: the people who design this system would need to included Star Trekish terminology and attitude into the list of things the computer could do. Example:
--
"Computer, please replay voice mail message 9 starting at time index 0-mark-9-5."
[computer chimes, message plays]
"Computer, message 9 sounds garbled. Run a level-three diagnostic on message integrity."
[pause]
"Diagnostic complete. Message shows signs of type-1 file corruption."
"Damn it!"
"Error: cannot comply with that directive"
--
If we could just get that far, then I'd be happy. Actually, no that's wrong. If we could just get that far, then invent warp drive, replicators, transporters, inertial dampeners, and holodecks, *then* I'd be happy.
Ross
The article misses another interesting, albeit scary, use of this technology. If these could be made small enough and cheap enough, they could be placed in key locations across the country, forever listening in on passers-by.
Avoiding all the issues of privacy, consider the following scenario. The police want to arrest a suspect for some crime (drug traffiking, conspiracy, etc.) but have no proof and can't tap his phone lines since he encrypts all his phone conversations. Through some method, they train this speech-recognition device to the suspect's voice and either have someone with the device planted on them track the suspect or have an array of said devices placed in public areas where the suspect is known to hang out (bus terminals, bars, etc.). Sooner or later, the suspect might slip up and the authorities have enough evidence needed for an arrest.
Regarding privacy concerns, it seemed to me that this device could only track a handful of known voices ... probably requiring vast processing power to track every voice in a room. So it might be a while yet before everybody's conversations in bugged places get transcripted.
Damned cool technology, though.
ian
Inasmuch as ringworm is actually a fungus without a nervous system, I'm a little perplexed by that claim. Tapeworm, maybe? Planaria?
Ringworm (tinea) is a fungus that covers the skin, causing discomfort, itching, and leaving an unsightly rash. Microsoft has managed to reproduce this behavior in software without using neural net technology at all.
I have to wonder, one of the major basis for the success of neural networks is that they are trained, rather then programmed in the traditional sense. This works fine while your researching and developing a singular system. But how do you mass-produce these systems? You can't just apply the same code across millions of them. Will there be classrooms filled with little computers learning how to be computers? What happens if one becomes a bully? What if one can't do math? And will there be trauma counselors on hand should one Blue Screen?
Dear Sir/Madam
I am writing to inform you that your network failed to show up for English Class today. We cannot stress enough how important regular attendance is key in achieving a proper education.
Please attend to this matter as this is its fourth missed class.
Thank you,
011100110
Principal - School of Advanced Network Training
"They do not preach that their god will rouse them, a little before the Nuts work loose." Kipling, 'The Sons of Martha'
Although this article is impressive, realize that the ability to pick out words is entirely different from the ability to understand words, to use words. I would bet that a 2 year old baby still has better comprehension and understanding of ideas expressed by spoken words than this nerual net does. Think of the way our language evolves, all the slight variations in tone and in gesture(sarcasm anyone?) , regional dialects (it's like butta) and all the double meanings of words (cleave). Mind you this stuff is pretty neat, but we have a long way to go before we can have conversations with our computers. Even then, I would rather talk to a two year old, i'm sure they hold the secrets of NP math in their little brains, they just forget it all during their Power Rangers phase.
If you read down near the bottom of the article, however, you will find this:
"The network was configured with just 11 artificial neurons, and in a sub-stage a live goat brain. The brain was activated through a patented process involving a castle and a lightning storm.
The researchers said one day they hoped that all humanity could benifit from the power of lighting.
Then they laughed kind of ominously."
Hotnutz.com
Terminologies, dialects, genders and whatnot would (will) be user-defineable, much like WinAmp skins or QuakeWorld skins. You'll have endless variations of the Star Trek Theme (including the charming and original fembot monotone from the original series), the Gangsta Theme, the Sesame Street Big Bird Theme and, of course, my personal favorite, the Wicked British Nanny Theme.
"You have 3 tasks left incompleted on your to-do list, you Naughty little boy! This calls for a vigorous spanking!"
(whipcrack) GrrrrrrOWl!
**>>BELCH
Come on! This is _the_ coolest piece of technology I have ever seen. Yes, there is the "big brother" possibility, but we shouldn't discourage a technology solely on that merit. Think of what this could do for deaf people! A pair of glasses that gives a text overlay of every (or certain) conversations in the room. Think how cool it would be to have your MP# library hooked up to a voice recognition system (yes.. ala trek). From what I understand, this system could still here your requests even when you had your music blasting. Talk about simplifying computer interfaces. Forget all this GUI crap!
In addition to the other good comments posted regarding taking this announcement with a grain of salt, I must add that the new system can only recognize a few words -- with only 11 neurons, it couldn't do much else. Without further information, I would guess that training up a net to recognize more words would be quite complicated -- especially given the non-standard training algorithms that were used. It would be great to find a scientific paper written by the researchers on the issue instead of solely press-release material. -dandre
Yes, this passage about just 11 neurons connected by a mere 30 links makes me wonder what this net actually does. "Speech Recognition" could of course also mean the ability to recognize that an audio signal contains speech :-)
:-)
Of course the task of net could also be to separate the noise signal from the speech, aka blind separation, a problem that has been solved before (for instance by independent component analysis)
If this is merely ICA with a time coded neural net, it is IMHO still pretty cool, and much more impressive than all those commercial systems that rely on dumb correlation and processing power.
Anyway, instead of just having me guessing, could someone please point to their paper
On the other hand, this could be a great leap for neural networks in general. Realizing that the timing of synapse signals is a critical factor in neuron firing is going to shake up some things in AI. (At least, I was never familiar with neural networks that used timing cues. If I am wrong, please let me know.) Of course in a large neural network, you're going to have lots of propagation latencies as signals bounce around the net, and it makes sense that even more important than which neurons fire is when neurons fire. It actually seems to justify the complexity of neural nets because the timing data can represent a much larger data/search space than the simple fire/dormant state of each neuron.
This could be exciting.
My Freakin Blog
While the press release doesn't say much about neural networks or whether the state of the art in speech recognition has improved, it tells us something about a disregard by USC for standards of scientific conduct: scientific publication by press release is improper.
If you have two hypotheses e.g. A and B, corresponding to 'two words' which were said, then it is easy to build systems which can recognize signals corresponding to A and those corresponding to B embedded in lots of noise. Basically you measure the likelihood ratio p(B)/p(A) using some sort of estimators that you've trained to light up with either A or B. If you gave me the data, I could do this with a number of different semi-conventional numerical techniques on a digital computer. I've seen similar things presented at conferences a few years ago---recognition of specific chaotic waveforms (specifically dolphin and whale song) embedded in lots of noise.
This is known as a "simple hypothesis test".
The more general circumstance, however is that the alternative is not A vs B, but A vs a huge multitude of other possibilities. This task is much more difficult, and correponds to the actual large-vocabulary speech recognition task. Now it becomes much more difficult to set a reliable threshold which will come on only when A is actually present, and not when A is absent. There is a tradeoff of false negative and false positive errors depending on your choice of threshold.
There is no possible way that this thing can recognize 50,000 words. There are only 30 connections, there is fundamentally not enough information processing power intrinsically in there.
What you would do is to have all sorts of these subunits lighting up their own 'word finder lights', and the result of *those* (i.e. the p(A) detectors) would then be inputs into higher level semantic networks of perhaps a similar type. These networks or hidden markov models or whatever are the ones that know which sorts of words follow other sorts of words, and thus let you get better recognition than the individual word finders themselves.
So, what is the accomplishement of this paper??
That they've apparently found an extremely efficient and well-performing low-level subunit using this time-domain information. From our own experimental observations (not on speech but on real live neurons from recently-living animals) this is very important. The fact that it is only 30 connections might mean that it is quite feasible to put 10 or 20 thousands of these subunits on a single chip, running in hardware. Given the factor of a thousand speed increase of electronics over neurons if you could time-division multi-plex different recognizers (blue sky dreaming here!) you could have that much many more of them during the milliseconds to seconds of audio-frequency processing time that we speak at.
If you notice, Professor Berger said that no other speaker-independent system outperformed humans, even in small test bases. Presumably that means in the small Bayesian post-hoc sorts of likelihood test regimes taht I described before. And in addition, it appears that this is not a simulation but that they built it on an actual physical computer chip, another very substantial advance.
My colleagues are going to ask the authors for the actual paper. The title and press release may be overblown, but this smells like real science and a significant advance here to me.
Take home message: even small groups of good neurons can do interesting and useful things. With the right architecture, a small group of neurons can outperform conventional "neuroid networks" of hundreds or thousands of nodes linked by linear transformations of sigmoidal basis functions. We may just be beginning to crack real-AI.
We see major body functions of lower animals being regulated by say ten neurons. Real neurons are much smarter than you think. :)
If small groups of neurons can do this, it makes you appreciate what a hundred billion might be able to do.
In the United States, at least, patents can be snatched up by the military and made Top Secret.
This allows the military to wait until some bright young entrepreneur to come up with a great solution, then they swoop down and tell the poor sap he can't talk about his patent for 10-15 years, and next thing you know the military comes out with some really cool speech recognition device.
So while there are brilliant people outside of The Man's Territory, their ideas can be and are stolen, and no one can talk about it.
I can think of better ways for the world to work...
-- I can't think of anything witty to put here. Sorry.