Neural Net Outperfoms Human in Speech Recognition

← Back to Stories (view on slashdot.org)

Neural Net Outperfoms Human in Speech Recognition

Posted by Hemos on Friday October 1, 1999 @05:09AM from the excellent-new-devices dept.

orac2 writes "Here's a press release (with a real video clip) on a neural net that can recognise speech better than humans - even in noisy environments. The network uses just 11 neurons. They did it by incorporating an aspect of biological neural networks normally ignored by artificial networks; the timing of signals between neurons. Beyond the immediate application to speech recognition the wider implications for all neural networks are obvious. " Neurons. Mmm.

13 of 203 comments (clear)

Min score:

Reason:

Sort:

All this and a counting horse by An+El+Haqq · 1999-10-01 01:30 · Score: 5

It's difficult to evaluate this system given the sparse amount of information available. I, for one, am incredibly skeptical at this point.

a) There is no statement of the train/test procedure for the neural net. It's fairly easy to get good performance if you're training your system on the same dataset that you test. Without this information, you cannot make a reasonable judgement.

b) If you listen to the audio samples in the video at
http://www.usc.edu/ext-relations/news_service/re al/real_video.html

You can notice a significant difference in the times of the samples (e.g. "stop" is shorter than "yes"). A fairly unsophisticated NN can pick up on the length of a sound sample and generalize from there. I didn't hear any statement saying that in the official training and testing all sound samples were of the same length.

It's really a mess. If someone has a journal article or other piece of reliable information on this research, a pointer would be appreciated. Until then, I'll be feeding Clever Hans.
1. Re:All this and a counting horse by KFury · 1999-10-01 10:54 · Score: 3
  
  All very good points.
  
  What I noticed (and makes me wish they actually had a technical paper linked to the article to appease my methodological curiosity) is that the 'random background noise' was exactly the same for each word in a given round of testing.
  
  If they were training by those samples, the entire story is bogus because the pure, unmasked original word could be extrapolated by taking one sample, inverting the wave, and adding a second sample.
  
  to put it another way, the net wouldn't be learning how to interpret the word "no" or "fire" in a crowd. It would be learning how to understand that particular soundbyte of cocktail party babble and be able to distinguish in what way the original cocktail party sound was modified.
  
  This is completely useless because you'll never have a need (or the opportunity) to have two (or four) differnt words masked over the exact same soundwave. The background noise will always be different from sample to sample in a real world test.
  
  --
  
  Kevin Fox
Patent? by TheKodiak · 1999-10-01 00:17 · Score: 3

It said they'll apply for a patent - I wonder how much the patent will cover. I really hope they don't manage to get a patent covering the use of temporal information in neural networks as a whole - ordinarily, I'd assume they wouldn't, but given some recent patents, I tend to worry.

--
-=Best Viewed Using [INLINE]=-
Oh great, just what the world needs. by Pont · 1999-10-01 00:18 · Score: 3

I know speech recognition seems cool and it will be very good for the disabled, but it's not a purely good thing.

Now, instead of requiring at least 2 people to invade your privacy and listen to everything you say, one supercomputer and a bunch of listening devices let The Man (tm) listen to thousands of people at once and scan the transcripts for keywords and sentances.
I get the impression by konstant · 1999-10-01 00:19 · Score: 5

I get the impression that this net did not perform better "even" under noisy conditions, but "only" under noisy conditions.

Here's the original link
http://ww w.usc.edu/ext-relations/news_service/releases/stor ies/36013.html

If I'm right about that, then this development (while still insanely cool - don't get me wrong) might not be so surprising. As I recall from college brain-and-mind psych courses, humans use a variety of factors when singling out a lone voice or conversation in a noisy environment. These include spacial orientation, visual cues, etc. My prof called the "cocktail party effect". Rob them of these cues, and it isn't suprising that they are hobbled.

Also, computers have the mixed blessing of ignoring information patterns unless they are instructed to do otherwise. A person, listening to white noise, would subconsciously attempt to find meaning in every bleep and scratch. A computer, listening only for certain cues, can disregard the majority of the signal.

I would be interested in learning what rate of word recognition this system achieves. Current technology manages about 90%, which means one in every ten words is heard incorrectly. If they could improve that to 99.9% or even just 99%, we might actually get some speech-processors in Office desktop products.

-konstant

--
-konstant
Yes! We are all individuals! I'm not!
1. Re:I get the impression by methuseleh · 1999-10-01 04:16 · Score: 3
  
  I just like the way they raised the term "hubbub" to the level of technical terminology. They even quantified it!
  
  I can see it now:
  
  "Joe, I'm reading 14% hubbub coming over this line--can you try to reduce it to 5%?"
  
  Or even make it an actual unit of measure:
  
  "Man, the rating on that party must have been 23.6 Khb." (Kilohubbubs)
  
  Of course, that's assuming it'd be a metric measure. If it gets adopted here in the U.S. of A first, the above example might be 8 11/16 hb.
  
  We need more technological terms like this :)
  
  --
  
  --
  --
  Think Green... Burn only 100% recycled dinosaurs in you car.
Neural Networks -- a farce or fact? by SamBeckett · 1999-10-01 12:25 · Score: 3
I've recently done quite a bit of research on Neural Networks, including coding and simulating them by hand... There are some (qutie drastic) flaws with neural networks...
I started my research doing a classic 5 pixel by 5 pixel OCR (optical character recognition) on the domain of digits on a single layer perceptron type network (similar to what these guys were using minuns the delayed firing rate)
Not suprisingly, the training algorithm converged to an answer quite quickly and I proceded to run tests with noisy data, to test the genrealazation of the network.
- 100 per cent correct at zero noise
- 50 per cent correct at twnety-five per cent nosie
- 10 per cent correct at fifty per cent noise
- NEARLY zero percent correct above fifty.
This isn't shocking in itself until you realize that once you go above fifty percent distortion rates you are actually INVERTING the digit!
I retrained the network with inverted digits as well as the normal digits and re-ran the tests on the same set of data (note: The net WILL NOT converge on normal & inverted 5x5 digits with only ten cells).. The correctness rate was only twnety-per cent throughout the whole domain of noise levels.
I then retrained again using TWENTY cells (9 more than this articles) and it converge quite nicely and gave me a quadratic function with an R-Squared value of .9995 or so.
People view Neural networks sometimes as a fix-all solution.. The article on /. earlier about "eveloutionary computing" is the same premise as neural networks : try stuff randomly (or using calculus) until we get a decent solution.
I'm sorry kiddoes, but that just doesn't cut it. A neural network can't ever outperform a Turing machine so there can't be any chance in hell it will ever outperform us in non-specilized tasks.
Of course, I'd probably be more optimistic if these guys would have released there algorithms, papers, source-code, etc so we could actually figure out HOW the HELL they can get an 11 cell network to recognize speech...
The moral of the story? understanding speech is a hell of lot harder than recognizing ten digits!
Not Going to Change the World by mfterman · 1999-10-01 03:13 · Score: 3

Voice recognition wouldn't be of great use to me, at least at the desktop. I hate leaving prolonged voicemail messages because I can't go back and edit a previous sentence. I have to go and compose a speech if I want to sound intelligent and coherent.

Voice recognition only becomes useful to me if natural language parsing and enough cognition power are available for me to command my computer in plain english to a fair degree of abstraction.

In mobile computing, it might be a lot more useful, especially for a device, say the size of the Palm Pilot, where various factors make voice far more convenient and less difficult than other forms of input.

There are a lot of human use factors that complicate voice recognition (making the computer recognize when you want it to parse your speech and when you don't want it listening). Human interface issues often make these things less wonderful than they appear.

Not that I'm saying this isn't a wonderful development and there aren't people out there who could really use this (in specialized environments or people who have mechanical difficulties), but I don't think voice recognition is going to change the world the way some people think it will.
Um, this has already happend. Was: Oh great... by orac2 · 1999-10-01 00:46 · Score: 3

The US, through NATO, already monitors telecoms traffic, where speech recognition machines are programmed to listen for buzzwords like "plutonium" or "assasinate". Suspect conversations are then recorded for later perusal. This is not conspiracy theory, the program is called Echelon, and here'a recent CNN report. And that's not even considering military technology is usually about five to twenty years ahead of everyone else, depending on the tech. (This is also why I sometimes preface trans-atlantic calls to friends with a string of probable buzzwords, just to waste some snoop's time.)

--
"Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
Article misses biggest (and scariest) use ... by ian+stevens · 1999-10-01 00:48 · Score: 3

The article misses another interesting, albeit scary, use of this technology. If these could be made small enough and cheap enough, they could be placed in key locations across the country, forever listening in on passers-by.
Avoiding all the issues of privacy, consider the following scenario. The police want to arrest a suspect for some crime (drug traffiking, conspiracy, etc.) but have no proof and can't tap his phone lines since he encrypts all his phone conversations. Through some method, they train this speech-recognition device to the suspect's voice and either have someone with the device planted on them track the suspect or have an array of said devices placed in public areas where the suspect is known to hang out (bus terminals, bars, etc.). Sooner or later, the suspect might slip up and the authorities have enough evidence needed for an arrest.
Regarding privacy concerns, it seemed to me that this device could only track a handful of known voices ... probably requiring vast processing power to track every voice in a room. So it might be a while yet before everybody's conversations in bugged places get transcripted.
Damned cool technology, though.

--
ian
The real components. by Matt2000 · 1999-10-01 00:58 · Score: 4

If you read down near the bottom of the article, however, you will find this:

"The network was configured with just 11 artificial neurons, and in a sub-stage a live goat brain. The brain was activated through a patented process involving a castle and a lightning storm.

The researchers said one day they hoped that all humanity could benifit from the power of lighting.

Then they laughed kind of ominously."

Hotnutz.com
--
- Mod Parent Up by CmdrTaco (Score: 2) 02:41 PM April
Remember, it's only a few words... by Dandre · 1999-10-01 01:08 · Score: 4

In addition to the other good comments posted regarding taking this announcement with a grain of salt, I must add that the new system can only recognize a few words -- with only 11 neurons, it couldn't do much else. Without further information, I would guess that training up a net to recognize more words would be quite complicated -- especially given the non-standard training algorithms that were used. It would be great to find a scientific paper written by the researchers on the issue instead of solely press-release material. -dandre
worthless without peer review by jetson123 · 1999-10-02 04:16 · Score: 3

The claims are worthless without descriptions of the experimental procedures, peer review, and replication. There are already many ways in which pattern recognition systems and neural networks can greatly outperform humans, even in the presence of noise; that says nothing about whether it is a practical advance or not.
While the press release doesn't say much about neural networks or whether the state of the art in speech recognition has improved, it tells us something about a disregard by USC for standards of scientific conduct: scientific publication by press release is improper.