Rest In Peas — the Death of Speech Recognition

← Back to Stories (view on slashdot.org)

Rest In Peas — the Death of Speech Recognition

Posted by Soulskill on Monday May 3, 2010 @09:07AM from the yale-in-ox-boom-i-crows-off dept.

An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"

18 of 342 comments (clear)

Min score:

Reason:

Sort:

Buffalo buffalo by Anonymous Coward · 2010-05-03 09:11 · Score: 5, Insightful

Buffalo buffalo Buffalo buffalo buffalo, buffalo Buffalo buffalo.
1. Re:Buffalo buffalo by CecilPL · 2010-05-03 09:24 · Score: 5, Funny
  
  That comma is just out of place and makes the sentence hard to parse.
2. Re:Buffalo buffalo by hoggoth · 2010-05-03 09:37 · Score: 5, Informative
  
  Buffalo bison whom other Buffalo bison bully, themselves bully Buffalo bison.
  
  --
  - For the complete works of Shakespeare: cat /dev/random (may take some time)
3. Re:Buffalo buffalo by Anonymous Coward · 2010-05-03 09:38 · Score: 5, Informative
  
  For those that don't know:
  http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo
  'Buffalo bison whom other Buffalo bison bully, themselves bully Buffalo bison'.
4. Re:Buffalo buffalo by JanneM · 2010-05-03 11:58 · Score: 5, Insightful
  
  Most people won't be able to parse the sentence, though. I know I can't. I have no idea how to interpret it as anything but a string of nouns. My guess is, even fewer would be able to parse it if spoken (the capitals and the comma are, I assume, important hints). It'd be unrealistic and unproductive to require speech systems to actually do better than most humans on the task; if many of us can't parse the sentence then why expect a computer to do so?
  Better overall benchmark: require it to have the ability of a competent but not perfect second-language user. We're long used to dealing with that level of proficiency, whether because the conversant is a foreigner, a child, or has a dialect very different from our own.
  
  --
  Trust the Computer. The Computer is your friend.
Android Speech Recognition Rules by bit+trollent · 2010-05-03 09:15 · Score: 5, Informative

I hardly type anything in to my HTC Incredible. Google's voice recognition, which is enabled on every textbox works just about perfectly.
Seriously, get an Android phone, try out the speech recognition text entry, and then tell me speech recognition is dead.
1. Re:Android Speech Recognition Rules by orangesquid · 2010-05-03 11:09 · Score: 5, Funny
  
  What Dave said: "Open the pod bay doors, HAL."
  What HAL heard: "Open the hot babe pornz, HAL."
  HAL's speech recognition and morality programming* combined to give the famous reply, "I'm sorry, Dave. I'm afraid I can't do that." HAL knew certain things would have been too titillating to an all-ages film audience in 1968.
  * Only for the film version. In the book version, it would have caused undue frustration to the reader, unable to see what Bowman was viewing. In that case, it was HAL's etiquette programming.
  
  --
  --TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
AI by ShadowRangerRIT · 2010-05-03 09:17 · Score: 5, Insightful

Natural language processing *is* AI. And high accuracy speech recognition requires natural language processing if we expect to have accuracy rates approaching that of a human. Humans hear words partially or incorrectly all the time. We fill in the gaps from context, and we correct if the course of the conversation reveals that the original interpretation is wrong. Expecting computers to do better, when half the time the problem is the speaker, not the listener, means you need it to be able to make the same corrections from limited information on the fly, and after the fact that a human brain makes.

--
$_ = "wftedskaebjgdpjgidbsmnjgcdwatb"; tr/a-z/oh, turtleneck Phrase Jar!/; print
That's Because... by BJ_Covert_Action · 2010-05-03 09:17 · Score: 5, Funny

It only flatlined because nobody tried to write speech recognition software in perl*.

*Disclaimer: Poster is not responsible for attempts resulting in unintended AI development and/or end of the world scenarios brought on by such an irresponsible endeavor.

--
Motorcycles, Robots, Space Gossip and More!
Re:Well duh. by Chris+Burke · 2010-05-03 09:28 · Score: 5, Funny

That misheard lyric is so common that there's a book about misheard lyrics with that as the title.
I know! A surprising number of people think Hendrix was talking about kissing the sky, rather than embracing the experimental, counter-culture, and free-love nature of the 60's, simply because they don't like to think of their testosterone-filled hero sucking face with another dude. Like, get over it! "Kiss the sky" doesn't even make any sense unless you're on some kind of mind-altering substance, and there's no way Jimmy would have put something like that in his body!

--

The enemies of Democracy are
Comment removed by account_deleted · 2010-05-03 09:30 · Score: 5, Funny

Comment removed based on user account deletion
Tea, Earl Grey, Hot by tokki · 2010-05-03 09:33 · Score: 5, Funny

How hard is it for a computer to understand the sentence: "Tea, Earl Grey, Hot"? That takes care of 90% of the use case scenarios right there. Next is "Computer, initiate auto-destruct sequence" is the next 8%.
Shout-outs to two idiots by Foobar_ · 2010-05-03 09:33 · Score: 5, Insightful

This blog post is retarded. The author is correlating a drop in internet news articles about Dragon NaturallySpeaking with a flatlining of speech recognition accuracy rate.
The Slashdot editor Soulskill is retarded for both not realizing this and for not reading the anonymously-submitted blog post (hmm no way it could have been the author) before approving it for the Slashdot front page. The guy is just out for more traffic to his rather pointless tech news commentary blog.
Decline of Slashdot, internet signal-to-noise ratio, get off my lawn, etc.
Re:Badger badgers badger Badger badgers by Anonymous Coward · 2010-05-03 09:47 · Score: 5, Funny

snaaaaaaake!
Re:IBM? by N1ck0 · 2010-05-03 09:55 · Score: 5, Interesting

IBM closed many of their speech research offices 1-2 years ago and transferred most of the research/data to Nuance's Dragon Naturally Speaking research.
Full Disclosure: I work for Nuance
Re:Mod parent up by brian_tanner · 2010-05-03 10:16 · Score: 5, Interesting

I think you're probably about 10-20 years out of date with your criticism. AI these days is *all about* statistical machine learning which is *all about* data and not about formal or expert systems at all. This is what Google and others are doing. The AI you are describing is from the late 80s and early 90s.

Neural networks are part of the story, but many of the ideas from ANNs have been improved upon when more structured settings are available. There is actually a resurgence right now in deep neural network though.
Re:Mod parent up by Known+Nutter · 2010-05-03 11:19 · Score: 5, Informative

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

--
Beware of the Leopard.
Speach recognition tech is broken in many ways by Theovon · 2010-05-03 12:33 · Score: 5, Informative

When I started on my Ph.D., I started out majoring in AI. One of several reasons I changed to computer architecture (CPU design, etc.) is because I just couldn't stand the broken ways that people were doing stuff. Actually computer vision stuff isn't so bad -- at least there's room for advancement. But the speech recognition state of the art is just awful. I couldn't stand the way they did much of anything in pursuit of human language understanding.
With automatic speech recognition (ASR), the first problem is the MFCCs. (Mel-frequency cepstral coefficients.) What they essentially do is take a fourier transform of a fourier transform of the data. This filters out not only amplitude but also frequency, leaving you only with the relative pattern of frequency. Think of this as analogous to taking a second derivative, where all you get is accelerating, leaving out position and velocity. You lose a LOT of information. Then once the MFCC's are computed, they're divided up into the top 13 (or so) dominant MFCCs, plus the first and second step-wise derivatives, giving you a 39D vector. Then the top N most common ones are tallied, and code-booked, mapping the rest to the nearest codes, leaving you with a relatively small number of codes (maybe a few hundred).
So to start with, the signal processing is half deaf, throwing away most of the information. I get why they do it, because it's speaker independent, but you completely lose some VERY valuable information, like prosodic stress, which would be very useful to help with word segmentation. Instead, they try to guess it from statistical models.
Next, they apply a hidden Markov model (HMM). Instead of inferring phones from the signal, the way they model it is as a sequence of hidden states (the phones) that cause the observations (the codes). This statistical model seems kinda backwards, although it works quite well, when trained properly. To train it, you need a lot of labeled data, where people have taken lots of speech recordings and manually labeled the phonetic segments. What is usually learned is mostly a unigram, where what you know are the a priori probabilities of each phone label (the hidden states), and the posterior probability of each phone given each possible prior phone. Given a sequence of codes, you find the most likely sequence of phones by computing the viterbi path through the HMM.
Honestly, I can't complain too much about the HMM. What I do complain about is the fact that the "cutting edge" is to replace the HMM with a markov random field (just remove the arrows from the HMM), and conditional random fields (which are markov random fields with extra inputs).
My response to using MRFs and CRFs is "big whoop", because all you're doing is replacing the statistical model, which doesn't dramatically improve recognition performance, because they haven't fixed the underlying problem with the signal processing.
Then on top of the phone HMM, they layer ANOTHER HMM on top of it to infer words and word boundaries, based on a highly inaccurate phone sequence.
The main problem with all of this is not that the reseachers are idiots. They're not. The problem is that the people with the funding are totally unwilling to fund anything really interesting or revolutionary. The existing methods "work", so the funding sources figure that we can just make incremental changes to existing technologies. Which is wrong. Unfortunately, any radically new technology would be highly experimental, with a high risk of failure, and would take a long time to develop. No one wants to fund anything that iffy. As a result, all the scientists working in this are spend their time on nothing but boring tweaks of a broken but "proven" reasonably effective technology.
So I don't blame people for the conundrum, but I see no opportunity to do anything interesting, so I just couldn't stand studying it.