Rest In Peas — the Death of Speech Recognition

← Back to Stories (view on slashdot.org)

Rest In Peas — the Death of Speech Recognition

Posted by Soulskill on Monday May 3, 2010 @09:07AM from the yale-in-ox-boom-i-crows-off dept.

An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"

26 of 342 comments (clear)

Min score:

Reason:

Sort:

Buffalo buffalo by Anonymous Coward · 2010-05-03 09:11 · Score: 5, Insightful

Buffalo buffalo Buffalo buffalo buffalo, buffalo Buffalo buffalo.
1. Re:Buffalo buffalo by liquiddark · 2010-05-03 09:30 · Score: 4, Insightful
  
  What human can parse this without an expert to tear apart the context? I don't see the point in trying to serve up a sentence that simply isn't a sentence to most speakers of the language.
2. Re:Buffalo buffalo by JanneM · 2010-05-03 11:58 · Score: 5, Insightful
  
  Most people won't be able to parse the sentence, though. I know I can't. I have no idea how to interpret it as anything but a string of nouns. My guess is, even fewer would be able to parse it if spoken (the capitals and the comma are, I assume, important hints). It'd be unrealistic and unproductive to require speech systems to actually do better than most humans on the task; if many of us can't parse the sentence then why expect a computer to do so?
  Better overall benchmark: require it to have the ability of a competent but not perfect second-language user. We're long used to dealing with that level of proficiency, whether because the conversant is a foreigner, a child, or has a dialect very different from our own.
  
  --
  Trust the Computer. The Computer is your friend.
AI by ShadowRangerRIT · 2010-05-03 09:17 · Score: 5, Insightful

Natural language processing *is* AI. And high accuracy speech recognition requires natural language processing if we expect to have accuracy rates approaching that of a human. Humans hear words partially or incorrectly all the time. We fill in the gaps from context, and we correct if the course of the conversation reveals that the original interpretation is wrong. Expecting computers to do better, when half the time the problem is the speaker, not the listener, means you need it to be able to make the same corrections from limited information on the fly, and after the fact that a human brain makes.

--
$_ = "wftedskaebjgdpjgidbsmnjgcdwatb"; tr/a-z/oh, turtleneck Phrase Jar!/; print
1. Re:AI by ShadowRangerRIT · 2010-05-03 09:24 · Score: 3, Insightful
  
  Just as an example, my father is partially deaf. No hearing in one ear, and less than a quarter of human baseline in the other. But with a hearing aid (which still doesn't get him to full functionality), he gets 95% accuracy or better in regular conversation, and it gets better as the conversation progresses. It's not because the hearing aid is fixing the underlying problem (it can't, since the problem is in the inner ear). But if he knows the general topic, and picks up on 50% of the phonemes, he can fill in the blanks and figure out the gist of the sentence, despite hearing it in bits and pieces. As the conversation progresses, his accuracy improves because he is supplying the prompts; if the responses fall into the set of "expected" responses, filling in the gaps becomes even easier. By contrast, if you change topics abruptly or go off on a tangent, you may need to repeat yourself half a dozen times. Now a computer will have better "hearing", but if it doesn't know the topic before you start, it's going to have the same problem anytime you slur a word, elide a syllable, or clear your throat mid-sentence. People expect to speak to a computer and have it understand, forgetting that people aren't usually expected to interpret a sentence in isolation, with no idea of the topic.
  
  --
  $_ = "wftedskaebjgdpjgidbsmnjgcdwatb"; tr/a-z/oh, turtleneck Phrase Jar!/; print
Number of sentences? by Logarhythmic · 2010-05-03 09:19 · Score: 2, Insightful

One estimate puts the number of possible sentences at 10^570
What a completely useless metric. It makes sense to examine the context and meaning of speech in order to accurately transcribe words, but the number of possible sentences doesn't seem to accurately describe the problem here...

--
"Before criticizing someone, first walk a mile in his shoes. Then, you'll be a mile away... and you'll have his shoes."
Not Dead Yet by Shidash · 2010-05-03 09:20 · Score: 2, Insightful

I doubt it is completely dead. I have yet to hear it from the researchers working on AI. I work in affective computing, so I am thinking that it is possible that the missing component could be emotion or another way to increase the understanding and ability of computers to learn. In addition, even if it is not possible to increase speech recognition capabilities in this model of computing, in another model of computing this and more would be possible. I am not believing it until I hear it from researchers who have tried most possible options for improvement.
Time flies like an arrow fruit flies like a banana by GuyFawkes · 2010-05-03 09:24 · Score: 2, Insightful

Having said that, Dragon works fairly well, provided you modulate your speech.
If you want a laugh with Dragon, turn away from the screen and talk normally, then look at what it has transcribed..

--
http://slashdot.org/~GuyFawkes/journal
Since I don't have a flying car today, all is lost by liquiddark · 2010-05-03 09:26 · Score: 4, Insightful

Futurists should really learn what the word "plateau" means. The death of any given technical progression, particularly one that deals with information procesing, tends to be announced early and often, right up to the point where progress becomes meaningful again and then all of a sudden everyone saw it coming, and oh by the way where's my flying car?
is there any evidence for this analysis? by Trepidity · 2010-05-03 09:27 · Score: 3, Insightful

I see a lot of claims, but not much evidence. If we're going to use perceptions and anecdotes as evidence, my impression is that speech recognition has always been considered vaguely stalled. In 2000, people didn't think much progress had been made since 1991 besides some commercialization of stuff academia already knew how to do. In 2010, this guy doesn't think much progress has been made since 2001 besides some commercialization of stuff academia already knew how to do. Yet I think some progress has been made over the past 20 years. There just haven't been any breakthroughs, which is maybe what he's expecting, given his vague suggestion that "AI", a pretty vague concept, is our hope.
I'm also skeptical that accuracy has flatlined, though it's possible that's true in some areas. My impression is that multi-speaker recognition, use of large corpora to improve accuracy, and use of language modeling to improve accuracy, have all improved over the past 10 years. Of course, not all improvements go everywhere: the speech recognition running in real-time on a mobile ARM processor is not using every possible state-of-the-art technique. The advance there is that you can run speech recognition in real-time on a mobile ARM processor at all, and get performance that was once only possible on pretty hefty workstations.

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Blame startrek by onyxruby · 2010-05-03 09:28 · Score: 4, Insightful

Blame Startrek for making it look flawless. Speech recognition is just like fusion technology, 20 years away from properly working - just like it has been for the last 20 years.
-RANT- I cant stand voice recognition systems that don't at least give you an option to press a number. Especially when they are out of tune and pick up back ground noises as voice. Please, please, please - always give the option to press a number instead of having to voice everything!!
Re:Mod parent up by ground.zero.612 · 2010-05-03 09:28 · Score: 0, Insightful

What about the simple fact that conversation itself is a learning process?
You learn the extent of your audience's comprehension among other things. How can a computer be programmed to recognize everything when we lack a sufficient model to base it on?
There is a point in conversation when a sensible human being will recognize they are not getting their ideas through, and simply give up and say "never mind".

--
"Be prepared, son. That's my motto. Be prepared." --Joe Hallenbeck
Shout-outs to two idiots by Foobar_ · 2010-05-03 09:33 · Score: 5, Insightful

This blog post is retarded. The author is correlating a drop in internet news articles about Dragon NaturallySpeaking with a flatlining of speech recognition accuracy rate.
The Slashdot editor Soulskill is retarded for both not realizing this and for not reading the anonymously-submitted blog post (hmm no way it could have been the author) before approving it for the Slashdot front page. The guy is just out for more traffic to his rather pointless tech news commentary blog.
Decline of Slashdot, internet signal-to-noise ratio, get off my lawn, etc.
Re:What are you talk'in about ? by bmo · 2010-05-03 09:53 · Score: 3, Insightful

People want "human quality" speech recognition.
As if we're ever going to get away from training speech recognition programs when we train listeners every day when we speak. It's just that most people don't look at it as being trained, since we're so used to doing it.
I'm sure you have more trouble understanding someone with a thick Cockney or Scottish accent if you're from the Midwest US. You'd ask that person to repeat a few times, wouldn't you?
To expect speech recognition programs to *not* use training is to expect them to exceed human intelligence. Indeed, it's to expect such programs to be psychic.
--
BMO
Watermelon Box by NReitzel · 2010-05-03 09:56 · Score: 4, Insightful

Long ago - decades, before Bill Gates was invented, a lot of research went into what would be required for actual voice recognition.
A counterexample was given, about an engineering marvel (of the time) that would recognise when someone said the word "watermelon". For a long time, people in the industry assumed that the path to voice recognition consisted of building more and better watermelon boxes.
Several authors, including Alan Turing himself, argued that actual voice recognition could never be accomplished with a large array of watermelon boxes. Current VR software divides input into a series of hyperplanes, and attempts to build a best match from the classification tree.
THis is the 2010 version of the watermelon box.
Real voice recognition won't be practical until the input is parsed, matched against context, and structured much akin to diagramming a sentence in those old English (or other) classes. In short, matching against a vocabulary is trying to solve an exponential problem with a (large) polynomial engine.
It won't be until the computer actually understands what is said that VR is likely to be practical in a global sense.
As a person who has been building computer systems for 35 years, it bothers me to see a huge body of research done into subjects like these ignored, because someone thinks that none of it applies to PC's.

--
Don't take life too seriously; it isn't permanent.
Re:Mod parent up by Antiocheian · 2010-05-03 10:00 · Score: 4, Insightful

Not necessarily. Speech recognition doesn't fail when it can't figure out elaborate grammatical constructs and lexical ambiguities. Speech recognition fails because it can't figure out simple sentences in conditions humans can.
Re:Mod parent up by zegota · 2010-05-03 10:26 · Score: 2, Insightful

Interestingly enough, a computer would likely parse that sentence correctly, while nearly any human speaker (not familiar with the sentence) would think it's a nonsense phrase.
Re:Android Speech Recognition Rules by peragrin · 2010-05-03 10:29 · Score: 2, Insightful

I gave up voice dialing when i sneezed and dialed my father. I coughed and got my mother,but no matter what i ddid a loud fart would not call my brother but open the web browser and visit slashdot.
Okay the last one might be a lie, but the sneezing to get my father is true. ry it, Make funny sharp noises at your voice dialer and see what it dials.

--
i thought once I was found, but it was only a dream.
Re:Mod parent up by __aasqbs9791 · 2010-05-03 10:35 · Score: 1, Insightful

You are exactly right. I've often said no two people actually speak the same language. They just sound very similar sometimes.
Re:Forget speech recognition.... by Pfhorrest · 2010-05-03 10:42 · Score: 3, Insightful

The word "data" is a plural countable noun. "Datum" is the singular form thereof. Plural countable nouns take the copula "are". Singular countable nouns take the copula "is". The sentence you quoted was thus grammatically correct: a datum "is", but data "are".

Though I admit, the treatment of "data" as a mass noun (the likes of which take the copula "is" as well) is common enough that it did sound jarring to my own ear, even knowing it was technically correct.

--
-Forrest Cameranesi, Geek of all Trades
"I am Sam. Sam I am. I do not like trolls, flames, or spam."
Do other languages fare just as bad... by thewils · 2010-05-03 10:49 · Score: 4, Insightful

English, I would think is a pretty daunting language for speech recognition, what with a substantial array of homophones, but I wonder if other languages fare better. Maybe Spanish or, say, Japanese would be better since (I'm guessing) there is a closer relation to the written script and the actual sound that it makes.

--
Once I was a four stone apology. Now I am two separate gorillas.
Philosophers, "we told you so". by cenc · 2010-05-03 11:06 · Score: 2, Insightful

I have been flamed more than a few times around here for suggesting Computer Science has not got a clue what they are doing when it comes to AI. Philosophy has been at this problem and more for the better part of the last 400+ years (more like a 1,000 years) in a serious way. The stock b.s., I get from the science fiction fan boys is that somehow natural language is a problem that can just be brute forced as if you were trying to figure out the password you forgot to your email account. Good luck with that.
By the way, language "recognition" by a computer is likly the easy part of the problem for AI researchers to crack. It is still not going to yield any real AI, just better cars and toasters.

--
Living in Chile
Re:Mod parent up by jpate · 2010-05-03 13:45 · Score: 3, Insightful

When you have lots of data, you don't have to build any "expert" knowledge into a learner.
This isn't really quite so clear cut. Feature engineering, model structure, model training techniques, and so on all bias statistical learners towards different parts of the hypothesis space. Hidden markov models (the standard in speech recognition) clearly constitute a data-driven approach, but usually they predict diphones (which appreciates the transitions between speech sounds) rather than phones themselves. That is, "cat" is recognized not by predicting a [k] followed by an [ae] followed by a [t], but (among other things) by a [k-ae] transition followed by a [ae-t] transition. This is a very direct way of encoding expert linguistic knowledge that speech sounds are pronounced differently in the context of other sounds. Think about where your tongue touches the top of your mouth in "keen" compared to "can."
Re:Mod parent up by arth1 · 2010-05-03 13:55 · Score: 3, Insightful

If I said "I had a hard time staying a wake", both a person and a computer would misunderstand and think I said "I had a hard time staying awake."
You give computers way too much credit.
More likely it would think you said "Dear aunt, let's set so double the killer delete select all".
My experience with telephone Voice Rejection Systems is that they get what you say wrong more often than not, especially if you have a deep voice.
Re:What are you talk'in about ? by icebraining · 2010-05-03 20:04 · Score: 2, Insightful

No, I won't to use a common dataset to train all software automatically, like VoxForge. What I was saying is that people don't need training to talk to each person they meet. A generic background training works fine, and so it should for computers.

--
Dilbert RSS feed
Re:Speach recognition tech is broken in many ways by jam244 · 2010-05-04 02:25 · Score: 2, Insightful

When I started on my Ph.D., I started out majoring in AI. One of several reasons I changed to computer architecture (CPU design, etc.) is because I just couldn't stand the broken ways that people were doing stuff.
I don't get it. You left a Ph.D. program because the field was immature? Isn't the whole point of a Ph.D. program to produce something new and share it? Yeah, I get that funding might be harder than a safer field like computer engineering, but it seems like you abandoned a huge opportunity. You make it sound like you had a whole slew of new, potentially great ideas, and you just dropped them because it would be "too hard".