The Uncanny Valley of Voice Recognition

← Back to Stories (view on slashdot.org)

The Uncanny Valley of Voice Recognition

Posted by Soulskill on Monday February 9, 2015 @03:01PM from the siri-do-these-pants-make-me-look-fat? dept.

An anonymous reader writes: We've often seen the term "uncanny valley" applied to the field of robotics — it's easy to get unsettled when robots act close to being human, yet fail completely in a few key ways. GitHub Engineer Zach Holman writes that we've now reached uncanny valley territory in speech recognition as well, though the results are more frustrating than they are disturbing. He says, "Part of this frustration is the user interface itself is less standardized than the desktop or mobile device UI you're used to. Even the basic terminology can feel pretty inconsistent if you're jumping back and forth between platforms.

Siri aims to be completely conversational: Do you think the freshman Congressman from California's Twelfth deserved to sit on HUAC, and how did that impact his future relationship with J. Edgar? Xbox One is basically an oral command line interface, of the form: Xbox (direct object). ...it's these inconsistencies that are frustrating as you jump back and forth between devices. And we're only going to scale this up."

3 of 83 comments (clear)

Min score:

Reason:

Sort:

Re:I don't think that means what you think it mean by NoNeeeed · 2015-02-09 23:45 · Score: 4, Interesting

I can kind of see what he means, although I think the comparison with the uncanny valley is a bit weak.
I've taken to using Google Now's voice commands to set timers while I'm cooking, so something like "Ok Google, set a timer for 20 minutes". I don't have to touch my phone and it works brilliantly even in the noisy environments of a kitchen.
I've gotten used to talking to it in a very naturalistic way, which is where the problems occasionally crop up, and when they do they can be quite jarring.
A good example was the last time I asked it to set a timer for "an hour and a half", which Now interpreted as 1:00:30s, i.e. an hour and a half *minute*.
The jarring effect is at this edge where we feel like the speech recognition system is understanding what we say, but really it's just trying to use lots of different rules and patterns that have been coded in. If you happen to just fall outside of one of those rules it fails completely, and it can seem very arbitrary.

--
Paul Leader
Re:I fail to see how it's any worse than other UIs by jc42 · 2015-02-10 02:39 · Score: 4, Interesting

but when I click a button the button is bloody well clicked
Looks like you don't have much experience with cheap touch screens.
Heh. You obviously haven't work with any of the more expensive ones. I have a small collection of different portable gadgets for web testing, and that statement about buttons definitely isn't true for the various Apple tablets or phones. Thus, there's a little "x" icon whose function is to close the tab/window. I've learned to just start tapping it about twice per second, and maybe by the 3rd or 4th or 6th or 10th tap, it'll close.
Of course, the little monster might know very well that I'm tapping it, but wants to see how serious I am about it.
Of course, Apple's gadgets aren't the only ones like this. They're just one of the worst of a bad lot. And often it's a good idea to not tap too fast, because when the window finally closes, it usually gets replaced with another that'll do something totally unexpected when you tap it in that newly-exposed spot.

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Re:It's because they don't work... by jc42 · 2015-02-10 03:00 · Score: 3, Interesting

I speak standard BBC English, and I have often been described by people as "the easiest person to understand in the company" in many different companies.
I my experience, the recognition rate appears to be about 2%.
Not surprising; your "BBC English" and our "media English" over here in North America are basically artificial dialects developed by the broadcast industries starting back in the 1940s. They even managed to do some fairly scientific testing, assembling listeners with different native dialects, and counting their mistakes when listening to different proposed pronunciations of various words and phrases. Their intent was to to develop dialects that were easily understood by most of their target audiences, and they did a reasonable job of it.
This doesn't help the computers' voice recognition software very much, though, because few customers speak these "standard" artificial dialects well. The software people aren't working on making the customers understand the computer's speech; they're trying to get the computers to understand untrained humans speaking their native dialects. This requires rather different processing than what the broadcasters were trying to do, and is a much more difficult task for us humans, too. It doesn't help that the computers are often listening to humans who aren't totally awake and sober ...

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.