Vista Speech Recognition Goes Awry

← Back to Stories (view on slashdot.org)

Vista Speech Recognition Goes Awry

Posted by ryuzaki0 on Saturday July 29, 2006 @01:25AM from the egg-on-face dept.

An anonymous reader writes "It seems even MSNBC is willing to take a jab on those rare occasions when Microsoft products don't work. During a demo of Vista's speech recognition technology, Vista couldn't differentiate between mom and aunt, and all attempts to rectify the problem just made it worse. Wait until you see what it spat out, I think we have a new 'All your base.' Don't you just love Microsoft's live demonstrations?"

13 of 418 comments (clear)

Min score:

Reason:

Sort:

So? by Klaidas · 2006-07-29 01:39 · Score: 5, Informative

This isn't the first presentation went wrong, isn't it?
Win98 gone wild: http://www.youtube.com/watch?v=Hrbx9_AY720
Media Center Edition gone wild http://www.youtube.com/watch?v=j7EEbokKLHI
We can add this one to the list too ;)
Dear aunt by linvir · 2006-07-29 01:40 · Score: 4, Informative

For the flashless. Here's the format:
Microsoftie says this
Speech recogniser hears this

Dear mom
comma
Dear aunt,
[laughter]
Fix aunt
Let's set
Delete that
Delete that
Delete that
so
I think it's picking up a little bit of echo here
Delete... select all
double the killer delete select all
[laughter]

Final text:
Dear aunt, let's set so double the killer delete select all
1. Re:Dear aunt by James+Manning · 2006-07-29 02:11 · Score: 5, Informative
  
  For the curious, it was an audio gain issue. Details on Rob Chambers' blog:
  
  http://blogs.msdn.com/robch/archive/2006/07/29/682 479.aspx
  
  --
  
  Various ramblings
Re:Is SR ever going to be good enough? by joe+155 · 2006-07-29 01:57 · Score: 4, Informative

I have used Naturally Speaking, it can take a bit of time to train it, but if only you use it then you can eventually get it to the point where you can talk at a normal speed (although it has to be clear) and it will get to approaching 90% accuracy, sometimes I had it higher. The point was that it couldn't be used as an alternative to typing for extended periods though because you had to check everything it wrote.

One thing it did do which was good though is tried to understand sections of speech, rather than just each word, which did improve accuracy. Words often follow patters and there are few words that make sense after a word, so it was often right with "over there".

SR tech will eventually be as good as on star trek as long as people work on it. I would give it 20 years if it is seen as something which could make a lot of money, 40 if you have to wait for interested people to do it for free on their own time

--
*''I can't believe it's not a hyperlink.''
Man, that brings back memories!!! by smchris · 2006-07-29 02:11 · Score: 1, Informative

Does Microsoft have to copy EVERYTHING??? I used OS/2 Warp for the second half of the 90s but my experience with _its_ built-in speech recognition was pretty much identical to that demo.
Re:Roald Dahl by EddWo · 2006-07-29 02:59 · Score: 4, Informative

From the book Dirty Beasts

--
"Taligent is still pure vapor. Maybe they'll be the last who jumps up on Openstep... "
On MSNBC's front page - for about 30 minutes.... by wowbagger · 2006-07-29 03:05 · Score: 5, Informative

A friend of mine called me at work (since he knew that to access MSNBC's videos requires Internet Explorer, Windows Media 9 or better, and Flash, and I have neither IE nor WMP at home) and told me about this.

I went to msnbc.com - and there it was, third on the list of videos on the main page.

I called this to the attention of two of my coworkers, and we viewed the video - total elapsed time, maybe twenty minutes.

Then I went to call it to the attention of a third coworker - and the video was no longer on the front page of MSNBC. OK, so maybe they've moved it off the front page, but it should still be on the Technology subsection, right?

Wrong.

Nor was it under Videos, nor anywhere else I could find it easily.

Perhaps this was just a normal rotation of a video. Perhaps not. But no matter what the real cause, there is the appearance that it was removed from the page because it was too embarrassing. Not good for Microsoft.

However, I will give MSNBC this - they didn't give Microsoft a free ride on this, they ribbed them pretty hard.

However, I knew that this would be appearing on other sources as a video that could be viewed outside of Windows. Actually, I am rather surprised that it took this long.

Now, as to the demonstration itself - it looks to me (a person who does signal processing and analysis for a living) like the presenter had the mike gain too high - every time he spoke he maxed out the bar graph on the display. *IF* he had the gain too high, and the audio was clipping significantly, that could make "mom" have enough of a pop to maybe sound like AUNT - especially if the software is using context to try to reduce the search-space for the words. Of course, that's why I would have a monitoring routine in the system, and if any of the samples are at 100% full scale, or if many of the samples are over 90% full scale, or the signal power is too high, I'd have my software adjust the mike gain down *and* flag an alert to the user. I'd also try to look for the mike element itself being overloaded.

--
www.eFax.com are spammers
Audio Gain Settings Caused the Problem by ThinkFr33ly · 2006-07-29 03:26 · Score: 2, Informative

As much as many of you would like to believe that the reason this demo failed was because Microsoft code is horribly designed and implemented, and that they are completely incompetent, there just might be a slightly more realistic explanation for the demo's abject failure.

According to Rob Chambers, a developer on the Vista speech recognition team, the failures during the demo were caused by audio gain issues.

From his blog:

If you watch the video clip on MSN Video you can see in the speech user interface that the microphone "volume" is very high. It pushes up into the red frequently while Shanen is speaking to the computer. That's caused by the fact that the audio sub-system wasn't respecting the audio gain settings we've asked it to use.

This is a known bug in current builds, and has already been fixed by the audio team in their private builds in preparation for RTM.

Read the entire blog post for a more complete explanation of what happened... one that's just slightly more plausible than most of the explanations proffer by your fellow Slashdotters.
Re:Is SR ever going to be good enough? -- Yes! by oblique303 · 2006-07-29 03:32 · Score: 5, Informative

I use Dragon NaturallySpeaking every day (carpal tunnel syndrome), and version 9 has around 99% accuracy, with around 98% out-of-the-box with no training. This means 10 or so errors out of a 1000 word dictation.

I didn't believe it either, until I actually tried it. Dragon is the first worthwhile speech recognition solution I've seen that's practical for general use (Though I'd love if they'd release a "programmers" version to compliment the Medial/Legal versions). I get about 99% accuracy (a decent microphone is *very* important!)

Dragon 9 also doesn't "technically" need training, but accuracy further improves if you do bother to train it a bit. The NYT reviewer was able to get 99.6% accuracy after a short training session.

Here's a few reviews of version 9:

http://www.nytimes.com/2006/07/20/technology/20pog ue.html?ex=1154318400&en=6fd795114b3f72ea&ei=5070

http://www.npr.org/templates/story/story.php?story Id=5577523
Re:removing ambient noise by Anonymous Coward · 2006-07-29 03:53 · Score: 5, Informative

For those interested, merely subtracting the two signals doesn't work. The signal at the microphone is not just the music signal (called far-end signal) plus the mic signal (near-end signal). The music signal has travelled across the room before it reaches the microphone, giving it some reverberations (echo). If you simply subtract the two signals, you will still hear the music signal quite loudly.

What is done in practice and works extremely good, is modelling that "echo" as a filter (a FIR transversal filter, which is simply a delay line). You estimate the coefficients of the filter and use the music signal after the "room filter" has been applied to substract from the microphone signal. You then have the voice-only signal left.

This is setup is called AEC or Acoustic Noise Cancellation. It is used in every telephone and mobile phone there is and is crucial to ADSL. If an ADSL modem would not cancel out its own sent signal at its receiver, the attainable speed would be several times less. AEC is also the reason why talking immediately when you pick up a mobile phone leaves an audible echo of your own voice: estimating the coefficients of the filter is still taking place at that point.

See http://www.dspalgorithms.com/products/echo.html for a diagram of the AEC or read Haykin's Adaptive Filter Theory if you're looking for a decent book on the subject.
Probably a bad Mic. by jcr · 2006-07-29 04:34 · Score: 2, Informative

When I was last involved in adding speech control to an app, I attended a developer workshop at Apple, and found out much to my surprise that my mic wasn't any good. It sounded fine when I used it for voice recording, but for recognition the gain curve was all wrong. When I tried one of the mics that the speech team from Apple provided, the hit rate went from under 20% to well over 90%.

When Kim Silverman demos Apple's speech recognition, he uses a high-quality noise cancelling mic. It makes all the difference.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."
Not the first MS demo embarrassment. by MsGeek · 2006-07-29 07:17 · Score: 3, Informative

Actually the last MS demo flame out of this magnitude came when the beta of Windows 98 was being demoed. They wanted to show a scanner "just working" with Windows 98 and USB. W98 hit the blue screen of death when the USB scanner was plugged in.

There, I found it. The file is an old QuickTime movie. I'm going to put this up on YouTube. There, that's done. Have at it.

--
Knowledge is power. Knowledge shared is power multiplied.
Re:Awww...c'mon guys.... by Atario · 2006-07-29 21:26 · Score: 2, Informative

There were other problems that simply can't be put down to actual recognition problems; it clearly understood perfectly the pronunciation of "delete select all", yet didn't act on any of that as a command.

--
"A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt