Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals (qz.com)
Microsoft announced today a system that can transcribe the content of a phone call with "the same or fewer errors" than real actual human professionals trained in transcription -- even when the human transcript is double-checked by a second human for accuracy. As you can imagine, this is a huge milestone for speech recognition. From a Quartz report:The team doesn't attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures. To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100 percent accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9 percent and 11.3 percent error rates. After learning from 2,000 hours of human speech, Microsoft's system went after the same audio file -- and scored 5.9 percent and 11.1 percent error rates. That minute difference ends up being about a dozen fewer errors.
Microsoft's next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.
Isn't that the dying PC company?
Better than Indian professionals. - FTFY
That minute difference ends up being about a dozen fewer errors.
If 0.2% is a dozen, then 1% is sixty, so 100% is six thousand errors.
Yikes.
that is all
and the NSA.
I'll believe that when I ducking see it.
--
This comment was transcribed by Microsoft's new AI transcription software.
If you want voice input to be more than just a toy, then getting near flawless accuracy here seems to be a required first step.
If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
"His name was James Damore."
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Dialog windows: "Do you want to register for your FREE Windows 10 Upgrade?"
Me (vocally): "No, no... of for the love of all that's sacred, NO!"
Windows: "This may take a while. Please do not power down your computer ..."
Automated closed captioning for the hearing impaired would be one. I'm not hearing impaired, but I use the CC system with the volume low when I am watching TV while everyone else in the house is sleeping. I also use it when everyone is awake and noisy. It is amazing how awful some CC can be.
Will they use it on their own hyped up marketing?
Who makes voice calls anymore?
Dear Aunt, let's set so double the killer delete select all
Microsoft is incorrectly interpreting the results. A more accurate conclusion would be that they achieved equivalent performance. Superior performance would require that the error rates for the AI be substantially lower than those of humans. They're not, they're nearly identical.
Say what you want about Microsoft (and some of it is true) but this is progress, even if they (maybe) cherry picked the one trial that had the lowest difference in error rate between the algorithm and a human...
This is my sig, there are many like it but this one is mine
Can it read lips?
Speaking as a professional who has worked in this field, this screams of a cherry picked scenario. The margin of success falls well within the bounds of the statistically insignificant variability I would expect to see in SR systems (human or otherwise). In the article they admit to eliminating audio which would favor humans over machines (noisy environments, etc). This kind of PR release produces good short term feelings but in the long term makes Microsoft (and computer science people in general) look like myopic, self-important ignorant twits.
Question: how did they find the errors that the two-human team missed? Presumably with a third human. Does this mean a three-person team can beat out both a two-person team and ASR? Or was there a script that was used to generate the audio? That would raise other questions, such as the accuracy of the speakers.
How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.
Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.
Transcription is obviously a lot more straightforward, and the goalposts should be pretty easy to set.
Let's not stir that bag of worms...
Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.
640 lies oughtta be enough for anyone. Ignore them by now.
Table-ized A.I.
I assume this is so the Govt agencies can transcribe cell-phone communications to text and then perform analysis to find all the "bad guys" ?
I made this: http://www.bpftpserver.com
Will man or machine determine? Stay tuned...
Captcha: revered
Wake me up when they buy Nuance/eScription. (Used by Siri and Samsung S Voice)
The machines can finally interpret our speech. Next step: launch all the missiles.
based on that twitter chat-bot that turned racist and trollish in a matter of hours? I have been looking for a way to UTF-TRUMP encode my documents!
"even when the human transcript is double-checked by a second human for accuracy"
Everything depends on how dumb the transcriber and/or checker is.
The acid test for transcription for me is if the transcriptionist gets the word "defuse" right, as in "He defused the tense situation." Every, and I mean EVERY, closed caption I've seen transcribes it as, "He diffused the tense situation." It seems to be the universal mistake.
We use a M$ system here at work -- some of the funniest writing I've ever read was a simple, serious phone message "transcribed" by their software. I'd say, "I welcome our new M$ AI overlords" but it's better to read their transcription. "I will come are knew micro soft overloads"
Now the NSA can store text transcripts of your conversations instead of having to store the audio files. This will leave so much more room for video! Hey - why did you put tape on your webcam, citizen?
Seven puppies were harmed during the making of this post.
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Eye thin queue meant two say:
Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.
Now give me your accuracy rate when following a 28 year old physician from Northern China who has a tendency to turn his head away from the mic at random while dictating, and a 42 year old doctor from Southern India who always speaks too softly and a 36 year old from Georgia who tends to have speech patterns that the transcriptionist has to interpret what he actually meant.
Sorry, been told one too many times that voice recognition can mean that I can do away with transcriptionists to accept anything less than a 100% money back offer within the first year.
Of middle class jobs about to go caput.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
The humans had a 5.9% error rate AFTER proofreading by another person? That's either a lousy speaker, a terrible recording, or really bad transcription. That's not something to brag about, frankly. I used to get an error rate of under 2% with IBM ViaVoice back in 1994. This doesn't seem like progress to me.
https://www.youtube.com/watch?...
Cheap $20 rechargeable wireless RF headphones work pretty well. I have 3 pairs in my house connected to different devices like TV, PC, home receiver. Someone in my house is always wearing wireless headphones. Spend the $20 and get enjoy TV with the sound. Watch TV, PC, HTPC at any hour in the family room without waking anyone up.
I might believe those numbers if it's after 2000 hours of the same individual speaking in monotone at a consistent pace and that that individual is speaking identically to what was done during the "learning phase", using already "learned words".
If that's not the case, there's no chance of those numbers being even close to accurate.... I've seen some of the crap Cortana comes up with when I speak to it.
Didn't we have a story about this and a Chinese company the other month. Seems Microsoft is late to the party again.
If it's anything like the speech transcription in their latest bugfix for Skype, it's a total joke.
Jim: Hey there
Bot: Good day sir.
Jim: Semi colon drop table language
Bot:???????????
Have my criticism and observations upset you AC? Struck a nerve?
Ask me about my sig!
I have just this to say about that: folks, I wouldn't let alpha software out to users.
They brought in "hybrid" phones here last year (VOIP). For voicemail, it sends an mp3, and a "transcription". Frequently, the "transcription", "powered by Microsoft speech technology", resembles early "computer poetry". And by "early", I'm talking 1960s or '70s.... with significant portions bearing zero resemblance to what was said.
mark