Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals (qz.com)
Microsoft announced today a system that can transcribe the content of a phone call with "the same or fewer errors" than real actual human professionals trained in transcription -- even when the human transcript is double-checked by a second human for accuracy. As you can imagine, this is a huge milestone for speech recognition. From a Quartz report:The team doesn't attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures. To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100 percent accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9 percent and 11.3 percent error rates. After learning from 2,000 hours of human speech, Microsoft's system went after the same audio file -- and scored 5.9 percent and 11.1 percent error rates. That minute difference ends up being about a dozen fewer errors.
Microsoft's next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.
That minute difference ends up being about a dozen fewer errors.
If 0.2% is a dozen, then 1% is sixty, so 100% is six thousand errors.
Yikes.
I'll believe that when I ducking see it.
--
This comment was transcribed by Microsoft's new AI transcription software.
If you want voice input to be more than just a toy, then getting near flawless accuracy here seems to be a required first step.
If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
"His name was James Damore."
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Dialog windows: "Do you want to register for your FREE Windows 10 Upgrade?"
Me (vocally): "No, no... of for the love of all that's sacred, NO!"
Windows: "This may take a while. Please do not power down your computer ..."
Automated closed captioning for the hearing impaired would be one. I'm not hearing impaired, but I use the CC system with the volume low when I am watching TV while everyone else in the house is sleeping. I also use it when everyone is awake and noisy. It is amazing how awful some CC can be.
Say what you want about Microsoft (and some of it is true) but this is progress, even if they (maybe) cherry picked the one trial that had the lowest difference in error rate between the algorithm and a human...
This is my sig, there are many like it but this one is mine
How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.
Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.
Transcription is obviously a lot more straightforward, and the goalposts should be pretty easy to set.
Let's not stir that bag of worms...
Question: how did they find the errors that the two-human team missed? Presumably with a third human. Does this mean a three-person team can beat out both a two-person team and ASR? Or was there a script that was used to generate the audio? That would raise other questions, such as the accuracy of the speakers.
I had the same question. We ran into a similar problem in a school project making an AI that interpreted results from a polysomnogram. In theory we got over ~90% accuracy, but different humans would score the same sleep study differently, which basically meant that humans got 90% accuracy compared to each other too.
Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.
640 lies oughtta be enough for anyone. Ignore them by now.
Table-ized A.I.
I assume this is so the Govt agencies can transcribe cell-phone communications to text and then perform analysis to find all the "bad guys" ?
I made this: http://www.bpftpserver.com
Hush! As long as MS exists, I have total job security!
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
It's less them having trouble understanding me, it's more me having trouble understanding them. If MS built a speech recognition software that can translate the output of an Indian call center, my hat is off to them!
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
The machines can finally interpret our speech. Next step: launch all the missiles.
based on that twitter chat-bot that turned racist and trollish in a matter of hours? I have been looking for a way to UTF-TRUMP encode my documents!
"even when the human transcript is double-checked by a second human for accuracy"
Everything depends on how dumb the transcriber and/or checker is.
The acid test for transcription for me is if the transcriptionist gets the word "defuse" right, as in "He defused the tense situation." Every, and I mean EVERY, closed caption I've seen transcribes it as, "He diffused the tense situation." It seems to be the universal mistake.
Now the NSA can store text transcripts of your conversations instead of having to store the audio files. This will leave so much more room for video! Hey - why did you put tape on your webcam, citizen?
Seven puppies were harmed during the making of this post.
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Eye thin queue meant two say:
Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.
Of middle class jobs about to go caput.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
The humans had a 5.9% error rate AFTER proofreading by another person? That's either a lousy speaker, a terrible recording, or really bad transcription. That's not something to brag about, frankly. I used to get an error rate of under 2% with IBM ViaVoice back in 1994. This doesn't seem like progress to me.
https://www.youtube.com/watch?...
I thought it was bad the day I had to train some foreign workers up to replace me.
At least they were human. IT'd be worse having to train up an AI to take your job...
No good when you live with a nutter who thinks they cause cancer.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Jim: Hey there
Bot: Good day sir.
Jim: Semi colon drop table language
Bot:???????????
Have my criticism and observations upset you AC? Struck a nerve?
Ask me about my sig!
I have just this to say about that: folks, I wouldn't let alpha software out to users.
They brought in "hybrid" phones here last year (VOIP). For voicemail, it sends an mp3, and a "transcription". Frequently, the "transcription", "powered by Microsoft speech technology", resembles early "computer poetry". And by "early", I'm talking 1960s or '70s.... with significant portions bearing zero resemblance to what was said.
mark