Slashdot Mirror


Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals (qz.com)

Microsoft announced today a system that can transcribe the content of a phone call with "the same or fewer errors" than real actual human professionals trained in transcription -- even when the human transcript is double-checked by a second human for accuracy. As you can imagine, this is a huge milestone for speech recognition. From a Quartz report:The team doesn't attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures. To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100 percent accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9 percent and 11.3 percent error rates. After learning from 2,000 hours of human speech, Microsoft's system went after the same audio file -- and scored 5.9 percent and 11.1 percent error rates. That minute difference ends up being about a dozen fewer errors. Microsoft's next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.

57 of 98 comments (clear)

  1. 11.1 vs. 11.3 percent by Anonymous Coward · · Score: 1

    That minute difference ends up being about a dozen fewer errors.

    If 0.2% is a dozen, then 1% is sixty, so 100% is six thousand errors.

    Yikes.

  2. Right ... by scunc · · Score: 4, Funny

    I'll believe that when I ducking see it.
    --
    This comment was transcribed by Microsoft's new AI transcription software.

    1. Re:Right ... by Bongo · · Score: 1

      I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.

      Those ducks, always up to something nasty.

    2. Re:Right ... by Quirkz · · Score: 1

      I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.

      Those ducks, always up to something nasty.

      I used to have an office that overlooked a river. I can't speak for all ducks, but the resident mallards ... yes, they were almost always up to those types of things.

    3. Re:Right ... by RockDoctor · · Score: 1

      Those ducks, always up to something nasty.

      Homosexual necrophiliac rape, if I recall correctly.

      Moeliker, C.W., 2001 - The first case of homosexual necrophilia in the mallard Anas platyrhynchos (Aves: Anatidae) - DEINSEA 8: 243-247 [ISSN 0932-9308]. Published 9 November 2001

      Yes, I do remember correctly, and it was indeed a Mallard doing the deed (and being done-unto, too).

      Almost unremarkable that it was a Dutch report, and was considered so remarkable that it took 6 years from event to publication.

      I'd not actually read TFP on this - though I knew of it. For future reference, the journal is "DEINSEA- ANNUAL OF THE NATURAL HISTORY MUSEUM, ROTTERDAM P.O. Box 23452, NL-3001 KL, Rotterdam, The Netherlands" and they keep the paper here.

      --
      Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
  3. Voice Control by Rockoon · · Score: 5, Insightful

    If you want voice input to be more than just a toy, then getting near flawless accuracy here seems to be a required first step.

    If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.

    --
    "His name was James Damore."
    1. Re:Voice Control by TFlan91 · · Score: 2

      Agreed, however people down south don't move their mouse with "the typical hospitallllity of us folk 'round here" as opposed to the people up north who couldn't give a rats ass.

      Speech is incredibly dense to parse. Where a near perfect operation is required for a mouse, voice control can have a couple bumps in its' road before (and while) being highly adopted.

    2. Re:Voice Control by stephanruby · · Score: 2

      If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.

      Wrong example. Mouse usability requires constant visual feedback and almost constant human correction. That is the reason why we can't really use a mouse without looking directly at the screen.

      In any case, flawless transcription accuracy of one single human voice out of 7.5 billion voices already happens with Google Voice. The problem occurs when Google Voice is not tuned to the voices of the other 7.49999 billion people. Do you think that's what Microsoft is using in the backend this second time around?

    3. Re:Voice Control by Opportunist · · Score: 1

      This!

      We have input today that is perfect. More important, we sometimes have to do input that can break hours if not days of work if executed wrongly. Hitting the wrong key at the wrong time can at least be chalked off as human error, Saying "down" do scroll and it being interpreted as "shutdown" (along with the frustrated "NO, dammit" being interpreted as the answer to "save work (y/n)?") is more a problem of the input parser than the human in front of the screen.

      Unless it is AT LEAST at par with other means of input, there is very little reason for anyone to switch.

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    4. Re:Voice Control by GameboyRMH · · Score: 1

      If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.

      And yet touchpads are still vastly more common on laptops than trackpoints...

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
  4. any better than "Show me to buy milk"? by itsme1234 · · Score: 2

    Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".

    1. Re:any better than "Show me to buy milk"? by LynnwoodRooster · · Score: 1

      "Show me my most at-risk opportunities".

      Huh, you mean Xiaomi is coming out with moist asterisks? How very interesting!

      --
      Browsing at +1 - no ACs, I ignore their posts. So refreshing!
  5. Strange success criteria by DoofusOfDeath · · Score: 1, Interesting

    Dialog windows: "Do you want to register for your FREE Windows 10 Upgrade?"

    Me (vocally): "No, no... of for the love of all that's sacred, NO!"

    Windows: "This may take a while. Please do not power down your computer ..."

    1. Re:Strange success criteria by iggymanz · · Score: 1

      customer relations record: the customer loves windows as if it's the most sacred thing to him

  6. Now put it to good use! by cmiller173 · · Score: 4, Informative

    Automated closed captioning for the hearing impaired would be one. I'm not hearing impaired, but I use the CC system with the volume low when I am watching TV while everyone else in the house is sleeping. I also use it when everyone is awake and noisy. It is amazing how awful some CC can be.

    1. Re:Now put it to good use! by yagu · · Score: 1

      It is amazing how awful some CC can be.

      At first I thought, based on your post you'd really meant to say: "It is amazing how awesome CC can be."

      Interestingly, both are true.

    2. Re:Now put it to good use! by pipingguy · · Score: 1

      Yes, I've noticed this too. I've often wondered if some CC is done by machine or just illiterates.

    3. Re:Now put it to good use! by CODiNE · · Score: 1

      Oh yes, my body is ready.

      And please make an API for all those horrible podcast and audioblog sites out there that make me miss out on industry trends.

      And maybe... talk to Google about YouTube CC.
      *blech!*

      --
      Cwm, fjord-bank glyphs vext quiz
    4. Re:Now put it to good use! by Anne+Thwacks · · Score: 1
      by machine or just illiterates

      No. This is a whole new technology: artificial stupidity. Its going to change the world, I tell you. (Mostly for the worse, I suspect!)

      --
      Sent from my ASR33 using ASCII
    5. Re:Now put it to good use! by somenickname · · Score: 1

      I'd love to see a YouTube feature that allows you to get the automatically generated transcript of a video without having to actually watch the video. For videos that are intended to be informative, having the transcript and grepping it for keywords and the context they are used would help you determine if it's worth watching a lengthy video. It maybe even just outright give you the information you want without having to sit through a half hour video.

    6. Re:Now put it to good use! by iczer1 · · Score: 1
      Caption fails (old but funny):

      Make a short skit, act it out, take the CC output and redo the skit with the new words.

      https://www.youtube.com/watch?...

    7. Re:Now put it to good use! by antdude · · Score: 1

      I wished more of those CCs were manually typed out by humans.

      --
      Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
    8. Re:Now put it to good use! by RabidReindeer · · Score: 1

      Based on Spanish-language soundtracks, there's no doubt that some CC is human-generated. On my Stargate discs, the foreign-language captions aren't even saying the same sentences as the alternate-language voices.

    9. Re:Now put it to good use! by RabidReindeer · · Score: 1

      I've begun to suspect that YouTube is often used by the lazy and illiterate to to avoid actually taking the effort to type and format what should realistically have been text articles.

    10. Re:Now put it to good use! by AmiMoJo · · Score: 1

      It should really improve YouTube too. Having an accurate transcription of a video means it becomes much more searchable than if all you have is the title and summary text. The current automatic transcription on YouTube is nearly useless.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    11. Re:Now put it to good use! by EndlessNameless · · Score: 1

      It will be decades before artificial stupidity is anywhere near natural stupidity on any metric.

      Natural stupidity is surprisingly flexible and resilient---it can crop up anywhere and is almost impossible to stop.

      Artificial stupidity requires significant investment and evolutionary design before it can approach the persistence and impact we see naturally.

      --

      ---
      According to the latest ruleset, this post should be modded as Vorpal Flamebait +5.
    12. Re:Now put it to good use! by Quirkz · · Score: 1

      Yeah, came here to say that. We usually have ours on, and I can't seem to resist reading it. The frequency of errors and quirks is such that I've nearly started making a list of the worst ones. Any show from England tends to have "[indecipherable]" stuck in repeatedly, even when I would have said the language was perfectly clear.

      One of my favorites was "read my copy of At Last Shrub" which turned out to be "Atlas Shrugged".

  7. C'mon guys by diesalesmandie · · Score: 1

    Say what you want about Microsoft (and some of it is true) but this is progress, even if they (maybe) cherry picked the one trial that had the lowest difference in error rate between the algorithm and a human...

    --
    This is my sig, there are many like it but this one is mine
  8. They have a 100% accurate translation? by ewibble · · Score: 1

    How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.

    Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.

    1. Re:They have a 100% accurate translation? by saider · · Score: 1

      Quick and fast are easier to discriminate than "fast" and "fat".

      Consider the following iterative algorithm...

      "That is a fast car" - is translated to
      That is a fat car *Context filter - strict vs slang - replace fat with phat*
      *Context filter - apply ghetto style - replace "That" with "Dat"*
      *Context filter - apply ghetto style - replace "is" with "be"*
      *Context filter - apply ghetto style - replace "a" with "one"*
      That be one phat car.

      --


      Remember, You are unique...just like everyone else.
    2. Re:They have a 100% accurate translation? by RabidReindeer · · Score: 1

      How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.

      Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.

      Who gave you free reign to make such assertions? You need to tow the line or we'll see to it that you loose your posting privileges here!

  9. The subject is "transcription" not "translation". by JMZero · · Score: 1

    Transcription is obviously a lot more straightforward, and the goalposts should be pretty easy to set.

    --
    Let's not stir that bag of worms...
  10. Re:Cherry picked by dpidcoe · · Score: 1

    Question: how did they find the errors that the two-human team missed? Presumably with a third human. Does this mean a three-person team can beat out both a two-person team and ASR? Or was there a script that was used to generate the audio? That would raise other questions, such as the accuracy of the speakers.

    I had the same question. We ran into a similar problem in a school project making an AI that interpreted results from a polysomnogram. In theory we got over ~90% accuracy, but different humans would score the same sleep study differently, which basically meant that humans got 90% accuracy compared to each other too.

  11. Microsoft Lies. Case Closed. by Tablizer · · Score: 1

    Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.

    640 lies oughtta be enough for anyone. Ignore them by now.

    1. Re:Microsoft Lies. Case Closed. by David_Hart · · Score: 1

      Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.

      640 lies oughtta be enough for anyone. Ignore them by now.

      Just to begin with, they have been working on this for a while...

      https://www.youtube.com/watch?...

  12. Govt Survellience by mcolgin · · Score: 3, Insightful

    I assume this is so the Govt agencies can transcribe cell-phone communications to text and then perform analysis to find all the "bad guys" ?

    --
    I made this: http://www.bpftpserver.com
  13. Re:Microsoft? by Opportunist · · Score: 3, Funny

    Hush! As long as MS exists, I have total job security!

    --
    We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
  14. Re:Obligatory (i.e. doing the needful) by Opportunist · · Score: 1

    It's less them having trouble understanding me, it's more me having trouble understanding them. If MS built a speech recognition software that can translate the output of an Indian call center, my hat is off to them!

    --
    We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
  15. Finally by Sperbels · · Score: 1

    The machines can finally interpret our speech. Next step: launch all the missiles.

    1. Re:Finally by John.Banister · · Score: 1

      I'm sorry, the missus can't do launch today. It's laundry day. Clippy says she might have time to come round at 11:45 tomorrow. Would that work?

  16. Is it.... by downright · · Score: 1

    based on that twitter chat-bot that turned racist and trollish in a matter of hours? I have been looking for a way to UTF-TRUMP encode my documents!

  17. "Humans" by pipingguy · · Score: 1

    "even when the human transcript is double-checked by a second human for accuracy"

    Everything depends on how dumb the transcriber and/or checker is.

  18. Defused by John+Jorsett · · Score: 3, Interesting

    The acid test for transcription for me is if the transcriptionist gets the word "defuse" right, as in "He defused the tense situation." Every, and I mean EVERY, closed caption I've seen transcribes it as, "He diffused the tense situation." It seems to be the universal mistake.

    1. Re:Defused by gustygolf · · Score: 1

      My test goes like this:

      Dear Aunt
      Let's set so double the killer delete select all.

      --
      "Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
    2. Re:Defused by ChrisMaple · · Score: 1

      FWIW, before common usage overwhelmed the original creation, "jigga" was the correct pronunciation of "giga".

      --
      Contribute to civilization: ari.aynrand.org/donate
    3. Re:Defused by well_in_theory · · Score: 1

      What gets to me more is the choice of how to pronounce the value...

      No self-respecting scientist would ever say "one point twenty one". That's "one point two one." Or is 1.201 "one point two hundred and one" and thus more?

  19. I'm sure the NSA will be happy by Dunbal · · Score: 1

    Now the NSA can store text transcripts of your conversations instead of having to store the audio files. This will leave so much more room for video! Hey - why did you put tape on your webcam, citizen?

    --
    Seven puppies were harmed during the making of this post.
  20. Say what? by Anonymous Coward · · Score: 1

    Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".

    Eye thin queue meant two say:
    Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.

    1. Re:Say what? by Anne+Thwacks · · Score: 1
      Eye thin queue meant two say: Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.

      Hey! It looks like you have obtained illegal access to the system used to caption news broadcasts!

      --
      Sent from my ASR33 using ASCII
  21. Well, there's a whole bunch by rsilvergun · · Score: 1

    Of middle class jobs about to go caput.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
  22. What humans were these? by Anonymous Coward · · Score: 1

    The humans had a 5.9% error rate AFTER proofreading by another person? That's either a lousy speaker, a terrible recording, or really bad transcription. That's not something to brag about, frankly. I used to get an error rate of under 2% with IBM ViaVoice back in 1994. This doesn't seem like progress to me.

  23. Classic speech recognition failure by iczer1 · · Score: 1
    Dear aunt, let's set so double the killer delete select all

    https://www.youtube.com/watch?...

  24. Re: Microsoft? by vivian · · Score: 1

    I thought it was bad the day I had to train some foreign workers up to replace me.
    At least they were human. IT'd be worse having to train up an AI to take your job...

  25. Re:Wireless Headphones vs CC smh by Hognoxious · · Score: 1

    No good when you live with a nutter who thinks they cause cancer.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  26. Speech Injection by gizmod · · Score: 1

    Jim: Hey there
    Bot: Good day sir.
    Jim: Semi colon drop table language
    Bot:???????????

  27. Re: Self-Reflection by InsertCleverUsername · · Score: 1

    Have my criticism and observations upset you AC? Struck a nerve?

    --
    Ask me about my sig!
  28. ROTFLMAO! by whitroth · · Score: 1

    I have just this to say about that: folks, I wouldn't let alpha software out to users.

    They brought in "hybrid" phones here last year (VOIP). For voicemail, it sends an mp3, and a "transcription". Frequently, the "transcription", "powered by Microsoft speech technology", resembles early "computer poetry". And by "early", I'm talking 1960s or '70s.... with significant portions bearing zero resemblance to what was said.

            mark