Slashdot Mirror


Microsoft Speech Recognition Now As Accurate As Professional Transcribers (techcrunch.com)

An anonymous reader quotes TechCrunch: Microsoft announced today that its conversational speech recognition system has reached a 5.1% error rate, its lowest so far. This surpasses the 5.9% error rate reached last year by a group of researchers from Microsoft Artificial Intelligence and Research and puts its accuracy on par with professional human transcribers who have advantages like the ability to listen to text several times. Both studies transcribed recordings from the Switchboard corpus, a collection of about 2,400 telephone conversations that have been used by researchers to test speech recognition systems since the early 1990s. The new study was performed by a group of researchers at Microsoft AI and Research with the goal of achieving the same level of accuracy as a group of human transcribers who were able to listen to what they were transcribing several times, access its conversational context and work with other transcribers.

30 of 176 comments (clear)

  1. Laughable Hype by bwanagary · · Score: 5, Interesting

    On a daily basis in my work environment Microsoft technology is used to a) record voicemail and b) generate text from the speech.  Never, ever, have I received any converted voicemail that wasn't completely unintelligible gibberish.  Seriously.  This is utter nonsense.

    1. Re:Laughable Hype by avandesande · · Score: 4, Funny

      You should start talking with people who don't speak gibberish.

      --
      love is just extroverted narcissism
    2. Re:Laughable Hype by bobstreo · · Score: 3, Insightful

      You should start talking with people who don't speak gibberish.

      Yeah, but Mumbai is on the phone with us again...

    3. Re: Laughable Hype by Anonymous Coward · · Score: 2, Insightful

      We have a up to date Microsoft service doing this at my work. Accuracy is a running joke and I regularly forward people their transcriptions so we all get a good laugh. This might be lab quality recordings with limitations on launguage complexity used to cut down on errors. Error rate of a closed set test isnt really a great indicator. Now a year long comparison against several call centers in multiple industries would be quite compelling.

    4. Re:Laughable Hype by Luthair · · Score: 2

      No, context recognition would mean the correct word but wrong meaning. Buy and My are clearly distinct words with different pronunciation.

    5. Re:Laughable Hype by Luthair · · Score: 3, Insightful

      3) How much background noise? Are these from people calling from cell phones. Or a LAN line.

      Why does it matter? If it doesn't function in a standard operating environment then it isn't doing as claimed. What would you say to a watch maker who claimed their product was unscratchable but testing consisted of rubbing it with microfibre cloth?

    6. Re:Laughable Hype by pr0fessor · · Score: 3, Insightful

      3.... I've tried various voice recognition software over the years and can say they are getting much better but if there is any background noise forget it.

      I quit trying to use siri because when I get in the car and ask siri for directions if my wife is with me I get siri saying "I couldn't find, 102 why the fuck street don't you type in the address like a regular shut up person damn it.

    7. Re: Laughable Hype by Chaset · · Score: 2

      I just read that as an IP phone connected to the LAN. I have one of those at work. It is theoretically better audio quality than the analog internal phone system it replaced. So cell phone=really bad, LAN line=really good audio quality.

      --
      -- "This world is a comedy to those who think, a tragedy to those who feel."
  2. Errors are not Errors by idji · · Score: 5, Insightful

    When a human transcriptionist makes a mistake you can usually work out what they meant. When Speech-to-text (STT) makes a mistake it is often gibberish. So objectively it is "better" at transcribing, but subjectively much worse.

    1. Re:Errors are not Errors by AmiMoJo · · Score: 4, Interesting

      Not any more. One of the ways that they got the accuracy up so high is by giving the machine an understanding of English and common phrases, similar to what a human has. It's been used for input correction on smartphones for a while too, e.g. with the Google keyboard it can correct the previous word based on the next one you type if it realizes that they don't make sense together.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    2. Re:Errors are not Errors by jellomizer · · Score: 4, Informative

      Normally we have transcriptionist who are trained in a particular area to understand the context of the message. A legal transcriptionist requires different training then a Medical Transcriptionist.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    3. Re:Errors are not Errors by AmiMoJo · · Score: 4, Interesting

      It's more than just syntax and grammar rules. For example, Google has been mining the web for that kind of knowledge. You can see it in Google Translate sometimes. It generates suggestions for your input, and sometimes screws up like thinking "alot" is a word. It also uses colloquialisms in its output, which again it gathered from analysis of the web and which doesn't fit standard grammar or syntax rules.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    4. Re:Errors are not Errors by K.+S.+Kyosuke · · Score: 3, Insightful

      Hey, it's going to cost $700 per minute but at least there will be no errors!

      So it's about three times cheaper than the lawyer that you'd need if you get sued for a bad transcription?

      --
      Ezekiel 23:20
    5. Re:Errors are not Errors by gnick · · Score: 4, Insightful

      A legal transcriptionist requires different training then a Medical Transcriptionist.

      And sometimes even that training falls short. Does anyone remember the explosion at WIPP when the tech transcribed "an organic kitty litter" instead of "inorganic kitty litter"?
      Kitty litter explosion.

      --
      He's getting rather old, but he's a good mouse.
    6. Re:Errors are not Errors by hord · · Score: 2

      The way the machine learning databases are built, it does understand what is being said. That's why it is so effective. This happens through the connections that are built inside the neural network along with the architecture of the network itself. They are now using context-sensitive data labeling to assign specific meaning to words that are generally ambiguous based on the text around these words. The neural net can learn over time which combinations of words are likely to fall within specific categories and use this as a basis for translation later on.

      There are several teams working on this and they have been publishing papers for a while now. The press is probably just picking up on it. Google's translation service is said to have increased in accuracy due to the same general principles. Facebook is looking into it for better social data aggregating. Various academic teams are doing it for prestige and to add to the field. Lots of good work here, actually although there is still far, far to go.

    7. Re:Errors are not Errors by hord · · Score: 3, Interesting

      I'm not a statistician but it's possible that once you can prove that the neural network can produce answers at a success rate higher than humans you would be introducing error by allowing humans to review it. I'm not saying it shouldn't be done but this is one of the weird questions that people will have to ask on a case-by-case basis as these technologies are applied to real problems.

    8. Re:Errors are not Errors by SeattleLawGuy · · Score: 2

      Hey, it's going to cost $700 per minute but at least there will be no errors!

      So it's about three times cheaper than the lawyer that you'd need if you get sued for a bad transcription?

      This will eventually bring down the costs of lawsuits by making court reporters less common, but that may take a few decades.

      Not many lawyers are $700 per minute. Even $700 per hour is rare.

      And do you know how much we have to pay to go through law school and have our senses of humor surgically removed?

      --
      Real lawyers write in C++
    9. Re:Errors are not Errors by djinn6 · · Score: 3, Interesting

      The way the machine learning databases are built, it does understand what is being said.

      I think the word "understand" has a more general meaning than what you wrote later on. For it to understand what was being said, beyond making grammatical sense of the sentence, it needs to know the abstract concepts behind the words and be able to manipulate them.

      For example:

      Jeff is a software engineer, Kate is a software engineer, and Larry is also ...

      Can you finish the sentence?

      Most humans could do it with a high degree of accuracy. Some might even find the obvious answer so boring that they try for a more creative one. However, ML is still very far from that.

      Since it does not grasp the abstract concepts, its transcription is much more likely to lose meaning than a human transcriber. When talking about network technology for example, a human will not mis-transcribe "NAT" to "gnat", while a machine will.

  3. Using it to post on slashdot by Harald+Paulsen · · Score: 4, Funny

    holyfield is these all of this was made worse by the fact that i had these birds skilled estimate uh... supplying itself what's your special prom to prevent fraud reform
    thoughtfulness julia roberts police comments entry drug connections predicting that nighttime beating

    --
    Harald
  4. Bad experiences on this front by CustomSolvers2 · · Score: 4, Interesting

    Some months ago, I did some tests with speech recognition software and my conclusion was that it is still too unreliable. My intention was to develop an application allowing me to write moderately complex code by voice (creating files and folders, including proper indentation, recognising functions, variables and other basic elements, etc. Basically, allowing me to write/edit the main parts of a random algorithm in certain language without touching the keyboard). I did test Microsoft in-built functionality (+ used one of Microsoft's .NET programming languages) and it wasn't even close to what "5.9% error rate" seems to indicate (almost perfect?).

    In defence of the software, I have to say that my English accent isn't precisely excellent (some people say that it is "too thick" and other people just say "what?". LOL) and honestly I make a very little effort to pronounce properly. But this is also the problem with speech recognition: it is mostly focused on a specific language/accent/intonation. I was doing my tests in an English Windows version and this was the language for the default speech recognition (and adding a different one wasn't precisely straightforward).

    I do perfectly understand the complexity associated with developing a reliable enough piece of software delivering what I was expecting; but this is precisely the reason why I looked for existing solutions rather than developing everything myself (what I do pretty often). In any case, my impression is that you can still not expect good enough reliability of (Microsoft's) speech recognition software, much less when mixing languages/accents up (particularly problematic situation: including Spanish words when talking in English). I might give a new shot at all this next year though.

    --
    Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
    1. Re:Bad experiences on this front by Baron_Yam · · Score: 3, Interesting

      5.9% means it still gets more than 1 in 20 things wrong. That's a LOT when you're feeding the information into a system that requires pretty much a 0% error rate.

      Second, there's a huge difference between standard language and specialist syntax. With programming, you're likely going to want a LOT of special formatting that you can type without thinking but it's cumbersome to communicate via speech in a way that won't confuse a speech recognition engine.

      And finally - so long as they don't have a related disability - a proficient typist can already type about as fast as they can form decent code in their head. With a bit of 'mousework' for selection and cut-and-paste I don't see speech ever becoming the superior entry method unless and until we have genuine AI that understands your intent rather than your words.

      It might be nice to use speech as a macro-invoker, though.

    2. Re:Bad experiences on this front by CustomSolvers2 · · Score: 2

      I was sharing my personal experience on this front, not implying that the outputs of this research has anything to do with current commercial accuracy. I personally found kind of surprising the high number of errors (not too much into voice-based anything, but from what I see and read everywhere I was kind of expecting something different) and merely posted about that experience.

      --
      Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
    3. Re:Bad experiences on this front by ranton · · Score: 2

      I can't use voice recognition to send a text without 3-5 attempts. And I don't have a hard accent.

      It is very odd that you have such a low success rate with voice recognition. At least 2/3 of my voice texts can be sent without editing, and most of the errors have to do with proper names. Are you sure you don't have an accent? My wife mumbles pretty bad when talking fast (so bad I don't like talking with her on the phone most of the time) but even she has a pretty easy job using voice to text now. It was pretty bad a few years ago but it really is amazing how much better it has become.

      --
      -- All that is necessary for the triumph of evil is that good men do nothing. -- Edmund Burke
    4. Re:Bad experiences on this front by ziggystarsky · · Score: 2

      The reported error rate is for conversational English. This means that you cannot throw meaningless words at it. Modern speech recognition exploits grammatical and semantical structure. The stock recognizers can't do this for programming languages. You could train the model on a programming language, and certain constructs (like brackets, if-then-else) will see an improvement in recognition.

  5. "As Accurate As Professional Transcribers" by Anonymous Coward · · Score: 5, Funny

    "As Accurate As Professional Transcribers..."

    They left out "from Uzbekistan transcribing Navajo - underwater".

    Never trust anything Clippy say.

  6. Microsoft Speech Recognition Now As Accurate - Say by WeBMartians · · Score: 3, Interesting

    If it can recognize "It's difficult to wreck a nice beach", I'll be thoroughly 'whelmed'.

  7. In which environment? by Opportunist · · Score: 2

    In a sound proof studio built for sound recording spoken by someone with speech training?

    Or in an environment with 30 people talking in the background, an air condition running, doors and drawers slamming, people laughing, feet
    and chairs shuffling across the floor, some photocopiers that got their last service before Bush left office whining for hours and a person speaking into the phone while at the same time talking to coworkers and you're expected to know which words belong to you and which ones are directed at someone else?

    Aka "open plan office".

    --
    We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
  8. On the down side by fahrbot-bot · · Score: 2

    It still showed up at the South Park "Save Films from their Directors" club for the wrong reason when it heard, "Free Hat".

    (For those that aren't South Park followers...)

    Cartman writes "Free Hat" on the advertising poster in the belief that freebies are necessary to attract people. However, the crowd mistakenly thinks the rally is to free Hat McCullough, a convicted baby killer they believe was innocent.

    Now thinking that "Free Hat" would be a great name of one of those Windows App Store pirate streaming apps ...

    --
    It must have been something you assimilated. . . .
  9. Hype, more hype, and maybe outright lies by Rick+Schumann · · Score: 2

    If you believe Microsoft without independent verification from an otherwise uninterested third-party who has no investment in the outcome, then you're a fool.

  10. 5% by MMC+Monster · · Score: 2

    One in 20 words is wrong?

    How can a human transcriptionist be that bad?

    --
    Help! I'm a slashdot refugee.