Microsoft Speech Recognition Now As Accurate As Professional Transcribers (techcrunch.com)
An anonymous reader quotes TechCrunch:
Microsoft announced today that its conversational speech recognition system has reached a 5.1% error rate, its lowest so far. This surpasses the 5.9% error rate reached last year by a group of researchers from Microsoft Artificial Intelligence and Research and puts its accuracy on par with professional human transcribers who have advantages like the ability to listen to text several times. Both studies transcribed recordings from the Switchboard corpus, a collection of about 2,400 telephone conversations that have been used by researchers to test speech recognition systems since the early 1990s. The new study was performed by a group of researchers at Microsoft AI and Research with the goal of achieving the same level of accuracy as a group of human transcribers who were able to listen to what they were transcribing several times, access its conversational context and work with other transcribers.
On a daily basis in my work environment Microsoft technology is used to a) record voicemail and b) generate text from the speech. Never, ever, have I received any converted voicemail that wasn't completely unintelligible gibberish. Seriously. This is utter nonsense.
When a human transcriptionist makes a mistake you can usually work out what they meant. When Speech-to-text (STT) makes a mistake it is often gibberish. So objectively it is "better" at transcribing, but subjectively much worse.
holyfield is these all of this was made worse by the fact that i had these birds skilled estimate uh... supplying itself what's your special prom to prevent fraud reform
thoughtfulness julia roberts police comments entry drug connections predicting that nighttime beating
Harald
Some months ago, I did some tests with speech recognition software and my conclusion was that it is still too unreliable. My intention was to develop an application allowing me to write moderately complex code by voice (creating files and folders, including proper indentation, recognising functions, variables and other basic elements, etc. Basically, allowing me to write/edit the main parts of a random algorithm in certain language without touching the keyboard). I did test Microsoft in-built functionality (+ used one of Microsoft's .NET programming languages) and it wasn't even close to what "5.9% error rate" seems to indicate (almost perfect?).
In defence of the software, I have to say that my English accent isn't precisely excellent (some people say that it is "too thick" and other people just say "what?". LOL) and honestly I make a very little effort to pronounce properly. But this is also the problem with speech recognition: it is mostly focused on a specific language/accent/intonation. I was doing my tests in an English Windows version and this was the language for the default speech recognition (and adding a different one wasn't precisely straightforward).
I do perfectly understand the complexity associated with developing a reliable enough piece of software delivering what I was expecting; but this is precisely the reason why I looked for existing solutions rather than developing everything myself (what I do pretty often). In any case, my impression is that you can still not expect good enough reliability of (Microsoft's) speech recognition software, much less when mixing languages/accents up (particularly problematic situation: including Spanish words when talking in English). I might give a new shot at all this next year though.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
"As Accurate As Professional Transcribers..."
They left out "from Uzbekistan transcribing Navajo - underwater".
Never trust anything Clippy say.
They should do tests using modern hardware. For example the speech recognition on iOS seems to be pretty good. If they can get this technology into windows 10 that would be awesome. Oh I dictated this using iOS.
The NSA would love this. Keyword scanning of 95% of what's spoken in phone conversations (given enough processing power to transcribe them all).
Better known as 318230.
Just make sure you run it on an air gapped computer if you want your conversation to remain private.
Seven puppies were harmed during the making of this post.
I worked as a professional transcriber in the legal profession, actually employed by the government. 95% accuracy would be 1 mistake in 20 words, an error almost every 2 lines. For the standard we had to type to, an error every 2 pages would be unacceptable. These transcripts are admissible evidence in court as an exception to hearsay rules and people's lives hang on the accuracy of them. The transcripts themselves are also literally the law of the land (I live in a common law jurisdiction, so my transcript is literally legally binding law and a printout of my transcript is admissible for that purpose as well). Imagine a 5% error rate in that.
Also, judges always speak "The Queen's English". How is this algorithm going to translate what they really say into proper language suitable for a judicial order? I'd also love to see how it deals with technical Jargon; for example citations that are spoken all sorts of haphazard ways yet must be typed in a specific format.
And this doesn't even factor in the thick accents many people use that are almost unrecognizable by the best humans, how is a computer going to deal with that?
At work we have an cloud-based Outlook that transcribes voicemail to text. It's so comically inaccurate that we sometimes forward the results to the sender and we both get a good laugh.
Competition Good, Monopoly Bad.
If it can recognize "It's difficult to wreck a nice beach", I'll be thoroughly 'whelmed'.
That's from over 10 years ago, which in computing terms is ancient history.
"I bless every day that I continue to live, for every day is pure profit."
The lameness filter is lame.
IgPay AtinLay?
Peace is easy to achieve, just surrender. Liberty is much harder get/keep.
We have arrived at the point where assuming that a company wants to invade your privacy is pretty much the default position.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
In a sound proof studio built for sound recording spoken by someone with speech training?
Or in an environment with 30 people talking in the background, an air condition running, doors and drawers slamming, people laughing, feet
and chairs shuffling across the floor, some photocopiers that got their last service before Bush left office whining for hours and a person speaking into the phone while at the same time talking to coworkers and you're expected to know which words belong to you and which ones are directed at someone else?
Aka "open plan office".
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
Microsoft Speech Recognition Now As Accurate As Professional Transcribers who are deaf and whose native language is Esperanto.
Given that transcription is not a highly paid area, and that a moderate typist can transcribe pretty much as fast as as you talk, there is not a chance in hell you can fire 10 transcribers and hire two.
However this is 2017, there is no need to have your transcription service in central London for example. Punt the audio file to somewhere else over the internet. It doesn't need to even leave the UK to be much cheaper than being in central London either.
In fact this is perfect for homeworking to be honest. Especially given the pay rates and demographic profile of most transcriptionists. That is the job is not exactly high pay, most of them are female and a high number give up work as childcare costs are too much once they have children. Take the commute out the equation and bingo pool of skilled workers ready and waiting. Bit of flexible working to do the school run and jobs a good one.
The only specialist gear you need is a set of foot pedals and they cost under 100GBP for a USB set from the likes of Philips or Olympus. A full kit including software and headphones and pedals is under 200GBP.
It still showed up at the South Park "Save Films from their Directors" club for the wrong reason when it heard, "Free Hat".
(For those that aren't South Park followers...)
Cartman writes "Free Hat" on the advertising poster in the belief that freebies are necessary to attract people. However, the crowd mistakenly thinks the rally is to free Hat McCullough, a convicted baby killer they believe was innocent.
Now thinking that "Free Hat" would be a great name of one of those Windows App Store pirate streaming apps ...
It must have been something you assimilated. . . .
I was pleasantly surprised by the voice-message to email service my last employer had with Google.
They sent you the voice message in an attachment with the translation in the email. If the translation didn't make sense, you could play the audio yourself.
Only annoying thing was we still had to delete the VM off the phone manually afterward.
Will it transcribe, "Diffused the situation," or "Defused the situation"? Every single TV closed-caption I've ever seen, and I've taken special note since I first became aware of this, has gone with the former. And those presumably have been humans making that error.
If you believe Microsoft without independent verification from an otherwise uninterested third-party who has no investment in the outcome, then you're a fool.
One in 20 words is wrong?
How can a human transcriptionist be that bad?
Help! I'm a slashdot refugee.
I don't know about other kinds of transcription, but Court transcription is very highly paid and I believe so is medical transcription. For civil and family court, I get on the order of $8-10/page (at 32 lines per page) and I can type 20-25 pages per hour. Plus a hefty expediting fee if they can't wait 2-4 weeks. Plus I get paid for my time in court. I had a co-worker who had a part time job doing movie closed captioning, but that paid a lot less than our day job.
Your condescending attitude aside, this job requires making the recordings in court personally (part of the legislation that exempts our transcripts from hearsay laws -- how can I certify that this is what was really said in court if I wasn't there personally to hear it?). It is not a part time job for single moms to pick up a few extra bucks. And since courts often run late, parents have a huge problem doing this job at all -- how can you pick up your kid at 4 every day when at least once a month you're staying until 6?
I make more than a lot of the lawyer do, at least the legal aid ones. I also have logged far more courtroom hours than most lawyers twice my age and could do their job a lot better than they could if I had a law degree. I know most of the seminal cases better than them, I've typed many reported decisions which I will obviously know better than anyone who read it, and I have a large library of unreported decisions which are still legally binding.
But you go on thinking it's single moms pecking away making a few bucks an hour. Notice I'm on slashdot, I have a computer science undergraduate degree from a top tier school, but I make a hell of a lot more at this job 50 hours a week than I ever did slaving away programming 80 hours a week -- at a job that lasted 2 years before the company I worked for went bust.
Oh, and on top of that, I get a defined benefits government pension. All this and I'll retire before I'm 60 with a government pension higher than most CS people make in their peak earning years.
It's been possible to offshore to India for years now. At least one company divides the audio into 5 minute chunks and spreads it out to a large typing pool, so I can have a 5 hour audio file transcribed and returned to me in 30 minutes and at an amazingly low cost. The problem is the quality is so low the service is useless. They also don't format it
And fwiw, the software is free; the recording software is very expensive, but that's only on the equipment in the courtrooms. The software to play back their proprietary files is free. We do have a hefty annual fee to a professional standards organization though.
Humans transcribers "have the advantage to be able to listen to the recording several times"? What utterly demented nonsense is that? Of course, the software, having the recording, can "listen" to it as often as it wants. There is absolutely no "advantage" here for the human transcribers.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
As if the audio sails by the program and isn't stored in memory and parsed as many times as needed.
I fuse micro sot noise recognition ball the time it words fall Leslie.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
Because court transcribers are less than 0.1% of people doing transcription, that's why. No idea what it's like in the USA, but in the UK the NHS does not pay at that level for medical transcription services, and top law firms don't either. I very much doubt the court transcribers get paid that much either. I will however ask my brother (aka a real life Judge) what they get paid next week when I see him. However a quick google suggests 60GBP per a 5.5 hour day sitting after which overtime kicks in but rarely more than 7 hours which seems about right to me. I tell you now gets that late and your adjourn. I imagine those doing Hansard (that's Parliament's transcription service) get paid a lot more though.
So in the UK someone doing transcription is going to be earning in the region of 15k-20 GBP outside London, and more inside.
My suggestion was not to punt it to India, but instead of doing in central London, have it done in say Newcastle or Liverpool where as property prices are not insane like they are in London wages are lower.
This was all possible back in 2000. Between my brother and I we had it all worked out, business plans and everything, then the dotcom bubble burst. Oh and I am not talking single mothers either. Back in 2000 my brother worked at a large UK law firm and it was a problem that once they had kids and 99% of those doing the transcription where women, the cost of childcare made it uneconomic to return to work. Had a number of mothers lined up and eager to do the work.
Oh and most transcription is done from a dictaphone dude. Your court transcription is such a tiny tiny fraction of the market that it's not worth talking about really so get of your high horse.
The Miracle of Hiroshima -- Jesuits survived the atomic bomb thanks to the rosary
It is? And who decided *that*?
We've got it on our hybrid phones. At least half the time, the voice transcription "preview" resembles, randomly, Vogon poetry, or perhaps only "computer poetry" from 40 years ago. It rarely gets a name or title correct, and the message they're trying to leave, *maybe* 50% is close enough to guess what they meant, without listening to the mp3.