Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?
First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"
...because if there were, it'd be put into immediate use for TV closed captioning for live programs, for live presentations with a text crawl at the podium of the speaker, and in courtrooms, replacing the stenographer.
That said, there are companies working on it, like Dragon, but they're not there yet, and when they get closer, they won't be cheap.
TL;DR If it existed we'd be using it already.
Do not look into laser with remaining eye.
Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.
Good luck.
"First they came for the slanderers and i said nothing."
it exposes the raw functionality of Android's speech recognition better than anything else I have seen. "I just want something that will put on screen what is said aloud" is a feature set that is surprisingly hard to find.
The main gap I see is that this is really only practical for 1v1 conversations and group settings will require exponentially more sophistication, to identify and differentiate between different speakers.
I would love to say that loved ones should fucking learn their child/parent/friend's first language so they can converse in ASL, but that is a surprisingly hard sell for some people. ASL is my son's first language, and there are plenty of people in his life who refuse to learn to speak with him.
Hire some seasonal help. Interpreters need jobs around the holidays too.
Turns out being able to hear doesn't actually help here.
You'd need to have the table wired first with mic pickups. Or everyone could be nice and wear a mic which is then fed to a central processor and fanned out to folks iphones. Couple this with translation and sell to the UN!
It's a phone radio built it. The data gets converted to speech and wait for it.... I will even adopt the sound of your mother.
I can't believe what they'll think of next...
The problem with dinner conversations is that there are usually a number of them going on at one time. A computer has enough trouble following one voice let alone multiple voices at the same time.
"This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving."
So, let me just ask, what was hindering this before? If one of you can hear and talk, and the other can read, then surely it's not a huge leap to one of you typing and the other reading? I know it's not a given, but it seems pretty obvious. And, from the deaf person's side, surely nothing is lost? Hell, you could have "talked with" them while chatting to friends at the same time (though that's probably rude if you don't have the focus to do so properly).
I have used text tools and translations to talk to Italian relatives when we're in a pinch and need to communicate. Mime gets you surprisingly far, and you can use keywords and dictionaries, but when the Italian for "spanner" is also the Italian for "key", it gets rather confusing rather quickly.
Speech recognition is inherently difficult and software for it, therefore, is crap. Sure, you can ask Siri to do something simple but she can't transcribe a conversation of any substance at any kind of speed or accuracy. People have been telling me that voice recognition systems could do that for over 20 years now - I'm still to find one and I don't have any speech difficulties or trouble communicating with people of varying accents. In fact, all the people I know that told me how great Dragon was usually found some alternative or quietly dropped their use of it within a year.
Don't expect speech recognition to be any good for a LONG time yet. Especially in a noisy / confusing envrionment. I hate parties partly because my mind tries to capture all audio and cannot discern them all at the same time (I have a pattern-recognising mind - once I'm tuned in, I can even write down strong-accented Italian as it's spoken even though I know little Italian, with few errors, but ask me to listen to two people talk at once and it hurts my head because my brain DOES try to decipher all the mess at the same time).
As such, the biggest boost to the deaf community in decades was text and instant messaging. Hell, I have no idea if half the people I "talk" to on the Internet are deaf or not.
So rather than trying to find some magical automatic tool so you don't have to do anything special to talk to your friend/relative who IS different, why not learn sign-language, use a text-based tool (even just notepad would work!) or just go to some effort to make yourself understood to them?
Daughter and her friend communicate via texting a lot. They've got speech to text turned on at one end, and on the other end, text to speech.
Waaaait a minute...
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
My impression is, commercial speech to text works really well right now -- in fact, just well enough to lull you into complacency before it really screws you up.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
When we talk to each other, we speak very differently to when we're recording ourselves into a microphone or talking to a speech to text algorithm. Computers aren't at the point where they can interpret communicative intent and so they are unable to transcribe what you mean to say as opposed to the sounds that came out of your mouth. Even when people talk to their speech to text phones, we often see some very strange interpretations. Just check out the regular and frequent postings on Fail Blog.
www.Stay-in-Touch.ca includes a "Hard-of-Hearing" function. It is text or speech-to-text at your end which comes through as text to "Grandma" She picks up the phone and speaks. Then you reply. Runs on an Android Tablet for Grandma and a web-Page for you! I'd love you to test this.
Any reason your mother didn't just learn how to sign? It would seem to be a lot less expensive... and more thoughtful.
No, there is nothing good enough. In fact, automated systems got a good, old-fashioned drubbing at the captioning challenge at the ASSETS 2013 conference. Nothing came even close to a human steno captioner.
If you want to make family events accessible and the person in question signs, my recommendation is to hire freelance interpreters. They often charge on a sliding scale, depending on the event and means of the client, and tend to run much cheaper than anyone you can get through referral services. That is what we do at family events, and it works both ways: events by my hearing family for us, and events by us for my hearing friends and family.
If the person does not sign, human captioning is an alternative, but it probably would have to be local rather than remote. Remote does well if there are no overlapping conversations. Otherwise, it has to be local, but that is a lot harder to arrange for, and more expensive.
A note to some of the other commenters: please spare use the patronizing posts of us not missing out on any of the (presumably inane) conversations that take place at such events. *We* make the decision as to what conversation we consider important and what we do not, and nothing is worse than people presuming to speak for us, and presuming to know better about access than we do.
Are you able to do all of the following at your dinner conversation?:
1) Provide everyone with a decent close-talking directional microphone.
2) Require each person to take turns speaking, so there is very little overlap.
3) Have no pre-adolescents speaking.
4) Eliminate noticeable background noises.
5) Have no one with a strong non-native dialect speaking.
6) Require everyone to speak in full, grammatical sentences.
To the extent you say no to any of the above, you will get increasingly poor output. They are listed approximately in order of importance (1 being the most important). If you can say yes to all of those, you can probably get in the vicinity of 90% accuracy. This might be usable, depending on your ultimate purpose. If you were to additionally train acoustic and language models for all of the speakers, and then tell the software which user was speaking (i.e. switch the user on the fly during the conversation), you could probably get 95% accuracy and that would be quite usable.
So, in other words ... nope.
If humans are mostly water, and beer is mostly water, then humans must be mostly beer.
Record the conversation and they play it into Dragon, it works but you need a good quality audio feed. I've also tinkered with Julius and although it takes a bit of set up it works in most cases but you have to tweak it a bit more than Dragon at least in terms of what I was dealing with.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.
I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.
The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.
One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.
About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.
In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.
One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.
Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.
I'd really like something like this to create transcripts of our town hall meetings. It is an important need for government transparency throughout the world. In some ways those might be easier to do because people don't tend to speak at once. The google voice technology seems pretty good.
Ideally the transcript could be combined with an app that overlays the transcript on the video or audio, with accelerated playback, allowing the user to tag which speaker is speaking (if not automated), and correct any errors. That could also be part of a captcha system.
Dragon NaturallySpeaking Home is $50 on Amazon.com
I've sat in trials in federal courts a few times. You can watch the stenographer type the transcript on a monitor in front of you, and it's much better than the TV captions.
These were pharmaceutical patent cases and FDA litigation, mostly technical stuff, chemists being cross-examined.
They had a system with a court stenographer typing into a computerized stenotype machine, and the judge and both parties watching the result on monitors.
I was last in court a few years ago, but I don't think it's changed much.
In the old days of manual stenography, stenotypists used to take dictation at 120 words per minute (which is the high end of normal speaking speed) to pass a certification test, and they could do 150 wpm in bursts. They would type abbreviations into a fanfold paper tape, then read the tapes and type a final version.
Now the computerized systems give them the final version automatically, so they don't have to reread it. Some stenotype systems were more amenable to automation than others, but now everybody uses computerized systems.
The result was pretty close to what the final transcript looked like, although I didn't examine them too carefully. The practice has always been for the stenographer to type a draft, and circulated it to the parties for review. (The stenographer is also allowed to ask a witness to repeat something when the stenographer missed it. That's what they're in court for -- to produce a record.)
Court stenographers also make a lot of money ($100,000 a year or more at the top end), because they sell transcripts to the lawyers. A patent case can be worth $100 million, so what's another $100,000 for transcripts? I don't think the TV caption make anything like that. You get what you pay for.
Not "Google Glass" as is but . . . some future version of that, would seem to be the ideal. HIGHLY directional microphone, lets you "look" at the speaker of interest to help #1 see facial expressions/body language, #2 discern that "voice" from among the background noise clutter, with interpreted output onto the display.
.
Something should be possible... if its done in "Wet" electronics (Brain-Body), with enough processing power and sensitivity, discernment should be somewhat possible in "Dry" electronics (IC's and discrete components.)
I keep thinking back to my early days in Ham Radio . . . as time passed and band conditions changed, signals faded and were occluded by background noise, but my human ears "locked on and tracked" the fading morse code. Such that, a passing observer, walking in fresh, might not "hear anything" in the din of noise . . . when I was just able to stay locked on and keep the conversation going a few minutes/seconds longer. How was I staying "locked on"?
Clearly, all I need to do is break out the CSI Miami DVD's and watch a few episodes, where they pull out a clear conversation from the background of a garbled audio conversation, and just repeat their procedure . .
It'll take more than just directional audio input and decoding (recognition) software. (For those cases when you're trying to listen to ONE conversation in a room full of three or four going on) . . . "directionality" will help, but some pretty heavy duty "fast fourier transforms" (or something similar) might be needed to "track" the particular voices of interest. And laughter, non "voice" sounds like "gasps", tonal changes for surprise or interrogation . . . will make "tracking" problematic.
(Whether build by God, or 1.5 million years of evolution . . . the current design has a big head start.)
I know . . . all that's design postulation, and not the "I need it now" answer you were hoping for. Just thinking out loud.
No, I don't remember your name. But the memory mapped screen on a TRS80 from 1977 is from 15360 to 16383 if that helps.
I've not noticed anyone else mention using multiple microphones, but with two microphones spaced a few metres apart a computer can separate out two people talking over each other due to speed-of-sound differences (i.e. conversation A reaches mic 1 then mic 2, conversation B reaches mic 2 then mic 1).
I've seen two people talking separated using two microphones, I don't think you need more microphones for more people. This is a common machine learning demo (how I've seen it), but it is just signal processing really (i.e. no machine learning required).
I'm not aware of any products integrating this however, but I've never had the need to look. Is there such a thing as a bluetooth microphone for iPhone/Android? Like a headset but without the speaker. If you're a programmer maybe you could write an app that separates out different audio "streams" and then send each to existing voice recognition APIs.
... Especially when family members are yelling and stuff. Ugh!
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
However, you will probably have to pass a mic around the table. I needed to transcribe some paragraphs in a book to text on my computer and this thing worked really well.
How to use Chrome's speech-to-text
http://howto.cnet.com/8301-11310_39-20058475-285/how-to-use-chromes-speech-to-text/