Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?
First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"
Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.
Good luck.
"First they came for the slanderers and i said nothing."
Based on what I get on my TV when I press the Mute button, they really shouldn't be...
Do not look into laser with remaining eye.
Turns out being able to hear doesn't actually help here.
I have a suggestion of a test for you, to demonstrate why it's impractical at absolute best.
Take ten or so friends to a restaurant. It can be that you're the only patrons there so it's relatively quiet, that's fine. Seat everyone along two sides of a long table, and put a person at each end. Seat yourself in the middle of one of the long sides. Now, as your party is served, attempt to pay attention to all of the conversation going on among the friends. You'll probably find that the friends break into three or four distinct conversations, with some people floating between conversations depending on what's being talked about. Now, in turn, try to focus on or participate in every distinct conversation at the table.
Even as someone with good hearing, this will be a difficult task. With at few as four people it's possible to have two distinct conversations going on in parallel, and with six people it's almost guaranteed to have at least some moments with two simultaneous conversations.
Unless a family operates their dinners with parliamentary rules for who has the floor, it would be almost impossible for software to successfully monitor and differentiate so many speakers, even if the hardware were ideally installed so that each individual speaker could be individually sampled. Fully able-bodied humans struggle with this with years of experience in attempting to sort through the chatter, I don't see how software is going to make it work, and I also don't see how the hearing-impaired individual is going to be able to read to keep up with that many conversations simultaneously in order to really enjoy the experience, while eating.
Do not look into laser with remaining eye.
More fun audio tricks:
Look at each person speaking as you're trying to listen to them, then look at someone else while still trying to follow the same person's conversation. Usually, a few moments after looking away, you'll find the other conversations are more distracting. Your brain is trying to match up what you're hearing with the mouths moving in front of you.
Also, put in one earplug and close your eyes, so you lose spacial awareness. Again, the voices will blend together much more. That's because the brain also uses spacial cues (visual placement, stereo hearing) to separate sound sources.
You do not have a moral or legal right to do absolutely anything you want.
There's no perfect solution, but something that works for 60% might already be better than nothing.
I work in the closed captioning industry, and I'd say anything less than 95% accuracy is actually WORSE than nothing. Automatic Speech Recognition (ASR) has no concept of context or situational awareness. The mistakes they make tend to be not in the simple common words and phrases, but concentrated in the nouns, especially proper nouns: names of people, places, companies, products, etc. Even at 80% accuracy, which is quite good for the current best speaker independent ASR systems, you're looking at 2 words out of every 10 being substituted with the wrong word, completely changing the meaning of the phrases. Imagine the chaos if (major news network)'s closed captioning reported some celebrity or politician as saying "I'm not a fan of Jews." when they actually said "I'm not a fan of juice." (Which would be 83% accurate!) Wars have been started for one misheard word out of a thousand; imagine how bad 200 out of 1000 would be.
Here's an article about a HUMAN transcription error that caused a pretty major ruckus. Now imagine this kind of problem being an order of magnitude worse:
http://www.people.com/people/article/0,,20693447,00.html
People who lost hearing later in life tend to do better with high error rate ASR because they know what words sound like and can figure out easy substitutions, e.g. Juice vs. Jews, Election vs. Erection, etc., but people who were born deaf or lost hearing before language acquisition cannot easily make these substitutions in their head because they don't "hear" the word sounds when they read them.
No, but to expand on your car analogy, they have to be able to meet certain minimum standards and customer requirements.
And dropping out of analogy, the hypothetical courtroom automatic stenographer would probably have it easiest, as the rules of the court dictate that only one person may speak at a time, and most courts have individual microphones for every speaking party for acoustically recording the proceedings anyway. The same cannot be said for the dinner table.
Even the most rudimentary system for sampling several participants would cost hundreds of dollars. A half-way accurate comparison would be the equipment needed to record a drum-set, with individual microphones for each drum, cymbal, and accessory, and a processor that monitors line-levels and individually records each input separately. Replace the function of recording each input and turn it into processing each input for discrete words, and only then are you even getting to the hard part, interpreting what the sounds actually are.
The low-end equipment to record drums is hundreds of dollars. High end equipment to do the same thing costs thousands of dollars. Now tack on the cost of the processing side, and you're probably at tens of thousands of dollars. Just to attempt to participate in a large group conversation as opposed to small-party conversation where polite participants will probably work to simplify the flow of conversation to allow the impaired individual to participate.
A friend of mine in a social club has a son with some form of developmental disability. I've heard that it's Aspergers, but I'm not entirely certain as many of the traits commonly associated with Aspergers don't seem to manifest with him. When he's party to our conversations we modify our conversation to accommodate him. We attempt to avoid speaking over each other or over him, and we increase the amount of time that one considers a pause by a given speaker, so that we don't interrupt him while he's talking.
If we had a substantially hearing-impaired member, we would probably modify our conversations accordingly, slowing our speech enough that lips could be read, attempting to avoid talking over each other, and attempting to keep our faces oriented to where the individual could see those faces. Given the nature of our vocabulary in this social setting (a speculative fiction group) it would be highly unlikely that a speech-to-text system would correctly interpret any of the truly important words in the conversation anyway, so such a system would be useless.
Do not look into laser with remaining eye.
I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.
I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.
The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.
One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.
About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.
In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.
One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.
Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.