Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?

Posted by samzenpus on Monday December 30, 2013 @07:03AM from the keep-talking dept.

First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"

81 comments

Min score:

Reason:

Sort:

There isn't any... by TWX · 2013-12-30 07:09 · Score: 1

...because if there were, it'd be put into immediate use for TV closed captioning for live programs, for live presentations with a text crawl at the podium of the speaker, and in courtrooms, replacing the stenographer.

That said, there are companies working on it, like Dragon, but they're not there yet, and when they get closer, they won't be cheap.

TL;DR If it existed we'd be using it already.

--
Do not look into laser with remaining eye.
1. Re:There isn't any... by Anonymous Coward · 2013-12-30 07:11 · Score: 0
  
  video transcribers also quite expensive
2. Re:There isn't any... by Anonymous Coward · 2013-12-30 07:14 · Score: 0
  
  There's no perfect solution, but something that works for 60% might already be better than nothing.
  Perhaps there's a way to reuse youtube's automatic captioning in one way or another?
3. Re:There isn't any... by TWX · 2013-12-30 07:19 · Score: 4, Insightful
  
  video transcribers also quite expensive
  Based on what I get on my TV when I press the Mute button, they really shouldn't be...
  
  --
  Do not look into laser with remaining eye.
4. Re:There isn't any... by Anathem · 2013-12-30 07:27 · Score: 0
  
  Your TL;DR should be at the top of your response. (bonus for making it bold though)
5. Re:There isn't any... by TWX · 2013-12-30 07:30 · Score: 3, Interesting
  
  I have a suggestion of a test for you, to demonstrate why it's impractical at absolute best.
  
  Take ten or so friends to a restaurant. It can be that you're the only patrons there so it's relatively quiet, that's fine. Seat everyone along two sides of a long table, and put a person at each end. Seat yourself in the middle of one of the long sides. Now, as your party is served, attempt to pay attention to all of the conversation going on among the friends. You'll probably find that the friends break into three or four distinct conversations, with some people floating between conversations depending on what's being talked about. Now, in turn, try to focus on or participate in every distinct conversation at the table.
  
  Even as someone with good hearing, this will be a difficult task. With at few as four people it's possible to have two distinct conversations going on in parallel, and with six people it's almost guaranteed to have at least some moments with two simultaneous conversations.
  
  Unless a family operates their dinners with parliamentary rules for who has the floor, it would be almost impossible for software to successfully monitor and differentiate so many speakers, even if the hardware were ideally installed so that each individual speaker could be individually sampled. Fully able-bodied humans struggle with this with years of experience in attempting to sort through the chatter, I don't see how software is going to make it work, and I also don't see how the hearing-impaired individual is going to be able to read to keep up with that many conversations simultaneously in order to really enjoy the experience, while eating.
  
  --
  Do not look into laser with remaining eye.
6. Re:There isn't any... by Whorhay · 2013-12-30 07:31 · Score: 1
  
  No kidding. I'm always amazed at the low quality of the captioning for accuracy, spelling, and typo's. News broadcasts seem to be the worst because it is live presumably, but even canned content is often poorly done.
7. Re:There isn't any... by Anonymous Coward · 2013-12-30 07:31 · Score: 0
  
  ...because if there were, it'd be put into immediate use for TV closed captioning for live programs, for live presentations with a text crawl at the podium of the speaker, and in courtrooms, replacing the stenographer.
  That said, there are companies working on it, like Dragon, but they're not there yet, and when they get closer, they won't be cheap.
  TL;DR If it existed we'd be using it already.
  The family dinner automatic transcription may have different requirements from the court stenographer automatic transcription application. It isn't a silly question because you'd expect this kind of scenario where the individual will tolerate a higher error rate to be one of the first practical applications, not the last.
  To put this into a car analogy, electric cars don't need to surpass ICE cars in every conceivable scenario to make one worth buying for a given individual.
8. Re:There isn't any... by KiloByte · 2013-12-30 07:33 · Score: 1
  
  It hasn't advanced a bit since 15 years ago when IBM advertised ViaVoice. Unless you're the person a particular speech recognition tool was tailored for or someone with almost identical speech, all you get is nonsense poetry that vaguely resembles the rhythm and rhyme of what you said.
  
  --
  The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
9. Re:There isn't any... by Sarten-X · 2013-12-30 07:39 · Score: 3, Interesting
  
  More fun audio tricks:
  Look at each person speaking as you're trying to listen to them, then look at someone else while still trying to follow the same person's conversation. Usually, a few moments after looking away, you'll find the other conversations are more distracting. Your brain is trying to match up what you're hearing with the mouths moving in front of you.
  Also, put in one earplug and close your eyes, so you lose spacial awareness. Again, the voices will blend together much more. That's because the brain also uses spacial cues (visual placement, stereo hearing) to separate sound sources.
  
  --
  You do not have a moral or legal right to do absolutely anything you want.
10. Re:There isn't any... by jettoblack · 2013-12-30 07:40 · Score: 5, Insightful
  
  There's no perfect solution, but something that works for 60% might already be better than nothing.
  I work in the closed captioning industry, and I'd say anything less than 95% accuracy is actually WORSE than nothing. Automatic Speech Recognition (ASR) has no concept of context or situational awareness. The mistakes they make tend to be not in the simple common words and phrases, but concentrated in the nouns, especially proper nouns: names of people, places, companies, products, etc. Even at 80% accuracy, which is quite good for the current best speaker independent ASR systems, you're looking at 2 words out of every 10 being substituted with the wrong word, completely changing the meaning of the phrases. Imagine the chaos if (major news network)'s closed captioning reported some celebrity or politician as saying "I'm not a fan of Jews." when they actually said "I'm not a fan of juice." (Which would be 83% accurate!) Wars have been started for one misheard word out of a thousand; imagine how bad 200 out of 1000 would be.
  Here's an article about a HUMAN transcription error that caused a pretty major ruckus. Now imagine this kind of problem being an order of magnitude worse:
  http://www.people.com/people/article/0,,20693447,00.html
  People who lost hearing later in life tend to do better with high error rate ASR because they know what words sound like and can figure out easy substitutions, e.g. Juice vs. Jews, Election vs. Erection, etc., but people who were born deaf or lost hearing before language acquisition cannot easily make these substitutions in their head because they don't "hear" the word sounds when they read them.
11. Re:There isn't any... by WolfgangPG · 2013-12-30 07:43 · Score: 1
  
  Agreed. This is why Bane (you couldn't see his mouth) was a common complaint in the most recent Batman movie and why in Transformers, etc... they add moving mouths to robots, etc...
12. Re:There isn't any... by TWX · 2013-12-30 07:52 · Score: 1
  
  I think the character most commonly displayed on mine is the semicolon. For whatever reason, the systems in use in this city seem to LOVE the semicolon. I don't even know what that's an option, it'd make more sense for them to drop punctuation entirely and solely concentrate on the 26 letters if the machine continues to default to using punctuation characters instead.
  
  --
  Do not look into laser with remaining eye.
13. Re:There isn't any... by TWX · 2013-12-30 08:06 · Score: 1
  
  Mmmhmm...
  
  There's a reason why radio personalities often sound similar, they have excessively good diction to excess. Even shock-jocks and others that seem counter-culture or edgy must have excellent verbal communications skills in order to work in that industry; they have to overcome the problem of communicating when the listener cannot read their lips. Vowel sounds and other 'long' sounds are easy to make out, but the staccato consonants are what give meaning to the sounds, and lots of lazy speakers under-emphasize their consonants and yet get upset when people ask them to repeat themselves.
  
  When I was a child I was in a professional choir. We were taught exactly that, to downplay non-vowel long sounds (the 'th' sound, the 's' sound, the 'l' sound, probably some others) and to over-emphasize hard consonants like 't', 'd', 'p', 'b', 'k', etc. This training apparently stuck with me; it's extremely rare that I'm asked to repeat myself.
  
  --
  Do not look into laser with remaining eye.
14. Re:There isn't any... by TWX · 2013-12-30 08:28 · Score: 4, Informative
  
  To put this into a car analogy, electric cars don't need to surpass ICE cars in every conceivable scenario to make one worth buying for a given individual.
  No, but to expand on your car analogy, they have to be able to meet certain minimum standards and customer requirements.
  
  And dropping out of analogy, the hypothetical courtroom automatic stenographer would probably have it easiest, as the rules of the court dictate that only one person may speak at a time, and most courts have individual microphones for every speaking party for acoustically recording the proceedings anyway. The same cannot be said for the dinner table.
  
  Even the most rudimentary system for sampling several participants would cost hundreds of dollars. A half-way accurate comparison would be the equipment needed to record a drum-set, with individual microphones for each drum, cymbal, and accessory, and a processor that monitors line-levels and individually records each input separately. Replace the function of recording each input and turn it into processing each input for discrete words, and only then are you even getting to the hard part, interpreting what the sounds actually are.
  
  The low-end equipment to record drums is hundreds of dollars. High end equipment to do the same thing costs thousands of dollars. Now tack on the cost of the processing side, and you're probably at tens of thousands of dollars. Just to attempt to participate in a large group conversation as opposed to small-party conversation where polite participants will probably work to simplify the flow of conversation to allow the impaired individual to participate.
  
  A friend of mine in a social club has a son with some form of developmental disability. I've heard that it's Aspergers, but I'm not entirely certain as many of the traits commonly associated with Aspergers don't seem to manifest with him. When he's party to our conversations we modify our conversation to accommodate him. We attempt to avoid speaking over each other or over him, and we increase the amount of time that one considers a pause by a given speaker, so that we don't interrupt him while he's talking.
  
  If we had a substantially hearing-impaired member, we would probably modify our conversations accordingly, slowing our speech enough that lips could be read, attempting to avoid talking over each other, and attempting to keep our faces oriented to where the individual could see those faces. Given the nature of our vocabulary in this social setting (a speculative fiction group) it would be highly unlikely that a speech-to-text system would correctly interpret any of the truly important words in the conversation anyway, so such a system would be useless.
  
  --
  Do not look into laser with remaining eye.
15. Re:There isn't any... by Desler · 2013-12-30 08:45 · Score: 1
  
  Oh really?
  Original sentence: After the party, your mom and I swept up together.
  Translated sentence: After the party, your mom and I slept together.
  Now the translated sentence is well over 60% accurate, but do you see you know how the meaning has completely changed?
16. Re:There isn't any... by Anonymous Coward · 2013-12-30 09:01 · Score: 0
  
  yeah i've been looking for a working commercial solution to transcript from call center recordings.
  nothing is good enough
  even for reasonable budget.
17. Re:There isn't any... by Anonymous Coward · 2013-12-30 09:06 · Score: 0
  
  I also work in the captioning industry, and have been fielding the voice-recognition question for 20 years. It's still not ready for prime-time. And it would fail miserably at the task in a noisy room.
  The iPhone offering is quite good for being non-voice-specific. If you have the discipline to speak clearly, with integrated punctuation, it's very understandable. Having all speakers on a group text, on their own individual phones, taking turns, displayed on a large monitor, would likely do a passable job. Or, hire a court reporter, and pay them what they're really worth.
  Most (but not all) broadcast captioning is done by the lowest bidder, and it looks like it. Plus, the educational background of the captioner is often sub-standard, so they don't even know the word they don't know how to spell.
18. Re:There isn't any... by EvanED · 2013-12-30 09:12 · Score: 2
  
  Counterargument: I don't know much about audio analysis, but imagine you have two people talking with different-pitched voices. I wouldn't be at all surprised if a computer, via some FFT and frequency analysis or something, would be able to do a far better job at separating them than a person could. Actually, I'd be a bit surprised if that wasn't true.
  Things become more difficult of course with similar-sounding voices, but I still suspect there's a fair bit of potential. There have been a lot of things people have done along this line.
19. Re:There isn't any... by skastrik · 2013-12-30 09:15 · Score: 2
  
  I must disagree. A 20% error rate rarely completely changes the meaning of a particular sentence or article.
  I can use automatic translation to read foreign web sites. I notice when the translation is weird and most probably wrong, but still I more or or less understand what is being discussed. How on earth can that be worse than absolutely nothing?
20. Re:There isn't any... by Anonymous Coward · 2013-12-30 10:00 · Score: 0
  
  translating != transcribing
21. Re:There isn't any... by Solandri · 2013-12-30 11:09 · Score: 2
  
  there are companies working on it, like Dragon, but they're not there yet
  Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.
22. Re:There isn't any... by WCVanHorne · 2013-12-30 12:24 · Score: 1
  
  When Zooey Deschanel gets shot by the cops?
23. Re:There isn't any... by Bourdain · 2013-12-30 13:46 · Score: 1
  
  Based on what I get on my TV when I press the Mute button, they really shouldn't be...
  Most of the time when you view closed captions, it is typed up, not automatically transcribed by a computer program - link
  
  Further, for live events, it is typically typed live by a stenographer which yields the inherent delay
  
  As for errors, I personally have mostly seen errors when I'm watching over the air and the reception isn't very clear (though I don't often use closed captioning so my sample size is limited)
24. Re:There isn't any... by Anonymous Coward · 2013-12-30 13:51 · Score: 0
  
  I don't even know what that's an option; it'd make more sense for them...
  FTFY
25. Re:There isn't any... by nbauman · 2013-12-30 15:32 · Score: 1
  
  That's because your pinkie is the easiest finger to make a mistake with, and the semicolon is right under your ;pinkie. I do it all the time.
  Speed is more im;portant than accuarcy.
26. Re:There isn't any... by Anonymous Coward · 2013-12-30 16:46 · Score: 0
  
  There are arrays of microphones installed in indoor arenas that can isolate the words of individuals in the audience of 10s of 1000s. One assumes the audio stream can be split to allow isolating more than one person simultaneously.
  So it isn't impossible, tho it may require a lot of DSP equipment.
27. Re:There isn't any... by spasm · 2013-12-30 18:25 · Score: 1
  
  On the other hand, conversations between the relatives at Christmas have massive amounts of context - we *know* Uncle Joe who has been sending money to the Zionist Freedom Front ever since his older sister was liberated from Auswich couldn't possibly have said "I'm not a fan of the Jews", whereas we don't have the personal-level context needed to decide whether some idiot celebrity or politician might or might not have said "I'm not a fan of the Jews". So for the purpose of the original poster, an 80% accurate ASR system might be good enough.
28. Re:There isn't any... by foobar+bazbot · 2013-12-30 18:44 · Score: 1
  
  There are arrays of microphones installed in indoor arenas that can isolate the words of individuals in the audience of 10s of 1000s. One assumes the audio stream can be split to allow isolating more than one person simultaneously.
  So it isn't impossible, tho it may require a lot of DSP equipment.
  This.
  All the people going on about putting a mic on each person, or all around the table, or whatever are way off. Those mics will pick up everyone else's voice as well, and speech-to-text capability goes way down with extra voices at even low volumes. Phased arrays are where it's at -- with enough mics, you can do a much better job at getting the voice you want, and at nulling out any other concurrent speakers -- and of course, as you say, you can use the same mic array to simultaneously monitor several speakers. However, there's two notes to remember:
  The physical resolution of any such system, even in the near field, is more-or-less limited by wavelength, so if the lowest frequency of interest is, say, 300 Hz, you'll have difficulty resolving much less than a meter.
  And a living room or whatever is merry hell for multipath. Establishing the correct delays and amplitudes needed to resolve each speaker (or conversely to null them out) is simple enough in the static case (just have everyone speak a sentence or so in turn for calibration at the beginning of the event -- if you're a religious family, perhaps you can integrate this with saying grace over the food...), but when somebody gets up and walks around, everyone's coefficient matrix changes. With the burden of keeping up with that in realtime, it's a bit more of a challenge than it might at first appear.
29. Re:There isn't any... by flux · 2013-12-30 20:32 · Score: 1
  
  A microphone array and some DSP should have no trouble distinguishing between sound sources at distinct points in space, in addition to removing background noise.
30. Re:There isn't any... by TWX · 2013-12-30 20:45 · Score: 1
  
  How about understanding the guy with the stutter? The guy that doesn't move his jaw so his consonants are almost imperceptible? The Scot? The Cajun? The Bostonian? The Puerto Rican? The Jamacian?
  
  --
  Do not look into laser with remaining eye.
31. Re:There isn't any... by painandgreed · 2013-12-31 04:25 · Score: 1
  
  there are companies working on it, like Dragon, but they're not there yet
  Is Dragon actually still working on it? The original authors who were the true R&D geniuses behind the technology were locked out of improving the product (and pretty much the industry) because of copyright after the botched sale of their company. The recent demonstrations of the software I've seen look like not much has improved since 2000 except computers have gotten faster.
  I doubt you have actually been paying attention or use speech recognition. It has gotten so much better in the last fourteen years. Back then, it was basically worthless and now accuracy really isn't any more of an issue than it is with human listener even with various foreign accents speaking English.
32. Re:There isn't any... by Anonymous Coward · 2013-12-31 04:58 · Score: 0
  
  Those certainly are hard, but independent of the "many simultaneous speakers" problem you were talking about in your original post. That would be hard to deal with even if you had just one person talking into a mic.
  What flux was saying (and I also suspect is largely true) is that the "multiple simultaneous speakers" problem actually isn't much of a problem.
captions by phantomfive · 2013-12-30 07:13 · Score: 4, Insightful

Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.

Good luck.

--
"First they came for the slanderers and i said nothing."
1. Re:captions by Anonymous Coward · 2013-12-30 07:37 · Score: 0
  
  Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.
  Good luck.
  It may be better than nothing at least. It sounds like it's just that there's not enough of a potential customer base to drum up the money to turn today's technology into a usable piece of software for this application. Might make a good kickstarter.
2. Re:captions by snsh · 2013-12-30 08:06 · Score: 1
  
  Youtube is far from the best speech-to-text technology available. The best STT technology is probably owned by the NSA or companies that work with them. Part of the secret sauce to good STT is voice training and speaker recognition, which I don't believe youtube's STT is capable of yet. As far as Youtube is concerned, it's only one person talking throughout each video, so when you have a French dude speaking one sentence, followed by an Irish woman the next sentence, youtube may not dynamically adapt to that.
  But besides that, the first thing any STT vendors ask when you start talking to them is about the quality of the recorded sound. If your speakers are not in a low-noise environment with good microphone setups, the results will always be disappointing.
"Listnote" for android by netsavior · 2013-12-30 07:13 · Score: 1

it exposes the raw functionality of Android's speech recognition better than anything else I have seen. "I just want something that will put on screen what is said aloud" is a feature set that is surprisingly hard to find.

The main gap I see is that this is really only practical for 1v1 conversations and group settings will require exponentially more sophistication, to identify and differentiate between different speakers.

I would love to say that loved ones should fucking learn their child/parent/friend's first language so they can converse in ASL, but that is a surprisingly hard sell for some people. ASL is my son's first language, and there are plenty of people in his life who refuse to learn to speak with him.
1. Re:"Listnote" for android by goarilla · 2013-12-30 07:19 · Score: 1
  
  And then your brother marries a Chinese woman.
  Now you've got to learn Mandarin or Cantonese or shunt her out of your "loved-ones-group".
2. Re:"Listnote" for android by Anonymous Coward · 2013-12-30 07:30 · Score: 0
  
  Kind of a different situation when the person in question physically cannot learn the other language.
3. Re:"Listnote" for android by Anonymous Coward · 2013-12-30 07:31 · Score: 0
  
  A quick search for "deaf" on Google Play (store?) gives a few options. Among them:
  Deaf Helper -- speech->text
  headphone pass-through-- also translates sound to vibration
  Talk to the Deaf ($2.71) -- speech to text and text-to-speech.
  etc. Dunno why I'm retyping the list as you can just look at it yourself.
4. Re:"Listnote" for android by WolfgangPG · 2013-12-30 07:58 · Score: 1
  
  Talk for Me - Windows Phone: http://www.windowsphone.com/en-us/store/app/talk-for-me/1a9d317f-e55c-44c1-a643-e1dd4b4fafa9
5. Re:"Listnote" for android by icebike · 2013-12-30 08:00 · Score: 1
  
  ASL is my son's first language, and there are plenty of people in his life who refuse to learn to speak with him.
  If its a hard sell, there's probably a reason for that.
  Technical solutions are the focus of this article.
  ASL to text is drastically harder problem, but it appears to be under development.
  Text to ASL is starting to be available but probably only useful for people too young to read. (Showing them the text would be quicker if they could read). However it might serve as a teaching aid for other to learn ASL.
  The deaf seldom speak clearly enough for any speech recognition to work. Siri and Android speech recognition is haphazard enough when any random accent is involved, and becomes useless when a speech impediment issue exists.
  
  --
  Sig Battery depleted. Reverting to safe mode.
Seasonal Help by Anonymous Coward · 2013-12-30 07:16 · Score: 0

Hire some seasonal help. Interpreters need jobs around the holidays too.
Holiday rituals by russotto · 2013-12-30 07:21 · Score: 4, Funny

it means they sit around bored while watching relatives jabber
Turns out being able to hear doesn't actually help here.
1. Re:Holiday rituals by girlintraining · 2013-12-30 08:13 · Score: 0
  
  Turns out being able to hear doesn't actually help here.
  Amusing, yes, but we have a duty as a society to help our vulnerable. You may have the gift of hearing or sight today, but tomorrow anything can happen. We're not just helping them with this technology, we're helping ourselves.
  As well, some people have auditory processing disorders that make groups of people essentially unintelligible. I happen to be one of them -- I can hear and see just fine but I cannot separate individual words or conversations in a crowd. As a result, the only person I regularly go out in crowds with is a lesbian friend of mine who has learned sign language to converse with her hearing-impaired mother.
  This technology doesn't just help the deaf and/or blind; It can help those whose disabilities are less severe as well.
  
  --
  #fuckbeta #iamslashdot #dicemustdie
2. Re:Holiday rituals by rk · 2013-12-30 08:29 · Score: 2
  
  Ask most Deaf people if they think they're "vulnerable" and you'll get a pretty strong "No" (or a shake of a head with the first two fingers pinched together with the thumb).
  YMMV, of course.
3. Re:Holiday rituals by Anonymous Coward · 2013-12-30 09:36 · Score: 0
  
  It’s great that they feel all empowered and everything, but they’re still vulnerable to things they can’t hear. Smoke detectors, oncoming trains, cops yelling “Stop or I’ll shoot...” I don’t know why so many in the deaf community insist that being unable to hear is not a disability.
4. Re:Holiday rituals by Anonymous Coward · 2013-12-30 10:03 · Score: 0
  
  I don't know why so many persons think that a group of people is a "community", especially when their membership within it is not voluntary.
5. Re:Holiday rituals by Anonymous Coward · 2013-12-30 10:45 · Score: 1
  
  > Smoke detectors,
  That's why they flash. High pitched sounds don't do a good job waking up infants and old people. Or the deaf.
  > oncoming trains,
  Look both ways and feel the vibration traveling through the ground. Also a good practice where there's electric cars.
  > cops yelling “Stop or I’ll shoot...”
  Expecting people to hear is silly.
  > I don’t know why so many in the deaf community insist that being unable to hear is not a disability.
  You came closest with that last one: being deaf isn't that big a deal, sort of like being left-handed, however today's society is built around an expectation that people do hear thereby marginalizing the deaf. We know that doesn't need to be the case -- for instance Martha's Vineyard used to be equally accessible and deafness wasn't particularly remarkable.
6. Re:Holiday rituals by hey! · 2013-12-30 10:52 · Score: 2
  
  I expect that they object to being viewed as persons of intrinsically lesser capability.
  It's wrong, although not malicious, to say "deaf people are vulnerable," because vulnerability is not a permanent attribute; it's an ephemeral state that can usually be engineered out of the environment. A level crossing with only a crossbuck sign can be fitted with flashing lights.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Hardware first by Anonymous Coward · 2013-12-30 07:29 · Score: 0

You'd need to have the table wired first with mic pickups. Or everyone could be nice and wear a mic which is then fed to a central processor and fanned out to folks iphones. Couple this with translation and sell to the UN!
Her iPhone also has an even more amazing feature by Anonymous Coward · 2013-12-30 07:30 · Score: 0

It's a phone radio built it. The data gets converted to speech and wait for it.... I will even adopt the sound of your mother.
I can't believe what they'll think of next...
multiple conversations by jklovanc · 2013-12-30 07:39 · Score: 1

The problem with dinner conversations is that there are usually a number of them going on at one time. A computer has enough trouble following one voice let alone multiple voices at the same time.
Language by ledow · 2013-12-30 07:45 · Score: 1

"This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving."
So, let me just ask, what was hindering this before? If one of you can hear and talk, and the other can read, then surely it's not a huge leap to one of you typing and the other reading? I know it's not a given, but it seems pretty obvious. And, from the deaf person's side, surely nothing is lost? Hell, you could have "talked with" them while chatting to friends at the same time (though that's probably rude if you don't have the focus to do so properly).
I have used text tools and translations to talk to Italian relatives when we're in a pinch and need to communicate. Mime gets you surprisingly far, and you can use keywords and dictionaries, but when the Italian for "spanner" is also the Italian for "key", it gets rather confusing rather quickly.
Speech recognition is inherently difficult and software for it, therefore, is crap. Sure, you can ask Siri to do something simple but she can't transcribe a conversation of any substance at any kind of speed or accuracy. People have been telling me that voice recognition systems could do that for over 20 years now - I'm still to find one and I don't have any speech difficulties or trouble communicating with people of varying accents. In fact, all the people I know that told me how great Dragon was usually found some alternative or quietly dropped their use of it within a year.
Don't expect speech recognition to be any good for a LONG time yet. Especially in a noisy / confusing envrionment. I hate parties partly because my mind tries to capture all audio and cannot discern them all at the same time (I have a pattern-recognising mind - once I'm tuned in, I can even write down strong-accented Italian as it's spoken even though I know little Italian, with few errors, but ask me to listen to two people talk at once and it hurts my head because my brain DOES try to decipher all the mess at the same time).
As such, the biggest boost to the deaf community in decades was text and instant messaging. Hell, I have no idea if half the people I "talk" to on the Internet are deaf or not.
So rather than trying to find some magical automatic tool so you don't have to do anything special to talk to your friend/relative who IS different, why not learn sign-language, use a text-based tool (even just notepad would work!) or just go to some effort to make yourself understood to them?
speech to text by roc97007 · 2013-12-30 07:50 · Score: 1

Daughter and her friend communicate via texting a lot. They've got speech to text turned on at one end, and on the other end, text to speech.
Waaaait a minute...

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
not yet by roc97007 · 2013-12-30 07:51 · Score: 1

My impression is, commercial speech to text works really well right now -- in fact, just well enough to lull you into complacency before it really screws you up.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
1. Re:not yet by Livius · 2013-12-30 10:32 · Score: 1
  
  Humans instinctively develop a language faculty at a very young age - it's easy to underestimate what a difficult problem it is for computers and difficult to appreciate how good even young children are at it. 99% accuracy sounds good, but a person who could read at that accuracy would be called illiterate.
People don't talk to machines like they're people by matbury · 2013-12-30 07:58 · Score: 1

When we talk to each other, we speak very differently to when we're recording ourselves into a microphone or talking to a speech to text algorithm. Computers aren't at the point where they can interpret communicative intent and so they are unable to transcribe what you mean to say as opposed to the sounds that came out of your mouth. Even when people talk to their speech to text phones, we often see some very strange interpretations. Just check out the regular and frequent postings on Fail Blog.
Common problem for the Elderly.. please test this! by jbrohan · 2013-12-30 08:22 · Score: 1

www.Stay-in-Touch.ca includes a "Hard-of-Hearing" function. It is text or speech-to-text at your end which comes through as text to "Grandma" She picks up the phone and speaks. Then you reply. Runs on an Android Tablet for Grandma and a web-Page for you! I'd love you to test this.
Your mother... by Anonymous Coward · 2013-12-30 08:32 · Score: 0

Any reason your mother didn't just learn how to sign? It would seem to be a lot less expensive... and more thoughtful.
Freelance interpreters by Krischi · 2013-12-30 09:01 · Score: 1

No, there is nothing good enough. In fact, automated systems got a good, old-fashioned drubbing at the captioning challenge at the ASSETS 2013 conference. Nothing came even close to a human steno captioner.
If you want to make family events accessible and the person in question signs, my recommendation is to hire freelance interpreters. They often charge on a sliding scale, depending on the event and means of the client, and tend to run much cheaper than anyone you can get through referral services. That is what we do at family events, and it works both ways: events by my hearing family for us, and events by us for my hearing friends and family.
If the person does not sign, human captioning is an alternative, but it probably would have to be local rather than remote. Remote does well if there are no overlapping conversations. Otherwise, it has to be local, but that is a lot harder to arrange for, and more expensive.
A note to some of the other commenters: please spare use the patronizing posts of us not missing out on any of the (presumably inane) conversations that take place at such events. *We* make the decision as to what conversation we consider important and what we do not, and nothing is worse than people presuming to speak for us, and presuming to know better about access than we do.
1. Re:Freelance interpreters by Anonymous Coward · 2013-12-30 09:31 · Score: 0
  
  >A note to some of the other commenters: please spare use the patronizing posts of us not missing...
  
  A note to you: Please get a sense of humor.
2. Re:Freelance interpreters by Krischi · 2013-12-30 09:44 · Score: 1
  
  Well-said ... by someone who has no clue about what it is like.
Nope, unless ... by rogerz · 2013-12-30 09:38 · Score: 2

Are you able to do all of the following at your dinner conversation?:
1) Provide everyone with a decent close-talking directional microphone.
2) Require each person to take turns speaking, so there is very little overlap.
3) Have no pre-adolescents speaking.
4) Eliminate noticeable background noises.
5) Have no one with a strong non-native dialect speaking.
6) Require everyone to speak in full, grammatical sentences.
To the extent you say no to any of the above, you will get increasingly poor output. They are listed approximately in order of importance (1 being the most important). If you can say yes to all of those, you can probably get in the vicinity of 90% accuracy. This might be usable, depending on your ultimate purpose. If you were to additionally train acoustic and language models for all of the speakers, and then tell the software which user was speaking (i.e. switch the user on the fly during the conversation), you could probably get 95% accuracy and that would be quite usable.
So, in other words ... nope.

--
If humans are mostly water, and beer is mostly water, then humans must be mostly beer.
You might try... by Virtucon · 2013-12-30 09:39 · Score: 1

Record the conversation and they play it into Dragon, it works but you need a good quality audio feed. I've also tinkered with Julius and although it takes a bit of set up it works in most cases but you have to tweak it a bit more than Dragon at least in terms of what I was dealing with.

--
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Thanks for the suggestions by DeafScribe · 2013-12-30 10:43 · Score: 3, Informative

I've been following the field for awhile, so I'm aware of the barriers to success; it is, as the engineers like to say, a non-trivial problem. But I can't possibly be aware of every development, so it's really helpful to get your perspective.
I agree with the general consensus that we're a ways off from accurate machine transcription of group discussions, for the reasons discussed; that several conversations can be active at once, interference from other background noises, comprehending context, etc.
The point about late-deafened people being able to work with lower accuracy is a good one. I'm like that, I can recognize phonetic mistakes and mentally substitute the correct word because I know what it intended, but I have lot of born-deaf friends who would be lost.
One reader took upbrage at rusotto's joke about being able to hear doesn't help here. Me, I laughed. I know what it means to be bored silly even when everything is clear, and even ASL conversations can have the same problem. Another point about referring to deaf folks as "vulnerable" - yes, most people would resent that sort of label, even among those who understand it's not done with malice.
About communications with my mother - yes, we can converse by text, or through an ASL interpreter, and via video relay, and we've done all those things. But each of them is mediated to some degree, and working through Siri is too, but with an important difference; the other mediated techniques are more intrusive and divert focus from the person you're conversing with.
In video relay, I don't see my mother at all - I see an interpreter. Text - typing or writing - is also face to face, but it's slow. An ASL interpreter divides focus between the person I'm conversing with and the 'terp. All of these options work. The difference with Siri is, I can see her as she's speaking, focus on HER, read the text generated by Siri and match that with the facial expressions and body language.
One point made was the capacity for reading fast enough to keep up with transcription of a full table of rapid-fire conversation; I agree that would be tough.
Probably the most practical solution now is an ASL 'terp for those (like me) who know ASL. This is one area where the human capacity for a complex task trumps current tech.
Government meetings by Anonymous Coward · 2013-12-30 10:44 · Score: 0

I'd really like something like this to create transcripts of our town hall meetings. It is an important need for government transparency throughout the world. In some ways those might be easier to do because people don't tend to speak at once. The google voice technology seems pretty good.
Ideally the transcript could be combined with an app that overlays the transcript on the video or audio, with accelerated playback, allowing the user to tag which speaker is speaking (if not automated), and correct any errors. That could also be part of a captcha system.
1. Re:Government meetings by Anonymous Coward · 2013-12-30 11:35 · Score: 0
  
  Generating a readable transcript requires people to speak with embedded punctuation. Ain't gonna happen.
  Speaker ID is trivial in an individually-mic'd scenario like a government meeting. Automatic mic mixers can determine who is speaking, successfully bridge pauses in conversation, and provide moderator override if necessary. They can easily drive video switchers to cut to the correct camera, so generating a timestamped speaker ID file is easy. And "crowdsourcing" stuff like that via captcha will lead to the same public record integrity that Wikipedia exhibits -- namely none.
  "Shadow-speaking" before or after the fact can lead to reasonable accuracy, but stopping to spell non-vocab words in a major slow-down. Still, correcting transcripts wrt the recording is a great job for homeworkers, volunteers, etc.
  It's not whether "you" would like to create transcripts of your town hall meetings, it's would "your town" like to do so. If it's a US town, they likely are required to have minutes recorded for legal purposes. Most just record the meeting. Play it back as a podcast, with markers for discussion, vote, public input, etc. and you're probably good to go.
  If it's recorded and broadcast with captions anyway, it's easy to harvest the timestamped caption data as a text file.
Dragon NaturallySpeaking by Anonymous Coward · 2013-12-30 13:25 · Score: 0

Dragon NaturallySpeaking Home is $50 on Amazon.com
Court stenographers are pretty good by nbauman · 2013-12-30 15:29 · Score: 2

I've sat in trials in federal courts a few times. You can watch the stenographer type the transcript on a monitor in front of you, and it's much better than the TV captions.
These were pharmaceutical patent cases and FDA litigation, mostly technical stuff, chemists being cross-examined.
They had a system with a court stenographer typing into a computerized stenotype machine, and the judge and both parties watching the result on monitors.
I was last in court a few years ago, but I don't think it's changed much.
In the old days of manual stenography, stenotypists used to take dictation at 120 words per minute (which is the high end of normal speaking speed) to pass a certification test, and they could do 150 wpm in bursts. They would type abbreviations into a fanfold paper tape, then read the tapes and type a final version.
Now the computerized systems give them the final version automatically, so they don't have to reread it. Some stenotype systems were more amenable to automation than others, but now everybody uses computerized systems.
The result was pretty close to what the final transcript looked like, although I didn't examine them too carefully. The practice has always been for the stenographer to type a draft, and circulated it to the parties for review. (The stenographer is also allowed to ask a witness to repeat something when the stenographer missed it. That's what they're in court for -- to produce a record.)
Court stenographers also make a lot of money ($100,000 a year or more at the top end), because they sell transcripts to the lawyers. A patent case can be worth $100 million, so what's another $100,000 for transcripts? I don't think the TV caption make anything like that. You get what you pay for.
1. Re:Court stenographers are pretty good by Whorhay · 2013-12-30 17:08 · Score: 1
  
  The thing is that in TV though you don't need a full time stenographer. You need at the most someone for broadcasts without transcripts, and even the talking heads on the news programs are often working from scripts on teleprompters. Probably the cheapest thing they could do is hire a person that is training to become a stenographer. They get real world experience for some pay and the broadcaster gets some semblance of professionalism. In most areas you are talking about four hours of live broadcast a day at the most, a third of that will actually be commercials and probably half of what is left will be canned segments that get reused for multiple time slots.
2. Re:Court stenographers are pretty good by TWX · 2013-12-30 20:48 · Score: 1
  
  To add to your argument, I expect that Good Morning America and other national-level programs could afford to pay the live professional stenographer.
  
  --
  Do not look into laser with remaining eye.
3. Re:Court stenographers are pretty good by nbauman · 2013-12-31 03:40 · Score: 1
  
  I don't like this idea of, "Hire a trainee to work cheap."
  Some of the show is scripted, but when I watch a news program, the most interesting thing is the on-air comments by the invited guests, and the back-and-forth debate. How is Newt Gingrich (or whomever) going to talk his way out of this one? That's also the most difficult thing to get down (as I know from often taking notes on panel debates).
  I think a trainee would make so many mistakes that it would cost more to go back and correct the mistakes than it would cost to do it right the first time. One of the major uses of the transcript is for future reference, like the CNN transcripts.
4. Re:Court stenographers are pretty good by Whorhay · 2013-12-31 09:20 · Score: 1
  
  You could always hire a few of them and run the results through a differential analysis program so that it uses the the parts that match between the transcribers when in doubt.
Glass by jhumkey · 2013-12-31 03:29 · Score: 1

Not "Google Glass" as is but . . . some future version of that, would seem to be the ideal. HIGHLY directional microphone, lets you "look" at the speaker of interest to help #1 see facial expressions/body language, #2 discern that "voice" from among the background noise clutter, with interpreted output onto the display.

Something should be possible... if its done in "Wet" electronics (Brain-Body), with enough processing power and sensitivity, discernment should be somewhat possible in "Dry" electronics (IC's and discrete components.)

I keep thinking back to my early days in Ham Radio . . . as time passed and band conditions changed, signals faded and were occluded by background noise, but my human ears "locked on and tracked" the fading morse code. Such that, a passing observer, walking in fresh, might not "hear anything" in the din of noise . . . when I was just able to stay locked on and keep the conversation going a few minutes/seconds longer. How was I staying "locked on"?

Clearly, all I need to do is break out the CSI Miami DVD's and watch a few episodes, where they pull out a clear conversation from the background of a garbled audio conversation, and just repeat their procedure . . .

It'll take more than just directional audio input and decoding (recognition) software. (For those cases when you're trying to listen to ONE conversation in a room full of three or four going on) . . . "directionality" will help, but some pretty heavy duty "fast fourier transforms" (or something similar) might be needed to "track" the particular voices of interest. And laughter, non "voice" sounds like "gasps", tonal changes for surprise or interrogation . . . will make "tracking" problematic.

(Whether build by God, or 1.5 million years of evolution . . . the current design has a big head start.)

I know . . . all that's design postulation, and not the "I need it now" answer you were hoping for. Just thinking out loud.

--
No, I don't remember your name. But the memory mapped screen on a TRS80 from 1977 is from 15360 to 16383 if that helps.
Multiple Mics Can Separate Conversations by AlastairMurray · 2013-12-31 03:42 · Score: 1

I've not noticed anyone else mention using multiple microphones, but with two microphones spaced a few metres apart a computer can separate out two people talking over each other due to speed-of-sound differences (i.e. conversation A reaches mic 1 then mic 2, conversation B reaches mic 2 then mic 1).
I've seen two people talking separated using two microphones, I don't think you need more microphones for more people. This is a common machine learning demo (how I've seen it), but it is just signal processing really (i.e. no machine learning required).
I'm not aware of any products integrating this however, but I've never had the need to look. Is there such a thing as a bluetooth microphone for iPhone/Android? Like a headset but without the speaker. If you're a programmer maybe you could write an app that separates out different audio "streams" and then send each to existing voice recognition APIs.
Sometimes being deaf is a good thing... by antdude · 2013-12-31 04:19 · Score: 1

... Especially when family members are yelling and stuff. Ugh!

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
This works rather well by Anonymous Coward · 2013-12-31 12:51 · Score: 0

However, you will probably have to pass a mic around the table. I needed to transcribe some paragraphs in a book to text on my computer and this thing worked really well.
How to use Chrome's speech-to-text
http://howto.cnet.com/8301-11310_39-20058475-285/how-to-use-chromes-speech-to-text/