Slashdot Mirror


Ask Slashdot: Effective, Reasonably Priced Conferencing Speech-to-Text?

First time accepted submitter DeafScribe writes "Every year during the holidays, many people in the deaf community lament the annual family gathering ritual because it means they sit around bored while watching relatives jabber. This morning, I had the best one-on-one discussion with my mother in years courtesy of her iPhone and Siri; voice recognition is definitely improving. It would've been nice if conference-level speech-to-text had been available this evening for the family dinner. So how about it? Is group speech to text good enough now, and available at reasonable cost for a family dinner scenario?"

5 of 81 comments (clear)

  1. captions by phantomfive · · Score: 4, Insightful

    Go find some youtube videos with auto-captioning. That is the upper-limit on the quality you will get with today's technology.

    Good luck.

    --
    "First they came for the slanderers and i said nothing."
  2. Re:There isn't any... by TWX · · Score: 4, Insightful

    video transcribers also quite expensive

    Based on what I get on my TV when I press the Mute button, they really shouldn't be...

    --
    Do not look into laser with remaining eye.
  3. Holiday rituals by russotto · · Score: 4, Funny

    it means they sit around bored while watching relatives jabber

    Turns out being able to hear doesn't actually help here.

  4. Re:There isn't any... by jettoblack · · Score: 5, Insightful

    There's no perfect solution, but something that works for 60% might already be better than nothing.

    I work in the closed captioning industry, and I'd say anything less than 95% accuracy is actually WORSE than nothing. Automatic Speech Recognition (ASR) has no concept of context or situational awareness. The mistakes they make tend to be not in the simple common words and phrases, but concentrated in the nouns, especially proper nouns: names of people, places, companies, products, etc. Even at 80% accuracy, which is quite good for the current best speaker independent ASR systems, you're looking at 2 words out of every 10 being substituted with the wrong word, completely changing the meaning of the phrases. Imagine the chaos if (major news network)'s closed captioning reported some celebrity or politician as saying "I'm not a fan of Jews." when they actually said "I'm not a fan of juice." (Which would be 83% accurate!) Wars have been started for one misheard word out of a thousand; imagine how bad 200 out of 1000 would be.

    Here's an article about a HUMAN transcription error that caused a pretty major ruckus. Now imagine this kind of problem being an order of magnitude worse:
    http://www.people.com/people/article/0,,20693447,00.html

    People who lost hearing later in life tend to do better with high error rate ASR because they know what words sound like and can figure out easy substitutions, e.g. Juice vs. Jews, Election vs. Erection, etc., but people who were born deaf or lost hearing before language acquisition cannot easily make these substitutions in their head because they don't "hear" the word sounds when they read them.

  5. Re:There isn't any... by TWX · · Score: 4, Informative

    To put this into a car analogy, electric cars don't need to surpass ICE cars in every conceivable scenario to make one worth buying for a given individual.

    No, but to expand on your car analogy, they have to be able to meet certain minimum standards and customer requirements.

    And dropping out of analogy, the hypothetical courtroom automatic stenographer would probably have it easiest, as the rules of the court dictate that only one person may speak at a time, and most courts have individual microphones for every speaking party for acoustically recording the proceedings anyway. The same cannot be said for the dinner table.

    Even the most rudimentary system for sampling several participants would cost hundreds of dollars. A half-way accurate comparison would be the equipment needed to record a drum-set, with individual microphones for each drum, cymbal, and accessory, and a processor that monitors line-levels and individually records each input separately. Replace the function of recording each input and turn it into processing each input for discrete words, and only then are you even getting to the hard part, interpreting what the sounds actually are.

    The low-end equipment to record drums is hundreds of dollars. High end equipment to do the same thing costs thousands of dollars. Now tack on the cost of the processing side, and you're probably at tens of thousands of dollars. Just to attempt to participate in a large group conversation as opposed to small-party conversation where polite participants will probably work to simplify the flow of conversation to allow the impaired individual to participate.

    A friend of mine in a social club has a son with some form of developmental disability. I've heard that it's Aspergers, but I'm not entirely certain as many of the traits commonly associated with Aspergers don't seem to manifest with him. When he's party to our conversations we modify our conversation to accommodate him. We attempt to avoid speaking over each other or over him, and we increase the amount of time that one considers a pause by a given speaker, so that we don't interrupt him while he's talking.

    If we had a substantially hearing-impaired member, we would probably modify our conversations accordingly, slowing our speech enough that lips could be read, attempting to avoid talking over each other, and attempting to keep our faces oriented to where the individual could see those faces. Given the nature of our vocabulary in this social setting (a speculative fiction group) it would be highly unlikely that a speech-to-text system would correctly interpret any of the truly important words in the conversation anyway, so such a system would be useless.

    --
    Do not look into laser with remaining eye.