Slashdot Mirror


Google's DeepMind Made an AI Watch Close To 5000 Videos So That It Surpasses Humans in Lip-Reading (thetechportal.com)

A new AI tool created by Google and Oxford University researchers could significantly improve the success of lip-reading and understanding for the hearing impaired. In a recently released paper on the work, the pair explained how the Google DeepMind-powered system was able to correctly interpret more words than a trained human expert. From a report: To accomplish the task, a cohort of scientists fed thousands of hours of TV footage -- 5000 to be precise -- from the BBC to a neural network. It was made to watch six different TV shows, which aired between the period of January 2010 and December 2015. This included 118,000 difference sentences and some 17,500 unique words. To understand the progress, it successfully deciphered words with a 46.8 percent accuracy. The neural network had to recognize the same based on mouth movement analysis. The under 50 percent accuracy might seem laughable to you but let me put things in perspective for you. When the same set of TV shows were shown to a professional lip-reader, they were able to decipher only 12.4 percent of words without error. Thus, one can understand the great difference in the capability of the AI as compared to a human expert in that particular field.

3 of 80 comments (clear)

  1. A nice contrast to all the AI doom-mongering by Bearhouse · · Score: 4, Interesting

    My beloved grand-mother went deaf after years working in a factory; (in those days - especially during WW2; she helped build tanks - HSE did not exists).
    It was really painful to see how it penalised her in daily life, family gatherings etc.
    She ended up talking all the time, and then getting paranoid about "what people were saying about her".
    So, if this can be used with some kind of (better-resolved implementation) of Google glass to help the hard of hearing then, great!

  2. A purpose for Google Glass? by JustDisGuy · · Score: 3, Interesting

    As a person with hearing difficulty, realtime captioning of live conversation would be an awesome use of this technology.

    Add to that an app that identifies the people I'm talking to, and I'm your next customer.

    --
    "Never attribute to malice that which is adequately explained by stupidity." - Hanlon's Razor
  3. Sounds about right... by RyanFenton · · Score: 3, Interesting

    Sounds about right, for the circumstances.

    I'm working on a project right now using CMU Sphinx (because it's free/open source) to identify word starts/ends for the sake of syncing word display to audio. All the tools available for speech-to-text are going to require human editing:

    Comparrison of commonly used speech-to-text tools

    ...lots of words end up word salad with any tools, even custom-trained, but the tools are nice for being able to at least have the words show up on beat once they are human-corrected.

    Syncing video frames of talking without the audio has got to be even more ambiguous, with more reliance on context.

    Sounds like a good challenge for a learning system to pick up on. The 5000 hour mark seems almost analogous to what a human child might pick up raised watching TV in a language different from their family.

    Ryan Fenton