Slashdot Mirror


Researchers Build An AI That's Better At Reading Lips Than Humans (bbc.com)

An anonymous reader quotes the BBC: Scientists at Oxford say they've invented an artificial intelligence system that can lip-read better than humans. The system, which has been trained on thousands of hours of BBC News programs, has been developed in collaboration with Google's DeepMind AI division. "Watch, Attend and Spell", as the system has been called, can now watch silent speech and get about 50% of the words correct. That may not sound too impressive - but when the researchers supplied the same clips to professional lip-readers, they got only 12% of words right...
The system now recognizes 17,500 words, and one of the researchers says, "As it keeps watching TV, it will learn."

62 comments

  1. "professional" ? by Anonymous Coward · · Score: 0

    'the same clips to professional lip-readers"

    ok, who else didn't know that there are "professional" lip readers?

    1. Re:"professional" ? by toonces33 · · Score: 2

      Well, there is "Bad Lip Reading" - their videos are usually pretty funny.

    2. Re:"professional" ? by Anonymous Coward · · Score: 0

      Hey If you get paid for it, you're a professional.

    3. Re:"professional" ? by JustAnotherOldGuy · · Score: 2

      'the same clips to professional lip-readers"

      ok, who else didn't know that there are "professional" lip readers?

      The police use them from time to time (on surveillance videos). I imagine there are other uses as well.

      --
      Just cruising through this digital world at 33 1/3 rpm...
    4. Re:"professional" ? by Anonymous Coward · · Score: 0

      He's very accurate.

      I doubt any AI system will be able to beat his accuracy in the near future.

  2. Feed it dubbed Shaw Brothers. by Anonymous Coward · · Score: 0

    Blow its goddamned stack.

  3. 17 years too late by Anonymous Coward · · Score: 5, Insightful

    I'm sorry Dave, I'm afraid I can't do that.

    1. Re:17 years too late by hcs_$reboot · · Score: 2

      16 actually (a comment to remove the Funny mod, that should be Insightful instead)

      --
      Slashdot, fix the reply notifications... You won't get away with it...
  4. Great way to get flushed down the airlock! [n/t] by mobby_6kl · · Score: 1

    N/T

  5. perfect opportunity by v1 · · Score: 2

    Sseeing as there's so much closed-captioning going on, they've got an enormous volume of material to train their neural network on.

    I've done this sort of thing before, and often finding a large set of quality training material is a significant challenge.

    Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy. That's what deaf (and hearing impaired) lip readers have to do since the stated 12% initial recognition is about right. They have to stay very focused on the speaker and make heavy use of context to work out what's being said. And that's a perfect job for a computer.

    --
    I work for the Department of Redundancy Department.
    1. Re:perfect opportunity by BarbaraHudson · · Score: 1

      The closed-captioning does speech-to-text, not lip reading. It's advanced to the point that you can dictate your SMS messages more reliably than fumbling around with an on-screen keyboard and auto-uncorrect.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    2. Re:perfect opportunity by ShanghaiBill · · Score: 1

      The closed-captioning does speech-to-text, not lip reading.

      Sure, but if it did both, the error rate would go way down.

    3. Re:perfect opportunity by jordanjay29 · · Score: 1

      What are you talking about? Closed captioning for most media is manually entered and synced with the time. Speech-to-text captions (like those on YouTube) have far less accuracy, although sometimes they put real-time captioning (think televised news) to shame. But most of what you see on TV and everything on DVDs is written and checked by a human, and is not entirely reliant on STT transcription.

    4. Re:perfect opportunity by jordanjay29 · · Score: 1

      Sort of? Consider how many times dialogue is spoken off-camera, such as a voice-over or cutaway reaction shot, or when the speaker is simply not facing the camera. Your reliability in those cases are cut in half anyway without the advantage of being able to lip read.

    5. Re:perfect opportunity by JohnFen · · Score: 1

      Also consider how frequently the captions differ from the actual spoken words.

    6. Re:perfect opportunity by Anonymous Coward · · Score: 0

      Have you seen what some of those closed cation8ng systems come up with?

      They replace "Democrat" with "Republican" (MAD TV Ross Perot sketch).

      Sometimes they get entire phrases misinterpreted like "the entire army battalion moved in another direction" becomes the "army battalion moved into another dimension".

    7. Re:perfect opportunity by stephanruby · · Score: 1

      Would each closed-captioned syllable or word need to be manually synchronized with the video first? Or can the training be done without it?

      Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy.

      But this AI is already using context to some degree. The article gives the example of "Prime Minister" for instance, where the AI knows that if the word "Prime" is read on their lips, that the word "Minister" will probably follow. Also, the AI has been trained in one context alone, which means that the context is already taken into account. For instance, if the same anchorman were to order his favorite frappuccino at Starbucks, I really don't think that the AI would do as good a job.

      Also, they say they used thousands of hours of video, but it could be that they trained the AI on just three to four news anchors, which could make it easier on the AI. After all, I would expect a professional lip reader to do a lot better with reading the lips of his own family members, simply because he was so accustomed to their style of talking.

      And last, I always doubt the self-reporting of scientific results to the mainstream press. An AI developer/researcher has every incentive to exaggerate the success rate of his own work. Also, any professional lip reader hired probably received compensation for their work and probably signed an NDA with the researcher. So it could be very easy for the researcher to claim whatever he wanted and no one would be there to contradict his story.

      After all, we're talking about big money here for the right sleight of hands (whether it's exaggerating, lying, or doing something else completely unethical). For instance, the guy who sold his self-driving car company to Uber after having only started four months earlier sold it for 600 million dollars. Can you imagine 600 million dollars after only four months? Many people, including researchers, would be willing to lie, cheat, or even kill for a tiny fraction of that amount, and some others would even be willing to do it for free for the ego boost alone.

    8. Re:perfect opportunity by jordanjay29 · · Score: 1

      This can happen for a number of reasons, actually. Sometimes it's an actual mistake, but also possible is a rephrasing of the line to make it easier to caption or easier to understand. Since captioning is most often geared towards Deaf people, and many grew up with English as a second language, some idioms and turns of phrase can seem out of context and aren't as appropriate for captions. There are some who bristle at this attempt at hand-holding and think captions should be 100% accurate to dialogue, while others are fine with some revisions to make it easier, but either way this is a thing that can happen. Also possible is censorship or mismatched captions to the video.

    9. Re:perfect opportunity by JohnFen · · Score: 1

      Yes, I understand. But the fact that the captions and the spoken words often differ limits the effectiveness of combining captions and lip reading to reduce the error in machine translations. It doesn't matter much why the captions and the spoken words differ.

    10. Re:perfect opportunity by BarbaraHudson · · Score: 1

      Closed captioning for live events (such as news) is text-to-speech. Easily detectable if you read the captions and listen to the words - the mistakes aren't from typos, but closely sounding words. Manually entered also takes a few seconds delay, same as simultaneous translation is not really simultaneous, there's a second or so delay (but the translator can often anticipate what's about to be said by context - and then when they goof, you get to hear it when they correct themselves).

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    11. Re:perfect opportunity by Anonymous Coward · · Score: 0

      if you're going to train lip reading robots using existing closed captioned television.. first make the captions on the television programs complete and accurate, every time - all the time. they aren't, currently. not by a long shot.

      source: been 'reading' televison for nearly 20 years now.

    12. Re: perfect opportunity by Anonymous Coward · · Score: 0

      Some live TV is also done by hand, which is why it sometimes lags the dialogue.

    13. Re:perfect opportunity by v1 · · Score: 1

      The closed-captioning does speech-to-text, not lip reading.

      Closed Captioning is the transmission of text of what is being said along with the video and audio stream. It's up to the receiver to do text to speech.

      The benefit of CC here is that you have the "problem" (the video of the speaker) AND the "answer" (the text that they spoke) to work with, and this is precisely what you require to train a neural network. A large volume of problems and correct solutions. "When you get THIS input, you are supposed to produce THAT output". Over and over again, with as much volume and variety as possible. That sort of training is how you end up with a high-accuracy neural network.

      To compare that, go do a google image search for "cat". A network properly trained to recognize a picture of a cat needs to see a lot of pictures of cats. But you'll notice the search is polluted with a large number of drawings of cats, more than one cat, and things that aren't cats at all. If a significant number of those get fed into the training process, they can severely affect the network's performance. This almost always means you have to comb over the training material by hand, removing anything that's either not correct or not a good example. CC on the other hand is available in a HUGE quantity, with a very high purity, making training so much easier to do and producing so much higher quality performance from the trained network.

      --
      I work for the Department of Redundancy Department.
    14. Re:perfect opportunity by BarbaraHudson · · Score: 1

      There's no need to do text-to-speech if you're already transmitting the audio stream, duh!

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
  6. straight from the related links by Anonymous Coward · · Score: 1

    https://tech.slashdot.org/story/16/11/25/1146258/googles-deepmind-made-an-ai-watch-close-to-5000-videos-so-that-it-surpasses-humans-in-lip-reading?sdsrc=rel

    1. Re: straight from the related links by Anonymous Coward · · Score: 0

      You don't expect /. editors to actually read /., do you?

  7. But the wild card walks in by Anonymous Coward · · Score: 1

    Sees the computer AI progressing in its research, and decides to replace the movies being watched, with the complete collection of gojira monster films that were dubbed in English and hardly provided any syncing at all, circa 1960's era, followed by Chinese martial arts movies full of lines like "Yaaaaa!" " Huh?" and "Prepare to die!"

    The icing on the cake is when he throws in an Inspector Clouseau film

  8. The surveillance state by JustAnotherOldGuy · · Score: 3, Insightful

    The surveillance state is coming in its pants thinking about all the additional conversations they'll be able to monitor now.

    Time to break out the bandannas and cough-masks....soon it'll be fashionable to wear them in public!

    --
    Just cruising through this digital world at 33 1/3 rpm...
    1. Re:The surveillance state by fustakrakich · · Score: 3, Insightful

      soon it'll be fashionable to wear them in public!

      And illegal

      --
      “He’s not deformed, he’s just drunk!”
    2. Re:The surveillance state by Anonymous Coward · · Score: 0

      How about mustaches that cover lips?

  9. That cry of dismay ... by BarbaraHudson · · Score: 1

    That cry of dismay was the sound of thousands of blind gynecologists realizing they will be out of a job reading lips. :-)

    Of course the reality is grim - even more surveillance by marketers and the state - especially with TVs and webcams and (if you believe Trump) microwaves watching everything you say and do.

    --
    "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    1. Re:That cry of dismay ... by Anonymous Coward · · Score: 0

      if you believe Trump

      You should! He never lies, and he's always right

    2. Re:That cry of dismay ... by BarbaraHudson · · Score: 1

      if you believe Trump

      You should! He never lies, and he's always right

      Except when his lips move ....

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
  10. I for one welcome our new ... by Anonymous Coward · · Score: 0

    ... silent-movie-interpreting overlords.

    1. Re:I for one welcome our new ... by jordanjay29 · · Score: 1

      The irony of your comment is that silent movies generally used title cards for their dialogue anyway, making it equally accessible no matter if you could hear or not.

    2. Re: I for one welcome our new ... by Anonymous Coward · · Score: 0

      The irony of your comment is that the dialogue spoken by the actors but not heard is not necessarily what is on the title cards.

  11. Professional lip readers are bunk. by Khyber · · Score: 2

    Go compare this to a deaf person that reads lips. I know of literally thousands that never miss a single spoken word as long as they're looking at the speaker's mouth.

    Source: Camfrog, where there are fucktons of deaf people communicating with those with hearing. We speak after getting their attention with a hand signal, they read our lips and reply with zero issues.

    --
    Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    1. Re:Professional lip readers are bunk. by JohnFen · · Score: 1

      This is true. I once had a conversation with someone and was very surprised to later learn that the person was completely deaf. I had no clue.

  12. The George Bush AI by Anonymous Coward · · Score: 0

    Watch my lips...

  13. Based on "2001", I thought it would be better by mykepredko · · Score: 1

    Or was Frank Poole killed because HAL thought they were going to unplug the "Mammary Circus" and that was basically the only DVD the three of them could agree on watching?

  14. Re: Maybe /. needs an AI ... by Anonymous Coward · · Score: 1

    with all the AI job obsolescence going on the universal income one is pretty much relevant

  15. need good info to train the AI by frovingslosh · · Score: 1

    I'm wondering what text they are using to train the AI about what was said. I sure hope it isn't the closed captioning text on the news broadcasts. In my experience that is only about 50% accurate itself.

    --
    I'm an American. I love this country and the freedoms that we used to have.
  16. hong kong kung fu by Anonymous Coward · · Score: 0

    I will be impressed when they can lip read old Hong Kong kung fu movies!

  17. Round peg, meet round hole by yodleboy · · Score: 3, Interesting

    Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...

    1. Re:Round peg, meet round hole by BarbaraHudson · · Score: 1

      Because Berkeley lied when they said that they had to provide transcripts or remove the material. Section 107 of the copyright act 1976 allows for fair use for teaching materials, and this allows 3rd parties to make available all such materials in more accessible forms, and for Berkeley to use the results of such work.

      They weren't interested in doing this. It's about monetization and artificial scarcity, pure and simple. This was just a smokescreen to remove the material.

      The blind will be using TTS screen readers such as non-visual desktop access anyway, and deaf people can still read the materials, and use STT software for converting speech to text (and to all those idiots who continue to say that speech-to-text doesn't work because it didn't when you tried it in 1995, try dictating your SMS messages - it's quicker and more accurate than trying to use an on-screen keyboard).

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    2. Re:Round peg, meet round hole by Anonymous Coward · · Score: 0

      Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...

      What happened at Berkley smells of bullshit. Could it be intentional? If it is, even above 100% accuracy will not be good enough.

    3. Re:Round peg, meet round hole by Anonymous Coward · · Score: 0

      What is it like living the the Barbara Hudsonverse?

      The vast majority of the videos were released under a creative commons license and are available elsewhere. https://it.slashdot.org/story/...

      Also, the "smokescreen" was the result of the DOJ (Trump's!) telling them that they had to make the materials ADA compliant

      Berkeley and Jeff Secessionins working together, to monetize freely available creative commons videos by ensuring that people mirror them! Brilliant, Babs, really!

    4. Re:Round peg, meet round hole by Barnoid · · Score: 1

      Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...

      Good idea, but unfortunately it won't work in this case. Many of UCBerkeley's lecture videos only show the slides and you hear the lecturer talk. See, for example, https://www.youtube.com/watch?....

    5. Re:Round peg, meet round hole by BarbaraHudson · · Score: 1

      They could have pointed out that since these are fair use materials, there are agencies whose mandate is to make them ADA compliant . They didn't.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
  18. Learning through TV by JohnFen · · Score: 1

    "As it keeps watching TV, it will learn."

    When TV was first being introduced as a consumer product, one of the selling points of the idea was that people would be able to learn by watching it. If this works out as well as that, then the system will only be able to recognize when someone is uttering lines from commercials.

    1. Re:Learning through TV by jordanjay29 · · Score: 1

      That accounts for the 50% rate, that's about how many commercials are captioned.

  19. Read My Lips: No... New... Taxes by Anonymous Coward · · Score: 0

    Having the NN train on politicians might make for long-term reliability issues :S

  20. Spying Concerns by ssufficool · · Score: 1

    At least I know it won't be able to read my lips. You see, I speak American, not English.

  21. Not surprising by nitehawk214 · · Score: 1

    Humans are very difficult to read.

    --
    I'm a good cook. I'm a fantastic eater. - Steven Brust
  22. That's OK. Burkas will be required. by Anonymous Coward · · Score: 0

    That's OK. Burkas will be required.

    Sharia for the win!

  23. Try this line, Mr. AI lipreader by cellocgw · · Score: 1

    Did he just say "No new taxes," or did he say "No Newt[Gingrich] Axes" ?

    Heck you were even told, prior to that line, "read my lips," so you got no excuses.

    --
    https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
  24. Duplicate, and old by udif · · Score: 1

    https://tech.slashdot.org/story/16/11/25/1146258/googles-deepmind-made-an-ai-watch-close-to-5000-videos-so-that-it-surpasses-humans-in-lip-reading

  25. Quiet Man by sfsp · · Score: 1

    Maybe we'll finally find out what John Ford told Maureen O'Hara to say John Wayne...a secret all three took to their graves...