Slashdot Mirror


Extracting Audio From Visual Information

rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.

4 of 142 comments (clear)

  1. Re:Not surprising by Z00L00K · · Score: 5, Informative

    To follow up, look at the Electromax Laser Listening Systems.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  2. Re:Not surprising by JazzHarper · · Score: 4, Informative

    There is a very significant difference: this involves detecting vibrations in images of objects in a video recording rather than the objects themselves. However, not just any video will do; it requires a very high frame rate.

  3. Re:Requires a very high speed camera by blincoln · · Score: 5, Informative

    For some reason, the person who posted the article or the Slashdot editors linked to a bad knock-off video that removed 3/4 of the details instead of the actual researchers' video. The real video makes it clear that they can also get results from a standard DSLR 60 FPS video by taking advantage of the rolling shutter effect. There's a fidelity loss, but it's a lot better than I would have expected.

    --
    "...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
  4. Re:Not surprising by doublebackslash · · Score: 4, Informative

    FTFA

    In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

    They don't go into detail on the algorithm but reading between the lines it seems that they are using the spatial nature of video and the fact that not every pixel is captured at exactly the same moment (let alone each line) to ferret out higher frequency information. I have other guesses, but they are wild speculation. Either way VERY cool.

    --
    md5sum /boot/vmlinuz
    d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz