Extracting Audio From Visual Information
rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.
To follow up, look at the Electromax Laser Listening Systems.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
There is a very significant difference: this involves detecting vibrations in images of objects in a video recording rather than the objects themselves. However, not just any video will do; it requires a very high frame rate.
For some reason, the person who posted the article or the Slashdot editors linked to a bad knock-off video that removed 3/4 of the details instead of the actual researchers' video. The real video makes it clear that they can also get results from a standard DSLR 60 FPS video by taking advantage of the rolling shutter effect. There's a fidelity loss, but it's a lot better than I would have expected.
"...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
FTFA
In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.
They don't go into detail on the algorithm but reading between the lines it seems that they are using the spatial nature of video and the fact that not every pixel is captured at exactly the same moment (let alone each line) to ferret out higher frequency information. I have other guesses, but they are wild speculation. Either way VERY cool.
md5sum
d41d8cd98f00b204e9800998ecf8427e