Extracting Audio From Visual Information

← Back to Stories (view on slashdot.org)

Extracting Audio From Visual Information

Posted by samzenpus on Monday August 4, 2014 @01:38AM from the what-the-bag-says dept.

rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.

18 of 142 comments (clear)

Min score:

Reason:

Sort:

Not surprising by Z00L00K · 2014-08-04 01:42 · Score: 4, Insightful

Measuring the vibrations of windows or other items was used already 40 to 50 years ago by spy agencies, so I wonder if this isn't something that has been re-discovered?

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
1. Re:Not surprising by Z00L00K · 2014-08-04 01:45 · Score: 5, Informative
  
  To follow up, look at the Electromax Laser Listening Systems.
  
  --
  If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
2. Re:Not surprising by Hamsterdan · 2014-08-04 02:00 · Score: 5, Insightful
  
  The countermeasure for laser listening was to install the windows inside a pipe *frame* and play music in the pipes. Using an object inside the building to extract audio defeats that countermeasure. This is 2014, do not expect any privacy, especially from government agencies...
  
  --
  I've got better things to do tonight than die.
3. Re:Not surprising by JazzHarper · 2014-08-04 02:02 · Score: 4, Informative
  
  There is a very significant difference: this involves detecting vibrations in images of objects in a video recording rather than the objects themselves. However, not just any video will do; it requires a very high frame rate.
4. Re:Not surprising by timeOday · 2014-08-04 02:08 · Score: 4, Insightful
  
  Well, even a normal microphone is "just" measuring the linear displacement of a membrane over time, so clearly the important distinction is how you measure it. A laser range-finder is different from a microphone, and a video camera is different from a laser range-finder.
5. Re:Not surprising by fuzzyfuzzyfungus · 2014-08-04 02:18 · Score: 4, Funny
  
  Clearly, if your work is that important having a window office becomes a sign of extremely low status and institutional nonimportance, rather than professional advancement...
  
  (At least until they discover the guy spying on the basement dwellers with sophisticated seismometers)
6. Re:Not surprising by doublebackslash · 2014-08-04 04:47 · Score: 4, Informative
  
  FTFA
  
  In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.
  They don't go into detail on the algorithm but reading between the lines it seems that they are using the spatial nature of video and the fact that not every pixel is captured at exactly the same moment (let alone each line) to ferret out higher frequency information. I have other guesses, but they are wild speculation. Either way VERY cool.
  
  --
  md5sum /boot/vmlinuz
  d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
Been there done that by Anonymous Coward · 2014-08-04 01:45 · Score: 3, Funny

Sorry but that is so 2004.
- NSA
1. Re:Been there done that by gsslay · 2014-08-04 02:15 · Score: 4, Funny
  
  Are they not doing this already in CSI? I'm sure I saw them enhance an office security video of a post-it note, reflected off a monitor screen, magnified a couple of times, and there they had it; complete dialog in stereo, with accompanying analysis of voice stress so they knew who was lying. Isn't science wonderful?
Scary by Anonymous Coward · 2014-08-04 01:48 · Score: 3, Interesting

This is cool, yet scary stuff.
I wonder how loud the original audio has to be in order to be recovered in this manner? It sounded to me like the spoken words were being shouted, and we have no way of knowing how loud the music was played. I didn't see any mention of that in the linked article.
The linked article has additional technical(ish) information that's not in the video.
Now my tin-foil hat... by Anonymous Coward · 2014-08-04 01:51 · Score: 5, Funny

...Needs a tin-foil hat!
1. Re:Now my tin-foil hat... by JackieBrown · 2014-08-04 02:07 · Score: 4, Insightful
  
  The hat is a trick!
  The reason they want you to wear foil is so that the sound can bounce off it.
2. Re:Now my tin-foil hat... by fuzzyfuzzyfungus · 2014-08-04 02:37 · Score: 4, Insightful
  
  Worse than that. If there's a metal foil involved, vibration measurement should be doable with RF as well as light. Only with a next generation reduced radar cross section geometries and RF absorbent materials can a truly secure tinfoil hat be constructed.
  
  Unfortunately, walking around with what appears to be a small F-117 attached to your head offers limited visual camouflage potential and may prove counterproductive in your attempts to avoid Their surveillance.
Requires a very high speed camera by tepples · 2014-08-04 01:51 · Score: 4, Interesting

The YouTube video captions state that this technique requires a camera capable of a few thousand frames per second. Thus this is pretty much using a camera to follow the vibrations, little different from a laser mic. What would impress me more is if they were able to pick up different frequencies from different parts of the bag with different resonant frequencies and reconstruct from standard 30 fps video using the bag as a transducer.
1. Re:Requires a very high speed camera by interiot · 2014-08-04 02:10 · Score: 4, Insightful
  
  30 Hz is far below the Nyquist rate (6800 Hz, going by POTS specs), so no, that wouldn't be possible without some fundamental changes in our understanding of information theory and physics.
2. Re:Requires a very high speed camera by tepples · 2014-08-04 03:30 · Score: 3, Insightful
  
  In theory, if you can find different targets in the frame with resonant frequencies spaced no more than 15 Hz apart, you can read a different 15 Hz off each target.
3. Re:Requires a very high speed camera by Anonymous Coward · 2014-08-04 03:32 · Score: 3, Insightful
  
  Oh dear. You even linked to Wikipedia (although not to the Wikipedia page "Nyquist Rate"). Does it not occur to you that OP understands those things better than you do?
  To start with you need to understand what the Nyquist rate means. Sampling is like wrapping a signal around a cylinder. Just because parts are overlaid ("aliasing") doesn't mean you can't untangle the original signal. For instance, if a single audio source contains only pure harmonics, so the frequencies are known to be N, 2N, 3N, 4N, and so on, and if you have the range of possible N down to a smallish range (e.g. you know it's a voice) and you know that higher harmonics are always smaller than lower harmonics, then you can, from a massively sub-Nyquist sampling like this, extract both N *and* all the coefficients of all the harmonics. It's just like determining the dimensions of a triangle after it's wrapped around a cylinder. No, the triangle doesn't have to fit within one revolution of the cylinder, that's just the trivial case that obviously works.
  What OP is proposing is that because different parts of the physical system have different resonances, when you look at that part of the image you are seeing a strongly filtered version of the original signal - basically a single frequency. You can measure the size of this signal using an aliased sampling - there's no problem with that whatsoever, it just works, an aliased sampling has the same energy as a non-aliased sampling, the samples are just in a different order. Then if you know different image areas have different responses, you can build up an image of the signal by patchwork. It would be a bloody hard job for a crisp packet in arbitrary configuration, but if you get to design the object you're looking at you can make this as sensitive as you like, and even use really crappy cameras to do it.
  Nyquist rate isn't the be-all and end-all people think it is, it's just a limit for *perfect* reconstruction of *arbitrary* signals. The naive approach is to restrict yourself to sub-Nyquist signals and use the easy algorithms everybody knows. The fun stuff (read: the stuff you might get paid for) involves at least flirting with the Nyquist range, or even fully embracing that aliasing is happening and figuring out the consequences from first principles. Once you do this, you can do amazing things that seem impossible to Signal Processing 101 students ... the only problem then is you get SP101 students telling you you're an idiot for thinking that's possible. Oh, well.
  BTW, sampling rate on telephony is 8000Hz as standard. Pro-tip: if you want to sound like a signal processing expert, know common sample rates.
4. Re:Requires a very high speed camera by blincoln · 2014-08-04 03:48 · Score: 5, Informative
  
  For some reason, the person who posted the article or the Slashdot editors linked to a bad knock-off video that removed 3/4 of the details instead of the actual researchers' video. The real video makes it clear that they can also get results from a standard DSLR 60 FPS video by taking advantage of the rolling shutter effect. There's a fidelity loss, but it's a lot better than I would have expected.
  
  --
  "...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman