Slashdot Mirror


Extracting Audio From Visual Information

rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.

34 of 142 comments (clear)

  1. Not surprising by Z00L00K · · Score: 4, Insightful

    Measuring the vibrations of windows or other items was used already 40 to 50 years ago by spy agencies, so I wonder if this isn't something that has been re-discovered?

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    1. Re:Not surprising by Z00L00K · · Score: 5, Informative

      To follow up, look at the Electromax Laser Listening Systems.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    2. Re:Not surprising by Hamsterdan · · Score: 5, Insightful

      The countermeasure for laser listening was to install the windows inside a pipe *frame* and play music in the pipes. Using an object inside the building to extract audio defeats that countermeasure. This is 2014, do not expect any privacy, especially from government agencies...

      --
      I've got better things to do tonight than die.
    3. Re:Not surprising by JazzHarper · · Score: 4, Informative

      There is a very significant difference: this involves detecting vibrations in images of objects in a video recording rather than the objects themselves. However, not just any video will do; it requires a very high frame rate.

    4. Re:Not surprising by Anonymous Coward · · Score: 2, Funny

      When high frame rate cameras are outlawed, only outlaws will have high frame rate cameras..............

    5. Re:Not surprising by timeOday · · Score: 4, Insightful

      Well, even a normal microphone is "just" measuring the linear displacement of a membrane over time, so clearly the important distinction is how you measure it. A laser range-finder is different from a microphone, and a video camera is different from a laser range-finder.

    6. Re:Not surprising by fuzzyfuzzyfungus · · Score: 4, Funny

      Clearly, if your work is that important having a window office becomes a sign of extremely low status and institutional nonimportance, rather than professional advancement...

      (At least until they discover the guy spying on the basement dwellers with sophisticated seismometers)

    7. Re:Not surprising by Z00L00K · · Score: 2

      The method is the same, it's just a different tool involved on the way.

      It's enough to measure the image of an object, you don't need to record it first and you actually don't need a laser either, even though it may help.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    8. Re:Not surprising by Jeremy+Erwin · · Score: 2

      From Top Secret America: the rise of the surveilance state

      As important to a man's self image as the power of his car's engine or his motorcycle's rumble, SCIF size had become a symbol of status. "In DC, everyone talks SCIF, SCIF, SCIF," said Bruce Paquin, owner of a construction company that builds SCIFs for the government and private corporations. "They've got the penis envy thing going. You can't be a big boy unless you're a three letter agency and you have a big SCIF.

      (A SCIF is a room that has been certified to be impenetrable to various types of surveillance techniques.)

    9. Re:Not surprising by doublebackslash · · Score: 4, Informative

      FTFA

      In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

      They don't go into detail on the algorithm but reading between the lines it seems that they are using the spatial nature of video and the fact that not every pixel is captured at exactly the same moment (let alone each line) to ferret out higher frequency information. I have other guesses, but they are wild speculation. Either way VERY cool.

      --
      md5sum /boot/vmlinuz
      d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
    10. Re:Not surprising by Jeremy+Erwin · · Score: 2

      That's on a need to know basis, and you don't have a need to know.

  2. Been there done that by Anonymous Coward · · Score: 3, Funny

    Sorry but that is so 2004.

    - NSA

    1. Re:Been there done that by gsslay · · Score: 4, Funny

      Are they not doing this already in CSI? I'm sure I saw them enhance an office security video of a post-it note, reflected off a monitor screen, magnified a couple of times, and there they had it; complete dialog in stereo, with accompanying analysis of voice stress so they knew who was lying. Isn't science wonderful?

  3. Possible NASA method by Anonymous Coward · · Score: 2, Funny

    Could this be used by NASA to look for intelligent life on other worlds by measuring objects in the same fashion?

  4. Scary by Anonymous Coward · · Score: 3, Interesting

    This is cool, yet scary stuff.

    I wonder how loud the original audio has to be in order to be recovered in this manner? It sounded to me like the spoken words were being shouted, and we have no way of knowing how loud the music was played. I didn't see any mention of that in the linked article.

    The linked article has additional technical(ish) information that's not in the video.

  5. Now my tin-foil hat... by Anonymous Coward · · Score: 5, Funny

    ...Needs a tin-foil hat!

    1. Re:Now my tin-foil hat... by JackieBrown · · Score: 4, Insightful

      The hat is a trick!

      The reason they want you to wear foil is so that the sound can bounce off it.

    2. Re:Now my tin-foil hat... by fuzzyfuzzyfungus · · Score: 4, Insightful

      Worse than that. If there's a metal foil involved, vibration measurement should be doable with RF as well as light. Only with a next generation reduced radar cross section geometries and RF absorbent materials can a truly secure tinfoil hat be constructed.

      Unfortunately, walking around with what appears to be a small F-117 attached to your head offers limited visual camouflage potential and may prove counterproductive in your attempts to avoid Their surveillance.

  6. Requires a very high speed camera by tepples · · Score: 4, Interesting

    The YouTube video captions state that this technique requires a camera capable of a few thousand frames per second. Thus this is pretty much using a camera to follow the vibrations, little different from a laser mic. What would impress me more is if they were able to pick up different frequencies from different parts of the bag with different resonant frequencies and reconstruct from standard 30 fps video using the bag as a transducer.

    1. Re:Requires a very high speed camera by interiot · · Score: 4, Insightful

      30 Hz is far below the Nyquist rate (6800 Hz, going by POTS specs), so no, that wouldn't be possible without some fundamental changes in our understanding of information theory and physics.

    2. Re:Requires a very high speed camera by sunderland56 · · Score: 2, Insightful

      reconstruct from standard 30 fps video

      Dear sir: what you are asking is impossible.

      Sincerely yours,

      Harry Nyquist

    3. Re:Requires a very high speed camera by Uecker · · Score: 2

      No, it could work. He wants to capture different information from different parts of the bag. This is a multi-channel problem so you can go below Nyquist. Also you might have a model for speech and you can use to reduce the amount of required information. Finally, you co not need perfect recovery.

    4. Re:Requires a very high speed camera by silfen · · Score: 2

      It's not "impossible", and he even told you how to do it. Incidentally, your ear works the way he suggested.

    5. Re:Requires a very high speed camera by jones_supa · · Score: 2

      30 fps would allow a maximum frequency of 15 Hz.

    6. Re:Requires a very high speed camera by tepples · · Score: 3, Insightful

      In theory, if you can find different targets in the frame with resonant frequencies spaced no more than 15 Hz apart, you can read a different 15 Hz off each target.

    7. Re:Requires a very high speed camera by kristianbrigman · · Score: 2

      Well, it might be theoretically possible - but you'd need to get the bits from somewhere. Think of an ocean wave, and you want to measure the height of the water at a given point in time. But waves on water move in fairly predictable ways, so a single picture will tell you both the height of the water at the time the picture was taken, as well as a good approximation of what it was for a short time before and after the picture.

      Another possibility is if there are multiple video streams from the same event, they are probably all 30 fps, but probably didn't catch the exact same samples - overlay them and you may be able to reconstruct a higher-frequency signal.

      This doesn't make it an easier problem, or even possible - now, instead of having to capture at a frequency above the nyquist rate, you have to capture video at a resolution that can tell the micro-topology of a potato chip bag from 15 feet away. After all, you have to extract the information from somewhere. But there are ways to get beyond nyquist sometimes.

      Another example which feels related but i'm not sure how yet: Roland has a patent on electronic drums. They have a single sensor in the middle of the drum, yet within a quarter-wave of a hit anywhere on the drum, they can tell both that it was hit and how far away from the center it was hit, based on the shape of the wave.

    8. Re:Requires a very high speed camera by Anonymous Coward · · Score: 3, Insightful

      Oh dear. You even linked to Wikipedia (although not to the Wikipedia page "Nyquist Rate"). Does it not occur to you that OP understands those things better than you do?

      To start with you need to understand what the Nyquist rate means. Sampling is like wrapping a signal around a cylinder. Just because parts are overlaid ("aliasing") doesn't mean you can't untangle the original signal. For instance, if a single audio source contains only pure harmonics, so the frequencies are known to be N, 2N, 3N, 4N, and so on, and if you have the range of possible N down to a smallish range (e.g. you know it's a voice) and you know that higher harmonics are always smaller than lower harmonics, then you can, from a massively sub-Nyquist sampling like this, extract both N *and* all the coefficients of all the harmonics. It's just like determining the dimensions of a triangle after it's wrapped around a cylinder. No, the triangle doesn't have to fit within one revolution of the cylinder, that's just the trivial case that obviously works.

      What OP is proposing is that because different parts of the physical system have different resonances, when you look at that part of the image you are seeing a strongly filtered version of the original signal - basically a single frequency. You can measure the size of this signal using an aliased sampling - there's no problem with that whatsoever, it just works, an aliased sampling has the same energy as a non-aliased sampling, the samples are just in a different order. Then if you know different image areas have different responses, you can build up an image of the signal by patchwork. It would be a bloody hard job for a crisp packet in arbitrary configuration, but if you get to design the object you're looking at you can make this as sensitive as you like, and even use really crappy cameras to do it.

      Nyquist rate isn't the be-all and end-all people think it is, it's just a limit for *perfect* reconstruction of *arbitrary* signals. The naive approach is to restrict yourself to sub-Nyquist signals and use the easy algorithms everybody knows. The fun stuff (read: the stuff you might get paid for) involves at least flirting with the Nyquist range, or even fully embracing that aliasing is happening and figuring out the consequences from first principles. Once you do this, you can do amazing things that seem impossible to Signal Processing 101 students ... the only problem then is you get SP101 students telling you you're an idiot for thinking that's possible. Oh, well.

      BTW, sampling rate on telephony is 8000Hz as standard. Pro-tip: if you want to sound like a signal processing expert, know common sample rates.

    9. Re:Requires a very high speed camera by SydShamino · · Score: 2, Informative

      No, you can pick up something higher than Nyquist, as long as you understand your sources of information and noise. It will alias down into the measurable range, and you can extract useful information from the alias. We have a system that operates up to 1 MHz using a 1.8 MHz ADC. When we know the signal is at 1 MHz, we extract the information at 800 kHz and use that.

      What the GGP was talking about, though, was finding resonance on the bag where unique 30-Hz-width bands higher frequencies were being naturally modulated to baseband. If you had 100 points on the bag that each modulated a different frequency (30 Hz, 45 Hz, 90 Hz, ... 1500 Hz), you could extract the data from each sub-band separately and reconstruct the original signal. See http://en.wikipedia.org/wiki/F... and assume the source isn't one 1500 Hz conversation but instead one hundred 15 Hz conversations. And also assume that is one amazing bag of chips.

      --
      It doesn't hurt to be nice.
    10. Re:Requires a very high speed camera by blincoln · · Score: 5, Informative

      For some reason, the person who posted the article or the Slashdot editors linked to a bad knock-off video that removed 3/4 of the details instead of the actual researchers' video. The real video makes it clear that they can also get results from a standard DSLR 60 FPS video by taking advantage of the rolling shutter effect. There's a fidelity loss, but it's a lot better than I would have expected.

      --
      "...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
    11. Re:Requires a very high speed camera by doublebackslash · · Score: 2

      That assumes that you only are getting one sample per frame. FTFA

      In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

      Remember that video has two spatial dimensions with 3 channels (which themselves are in different spatial locations within each pixel) each and that each line isn't captured at the same instant. There is a lot more information there than a single sample at a given rate. Nyquist doesn't apply to the frame rate here. Nyquist is stil lrelevant to the problem, of course! They didn't break Nyquist, they just found a way to get more information than intuition implies.

      --
      md5sum /boot/vmlinuz
      d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
  7. Resolution and sensor noise by BitZtream · · Score: 2, Informative

    The sensor and optics must have been ridiculously high quality and resolution for this to work. Sensor noise alone would almost certainly rule this out for any COTS consumer package. They certainly aren't doing it with CNN footage or old CCTV surveillance tapes.

    In which case, it's of no practical value since a laser mic would be far cheaper and more discrete.

    Cool from an academic perspective that they can use DSP now, but it's just more fun with a laser mic, same principals and theories, new less workable application.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  8. Re: Yeah, only if one speaks in extremely low tone by Anonymous Coward · · Score: 2, Funny

    Because if your target is eating SunChips you'd risk hearing loss.

  9. Re:Yeah, only if one speaks in extremely low tones by silas_moeckel · · Score: 2

    Because your emitting something sending that IR laser to do it. This is completely passive.

    --
    No sir I dont like it.
  10. tl;dr: by CaptainStumpy · · Score: 2

    Yelling MARY HAD A LITTLE LAMB, ITS FLEECE WAS WHITE AS SNOW at a houseplant, bag of chips, and glass of water is now research.

    --
    It will be better to purchase from an owner who is a good farmer and a good builder.