Encrypted VoIP Meets Traffic Analysis

Re:Bleh by Anthony+Mouse · 2011-03-15 03:29 · Score: 5, Informative

I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.

That's not good by Anonymous Coward · 2011-03-15 03:30 · Score: 1

Better stick to a constant bitrate then :)

Re:That's not good by WorBlux · 2011-03-15 14:24 · Score: 1

Exactly, or just add enough random data into the stream, plus the voice channel or make it look like a constant stream of random data.

Re:Bleh by ByOhTek · 2011-03-15 03:30 · Score: 1

People only use two phrases when they talk?

--
Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).

So...obvious solution then? by Anthony+Mouse · 2011-03-15 03:30 · Score: 4, Interesting

Use fixed-bitrate encoding for VoIP.

Re:So...obvious solution then? by ackthpt · 2011-03-15 03:32 · Score: 1

Use fixed-bitrate encoding for VoIP.
Better still, two cans and a length of string.

--

A feeling of having made the same mistake before: Deja Foobar
Re:So...obvious solution then? by Bengie · 2011-03-15 03:58 · Score: 2

until someone gets a warrant to string tap you. You'd think the string connecting the two cans is protected by quantum randomness from the string theory, but it is not.
Re:So...obvious solution then? by bsquizzato · 2011-03-15 04:00 · Score: 3, Interesting

Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
The article describes the method used to detect phrases ...

At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
Essentially, you gather enough information about how a VBR codec could encode a speech phrase you are looking for, then predict where it was spoken by looking at the "data bursts" being sent in the media stream. We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
Re:So...obvious solution then? by Anonymous Coward · 2011-03-15 04:09 · Score: 5, Informative

OpenSSH had a similar problem, it would leak information about your login password by the timing/size of the packets:
http://www.ece.cmu.edu/~dawnsong/papers/ssh-timing.pdf
I believe their solution was to introduce random NOP packets into the stream. This approach could work here too.
Re:So...obvious solution then? by Cthefuture · 2011-03-15 04:22 · Score: 4, Interesting

Actually most people are using G.711 these days which is in fact a fixed bitrate (it's the same protocol used on your normal "hard" voice line).
But most VoIP providers do not offer SRTP or any encryption whatsoever so this whole thing is not even a question. More than likely anyone can listen in on your VoIP calls. We need to put more pressure on VoIP providers to offer encryption.

--
The ratio of people to cake is too big
Re:So...obvious solution then? by buback · 2011-03-15 04:25 · Score: 1

So I guess it's like how dentist understand their patients when they have their hands and tools in their mouths.
Re:So...obvious solution then? by modemboy · 2011-03-15 04:41 · Score: 1

I immediately thought of this exploit as well. Seems to me you would need a lot of NOP packets comparatively, the login info is just a few keystrokes. Plus login info is not time sensitive on the receiving end, delays in a voice stream might not be acceptable.
Re:So...obvious solution then? by tixxit · 2011-03-15 04:46 · Score: 1

Some encrypted systems actually specify how much data can be "leaked" out per some amount of time. The idea is that, practically, you'll always lose something, so you need to determine a limit that is acceptable. I guess that while voice/sound "data" is very complex, speech is much less so and it doesn't take much data being leaked to get the gist of what was said. Since their method is essentially looking at a sequence of numbers, the more obvious solution may be to add some padding to the packets to foil this attack (perhaps to align on certain boundaries of X-number of bytes); this would reduce the number of bits of information leaked per packet. The hard part would be to figure out how much is need to degrade the signal:noise ratio to a good security vs. efficiency trade-off.
Re:So...obvious solution then? by Jah-Wren+Ryel · 2011-03-15 05:03 · Score: 2

We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
Any fix is going "waste" some amount of bandwidth.
One solution to this attack may be to semi-randomly inject "nops" to bridge phoneme breaks. So instead of being able to identify individual phonemes by bandwidth spikes, attackers will be limited to identifying entire word clusters - like filling the "space" between the phonemes in the first three words of a sentence to make it look like one really long phoneme.
But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that they are transmitted somewhat out of order and then re-ordered on the receiving end. That probably won't use up much extra bandwidth but would increase latency.

--
When information is power, privacy is freedom.
Re:So...obvious solution then? by Anonymous Coward · 2011-03-15 05:09 · Score: 1

But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that they are transmitted somewhat out of order and then re-ordered on the receiving end. That probably won't use up much extra bandwidth but would increase latency.
Might not even need to re-order the audio, just burst it so that multiple phonemes are all "packed" together for transmission so there are much fewer phoneme breaks visible via traffic analysis. You burn latency that way too, but it would be much simpler to implement than a randomizing algorithm.
Re:So...obvious solution then? by Anthony+Mouse · 2011-03-15 05:22 · Score: 1

We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
It seems like there might be some promise in improving the compression method itself using the same techniques, so that the things that currently take more bandwidth would take less and therefore become less distinguishable, but if the compression is already near-optimal then this won't work without an efficiency loss because the change would correspondingly make the things that currently take less bandwidth take more, and those things might be more common.
The only general solution is some kind of padding scheme, and the only way to completely defeat the attack is to use a compression method that compresses all inputs to output of the same size, i.e. fixed bitrate, because the degree of deviation from that is the degree to which the attack functions. The existence of efficiency-improving variation is what leaks information, because it tells the attacker a characteristic of the underlying data, namely the number of bits required to encode it. That isn't to say there is no compromise solution (like the OpenSSH method discussed below) where you sacrifice some degree of efficiency in order to make the attack sufficiently infeasible, but there is a direct inverse relationship between real-time encoding efficiency and information leakage.
Re:So...obvious solution then? by Peter+Simpson · 2011-03-15 05:32 · Score: 1

It's very clever. Seems like using a CBR encoder would defeat this method, because every packet would have the same number of samples. Being *too* efficient might save you bandwidth, but it reveals something about your speech patterns.
Re:So...obvious solution then? by dgatwood · 2011-03-15 05:38 · Score: 1

Agreed that the problem is the packing, not the data. However, grouping multiple short packets together is still leaking information. The only difference is that instead of looking at the length of packets, you have to look at the timing between packets.
I would suggest that the right solution is to modify your code so that instead of sending out packets of varying length isochronously, you instead send out packets of the same length isochronously, and adjust the average length every... say ten seconds, adjusting immediately only when you realize that your encoder is getting dangerously ahead or behind. Pad the packet with null blocks as needed to maintain the average.
With such a scheme, the only information you are leaking is a weighted average length of the packets in the last few seconds of the conversation. That should be much less useful.

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:So...obvious solution then? by psydeshow · 2011-03-15 05:58 · Score: 1

At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
Awesome.
It's like listening to the "Mwa mwaa mwaa mwa mwa" voice that adults use in the old Peanuts television specials, and figuring out what they are saying based on the length of the "mwas" and their order in the conversation.
Re:So...obvious solution then? by Kjella · 2011-03-15 06:10 · Score: 2

Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
Enough to matter? According to my cell phone bill, I had over 100MB of data traffic last month. That's about 10 hours of 24 kbps CBR encoded voice, which is the highest possible CBR setting speex has. If it's on my DSL/cable/whatever line, who cares? Even if I did that 24x7 for a month it'd be 7-8 GB and I'm pretty sure even a teenage girl with mouth diarrhea has to sleep sometimes. If that's what it takes, I don't see CBR as being a dealbreaker.

--
Live today, because you never know what tomorrow brings
Re:So...obvious solution then? by bsquizzato · 2011-03-15 06:58 · Score: 1

Now take hundreds of thousands of calls like yours running through your service provider's network, being transferred to other providers networks, etc. Or, hundreds/thousands of calls running w/in a large enterprise such as from branch offices to HQ. Bandwidth costs money. In situations like these, you try to conserve bandwidth any way you can.
Re:So...obvious solution then? by TuringCheck · 2011-03-15 11:19 · Score: 1

Working in telephony and VoIP for the last 8 years I don't remember seeing a VBR codec in actual use - ever. At most silence detection is used but that has unpleasant side effects too. I also find useless to save 2-3 bytes when the UDP+RTP overhead is 40 (plus at least 4 if SRTP is used).
Re:So...obvious solution then? by PReDiToR · 2011-03-15 13:16 · Score: 1

You mean like trying to decipher Kenny from South Park's words?

I wonder what my kids would compare it to ...

--

Do not meddle in the affairs of geeks for they are subtle and quick to anger
Re:So...obvious solution then? by NateTech · 2011-03-15 17:40 · Score: 1

Using a VBR and then inserting NOP's sounds like... using a non-variable streaming CODEC.

--
+++OK ATH
Re:So...obvious solution then? by Eivind · 2011-03-15 19:47 · Score: 1

Not enough to matter.
VBR *does* save bandwith for equivalent quality, but not a lot of it.
Your 100MB gives you 10 hours of 24kbps of CBR encoded voice, and at a guess, VBR would maybe give you 13-15 hours of voice in the same bandwith.
Certainly trivial, and certainly the answer to this problem is that encrypted voice, should be encoded CBR to make traffic-analysis impossible.

Re:Bleh by gstoddart · 2011-03-15 03:30 · Score: 2

So on average that can't do any better than chance. Wow such great results!

I think if half the time you can identify a phrase in a supposedly encrypted stream ... that's better than 'chance'.

--
Lost at C:>. Found at C.

Stalin's Dream II by ackthpt · 2011-03-15 03:31 · Score: 2

Teh Recognisining.

"I'd like to order pizza, with pepperoni, pineapple, mushroom and an Iludium Pu-36 space modulator delivered to Hall of Justice."

--

A feeling of having made the same mistake before: Deja Foobar

Re:Stalin's Dream II by bmo · 2011-03-15 03:47 · Score: 2

http://www.youtube.com/watch?v=7A4HeawmE6A
Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
--
BMO
Re:Stalin's Dream II by AlienIntelligence · 2011-03-16 09:15 · Score: 1

http://www.youtube.com/watch?v=7A4HeawmE6A
Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
--
BMO
Hear, hear!
Marvin is the man! I mean, he's the silly thought and pseudo I use
for this nickname.
-AI

--
For me, it is far better to grasp the Universe as it really is than to persist in delusion

Re:Bleh by batquux · 2011-03-15 03:31 · Score: 4, Funny

Come on, 50% is better than most unencrypted voice recognition!

Re:Bleh by Dalzhim · 2011-03-15 03:32 · Score: 1

Especially when being wiretapped.

Re:Bleh by Lumpio- · 2011-03-15 03:32 · Score: 1

I think there's a big difference in the probabilities of a coin toss and the probability of guessing the correct phrase of who-knows-how-many alternatives.

Re:Bleh by bennomatic · 2011-03-15 03:38 · Score: 4, Interesting

This reminds me of the guy Colbert interviewed regarding the Large Hadron Collider who thought there was a 50% chance that it would destroy the universe. When questioned as to how he got those odds, he said, "Well, there's two options... either it will happen or it won't happen. 50%."

--
The CB App. What's your 20?

Re:Bleh by Anonymous Coward · 2011-03-15 03:39 · Score: 1

People only use two phrases when they talk?

The phrases that it detects are "Badda-bing" and "Badda-boom."

Re:Bleh by zill · 2011-03-15 03:39 · Score: 4, Funny

A'LA'IH

Duh! by Anonymous Coward · 2011-03-15 03:43 · Score: 2, Insightful

When you want to secure something, you must think carefully about how you might be leaking information. You can't just slap some encryption on and call it a day.

Re:Bleh by Chrisq · 2011-03-15 03:47 · Score: 5, Funny

Once they discover a method to wire trap encrypted video calls, that would open a new era in porn scene.

...

I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.

Maybe he's talking about the porn film.90% seem to be "oh" or "yes" (or so i am told)

3 years old work by slashdotmsiriv · 2011-03-15 03:59 · Score: 2

The conference version of the paper appeared in IEEE S&P 2008.

http://cs.unc.edu/~fabian/papers/oakland08.pdf

Re:Bleh by lwsimon · 2011-03-15 04:02 · Score: 2

I remember following this logic... when I was three. No shit, I have a vivid memory of trying to figure out how proportions worked - I knew that a penny tossed would give a 50/50 split, but that other problem with two states - e.g., when I threw a rock, I'd either hit the matchbox car or I wouldn't - weren't. I gave up, and figured it out later, when I was five or so.

--
Learn about Photography Basics.

No shit? by Anonymous Coward · 2011-03-15 04:11 · Score: 1

You mean when you vary a quality of your signal (in this case bitrate) based on content, people can read information about the content from those variations??? OMFG!

TFA != Wiretap by Barryke · 2011-03-15 04:13 · Score: 1

No it does not work like that (Wire tapping encrypted video calls).
It does not tap the signal, but increases your odds when guessing whether something was communicated in a specific manner.

--
Hivemind harvest in progress..

then it's shitty encryption by cellocgw · 2011-03-15 04:25 · Score: 2

The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.

Pick a better algorithm and/or suck it up and waste a little bandwidth.

--
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw

Re:then it's shitty encryption by dachshund · 2011-03-15 07:06 · Score: 1

The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.
(A common) definition of symmetric encryption is that a message should be indistinguishable from an equal-length string of random bits. In that sense, there's nothing wrong with this encryption scheme.
What is wrong here is that encryption does not hide message length, and in many cases message lengths can leak information about the message content. The research is nice because they show a very practical way to get useful information from message length.
The research is, however, three years old --- it was originally published in ACM CCS 2008. This is just a journal submission. I love to see crypto in the news, but this really shouldn't be.

Re:Bleh by fnj · 2011-03-15 04:26 · Score: 1

Oops ... wait a minute ...

Google Voice by Arykor · 2011-03-15 04:28 · Score: 1

Google is involved in this? Perhaps encryption could help them improve the accuracy of transcription in Google Voice...

What phrases? by stillnotelf · 2011-03-15 04:33 · Score: 1

I'm hoping it's best at picking up obvious spy phrases, like "the eagle has landed", "the moon fish squicks wickedly at midnight", "long is the gap between cacti"... Somehow I think it's probably best at "hello".

Re:What phrases? by DriedClexler · 2011-03-15 07:32 · Score: 1

Somehow I think it's probably best at "hello".
I'm one step ahead of these known-plaintext attacks -- no longer do I use the same, small set of voice greetings. No no -- I prepend a nonce.
"Hello?"
"Shgr'gl'hm-v'va Hi Mom, it's Clyde ... and you're not supposed to answer the phone like that!!!"

--
Information theory is life. The rest is just the KL divergence.
Re:What phrases? by NateTech · 2011-03-15 17:43 · Score: 1

Who answers with "Hello" still? Waste of time. Look at Caller ID, "Hi XXX."
Or... "This is XXX." That one always throws the telemarketers... "Is X there?" "Didn't I just say that?"
Or my favorite, old military and any kind of "Operations" job folks... we just answer with our last name. One word, contact established, identity verified... go with your traffic.
"Goodbye" is silly too. Just hang up.

--
+++OK ATH

Re:Bleh by ciderbrew · 2011-03-15 04:34 · Score: 4, Funny

The pitch is the main thing in the art form.
A low German voice - "ooohhh yaaaaa", over and over. then you have the high pitched Japanese squeak sound - "ii, ii, ii, kimochi". Which really gets annoying these days. It took a few years; but it IS annoying.

Re:Bleh by NotQuiteReal · 2011-03-15 04:51 · Score: 2

The two phrases are "can you hear me?" and "I have a bad connection, let me call you back."

--
This issue is a bit more complicated than you think.

Variable bit rate? by s_p_oneil · 2011-03-15 04:56 · Score: 1

Did you note that they specified variable bit rate? In this case, I'll bet it had more to do with the timing and flow of the packets and bytes than with the actual content of the bytes. When there's a pause in a person's speech, there is a pause in the network traffic. Imagine someone trying to send morse code through an encrypted voice channel. Someone watching a bandwidth graph that had a high enough frequency would know exactly what coded message you sent regardless of the compression or encryption algorithm used (as long as the compression is variable bit rate). Due to the way voice data is compressed, increases or decreases in traffic could imply certain changes in tone, pitch, volume, inflection, etc. Tracked at a very high frequency, changes in the flow of bytes could give plenty of clues as to what is being said whether the traffic is encrypted or not. In general, encryption algorithms don't change the number or flow of bytes, just the content of the bytes.

RTP blinding by WaffleMonster · 2011-03-15 05:08 · Score: 2

A few solutions...

Add some number of pad bytes to each packet to fill in blanks.

Tweak existing high complexity codecs (ilbc, speex..etc) to maintain a persistant bitrate by dynamically scaling quality to even out the per packet bits.

Use a fixed bitrate codec (most of these really suck from bw effeciency vs quality perspective)

Switch variability to the time domain adding jitter to mask the signal and control latency/security tradeoff.

SRTP scares me because it was invented for a single narrow purpose. Would much prefer the use of DTLS to secure RTP streams which being very similar to TLS has received much more scrutiny than SRTP likely ever will.

useless, and easy countermeasures by t2t10 · 2011-03-15 05:17 · Score: 2

First of all, statements like "50% accuracy" are nearly useless; you need to know both precision and recall. And to the degree that "50% accuracy" tells you anything, it tells you that the system is pretty bad.

Finally, the countermeasure for this is the same as the countermeasure for other automated speech analysis techniques: play some singing or theater in the background.

Re:useless, and easy countermeasures by uid7306m · 2011-03-15 07:15 · Score: 1

Exactly. The phrases used are fairly long, for instance: "Laugh, dance, and sing if fortune smiles upon you." In the TIMIT corpus, there are 122 of them. In the English language, there are hmm, lots of sentences of that length. There are about 1000 different syllables in English, and I count 11 syllables in that sentence. Thus, there are some fraction of 10^33 sentences of that length.
So, if you tried this on English, one of two things would happen. If you used that recognizer without any modification, then it would sit there silently until you said one of the sentences in TIMIT, like "She had your dark suit in greasy wash water all year." And, it would be a *long* wait.
Or, you could change the recognizer so it could recognize more than 122 possible sentences. In that case, the error rate would go way up.

Re:Bleh by Magnus+Pym · 2011-03-15 05:37 · Score: 1

Well, assuming that he has no knowledge about how the thing works and has no other information, his computation of probabilities is technically correct :)

QoS by sourcerror · 2011-03-15 06:08 · Score: 1

Thus you increase latency, which is the single most important thing in a phonecall.

Nexidia by randyjparker · 2011-03-15 06:20 · Score: 1

Nexidia has been selling proprietary tech to do this for years

Average accuracy of 50%? by fishbowl · 2011-03-15 06:37 · Score: 1

On any digital signal, comparing a random source of bits should get you 50% accuracy.

--
-fb Everything not expressly forbidden is now mandatory.

Better than guessing? by KnownIssues · 2011-03-15 06:50 · Score: 1

I'm sure there's a mathematical/statistical reason why 50% accuracy is better than guessing in this case, but that would be very counter-intuitive. Same with as high as 90% under certain conditions. I could get to 90% accuracy if I could select out everything that reduced my accuracy as well. I don't doubt the full article explains better though. I'm not suggesting MIT, Google, etc scientists are stupid.

An exercise of pattern detection by c0lo · 2011-03-15 11:35 · Score: 1

Seems that I started to detect a pattern between the current TFA and this one.
Now, DHS, I know I'm not at MIT, but other cases showed I don't need to... So, just where is my grant for advanced research of the subject?

--
Questions raise, answers kill. Raise questions to stay alive.

Re:Bleh by ciderbrew · 2011-03-16 01:23 · Score: 1

This should have got at least one +funny.

Re:Bleh by Virtual_Raider · 2011-03-16 08:05 · Score: 1

You mean it doesn't amount to "fuck" and "shit"? The media and the internet have fooled me again!

--
+Raider of the lost BBS

Re:Bleh by AlienIntelligence · 2011-03-16 09:07 · Score: 1

How many words are there in the English language - many tens of thousands at least.

Many tens of thousands???

I hope English is your second language.

There are over 1 MILLION English words in common and uncommon use.
[ http://www.languagemonitor.com/no-of-words/ ]

Yes.... many, many, many tens of thousands.

-AI

FWIW, in response to TFA... I realize their research is on phrases. Which
very quickly reduces the set. Since many of those words would only exist
in very few spoken phrases.

--
For me, it is far better to grasp the Universe as it really is than to persist in delusion

Re:Bleh by AlienIntelligence · 2011-03-16 09:09 · Score: 1

but they're recognizing individual words, from a set of many thousands of potential words, half the time or better.

That's really quite impressive. And you're an idiot.

From a set of many thousands of words...

and he's the idiot?

-AI

--
For me, it is far better to grasp the Universe as it really is than to persist in delusion

Re:Video by kmoser · 2011-03-16 16:55 · Score: 1

I'd tap that.

Slashdot Mirror

Encrypted VoIP Meets Traffic Analysis

65 of 98 comments (clear)