A New Vulnerability In RSA Cryptography
romiz writes, "Branch Prediction Analysis is a recent attack vector against RSA public-key cryptography on personal computers that relies on timing measurements to get information on the bits in the private key. However, the method is not very practical because it requires many attempts to obtain meaningful information, and the current OpenSSL implementation now includes protections against those attacks. However, German cryptographer Jean-Pierre Seifert has announced a new method called Simple Branch Prediction Analysis that is at the same time much more efficient that the previous ones, only needs a single attempt, successfully bypasses the OpenSSL protections, and should prove harder to avoid without a very large execution penalty." From the article: "The successful extraction of almost all secret key bits by our SBPA attack against an openSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against side-channel attacks are, in the context of SBPA attacks, totally useless." Le Monde interviewed Seifert (in French, but Babelfish works well) and claims that the details of the SBPA attack are being withheld; however, a PDF of the paper is linked from the ePrint abstract.
When i see this kind of news the following question arises: so what are we supposed to do now? Throw away RSA cryptography is not the best answer i think. What do you, fellow /.ers, would do to by pass this problem?
When my Karma level reaches 0 I feel in piece with the Universe
This isn't really a flaw in RSA cryptography, but rather the fairly obvious situation that a branch predictor, shared between processes of different privilege levels, can be used as a covert channel and thus can be used to reveal state. The same is true with the cache, for example, and multithreading makes this problem many times worse by increasing the bandwidth of the channel. On architectures which don't have branch predictors, or don't share them, this is not an issue. ARM processors, for example, tend to rely on predication rather than branches (except when running Thumb), and thus don't suffer the same problem.
This class of problems is only going to grow as CPUs become less and less deterministic.
Think managed web hosting companies that put dozens of virtual hosts on a single physical server. If this really works from an unprivileged account, one malicious user could steal SSL keys from all the rest.
0 1 - just my two bits
It gets better. The attack requires that the two processes are running on the same core with hyperthreading enabled (i.e. ALU-poor CMP). The "spy" process will be sucking up 100% cpu pretty much continuously. They also simplified the multiplication routine from OpenSSL. Even if you are running such a setup on a P4 with HT turned on (even though its often useless), and you need to run secure processes along with unsecure ones (generally not a good idea anyway), patches already exist for Linux and BSDs to address this. The patches modify the scheduler to prevent processes from different users from running on the same physical core. A half-hearted attempt is made in the paper to say that these attacks to generalize to something remote, but no details are given as to how their attack would compensate for the 100,000 fold decrease in timing accuracy to pull off the attack on even a local LAN.
Essentially they took a very impractical attack with an unlikely scenario, and created a somewhat practical attack with an unlikely scenario. Avoid the problem scenario which was raised in the prior work last year, and you are still golden.
A slight improvement to your idea might be to balance the loop anyway, using D = 1 - di, etc., essentially a vectorized version of figure 4. This would slow it down by a factor of two but it would make it resistant to conventional timing attacks.
We don't see the world as it is, we see it as we are.
-- Anais Nin
From the linked article:
// Either 0 or -1 // -1 or 0 // Either 0 or R0 // At this point T0 will point to the value to be squared, R0 or R1!
// Now we move the correct values back into R0 & R1
R0 = 1; R1 = M
for i from 0 to n-1 do
if d[i] then
R1 = R0 * R1 mod N
R0 = R0 * R0 mod N
else
R0 = R0 * R1 mod N
R1 = R1 * R1 mod N
return R0
The key-dependent if statement is the key here, if we can remove all such branches, then there's no Branch Target Buffer entry that depends on it, and no timing channel attack either:
R0 = 1; R1 = M;
for (i = 0; i < n; i++) {
mask = 0 - d[i];
nmask = mask ^ -1;
T0 = R0 & mask;
T0 += R1 & nmask;
T1 = R0 * R1 mod N;
T0 = T0 * T0 mod N;
R1 = T1 & mask;
R0 = T0 & mask;
R0 += T1 & nmask;
R1 += T0 & nmask;
}
return R0;
There are at least three interesting issues here:
a) Most modern cpus have hw support for conditional operations, on x86 this is in the form of CMOVcc which is a (constant-time!) conditional move into a register, but as shown above, it really isn't needed here.
b) The perforance impact of the above branch removal can be negative!
On a P4 a branch miss costs about 20 clock cycles, and since a key-dependent branch will miss 50% of the time, the average cost is 10 cycles. My replacement code above takes around 5 cycles or less on any current cpu.
c) A final possible timing-channel attack would be due to the memory alignment of the R0 and R1 values:
By allocating them at the same address modulo the cpu page size, i.e. at 4 KB offset, the cache lines hit will be the same for both.
When I worked on the asm version of DFC, one of the AES also-rans, I removed a similar timing attack from a core 128-bit modular multiplication operation, using very similar techniques.
Terje
"almost all programming can be viewed as an exercise in caching"