I'm sure it depends on the building. Some friends and I were talking last week about how irritating it is that the "close door" buttons on the elevators in our building do nothing.
From what I've heard, it's a quarter-truth rather than an outright lie. If Alice finds Bob using friend finder, then this message can appear with Bob's name on it. Because there's no way to tell if a friend request came from friend finder, Bob doesn't know that he's "found friends using friend finder".
A "study" that determines that disabling Javascript will not allow you to execute Javascript.
A study that shows that many high-profile websites (which follow the previous best practices) are insecure because they don't take this into account, and proposes enhanced defense mechanisms.
I wish *I* could get paid obscene amounts of money to make "studies" like these.
If you can repeatedly find security flaws in web best practices, you're welcome to come join the lab. It pays about $15/hr, plus half your health insurance.
In addition to the CPGPU or whatever what they're calling it, Fusion should finally catch up to (and exceed) Intel in terms of niftilicious vector instructions. For example, it should have crypto and binary-polynomial acceleration, bit-fiddling (XOP), FMA and AVX instructions. As an implementor, I'm looking forward to having new toys to play with.
Actually, the #1 problem here would be that it would be disgustingly expensive, an ineffective sound barrier and an inefficient energy source.
To start with, lining the highway with any kind of fancy tech would be fabulously expensive. Maybe you could install parabolic concentrators to reduce the cost, but it would still be impractical this side of Dubai.
Second, consider how loud a speaker with a few watts of power, compared to a nearby highway. Truck rumblings will probably hit the wall with something on the order of a few watts per square meter, tops.
Third, if this is 18% efficient, that's not even 1 dB down.
Hydrogen isn't nearly as dangerous as people think. It doesn't have a very high energy density, and it rises as it burns. In the Hindenberg disaster, an airship with 200,000 m^3 of hydrogen caught fire while still in the air, then crashed into the ground, and almost 2/3 of the passengers and crew survived. I'm pretty sure a car colliding with this barrier would be less spectacular.
I want to see this happen, as do several of my colleagues in the security industry. Unfortunately, this sort of strategy breaks legacy applications so corporations are not going to adopt it.
Because hot fuel explodes harder inside an engine cylinder?
Pretty much, yeah. The fuel burns rapidly, and its heat increases the pressure in the cylinder. The more heat, the more pressure. Obviously, adding the heat earlier could cause some problems: it could result in less fuel and air in the cylinder (higher initial pressure), or it could damage components, or it could leak heat out, or it could cause the cylinder to fire at an inconvenient time, or whatever. But in principle, it doesn't have to be a problem.
And for your information, honey is delicious and pre-eaten.
Technically, yes, but the bees didn't digest it and burn the calories out of it, or we'd call it 'poop' instead of 'honey.'
Intel also added an instruction called PCLMULQDQ which does polynomial multiplication over F_2. If it's fast (I can't find timing numbers, but hopefully it's something like latency 2 and throughput 1) then it will be very useful for cryptography in general...
Latency 15 cycles, throughput 10 cycles. That's a shame. Also, AESENC has thoughput 2 cycles/round, only half as fast as I expected.
Hopefully, this will cause people to stop using CBC mode, but perhaps I'm too optimistic.
I don't quite get the point. Are there flaws in CBC mode or do you think it's just better to use i.e. OCB or GCM mode because many developers often forget to add proper MACs to their protocols?
I overstated the case. CBC mode isn't terrible, it's just not the best mode out there.
The main problem with CBC mode is that it isn't parallel. This means that most of the new, faster, timing-attack-resistant libraries can't handle it efficiently (the one I wrote is an exception, but it only works on recent Intel procs and it's ~10% slower than Kasper-Schwabe for bulk encryption). It also means that CBC is a factor of 3 slower on Westmere, maybe 1.5 on VIA chips and 1.5 on the PowerPC G4. Due to optimizations, CBC mode is an additional 15-20% slower than CTR mode in most software libraries. It also has a larger attack surface than CTR mode, and unlike CTR mode it requires decryption.
OCB and GCM modes are really nice because of the integrated MAC. Too bad OCB is patented... if you fit the constraints, there really isn't a better mode for AES.
The fastest code that I know of for AES in CTR mode is Kasper-Schwabe. It does 8 128-bit encryptions at a time, so it also should be suitable for, say, PMAC if you doctor it. I believe that it does not handle decryption (outside of CTR mode where it's the same as encryption) or other key sizes. Modes other than CTR lose some optimization, and should be ~20% slower. It should be available on Kasper's homepage. It requires SSSE3 and reportedly achieves 6.9 cycles/byte on Nehalem for CTR mode.
My code is available here. On Nehalem, it achieves ~9.4 cycles encrypting, ~11.1 cycles decrypting in essentially any mode. It is suitable for encryption or decryption, and supports all three key sizes (longer keys are slower, of course). A newer (unreleased, experimental) version makes slight performance improvements (maybe down to 9.1 cycles encrypting on Nehalem) and implements an optimization for CTR mode that brings it down to ~7.5 cycles. Email me (mhamburg AT cs DOT stanford DOT edu) if you want to try the experimental version. However, my code fundamentally requires SSSE3, and it performs quite poorly on Conroe.
Also, Dan Bernstein (homepage) has somewhere a fast conventional (not timing-attack resistant, but not requiring any sort of SSE) implementation of AES for several processors, and I've heard Crypto++ is pretty fast too.
I believe that all of the above libraries are public-domain and patent-free.
Out of curiosity, what's your application? Can you just get a VIA Nano or Intel Westmere core and run on that?
It's a shame they didn't make a mid-level version with no graphics core, a la Core i7 860. As a crypto/security guy, I'd like to try out PCLMULQDQ, the AES instructions and maybe the IOMMU. But if I'm going to get a fancy new computer, I might as well put a decent graphics card in it, at which point their on-die graphics card is simply a waste of space, power, money and latency. And no, I'm not dropping $1k for a 6-core Gulftown.
They're not: they're putting extra instructions on-board which help implement AES more efficiently. They may also allow you to implement other algorithms more efficiently, though I haven't looked at them in enough detail to be sure.
The instructions perform a single round of AES (which has 10-14 rounds depending on key size), either encrypting or decrypting. Certain other algorithms such as Lex, Camellia, Fugue and Grostl use AES S-boxes in their core, and can probably benefit from these instructions. However, they will not achieve nearly so much a speedup as AES.
The AES instructions themselves will approximately double the speed of sequential AES computations. This is very unimpressive; VIA's AES instructions are much faster. They will also make it resistant to cache-timing attacks without losing speed, which is unimpressive because you can already do this on Penryn and Nehalem. The low speed results from the AES instructions having latency 6; if you can use a parallel mode (GCM, OCB, PMAC, or CBC-decrypt, for example) then the performance should be 10-12x the fastest current libraries. Hopefully, this will cause people to stop using CBC mode, but perhaps I'm too optimistic.
Intel also added an instruction called PCLMULQDQ which does polynomial multiplication over F_2. If it's fast (I can't find timing numbers, but hopefully it's something like latency 2 and throughput 1) then it will be very useful for cryptography in general, speeding up certain operations by an order of magnitude or more. This is more exciting to me than the AES stuff, because it might enable faster, simpler elliptic-curve crypto and similarly simpler message authentication codes. Unfortunately, these operations are still slow on other processors, so cryptographers will be hesitant to use them until similar instructions become standard. If the guy you're communicating with has to do 10x the work so that you can do half the work... well, I guess it's still a win if you're the server.
I thought AES was relatively fast as encryption algorithms go.
That still doesn't make it fast at an absolute level. Particularly when you're doing full-disk encryption with user account encryption on top and IPSEC on all your network connections.
AES is fast for a block cipher, but modern stream ciphers such as Salsa20/12, Rabbit, HC and SOSEMANUK are about 3-4x faster. (In other words, they are still faster than AES in a sequential mode on Westmere.) AES is still competitive, though, if you can use OCB mode to encrypt and integrity-protect the data at the same time.
The fastest previous Intel processor with cutting-edge libraries in the most favorable mode could probably encrypt or decrypt 500MB/s/core at 3-3.5GHz. This is fast enough for most purposes, but in real life with ordinary libraries you'd probably get a third of that. So this will significantly improve disk and network encryption if they use a favorable cipher mode.
Cred: I am a cryptographer, and I wrote what is currently the fastest sequential AES library for Penryn and Nehalem processors. But the calculations above are back-of-the-envelope, so don't depend on them.
This is only the case until most all instructions spend exactly 1 cycle in an execution unit. This includes shifts, adds, subs, muls, ands, ors, xors, and so on and on. Thats the state of the modern processor.
mul has latency 3 on Core i7. Of course, this assumes that by "modern" you mean i7. If we're talking about mobile processors, the latency is terrible on Atom, and probably not so hot on ARM either.
All of these instructions are 3 cycle latency on the AMD64 processors that I am familiar with (pre-phenom) and essentially the first cycle is loading the operands from the register pool into an execution unit, the second cycle is the execution, and the third cycle is retirement from the execution unit, updating the register pool. Core2's and i7's have these down to 2 cycle latency.
Every major desktop processor built in the last 10 years (maybe more like 20?) has single-cycle latency (or less!) for most simple operations (add/sub/xor/shift), enabled by forwarding between different stages of the pipeline. I don't know if lea counts for this because it's CISC-y, but something like an add always has single-cycle latency.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.
GCC on x86 these days likes to emit small multiplies as one or two lea instructions. It gives you a = b + [1248]c + const. This lets you multiply by 2,3,5 or 9 in one cycle, along with the shifter which lets you multiply by 1,2,4,8,... in one cycle. Between these, you should be able to multiply by any constant up to 21 in 2 cycles, and add a constant to boot. Similarly, you can multiply by a smaller range of values and add another register and a constant as well.
You can do this even better on ARM, where every instruction gets a free shift or rotate. In 2 cycles you can multiply by thousands of different values.
Regardless of this, on any platform with a multiplier, the multiplier is faster for some random unknown value. For a small known value, though, add/shift ladders may be faster.
I don't remember the exact details, but I'll guess based on what I can find on the Internet. I'm guessing that I remember wrong and you should s/antigen/toxin/g. Wikipedia says that some malaria vaccines target the parasite itself, and some target the toxins it produces. People receiving anti-toxic vaccines would still be infected and would still have toxins in their systems, but the toxins would be reduced by an immune response.
Suppose these toxins increase malaria's infectiousness in some way (which is a reasonable guess, because otherwise malaria wouldn't produce them... killing your host for no reason is not adaptive). Then there may be a competitive advantage for strains which produce more toxins, so that the anti-toxic immune response is less effective. This means that malaria would evolve to be more harmful and more lethal, especially to unvaccinated people.
The same probably isn't true for an anti-parasitic vaccine, especially if it can completely prevent infection.
I actually realized that when writing the post, but I figured the nitpick didn't add anything to my argument and just tossed in a "generally" to cover it. Guess I was wrong.
I'd say that you're both wrong. Preventing someone from getting polio doesn't generally make them more likely to get tuberculosis. This is the opposite case from antibiotic soaps, where killing off the colony on a surface just makes way for another, different colony. The new colony isn't particularly more or less likely to be harmful, but it is more likely to be resistant.
On the other hand, I read (too lazy to dig up the reference) a study that suggests that some (but not all) malaria vaccines may encourage evolution of a more harmful strain of the disease. If the vaccine targets an antigen that increases both pathogenicity and infectiousness, then strains which express more of that antigen may be more successful, and also more harmful to unvaccinated people.
Off topic, but "any sufficiently advanced incompetence is indistinguishable from malice."
I'm sure it depends on the building. Some friends and I were talking last week about how irritating it is that the "close door" buttons on the elevators in our building do nothing.
From what I've heard, it's a quarter-truth rather than an outright lie. If Alice finds Bob using friend finder, then this message can appear with Bob's name on it. Because there's no way to tell if a friend request came from friend finder, Bob doesn't know that he's "found friends using friend finder".
It was pretty good for cheap scientific computing clusters. But it's being replaced now by tower computers with 4 GPUs.
To be fair, until recently there were more people in Philadelphia than in all of Utah.
A "study" that determines that disabling Javascript will not allow you to execute Javascript.
A study that shows that many high-profile websites (which follow the previous best practices) are insecure because they don't take this into account, and proposes enhanced defense mechanisms.
I wish *I* could get paid obscene amounts of money to make "studies" like these.
If you can repeatedly find security flaws in web best practices, you're welcome to come join the lab. It pays about $15/hr, plus half your health insurance.
Disclaimer: I work with these guys.
In addition to the CPGPU or whatever what they're calling it, Fusion should finally catch up to (and exceed) Intel in terms of niftilicious vector instructions. For example, it should have crypto and binary-polynomial acceleration, bit-fiddling (XOP), FMA and AVX instructions. As an implementor, I'm looking forward to having new toys to play with.
A quarter billion seems pretty reasonable for keeping other companies off their turf.
More people have been to Germany than I have.
Actually, the #1 problem here would be that it would be disgustingly expensive, an ineffective sound barrier and an inefficient energy source.
To start with, lining the highway with any kind of fancy tech would be fabulously expensive. Maybe you could install parabolic concentrators to reduce the cost, but it would still be impractical this side of Dubai.
Second, consider how loud a speaker with a few watts of power, compared to a nearby highway. Truck rumblings will probably hit the wall with something on the order of a few watts per square meter, tops.
Third, if this is 18% efficient, that's not even 1 dB down.
Hydrogen isn't nearly as dangerous as people think. It doesn't have a very high energy density, and it rises as it burns. In the Hindenberg disaster, an airship with 200,000 m^3 of hydrogen caught fire while still in the air, then crashed into the ground, and almost 2/3 of the passengers and crew survived. I'm pretty sure a car colliding with this barrier would be less spectacular.
I want to see this happen, as do several of my colleagues in the security industry. Unfortunately, this sort of strategy breaks legacy applications so corporations are not going to adopt it.
Because hot fuel explodes harder inside an engine cylinder?
Pretty much, yeah. The fuel burns rapidly, and its heat increases the pressure in the cylinder. The more heat, the more pressure. Obviously, adding the heat earlier could cause some problems: it could result in less fuel and air in the cylinder (higher initial pressure), or it could damage components, or it could leak heat out, or it could cause the cylinder to fire at an inconvenient time, or whatever. But in principle, it doesn't have to be a problem.
And for your information, honey is delicious and pre-eaten.
Technically, yes, but the bees didn't digest it and burn the calories out of it, or we'd call it 'poop' instead of 'honey.'
OK, so maybe yogurt is a better example.
It shouldn't really matter that much so long as most of the heat stays in the fuel.
And for your information, honey is delicious and pre-eaten.
Intel also added an instruction called PCLMULQDQ which does polynomial multiplication over F_2. If it's fast (I can't find timing numbers, but hopefully it's something like latency 2 and throughput 1) then it will be very useful for cryptography in general...
Latency 15 cycles, throughput 10 cycles. That's a shame. Also, AESENC has thoughput 2 cycles/round, only half as fast as I expected.
Hopefully, this will cause people to stop using CBC mode, but perhaps I'm too optimistic.
I don't quite get the point. Are there flaws in CBC mode or do you think it's just better to use i.e. OCB or GCM mode because many developers often forget to add proper MACs to their protocols?
I overstated the case. CBC mode isn't terrible, it's just not the best mode out there.
The main problem with CBC mode is that it isn't parallel. This means that most of the new, faster, timing-attack-resistant libraries can't handle it efficiently (the one I wrote is an exception, but it only works on recent Intel procs and it's ~10% slower than Kasper-Schwabe for bulk encryption). It also means that CBC is a factor of 3 slower on Westmere, maybe 1.5 on VIA chips and 1.5 on the PowerPC G4. Due to optimizations, CBC mode is an additional 15-20% slower than CTR mode in most software libraries. It also has a larger attack surface than CTR mode, and unlike CTR mode it requires decryption.
OCB and GCM modes are really nice because of the integrated MAC. Too bad OCB is patented... if you fit the constraints, there really isn't a better mode for AES.
The fastest code that I know of for AES in CTR mode is Kasper-Schwabe. It does 8 128-bit encryptions at a time, so it also should be suitable for, say, PMAC if you doctor it. I believe that it does not handle decryption (outside of CTR mode where it's the same as encryption) or other key sizes. Modes other than CTR lose some optimization, and should be ~20% slower. It should be available on Kasper's homepage. It requires SSSE3 and reportedly achieves 6.9 cycles/byte on Nehalem for CTR mode.
My code is available here. On Nehalem, it achieves ~9.4 cycles encrypting, ~11.1 cycles decrypting in essentially any mode. It is suitable for encryption or decryption, and supports all three key sizes (longer keys are slower, of course). A newer (unreleased, experimental) version makes slight performance improvements (maybe down to 9.1 cycles encrypting on Nehalem) and implements an optimization for CTR mode that brings it down to ~7.5 cycles. Email me (mhamburg AT cs DOT stanford DOT edu) if you want to try the experimental version. However, my code fundamentally requires SSSE3, and it performs quite poorly on Conroe.
Also, Dan Bernstein (homepage) has somewhere a fast conventional (not timing-attack resistant, but not requiring any sort of SSE) implementation of AES for several processors, and I've heard Crypto++ is pretty fast too.
I believe that all of the above libraries are public-domain and patent-free.
Out of curiosity, what's your application? Can you just get a VIA Nano or Intel Westmere core and run on that?
It's a shame they didn't make a mid-level version with no graphics core, a la Core i7 860. As a crypto/security guy, I'd like to try out PCLMULQDQ, the AES instructions and maybe the IOMMU. But if I'm going to get a fancy new computer, I might as well put a decent graphics card in it, at which point their on-die graphics card is simply a waste of space, power, money and latency. And no, I'm not dropping $1k for a 6-core Gulftown.
Why put AES on-board?
They're not: they're putting extra instructions on-board which help implement AES more efficiently. They may also allow you to implement other algorithms more efficiently, though I haven't looked at them in enough detail to be sure.
The instructions perform a single round of AES (which has 10-14 rounds depending on key size), either encrypting or decrypting. Certain other algorithms such as Lex, Camellia, Fugue and Grostl use AES S-boxes in their core, and can probably benefit from these instructions. However, they will not achieve nearly so much a speedup as AES.
The AES instructions themselves will approximately double the speed of sequential AES computations. This is very unimpressive; VIA's AES instructions are much faster. They will also make it resistant to cache-timing attacks without losing speed, which is unimpressive because you can already do this on Penryn and Nehalem. The low speed results from the AES instructions having latency 6; if you can use a parallel mode (GCM, OCB, PMAC, or CBC-decrypt, for example) then the performance should be 10-12x the fastest current libraries. Hopefully, this will cause people to stop using CBC mode, but perhaps I'm too optimistic.
Intel also added an instruction called PCLMULQDQ which does polynomial multiplication over F_2. If it's fast (I can't find timing numbers, but hopefully it's something like latency 2 and throughput 1) then it will be very useful for cryptography in general, speeding up certain operations by an order of magnitude or more. This is more exciting to me than the AES stuff, because it might enable faster, simpler elliptic-curve crypto and similarly simpler message authentication codes. Unfortunately, these operations are still slow on other processors, so cryptographers will be hesitant to use them until similar instructions become standard. If the guy you're communicating with has to do 10x the work so that you can do half the work... well, I guess it's still a win if you're the server.
I thought AES was relatively fast as encryption algorithms go.
That still doesn't make it fast at an absolute level. Particularly when you're doing full-disk encryption with user account encryption on top and IPSEC on all your network connections.
AES is fast for a block cipher, but modern stream ciphers such as Salsa20/12, Rabbit, HC and SOSEMANUK are about 3-4x faster. (In other words, they are still faster than AES in a sequential mode on Westmere.) AES is still competitive, though, if you can use OCB mode to encrypt and integrity-protect the data at the same time.
The fastest previous Intel processor with cutting-edge libraries in the most favorable mode could probably encrypt or decrypt 500MB/s/core at 3-3.5GHz. This is fast enough for most purposes, but in real life with ordinary libraries you'd probably get a third of that. So this will significantly improve disk and network encryption if they use a favorable cipher mode.
Cred: I am a cryptographer, and I wrote what is currently the fastest sequential AES library for Penryn and Nehalem processors. But the calculations above are back-of-the-envelope, so don't depend on them.
This is only the case until most all instructions spend exactly 1 cycle in an execution unit. This includes shifts, adds, subs, muls, ands, ors, xors, and so on and on. Thats the state of the modern processor.
mul has latency 3 on Core i7. Of course, this assumes that by "modern" you mean i7. If we're talking about mobile processors, the latency is terrible on Atom, and probably not so hot on ARM either.
All of these instructions are 3 cycle latency on the AMD64 processors that I am familiar with (pre-phenom) and essentially the first cycle is loading the operands from the register pool into an execution unit, the second cycle is the execution, and the third cycle is retirement from the execution unit, updating the register pool. Core2's and i7's have these down to 2 cycle latency.
Every major desktop processor built in the last 10 years (maybe more like 20?) has single-cycle latency (or less!) for most simple operations (add/sub/xor/shift), enabled by forwarding between different stages of the pipeline. I don't know if lea counts for this because it's CISC-y, but something like an add always has single-cycle latency.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.
GCC on x86 these days likes to emit small multiplies as one or two lea instructions. It gives you a = b + [1248]c + const. This lets you multiply by 2,3,5 or 9 in one cycle, along with the shifter which lets you multiply by 1,2,4,8,... in one cycle. Between these, you should be able to multiply by any constant up to 21 in 2 cycles, and add a constant to boot. Similarly, you can multiply by a smaller range of values and add another register and a constant as well.
You can do this even better on ARM, where every instruction gets a free shift or rotate. In 2 cycles you can multiply by thousands of different values.
Regardless of this, on any platform with a multiplier, the multiplier is faster for some random unknown value. For a small known value, though, add/shift ladders may be faster.
... in Soviet Russia, TV does not watch you?
I don't remember the exact details, but I'll guess based on what I can find on the Internet. I'm guessing that I remember wrong and you should s/antigen/toxin/g. Wikipedia says that some malaria vaccines target the parasite itself, and some target the toxins it produces. People receiving anti-toxic vaccines would still be infected and would still have toxins in their systems, but the toxins would be reduced by an immune response.
Suppose these toxins increase malaria's infectiousness in some way (which is a reasonable guess, because otherwise malaria wouldn't produce them... killing your host for no reason is not adaptive). Then there may be a competitive advantage for strains which produce more toxins, so that the anti-toxic immune response is less effective. This means that malaria would evolve to be more harmful and more lethal, especially to unvaccinated people.
The same probably isn't true for an anti-parasitic vaccine, especially if it can completely prevent infection.
I actually realized that when writing the post, but I figured the nitpick didn't add anything to my argument and just tossed in a "generally" to cover it. Guess I was wrong.
I'd say that you're both wrong. Preventing someone from getting polio doesn't generally make them more likely to get tuberculosis. This is the opposite case from antibiotic soaps, where killing off the colony on a surface just makes way for another, different colony. The new colony isn't particularly more or less likely to be harmful, but it is more likely to be resistant.
On the other hand, I read (too lazy to dig up the reference) a study that suggests that some (but not all) malaria vaccines may encourage evolution of a more harmful strain of the disease. If the vaccine targets an antigen that increases both pathogenicity and infectiousness, then strains which express more of that antigen may be more successful, and also more harmful to unvaccinated people.