RC4 Code Achieves 319 MB/s On AMD64 Opteron
Marc Bevand writes "This
recent paper
is about optimizing
RC4
for
AMD64
processors. A working implementation is
provided. Its encryption/decryption throughput
reaches 319 MB/s on a single AMD Opteron x44
processor running at 1.8 GHz. This makes it, as of today, the world's fastest RC4 symmetric cipher implementation for general purpose CPUs. As the author of this work, I would like to
point out that many CPU-hungry applications
have not been optimized for AMD64 yet.
In other words: such speedups can be expected
in other areas."
An anonymous reader adds some figures for the old implementation: "Opteron 244 1.8 GHz (32-bit) 163 MB/s; Opteron 244 1.8 GHz (64-bit) 135 MB/s."
I was initially disappointed with the performance of my Athlon64. CPU intensive 64bit code often seemed much slower than it's (heavily optimised) 32bit counterpart.
:)
Every now & then I come across some code optimised for 64bit processors, and it just flies - as more & more stuff gets the treatment, it will be like upgradingin for free
amd decides to provide a compiler for its chip, optimization will always be behind intel (who do. for linux also).
If all a machine is doing is encrypting, A64s and Opterons are a bit overkill. The VIA C3 C5P has an encryption engine that makes top-of-the-line processors look sad. I couldn't find results for RC4, but is a page from a review of the EPIA MII-12000 which shows AES results. First graph is EPIAs in software, second is a few Intel and AMD CPUs (software), and the MII-12000 in software (which gets creamed by the AXP 2500+ and the P4@2.4) and hardware (which totally obliterates everything).
Don't get me wrong it's good that code is optimised, but I think that RC4 would fly faster on an IA64 than an opteron if specifically optimised to take advantage of the CPU's features.
RC4 isn't really that relavent in real life as wep is crap & also easily done in hardware anyway.
The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro.
486 = 32bits, faster but people still bought 386's due to cost.
Pentium = 32bits, sometimes faster but again costs meant 486's stayed popular.
Pentium Pro = 32bit, 16 bit instrucations stalled it. WHen running pure 32bit code ran like the dogs, when running 16bit code (win 98) ran like a dog.
Problem is that your generally better off saving your cash, buying a cheap CPU (32bit in this case) and waiting for the 2nd/3rd Generation CPU. By that time prices will more reasonable and you will see the full advantages as programs will use the extra bits properly.
I mean come on MS still hasn't released a final AMD64 version of Winblows yet.
...to allow DRM encryption of movies to become standard :)
Anagram("United States of America") == "Dine out, taste a Mac, fries"
Who wants to optimize RC4 for the PowerPC G5 chip (64-bit implementation) and do a bake-off? Hand-coding PPC assembly doesn't sound as fun as this PHP I'm working on at the moment, so someone else will have to tackle that!
once intel really puts some muscle behing the 64 bit desktop i'm sure we'll start to see loads of new apps compiled for the platform... aside from your os and rare app... games could have a lot to benefit from the extra performance and amd's line has been very well received (and is currently embarassing intel)... it's still nice to know you can do something super fast with your 64's
All the torrents you could want.
I'm holding out on the 64 bit systems until amd starts naming the chips commodore.
----
Squirrel
"I would like to point out that many CPU-hungry applications have not been optimized for AMD64 yet. In other words: such speedups can be expected in other areas."
well, maybe in some areas.
Since this is a cipher, it obviously helps a lot when you can work on 64-Bit chunks of data instead of 32-Bit.
The same speedup can probably be seen with applications that use numbers larger than 32b (or 64b for floats), since the number of operations necessary will essentially halve.
But other than that, I don't see much room for huge speedups.
That's good because is yet another pace in the direction when all information (http, smtp etc.) will travel encrypted (since today only some pages are served this way, because of the processor loads)
and because everytime we hear good about AMD we're happy:P
Everybody'll get TLS'ed
gtkaml.org
Does this change anything for rc5?
In a previous life I was running distributed.net's rc5 dnetc client, and naturally they had developed (or people had contributed to them) cores for almost any CPU imaginable. Improvements were relatively frequent, as in every-few-months a particular CPU would get an upgraded core, which would go through the calculations even faster.
Will the optimized AMD64 rc4 code provide any boost to those crunching rc5 on an AMD64?
I wish that every software company would put optimization first and features second. This way, we would not have to buy computers every few years. They can potentially last much longer.
See my earlier post as to why.
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
That sure would help my l333+ sc0r3z over at distributed.net. RC4 is so passe....
to author:
please post the benchmarks for the C version of your alogorithm along with the assembly version. It would be nice to know how much difference your tuning made.
Merlin
Wow, if RC4 is this much faster, just wait until they get to their Gold Master!
The ______ Agenda
We are seeing increased network throughputs of 15 to 20% using Opteron or Xeon x86_64 implementations, over Xeon. This is on various linux 2.4 kernels, redhat, suse, kernel.org. No optimisations for 64 bit were added, just compile and go. Bang, 110 MB/s on reads and 105 MB/s on writes compared to 85 and 80 MB/s respecitively on 32 bit. These increases in throughput haven't been analysed completely yet but it looks to be a combination of scsi, software raid and tcp. It was especially interesting to see this increase on the Xeon box running an i386 kernel compared to the same Xeon box running an ia32e kernel. Very cool.
64-bit operations (addition and multiplication) are much faster on a 64-bit CPU such as Athlon 64, for on a 32-bit CPU they have to be emulated in software using multiple instructions, which is slower than the "hardware-accelerated" way in 64-bit CPUs.
AMD really need to look at creating a multi-OS optimised compiler. Or activly support the GNU / gcc so that anyone can compile binaries that are compiled specifically for the AMD-64/Athlon whatever.
Then all the coders need to do is write the code that can be optimised best. The Intel C compiler does magic on intel processors in linux etc the performance difference is clear.
I just ran a simulation of the same tests being run on a 200Mhz Texas Instruments OMAP CPU. Well, I came really really close. I should up the clock 10% and beat the Opteron... wait... ohh that's right, I implemented the RC4 in DSP code and parallelled the hell out of it, that might have something to do with it.
Keep in mind, when it comes to encryption, I would still much prefer to have a CPU simply capable of moving the data to a DSP and then DSP it as parallelled as possible. Really, a 200Mhz DSP calculating 20 steps simultaniously is 4Ghz of linear processing power. With a board like this (http://www.mangodsp.com/seagull_pci.asp) you can probably do the equivilent of 112Ghz worth of linear calculation for encryption. The real problem is getting the data to the DSP.
Oh comeon, gcc's well, just slow.
Slow to compile, slow when compiled.
thank God the internet isn't a human right.
I'm waiting for an Opteron-optimized build of 1964, an Nintendo 64 emulator, before I upgrade.
Yes we need a compiler, but for the time being AMD could just release optimized ("performance") libraries for selected application areas, just as Intel does. They are available for both Windows and Linux, and the Linux ones are GCC-compatible.
And what kind of optimzations are we talking about here? Changing floats to be 64 bits, doubles to 128 bits? Better use of SIMD instructions? More SIMD registers? Sure we can address more memory when using 64 bits of addressing space compared to 32 bits, but how would that make things faster ???
If there is not enough user-visible registers, memory operands are used instead. However, these probably fits in the L1 cache anyway, and L1 cache is very fast (IIRC it is just one clock cycle on some CPUs). Also, the CPU can see that a later load depends on a preceding store, and the load can get the result from the preceding store directly. IIRC current processors already do this. So more user-visible general-purpose registers just reduces the number of load operations for the CPU to process, and since such operations do not cause much latency anyway (for they are working on the L1 cache, not the slow main memory), the performance increase would just be due to the load/store unit being less clobbered up, so that other load/store operations will not be delayed by them.
...crucial code, and assembly language monkeys are still worth having around =) .
I don't see the big deal here. I'd like to see what this algorithm would do if fully-optimized on the other processors out there, including the 64-bit G5. Maybe even better, use an algorithm that would have more practical value (wasn't RC4 cracked a while back already?) Try cracking MD5 or SHA-1 or something...
The interesting thing is that the Opteron 248 CPU is faster than just clock cycles (using timothy's code)
319*(2.2/1.8)=390 411
RSA, SHA-1 and SHA-256 are not something to choose instead of AES, they are more like a complement to them. AES is a simmetrical cipher, while RSA is a public key one, while SHA-1 and SHA-256 are hash functions.
That upgrade you are talking about would make the board better suited to do things like IPsec on hardware, but if you have a serious problem with AES (as stated in the grandparent post), you would have no alternative other than dumping the boards.
A real alternative would have been the inclusion of another simmetrical cipher (like 3DES or IDEA).
PS: I know there is another reply next to this one, but I can't see it right now because slashdot is acting kind of weird right now. If this was redundant, sorry.
GPG 0x1B479C78
rc4 will be my new performance benchmark...
Get your torrents...
The multi-core chips that AMD demoed a couple of months ago will offer even better improvements than the RC4 results when software, particularly the OS, is optimized for them. Okay, for Windows that might be a while since Microsoft is still working at just getting a version of Windows out that supports x86-64. But for Linux...the possibilities are pretty big. If it is done right, even old non-optimized 32-bit apps should see an huge increase in speed.
The Opteron 148 that was in the article is a nice processor but AMD has been selling it for at least a year now, and it isn't even the fastest Opteron.
If you really need speed, you can use RC4 securely but you have to know what you are doing and be aware of these attacks so you can employ protective countermeasures. Otherwise you are better off to use a cipher like AES which is actually secure.
Like somebody else has already mentioned, this would be nothing much under IA64. And indeed, an Itanium 2 box, running at a significantly lower frequency (1.3 GHz) already beats this figure hands down, at 381 MB/s.
That throughput is fine, but far from the world s fastest for a general purpose CPU. An Itanium 2 box, running at 1.3 GHz (significantly slower than the AMD64 CPU in the article) attains 381 MB/s. On a 1.7 GHz the throughput is 499 MB/s. I am aware that many consider Itanium a failure (Itanic, as they call it) but for some jobs it is king.
Speaking of AMD64, what AMD64 Linux distribution would you guys recommend for use on a high-traffic production server with AMD64/Opteron CPUs? I'm currently looking into Debian's AMD64 port for Sarge (I know, not released yet), Ubuntu, Red Hat Enterprise, and Fedora Core. Which one of these is the most stable/robust? I'd prefer to go with a Debian-based distribution (due to their package management system), but of course stability is more important than convenience.
Thanks.
Definitely.
http://www.intel.com/software/products/compilers/c lin/overview.htm
The Pentium Pro ran 16 bit code slowly; 32 bit code ran quite well. However at the time Windows still had a lot of 16 bit code, and so did most major apps. The Pentium Pro did not run faster than the much cheaper Pentium processors that were also available at the time.
The Athlon 64 architecture currently runs many or most 32 bit applications faster than comparable Intel processors, and is competitively priced. The ability to run 64 bit code is more like a bonus. This seems more comparable to the Pentium II, which was an extremely successful CPU architecture.
IA64 is basically irrelevant because the Itanic really is identical to the Pentium Pro. It can't run 32 bit code very well and it costs a fortune.
The new AMD-64 chips use passing by register to do function calls, leading to a huge speedup. Consider, on an x86, function calls are done on the stack. You push, push, push your arguments onto the stack and then jump to the subroutine that pops them off into registers to do work. It then leaves a return code on the stack and jumps back (I believe).
With the AMD-64 chips compiled with the new 64-bit ABI (i.e. Linux running in 64-bit mode, NOT windows which is currently only 32-bit), the arguments to the next function are stored in general purpose registers. The stack is used only when you run out of registers, and you have quite a few registers to work with. This reduces pushes and pops onto the stack (which are slow operations) and leaves everything in registers where they're going to be used anyway.
The 64-bit-ness has nothing to do with the speedup of AMD-64 processesors for most applications.
Time is an illusion. Lunchtime doubly so. --Ford Prefect
Would using the Altivec engine in the G5 be faster than using the main core of the processor? I suppose it depends on how well the algorithm can be vestorized.
The benefits seen in this optimization are largely parallel with the G5. If Apple can get things optimized for 64bit computing that can take advantage of it, we will see great things. In many cases, they already have...
I just think it's great that AMD is making such strides... for being a Mac guy, I pull for them in the PC world. What can I say? I like the underdog story.
"Politicians find new names for institutions which under old names have become odious to the people."
3.4 GHz EM64T, gcc-3.4.2:
2952628 RC4_set_key's in 5.00 seconds
Doing RC4 on 1024 byte blocks for 5 seconds
784464 RC4's of 1024 byte blocks in 4.99 second
RC4 set_key per sec = 590525.60 ( 1.693uS)
RC4 bytes per sec = 160980187.58 ( 0.050uS)
(153.52 MB/sec)
2.0 GHz Opteron, gcc-3.4.2:
3388004 RC4_set_key's in 4.99 seconds
Doing RC4 on 1024 byte blocks for 5 seconds
1810795 RC4's of 1024 byte blocks in 5.00 second
RC4 set_key per sec = 678958.72 ( 1.473uS)
RC4 bytes per sec = 370850816.00 ( 0.022uS)
(353 MB/sec)
"Don't get me wrong it's good that code is optimised, but I think that RC4 would fly faster on an IA64 than an opteron if specifically optimised to take advantage of the CPU's features."
Opterons are much cheaper then IA-64, and they run 32-bit x86 stuff at full speed. They make porting application easy because, it's still x86. So whether or not the Itanium is faster/better, is moot. They are way expensive and way nitche.
"RC4 isn't really that relavent in real life as wep is crap & also easily done in hardware anyway."
Yea, so might as well completely dismiss the whole thing just because you don't see value in it.. It's not the application, it's the fact that some optimizations made that much of a difference.
"The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro."
If AMD64 "suffers" the same fate as IA-32, then that's great! That means that up the road, ALL software, millions of packages, will all be on AMD64. Awesome! You didn't expect people to just switch all everything everywhere immediately, did you? As long as the trend follows toward AMD64, we're in good shape.
"Problem is that your generally better off saving your cash, buying a cheap CPU (32bit in this case) and waiting for the 2nd/3rd Generation CPU."
There's a problem in there?
"By that time prices will more reasonable and you will see the full advantages as programs will use the extra bits properly."
You're being ridiculous. AMD64 is cheap. It's here now, and it's even in the Prescott P4's. Basically, unless you want something OLD, you're going to get AMD64 whether you like it or not, in the near future. This is a GOOD thing.
So unless you're trying to say that we should hold off on spending the HUGE AMOUNT of $100 on an Athlon 64, then you're just flaimbaiting here.
- It's not the Macs I hate. It's Digg users. -
Sure, most of the apps we use today might not get a HUGE performance increase from 64-bit x86. However.
Can you imagine a 16-bit version of Office 2003? Or a media player? Or any of the other pretty heavy apps you run now a days?
A 64-bit platform opens new doors for doing things that would require a much faster IA-32 chip to perform. Since we're not going to be seeing the huge Ghz increases in clock speed for awhile, it's a decent thing to focus on.
- It's not the Macs I hate. It's Digg users. -
Cool! I had an A 500 as well. As usual, the Amiga line demonstrates its architectural superiority. Such a shame it had to die (let's be honest... it's dead). Finally we see one of its advanced features ported to the x86.
I also know that the MIPS architecture uses register passing. I believe is has up to eight 32-bit registers for argument passing. Quite a lot of architectures use it, it's just that x86 anachronistically held onto its stack passing system since I don't think it had enough GPRs to do it right. Finally, AMD-64 will solve the problem.
Time is an illusion. Lunchtime doubly so. --Ford Prefect
Did anybody try to run this on a 64-bit Xeon yet? This kind of algorithm naturally wants 64 bit data types. It would be interesting to compare Xeon vs. Opteron performance here.
Please note that the C3s before the C5P have half speed FPUs. If you get the C3 with the VIA padlock engine you are also getting a full speed FPU. So the FPU performance isn't as crap as before. Also they are dropping 3DNow! support in favour of SSE support, so I'm currently waiting for some newer revies to see if they are now suitable for software encoding in a HTPC. (I mainly want it for transcoding MPEG2 DTV signals to MPEG4 (Xvid or DivX).
Definitely SUSE Professional 9.2
Muchas Gracias, Señor Edward Snowden !
Doing RC4_set_key for 5 seconds 2711712 RC4_set_key's in 4.39 seconds Doing RC4 on 1024 byte blocks for 5 seconds 1451056 RC4's of 1024 byte blocks in 4.39 second RC4 set_key per sec = 617702.05 ( 1.619uS) RC4 bytes per sec = 338469554.44 ( 0.024uS) That looks mighty sweet for a Athlon64 3000+. I would have expected it to be far less than the Opteron though, wierd!
ubuntu is performs well and is easy to install but it is developed towards the desktop not server. Server ubuntu is in the pipeline but it is not it's current position in the market. (PS: it is a nice distribution though)
:-(
PS: Do not attempt to put your home on a vfat partition, it fails to install
It really is a free market. If people refuse to pay for features because performance is poor then you will see a change. Currently it is easier to buy a new machine every 2-3 years and people expect that redundancy, I even advise it.
I think with the lack of upgrades to Windows you are starting to see this effect happening. People are simply sticking to what they have. Microsoft (as an example only) will have to consider performance gains on existing hardware as a marketable thing soon.
If you write using generic code without need for carefully crafted data types then the compiler should compile it as you described. Unfortunately I see code all the time that assumes 32 bit ints and they are a real bugger to port to 64 bit.
It really take knowledge and skill to write portable code that makes few assumptions about hardware. Porting for OpenOffice.org 64 bit has been worked on for about 18 months. Hopefully 128bit will not be as hard. See the code for dates that is not Y2K compliant written now I would doubt this will be the case. We (royal we) programmers never learn.
GJKGHGJKHG
GOD BLESS YOU!