Slashdot Mirror


RC4 Code Achieves 319 MB/s On AMD64 Opteron

Marc Bevand writes "This recent paper is about optimizing RC4 for AMD64 processors. A working implementation is provided. Its encryption/decryption throughput reaches 319 MB/s on a single AMD Opteron x44 processor running at 1.8 GHz. This makes it, as of today, the world's fastest RC4 symmetric cipher implementation for general purpose CPUs. As the author of this work, I would like to point out that many CPU-hungry applications have not been optimized for AMD64 yet. In other words: such speedups can be expected in other areas." An anonymous reader adds some figures for the old implementation: "Opteron 244 1.8 GHz (32-bit) 163 MB/s; Opteron 244 1.8 GHz (64-bit) 135 MB/s."

177 comments

  1. Optimisation is definately the key by datajack · · Score: 5, Informative

    I was initially disappointed with the performance of my Athlon64. CPU intensive 64bit code often seemed much slower than it's (heavily optimised) 32bit counterpart.

    Every now & then I come across some code optimised for 64bit processors, and it just flies - as more & more stuff gets the treatment, it will be like upgradingin for free :)

    1. Re:Optimisation is definately the key by Savage-Rabbit · · Score: 5, Funny

      it will be like upgradingin for free :)

      Just don't get too excited. One of my coworkers made this same discovery a while back. Now he runs around the office wearing an "I love Opteron" T-Shirt and starts shouting"Intel is history - Power PC is dead!" everytime somebody mentions the words Opteron or AMD in a sentence. Worst of all he attacks anybody who disagrees and tries to bite them. We tried to knock him out with a dart gun after he savaged a visiting IBM sales rep but even heavy duty veterinary tranqulisers don't seem to have any effect.

      :-D

      --
      Only to idiots, are orders laws.
      -- Henning von Tresckow
    2. Re:Optimisation is definately the key by Talez · · Score: 1

      Power PC is dead?

      What the hell is he smoking?

      I'll take PPC over any other architechture any day of the week.

    3. Re:Optimisation is definately the key by Anonymous Coward · · Score: 2, Funny
      Power PC is dead?

      What the hell is he smoking?

      I know. The attacking and biting I can look past, but saying that Power PC is dead is just nuts.
    4. Re:Optimisation is definately the key by youknowmewell · · Score: 2, Funny
      Opteron or AMD
      Intel is history - Power PC is dead!
    5. Re:Optimisation is definately the key by isolationism · · Score: 1
      I just picked up a Socket 939 Athlon64 this weekend myself (and have teased Gentoo into running just about everything I'd like it to do -- the last step is the Adaptec RAID array. I'm hoping the i2o support is 64-bit stable in 2.6.9 or I'm in deep doo-doo).

      Anyway, code optimisation seems like natural progression as processors start to evolve to a new architecture -- but it's not going to really take off until 64-bit processors start overtaking the market and x86 is a legacy project (and not still the market dominator).

      Here's a little example: I bought my processor at an "OEM" style outlet that runs its business on bulk parts instead of customer service; they said they had sold "a good 4-5 Socket 939 processors" after a couple weeks as compared to dozens of other types of processors every day. Of course, there are other 64-bit AMD processors in the lineup which I didn't ask about so that number may be better than represented in my little weak example -- but since people still buy Intel it more or less means that "half" of the market can't run 64-bit optimised code (Itanic aside) ... Doesn't it?

    6. Re:Optimisation is definately the key by Anonymous Coward · · Score: 0

      Did you use 64-bit stun darts, though? :)

    7. Re:Optimisation is definately the key by gadget+junkie · · Score: 1

      "they said they had sold "a good 4-5 Socket 939 processors" after a couple weeks as compared to dozens of other types of processors every day."

      Not all athlon 64 procs are built for socket 939. the value option is socket 754.
      If there are other people like me out there, tough, they will be waiting for PCI express compatible motherboards. There's no point in buying a new rig and getting stuck with AGP five years down the line.

      --
      "If a boss demands loyalty, give him integrity. But if he demands integrity, give him loyalty." (John Boyd, 1927-1997)
    8. Re:Optimisation is definately the key by Anonymous Coward · · Score: 0

      PCIe boards will be out soon, your upgrade path lies in the 939 line. Buy a new video card when you switch.

  2. until by iamnotacrook · · Score: 4, Insightful

    amd decides to provide a compiler for its chip, optimization will always be behind intel (who do. for linux also).

    1. Re:until by isometrick · · Score: 4, Insightful

      I agree, to an extent. It's been said that Intel's compiler can outdo GCC in some performance benchmarks.

      GCC is no slouch though, and obviously Intel is performing some tricks that could also be implemented by GCC.

      I think it'd be a great move for AMD to work WITH GNU to optimize 64-bit AMD code from GCC.

      Seems like Intel is more prone to keeping secrets when it comes to processors. Maybe this is (yet another) way for AMD to give them a run for their money.

    2. Re:until by IWannaBeAnAC · · Score: 2, Informative

      Err, AMD have had developers working on optimizing GCC for quite a while now....

    3. Re:until by isometrick · · Score: 1

      Are there people on AMD's payroll that are regular contributors to GCC? If so, I have to claim ignorance on that one ...

      I'm saying that AMD could put the same scale of work into GCC that Intel puts into their compiler. I don't think that is there yet ... please let me know if I'm mistaken.

      I don't think it's a bad deal for them ... they get some free development on a free project that wouldn't directly profit them to do all in-house anyways. (AFAIK)

      Maybe they could even put in for a couple of full-time engineers on the project.

    4. Re:until by Gopal.V · · Score: 2

      ICC really sucks for compatibility ....

      For example, my code performs 5 times faster when compiled with gcc than when compiled with ICC ...

      Ok, maybe I'm a special case (I use computed GOTO). But you can't compile the kernel either :)

    5. Re:until by IWannaBeAnAC · · Score: 1
      Are there people on AMD's payroll that are regular contributors to GCC?

      Yes.

    6. Re:until by isometrick · · Score: 1

      Sorry it's not immediately obvious to me. Who are they?

    7. Re:until by IWannaBeAnAC · · Score: 1

      I don't know about the GCC contributors page, but I have met a programmer from AMD who was working on gcc, so there is at least one of them ;)

    8. Re:until by lachlan76 · · Score: 1

      Actually, you can compile the kernel with ICC, but it requires patching and a program between make and ICC. I read how to do it a magazine somewhere, don't have it on hand right now.

    9. Re:until by RupW · · Score: 5, Informative

      Sorry it's not immediately obvious to me. Who are they?

      AFAICR AMD paid SuSE to do the original work. I think the main developers were Jan Hubicka, the current x86-64 maintainer, and Andreas Jaeger. SuSE have a few more well-known GCC contributors: look at MAINTAINERS.

    10. Re:until by quigonn · · Score: 1

      I use computed GOTO

      In C/C++? Whoa. I always thought this was a construct only used by old FORTRAN programmers...

      --
      A monkey is doing the real work for me.
    11. Re:until by maxwell+demon · · Score: 2, Funny

      Quiche Eater!
      Real programmers can write FORTRAN in every language.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    12. Re:until by ceeam · · Score: 2, Interesting

      Dare I say that the fact that Intel produces a kick-ass compiler (for certain tasks anyway) has nothing to do with the fact that the same company produces CPUs. BTW - currently (AFAIK) the compiler is developed in Russia whereas chips design is done at "traditional" sites.
      PS: Oh, of course, Intel compiler won't ever support 3dnow, but that's the issue with sponsorship. I mean - AMD don't have to design the compiler themselves. They will be equally ok with sponsoring someone who knows how to do that.

    13. Re:until by Ninja+Programmer · · Score: 1
      • GCC is no slouch though [...]
      Yes it is

      • [...] and obviously Intel is performing some tricks that could also be implemented by GCC.
      Its not like the Intel compiler group is resting on its laurels. GCC has been getting its ass kicked by Intel's compiler for about 5 years now.
    14. Re:until by hpa · · Score: 1

      Jan Hubicka, as part of the SuSE team which ported Linux to x86-64.

    15. Re:until by Anonymous Coward · · Score: 0
      ICC really sucks for compatibility

      With GCC's version of C, you mean? What about with standard ANSI C?

  3. Somewhat OT, but... by bhtooefr · · Score: 4, Informative

    If all a machine is doing is encrypting, A64s and Opterons are a bit overkill. The VIA C3 C5P has an encryption engine that makes top-of-the-line processors look sad. I couldn't find results for RC4, but is a page from a review of the EPIA MII-12000 which shows AES results. First graph is EPIAs in software, second is a few Intel and AMD CPUs (software), and the MII-12000 in software (which gets creamed by the AXP 2500+ and the P4@2.4) and hardware (which totally obliterates everything).

    1. Re:Somewhat OT, but... by MrNemesis · · Score: 4, Interesting

      AFAIK, the VIA's *only* do AES, as they're designed to make good VPN endpoints. This is cos some hefty AES subroutines are built into the hardware (with software drivers doing the rest).

      So whilst this is all very handy, if you want encryption other than AES (which, if there were ever any significant flaws found in AES' maths, is a certainty) you'd want to dump those VIA boards and get yourself either a dedicated encryption device like an Encipher box (like an expensive version of the VIA) or just a beast of a machine to do encryption entirely in software (like an Opteron).

      I personally shunt everything through DSA stunnels, so a VIA isn't much use to me.

      --
      Moderation Total: -1 Troll, +3 Goat
    2. Re:Somewhat OT, but... by mczak · · Score: 5, Informative
      AFAIK, the VIA's *only* do AES, as they're designed to make good VPN endpoints. This is cos some hefty AES subroutines are built into the hardware (with software drivers doing the rest).
      True. VIA padlock (as they call it) can currently only do AES in hardware (and it can also generate true random numbers). The next VIA chip called C7 (C5J Esther) however should be able to also do SHA-1, SHA-256 and parts of RSA in hardware (I think it should be available first half of 2005). That's of course still a limited set of encryption algorithms, but it's certainly an improvement.
    3. Re:Somewhat OT, but... by bhtooefr · · Score: 1

      Ah, I for some reason thought it did RC4...

      Thanks...

    4. Re:Somewhat OT, but... by MrNemesis · · Score: 1

      Wow, I didn't know that. Nice to see that VIA is taking the encryption market seriously, especially in the Linux arena (IIRC they opened the specs to their encryption engine, right? There's definitely support for it available in the kernel via a patch to the Crypto API). As you say, it's not full blown RSA, DSA and MD5 in hardware, but it's a start.

      Now if only they'd be as nice with the damned CLE266 graphics drivers...

      --
      Moderation Total: -1 Troll, +3 Goat
    5. Re:Somewhat OT, but... by jrexilius · · Score: 1

      I should say that from in investment standpoint, having 2-3 different algorithms in hardware would mitigate the obselecence issue with discovered weaknesses.

      Excuse my ignorance here, but are these chips on an expansion card or can you find motherboards with them?

    6. Re:Somewhat OT, but... by evilviper · · Score: 1
      if there were ever any significant flaws found in AES' maths

      I would just like to object to hearing this all the time. Sure, it's POSSIBLE that AES will be found vunerable, but quite unlikely. The government agency that selected and approved of AES are the same ones who approved of DES, oh so many years ago. I think that alone means it deserves the benefit of the doubt.

      Of course, it's still POSSIBLE, but hearing the same questions about it repeated so often, gives the wrong impression.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    7. Re:Somewhat OT, but... by mczak · · Score: 1

      You can get those via cpus as "normal" cpus, (via C3), they fit into socket 370 boards. Standalone via cpus are not very popular, but the VIA mini-itx boards which have them soldered directly to the board surely are (via calls them "Eden" cpus, but it's just the same cpu in a different package). Not all VIA C3 (or Eden) cpus have padlock, only C5P "Nehemiah" have - via did not change the "public" name for newer cpu cores, and it's possible you can still get older ones. Good for small, quiet, cheap home-grown server boxes or HTPC (not suitable for software encoding due to lackluster fpu performance though).

    8. Re:Somewhat OT, but... by swillden · · Score: 1

      I personally shunt everything through DSA stunnels

      You encrypt your data with the Digital Signature Algorithm? Good trick, that. Gotta be horribly slow, though.

      Actually, you don't do this. You use DSA to validate DH public keys, use DH to establish a shared secret and use something like RC4 or some block cipher to actually do the bulk encryption. Or maybe you use RSA instead of DSA/DH, or maybe even El Gamal, but you definitely don't use DSA for bulk encryption.

      It's actually quite likely that you are using RC4, since that's one of the preferred bulk ciphers for SSL, mainly because of its high performance and excellent security (when used properly -- not like in WEP). Other likely candidates are IDEA, 3DES, Blowfish, AES or some of the other AES contenders. But not DSA.

      a dedicated encryption device like an Encipher box (like an expensive version of the VIA)

      Lots more than just an expensive version of the VIA, actually. The hardware crypto devices are *secure*, expensive versions of the VIA. I'm not sure if Encipher produces a FIPS 140-2 level 4 device, or if they're only level 3, but even that's pretty good. Other manufacturers make level 4 devices, which are very sophisticated, tamper-reactive devices.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    9. Re:Somewhat OT, but... by swillden · · Score: 2, Informative

      The government agency that selected and approved of AES are the same ones who approved of DES, oh so many years ago.

      And the same ones who were apparently surprised when flaws were found in SHA-1, which they also selected and approved. And the same ones who developed the Law Enforcement Access Field (LEAF) for Clipper, which was quickly broken by Matt Blaze.

      Thirty years ago when the NSA fixed IBM's Lucifer, which became DES, the NSA clearly had a huge amount of cryptologic knowledge that the public research community did not. Indeed, there really wasn't a public research community. That has changed. Although it's certain that the NSA has some tricks the public community does not, and it's obvious that there's nothing the public community knows that the NSA does not, it appears that the gap has closed considerably, and that it's entirely feasible for people to find flaws in NSA-approved, and even NSA-created ciphers.

      AES/Rijndael concerns lots of cryptographers because it is, by the only measure we really have available, the weakest of all of the AES candidates. All of the candidates use multiple "rounds" of computation, and all of them have been broken for some number of rounds less than the full algorithm. The difference between the number of rounds in the full algorithm and the largest number of rounds that has been broken is called the "margin of security", and Rijndael has a very thin margin of security, meaning that existing attacks only have to be extended by a little bit to break the full algorithm.

      Does this mean it's likely to be broken? No one knows, and only time will tell. Contingency plans are a good thing if you really care about security, though.

      I think that alone means it deserves the benefit of the doubt.

      You're insufficiently paranoid. Or, more likely, you just don't have anything you really need to protect. That's most people, actually.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    10. Re:Somewhat OT, but... by MrNemesis · · Score: 1

      Hehe, well done for pointing out my rather flaky knowledge of crypto and TLA's. I stand corrected, and thanks for explaining how it actually works!.

      What I should have said was; everything gets thrown through SSH tunnels and I'd love to see an acceleration of whatever it is that SSH uses, as well as acceleration for creating those huge RSA/DSA keys we use all the time, which are slow to generate even on a dual Athlon 2000. And maybe better use of those RNG's that some of the VIA and AMD chipsets use.

      I have a friend who worked (tangentially) with an Encipher box of some description... he said it was just a lump of aluminium with wires out the back and 3 blue LED's on the front, which later had another piece of aluminium welded over them since someone made a PoC that you could figure out the data being encrypted by the way the LED's blinked, or so the story goes. Not 100% convinced of that one myself, but you know what those crypto-genius people are like!

      --
      Moderation Total: -1 Troll, +3 Goat
    11. Re:Somewhat OT, but... by wirelessbuzzers · · Score: 1

      I admit that I, like the GP poster, don't have anything that I really need to protect. I also read the AES papers when they first came out, and it's been said that something so simple might just be "waiting for the right hammer". And I've read the "margin of security" stuff, and the tentative attacks (Courtois / Pieprzyk) using algebraic geometry.

      None of it matters, for now. To break a strong symmetric cipher like AES just means to find a method which will deduce the key from an arbitrarily large amount of data (often with chosen plaintext or ciphertext) in a smaller number of basic operations than 2^(key size). This doesn't mean the attack is practical.

      DES is broken, but not in a practical way. You need terabytes of crypts of chosen plaintext to break it in 2^40some time (here, would you please encrypt this RAID for me?). Of course, the key is now within the range of a brute-force attack, as EFF showed.

      Of course, this is in the public domain; who knows if some government or other secret organization has broken it, and has the supercomputer power to actually run an attack. But unless you're a foreign government or prime enemy of the state (read: Osama bin Laden), it doesn't matter much whether the US (or French, or Nigerian....) government has broken AES, because tinfoil hat or no, they wouldn't blow a secret like that just to send you to jail (perhaps I am insufficiently paranoid?).

      Now, it is possible that "the right hammer" will come along and there will be an attack on AES that runs on your PC. But I'm betting against it.

      --
      I hereby place the above post in the public domain.
    12. Re:Somewhat OT, but... by swillden · · Score: 1

      everything gets thrown through SSH tunnels and I'd love to see an acceleration of whatever it is that SSH uses, as well as acceleration for creating those huge RSA/DSA keys we use all the time

      Well, get an Opteron, install the tuned RC4 implementation and configure stunnel to prefer RC4 and you'll have no problems with throughput. The tuned RC4 won't speed up session startup because that's all public-key stuff. Large integer math libs could really benefit from tuning on 64-bit registers, though.

      As far as key creation... the big problem is finding the large prime numbers and that's just plain slow. Even with a good source of random numbers, you still have to test each one to see if it's sufficiently prime. 64 bit registers would help some. I doubt you'll see general-purpose hardware that optimizes primality testing, though, since key creation is something you don't typically do a lot of.

      I have a friend who worked (tangentially) with an Encipher box of some description... he said it was just a lump of aluminium with wires out the back and 3 blue LED's on the front, which later had another piece of aluminium welded over them since someone made a PoC that you could figure out the data being encrypted by the way the LED's blinked, or so the story goes.

      It seems hard to believe a security device would be designed to show any information at all about the crypto operations, but I suppose it's possible. It is amazing how much information can be extracted by smart people who know a lot of math.

      I'm more familiar with IBM's 4758 and related devices, which are very secure. If you ever want to see what extreme but very smart and methodical paranoia looks like, take a look at the 4758 design overview.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    13. Re:Somewhat OT, but... by swillden · · Score: 1

      To break a strong symmetric cipher like AES just means to find a method which will deduce the key from an arbitrarily large amount of data (often with chosen plaintext or ciphertext) in a smaller number of basic operations than 2^(key size). This doesn't mean the attack is practical.

      Sure, there's a big difference between theoretical and practical breaks. OTOH, attacks only improve so the smart thing to do is to start looking for other alternatives when theoretical breaks are found.

      DES is broken, but not in a practical way. You need terabytes of crypts of chosen plaintext to break it in 2^40some time (here, would you please encrypt this RAID for me?).

      Cryptographers argue about whether or not this constitutes an even theoretical break, since accumulating the 2^47 bytes (140TB) of plaintext/ciphertext material required is arguably more difficult than brute force. Still, you are correct that there are many levels of "broken", and most of them are irrelevant to practical use.

      Now, it is possible that "the right hammer" will come along and there will be an attack on AES that runs on your PC. But I'm betting against it.

      I would also bet against that. As of right now I'd even bet against *anyone* being able to break AES. And I think having an AES coprocessor built into my computer would be valuable. But it also wouldn't shock me to hear about a new attack which puts 128-bit AES within a couple orders of magnitude of being reachable by large clusters of machines with hardware AES coprocessors.

      Still, none of that was my point, really. My point was that to think AES is and will be secure just because the NSA said it is, is dangerous. It's very likely true, but the NSA isn't a group of aliens from another galaxy, and they aren't even that far ahead of the public researchers.

      Personally, I tend to place more weight on the fact that the world's best public cryptanalysts haven't been able to break it. If for no other reason than I know what their motivations are :-)

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    14. Re:Somewhat OT, but... by evilviper · · Score: 1
      the NSA clearly had a huge amount of cryptologic knowledge that the public research community did not

      Yet you think the NSA forgot much of that?

      DES has never been broken, therefore they know enough to thwart even the most advanced researchers today.

      But you're convinced, this time around, they don't know enough to do that again?

      SHA-1 and LEAF are completely different subject, really. If you want to talk about Clipper, talk about Skipjack, which hasn't been found vulnerable yet.

      it's entirely feasible for people to find flaws in NSA-approved, and even NSA-created ciphers.

      Possible, feasible, but still highly unlikely.

      Rijndael has a very thin margin of security, meaning that existing attacks only have to be extended by a little bit to break the full algorithm.

      This is a pretty ridiculous statement. Even if you can break all but the last round, it's still every bit as secure. Even if you can break all but one round, does not mean it's possible to extend the same or similar method to break that last round. Skipjack is again a good example, is it has just enough rounds to be secure, and it's not likely an accident or coincidence.

      Yes, I'm sure you realized that, but you still imply that the reduced-round vulerablities mean something, when they don't.

      You're insufficiently paranoid.

      Not true, I have a very healty level of paranoia. I'm not suggesting that everyone should switch over to AES immediately, nor am I doing that myself (well, I am for SSL/SSH, but nothing too important). I'm only saying that it's improper to imply that AES is vulnerable, or is going to be found vulnerable in the near-future.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    15. Re:Somewhat OT, but... by swillden · · Score: 1

      Yet you think the NSA forgot much of that?

      Nope. I think the public cryptologists caught up (or close to it).

      DES has never been broken, therefore they know enough to thwart even the most advanced researchers today.

      And what about tomorrow?

      Even if you can break all but the last round, it's still every bit as secure. Even if you can break all but one round, does not mean it's possible to extend the same or similar method to break that last round. Skipjack is again a good example, is it has just enough rounds to be secure, and it's not likely an accident or coincidence.

      Secure against known attacks. And progress in attacks on block ciphers frequently starts with reduced-round variants and proceeds to increasing numbers of rounds.

      Yes, I'm sure you realized that, but you still imply that the reduced-round vulerablities mean something, when they don't.

      Professional cryptographers disagree with you. If you read the discussions and comments during the NIST selection process you saw lots of discussion about the relative margins of security of the various algorithms. Many of those people were mildly concerned when Rijndael, the algorithm with the thinnest margin, was selected.

      Successful attacks on reduced-round variants don't provide a way to attack the full cipher, sure, but they're a hint that perhaps the attacks can be improved or combined to create a successful attack.

      I'm only saying that it's improper to imply that AES is vulnerable, or is going to be found vulnerable in the near-future.

      I agree with that it's incorrect to imply that AES is weak or will be broken, which is why I never said that. Please reread my post. I just argued against blind faith in it just because the NSA picked it. There's no magic here. Any cipher can be broken and you shouldn't assume otherwise just because of some name associated with it. That was a more reasonable thing to do 30 years ago when the NSA, arguably, knew vastly more about cryptography than anyone else. It's neither reasonable nor sensible now.

      And contingency plans are still a good idea. SSL, for example, is excellent this way. It's a protocol that can substitute any number of ciphers, and provides a mechanism for the endpoints to negotiate a cipher suite that is acceptable to both. That's good, resilient security design.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  4. Not worth the outlay at present by cheezemonkhai · · Score: 4, Informative

    Don't get me wrong it's good that code is optimised, but I think that RC4 would fly faster on an IA64 than an opteron if specifically optimised to take advantage of the CPU's features.

    RC4 isn't really that relavent in real life as wep is crap & also easily done in hardware anyway.

    The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro.

    486 = 32bits, faster but people still bought 386's due to cost.

    Pentium = 32bits, sometimes faster but again costs meant 486's stayed popular.

    Pentium Pro = 32bit, 16 bit instrucations stalled it. WHen running pure 32bit code ran like the dogs, when running 16bit code (win 98) ran like a dog.

    Problem is that your generally better off saving your cash, buying a cheap CPU (32bit in this case) and waiting for the 2nd/3rd Generation CPU. By that time prices will more reasonable and you will see the full advantages as programs will use the extra bits properly.

    I mean come on MS still hasn't released a final AMD64 version of Winblows yet.

    1. Re:Not worth the outlay at present by YetAnotherGeekGuy · · Score: 1

      I think that RC4 would fly faster on an IA64 than an opteron

      So this code should run directly on an Pentium IV with EM64T. Anybody tried it, yet? How about trying it with the Intel C compiler. Most benchmarks use the Intel compiler, even on AMD CPUs because its so much better than GCC.

      I don't buy the argument that its the extra registers, because there have been over 56 registers available for register renaming since the early-mid 90's.

      --

      to the Engineer, the glass is neither half full nor half empty. Its just two times too big.
    2. Re:Not worth the outlay at present by joib · · Score: 3, Informative


      486 = 32bits, faster but people still bought 386's due to cost.


      The 386 was also a 32-bit processor...

    3. Re:Not worth the outlay at present by Anonymous Coward · · Score: 1, Informative

      The 386SX was 16/32 the DX was fully 32.

      The 486 SX & DX were both fully 32 but the DX had a math Co-Processor onboard.

    4. Re:Not worth the outlay at present by anti-NAT · · Score: 1

      I don't buy the argument that its the extra registers, because there have been over 56 registers available for register renaming since the early-mid 90's.

      I'm no expert, however, from what I understand from the bit if reading I've done and the bit of assembler I've done, it isn't the number of registers on the chip, it is the number of registers available to the user of the chip.

      For example, on the classic 32 bit X86, there are only four general purpose registers - EAX, EBX, ECX and EDX. If you want to multiply four numbers together, you can only hold three of them on the CPU, as you need one of those registers to store the result of the multiplication for each round. After three multiplies, you'll have to move the fourth number from main memory into a register, and then perform the fourth multiply.

      On the AMD64, there are 12 general purpose registers (I think), so you could store all the digits on the CPU while the multiplication is taking place. RAM is very slow verses the CPU registers, avoiding getting data from RAM during the calculation is worth while, as the CPU isn't delayed, waiting for data from RAM, part way through doing the calculation.

      Of course, for the above examples, avoiding going to RAM for the forth digit is probably not going to make a significant or measurable difference. However, imagine if that calculation was being performed a 1 000 000 or a 100 000 000 times - the small saving adds up when that saving occurs many, many times, making it a big saving, which can result in a significant performance increase.

      That is why more, exposed general purpose registers on a CPU are useful.

      --
      The Internet's nature is peer to peer - 20050301_cs_profs.pdf
    5. Re:Not worth the outlay at present by SQLz · · Score: 1

      The Althon 64 starts at $141 on NewEgg

      See for yourself

      And BTW, the 386 was 32 bit.

    6. Re:Not worth the outlay at present by DigitumDei · · Score: 4, Interesting

      I just bought a new PC, and when compaired to all the available options, the the AMD64 option (I got an AMD64 2800+) was best. Slightly more expensive than the equivalent XP, cheaper than the p4. And they run so cool, its the first PC I've had in years where I don't have to worry about the temperature. When I bought an XP 2600+ last year, I spent almost half the chips price again on cooling.

      Just because I'm running a 32bit win XP on it doesn't make it a bad purchase.

      Also, I'm one of those people who bought a 386 instead of a 486 (then later a 486 instead of a pentium 1) because of the price difference. The price difference nowadays is nowhere near comparable to what it was then.

    7. Re:Not worth the outlay at present by MrNemesis · · Score: 0

      As I understood it, EMT's are just regular 32bit chips with 64bit memory addressing. As such, they don't have the extra general purpose registers that the AMD64's do, which go along way to speed up CPU-intensive code.

      I could be wrong though...

      --
      Moderation Total: -1 Troll, +3 Goat
    8. Re:Not worth the outlay at present by Bert64 · · Score: 4, Informative

      Actually, the majority of SSL websites are using RC4..
      If you use Mozilla and Apache, you can use 256-bit AES encryption for SSL (try loading up paypal with a mozilla based browser) but if either the server or client is microsoft-based your stuck with the much weaker 128bit RC4...
      MS - always behind the curve, no 256bit encryption, no 64bit os

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    9. Re:Not worth the outlay at present by Anonymous Coward · · Score: 0

      MS - always behind the curve, no 256bit encryption, no 64bit os ... all the freaking money, unfortunately. I don't see them crying.

    10. Re:Not worth the outlay at present by Eunuchswear · · Score: 1
      If you don't know what you're talking about why talk?

      --
      Watch this Heartland Institute video
    11. Re:Not worth the outlay at present by renoX · · Score: 1

      No you're wrong: AFAIK the only difference between EMT and AMD64 is that the EMT can't use address above 4GB as source or destination DMA, for the rest they are identical to AMD64 on the feature.

      Now it is true that I've heard that EMT is less optimised that AMD64, but I've never seen benchmarks so I don't know if it is true..

    12. Re:Not worth the outlay at present by zakezuke · · Score: 1

      The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro.

      What fate would that be? In 1985 I could see buying into a 286 simply because there was really no support for 32bit protected mode let alone expanded memory. Hell extended memory was barely supported. Even in 1990 I could see buying into a 286 if it would save you money. Dos 4.0 was a bug ridden piece of filth and there still was not alot of support for 32bit protected mode. By 1995 it was pretty clear that you would be totally SOL in a very short period of time.

      I think you mean that microsoft implemented 32bit support so slowly that you barely noticed anything until WindowsNT.

      Motorola on the other hand designed their 68000 was designed to be a 32bit chip from the get go, which I believe was first introduced in 1979 or so, at least according to my data book titled "break away from the past". Makes you wonder why anyone thought it was a good idea to use the 8086 for the PC.

      --
      There is no sanctuary. There is no sanctuary. SHUT UP! There is no shut up. There is no shut up.
    13. Re:Not worth the outlay at present by Anonymous Coward · · Score: 0

      Going 64-bits isn't that expensive. Going Opteron however could prove expensive.

    14. Re:Not worth the outlay at present by Fred_A · · Score: 1

      I don't quite get your point.

      64 bit systems aren't exactly new, they've been around for ages and the apps have been there as well. The fairly cheap (at the time) DEC alpha series & (way cheaper) assorted clones popularized them further.

      I currently run a 64 bit AMD cpu and my system and all my applications are 64 bit. It's quite easy to run a 64 bit system if you want/need one. You can even tweak your system so it runs 32 bit apps in case you have some old stuff lying around.

      Or you can go get a ready made system from Apple, or IBM, or Sun, or SGI, or a dozen of other makers. And their apps will all be full of true 64 bit goodness.

      And if you insist on exclusively running 32 bit intel code, the amd64 still runs that quite speedily (faster than any 32 bit CPU at any rate), so there really is an incentive to switch in that case.

      Your CPU list exlusively lists the "bitness" of each chip which is mostly irrelevant today given the performance of the compatibility layers. Optimizing code for the feature set of a specific processor is something that won't happen anyway for any significant number of applications. So it's quite unlikely that there will ever be a specifically amd64 version of much of anything beyond the use of a few of the features of the CPUs.

      --

      May contain traces of nut.
      Made from the freshest electrons.
    15. Re:Not worth the outlay at present by 10Ghz · · Score: 1

      Regural x86 has 8 GP-register, AMD64 has 16.

      --
      Lesbian Nazi Hookers Abducted by UFOs and Forced Into Weight Loss Programs - -all next week on Town Talk.
    16. Re:Not worth the outlay at present by John+Courtland · · Score: 1

      That's kinda vague. The 386 was a 32 bit processor with a 16bit data bus. It still could perform 32-bit arithmetic natively, but the bus was strangled.

      --
      Slashdot is proof that Sturgeon's Law applies to mankind.
    17. Re:Not worth the outlay at present by amjacobs · · Score: 0

      I'm not sure why you think that IA-64 would outperform AMD64. For those who don't know, IA-64 refers to Intel's VLIW instruction set that is used with the Itanium. RC4 generally is an integer type application, which the Opteron usually does better in (according to the SPEC results). In fact, the only time Itanium does well in anything is when it has 6MB of L2 cache. I have a hunch that with that much L2 cache, any processor will start to look decent.

      Also, the Opteron outperforms its 32-bit Athlon XP in 32-bit applications. There is no penalty for only using 32-bit software.

    18. Re:Not worth the outlay at present by Bert64 · · Score: 1

      Ofcourse not, theyre holding back innovation so they can make as much money as possible out of their old obsolete technology before they have to start offering something new.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    19. Re:Not worth the outlay at present by nick-less · · Score: 2, Informative

      That's kinda vague. The 386 was a 32 bit processor with a 16bit data bus. It still could perform 32-bit arithmetic natively, but the bus was strangled.

      thats the 386SX you're taking about, the regular 386DX which came out before the SX was full 32bit..

    20. Re:Not worth the outlay at present by Anonymous Coward · · Score: 0

      Apple doesn't have any 64-bit Apps. Their OS isn't doesn't even really support 64-bit apps (yet).

    21. Re:Not worth the outlay at present by Sivar · · Score: 2
      Motorola on the other hand designed their 68000 was designed to be a 32bit chip from the get go, which I believe was first introduced in 1979 or so, at least according to my data book titled "break away from the past". Makes you wonder why anyone thought it was a good idea to use the 8086 for the PC.

      I've spoken to some of the people that made this decision for various companies (e.g. Raytheon). The general consensus was that the difference between the 68K and the x86 was "night and day", but that the Intel chips were available dirt cheap (for the time) in massive quantities, and that there were more developers that knew the architecture (at least in their specific situations). Thus, yet again, business logic overrode technical/engineering prowess, and most of us have been using the worst major CPU architecture ever designed because of it.
      (Not that I blame Intel, they didn't have any other architecture to learn their mistakes from).
      --
      Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
    22. Re:Not worth the outlay at present by Anonymous Coward · · Score: 2, Insightful

      I'm not sure why you think that IA-64 would outperform AMD64. For those who don't know, IA-64 refers to Intel's VLIW instruction set that is used with the Itanium. RC4 generally is an integer type application, which the Opteron usually does better in (according to the SPEC results).

      Itanium does really well on encryption in general. Hand-optimized code makes good use of the large register set, the modulo-scheduling of loops and powerful bit manipulation primitives.

      IIRC Itanium hold the top stop in SpecSSL for a while (don't know where it stands currently, I don't think the numbers are current).

      In fact, the only time Itanium does well in anything is when it has 6MB of L2 cache

      Stop drinking the AMD coolaid.

    23. Re:Not worth the outlay at present by questionlp · · Score: 1

      MS - always behind the curve, no 256bit encryption, no 64bit os

      Microsoft does have a 64-bit OS, Windows Server 2003 for 64-bit systems. Of course, the version available right now is for the Itanium and for x86-64/IA-32e/EM64T. They also have a 64-bit version of SQL Server 2000 for the Itanium.
    24. Re:Not worth the outlay at present by questionlp · · Score: 1

      Oops... "and for x86-64..." should have been "and not for x86-64..."

    25. Re:Not worth the outlay at present by Woody77 · · Score: 1

      In college I did a bit of 68000 assembly. Intel x86 code just bends my head in comparison. 8 general purpose registers, and 8 address registers, and the ability to math between any of them (using the address registers for ptrs), including array operations with the extended addressing modes, made the 68000 processor a dream to write for.

      Now when debugging C++ on x86, and I drop to assembly to figure out what the hell the compiler has done, I look at the code and it's the ugliest stuff I've seen. I think the 68HC11 had a better assembly language (and was about as powerful as an 8086, IIRC)...

    26. Re:Not worth the outlay at present by John+Courtland · · Score: 1

      Yeah, I guess I'm guilty of being vague too... I'm at work though, so I have an excuse :)

      --
      Slashdot is proof that Sturgeon's Law applies to mankind.
    27. Re:Not worth the outlay at present by Anonymous Coward · · Score: 0

      Is it optimized with the assembler YASM-0.4.0?

    28. Re:Not worth the outlay at present by RangerElf · · Score: 2, Informative

      Regural x86 has 8 GP-register, AMD64 has 16.

      Not if you want to actually use the stack pointer and your stack-frame base pointer; you have 4 GP regs (EAX .. EDX), two kinda- specific-purpose regs (ESI, EDI), one crippled-kinda-general-purpose-pointer reg (EBP) and one specific-purpose register (ESP).

      AND, if you want to do multiplications and divisions (the worst offenders, IMO), then two of the GP registers are already spoken for (EAX, EDX).

      So actually, the grandparent poster was right.

      -gus

    29. Re:Not worth the outlay at present by amjacobs · · Score: 1

      Stop drinking the AMD coolaid.

      Yeah, yeah. At the lab that I worked at, we had access to two Itanium workstations with 1.5MB cache. When we were benchmarking them next to our Xeon and Opteron machines, Itanium were not performing well. I'm not saying that they don't work well in certain applications, just not the ones that we were working with.

      Looking at spec.org, the 1.5MB Version of the Dell PowerEdge 3250 gets a SpecInt score of 824, while the 3MB version has a score of 1022. The 6MB version has score of 1099, but it is 100 MHz faster. So, going from 1.5MB to 3MB increases performance by 25%. Pretty Significant. No? Meanwhile, a Xeon 3.06 GHz with 0.5MB cache gets a score of 1067. The 1.5MB version of Itanium performs on the level of a 2.2GHz Xeon.

      Anyway, I got a bit off topic. You have a point about some of the optimization techniques. Although, is it common for people to optimize in assembly for a VLIW architecture? I don't know; I've never done it myself. I thought the whole point of VLIW is that you need to have really good compilers to do optimization for you.

    30. Re:Not worth the outlay at present by Bert64 · · Score: 1

      So they support an architecture that noone is using, and have no support for one that`s selling quite well..
      Infact, NT4 for the Alpha has a larger userbase than the itanium, despite being discontinued...

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    31. Re:Not worth the outlay at present by drinkypoo · · Score: 1

      I don't know jack about RC4 but unless it uses floating point it's going to be slower on itanic than on opteron. If you can perform RC4 efficiently using floating point mathematics then itanic will probably whip opteron.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    32. Re:Not worth the outlay at present by drinkypoo · · Score: 1

      Don't forget that many if not most x86 instructions require that source and/or destination registers be fixed, so your data is always in AX or always put in DX, et cetera, and mul and div aren't the only cases of this. Plus, you MUST use the source and destination registers for copies... So arguably, none of the registers are truly general purpose :P

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    33. Re:Not worth the outlay at present by bestguruever · · Score: 1

      You must be the guy that writes my specs. Please go home, thanks.

      --
      if you think this is bad, you should have seen my last sig
    34. Re:Not worth the outlay at present by gujo-odori · · Score: 1

      The 486SX and DX actually both had math co-processors on board. The difference is that on the SX it was disabled. Yes. Same CPU, just deliberately crippled.

  5. Finally enough horsepower... by Vo0k · · Score: 3, Funny

    ...to allow DRM encryption of movies to become standard :)

    --
    Anagram("United States of America") == "Dine out, taste a Mac, fries"
    1. Re:Finally enough horsepower... by pagal_paanda · · Score: 0

      And would also make my Duke Nukem Forever run!!

    2. Re:Finally enough horsepower... by Ravadill · · Score: 1

      Don't worry about running it yet, the 3drealms crew seem to be having trouble *compiling* it.

  6. PowerPC G5 by TiMac · · Score: 4, Interesting

    Who wants to optimize RC4 for the PowerPC G5 chip (64-bit implementation) and do a bake-off? Hand-coding PPC assembly doesn't sound as fun as this PHP I'm working on at the moment, so someone else will have to tackle that!

    --

    1. Re:PowerPC G5 by Anonymous Coward · · Score: 0
      sif u could.

      http://n0cgi.distributed.net/speed/query.php?cputy pe=all&arch=2&contest=rc572

      ouch!

    2. Re:PowerPC G5 by DrStrangeLoop · · Score: 1

      this should be no problem, since even the 32bit g4 powerpc seems to be pin,- binary,- and stylecompatible to the amd64.

      for further information, look here

    3. Re:PowerPC G5 by Anonymous Coward · · Score: 1, Funny

      WHAT?! You aren't serious about PHP being more fun than handcoding PPC-assembler? Right? Right????

    4. Re:PowerPC G5 by fizze · · Score: 2, Insightful

      I dont know why everyone jumps off the horse as soon as they hear the magic word "assembly".
      Seriously.
      If you want to get 110% out of your hardware, you have to put effort in, to get effort out. Makes sense, doesnt it ?

      Im not saying people who dont like ASM are sissies, not at all. But Im saying that assembly has its right, just as so many other programing languages.

      --
      Powerful is he who overpowers his temptations.
    5. Re:PowerPC G5 by maxwell+demon · · Score: 1

      So PHP doesn't stand for Program Handcoded in PPC-assembler?

      --
      The Tao of math: The numbers you can count are not the real numbers.
    6. Re:PowerPC G5 by TiMac · · Score: 2, Insightful
      Indeed.

      But when other projects beckon that don't require assembler work, I'm not about to jump on one that does for "fun" either ;)

      --

    7. Re:PowerPC G5 by Anonymous Coward · · Score: 4, Informative
      What ouch? You're looking at something different; RC4 is not RC5-72...

      From distributed.net's pages, here's what it has to say on the Opterons for RC5-72 (uniprocessor)
      The Opteron 2420 achieved a score of 9,547,969.00.

      The 2GHz G5 for RC5-72 (uniprocessor) achieved a score of 15,057,412.00 (there are 2.5GHz chips available...) The best multi-cpu scores?
      A 2-way 2 GHz Opteron achieved a score of 15,145,274.67, but
      a 2-way 2.5GHz G5 smoked it with a score of 37,441,192.00.

      Apples to apples, my friend, apples to apples.

    8. Re:PowerPC G5 by FrgMstr21 · · Score: 1

      Ah yes very true but why dont you do the same comparison using OGR! :-)

    9. Re:PowerPC G5 by drw · · Score: 1

      I doubt that the RC5-72 core has been fully optimized for the G5 yet. For example, the 1.5GHz G4 on that same chart scores ~900,000 higher, and I doubt the G4 is that much better of a performer. I would assume (and I could be wrong) that the G5 optimizations are not fully 64 bit yet.

      Plus remember benchmarks are about as reliable/accurate as a presidential election pole.

    10. Re:PowerPC G5 by Anonymous Coward · · Score: 0

      A PPC G5 implementation would exploit Altivec (xxm/sse like instruction set). And altivec-enabled routines are all written in C; nobody does altivec asm these days.

    11. Re:PowerPC G5 by Anonymous Coward · · Score: 0
      thanks for asking, I will...

      For OGR, then...
      The Opteron 2420 for OGR (uniprocessor) scored 19,106,244.00. (one sample)
      The G5 OGR (uniprocessor) page is a bit confusing at first.
      It shows a 2GHz G5 scoring 19,204,321.00, but the 1.8 G5 is scoring significantly higher at 24,950,000.00. Another oddity is that the 1.8 G5 samples are showing a really large standard deviation (over 25%!). Time to dig deeper to see what's going on.
      Looking at the details for those samples shows the following:

      • G5 1.8GHz 2.9008OGRp2 client score of 19,800,000.00.
      • G5 1.8GHz 2.9009OGRp2 client score of 29,900,000.00.
      • G5 1.8GHz 2.9009OGRp2 client score of 33,000,000.00.
      The 2.9009OGR2 client appears to have some significant speedups.
      Apples to apples works both ways; I double-checked to see which client the Opteron was using. It was using 2.9008OGRp2. Perhaps the 2.9009 contains speedup for both the Opteron and the G5. However, The G5 in the benchmark here is the 1.8GHz G5. There's a 2.5GHz G5 shipping. That's over 33% faster clock rate, and it really shows in the G5 multiprocessor sample below.

      There are no 2-way 2 GHz Opteron benchmarks for OGR.
      A 2-way 2.5GHz G5 turned in an impressive 83,517,872.00.

  7. 64-bit by zxflash · · Score: 1

    once intel really puts some muscle behing the 64 bit desktop i'm sure we'll start to see loads of new apps compiled for the platform... aside from your os and rare app... games could have a lot to benefit from the extra performance and amd's line has been very well received (and is currently embarassing intel)... it's still nice to know you can do something super fast with your 64's

    --

    All the torrents you could want.
  8. chip names by Pompatus · · Score: 3, Funny

    I'm holding out on the 64 bit systems until amd starts naming the chips commodore.

    --

    ----
    Squirrel ... It's not just for breakfast anymore
  9. well... by mx.2000 · · Score: 4, Insightful

    "I would like to point out that many CPU-hungry applications have not been optimized for AMD64 yet. In other words: such speedups can be expected in other areas."

    well, maybe in some areas.
    Since this is a cipher, it obviously helps a lot when you can work on 64-Bit chunks of data instead of 32-Bit.

    The same speedup can probably be seen with applications that use numbers larger than 32b (or 64b for floats), since the number of operations necessary will essentially halve.

    But other than that, I don't see much room for huge speedups.

    1. Re:well... by Kjella · · Score: 1

      The same speedup can probably be seen with applications that use numbers larger than 32b (or 64b for floats), since the number of operations necessary will essentially halve.

      Depends on what you're doing. An add yes, instead of ab + cd you'd have a+c,b+d (plus some overflow flags). ab * cd? a*c + a*d + b*c + b*d (with appropriate magnitudes, of course).

      Still, cryptograhpy is still ideal for going 64-bit. Most other apps won't be significant, it is the added GP registers (which have nothing to do with 64 bit per se, but it requires a recompile so they dropped creating a new 32bit mode).

      Kjella

      --
      Live today, because you never know what tomorrow brings
    2. Re:well... by Anonymous Coward · · Score: 1, Interesting

      I am the author of the paper.

      You know, so *few* CPU-hungry apps are AMD64-optimized that it is almost shocking to see this unused CPU power... I *strongly* believe that such speedups (about 2 times faster) can be achieved in many areas such as video encoding, checksumming algorithms, etc. Servers and workstations will be the first ones to benefit from such optimizations.

    3. Re:well... by jrexilius · · Score: 1

      As I can't imagine many web applications that dont involve the above mentioned functions, I think there is going to be more impact for a larger audience then people realize.

      At one point I was doing metrics for a highly sensitive financial trading application we were working on and did a break down of the response time (we had 3 seconds to create, transmit, render, and get user reply on a trade decision and we had to hop the pacific and atlantic for our international users). The results were that we took a ~10% hit for SSL, which was a big deal when the latency over the oceans was non-negotiable.

      In that case it was hardware SSL accelerators but for my own company it helps to know that I can start to do things more cost-effectively.

    4. Re:well... by Anonymous Coward · · Score: 0

      Please could you align the top of your loop at .Lstart

      Thx

    5. Re:well... by Anonymous Coward · · Score: 0

      I will try it. Unfortunately I won't be able to do it these days because I am in the process of (1) upgrading my opteron processors and (2) moving from Paris to LA.

  10. that's good by b100dian · · Score: 2, Interesting

    That's good because is yet another pace in the direction when all information (http, smtp etc.) will travel encrypted (since today only some pages are served this way, because of the processor loads)
    and because everytime we hear good about AMD we're happy:P

    Everybody'll get TLS'ed

    --
    gtkaml.org
  11. Does this change anything for rc5? by Anonymous Coward · · Score: 0

    Does this change anything for rc5?

    In a previous life I was running distributed.net's rc5 dnetc client, and naturally they had developed (or people had contributed to them) cores for almost any CPU imaginable. Improvements were relatively frequent, as in every-few-months a particular CPU would get an upgraded core, which would go through the calculations even faster.

    Will the optimized AMD64 rc4 code provide any boost to those crunching rc5 on an AMD64?

    1. Re:Does this change anything for rc5? by RupW · · Score: 5, Interesting

      Will the optimized AMD64 rc4 code provide any boost to those crunching rc5 on an AMD64?

      No, they're entirely different. For a start, RC4 is a stream cipher whereas RC5 is a block cipher. They just share the same inventor, hence the names.

      AFAICR, the RC5 effort uses the register width to try and crack many keys in one go anyway - a different approach to this, which is using the register width to generate more of a single stream in one go.

    2. Re:Does this change anything for rc5? by Anonymous Coward · · Score: 0

      Stream ciphers and block ciphers are not that different. It should not matter too much for generating the cipherstream. What a streamcipher does is generate blocks of cipherstream, which it will use to encrypt the plain text with later on.

      I don't understand much of your last line. I do not know if optimized code can be used to crunch RC5, but the argument against it seems to be flawed. I'll have to look up the RC4/RC5 code to be able to say more on this subject.

      I would use 112 bit triple DES or 128/256 bit AES anytime over RC4, even though it seems to be stable enough.

    3. Re:Does this change anything for rc5? by RupW · · Score: 1

      Stream ciphers and block ciphers are not that different. It should not matter too much for generating the cipherstream.

      In their applications, perhaps, but they are very different in implementation. I would expect that techniques to optimise the implementation of stream cipher and block ciphers are very different, and the original question was (I thought) whether this optimised RC4 would help provide an optimised RC5.

      And my last point was that, as I understand it, the distributed.net people don't crunch RC5-72 with a vanilla implementation of RC5 anyway, but have come up with a way of performing several crunches as a long parallel operation since that's more optimal on some processors. (Notably RC5 involves full-rotation shifts of 32-bits and not all CPUs have hardware for that, and not all 64-bit CPUs will do that well either.) I don't imagine this reduces the order of the problem, it's just an optimisation for the given known plaintext and ciphertexts.

  12. Optimization First, Features Second by Space_Soldier · · Score: 4, Insightful

    I wish that every software company would put optimization first and features second. This way, we would not have to buy computers every few years. They can potentially last much longer.

    1. Re:Optimization First, Features Second by Smoo_Master · · Score: 2, Insightful

      I would tend to disagree with that. While one should weigh the performance against an overabundance of features, overzealous optimization can also result in problems. Remember that Knuth said "Premature optimization is the root of all evil."

      If someone took your idea to the extreme, you might get something like this:
      "What does it do?"
      "Nothing, but look how *fast* it does it?"

      I think the best solution is moderation in both ends.

    2. Re:Optimization First, Features Second by Zemplar · · Score: 1

      "I wish that every software company would put optimization first and features second. This way, we would not have to buy computers every few years. They can potentially last much longer."

      And why would the hardware and/or software manufacturers want that?

    3. Re:Optimization First, Features Second by ralejs · · Score: 1

      "Premature optimization is the root to all evil"
      No you simply don't want a company which puts speed in the first room. It is very hard to get fast code right. I'd rather have well working slow code than buggy fast code.
      Sure, it is possible to write fast correct code but very few people can. And I wouldn't want all companies to try that.

    4. Re:Optimization First, Features Second by maxwell+demon · · Score: 1
      Look at my super-fast text editor! Yes, it doesn't have many features; indeed what it does is to show you the old version, and then lets you type in the new one, so it also misses the features found usually in text editors like only changing part of the text without touching the rest at all (which would save you from typing the whole file again if you just want to change one character), but it can be proven that this is not strictly necessary, but you can do all the same with the interface "show old version, type in new one", so that's really all you need. And look, how super-optimized the editor is: Most of the time it just waits for the disk to deliver data, for the terminal to write it on your screen, for you to type new text, or for the disk to write the data. The time the editor itself runs is negligible!

      Well, actually there's currently only a prototype written as shell script; given the speed the editor already shows at this level, think about the speed the final assembler version will have!

      (Caveat: Make sure your terminal has a large enough scrollback buffer for your files!)

      Here's the prototype of that editor:
      #!/bin/sh
      cat $1
      cat - >$1
      SCNR
      --
      The Tao of math: The numbers you can count are not the real numbers.
    5. Re:Optimization First, Features Second by shplorb · · Score: 2, Insightful

      Buy a games console then =]

    6. Re:Optimization First, Features Second by Anonymous Coward · · Score: 0

      Until every single operation is instantaneous then computers are too slow. Doesn't matter how much optimization, it still needs to be faster.

      I want 1 second boot times. 1 second to open anything. 1 second to compile the kernel. 1 second to load Doom3. 1 second to render a perfect photorealistic scene. etc...

      See what I mean? I could be so much more productive if the computer would just do what I wanted without me having to wait on anything. It's not that I don't pause and do some stuff slowly, but there are times (like in development) that I'm constantly waiting on the computer and it screws with my train of thought. Computers are too slow. Way too slow. No amount of current optimization is going to be fast enough.

    7. Re:Optimization First, Features Second by TheRaven64 · · Score: 1, Funny

      The First Rule of Program Optimization:
      Don't do it.

      The Second Rule of Program Optimization (for experts only!):
      Don't do it yet.

      --
      I am TheRaven on Soylent News
    8. Re:Optimization First, Features Second by OneArmedMan · · Score: 1

      The First Rule of Program Optimization:
      We Do Not Talk about Optimization

      The Second Rule of Program Optimization (for experts only!):
      We Do Not Talk about Optimization!!!

    9. Re:Optimization First, Features Second by RedWizzard · · Score: 1
      I wish that every software company would put optimization first and features second. This way, we would not have to buy computers every few years. They can potentially last much longer.
      Want to explain your logic? It seems to me it'd be a once-off win (as everyone switches to focus on optimization), and then business as usual. Think it through:
      1. You have a computer capable of running the software you have at a level of performance you're happy with.
      2. All you software gets magically optimized, obviously your computer has excess capacity now.
      3. Some new versions of your software come out. These require more resources, even when optimized. But your hardware still performs well because of the excess capacity. This is the once-off win, you don't have to upgrade when you would have before. But then ...
      4. The increased requirements of new software exceed the capability of your hardware. You have to upgrade hardware. Now the cycle will repeat at the same rate it did before optimization.
      The problem is that optimization gets you a reduction in absolute requirements but it doesn't reduce the rate at which software's requirements increase over time.
  13. The extra GP registers will help by anti-NAT · · Score: 1

    See my earlier post as to why.

    --
    The Internet's nature is peer to peer - 20050301_cs_profs.pdf
    1. Re:The extra GP registers will help by lachlan76 · · Score: 1
      But, since the output is stored in one of the source operands, couldn't you do something like:
      MUL eax, ebx
      MUL eax, ecx
      MUL eax, edx
      leaving the solution stored in eax?
    2. Re:The extra GP registers will help by maxwell+demon · · Score: 1

      AFAIK MUL takes two 32-bit values and produces a 64-bit value, which is stored in edx/eax. Therefore you have to do the edx multiplication first. Other than that, this sequence should work.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    3. Re:The extra GP registers will help by lachlan76 · · Score: 1
      Hang on....you're making me get out the old manual ;)

      Ok, you were right. But yeah, if you start with
      MUL eax, edx
      it should be fine unless it overflows
    4. Re:The extra GP registers will help by anti-NAT · · Score: 1

      Heh, you've hit the edge of my assembler knowledge, and I didn't think the example through that well..

      However, the point I was trying to show was that on a processor with additional GP registers, you would be able to add to your example

      mul eax, eex

      If such an "eex" register existed, instead of

      mov <mem location>, ebx
      mul eax, ebx

      In other words, the additional GP registers allow both the number of "mov" instructions, and the delays they cause, to be reduced,.

      --
      The Internet's nature is peer to peer - 20050301_cs_profs.pdf
    5. Re:The extra GP registers will help by lachlan76 · · Score: 1

      This is pushing my knowledge now, but depending on how many cycles it takes, out-of-order execution could be used to load the new values from cache without a significant performance hit.

      Of course, I understand what you are trying to say.

  14. What about RC5-72 optimisation? by Anonymous Coward · · Score: 0

    That sure would help my l333+ sc0r3z over at distributed.net. RC4 is so passe....

  15. post C benchmarks by MagicMerlin · · Score: 1

    to author:
    please post the benchmarks for the C version of your alogorithm along with the assembly version. It would be nice to know how much difference your tuning made.

    Merlin

    1. Re:post C benchmarks by RupW · · Score: 1

      please post the benchmarks for the C version of your alogorithm along with the assembly version

      He did: 135 MB/s, near the top of the article, is for OpenSSL's C implementation of RC4 using GCC 3.4.2, -march=opteron -O3.

      Now you can probably tweak the compiler flags to improve that but it's a good point to start from.

  16. RC4 by cgenman · · Score: 1

    Wow, if RC4 is this much faster, just wait until they get to their Gold Master!

  17. Opteron and Xeon x86_64 by Anonymous Coward · · Score: 0

    We are seeing increased network throughputs of 15 to 20% using Opteron or Xeon x86_64 implementations, over Xeon. This is on various linux 2.4 kernels, redhat, suse, kernel.org. No optimisations for 64 bit were added, just compile and go. Bang, 110 MB/s on reads and 105 MB/s on writes compared to 85 and 80 MB/s respecitively on 32 bit. These increases in throughput haven't been analysed completely yet but it looks to be a combination of scsi, software raid and tcp. It was especially interesting to see this increase on the Xeon box running an i386 kernel compared to the same Xeon box running an ia32e kernel. Very cool.

  18. MOD PARENT UP by r6144 · · Score: 1

    64-bit operations (addition and multiplication) are much faster on a 64-bit CPU such as Athlon 64, for on a 32-bit CPU they have to be emulated in software using multiple instructions, which is slower than the "hardware-accelerated" way in 64-bit CPUs.

  19. in an ideal world by Exter-C · · Score: 1

    AMD really need to look at creating a multi-OS optimised compiler. Or activly support the GNU / gcc so that anyone can compile binaries that are compiled specifically for the AMD-64/Athlon whatever.

    Then all the coders need to do is write the code that can be optimised best. The Intel C compiler does magic on intel processors in linux etc the performance difference is clear.

  20. ummm.... TI OMAP CPU by Anonymous Coward · · Score: 0

    I just ran a simulation of the same tests being run on a 200Mhz Texas Instruments OMAP CPU. Well, I came really really close. I should up the clock 10% and beat the Opteron... wait... ohh that's right, I implemented the RC4 in DSP code and parallelled the hell out of it, that might have something to do with it.

    Keep in mind, when it comes to encryption, I would still much prefer to have a CPU simply capable of moving the data to a DSP and then DSP it as parallelled as possible. Really, a 200Mhz DSP calculating 20 steps simultaniously is 4Ghz of linear processing power. With a board like this (http://www.mangodsp.com/seagull_pci.asp) you can probably do the equivilent of 112Ghz worth of linear calculation for encryption. The real problem is getting the data to the DSP.

  21. GCC is no slouch though, by oliverthered · · Score: 1

    Oh comeon, gcc's well, just slow.
    Slow to compile, slow when compiled.

    --
    thank God the internet isn't a human right.
    1. Re:GCC is no slouch though, by Anonymous Coward · · Score: 0

      You realize that GCC outperforms ICC in some of those benchmarks, right?

    2. Re:GCC is no slouch though, by oliverthered · · Score: 1

      I can outperform a 100 meters runner in falling flat on my face...
      GCC does ok in the easy tests and dies a horrible death when the going gets tough.

      --
      thank God the internet isn't a human right.
  22. C=64? Try N64 by tepples · · Score: 1

    I'm waiting for an Opteron-optimized build of 1964, an Nintendo 64 emulator, before I upgrade.

  23. Performance libraries too! by Phatmanotoo · · Score: 1

    Yes we need a compiler, but for the time being AMD could just release optimized ("performance") libraries for selected application areas, just as Intel does. They are available for both Windows and Linux, and the Linux ones are GCC-compatible.

  24. Optimizing Compiler? by WilyCoder · · Score: 1

    And what kind of optimzations are we talking about here? Changing floats to be 64 bits, doubles to 128 bits? Better use of SIMD instructions? More SIMD registers? Sure we can address more memory when using 64 bits of addressing space compared to 32 bits, but how would that make things faster ???

    1. Re:Optimizing Compiler? by triso · · Score: 1
      And what kind of optimzations are we talking about here?
      Better use of the extra registers and better instruction interleaving to prevent pipeline clashes. Just the stuff that the Intel compiler does so well for its own chips to put gcc to shame.

      It is in AMD's best interest to make their own chips fly with gcc.
  25. You just save a few load uops by r6144 · · Score: 1

    If there is not enough user-visible registers, memory operands are used instead. However, these probably fits in the L1 cache anyway, and L1 cache is very fast (IIRC it is just one clock cycle on some CPUs). Also, the CPU can see that a later load depends on a preceding store, and the load can get the result from the preceding store directly. IIRC current processors already do this. So more user-visible general-purpose registers just reduces the number of load operations for the CPU to process, and since such operations do not cause much latency anyway (for they are working on the L1 cache, not the slow main memory), the performance increase would just be due to the load/store unit being less clobbered up, so that other load/store operations will not be delayed by them.

  26. In other news, it is still worth optimizing... by 5n3ak3rp1mp · · Score: 1

    ...crucial code, and assembly language monkeys are still worth having around =) .

    I don't see the big deal here. I'd like to see what this algorithm would do if fully-optimized on the other processors out there, including the 64-bit G5. Maybe even better, use an algorithm that would have more practical value (wasn't RC4 cracked a while back already?) Try cracking MD5 or SHA-1 or something...

  27. RC4 Code Achieves 411 MB/s On AMD64 Opteron by ozzee · · Score: 1
    model name : AMD Opteron(tm) Processor 248
    stepping : 10
    cpu MHz : 2191.201
    cache size : 1024 KB

    rc4speed :
    Doing RC4_set_key for 5 seconds
    3429648 RC4_set_key's in 4.97 seconds
    Doing RC4 on 1024 byte blocks for 5 seconds
    1998887 RC4's of 1024 byte blocks in 4.97 second
    RC4 set_key per sec = 690070.02 ( 1.449uS)
    RC4 bytes per sec = 411843116.30 ( 0.019uS)

    The interesting thing is that the Opteron 248 CPU is faster than just clock cycles (using timothy's code)

    319*(2.2/1.8)=390 411

    1. Re:RC4 Code Achieves 411 MB/s On AMD64 Opteron by SnakeJG · · Score: 1
      RC4 bytes per sec = 411843116.30 ( 0.019uS)

      The interesting thing is that the Opteron 248 CPU is faster than just clock cycles (using timothy's code)

      319*(2.2/1.8)=390 411
      You seem to think that 1,000,000 bytes is 1 MB... are you in hard drive sales?

      But seriously, if you take the proper value of 1,048,576 bytes as 1 MB, you get about 393 MB/s for the 2.2 Ghz Opteron, which is right about what you predicted.
    2. Re:RC4 Code Achieves 411 MB/s On AMD64 Opteron by Anonymous Coward · · Score: 0

      I am not suprised to see such numbers. Some people also reported to me that the speed of this RC4 code scales linearly with the frequency. Theoretically this code should run at 461 MB/s (or 483.2e6 Bytes/s) on the FX-55 (2.6 GHz).

      --
      Marc Bevand

  28. apples and oranges by Ernesto+Alvarez · · Score: 1

    VIA padlock (as they call it) can currently only do AES in hardware (and it can also generate true random numbers). The next VIA chip called C7 (C5J Esther) however should be able to also do SHA-1, SHA-256 and parts of RSA in hardware (I think it should be available first half of 2005). That's of course still a limited set of encryption algorithms, but it's certainly an improvement.


    RSA, SHA-1 and SHA-256 are not something to choose instead of AES, they are more like a complement to them. AES is a simmetrical cipher, while RSA is a public key one, while SHA-1 and SHA-256 are hash functions.

    That upgrade you are talking about would make the board better suited to do things like IPsec on hardware, but if you have a serious problem with AES (as stated in the grandparent post), you would have no alternative other than dumping the boards.

    A real alternative would have been the inclusion of another simmetrical cipher (like 3DES or IDEA).

    PS: I know there is another reply next to this one, but I can't see it right now because slashdot is acting kind of weird right now. If this was redundant, sorry.
  29. benchmark by torrents · · Score: 1

    rc4 will be my new performance benchmark...

    --
    Get your torrents...
  30. What about multi-core? by dtjohnson · · Score: 1

    The multi-core chips that AMD demoed a couple of months ago will offer even better improvements than the RC4 results when software, particularly the OS, is optimized for them. Okay, for Windows that might be a while since Microsoft is still working at just getting a version of Windows out that supports x86-64. But for Linux...the possibilities are pretty big. If it is done right, even old non-optimized 32-bit apps should see an huge increase in speed.

    The Opteron 148 that was in the article is a nice processor but AMD has been selling it for at least a year now, and it isn't even the fastest Opteron.

  31. RC4 is not cryptographically strong by SiliconEntity · · Score: 2, Informative
    The RC4 stream cipher has a number of weaknesses. See Itsik Mantin's RC4 page; he is a crypto student who did his master's thesis on RC4. Among other weaknesses, the 2nd byte of the output is twice as likely to match the plaintext as it should be; there are weak keys; and it is possible to distinguish the output from randomness. Some of the attacks are practical and have been used to break the WEP wireless encryption algorithm, which uses RC4.

    If you really need speed, you can use RC4 securely but you have to know what you are doing and be aware of these attacks so you can employ protective countermeasures. Otherwise you are better off to use a cipher like AES which is actually secure.

    1. Re:RC4 is not cryptographically strong by bob_jenkins · · Score: 1

      If you really need speed, and you're using a heavy duty CPU, you can do better than RC4 anyhow. RC4 manipulates 1-byte quantities. There are other stream ciphers that take about the same number of instructions but work on 4-byte or 8-byte quantities.

  32. Not that hot by Anonymous Coward · · Score: 0

    Like somebody else has already mentioned, this would be nothing much under IA64. And indeed, an Itanium 2 box, running at a significantly lower frequency (1.3 GHz) already beats this figure hands down, at 381 MB/s.

  33. Plucky AMD64 by Anonymous Coward · · Score: 0

    That throughput is fine, but far from the world s fastest for a general purpose CPU. An Itanium 2 box, running at 1.3 GHz (significantly slower than the AMD64 CPU in the article) attains 381 MB/s. On a 1.7 GHz the throughput is 499 MB/s. I am aware that many consider Itanium a failure (Itanic, as they call it) but for some jobs it is king.

    1. Re:Plucky AMD64 by Anonymous Coward · · Score: 0

      AFAIK the fastest Itanium you can buy is 1.6 GHz. Have these 1.7 GHz results been obtained on a preproduction sample ?

      Anyway an hypothetical 1.7 GHz Itanium would be priced between $3000 and $5000 (given the current Intel price list). Compare this to the FX-55 ($827) which should [1] run this RC4 code at 461 MB/s: the Itanium would be 3 or 6 times more expensive and would provide 100*(499-461)/461 = 8% in speed increase...

      No wonder why Itanium is a failure.

      [1] According to results I received from some people, the RC4 code speed increases linearly with the frequency.

      --
      Marc Bevand

  34. AMD64 Server Distribution by Anonymous Coward · · Score: 0

    Speaking of AMD64, what AMD64 Linux distribution would you guys recommend for use on a high-traffic production server with AMD64/Opteron CPUs? I'm currently looking into Debian's AMD64 port for Sarge (I know, not released yet), Ubuntu, Red Hat Enterprise, and Fedora Core. Which one of these is the most stable/robust? I'd prefer to go with a Debian-based distribution (due to their package management system), but of course stability is more important than convenience.

    Thanks.

  35. Spelling is definitely lacking by Anonymous Coward · · Score: 0

    Definitely.

  36. Intel's compiler now supports EM64T (Opteron) by Anonymous Coward · · Score: 0
  37. Pentium Pro is the worst example by GunFodder · · Score: 1

    The Pentium Pro ran 16 bit code slowly; 32 bit code ran quite well. However at the time Windows still had a lot of 16 bit code, and so did most major apps. The Pentium Pro did not run faster than the much cheaper Pentium processors that were also available at the time.

    The Athlon 64 architecture currently runs many or most 32 bit applications faster than comparable Intel processors, and is competitively priced. The ability to run 64 bit code is more like a bonus. This seems more comparable to the Pentium II, which was an extremely successful CPU architecture.

    IA64 is basically irrelevant because the Itanic really is identical to the Pentium Pro. It can't run 32 bit code very well and it costs a fortune.

    1. Re:Pentium Pro is the worst example by WuphonsReach · · Score: 1

      The Athlon 64 architecture currently runs many or most 32 bit applications faster than comparable Intel processors, and is competitively priced. The ability to run 64 bit code is more like a bonus. This seems more comparable to the Pentium II, which was an extremely successful CPU architecture.

      AMD was smart (as in business-smart) by providing a very easy upgrade path from 32 to 64 bit CPUs. I now own an Opteron CPU, and it is a very sweet chip that runs regular old 32-bit WinXP very nicely (as well as my games). Maybe someday I'll put a 64bit O/S on it, but in the meantime I'm getting my money's worth.

      Plus, the Opteron CPUs run mighty cool, even with the stock retail AMD cooler (44-48C).

      Hell of a nice chip, just wish they were a bit cheaper. I would not be surprised to see Opteron make significant inroads on the Xeon market.

      --
      Wolde you bothe eate your cake, and have your cake?
  38. register argument passing by BiggsTheCat · · Score: 1

    The new AMD-64 chips use passing by register to do function calls, leading to a huge speedup. Consider, on an x86, function calls are done on the stack. You push, push, push your arguments onto the stack and then jump to the subroutine that pops them off into registers to do work. It then leaves a return code on the stack and jumps back (I believe).

    With the AMD-64 chips compiled with the new 64-bit ABI (i.e. Linux running in 64-bit mode, NOT windows which is currently only 32-bit), the arguments to the next function are stored in general purpose registers. The stack is used only when you run out of registers, and you have quite a few registers to work with. This reduces pushes and pops onto the stack (which are slow operations) and leaves everything in registers where they're going to be used anyway.

    The 64-bit-ness has nothing to do with the speedup of AMD-64 processesors for most applications.

    --

    Time is an illusion. Lunchtime doubly so. --Ford Prefect

    1. Re:register argument passing by Renegrade · · Score: 1

      Damn, talk about progress.

      I'm pretty sure (as in absolutely certain, as in being a developer for this system) that my Amiga 500 did passing-by-register with it's "16-bit" 68000 CPU.

      Of course, this "16-bit" processor was really more 32-bit than say, a 386 based design, as it had eight 32-bit data registers, and eight 32-bit address registers which could address a 4GB linear address space.

      The OS itself used all pass-by-register interfaces, which used an address register plus offset jumptable calling system that was lightning fast, but you could compile your binaries to also use pass-by-register or the more traditional pass-by-stack system. Pass-by-register programs generally saw a performance increase between 10% and 800%, depending on how function-happy/OS-call-happy they were.

      It's nice to see that the rest of the industry it's finally catching up. Datatypes to Codecs, register to register, .library to .so/.dll, finally.

      Strikes me that mainstream progress is really just re-implementing old ideas from non-mainstream sources.

      PS. Just a little note to those who thought your ISA didn't matter (ISA as in instruction set architecture, not as in that nasty crappy IBM bus), I say, "Ha!"

    2. Re:register argument passing by Short+Circuit · · Score: 1

      I'd have to double-check, but I could swear that the -regparm=3 option is available for any x86 CPU when you're compiling your linux kernel. I know it's available on my Duron, which is most definitely not AMD-64.

    3. Re:register argument passing by BiggsTheCat · · Score: 1

      You're right, but x86 support for this sucks. In particular:

      1) you can only use up to three registers to do this. On average, C functions take three to four arguments, so there will be a lot of stack use anyway. The AMD-64 provides 8 additional registers, so there will be far fewer stack arguments.

      2) check out this note direct from the gcc man page:

      "Warning: if you use this switch, and num is nonzero, then you must build all modules with the same value, including any libraries. This includes the system libraries and startup modules."

      So yeah, you can use regparm on x86 but only if your whole system and all libraries were built with it (i.e. maybe if you use LFS or Gentoo you could get away with it). With the AMD-64, register passing is the documented, expected way to do it, and all AMD-64 Linux distros use it already.

      --

      Time is an illusion. Lunchtime doubly so. --Ford Prefect

  39. Re:PowerPC G5 - Altivec? by Anonymous Coward · · Score: 0

    Would using the Altivec engine in the G5 be faster than using the main core of the processor? I suppose it depends on how well the algorithm can be vestorized.

  40. As with the G5 by Frobozz0 · · Score: 1

    The benefits seen in this optimization are largely parallel with the G5. If Apple can get things optimized for 64bit computing that can take advantage of it, we will see great things. In many cases, they already have...

    I just think it's great that AMD is making such strides... for being a Mac guy, I pull for them in the PC world. What can I say? I like the underdog story.

    --
    "Politicians find new names for institutions which under old names have become odious to the people."
  41. EM64T vs Opteron by night · · Score: 1

    3.4 GHz EM64T, gcc-3.4.2:
    2952628 RC4_set_key's in 5.00 seconds
    Doing RC4 on 1024 byte blocks for 5 seconds
    784464 RC4's of 1024 byte blocks in 4.99 second
    RC4 set_key per sec = 590525.60 ( 1.693uS)
    RC4 bytes per sec = 160980187.58 ( 0.050uS)
    (153.52 MB/sec)

    2.0 GHz Opteron, gcc-3.4.2:
    3388004 RC4_set_key's in 4.99 seconds
    Doing RC4 on 1024 byte blocks for 5 seconds
    1810795 RC4's of 1024 byte blocks in 5.00 second
    RC4 set_key per sec = 678958.72 ( 1.473uS)
    RC4 bytes per sec = 370850816.00 ( 0.022uS)
    (353 MB/sec)

  42. Ohh... you're not biased... by cbreaker · · Score: 1

    "Don't get me wrong it's good that code is optimised, but I think that RC4 would fly faster on an IA64 than an opteron if specifically optimised to take advantage of the CPU's features."

    Opterons are much cheaper then IA-64, and they run 32-bit x86 stuff at full speed. They make porting application easy because, it's still x86. So whether or not the Itanium is faster/better, is moot. They are way expensive and way nitche.

    "RC4 isn't really that relavent in real life as wep is crap & also easily done in hardware anyway."

    Yea, so might as well completely dismiss the whole thing just because you don't see value in it.. It's not the application, it's the fact that some optimizations made that much of a difference.

    "The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro."

    If AMD64 "suffers" the same fate as IA-32, then that's great! That means that up the road, ALL software, millions of packages, will all be on AMD64. Awesome! You didn't expect people to just switch all everything everywhere immediately, did you? As long as the trend follows toward AMD64, we're in good shape.

    "Problem is that your generally better off saving your cash, buying a cheap CPU (32bit in this case) and waiting for the 2nd/3rd Generation CPU."

    There's a problem in there?

    "By that time prices will more reasonable and you will see the full advantages as programs will use the extra bits properly."

    You're being ridiculous. AMD64 is cheap. It's here now, and it's even in the Prescott P4's. Basically, unless you want something OLD, you're going to get AMD64 whether you like it or not, in the near future. This is a GOOD thing.

    So unless you're trying to say that we should hold off on spending the HUGE AMOUNT of $100 on an Athlon 64, then you're just flaimbaiting here.

    --
    - It's not the Macs I hate. It's Digg users. -
  43. Maybe not with normal apps of TODAY... by cbreaker · · Score: 1

    Sure, most of the apps we use today might not get a HUGE performance increase from 64-bit x86. However.

    Can you imagine a 16-bit version of Office 2003? Or a media player? Or any of the other pretty heavy apps you run now a days?

    A 64-bit platform opens new doors for doing things that would require a much faster IA-32 chip to perform. Since we're not going to be seeing the huge Ghz increases in clock speed for awhile, it's a decent thing to focus on.

    --
    - It's not the Macs I hate. It's Digg users. -
  44. Amiga 500! by BiggsTheCat · · Score: 1

    Cool! I had an A 500 as well. As usual, the Amiga line demonstrates its architectural superiority. Such a shame it had to die (let's be honest... it's dead). Finally we see one of its advanced features ported to the x86.

    I also know that the MIPS architecture uses register passing. I believe is has up to eight 32-bit registers for argument passing. Quite a lot of architectures use it, it's just that x86 anachronistically held onto its stack passing system since I don't think it had enough GPRs to do it right. Finally, AMD-64 will solve the problem.

    --

    Time is an illusion. Lunchtime doubly so. --Ford Prefect

  45. 64 bit Xeon by akuma(x86) · · Score: 1

    Did anybody try to run this on a 64-bit Xeon yet? This kind of algorithm naturally wants 64 bit data types. It would be interesting to compare Xeon vs. Opteron performance here.

  46. Re:FPU performance by Anonymous Coward · · Score: 0

    Please note that the C3s before the C5P have half speed FPUs. If you get the C3 with the VIA padlock engine you are also getting a full speed FPU. So the FPU performance isn't as crap as before. Also they are dropping 3DNow! support in favour of SSE support, so I'm currently waiting for some newer revies to see if they are now suitable for software encoding in a HTPC. (I mainly want it for transcoding MPEG2 DTV signals to MPEG4 (Xvid or DivX).

  47. SUSE by Taco+Cowboy · · Score: 1
    --
    Muchas Gracias, Señor Edward Snowden !
  48. And for the opteron's little brother... by quadfour · · Score: 1

    Doing RC4_set_key for 5 seconds 2711712 RC4_set_key's in 4.39 seconds Doing RC4 on 1024 byte blocks for 5 seconds 1451056 RC4's of 1024 byte blocks in 4.39 second RC4 set_key per sec = 617702.05 ( 1.619uS) RC4 bytes per sec = 338469554.44 ( 0.024uS) That looks mighty sweet for a Athlon64 3000+. I would have expected it to be far less than the Opteron though, wierd!

    1. Re:And for the opteron's little brother... by Anonymous Coward · · Score: 0

      Nope. This is perfectly normal. Many people reported me results showing that the speed of this RC4 code depends only on the processor frequency.

      --
      Marc Bevand

  49. ubuntu not for servers - yet by oo_waratah · · Score: 1

    ubuntu is performs well and is easy to install but it is developed towards the desktop not server. Server ubuntu is in the pipeline but it is not it's current position in the market. (PS: it is a nice distribution though)

    PS: Do not attempt to put your home on a vfat partition, it fails to install :-(

  50. free market by oo_waratah · · Score: 1

    It really is a free market. If people refuse to pay for features because performance is poor then you will see a change. Currently it is easier to buy a new machine every 2-3 years and people expect that redundancy, I even advise it.

    I think with the lack of upgrades to Windows you are starting to see this effect happening. People are simply sticking to what they have. Microsoft (as an example only) will have to consider performance gains on existing hardware as a marketable thing soon.

  51. Compiler should do this by oo_waratah · · Score: 1

    If you write using generic code without need for carefully crafted data types then the compiler should compile it as you described. Unfortunately I see code all the time that assumes 32 bit ints and they are a real bugger to port to 64 bit.

    It really take knowledge and skill to write portable code that makes few assumptions about hardware. Porting for OpenOffice.org 64 bit has been worked on for about 18 months. Hopefully 128bit will not be as hard. See the code for dates that is not Y2K compliant written now I would doubt this will be the case. We (royal we) programmers never learn.

  52. Re:Optimisation is def by DIAMANTUL · · Score: 0

    GJKGHGJKHG

    --
    GOD BLESS YOU!