AMD's Showcases Quad-Core Barcelona CPU
Gr8Apes writes "AMD has showcased their new 65nm Barcelona quad-core CPU. It is labeled a quad-core Opteron, but according to Infoworld's Tom Yeager, is really a redefinition of x86. Each core has a new vector math processing unit (SSE128), separate integer and floating point schedulers, and new nested paging tables (to vastly improve hardware virtualization). According to AMD, the new vector math units alone should improve floating point operation by 80%. Some analysts are skeptical, waiting for benchmarks. Will AMD dethrone Intel again? Only time will tell."
SSE+ operations up until now were operated on 64 bit at a time within the processor. SSE128 just means the new AMD chip will complete a SSE instruction in one pass.
This was pretty much the reason why most people only bothered with MMX optimizations in their applications.
When Intel first added SSE to the Pentium 3 chips, they did it with a 64bit setup to save die size on the then 350nm parts. Even when they moved to the newer smaller designs, they left it that way. The Core2 was the first chip to incorporate a single issue SSE engine. Therefore, with the Core2, it loads the instruction, then executes it. With the other chips, you have to load the first part(if it's a full 128bit instruction, or if it's multiple instructions added together), save, load, save, add, execute. This is where the Core2 kicks butt. I've been saying that the Barcelona would move to that design, since it's the biggest reason Intel has been beating AMD in the benchmarks. This will re-level the playing field. There have been lots of articles about this. Google it
"Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?"
OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).
AMD Opteron 2.0Ghz (HP wx9300) - 205 Million calculations per second
Intel Core 2 Duo 2.66Ghz (Mac Pro) - 146 Million
Intel Core Duo 2.0 Ghz (MacBook Pro) - 94 Million
IBM G5 PPC 2.3 Ghz (Apple Xserve) - 81 Million
Motorola G4 PPC 1.42 Ghz (Mac mini) - 72 Million
Intel P4 2.0 Ghz (Dell desktop) - 61 Million
Intel PIII 1.0 Ghz (Toshiba laptop) - 45 Million
Interesting things about these numbers. The Core Duo is clearly a close relative of the PIII since the performance at 2Ghz is roughly twice that of the PIII at 1Ghz. The P4 at 2Ghz is really very poor indeed which isn't a huge surprise as it was never very efficient. The G4 PPC puts in a reasonable result easily beating the much higher clocked P4 (what, the Mac people were right? Shock!) although I have to say that the performance of the G5 is disappointing. The Core 2 Duo isn't a bad performer although it does have the highest clock speed of any processor in this set but it is seriously beaten by the Opteron. From these numbers, a Core 2 Duo at 2Ghz would be about half as quick as an Opteron at the same speed.
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency. Most of these instructions still take 3-5 cycles to generate results, which is similar to the Pentium M, but now a vector of results finishes every cycle, instead of every two or four cycles.
An important consequence of this is that if your instructions are poorly scheduled by the compiler (or assembly programmer) and the processor spends too much time waiting for results of previous operations, the advantages of single-cycle throughput mostly disappear.