AMD's Showcases Quad-Core Barcelona CPU

Re:Honestly... by mabinogi · 2007-02-09 19:43 · Score: 5, Insightful

I don't care if it's 65nm, 45nm or 10mm - that's a completely irrelevant (to me as a user and purchaser) implementation detail. I care about the results - how fast is it for my workloads? How much is it? How much power does it use?

Obsession about process size is sillier than obsession over clock speeds.

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

--
Advanced users are users too!

Re:But SSE is already 128 bits! by Zenki · 2007-02-09 19:45 · Score: 5, Informative

SSE+ operations up until now were operated on 64 bit at a time within the processor. SSE128 just means the new AMD chip will complete a SSE instruction in one pass.

This was pretty much the reason why most people only bothered with MMX optimizations in their applications.

Re:But SSE is already 128 bits! by larrystotler · 2007-02-09 19:50 · Score: 5, Informative

When Intel first added SSE to the Pentium 3 chips, they did it with a 64bit setup to save die size on the then 350nm parts. Even when they moved to the newer smaller designs, they left it that way. The Core2 was the first chip to incorporate a single issue SSE engine. Therefore, with the Core2, it loads the instruction, then executes it. With the other chips, you have to load the first part(if it's a full 128bit instruction, or if it's multiple instructions added together), save, load, save, add, execute. This is where the Core2 kicks butt. I've been saying that the Barcelona would move to that design, since it's the biggest reason Intel has been beating AMD in the benchmarks. This will re-level the playing field. There have been lots of articles about this. Google it

Is dethroning Intel the point? by Weaselmancer · 2007-02-09 19:50 · Score: 5, Insightful

As long as AMD and Intel continue to chase each other in the x86 market, high end chips become low end in the span of six months. Just keep buying 6 months behind the press releases and you get great processors for next to nothing.

--
Weaselmancer
rediculous.

AMD64 is very fast by GreatDrok · 2007-02-09 20:17 · Score: 5, Interesting

In my own benchmarks (generic C integer and floating point scientific code) I have found that the Core Duo and Core 2 Duo aren't all that quick compared with an AMD64. Clock for clock the AMD64 Opterons we have are about 50% quicker than an equivalent Core 2 Duo for integer work. I know this doesn't agree with all the usual magazine benchmarks but they are heavily biased towards using SSE instructions where possible and it is SSE where the Core 2 Duo has been a real improvement over previous Intel designs and also bests the AMD chips. Hopefully, AMD has recognised this and the new SSE implementation will bring them back on par with Intel for these benchmarks but even today an AMD64 processor is a beast and more than a match for anything Intel produces.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Re:AMD64 is very fast by GreatDrok · 2007-02-09 20:54 · Score: 5, Informative

"Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?"

OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).

AMD Opteron 2.0Ghz (HP wx9300) - 205 Million calculations per second
Intel Core 2 Duo 2.66Ghz (Mac Pro) - 146 Million
Intel Core Duo 2.0 Ghz (MacBook Pro) - 94 Million
IBM G5 PPC 2.3 Ghz (Apple Xserve) - 81 Million
Motorola G4 PPC 1.42 Ghz (Mac mini) - 72 Million
Intel P4 2.0 Ghz (Dell desktop) - 61 Million
Intel PIII 1.0 Ghz (Toshiba laptop) - 45 Million

Interesting things about these numbers. The Core Duo is clearly a close relative of the PIII since the performance at 2Ghz is roughly twice that of the PIII at 1Ghz. The P4 at 2Ghz is really very poor indeed which isn't a huge surprise as it was never very efficient. The G4 PPC puts in a reasonable result easily beating the much higher clocked P4 (what, the Mac people were right? Shock!) although I have to say that the performance of the G5 is disappointing. The Core 2 Duo isn't a bad performer although it does have the highest clock speed of any processor in this set but it is seriously beaten by the Opteron. From these numbers, a Core 2 Duo at 2Ghz would be about half as quick as an Opteron at the same speed.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by GreatDrok · 2007-02-09 22:03 · Score: 3, Informative

"The P3 you list looks a Coppermine, I suspect a P3 Tualatin would perform much better."

Pretty sure it is a Tualatin since it is a 1Ghz PIII Mobile which I bought in early 2002 (http://www.theregister.co.uk/2001/01/31/chipzilla _readies_1ghz_mobile_piii would seem to support this).

Given that it is a Tualatin, then the peformance of the Core Duo at 2Ghz looks about right. The Core 2 Duo gets about 10% better performance clock for clock from all the blurb I have read except when it comes to SSE where it is about twice as fast so the performance figure of 146 million also looks pretty much on the mark too as a 2Ghz Core 2 Duo should be able to manage about 110 million if you scale the figure for clock speed and that is (surprise) ~10% quicker than the Core Duo at 2Ghz (94 million) so the basic integer performance of the Core 2 Duo is better than the Core Duo but doesn't compare with the 205 million the 2.0Ghz Opteron manages.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by waaka! · 2007-02-09 22:19 · Score: 3, Insightful

OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).
When you say you've tried "adding other optimizations," are you referring only to other GCC optimization flags? If your program's algorithms have any moderate degree of parallelism and you haven't tried vectorization either by compiler (GCC and ICC can both do this) or by hand, the benchmark you've done is not unlike a race where no one is allowed to shift out of first gear. Can you go into any more specifics about how this program does sequence comparisons?

Also, the disappointing numbers from the G5 may be partially explained by the fact that its integer unit has higher latency than the other desktop processors in that list. The G5 isn't exactly known for blistering integer performance, anyway.

Re:Intel's Responds by pchan- · 2007-02-09 20:31 · Score: 4, Interesting

"Lets make a Octa-core processor!"

Oh, here's one. Though it's been out since before Intel had quad-core chips.

Re:Honestly... by mabinogi · 2007-02-09 20:38 · Score: 4, Insightful

45nm is not inherently "better" than 65nm any more than 3Ghz is inherently "better" than 1Ghz. A smaller process size is a means to an end, it's not an end in itself.

The end is the delicate balance of improving power / watt while increasing overall performance and keeping the price down. If AMD can deliver a chip that does a better job of that at 65nm than an Intel 45nm one, then the AMD chip is not somehow "worse" than the Intel one just because it doesn't use 45nm. That's just stupid.

I'm not saying AMD can do that, but I think that criticizing them for not being ready for 45nm yet is more than premature.
AMD's actually guilty of the same flawed logic though - their criticism of Intel's 4 core processor being just 2 dual cores stuck together is just as pointless. It doesn't matter what matters is how well the processor meets the requirements of its target market.

--
Advanced users are users too!

Re:Well.... by Creepy+Crawler · 2007-02-09 20:48 · Score: 4, Interesting

---That sounds very interesting. Would you mind providing a link to the literature that discusses that ? I have some trouble figuring out the thermodynamics of this. Perpetum mobile and such, you know....

Of course. It, at first, sounds too good, but here you go.

Rolf Landauer showed in 1961 that reversible logic operations could be performed by neither using energy or taking heat out. The same could not be said for irreversible logic operations.

"Irreversibility and Heat Generation in the Computing Process" IBM Journal of Research Development 17 (1973): 525-32, IBM PDF

___

In 1973, Charles Bennett proved that any computation could be derived from purely reversible computing.

Charles H. Bennett "Logical Reversibility of Computation" IBM PDF

___

Later on, Fredkin and Toffoli presented a review of the ideas of reversible computing. The essential idea is that you can save all intermediary states between an algorithm to get the answer, and then reverse the process so that no energy is used, and generated no heat. Fredkin also indicates that if we switched from irreversible to reversible computing, we would expect to lose no more than 1% efficiency.

International Journal of Theoretical Physics 21 (1982):219-53 PDF

___

And as an unsubstantiated claim, I remember hearing that due to heat/radiation sources, that volatile memory gains errors of 1 bit per billion with a time from 1 minute to 1 day ( I forget the exact time). To correct this would only require the entropy of deleting that incorrect bit. In other words, 10^8 or so magnitude heat shrinkage. But trust the stuff above.

(Many of these ideas were taken from "The Singularity is Near" by Ray Kurzweil from page 130)

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Re:If its true by Anonymous Coward · 2007-02-09 20:54 · Score: 5, Funny

Three quad cores for the pasty-nerds under the sky,
Seven for the WoW-nerds in their halls of stone,
Nine for Diablo Men doomed to die,
One for the Dark Nerd on his dark throne
In the Land of Silicon where the corporations lie.
One quad core to rule them all, One quad core to find them,
One quad core to bring them all and in the darkness bind them
In the Land of Silicon where the corporations lie.
He paused, and then said in a deep voice,
This is the Master-quad core, the One quad core to rule them all.

Re:Well.... by drgonzo59 · 2007-02-09 21:25 · Score: 4, Insightful

And how exactly is your reversible computing going to reduce the resistance of millions and millions of conductors to 0. You are confusing a theoretical issue relating to computer science (and very relevant to quantum computing) with a practical problem of a CPU design. Just moving information around _without_ deleting it will generate heat.

Or did you actually think that those "stupid" CPU designers for all this years, battling with heat dissipation, never thought of, oh.. simply replacing the nand gates with reversible Fredkin and Toffoli gates and 'poof' magically all the heat issues are gone, processors will run @ hundreds of GHz, the wold's electrical power consumption will go down and the geeks won't be able to boast about their huge ass sinks anymore...

Re:Honestly... by epine · 2007-02-09 21:49 · Score: 4, Insightful

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

Feature size has denominated progress (as measure either by raw performance or performance per watt) over an unbroken 30 year period. Do you recall the very passionate debates about RISC vs CISC? Did a RISC design at one feature size ever beat a CISC design at the next shrink? I think not. Design has never mattered anywhere near as much as feature size. Not that you can't get design wrong. But then you can get a shrink wrong, too, and end up with 1% yields. AMD managed briefly to remain competitive with Intel playing a full shrink behind when Intel did that rather stupid marketron-driven face-plant into the thermal wall (against good advice from their Israel team, who later came to the rescue with Core Duo).

With the recent skyrocket of leakage current, the holy grail of feature size is somewhat tarnished, but it still dominates the performance curve. You completely missed the relationship between feature shrinks and the performance crown. If Intel has better process technology than AMD (almost always) and AMD has a better design (most of the time since the Athlon was first launched) and both companies shrink every 18 months following the Moore projection (that unbroken 30 year historical trend) and AMD always shrinks 9 months behind Intel, then the performance crown will pass back and forth exactly as often as either company announces their next product.

So I agree with you: feature size has no importance to the customer who wants performance for their dollar. Except that you can set your clock by it and project ten years into the future effective performance levels of shrinks we haven't even seen yet. Except for that part, yeah, I'm with you.

Re:But SSE is already 128 bits! by waaka! · 2007-02-09 22:00 · Score: 5, Informative

Hmm...do you mean specifically on AMD's hardware? That stopped being true for Intel starting with the Core, which has 1-cycle latency on SSE instructions.

Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency. Most of these instructions still take 3-5 cycles to generate results, which is similar to the Pentium M, but now a vector of results finishes every cycle, instead of every two or four cycles.

An important consequence of this is that if your instructions are poorly scheduled by the compiler (or assembly programmer) and the processor spends too much time waiting for results of previous operations, the advantages of single-cycle throughput mostly disappear.

Re:But SSE is already 128 bits! by pammon · 2007-02-09 22:33 · Score: 4, Informative

Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency

Well, certainly you won't be able to get a square root through in one clock cycle, but many/most of the simple integer arithmetic, bitwise, and MOV SSE instructions on the Core 2 really do have single cycle latency. source. None do on the AMD64, which supports the theory that SSE128 means more "new for us" than "new for everyone." Not to put AMD down - many of the other features sound promising (but the article is long on breathlessness and light on details, alas).

Re:Intel's Responds by Anonymous Coward · 2007-02-09 23:08 · Score: 3, Informative

8 core (two quad core chips in a single package) is already on Intel's internal roadmaps.

(this was anonymous for a reason)

Junk article, full of inaccuracies. by barracg8 · 2007-02-10 00:43 · Score: 4, Informative

Each of Barcelona's four cores incorporates a new vector math unit referred to as SSE128

SSE has always been 128bit (the 64bit simd extensions were called MMX). AMD used to funnel the instructions through a 64bit execution unit by splitting the work into two halves, the new core has a full 128bit SSE pipeline so doesn't need split the operations. Nothing new here, just a faster internal implementation. Can this deliver and 80% improvevment in benchmark performance? - quite possibly. Take a look at the Core2 FP perfromance numbers - it also has a full 128bit implementation of SSE.

And separating integer and floating-point schedulers also accelerates this thing called virtualization

Huh. Hardware virtualization affects how the processor handles certain instructions such as priviledged operations. FP instruction execution is unaffected. Virtualized workloads will benefit no more than non-virtualized workloads. Separate issue queues are good but does it specifically benefit virtualization? - no.

Barcelona blacks out power to individual portions of the chip that are idled, from in-core execution units to on-die bus controllers. This hasn't made it into PCs before ...

Intel call this 'intelligent power capability'.
http://www.intel.com/technology/magazine/computing /core-architecture-0306.htm?iid=search&

Barcelona adds Level 3 cache, a newcomer to the x86

Xeons have featured L3 caches for years. http://en.wikipedia.org/wiki/List_of_Intel_Xeon_mi croprocessors

Barcelona is genius, a genuinely new CPU that frees itself entirely of the millstone of the Pentium legacy.
Barcelona is a new CPU, not a doubling of cores and not extensions strapped on here and there.

Barcelona is an Opteron, with a doubling of cores and some extensions strapped on here and there.

I'm not meaning to detract from AMD here - the fact that they have still not had to make any radical changes to the opteron micro-architecture is a testament to the quality of the original design. They are slightly ahead of the game on virtualization - they're going to beat Intel to nested page tables - but other than that this chip is playing catchup. Overall this is going to be a very nice piece of kit to work with. But nothing radical and new here.

G.

Paging Tables by Doc+Ruby · 2007-02-10 03:16 · Score: 4, Informative

Nested paging tables is a per-core feature that will light the afterburners on x86 hardware virtualization. A paging table holds the map that translates virtual memory addresses to physical memory addresses, and each CPU core has only one. Virtual machines have to load and store their page tables as they get and lose their slice of the CPU. AMD solved the problem with nested paging tables. Simplified, each VM maintains its own paging table that stays fixed in place. Instead of loading and saving paging tables as your system flips from VM to VM, your system just supplies Barcelona with the ID of the virtual machine being activated. The CPU core flips page tables automatically and transparently. This is another feature that's implemented for each core.

Context-switching has long been the weakest design point for x86 in "PCs", especially servers. x86 arch is rooted in single-user, single-threaded, single-context apps. The in-core registers that CPU operations execute directly against have to be swapped out for each context switch. In *nix, that means every time a different process gets a timeslice, it's got to execute two slow copies between registers and at best cache RAM, at worst offchip RAM (over some offchip bus). If the register count is larger than the bus width (even onchip), that's another multiple on that slow cycle. That context-switch overhead can be larger than the timeslice allocated to each process's "turn" in the schedule for lower-latency / higher-response (lower "nice") processes, approaching realtime.

Unix was designed for multiusers, context-switching from the beginning. The chips it's run on coevolved with it. Linux arrived when x86 CPUs ran fast enough that context-switching was OK, but still a big waste compared with, say, MicroVAX multiple register sets. Windows architecture is rooted in the x86 architecture that DOS was designed for, though perhaps Vista has finally lost all of the old design baggage originated in the 8088/8086, but its long history of UI multitasking means it's context-switching all the time, which will gain in speed. The MacOS switch to BSD means it's got lots of power bound up in the context switches that could be released with Barcelona.

So while low-level benchmarks might show something like 80% FPU improvement, the high level (application) performance could improve quite a lot more. Recompiling apps to machine code that exploits more registers without the context-switching penalties could find multiples, especially apps with realtime multimedia that run concurrently with other apps. Intel's hyperthreading already gets past some of these bottlenecks in distributing tasks among multiple cores, but the Barcelona paging tables go even deeper, for likely extra performance (on top of Barcelona's own hyperthreading and new L3 cache).

Aside from the marketing "vapormarks" we'll surely see out of AMD (and their sockpuppets) before it's actually released "midyear", I'm looking forward to seeing how this thing really runs in multitasking apps. I'm expecting "like a greased snake across a griddle".

--

--
make install -not war

Slashdot Mirror

AMD's Showcases Quad-Core Barcelona CPU

19 of 190 comments (clear)