AMD's Showcases Quad-Core Barcelona CPU

Re:But SSE is already 128 bits! by Zenki · 2007-02-09 19:45 · Score: 5, Informative

SSE+ operations up until now were operated on 64 bit at a time within the processor. SSE128 just means the new AMD chip will complete a SSE instruction in one pass.

This was pretty much the reason why most people only bothered with MMX optimizations in their applications.

Re:But SSE is already 128 bits! by larrystotler · 2007-02-09 19:50 · Score: 5, Informative

When Intel first added SSE to the Pentium 3 chips, they did it with a 64bit setup to save die size on the then 350nm parts. Even when they moved to the newer smaller designs, they left it that way. The Core2 was the first chip to incorporate a single issue SSE engine. Therefore, with the Core2, it loads the instruction, then executes it. With the other chips, you have to load the first part(if it's a full 128bit instruction, or if it's multiple instructions added together), save, load, save, add, execute. This is where the Core2 kicks butt. I've been saying that the Barcelona would move to that design, since it's the biggest reason Intel has been beating AMD in the benchmarks. This will re-level the playing field. There have been lots of articles about this. Google it

Re:AMD64 is very fast by pjbass · 2007-02-09 20:30 · Score: 2, Informative

Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?

And pointing out that it isn't fair to compare because a Core2 duo already executes the full SSE instruction in one pass vs. the 2 clocks for a curret AMD64 is the same as saying it's not fair to compare the on-die memory controller on AMD's vs. Intel's FSB. But people didn't seem to care when the numbers went in AMD's favor.

I'd really be interested in seeing your numbers, your programs, and what compiler options you passed when building on each platform (as well as type of memory, mobo chipset, etc., in each machine).

Re:Well.... by Khyber · 2007-02-09 20:48 · Score: 2, Informative

Heat-free? Did you forget the Second Law? Or did you just forget about pure friction itself? Moving ANYTHING is going to involve friction. Nothing moves without SOME force, and friction will happen.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:AMD64 is very fast by GreatDrok · 2007-02-09 20:54 · Score: 5, Informative

"Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?"

OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).

AMD Opteron 2.0Ghz (HP wx9300) - 205 Million calculations per second
Intel Core 2 Duo 2.66Ghz (Mac Pro) - 146 Million
Intel Core Duo 2.0 Ghz (MacBook Pro) - 94 Million
IBM G5 PPC 2.3 Ghz (Apple Xserve) - 81 Million
Motorola G4 PPC 1.42 Ghz (Mac mini) - 72 Million
Intel P4 2.0 Ghz (Dell desktop) - 61 Million
Intel PIII 1.0 Ghz (Toshiba laptop) - 45 Million

Interesting things about these numbers. The Core Duo is clearly a close relative of the PIII since the performance at 2Ghz is roughly twice that of the PIII at 1Ghz. The P4 at 2Ghz is really very poor indeed which isn't a huge surprise as it was never very efficient. The G4 PPC puts in a reasonable result easily beating the much higher clocked P4 (what, the Mac people were right? Shock!) although I have to say that the performance of the G5 is disappointing. The Core 2 Duo isn't a bad performer although it does have the highest clock speed of any processor in this set but it is seriously beaten by the Opteron. From these numbers, a Core 2 Duo at 2Ghz would be about half as quick as an Opteron at the same speed.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Re:But SSE is already 128 bits! by waaka! · 2007-02-09 22:00 · Score: 5, Informative

Hmm...do you mean specifically on AMD's hardware? That stopped being true for Intel starting with the Core, which has 1-cycle latency on SSE instructions.

Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency. Most of these instructions still take 3-5 cycles to generate results, which is similar to the Pentium M, but now a vector of results finishes every cycle, instead of every two or four cycles.

An important consequence of this is that if your instructions are poorly scheduled by the compiler (or assembly programmer) and the processor spends too much time waiting for results of previous operations, the advantages of single-cycle throughput mostly disappear.

Re:AMD64 is very fast by GreatDrok · 2007-02-09 22:03 · Score: 3, Informative

"The P3 you list looks a Coppermine, I suspect a P3 Tualatin would perform much better."

Pretty sure it is a Tualatin since it is a 1Ghz PIII Mobile which I bought in early 2002 (http://www.theregister.co.uk/2001/01/31/chipzilla _readies_1ghz_mobile_piii would seem to support this).

Given that it is a Tualatin, then the peformance of the Core Duo at 2Ghz looks about right. The Core 2 Duo gets about 10% better performance clock for clock from all the blurb I have read except when it comes to SSE where it is about twice as fast so the performance figure of 146 million also looks pretty much on the mark too as a 2Ghz Core 2 Duo should be able to manage about 110 million if you scale the figure for clock speed and that is (surprise) ~10% quicker than the Core Duo at 2Ghz (94 million) so the basic integer performance of the Core 2 Duo is better than the Core Duo but doesn't compare with the 205 million the 2.0Ghz Opteron manages.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Re:But SSE is already 128 bits! by pammon · 2007-02-09 22:33 · Score: 4, Informative

Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency

Well, certainly you won't be able to get a square root through in one clock cycle, but many/most of the simple integer arithmetic, bitwise, and MOV SSE instructions on the Core 2 really do have single cycle latency. source. None do on the AMD64, which supports the theory that SSE128 means more "new for us" than "new for everyone." Not to put AMD down - many of the other features sound promising (but the article is long on breathlessness and light on details, alas).

Re:Intel's Responds by Anonymous Coward · 2007-02-09 23:08 · Score: 3, Informative

8 core (two quad core chips in a single package) is already on Intel's internal roadmaps.

(this was anonymous for a reason)

Junk article, full of inaccuracies. by barracg8 · 2007-02-10 00:43 · Score: 4, Informative

Each of Barcelona's four cores incorporates a new vector math unit referred to as SSE128

SSE has always been 128bit (the 64bit simd extensions were called MMX). AMD used to funnel the instructions through a 64bit execution unit by splitting the work into two halves, the new core has a full 128bit SSE pipeline so doesn't need split the operations. Nothing new here, just a faster internal implementation. Can this deliver and 80% improvevment in benchmark performance? - quite possibly. Take a look at the Core2 FP perfromance numbers - it also has a full 128bit implementation of SSE.

And separating integer and floating-point schedulers also accelerates this thing called virtualization

Huh. Hardware virtualization affects how the processor handles certain instructions such as priviledged operations. FP instruction execution is unaffected. Virtualized workloads will benefit no more than non-virtualized workloads. Separate issue queues are good but does it specifically benefit virtualization? - no.

Barcelona blacks out power to individual portions of the chip that are idled, from in-core execution units to on-die bus controllers. This hasn't made it into PCs before ...

Intel call this 'intelligent power capability'.
http://www.intel.com/technology/magazine/computing /core-architecture-0306.htm?iid=search&

Barcelona adds Level 3 cache, a newcomer to the x86

Xeons have featured L3 caches for years. http://en.wikipedia.org/wiki/List_of_Intel_Xeon_mi croprocessors

Barcelona is genius, a genuinely new CPU that frees itself entirely of the millstone of the Pentium legacy.
Barcelona is a new CPU, not a doubling of cores and not extensions strapped on here and there.

Barcelona is an Opteron, with a doubling of cores and some extensions strapped on here and there.

I'm not meaning to detract from AMD here - the fact that they have still not had to make any radical changes to the opteron micro-architecture is a testament to the quality of the original design. They are slightly ahead of the game on virtualization - they're going to beat Intel to nested page tables - but other than that this chip is playing catchup. Overall this is going to be a very nice piece of kit to work with. But nothing radical and new here.

G.

Re:Junk article, full of inaccuracies. by ocbwilg · 2007-02-10 02:27 · Score: 2, Informative

Xeons have featured L3 caches for years. http://en.wikipedia.org/wiki/List_of_Intel_Xeon_mi croprocessors

Actually, if you go waaaay back to the Socket 7 days you could have L3 cache as well. The AMD K6 and K6-2 CPUs only had on-die L1, and the L2 cache was on the mainbaord. But the K6-3 CPU had 256KB or 512KB of on-die L2 and was compatible with the same mainboards. So when you put that K6-3 in a socket 7 mainboard the mainboard's cache actually functioned as L3. Sure it wasn't on-chip, but L3 cache is definitely nothing new to x86.
Re:Junk article, full of inaccuracies. by ScriptedReplay · 2007-02-10 04:27 · Score: 2, Informative
I fully agree, the article is mainly empty of information - it took words from AMD briefings and produced a meaningless salad.

Now, as far as some claims, in detangled order:
- FPU boost: this seems to be based on several things - one is the obvious widening of SSE2 issues. Others are increasing instruction fetch from 16B/cycle to 32B/cycle, making the FPU scheduler 128bit, unaligned loads and a doubling of cache bandwidth.
- Virtualization: Nested page tables and reduces witching times for the hypervisor.
- Power: CPU and northbridge on separate power planes so they can be in different power modes (clock+voltage); apparently, voltages of different cores are independent as well, so that should give lower power consumption when not at full load (with appropriate MB support) AFAIR this is better than what Intel has, but I might be mis-remembering.
- the extra cache is long overdue and one will have to see whether their way of managing it is smart enough (things like moving data from L3 to individual caches, but sometimes keeping code shared in L3)
There. More content than TFA and shamelessly copied from a 4-month old article for the benefit of all us non-RTFA people. And there's more actually.

Paging Tables by Doc+Ruby · 2007-02-10 03:16 · Score: 4, Informative

Nested paging tables is a per-core feature that will light the afterburners on x86 hardware virtualization. A paging table holds the map that translates virtual memory addresses to physical memory addresses, and each CPU core has only one. Virtual machines have to load and store their page tables as they get and lose their slice of the CPU. AMD solved the problem with nested paging tables. Simplified, each VM maintains its own paging table that stays fixed in place. Instead of loading and saving paging tables as your system flips from VM to VM, your system just supplies Barcelona with the ID of the virtual machine being activated. The CPU core flips page tables automatically and transparently. This is another feature that's implemented for each core.

Context-switching has long been the weakest design point for x86 in "PCs", especially servers. x86 arch is rooted in single-user, single-threaded, single-context apps. The in-core registers that CPU operations execute directly against have to be swapped out for each context switch. In *nix, that means every time a different process gets a timeslice, it's got to execute two slow copies between registers and at best cache RAM, at worst offchip RAM (over some offchip bus). If the register count is larger than the bus width (even onchip), that's another multiple on that slow cycle. That context-switch overhead can be larger than the timeslice allocated to each process's "turn" in the schedule for lower-latency / higher-response (lower "nice") processes, approaching realtime.

Unix was designed for multiusers, context-switching from the beginning. The chips it's run on coevolved with it. Linux arrived when x86 CPUs ran fast enough that context-switching was OK, but still a big waste compared with, say, MicroVAX multiple register sets. Windows architecture is rooted in the x86 architecture that DOS was designed for, though perhaps Vista has finally lost all of the old design baggage originated in the 8088/8086, but its long history of UI multitasking means it's context-switching all the time, which will gain in speed. The MacOS switch to BSD means it's got lots of power bound up in the context switches that could be released with Barcelona.

So while low-level benchmarks might show something like 80% FPU improvement, the high level (application) performance could improve quite a lot more. Recompiling apps to machine code that exploits more registers without the context-switching penalties could find multiples, especially apps with realtime multimedia that run concurrently with other apps. Intel's hyperthreading already gets past some of these bottlenecks in distributing tasks among multiple cores, but the Barcelona paging tables go even deeper, for likely extra performance (on top of Barcelona's own hyperthreading and new L3 cache).

Aside from the marketing "vapormarks" we'll surely see out of AMD (and their sockpuppets) before it's actually released "midyear", I'm looking forward to seeing how this thing really runs in multitasking apps. I'm expecting "like a greased snake across a griddle".

--

--
make install -not war

Re:AMD64 is very fast by GreatDrok · 2007-02-10 05:28 · Score: 2, Informative

"This guys little application does tons of random memory reads"

If only that was the case but actually it is very linear. The application can hold the whole of its memory requirements in cache these days so it hardly has to touch main memory and it was designed to do all the inner loop code using only registers. Heck, I doubled the size of the inner loop just to avoid a single register copy because it made a significant performance increase.

The reason I like this code is that it shows how many operations you can expect a chip to achieve when it isn't having to wait on main memory. It is an extremely compute intensive application with very little I/O. If it really was about random memory reads then I would be inclined to agree with you but it isn't, it loads blocks of memory into cache and chews through them linearly.

I am pretty processor agnostic. I did really like the Alpha but that is dead now and that is a real shame. I also object to being called a 'fanboi' as I personally don't own any AMD kit since I generally prefer Macs which means regardless of what you might think all my personal machines are PPC or Intel. If anything though, my application is an example of general computing applications in that it doesn't use any SSE tricks to increase performance. It is just code that anyone could write in C and compile on whatever machine they have to hand so the performance is pretty real world. Sure, I've spent a bit of time right in the core making the thing efficient but that is where the program spends 99% of its time so not doing so would be stupid.

How's this for something to make your head spin, we were benchmarking some Java code written by someone else the other day and found that Java under Windows XP Pro on one of these Opterons was no quicker than it was on my G4 1.5Ghz PowerBook but the same app under Linux on the same Opteron was 4x quicker. The guy running the machine under Windows is a Java developer and now wants Linux installing and will use Windows via VMware in future. Also interesting, the 2.66Ghz MacPro was about 30% slower than the Opteron under Linux running the same bytecode but still faster than Windows running the same code. Not my 'little application' but still seems to follow the same trend which I thought was interesting. Apart from the Windows thing. No idea what is wrong with Java under Windows unless Sun did it deliberately which wouldn't surprise me.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Slashdot Mirror

AMD's Showcases Quad-Core Barcelona CPU

14 of 190 comments (clear)