HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans
Eric^2 writes: "At last week's MicroProcessor Forum, HP's David J. C. Johnson unveiled the details of HP's latest RISC processor destined to redefine performance in Server-Class processors. Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor. Aside from bumping the core speed up to an initial 1 GHz, enhancements include the addition of combined 35 MB L1+L2 cache. The article contains the full text. AMD, please steal an idea..."
The IBM p690 server uses POWER4 processors. Each
chip has 2 POWER cores with high-speed interconnects. Even better is that each chip is connected to 3 other chips to make up 8 CPU packs.
Wow... And I thought the 8MB L2 cache on UltraSPARC IIIs was a lot, not to mention the 16MB on some IBMs. Now we're talking about 3MB just in L1 with 32MB L2 cache. This beasty should have some impressive benchmark scores (yeah, I know, benchmarks aren't everything...)
...a 1 GHZ processor may not sound like much, even in this dual-core configuration, but keep in mind that this is a RISC processor. None of that Super-mega-ultra-long-50-bazillion-stage pipeline crap that Intel uses to pump up their MHz rating. The article kind of sells this point a little bit short. The RISC architecture allows this processor to do roughly twice as much work in the same amount of time - or, to put it in a more concrete scenario: imagine a pair of 2GHz Pentium 4's running in SMP configuration.
Now that's FAST .
Did that say 35MB of L1 + L2 cache? I may be rusty, but I think I remember reading in my Processor Design for Dummies book that increasing cache size actually can slow down processor performance after a certain amount. Could someone please clarify this?
today is spelling optional day.
It wouldn't be stealing an idea. This idea has been around for a long time in academia. Maybe the poster forgot this, but the POWER4 from IBM does this, and comes with 32Mb of L3 cache, plus an ondie shared L2 cache. The idea isn't new, it's known as CMP (Chip-level Multi-Processing). Really, "SMP on a chip" is merely called CMP.
:)
Also, though Sun has decided not to use the MAJC architecture for anything (they were hoping to try to get it to become a video-accelerator, but that's not even going to happen, most likely), that too was fully spec'ed out to have multiple cores on a chip...it's really nothign new
The longstanding rumour is that AMD will be coming out with a dual-hammer processor (ie, CMP). In academia, the idea has been used frequently as well.
The idea of using CMP isn't even that big a deal to most consumers. While it would be nice for AMD to come out with a chip that does multithreading (merely because it increases real-world throughput quite a bit, depending upon the type of multithreading), the average PC running windows 9x/ME/ XP Consumer won't be able to multithread anyway. The only reason for AMD to multithread is for the server-space, which is what they're aiming for with the hammer series...but I digress.
The most interesting parallel architecture I heard about at the MPF was Siroyan's OneDSP architecture. This is a clustered VLIW machine that can execute up to 64 instructions each cycle! See the EE times article and their MPF paper
Not likely. HP is Intel's partner in the development of Itanium, which is based on PA-RISC.
If AMD has a desire to cram two AMD-64 in one package they better come with their own solution or license IBM's one...
What ? Me, worry ?
The official HP presentation on the PA-8800 is0 01.pdf.
available as a PDF from http://www.cpus.hp.com/technical_references/mpf_2
Y.
Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.
PA-8800 lets you create two opposite predicates in one instruction, for example the predicate a=b.
// pLT & pNLT are 2 complementary preds
;; // add to b [then] // or sub from b [else]
;; // uses of b
;;
// speculatively sub from b (into temp) // and add to b
;; // uses of b [then] // uses of b (temp) [else] // move bTmp to b [else]
;;
This seems to indicate that there are no separate "do this if predicate is true" and "do this if predicate is false" instructions, so for opposite predication you would have to specify two different predicates.
The processor cannot know that these two predicates are related, so this would give you quite a problem.
As has been publicly disclosed, in general in PA-8800, an instruction reading any resource (such as a predicate) must be in a later instruction group (cycle) than the instruction writing that resource. As a special case, branches are allowed to use a predicate written by another instruction in the same instruction group (as shown in the IDF slides).
So, the straightforward (but slow) PA-8800 schedule for the earlier example:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
would be:
cmp.lt pLT, pNLT = a, 0
(pLT) add b = b, a
(pNLT) sub b = b, a
add c = c, b
add d = d, b
which takes 5 instructions in 3 cycles. (Note: In PA-8800 assembly, ";;" indicates the end of an instruction group, "=" separates the target operand(s) from the source(s), "//" begins a comment, and (pred) specifies the controlling predicate.)
An alternate (faster) schedule in PA-8800 is as follows:
sub bTmp = b, a
add b = b, a
cmp.lt pLT, pNLT = a, 0
(pLT) add c = c, b
(pLT) add d = d, b
(pNLT) add c = c, bTmp
(pNLT) add d = d, bTmp
(pNLT) mov b = bTmp
This takes 8 instructions in 2 cycles and one extra register. The final move of bTmp to b can be eliminated if b isn't live out at that point.
Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor.
It doesn't enable SMP "on a single processor". It provides two processors on a single die. There is a distinction.
AMD, please steal an idea...
The big rumor regarding the third version of Hammer is that it'll be a dual-CPU module. Any guesses as to Hammer's clock speed on release?
299,792,458 m/s...not just a good idea, its the law!
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
...is that you actually can go out and buy a new mainframe using Power4. Nothing wrong with looking ahead, but if you remember, AMD said that the Athlon should have been made in an "Athlon Ultra" version spotting 8MB L2 cache. .... I still stick to the motto: "I'll belive it when I can buy it"
Thomas S. Iversen
That seems practicle enough to me.
You know when AMD 1st brought out the Athlon they were spose to be compatible with Alpha 21264 boards too.
AMD even made a couple of engineering samples in slot B packages for testing but that's as far as it it.
If someone could hack a slot A/Slot B adaptor then they could hypothetically do the same thing. They might have to hack a bios update to though.
I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.
"whatever it cost"?
Then go an buy something from Sun, IBM, Compaq -> AFAIK all three buy servers with that large L2 Caches. (Maybe HP and SGI as well).
E.g. something from IBM's z900 serie (mainframe - up to 32 MB L2 (per CPU?)) or pSeries 620 (workgroup/midrange server - up to 8MB L2 per CPU) or Sun Enterprise 450 (workgroup server - up to 8MB L2 Cache per CPU), Sun Fire 15K (High End Server, 8MB L2 per CPU), Compaq Alphaservers GS/ES series (up to 8MB per CPU).
And if you want just total of 8MB a SGI Origin 300 with more than 4 CPU should do it as well (2MB L2 per CPU).
It doesn't seem too practical to me. Most apps don't benefit greatly from SMP anyway.
They don't? What kind of server do you run? Most all pieces of production-class server software that I know of benefit from multiple processes. Look at Apache, forking off five, ten, or even more processes to handle requests. MySQL, I believe, uses threads. PostgreSQL forks off a new backend for each connection. Shoot, even your telnet, ftp, ssh, and mail daemons will fork off for each connection, allowing you to take advantage of more than one CPU.
If you're sitting at home working on a spreadsheet, you're right, SMP isn't for you - and this machine isn't targetted at you. When you're running a server that may have tens, hundreds, or thousands of SIMULTANEOUS processes fighting for CPU time, every processor counts.
And, to make things even better, even if you're only running a single, non-threaded process, having two processors still makes the machine much more "responsive", as the second CPU can handle kernel code for file IO, network code, interrupt handling, writing to logs, and a lot of other tasks. Ever seen how much CPU time even syslog can chew up?
steve
Oh, you're not stuck, you're just unable to let go of the onion rings.
I *thought* the cache density looked a bit high for ordinary SRAM - the article mentions something they're calling "single-transistor SRAM".
Does anyone know how on earth they're managing this? Or is this just some low-leakage variant of DRAM with added marketing spin?
...a Furbeowulf cluster of these things!
TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
In news today, a small chunk of Austin TX vaporized when an engineer tripped over a Thermaltake vortex containment field, causing an experimental single-chip SMP AMD processor to go critical in its 1024 pin socket...
AIUI, there are two competing methods of scaling CPUs now - Symmetric Multi-threading (SMT), and Chip-level Multi Processing (CMP). HP is going CMP because SMT is too difficult in terms of writing the compilers. Both Compaq (with the Alpha CPU) and IBM (PowerX) are going SMT. In fact, the biggest thing Intel got out of it's purchase of Alpha technology, other than the engineers themselves, is the Alpha SMT work.
well yes HP PA-RISC is nice but really its catch up
S 0002
x 2/index.asp
MIPS 1GHz Dual core on same die for a while
and that its 64bit
check
http://www.electronicstimes.com/story/OEG20010612
or
http://www.pmc-sierra.com/products/details/rm9000
oh yeah did I mention that PA-RISC is a MIPS decendant
but shhh they made so many changes they fscked the pipeline(they might have got it working again but I dont know any more)
may the SPECINT and SPECFP fight it out
regards
john jones
p.s. I wonder what the HP layout guys think of Intel chips (-;
the average PC running windows 9x/ME/ XP Consumer won't be able to multithread anyway
Not to defend MS, or anything, but XP Consumer is still based on the NT core. It'll multithread.
Fascism starts when the efficiency of the government becomes more important than the rights of the people.
With an agenda based on scale, you don't get there by introducing a new CPU in a dead line. HP's SuperDome line is getting creamed by Sun and IBM - HP cannot afford to go back to the front lines with another enterprise offering unless SuperDome pans out a hell of a lot more than it is currently.
HP has always had impressive technology but still loses market share . HP-UX has dwindling market share and software support. The merger with Compaq will derail any plans for further proprietary architectures.
If you want to look at the gee-whiz value here, fine, but don't expect to see this in a product.
HP is going CMP because SMT is too difficult in terms of writing the compilers.
Actually, I think they're doing it because it means they don't have to design a new processor core.
As far as each thread being executed in an SMT chip is concerned, they're running on a single-thread processor. The same scheduling optimizations that benefit code in a single-thread system will benefit the code running SMT with other threads. SMT actually makes this job a bit easier, by reducing the effective latency of instructions (if neither thread's stalled, each thread will execute every other clock, making a 10-cycle-latency instruction look like a 5-cycle-latency instruction, which in turn makes each thread less _likely_ to stall; nice feedback loop here).
The only extra complexity would be in the operating system's scheduling and context switching routines, and that wouldn't be much more complicated than on a multiprocessor system.