HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans
Eric^2 writes: "At last week's MicroProcessor Forum, HP's David J. C. Johnson unveiled the details of HP's latest RISC processor destined to redefine performance in Server-Class processors. Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor. Aside from bumping the core speed up to an initial 1 GHz, enhancements include the addition of combined 35 MB L1+L2 cache. The article contains the full text. AMD, please steal an idea..."
These companies tend to patent anything that will give them a competitive edge in the marketplace. "Stealing an idea" would probably get them into some legal hot water, just like stealing a TV, or your car.
The IBM p690 server uses POWER4 processors. Each
chip has 2 POWER cores with high-speed interconnects. Even better is that each chip is connected to 3 other chips to make up 8 CPU packs.
Wow... And I thought the 8MB L2 cache on UltraSPARC IIIs was a lot, not to mention the 16MB on some IBMs. Now we're talking about 3MB just in L1 with 32MB L2 cache. This beasty should have some impressive benchmark scores (yeah, I know, benchmarks aren't everything...)
...a 1 GHZ processor may not sound like much, even in this dual-core configuration, but keep in mind that this is a RISC processor. None of that Super-mega-ultra-long-50-bazillion-stage pipeline crap that Intel uses to pump up their MHz rating. The article kind of sells this point a little bit short. The RISC architecture allows this processor to do roughly twice as much work in the same amount of time - or, to put it in a more concrete scenario: imagine a pair of 2GHz Pentium 4's running in SMP configuration.
Now that's FAST .
Did that say 35MB of L1 + L2 cache? I may be rusty, but I think I remember reading in my Processor Design for Dummies book that increasing cache size actually can slow down processor performance after a certain amount. Could someone please clarify this?
today is spelling optional day.
Why hasn't someone else done something like this? I would pay whatever it cost to get even an 8MB L1 & L2 Cache. Anyone want to make me one?
Um, this is my sig.
The most interesting parallel architecture I heard about at the MPF was Siroyan's OneDSP architecture. This is a clustered VLIW machine that can execute up to 64 instructions each cycle! See the EE times article and their MPF paper
The official HP presentation on the PA-8800 is0 01.pdf.
available as a PDF from http://www.cpus.hp.com/technical_references/mpf_2
Y.
Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.
Doesn't Chuck Moore's 25x already do SMP-like things, at a few billion instructions per second? Last time I checked he was using a 20-word instruction set on a stack-based computer, which IMO counts as RISC.
This is hardly new, but HP's version probably uses some fancy new lithography, and wins when it comes to clock speed.
"Look at me, I invented the stove!" -- Ben Franklin
PA-8800 lets you create two opposite predicates in one instruction, for example the predicate a=b.
// pLT & pNLT are 2 complementary preds
;; // add to b [then] // or sub from b [else]
;; // uses of b
;;
// speculatively sub from b (into temp) // and add to b
;; // uses of b [then] // uses of b (temp) [else] // move bTmp to b [else]
;;
This seems to indicate that there are no separate "do this if predicate is true" and "do this if predicate is false" instructions, so for opposite predication you would have to specify two different predicates.
The processor cannot know that these two predicates are related, so this would give you quite a problem.
As has been publicly disclosed, in general in PA-8800, an instruction reading any resource (such as a predicate) must be in a later instruction group (cycle) than the instruction writing that resource. As a special case, branches are allowed to use a predicate written by another instruction in the same instruction group (as shown in the IDF slides).
So, the straightforward (but slow) PA-8800 schedule for the earlier example:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
would be:
cmp.lt pLT, pNLT = a, 0
(pLT) add b = b, a
(pNLT) sub b = b, a
add c = c, b
add d = d, b
which takes 5 instructions in 3 cycles. (Note: In PA-8800 assembly, ";;" indicates the end of an instruction group, "=" separates the target operand(s) from the source(s), "//" begins a comment, and (pred) specifies the controlling predicate.)
An alternate (faster) schedule in PA-8800 is as follows:
sub bTmp = b, a
add b = b, a
cmp.lt pLT, pNLT = a, 0
(pLT) add c = c, b
(pLT) add d = d, b
(pNLT) add c = c, bTmp
(pNLT) add d = d, bTmp
(pNLT) mov b = bTmp
This takes 8 instructions in 2 cycles and one extra register. The final move of bTmp to b can be eliminated if b isn't live out at that point.
Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor.
It doesn't enable SMP "on a single processor". It provides two processors on a single die. There is a distinction.
AMD, please steal an idea...
The big rumor regarding the third version of Hammer is that it'll be a dual-CPU module. Any guesses as to Hammer's clock speed on release?
299,792,458 m/s...not just a good idea, its the law!
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
...is that you actually can go out and buy a new mainframe using Power4. Nothing wrong with looking ahead, but if you remember, AMD said that the Athlon should have been made in an "Athlon Ultra" version spotting 8MB L2 cache. .... I still stick to the motto: "I'll belive it when I can buy it"
Thomas S. Iversen
That seems practicle enough to me.
You know when AMD 1st brought out the Athlon they were spose to be compatible with Alpha 21264 boards too.
AMD even made a couple of engineering samples in slot B packages for testing but that's as far as it it.
If someone could hack a slot A/Slot B adaptor then they could hypothetically do the same thing. They might have to hack a bios update to though.
I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.
If you had taken even thirty seconds to read the blurb, you would see that they have indeed put two cores on one chip, which gives you two processors on a chip. Remember, "processor" != "chip".
It's the same idea as Via combining the north and south bridges on some of their motherboard chipsets. They take the two cores, and put them on a single wafer, with a bus (still on the same wafer) between them.
The idea really isn't revolutionary. Ever since microprosessors were invented, the trend has been to pack more and more onto a single chip, as it reduces cost, complexity, and design complexity while increasing compatibility and (most importantly) bandwidth. While your fastest P4 front-side bus chugs along at 400 MHz, busses that are kept on the wafer can run at full core frequency, even in the gigahertz range. Plus, you can run a lot more of them, and since the distances covered are shorter, it's easier to avoid external RF interference. And in multi-processing computers, the connectivity between cores is vitally important.
Look at a lot of motherboard chipsets these days. In one or two chips, they'll have circuitry for video, audio, modem, network, IDE, floppy, serial, USB, PCI, and memory controllers, to name just a few. One of the long-term goals that some companies have been talking about is "SOC", or "System On A Chip", where a single chip will have everything you need for a computer. At the point where the CPU has all of the other controllers inside of it, not only could performance increase dramatically, you could potentially use a motherboard for any CPU that you wanted, as all the motherboard would do is provide power to the CPU and traces from the CPU to the connectors for external componants.
steve
Oh, you're not stuck, you're just unable to let go of the onion rings.
Think of it as two cores....
Reading through the article, this design seems to share a lot in common with Sun's MAJC architecture. Both allow for multiple cores on a single chip. Anyone else notice the similarities?
I guess the biggest difference would be that the HP chip is actually going to be built, while the MAJC chip seems to still just be a design.
It is interesting that a number of designs lately seem to be looking to the integration of multiple CPU cores on a single chip to increase performance in server applications.
zor_prime
"We all do no end of feeling, and we mistake it for thinking." -Mark Twain
EEtimes Story
Everyone in the high-performance CPU market (except itanic) is doing either this or multiple concurrent thread contexts to speed overall system computational throughput.
It doesn't seem too practical to me. Most apps don't benefit greatly from SMP anyway.
They don't? What kind of server do you run? Most all pieces of production-class server software that I know of benefit from multiple processes. Look at Apache, forking off five, ten, or even more processes to handle requests. MySQL, I believe, uses threads. PostgreSQL forks off a new backend for each connection. Shoot, even your telnet, ftp, ssh, and mail daemons will fork off for each connection, allowing you to take advantage of more than one CPU.
If you're sitting at home working on a spreadsheet, you're right, SMP isn't for you - and this machine isn't targetted at you. When you're running a server that may have tens, hundreds, or thousands of SIMULTANEOUS processes fighting for CPU time, every processor counts.
And, to make things even better, even if you're only running a single, non-threaded process, having two processors still makes the machine much more "responsive", as the second CPU can handle kernel code for file IO, network code, interrupt handling, writing to logs, and a lot of other tasks. Ever seen how much CPU time even syslog can chew up?
steve
Oh, you're not stuck, you're just unable to let go of the onion rings.
When you consider that the PA-RISC team has been transferred to that "evil" company Intel.
Conformity is the jailer of freedom and enemy of growth. -JFK
I *thought* the cache density looked a bit high for ordinary SRAM - the article mentions something they're calling "single-transistor SRAM".
Does anyone know how on earth they're managing this? Or is this just some low-leakage variant of DRAM with added marketing spin?
...a Furbeowulf cluster of these things!
TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
In news today, a small chunk of Austin TX vaporized when an engineer tripped over a Thermaltake vortex containment field, causing an experimental single-chip SMP AMD processor to go critical in its 1024 pin socket...
Less crack. Go study modern OSs and stay away from SunOS and old Slackware "SMP" kernels.
What part of "gross simplification" did you not understand?
l s/ 24319001.pdf
At any rate, read for yourself:
http://developer.intel.com/design/pentium/manua
I am very small, utmostly microscopic.
AIUI, there are two competing methods of scaling CPUs now - Symmetric Multi-threading (SMT), and Chip-level Multi Processing (CMP). HP is going CMP because SMT is too difficult in terms of writing the compilers. Both Compaq (with the Alpha CPU) and IBM (PowerX) are going SMT. In fact, the biggest thing Intel got out of it's purchase of Alpha technology, other than the engineers themselves, is the Alpha SMT work.
Sorry, while it may be true for Pentium series, it is not true for SMP in general.
1) It is actually possible to get better than linear improvement under certain conditions (like if something is already in a shared cache because it was fetched by the other cpu).
2) It is possible to have each cpu schedule itself based on contents of ram.
Yes, there is overhead of having two cpus, but it is very variable dependent on OS and workload.
Sounds like more kernel work. I'm won't be happy until I can mount file systems in my cache. Think about it. My 286 only had a 40 MB hard drive. Hello, solid state!
WARNING: there is a trojan on your
well yes HP PA-RISC is nice but really its catch up
S 0002
x 2/index.asp
MIPS 1GHz Dual core on same die for a while
and that its 64bit
check
http://www.electronicstimes.com/story/OEG20010612
or
http://www.pmc-sierra.com/products/details/rm9000
oh yeah did I mention that PA-RISC is a MIPS decendant
but shhh they made so many changes they fscked the pipeline(they might have got it working again but I dont know any more)
may the SPECINT and SPECFP fight it out
regards
john jones
p.s. I wonder what the HP layout guys think of Intel chips (-;
As has been pointed out above, this is just HP playing catchup to IBM. IBM has taken a leap ahead of their competitors and now they have to play catchup.
HP's announcement is nothing compared to what IBM has in development.
HP workstations certainly seem to be very solid and nifty and they have a lot of potential for linux boxes. Assembly programmers will appreciate all of the registers that are available.
Clickety Click
With an agenda based on scale, you don't get there by introducing a new CPU in a dead line. HP's SuperDome line is getting creamed by Sun and IBM - HP cannot afford to go back to the front lines with another enterprise offering unless SuperDome pans out a hell of a lot more than it is currently.
HP has always had impressive technology but still loses market share . HP-UX has dwindling market share and software support. The merger with Compaq will derail any plans for further proprietary architectures.
If you want to look at the gee-whiz value here, fine, but don't expect to see this in a product.
But some programs will benefit. Have you ever run a heavily used web server? They fork off lots of processes. It will benefit greatly from SMP.
Most processor intensive programs have become multithreaded, and the rest can be if SMP becomes popular.
I see too many people asking what the use of something is if all their existing stuff wouldn't benefit from it. This is often because their stuff hasn't had any reason to adapt to this cool new thing that people are going to reject because their stuff doesn't benefit from it now. Take the plan9 OS for example. It does the Right Thing for a great networked internal structure, but the GUI stinks. It is not popular, partly because people don't like the UI. But if poeple used it, the GUI would be improved and we would have all its cool benefits.
HP is going CMP because SMT is too difficult in terms of writing the compilers.
Actually, I think they're doing it because it means they don't have to design a new processor core.
As far as each thread being executed in an SMT chip is concerned, they're running on a single-thread processor. The same scheduling optimizations that benefit code in a single-thread system will benefit the code running SMT with other threads. SMT actually makes this job a bit easier, by reducing the effective latency of instructions (if neither thread's stalled, each thread will execute every other clock, making a 10-cycle-latency instruction look like a 5-cycle-latency instruction, which in turn makes each thread less _likely_ to stall; nice feedback loop here).
The only extra complexity would be in the operating system's scheduling and context switching routines, and that wouldn't be much more complicated than on a multiprocessor system.