HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans
Eric^2 writes: "At last week's MicroProcessor Forum, HP's David J. C. Johnson unveiled the details of HP's latest RISC processor destined to redefine performance in Server-Class processors. Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor. Aside from bumping the core speed up to an initial 1 GHz, enhancements include the addition of combined 35 MB L1+L2 cache. The article contains the full text. AMD, please steal an idea..."
These companies tend to patent anything that will give them a competitive edge in the marketplace. "Stealing an idea" would probably get them into some legal hot water, just like stealing a TV, or your car.
The IBM p690 server uses POWER4 processors. Each
chip has 2 POWER cores with high-speed interconnects. Even better is that each chip is connected to 3 other chips to make up 8 CPU packs.
then i'd be happy.
Wow... And I thought the 8MB L2 cache on UltraSPARC IIIs was a lot, not to mention the 16MB on some IBMs. Now we're talking about 3MB just in L1 with 32MB L2 cache. This beasty should have some impressive benchmark scores (yeah, I know, benchmarks aren't everything...)
wow, that sounds a lot like IBM's release re: the Power4... except not as interesting
...a 1 GHZ processor may not sound like much, even in this dual-core configuration, but keep in mind that this is a RISC processor. None of that Super-mega-ultra-long-50-bazillion-stage pipeline crap that Intel uses to pump up their MHz rating. The article kind of sells this point a little bit short. The RISC architecture allows this processor to do roughly twice as much work in the same amount of time - or, to put it in a more concrete scenario: imagine a pair of 2GHz Pentium 4's running in SMP configuration.
Now that's FAST .
Intel jammed two 486 cores on a chip and called it a "Pentium."
Gross simplification is a viable debating tactic, BTW.
I am very small, utmostly microscopic.
It doesn't seem too practical to me. Most apps don't benefit greatly from SMP anyway. Add to that the potential heat problems caused by two cores on one chip...Why not just go with a more traditional SMP approach? At least you won't have to worry too much about heat then.
Job? I don't have time to get a job! Who will sit around and bitch about being broke and unemployed then?
Did that say 35MB of L1 + L2 cache? I may be rusty, but I think I remember reading in my Processor Design for Dummies book that increasing cache size actually can slow down processor performance after a certain amount. Could someone please clarify this?
today is spelling optional day.
Why hasn't someone else done something like this? I would pay whatever it cost to get even an 8MB L1 & L2 Cache. Anyone want to make me one?
Um, this is my sig.
Wow, modded down in less than 0.3 seconds! Light speed moderation, I have a patent, and you owe me.
Just imagine a Beowulf cluBLAM!!! Thud.
The most interesting parallel architecture I heard about at the MPF was Siroyan's OneDSP architecture. This is a clustered VLIW machine that can execute up to 64 instructions each cycle! See the EE times article and their MPF paper
" One guy wrote that we should take all these Legos and build giant robots with which to attack Afghanastan. " -- Rob Malda, Founder of Slashdot, a "News for Nerds" website, in a NPR report on post WTC gen-X, 10/22/2001
I, for one, would like to take a moment to thank Rob for setting us "Nerds" back where we belong. Way to make us look like a bunch of childish tech-heads with no conception of the real world! As a troll, I think it's high time that you slashdotters got slapped down for the idiotic geeks that you are! (That was sarcasm, you nincompoop!)
The official HP presentation on the PA-8800 is0 01.pdf.
available as a PDF from http://www.cpus.hp.com/technical_references/mpf_2
Y.
It has the clock frequency of a 300bps modem's dsp. That's still pretty darn cool! *shrug*
Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.
Doesn't Chuck Moore's 25x already do SMP-like things, at a few billion instructions per second? Last time I checked he was using a 20-word instruction set on a stack-based computer, which IMO counts as RISC.
This is hardly new, but HP's version probably uses some fancy new lithography, and wins when it comes to clock speed.
"Look at me, I invented the stove!" -- Ben Franklin
1GHz? Man my Pentium 4 2GHz beats all these supposedly "fast" chips. It's HALF the speed, for christ's sake. That is sooooooooo quarter 3 2000.
Acting stupid isn't much fun when there's someone around who knows better
PA-8800 lets you create two opposite predicates in one instruction, for example the predicate a=b.
// pLT & pNLT are 2 complementary preds
;; // add to b [then] // or sub from b [else]
;; // uses of b
;;
// speculatively sub from b (into temp) // and add to b
;; // uses of b [then] // uses of b (temp) [else] // move bTmp to b [else]
;;
This seems to indicate that there are no separate "do this if predicate is true" and "do this if predicate is false" instructions, so for opposite predication you would have to specify two different predicates.
The processor cannot know that these two predicates are related, so this would give you quite a problem.
As has been publicly disclosed, in general in PA-8800, an instruction reading any resource (such as a predicate) must be in a later instruction group (cycle) than the instruction writing that resource. As a special case, branches are allowed to use a predicate written by another instruction in the same instruction group (as shown in the IDF slides).
So, the straightforward (but slow) PA-8800 schedule for the earlier example:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
would be:
cmp.lt pLT, pNLT = a, 0
(pLT) add b = b, a
(pNLT) sub b = b, a
add c = c, b
add d = d, b
which takes 5 instructions in 3 cycles. (Note: In PA-8800 assembly, ";;" indicates the end of an instruction group, "=" separates the target operand(s) from the source(s), "//" begins a comment, and (pred) specifies the controlling predicate.)
An alternate (faster) schedule in PA-8800 is as follows:
sub bTmp = b, a
add b = b, a
cmp.lt pLT, pNLT = a, 0
(pLT) add c = c, b
(pLT) add d = d, b
(pNLT) add c = c, bTmp
(pNLT) add d = d, bTmp
(pNLT) mov b = bTmp
This takes 8 instructions in 2 cycles and one extra register. The final move of bTmp to b can be eliminated if b isn't live out at that point.
Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor.
It doesn't enable SMP "on a single processor". It provides two processors on a single die. There is a distinction.
AMD, please steal an idea...
The big rumor regarding the third version of Hammer is that it'll be a dual-CPU module. Any guesses as to Hammer's clock speed on release?
299,792,458 m/s...not just a good idea, its the law!
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
...is that you actually can go out and buy a new mainframe using Power4. Nothing wrong with looking ahead, but if you remember, AMD said that the Athlon should have been made in an "Athlon Ultra" version spotting 8MB L2 cache. .... I still stick to the motto: "I'll belive it when I can buy it"
Thomas S. Iversen
This is getting rediculous. Symmetric Multiprocessing on a single chip? That's impossible, unless you're just screwing around with semantics. I mean, think about it. You need TWO chips, at least, in order to engage in SMP, and anyone who says otherwise is putting out meaningless hype. SMP on one chip, or just vaporware?
Slashdot: Open Source, Closed Minds.
IBM unveiled its SMP-on-a-chip solution, the Power4, almost 2 weeks ago. 64 bit PowerPC. And only 2 OS'es run it.
One of them is Linux.
With no such niceties as virtual memory, large address spaces, fast additions etc etc there is not a lot of software which would run well on them.
That seems practicle enough to me.
You know when AMD 1st brought out the Athlon they were spose to be compatible with Alpha 21264 boards too.
AMD even made a couple of engineering samples in slot B packages for testing but that's as far as it it.
If someone could hack a slot A/Slot B adaptor then they could hypothetically do the same thing. They might have to hack a bios update to though.
compared to say a 2.2GHz P4 or an Athlon XP 1800+. Inquiring minds want to know.
http://www-1.ibm.com/servers/eserver/pseries/hardw are/whitepapers/p690_config.html#arch
You asked for it; here's some yummy troll food fer ya:
For the last time, clock speed doesn't compare between architectures. This is a RISC processor with a short pipeline, the pentium 4 you drool about (but don't really have) is a CISC with an extra-long needlessly-clock-boosting pipeline. Of course, you'd know this if you read the article.
If you read the specs, you'd see: "Speaking of performance, each PA-8700 RISC core delivers a SPEC performance of around 550 (for both Int and FP) at 750 MHz and the dual core PA-8800 running at 1 GHz will start out at a minimum of 900 / 1000 SPEC2000 int/fp scores, according to very conservative estimates."
A 2GHz P4 is in the 650 SPECINT/670 SPEC2000FP range, so basically each PA processor is about 10% faster.
Was it trolliscious? satisfied? oh, furry troll, why do I even bother feeding you?
I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.
Who says the OS needs to know there are two or 4 or 6 CPUs in a system? Threaded programming works best only on Clusters. SMP scales poorly on Intel Pentium IV's because maybe that isn't a reason to use two Pentium IV CPUs...Each CPU provides more resource management. Someone should've written some supporting text on "the SMP myth" which includes why SMP is not a good and efficient solution to increase calculation performance on a given workstation. Pentium Pro CPUs provide BUS Mastering in SMP mode and the secondary CPU provides upto %30 of extra system performance. Of Course, for every extra CPU in a Pentium Pro system, it adds 30% - (NumCPUs*5) performance due to scheduling in the software. It is always the Primary CPU in a multi-cpu system that must schedule events for the Secondary, Tertiary, and Quaternary cpu in software. On a Pentium Pro CPU, that requires somewhere around 10% of that Primary CPU's processing power to schedule software for that second CPU; it's around 15%/20% for the 3rd, etc. That's why you see Dual Pentium IV Workstations not performing upto par with another workstation with only one Pentium IV. It's SMP doesn't scale well. The only value of multiple-CPUs is for BUS Mastering and providing more system resource management. Intel abandoned BUS Mastering in SMP systems after the Pentium Pro. So, for the extra cost of using a second Pentium IV CPU, it isn't worth it. Just get a nice Pentium Pro Server on eBay and you will get your money's worth for those extra CPUs; which provide BUS Mastering. Pentium Pro has always been a nice CPU. You can scramble an egg in 5 minutes, versus 15 for the Pentium IV.
But I'm sure you already Gnu that.
Reading through the article, this design seems to share a lot in common with Sun's MAJC architecture. Both allow for multiple cores on a single chip. Anyone else notice the similarities?
I guess the biggest difference would be that the HP chip is actually going to be built, while the MAJC chip seems to still just be a design.
It is interesting that a number of designs lately seem to be looking to the integration of multiple CPU cores on a single chip to increase performance in server applications.
zor_prime
"We all do no end of feeling, and we mistake it for thinking." -Mark Twain
sorry, could resists, there was no single grep of beowolf.
EEtimes Story
Everyone in the high-performance CPU market (except itanic) is doing either this or multiple concurrent thread contexts to speed overall system computational throughput.
When you consider that the PA-RISC team has been transferred to that "evil" company Intel.
Conformity is the jailer of freedom and enemy of growth. -JFK
I *thought* the cache density looked a bit high for ordinary SRAM - the article mentions something they're calling "single-transistor SRAM".
Does anyone know how on earth they're managing this? Or is this just some low-leakage variant of DRAM with added marketing spin?
...a Furbeowulf cluster of these things!
TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
In news today, a small chunk of Austin TX vaporized when an engineer tripped over a Thermaltake vortex containment field, causing an experimental single-chip SMP AMD processor to go critical in its 1024 pin socket...
Less crack. Go study modern OSs and stay away from SunOS and old Slackware "SMP" kernels.
Imagine a Beowolf Cluster of THESE!!!
AIUI, there are two competing methods of scaling CPUs now - Symmetric Multi-threading (SMT), and Chip-level Multi Processing (CMP). HP is going CMP because SMT is too difficult in terms of writing the compilers. Both Compaq (with the Alpha CPU) and IBM (PowerX) are going SMT. In fact, the biggest thing Intel got out of it's purchase of Alpha technology, other than the engineers themselves, is the Alpha SMT work.
I heard it will start at 3 GHz?
Sorry, while it may be true for Pentium series, it is not true for SMP in general.
1) It is actually possible to get better than linear improvement under certain conditions (like if something is already in a shared cache because it was fetched by the other cpu).
2) It is possible to have each cpu schedule itself based on contents of ram.
Yes, there is overhead of having two cpus, but it is very variable dependent on OS and workload.
Are you coming on to me?
it looks like artificial breathing attempts when resources don't allow for better chip designs anymore. 3dfx did it with voodoo2, and it's such a cheap solution that I'm surprised HP even bothers to this show'n'tell. "look, we didn't have enough money to do 1 good and new solution so we slapped together 2 old ones!! all hail the new ultrafast processor!"
Sounds like more kernel work. I'm won't be happy until I can mount file systems in my cache. Think about it. My 286 only had a 40 MB hard drive. Hello, solid state!
WARNING: there is a trojan on your
well yes HP PA-RISC is nice but really its catch up
S 0002
x 2/index.asp
MIPS 1GHz Dual core on same die for a while
and that its 64bit
check
http://www.electronicstimes.com/story/OEG20010612
or
http://www.pmc-sierra.com/products/details/rm9000
oh yeah did I mention that PA-RISC is a MIPS decendant
but shhh they made so many changes they fscked the pipeline(they might have got it working again but I dont know any more)
may the SPECINT and SPECFP fight it out
regards
john jones
p.s. I wonder what the HP layout guys think of Intel chips (-;
I'll bet you could fry eggs on it pretty well with that much silicon cranking out heat in one chip.
As has been pointed out above, this is just HP playing catchup to IBM. IBM has taken a leap ahead of their competitors and now they have to play catchup.
HP's announcement is nothing compared to what IBM has in development.
HP workstations certainly seem to be very solid and nifty and they have a lot of potential for linux boxes. Assembly programmers will appreciate all of the registers that are available.
Clickety Click
With an agenda based on scale, you don't get there by introducing a new CPU in a dead line. HP's SuperDome line is getting creamed by Sun and IBM - HP cannot afford to go back to the front lines with another enterprise offering unless SuperDome pans out a hell of a lot more than it is currently.
HP has always had impressive technology but still loses market share . HP-UX has dwindling market share and software support. The merger with Compaq will derail any plans for further proprietary architectures.
If you want to look at the gee-whiz value here, fine, but don't expect to see this in a product.
HP is going CMP because SMT is too difficult in terms of writing the compilers.
Actually, I think they're doing it because it means they don't have to design a new processor core.
As far as each thread being executed in an SMT chip is concerned, they're running on a single-thread processor. The same scheduling optimizations that benefit code in a single-thread system will benefit the code running SMT with other threads. SMT actually makes this job a bit easier, by reducing the effective latency of instructions (if neither thread's stalled, each thread will execute every other clock, making a 10-cycle-latency instruction look like a 5-cycle-latency instruction, which in turn makes each thread less _likely_ to stall; nice feedback loop here).
The only extra complexity would be in the operating system's scheduling and context switching routines, and that wouldn't be much more complicated than on a multiprocessor system.
Tricky. This time Spootnik copied and pasted from not one but two Usenet articles, neither of which has anything to do with PA-8800.
Not just article hijacking, but blatantly false article hijacking.