ArsTechnica Compares the P4 and G4e: Part II

← Back to Stories (view on slashdot.org)

ArsTechnica Compares the P4 and G4e: Part II

Posted by CmdrTaco on Wednesday November 7, 2001 @02:29AM from the start-your-head-spinning dept.

Deffexor writes "It looks like Hannibal of ArsTechnica fame has put Part 2 of his original comparison article between Intel's P4 and the Apple/Motorola G4e. In a nutshell, this second article covers the execution core, the AltiVec unit and SSE2, as well as a myriad of other interesting factoids. An interesting read, if not a little technically intense for those of us with less than a CE/EE degree. Have at it boys!"

9 of 192 comments (clear)

Another view by Mik!tAAt · 2001-11-07 02:44 · Score: 5, Funny

Here's another comparison: Joy Of Tech (and the next 6 pages as well)

--
This is the place where you write something that will make you seem like a complete idiot.
technically intense.. by smaughster · 2001-11-07 02:45 · Score: 4, Funny

>An interesting read, if not a little technically intense for those of us with less than a CE/EE degree.

Tell me about it, I do have more then CE, two letters even, namely MCSE and even I had to stop when they started throwing around the heavy stuff. I mean, A = A + B is supposed to make sense even if B isn't equal to zero.

--
I intend to live forever, so far so good.
ppc power by peachboy · 2001-11-07 02:52 · Score: 4, Insightful

i personally believe that flexibility of the assembly instructions as well as the number of instructions executed per cycle contribute greatly to the dominant speed (at any given MHz/GHz) of the ppc processor. compare any intel/amd processor to a ppc at the same clock speed, and the ppc will kick its x86 ass.

the high end ppc desktops are topping out around 900MHz, while the p4's are hitting 2GHz. there has to be another explanation besides the complaint that jobs is ignorantly sitting on his thumbs. i think he knows what he's doing.

note: i am not a mac zealot.. i don't even own a mac - only 4 x86 pc's (1 athlon, 2 p133, 1 p120). i simply can appreciate the speed of the ppc.

--
"I just want to thank my coach Eric a.k.a. Disco for shattering my reality..."
1. Re:ppc power by Christopher+Thomas · 2001-11-07 03:52 · Score: 5, Interesting
  
  the high end ppc desktops are topping out around 900MHz, while the p4's are hitting 2GHz. there has to be another explanation besides the complaint that jobs is ignorantly sitting on his thumbs.
  
  Two factors come into play here.
  
  The first is that, if I remember correctly, PPC and x86 chips use a different clocking scheme. This means that clock rates between them aren't even directly comparable (what a "clock" is depends on the clocking scheme).
  
  The second is that it's perfectly possible that the PPC architecture is limited to lower clock rates than the x86 architecte. Signal propagation through gates takes time. If one architecture expects signals to propagate through logic three gates deep per clock, and another architecture expects signals to propagate through logic five gates deep, then of *course* one will have a faster maximum clock rate than the other. They would hopefully still be doing the same amount of work per unit of real time.
  
  You should already be familiar with this from the Athlon/P4 spin war. A 0.18 micron Athlon core simply cannot be clocked as quickly as a 0.18 micron P4 core - no matter what you do. Does this make the Athlon automatically a poor performer? No, because it can do more per clock. Does this make the Athlon automatically kill the P4, because it "can do more per clock"? No, because the P4 can be clocked faster. Only real benchmarks will tell.
  
  A point against Apple is that Apple has been allergic to publishing SPECmarks for its processors for the past couple of years (the only PPC-ish benchmarks are IBM's benchmarks of the Power series of chips, which forked after the G3 IIRC). This removes a very consistent (if somewhat flawed) means of comparison.
Nice. by tcc · 2001-11-07 02:53 · Score: 5, Interesting

But with the G5 around the corner, I think THAT will be THE interresting comparison.. expecially since Intel plans on keeping the P4 for a while (, ramping it up in speed, when you Read adobe saying the G5 are significantly faster than P4 (and if you go read the article, the same people do say that the P4 is faster than a G4 (exept for altivec stuff) so if they say G5 is faster than P4, it probably will be :)...it should be really nice to see something that kills the P4 in raw performance other than AMD).

--
--- Metamoderating abusive downgraders since my 300th post.
Maybe pointlessly detailed by Junks+Jerzey · 2001-11-07 02:53 · Score: 5, Interesting

Note: I have a B.S. in computer science, a solid understanding of hardware issues, and have been programming for 19 years.

When I read articles like this, there's so much detail that I find myself--even willingly--losing sight of the big picture. Sure, you could read a detailed write-up about Toyota's new engine, but those details don't really matter much unless you've just made a hobby of knowing about engines. Realistically, you'll have a hard time connecting those details to your driving experience. Heck, someone could put in a different engine, tell you that its a Toyota, and you'd be saying things like "Oh, yes, this feels just like a Toyota, I can tell that the designers did blah and blah."

After the Pentium II generation of CPUs, things have gotten very, very muddled. Amazing features that are supposed to increase performance don't always do so. Sometimes they make things worse. Little compiler tweaks can make one program be twice as fast as another, given the same hardware. Chips with higher clock rates can be significantly slower than chips with 20% slower clocks. Certain applications run much faster than on previous chips, but there are others that show no increase.

It's all very chaotic and confusing, even for people in the know. I suspect that if you took a program that people claimed to need a P4 or Athlon for--something very performance sensitive--and set yourself the task of making it run faster on a PII than an Athlon, you could do it. But that doesn't matter, as everyone seems to be clamoring for newer chips.
Great Article! by Uttles · 2001-11-07 03:00 · Score: 5, Insightful

This article is extremely informative and gives you a good insight into how these processors are designed, as well as how they compare. I disagree with the poster though, you don't need a CE or EE degree to get the idea of what's going on. I'm a CE and I had classes on this sort of thing so yes I could follow all the gritty details, but I think the author did a good job of explaining things so that most people could understand. Also, I thought the author summed things up perfectly saying:

The preceding discussion should make it clear that the overall design approaches I outlined in the first article can be seen in the execution cores of each processor. The G4e continues its "wide and shallow" approach to performance, counting on instruction-level parallelism to allow it to squeeze the most performance out of code. The P4's "narrow and deep" approach, on the other hand, uses fewer execution units, eschewing ILP and betting instead on increases in clock speed to increase performance.

This is exactly the case. Unfortunately the popular masses don't understand all of this wide vs narrow stuff, so they go for the higher clock speeds. In reality, Intel is really pulling one over on us, charging more money and all we're getting is a higher clock rate, not a whole lot of performance gain. PPC has proven itself time and time again to be the better processor, but unfortunately they aren't used in very popular machines (mostly Macs,) so we don't get to reap the benefits.

On a related note, this article touches on one of the many reasons why the Gamecube will run circles around the Xbox. GameCube's processor is a 485Mhz PPC designed specifically for video games, while the Xbox just uses a common Pentium running at 733 MHz.

This all brings up a good question: why haven't Macintosh's or GameCube's marketers come up with a bench mark to put next to the processor speed? Maybe I missed it, but I've never seen a Macintosh commercial saying "comes with a G4 800 MHz, comparable to a P4 1.5 MHz." There might be too many legalities involved to do something like that, but it seems like they need to educate people somehow of the non 1 to 1 relationship between clock speeds of P4s and PPCs.

--

~ now you know
1. Re:Great Article! by bmajik · 2001-11-07 05:16 · Score: 4, Informative
  
  In reality, Intel is really pulling one over on us, charging more money and all we're getting is a higher clock rate, not a whole lot of performance gain
  
  This is a debatable point. I think it is wrong to conclude intel is "pulling one over on us". It has been demonstrated that as more EU's are added, the effectiveness and utilization of EU's goes down. The quest for ILP comes to a crashing screeching halt before you even get to 4 EU's. IIRC, only one processor-scheduled CPU is designed with more than 4 EUs.
  
  The necessity of the chip to extract ILP in realtime is what leads us to these big hairy controllers and limited clock speeds. Controller shrink was what led to RISC in the first place, and now that we've had to add in superscalar "goo" there's hardly a difference between the CISC philosophy and the RISC one. Never mind that Intel chips have been re-writing CISC instructions as multi-EU uops forever.
  
  The point is, adding additional EU's has been desmontrated to be of dubious merit. Right NOW, the P4 speed improvements come from SSE2, just like the G4's speed improvements come from AltiVec. Both do essentially the same thing, although i've read more about AltiVec and it seems "cooler" :)
  
  The difference is this - When the P4 core hits 3ghz, its retire rate will just destroy anything a G4 or Athlon will do. Intel took the pipeline length hit NOW and will reap the benefits later.
  
  They also spent the time to get their prediction units as top notch as possible, because iirc statistically there will be > 3 conditional branches in progress in those ridiculous 20 stage pipes :)
  
  So - the problem with intel's approach - a single instruction takes longer to complete, and the fill/drain penalty for mispredictions is high.
  The retire rate however, is amazing, and the clock rate ramping ability is similarly amazing.
  
  Your assertion that MOTs approach _relies_ on adding additional EU's is surely incorrect, because "everyone" knows that controller complexity is again dominating cpus, and much of that is dedicated to extracting and managing ILP on 4 or less EUs (and that it just isn't there beyond 4.. i think the Power4 was supposed to have 6 EUs, and the Alpha 364 or 464 was going to have 8 ?)
  
  Intel has already "side stepped" the SuperScalar risc EU problem with IA64 - Thats what LIW does. LIW is interesting again now because of the reliazation that controller extracted ILP was too expensive and not good enough for the performance increases needed.
  
  --
  My opinions are my own, and do not necessarily represent those of my employer.
He misses one important difference... by Anonymous Coward · 2001-11-07 03:19 · Score: 4, Informative

about floating point instructions: on the PPC, both the clsassical FPU and the Altivec unit have fused multiply-add instructions, i.e. a single machine instruction computes: RA=RB*RC+RD where RA, RB, RC and RD are arbitrary floating point registers. This takes the same time as a multiply, basically the add (which can also be a substract) step is free.
The two operand Intel architecture does not allow the fused multiply add, so that the latency of such an operation is the latency of a multiply plus the latency of an add (and the destination register has to be one of the operands, although the other operand can be in memory, saving you a load). There are plenty of practical algorithms which benefit greatly from the fused multiply-add, for example polynomial evaluations, matrix multiplications, etc, a feature pioneered by IBM in the RS6000 series and that Intel is using in Inanium.
And people who claim that you can do loop unrolling to hide the latencies should check their math: with only 8 registers, there is no way to hide the latencies of a multiply plus an add on a P4, while it is almost trivial on a G4 (32 registers and shorter latencies between accumulates). Furthermore many transcendental function evaluations are evaluated in libraries through polynomial approximations, which cannot be unrolled nor easily sped up: the number of coefficients is usually large enough to make the routine limited by the latency of the back to back floating point operations, but not large enough to take a divide and conquer approach.
While the G4 is clearly the better architecture (not having double precision Altivec is not that important, I consider vector processing is only worth if you can do more than 4 elemnts per vector), the memory susbystem of the P4 is far superior. Hopefully the G5 will be comparable in this area (and I can't buy a desktop Power4 system :-().