ArsTechnica Compares the P4 and G4e: Part II
Deffexor writes "It looks like Hannibal of ArsTechnica fame has put Part 2 of his original comparison article between Intel's P4 and the Apple/Motorola G4e. In a nutshell, this second article covers the execution core, the AltiVec unit and SSE2, as well as a myriad of other interesting factoids. An interesting read, if not a little technically intense for those of us with less than a CE/EE degree. Have at it boys!"
The two operand Intel architecture does not allow the fused multiply add, so that the latency of such an operation is the latency of a multiply plus the latency of an add (and the destination register has to be one of the operands, although the other operand can be in memory, saving you a load). There are plenty of practical algorithms which benefit greatly from the fused multiply-add, for example polynomial evaluations, matrix multiplications, etc, a feature pioneered by IBM in the RS6000 series and that Intel is using in Inanium.
And people who claim that you can do loop unrolling to hide the latencies should check their math: with only 8 registers, there is no way to hide the latencies of a multiply plus an add on a P4, while it is almost trivial on a G4 (32 registers and shorter latencies between accumulates). Furthermore many transcendental function evaluations are evaluated in libraries through polynomial approximations, which cannot be unrolled nor easily sped up: the number of coefficients is usually large enough to make the routine limited by the latency of the back to back floating point operations, but not large enough to take a divide and conquer approach.
While the G4 is clearly the better architecture (not having double precision Altivec is not that important, I consider vector processing is only worth if you can do more than 4 elemnts per vector), the memory susbystem of the P4 is far superior. Hopefully the G5 will be comparable in this area (and I can't buy a desktop Power4 system :-().
Will cost user $1000 with graphics card, audio, firewire, processor, ethernet, memory, apparently.
Not exactly. I'm sure you know clock speed is meaningless for determining how fast a CPU really is. Now higher model numbers should mean huge advancements, and they do many times. One reason all applications are not sped up is because the bottleneck is in different places for different applications. If you tried to play those same games that sped up on a newer CPU, but with an older graphics card, well it would be exactly the same. The word processor would then seem faster than games. Another thing, many games take advantage of special CPU instructions which many applications do not need.
Anyone who has ever upgraded or put together their own PC should know that performance depends on all parts, not just the CPU or memory. Maybe your applications don't speed up because you are still using that old hard drive, which is the bottleneck? Yes, it would be nice to purchase a magic chip that you throw into a computer which speeds everything up dramtically, but the reality is no computer (that I know of) works that way. I'd love it if I could install a new CPU and get 20% more bandwidth, but it just ain't gonna happen. The reason your applications do not seem faster is because they are probably already as fast as possible (to detect by any mere mortal, that is).
Dijkstra Considered Dead
In reality, Intel is really pulling one over on us, charging more money and all we're getting is a higher clock rate, not a whole lot of performance gain
This is a debatable point. I think it is wrong to conclude intel is "pulling one over on us". It has been demonstrated that as more EU's are added, the effectiveness and utilization of EU's goes down. The quest for ILP comes to a crashing screeching halt before you even get to 4 EU's. IIRC, only one processor-scheduled CPU is designed with more than 4 EUs.
The necessity of the chip to extract ILP in realtime is what leads us to these big hairy controllers and limited clock speeds. Controller shrink was what led to RISC in the first place, and now that we've had to add in superscalar "goo" there's hardly a difference between the CISC philosophy and the RISC one. Never mind that Intel chips have been re-writing CISC instructions as multi-EU uops forever.
The point is, adding additional EU's has been desmontrated to be of dubious merit. Right NOW, the P4 speed improvements come from SSE2, just like the G4's speed improvements come from AltiVec. Both do essentially the same thing, although i've read more about AltiVec and it seems "cooler"
The difference is this - When the P4 core hits 3ghz, its retire rate will just destroy anything a G4 or Athlon will do. Intel took the pipeline length hit NOW and will reap the benefits later.
They also spent the time to get their prediction units as top notch as possible, because iirc statistically there will be > 3 conditional branches in progress in those ridiculous 20 stage pipes
So - the problem with intel's approach - a single instruction takes longer to complete, and the fill/drain penalty for mispredictions is high.
The retire rate however, is amazing, and the clock rate ramping ability is similarly amazing.
Your assertion that MOTs approach _relies_ on adding additional EU's is surely incorrect, because "everyone" knows that controller complexity is again dominating cpus, and much of that is dedicated to extracting and managing ILP on 4 or less EUs (and that it just isn't there beyond 4.. i think the Power4 was supposed to have 6 EUs, and the Alpha 364 or 464 was going to have 8 ?)
Intel has already "side stepped" the SuperScalar risc EU problem with IA64 - Thats what LIW does. LIW is interesting again now because of the reliazation that controller extracted ILP was too expensive and not good enough for the performance increases needed.
My opinions are my own, and do not necessarily represent those of my employer.
Yes XP will run on it using VPC. The question is why.
On the G4 you can run OSX, OS9, XP and rootless XWindows all at the same time. The only problem is you have to reboot to run Linux. But then you can run the MacOS from within Linux.
Flexibility of the Mac is one of its strong suits. Check out the different Gnu Darwin, Darwin, and Xon X sites. That is where the action is
Yes I am running BSD, you still running Windows?
photosMy Photostream
A good general resource for this kind of advanced computer architecture is the book Computer Architecture by David Patterson and John Hennessy. It's quite dense. For the latest in processor architecture, the IEEE Micro magazine is useful.
The Gamecube's CPU is a PPC 603 with its FPU split in two and very fast custom busses for connection to the rest of the console so yes, I'd say it's "custom engineered for gaming".
At least compared to that of the X-Box.
Look, I *want* to believe that the G5 makes great coffee, gives fantastic backrubs, cures cancer and runs faster than every P4. I do. I've just heard all these lines before, with the G4.
I just got back from a seminar with Motorola and the architecture of their new and upcoming chips. Some nice stuff on the horizon. The did mention some of the less than stellar performance on prior chipsets, and explained that it was due to not taking advantage of the chip features. An operating system or user application that must be backwards compatible will not be able to utilize the chipsets to full advantage.
You don't judge high end chipsets based on mass market consumer applications.
A Government Is a Body of People, Usually Notably Ungoverned
Thats a very amusing comic, much more so than UF. Any idea why its never mentioned on SD?