ArsTechnica Compares the P4 and G4e: Part II
Deffexor writes "It looks like Hannibal of ArsTechnica fame has put Part 2 of his original comparison article between Intel's P4 and the Apple/Motorola G4e. In a nutshell, this second article covers the execution core, the AltiVec unit and SSE2, as well as a myriad of other interesting factoids. An interesting read, if not a little technically intense for those of us with less than a CE/EE degree. Have at it boys!"
gotta love AT, they dont publish often but when they do its fantastic stuffs.
This is exactly what I've been trying to find out for some time now. I've been increasingly upset with the x86 line of chips since it seems that there is hardly any diffrence between 600Mhz and 1.2Ghz.
Here's another comparison: Joy Of Tech (and the next 6 pages as well)
This is the place where you write something that will make you seem like a complete idiot.
>An interesting read, if not a little technically intense for those of us with less than a CE/EE degree.
Tell me about it, I do have more then CE, two letters even, namely MCSE and even I had to stop when they started throwing around the heavy stuff. I mean, A = A + B is supposed to make sense even if B isn't equal to zero.
I intend to live forever, so far so good.
Big article, only had time to glance over it (and I'm not technically qualified to understand it in its full detailed glory), but as far I can see, the dude isn't picking sides. Wich is a rare treat.
:)
Although he does confirm Steve Job's words of wisdom: Mhz aren't everything
(I, on the other hand, am picking sides)
You can't take the sky from me...
A more interesting comparison will be to pit the P4 against the comming G5. According to the Register, Apple has begun seeding early G5's at up to 1.6GHz to key devlopers. Other sources are claiming limited yeilds in the 2.4GHz range already.
There's still bugs to be worked out before production ramps up for release early next year, and supposedly AltiVec will not be as strong on the G5 as it is in the G4. But at 2.4GHz on an already-superior FPU, who needs it?
i personally believe that flexibility of the assembly instructions as well as the number of instructions executed per cycle contribute greatly to the dominant speed (at any given MHz/GHz) of the ppc processor. compare any intel/amd processor to a ppc at the same clock speed, and the ppc will kick its x86 ass.
the high end ppc desktops are topping out around 900MHz, while the p4's are hitting 2GHz. there has to be another explanation besides the complaint that jobs is ignorantly sitting on his thumbs. i think he knows what he's doing.
note: i am not a mac zealot.. i don't even own a mac - only 4 x86 pc's (1 athlon, 2 p133, 1 p120). i simply can appreciate the speed of the ppc.
"I just want to thank my coach Eric a.k.a. Disco for shattering my reality..."
But with the G5 around the corner, I think THAT will be THE interresting comparison.. expecially since Intel plans on keeping the P4 for a while (, ramping it up in speed, when you Read adobe saying the G5 are significantly faster than P4 (and if you go read the article, the same people do say that the P4 is faster than a G4 (exept for altivec stuff) so if they say G5 is faster than P4, it probably will be :)...it should be really nice to see something that kills the P4 in raw performance other than AMD).
--- Metamoderating abusive downgraders since my 300th post.
Note: I have a B.S. in computer science, a solid understanding of hardware issues, and have been programming for 19 years.
When I read articles like this, there's so much detail that I find myself--even willingly--losing sight of the big picture. Sure, you could read a detailed write-up about Toyota's new engine, but those details don't really matter much unless you've just made a hobby of knowing about engines. Realistically, you'll have a hard time connecting those details to your driving experience. Heck, someone could put in a different engine, tell you that its a Toyota, and you'd be saying things like "Oh, yes, this feels just like a Toyota, I can tell that the designers did blah and blah."
After the Pentium II generation of CPUs, things have gotten very, very muddled. Amazing features that are supposed to increase performance don't always do so. Sometimes they make things worse. Little compiler tweaks can make one program be twice as fast as another, given the same hardware. Chips with higher clock rates can be significantly slower than chips with 20% slower clocks. Certain applications run much faster than on previous chips, but there are others that show no increase.
It's all very chaotic and confusing, even for people in the know. I suspect that if you took a program that people claimed to need a P4 or Athlon for--something very performance sensitive--and set yourself the task of making it run faster on a PII than an Athlon, you could do it. But that doesn't matter, as everyone seems to be clamoring for newer chips.
Uhh, in a word, No.
It'd be easy to use this in the video card realm as well. 2 is 2, right? Voodoo2 and a GeForce2.. I won't touch the microsoft jab.
Does anyone know if an ATX board with a G4 exists? I just started developing my own little OS and, frankly, x86 assembly stinks, I hadn't touched it for 4 years and didn't remember how crap that ISA indeed is. The 68k series were such a nice development platform, the PPC ISA looks quite cool as well.
This article is extremely informative and gives you a good insight into how these processors are designed, as well as how they compare. I disagree with the poster though, you don't need a CE or EE degree to get the idea of what's going on. I'm a CE and I had classes on this sort of thing so yes I could follow all the gritty details, but I think the author did a good job of explaining things so that most people could understand. Also, I thought the author summed things up perfectly saying:
The preceding discussion should make it clear that the overall design approaches I outlined in the first article can be seen in the execution cores of each processor. The G4e continues its "wide and shallow" approach to performance, counting on instruction-level parallelism to allow it to squeeze the most performance out of code. The P4's "narrow and deep" approach, on the other hand, uses fewer execution units, eschewing ILP and betting instead on increases in clock speed to increase performance.
This is exactly the case. Unfortunately the popular masses don't understand all of this wide vs narrow stuff, so they go for the higher clock speeds. In reality, Intel is really pulling one over on us, charging more money and all we're getting is a higher clock rate, not a whole lot of performance gain. PPC has proven itself time and time again to be the better processor, but unfortunately they aren't used in very popular machines (mostly Macs,) so we don't get to reap the benefits.
On a related note, this article touches on one of the many reasons why the Gamecube will run circles around the Xbox. GameCube's processor is a 485Mhz PPC designed specifically for video games, while the Xbox just uses a common Pentium running at 733 MHz.
This all brings up a good question: why haven't Macintosh's or GameCube's marketers come up with a bench mark to put next to the processor speed? Maybe I missed it, but I've never seen a Macintosh commercial saying "comes with a G4 800 MHz, comparable to a P4 1.5 MHz." There might be too many legalities involved to do something like that, but it seems like they need to educate people somehow of the non 1 to 1 relationship between clock speeds of P4s and PPCs.
~ now you know
Like with a Tootsie Pop, you start licking, and finally get too impatient and just bite the damn thing.
Mmm...chocolatey goodness...
Carl G. Jung
--
"With one breath, with one flow, You will know Synchronicity" -La Policia
"Intel introduced SIMD to the PC market with MMX on the PII."
No, my dear sir, Intel introduced SIMD in the first generation Pentium, labelled as Pentium MMX which started at 233Mhz IIRC.
And I've yet to see a single Sparc app using VIS except their crap ShowMeTV.
The two operand Intel architecture does not allow the fused multiply add, so that the latency of such an operation is the latency of a multiply plus the latency of an add (and the destination register has to be one of the operands, although the other operand can be in memory, saving you a load). There are plenty of practical algorithms which benefit greatly from the fused multiply-add, for example polynomial evaluations, matrix multiplications, etc, a feature pioneered by IBM in the RS6000 series and that Intel is using in Inanium.
And people who claim that you can do loop unrolling to hide the latencies should check their math: with only 8 registers, there is no way to hide the latencies of a multiply plus an add on a P4, while it is almost trivial on a G4 (32 registers and shorter latencies between accumulates). Furthermore many transcendental function evaluations are evaluated in libraries through polynomial approximations, which cannot be unrolled nor easily sped up: the number of coefficients is usually large enough to make the routine limited by the latency of the back to back floating point operations, but not large enough to take a divide and conquer approach.
While the G4 is clearly the better architecture (not having double precision Altivec is not that important, I consider vector processing is only worth if you can do more than 4 elemnts per vector), the memory susbystem of the P4 is far superior. Hopefully the G5 will be comparable in this area (and I can't buy a desktop Power4 system :-().
----snip----
add A, B
mov C, A
The first command adds the two numbers, and the second command moves the result from A to C. Of course, you still have the potential problem that the original value of A was erased by the add command, so if you wanted preserve A's value then you'd have to insert even more instructions to store A in a temporary register and then restore its value once the addition has been performed.
----snip----
Not quite. I'm sure even people who _dont_ know x86 assembly language will realise all you don't need any extra instructions at all. Simply reorder them:
mov C, A
add C, B
Obviously, the example was being used to show how much nicer it would be to have three or more operands in your instructions, but it was a lousy example.
On a sidenote, we've been able to specify more than two operands with certain instructions since the 80386. Look up the syntax for the "imul" instruction.
Today's weirdness is tomorrow's reason why. -- Hunter S. Thompson
This all brings up a good question: why haven't Macintosh's or GameCube's marketers come up with a bench mark to put next to the processor speed? Maybe I missed it, but I've never seen a Macintosh commercial saying "comes with a G4 800 MHz, comparable to a P4 1.5 MHz." There might be too many legalities involved to do something like that, but it seems like they need to educate people somehow of the non 1 to 1 relationship between clock speeds of P4s and PPCs.
Cyrix used to sell PR parts, PR133 might have been a 116Mhz chip, but it was as fast as a 133Mhz pentium. So there's precedent, and it's probably legally OK, but I suspect the reason is it doesn't really matter.
What really matters is that the CPU is fast enough for what you want to do. I run OS9, OSX and linux on my machines. My home machine is a G3/350, and it's plenty fast for running OS9 for everthing but compressing MPEG1 video. It's not fast enough for running OSX. My work machine is a G4/400 and it's just fast enough for running OSX. But it's not fast enough for compressing MPEG1 video. If I had a dual-800 G4 it would be more than fast enough for OSX, but it would still be too slow for compressing MPEG1 video. My linux machine is a Dual-800, P3. It's just fast enough for running linux with all the crap I have running. It's still too slow for compressing MPEG1 video, though. I also use a 1.2GHz Athlon machine occasionally, and I consider that just fast enough to run Windows 2000. I assume XP is similar. But it would still take a long time to compress MPEG1 video.
So, how would you structure a comparison benchmark? SPECint? BYTEMark? PhotoShop duals? I think the answer is that you don't. It doesn't matter, as long as the computer is fast enough to do what you want it to do. The semi-annual MacWorld Photoshop duals are interesting since they actually show that the computer is too slow for designers but the Windows machines aren't any better. Perhaps they need to enunciate more, but I think their current stand of , "it's fast enough," is the mature one.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
The whole comparison is completely pointless. The majority of machines running Intel processors are running Windows. The majority running Motorola processors are running some version of the Mac OS. Might as well have Car and Driver running a comparison of a Jaguar S-type and a 10-ton dumptruck. (which processor is the equiv of a Jag is left as an excercise for the reader)
Steve's Computer Service, Hobbs, NM
Come on now! How can /.ers REALLY know which is the best CPU for us if there's no BEOWULF cluster benchmark comparisons between the G4 and the P4.
I want to imagine.
Last month they said the G5 was going to be released with that iPad wireless web tablet (not the iPod, the tablet photoshop job). It scales perfectly to 16GHz, costs $0.13 each, and does 543TFlops.
Are you stupid or just idiotic? Or maybe just completely humourless?
Of which, I "R" in that group.
While the G4e has fairly standard, fairly unremarkable floating-point hardware, the PPC ISA does things the way they're supposed to be done
[snip]
The P4, on the other hand, has slightly better hardware but is hobbled by the legacy x87 ISA.
I could have sworn tomshardware stated it best as "Essentially we have a P4(86) 2Ghz".
I'm paraphrasing, mind you and possibly taking it out of context, *but* instead of increasing the cache (instruction/data/registers) they combined and dropped it down to 8k of instruction and data.
Oh, and on the P4 vs AMD's XP chip, how would this analogy be changed or overhauled as it stands with the P4 vs G4e?
I'd really like to know. Or have a better "real world" analogy geared for the newbie user who usually winds up asking me, and I have to be able to explain complex things in simple terms to myself first.
Thanks.
GISboy
If it is not on fire, it is a software problem.
please would someone re-moderate this wildly incorrect +3 post
How is it more careful to use one word over another? What are the mortal dangers of using 'whomever' incorrectly that don't apply to using 'whoever' incorrectly? Should I use 'he' instead of 'him' incorrectly, too? Or maybe I should just *gasp* dare to make mistakes so that I can learn from them?
We're still thinking according to the Pentium I mindset. Stalls like that one aren't really an issue anymore thanks to the out-of-order instruction execution scheme that has been evolving since the P-II.
--
Today's weirdness is tomorrow's reason why. -- Hunter S. Thompson
Next time, I'll include those humour tags for those who do not recognize it :)
:-)
As for the math part, (yes I knew it was about computer instructions) there isn't a single problem with A = A + B with B not equal to 0. Modulo arithmic for example. But then again, I am not sure what grade math level you need to have for that
I intend to live forever, so far so good.
What does the motherboard have to do with the case? Just pop the board out and replace it with another.
It always amazes me the way Mac-heads will use all kinds of adaptor cards to upgrade old motherboards, which of course would cripple performance.
autopr0n is like, down and stuff.
I'm a CE as well, and I absolutely loved the article. It's nice to have someone fight the technical battles for you to release minute details about their procs. Trying to get information like this is usually like pulling teeth from some companies.
What I wanted to convey though, for people who may not deal on a hardware level with this stuff. Is that it is very hard to really get a good understanding for the whole processor. These projects are IMMENSE. Trying to keep track of millions of transistors, and lay them out, etc... is a nightmare. I know. So while it is good to talk about the higher level concepts of narrow Vs. Wide on a conceptual level that all falls away when you start looking at transistors. More than anything these projects are all about coordination. You can have a team of engineers working on a specific part that have NO IDEA what the other bits and pieces look like unless they have to interface with them. So just keep that in mind when we're judging these companies. I just think we lose sight of the massive scale of these projects sometime.
Are there any real benchmarks comparing the p4/athlon/g4? (by real, I don't mean a set of 2 or 3 similar Photoshop filters). I hear a lot of people saying that the g4 is a superior performer, but I have a hard time believing a 800mhz chip is really faster then a 2ghz one.
autopr0n is like, down and stuff.
there isn't a single problem with A = A + B with B not equal to 0. Modulo arithmic for example
What? You just defined 0 in a mathematical sense. What's your example?
As my father lik@(munch munch)...
FYI, the whole thing is meant as a joke :)
:)
After all, a Pnetium "4" is way less than a 286. 4 286 !!
In Soviet Russia you dant have to put up with these crappy jokes
If you are a consumer who wanted a comparison to decide which kind of computer to buy, you are right the article was (mostly) useless.
BUT, for the audience the article is intended for - geeks, technophiles, nerds & propeller heads it was not pointless at all. On this forum in particular there are a lot of people that use neither Windows nor MacOS but other operating sytems which run happily on either processor. Even if there is no *practical* point there is always sheer geek curiosity - alot of us find such articles entertaining.
Might as well have Car and Driver running a comparison of a Jaguar S-type and a 10-ton dumptruck.
I don't think that the difference between a P4 and a G4 is quite as wide as that - and they are being marketed by both sides as roughly equivalent products. Most techie people may know which is the "dumptruck" and which the "Jaguar" but it is still interesting to see a technical explanation of WHY and precisely HOW they are so different.
What do you mean by "too slow?!"
Do you mean in real time? If so, say it.
Saying it's "too slow to compress MPEG1 video" is bollocks. Look at the recent iDVD2 demo on a dual 800, that's MPEG 2 which is harder to compress.
It doesn't mean much now, it's built for the future.
Man, I've gotta pay more attention to that site. I remember learning about pipelines and such as a CompSci major. It's cool to actually put that part of my education to use. :)
I'll bite (bit of professional math pride flaring up).
First of all, I am interested which number would be defined at 0 since I assumed B to not be equal to 0 (unless you are counting module B of course).
Furthermore, any mathematician who would accept a definition of 0 in a mathematical sense without a uniqueness clause should be fired.
And if you still do not get it, think clock times, where 1 am + 24 hours equals 1 am.
I intend to live forever, so far so good.
Oh, I recognise humor.
But this was not it.
~jeff
To each his preferences.
I intend to live forever, so far so good.
If it can't run windows XP?
I have a bunch of boxes I use for dev/test purposes. My primary is a 2Ghz box, and I also have a 800Mhz P-III next to it. Both with 512mb of ram... My apps sure as heck compiles MUUUUUCH faster on the P4 then the P-III. When I run some of my framework apps, you should see the performance. I have a Celery 800 as well. That thing slows to a crawl when I crank up the number of clients, but the P4 continues to fly. The P-III, the animations will slow down a bit. The P4 runs full steam through it...
nevermind this post. . .
the history of the world
Yes XP will run on it using VPC. The question is why.
On the G4 you can run OSX, OS9, XP and rootless XWindows all at the same time. The only problem is you have to reboot to run Linux. But then you can run the MacOS from within Linux.
Flexibility of the Mac is one of its strong suits. Check out the different Gnu Darwin, Darwin, and Xon X sites. That is where the action is
Yes I am running BSD, you still running Windows?
photosMy Photostream
Imagine a Beowolf Cluster of THESE!!!
It would be fair to compare the K6 to the Pentium 4 if the K6 was the best chip AMD had to offer as is the case with the G4 vs P4. If Motorolla had released the G5 before the P4 came out, a comparison of the G5 vs P3 would be fair because it would be the best thing each company had to offer.
man RTFM
No manual entry for RTFM.
Actually I cant get XP to install under VPC (v. 4.0.2). It goes thru the first stage install, but then poops in the 2nd stage (in GUI mode) with some error regarding one of the 'assembly' files (XML linking to DLLs). I think maybe the disc image i downloaded is bad...
A good general resource for this kind of advanced computer architecture is the book Computer Architecture by David Patterson and John Hennessy. It's quite dense. For the latest in processor architecture, the IEEE Micro magazine is useful.
Just a quickie. This is the second time that the Power4 has come up on this thread.
Just how much do the POWER and PowerPC lines have in common. I know the PowerPC was born out of the POWER line but have they now gone down completely different roads?
And I live to intend forever. So far so good...
YHBT. HAND.
Quite a lot. Actually, from an application point of view the instruction set of the Power4 is excatly the 64 bit PPC instruction set. For a system programmer, there are a few differences in the MMU and some exception handling details, but nothing dramatic.
The two operand Intel architecture does not allow the fused multiply add, so that the latency of such an operation is the latency of a multiply plus the latency of an add (and the destination register has to be one of the operands, although the other operand can be in memory, saving you a load). There are plenty of practical algorithms which benefit greatly from the fused multiply-add, for example polynomial evaluations, matrix multiplications, etc, a feature pioneered by IBM in the RS6000 series and that Intel is using in Inanium.
And people who claim that you can do loop unrolling to hide the latencies should check their math: with only 8 registers, there is no way to hide the latencies of a multiply plus an add on a P4, while it is almost trivial on a G4
Actually, it turns out that you can still mask the loop latency with a limited register set.
First, you can use "software pipelining" to mask quite a bit of the loop latency without having to unroll (it's a clever reshuffling of the loop instructions; for brevity, I won't describe it here). This requires one extra FP register over the straightforward implementation of an x86 dot-product loop (four instead of three, because I can no longer re-use scratch registers between steps).
Second, branch prediction will to a limited extent perform unrolling for you. While the architectural register file has only 8 registers, there are many more internal registers on the chip. Register renaming allows the processor to run several iterations of the loop in parallel without having to worry about namespace conflicts (though true dependencies remain intact). This works as long as the total number of iterations being unrolled fits within the scheduler's window (usually 8-16 instructions; I don't know how big the P4's window is).
In summary, for something as straightforward as a dot product, it's certainly possible to write x86 code that will avoid the penalty of having separate add and multiply instructions.
[You'll really be bound by the memory subsystem for both chips, but that's moot point for this discussion.]
From one potential immortal to another, I'm not dead yet!
Steve M
nuff said.
t.
A google later, we have: IBM is launching its lowest-priced system, featuring eight processors, at $450,000. ... he new IBM servers have a Power4 processor that contains two processors, a system switch, a large memory and input/output technology -- a design that enables the server to conserve energy and outperform servers that have twice as many processors, IBM said.
Note that this is the lowest-priced system.
t.
A = A + B
With a MCSE, you don't know what this stands for ?
x = x + y ?
x += y ?
In Z notation,
let say you have a SoftEnd or CompEng degree,
you would write this as:
A' = A + B
To indicate that A' is the new value of A.
Wow that's difficult!
And you got a job on top of that!
What's the email of your employer?
You should get fired!
infinity is not a number, but we should know that shouldn't we. i need my karma back.
Thats a very amusing comic, much more so than UF. Any idea why its never mentioned on SD?
Well any professional mathematician who talks about an alternate number system (not real) without making that clear should be fired.
A = A + B
B = 0
because by not specifying the number system you just have to assume real. The previous poster was not talking about defining a number system but the validity of an equation in some already defined system (presumably real).
I didn't think mathematicians would have the childish ego of most computer nerds. Then again you probably aren't a mathematician.
Have a nice day.
Nope...hiatus. He's starting SAMBLA.
Does it run OS X? That answer is no.
||| I still can't believe Parkay's not butter.