Double-Whammy Look At The Pentium 4
SystemLogicNet writes: "We at SystemLogic.net have just taken a technical look at the Pentium 4 architecture. In the article we go over all the basics that all the other sites cover like the double pumped ALUs, iSSE2, the longer pipeline, etc, but in addition we have some discussion about how different program structurings have an impact upon the design, and performance of the Pentium 4. One of the major areas where this comes into play is how complex data structures interact with the underlying philosophy that the Pentium 4 is built upon -- extreme bandwidth. This Pentium 4 technical background can be read over here. At the same time, we've done a rigorous analysis, including benchmark description and discussion regarding the Pentium 4's performance, and this can be read over this way."
I was shocked by this review site. Most all the graphs are misleading. Most magnify the area of differenc between the two processors to make the margin look larger. For example, in the benchmark "Content Creation Winstone" (http://www.systemlogic.net/reviews/hardware/proce ssors/intel/p41700/i/c7.gif), the difference is only 3.6 points, yet the scale is nearly 1/3. That's nearly 3x magnification.
Some only differ by a few percent, the lowest about -4.5% of P4 score, yet the distance represented on the graph would suggest nearly a 60% difference or more.
This review site needs to get a clue about statictics and start using proper graphing according to real differences, not magnified margins.
"I'll just chip in a bit for RedHat: I actually have that installed on my university machine." - Linus, '95
They said clockrate really might not matter. I think we all realise it does. Just look at the benchmarks. 1100 versus 1700 and the 1700 wins. Imagine that. Similar architectures this should be no surprise. Even I can walk down to my local store and get a 1.33ghz Athlon now. Why couldn't they?
As for the it costs a few hundred bucks more, it makes a difference especially when your competitor is a few hundred bucks less AND making inroads into what was traditionally your turf.
"Just wait till next year's model, which will be even better." I think we all hope this is true, but it might not be. Both sides have stumbled at various points along the way.
As for the buyers. They will buy a Duron at the low end or even an Athlon. Why buy a Celeron when a Duron gets you more preformance, costs the same, and lets you upgrade to a faster processor and isn't a complete dead end yet? Even the big companies are advertising them in the local newspapers.
Actually, it's not exactly "hard" stuff from an implementation point of view. Cycle times are short so you want a predict equation that you can do quickly and in one cycle. In fact, you can get pretty good results with a simple 4 state strongly not taken (00) - weakly not taken (01) - weakly taken (10) - strongly taken (11) saturating counter that updates when a branch is confirmed to be taken or not taken. If your BHT (branch history table) is sufficiently large, you can get decent results. Sprinkle in some voodoo magic by adding a GHR (global history register) which hashes the opcode address based on the state of the last n taken branches and you can get a couple of extra percentage points. I've seen upwards of 95%-97% prediction rates with such implementations but that's in a RISC environment which also provides fairly accurate branch hints in the opcode itself (much like the Itanium does). (The compiler knows what the code should do and what the semantics of a branch are: an "if", "for", "switch" construct, etc.)
Where things probably get weird for Intel is that their BHT probably suffers a bit of address aliasing/underutilization due to the fact that x86 opcodes are variable length. With RISC architectures (fixed length opcodes), you can chop off the last couple of address bits since the 0,1,2,3 cases don't matter == less address aliasing over a greater range of addresses.
Mispredict bypass buffers are another nicety that help back out of branch mispredicts because you don't have to go running back to the I$ and wait two cycles. In fact, while you're going down the codestream for the "predicted taken" path, you can also load up the "not predicted taken" path into a line buffer from an alternate cache such as a BTB (branch target buffer: if the data is available, a TLB entry exists, etc) and bypass the 2 cycle hit on the I$ on the mispredict. Two cycles are two cycles...
Engineers have a very big bag of tricks to work from..but they do have to know when to cut the apron strings and say "out with the old, in with the new." I think the key to major ramp-ups in speed for the x86 architecture is going to be when Intel proclaims "The Great Simplification" (a la "A Canticle for Leibowitz") and deprecates a whole slew of ancient modes (e.g., 286 type stuff) such that they must be emulated through an OS trap. By that time, DOS based OSs like W9x will be about as common as Win311 is now so it won't even matter. About the only people who I can see complaining then are VMWare, Netraverse, Plex86, and the WineHQ Team.
In addition to that, LOOK AT THE RAM SPEED!
For christ's sake. The P4 is using pc800 RDRAM and a 400 mhz FSB. (100X4) The athalon is only running a 200mhz FSB and PC 133 SDRAM!!!
I mean, lets be realistic, here, folks. The P4 has a 600 mhz clock speed, 667 mhz ram clock speed, and 200 mhz front side bus advantage.
on pricewatch, the P4 1.7Ghz $326, 128MB PC800 is $44, and a P4 Mobo is $115.
By comparison, a 1.33Ghz Athalon is $120, 128MB of DDR is $17, and DDR boards are $94.
P4 = $485, Athalon = $231
Add to the other advantages the $254 price advantage (more than double).
Anyone say the test is fair, or that the P4 is a good deal?
me either.
~z
sig?