P4 - The Art Of Compromise · Slashdot Mirror

← Back to Stories (view on slashdot.org)

P4 - The Art Of Compromise

Posted by ryuzaki0 on Thursday December 14, 2000 @01:58PM from the that-o-ring-design-is-fine-in-cold-weather dept.

Buckaroo writes: "Interesting article at EETimes on what Intel's architects originally had in mind for the P4 - 1 slow ALU, 2 fast ALUs, 2 FPUs, 16K of L1 cache, 128K of L2 cache, 1 MB of external L3 cache, etc. - It was all too big and hot, so a bunch of it got the chop." This article sheds new light on the reasons behind performance problems with the chip.

5 of 117 comments (clear)

Min score:

Reason:

Sort:

PPC by Chris+Johnson · 2000-12-14 13:53 · Score: 4

So go PPC. 512K cache is _small_ for current PPCs, 1M cache is typical and 2M of cache is possible with the G4s. You don't _have_ to cling to x86 just because an industry is desperately trying to keep it hobbling along. It's possible to not use x86. For that matter, UltraSPARC cache can be up to _four_ megs.
Re:Clock by Detritus · 2000-12-14 13:22 · Score: 4

I've always wondered why one of the vendors hasn't put a divide-by-two circuit on the clock input pin of the microprocessor. Then they could claim that their 800 MHz chip was actually a 1.6 GHz chip.
A friend told me that in the early days of portable transistor radios, some manufacturers would intentionally add non-functional transistors to their radios, just so they could advertise them as N transistor radios, where larger values of N were "better".

--
Mea navis aericumbens anguillis abundat
Next breakthrough by max99ted · 2000-12-14 09:20 · Score: 4

Watched the A&E Biography on Steve Wozniak last night. One of his designs (Apple 1 methinks) was revolutionary for its time - a reduction from about one thousand chips on board to sixty - way ahead of what anyone else was doing.
Maybe it's me, but I can't think of a simliar 'breakthrough' advance in recent years. I remember reading somewhere that computers are approaching the 'limits' of current architecture design - we can only crank out so much from today's motherboard/x386 technology. I know that optical computing is slated as the next wave, but I can't help thinking that to bring this to light there needs to be a new "Apple I" breakthrough. Am I off base here?

--

Please stop APK.. you're only hurting yourself.
The real limits, IMO. by Christopher+Thomas · 2000-12-14 11:54 · Score: 5

The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture.

Actually, while inconvenient, the x86 architecture isn't as horrible a limit to performance as a lot of people seem to be assuming. The main problem is the extra latency in the decode stage, which lengthens the pipeline somewhat, but the P4's trace cache takes care of that.

The real problem with the P4 is that it has very wierd optimization requirements (the whole "bundle" thing) and so needs a very smart compiler if code is to run quickly. Generally, even if compilers like this exist, they aren't used (remember the original Pentium?).

The other problem with the P4 is the long pipeline, which exacerbates stall problems.

As for architecure in general, heat issues are what's limiting clock speeds (for x86 and non-x86 processors alike). However, the main limit people are noticing is the limit to the number of instructions you can run in parallel. As long as you're executing only one thread, you're not going to be able to sqeeze more operations per clock beyond a certain point. The "performance problem" isn't with clock speed - it's with people expecting new chips to do more, clock for clock, than old chips while running serial programs. This parallelism problem affects all chips - x86 and non-x86.

This is why the major manufacturers are starting to look at SMT chips (Symmetric Multi-Threading) seriously. Running multiple threads in parallel on one chip doesn't take much extra hardware, and makes it *much* easier to schedule concurrent instructions and to keep on running when one instruction stalls (your "Instruction-Level Parallelism" goes up in proportion to the number of threads).
The P4 is the world's fastest microprocessor. by ToLu+the+Happy+Furby · 2000-12-14 18:34 · Score: 5

It merely needs recompiled code to perform well.

On what am I basing this apparently heretical statement? On SPEC_CPU2000, the most demanding, well balanced, most respected cross-platform CPU benchmark in the world. As you can see if you peruse these lists, the P4/1500 has the highest scores of any shipped CPU in the world, both in SPECint (base and peak) and in SPECfp (base only).

Before any of you reply and think you've caught a mistake, the Alpha EV67/833 is *not* publicly available, and won't be until January, at which point it will take back leadership in SPECfp_base and SPECint_peak. Of course, the P4/1700 will probably take back the lead when it's released in March or so. Indeed, the P4 and Alpha will likely trade the top SPEC spot back and forth at least until the EV68 (EV67 moved to .18 um process and with on-die L2 cache) makes an appearance (Q2?), if not all the way until the EV7 (EV68 with integrated on-chip *8-channel* RDRAM controller) is released (Q4?).

This is why all this banal talk about the P4 being a crappy chip or (in the wake of this article) a "crippled" chip is ignorant drivel. SPEC_CPU is an exceptionally well designed, balanced, and comprehensive benchmark stressing a CPU to its limits in all sorts of ways. Why then the P4's disappointing performance on all those other benchmarks? They are all on "legacy" code--code compiled with the P6 core in mind. Because the P4 represents the first chip with a new core architecture (the horribly misnamed "NetBurst" core) from Intel in 5 years, it has a lot of pretty radical design features which don't take well to code compiled for the P6 core. While this means the P4 is pretty a useless (or at least very overpriced) solution to running today's code--and indeed, most code released for at least the next year or so--it has nothing to do with how good a *design* it has, which is ostensibly the point of this discussion. Indeed, the PPro--the first P6 core chip--posted very "disappointing" benchmarks on legacy code when *it* was released 5 years ago; many observers wrote it and the P6 core off as underperforming overdesigned wackiness from Intel. It was arguably the most successful and innovative CPU core ever. Not so incidentally, this was strongly forshadowed by its brief theft of the SPECint95 performance crown from the top Alpha of the time...

Now to dispense with the most repeated "points" we've seen thus far.

1) "This just goes to show that x86 is a dead ISA with no headroom to grow." Not the most unexpected statement to be found on /., but let's just say that the other 99.99% of the world that enjoys backwards compatability will make sure x86 stays alive for quite a long time to come thank you. On a technical (rather than marketing level), though, this is ridiculous bunk as well, as the fact that the P4 beats every released 64-bit 10-times-as-expensive RISC chip with 30-times-as-expensive platforms, on SPEC_CPU--a benchmark specifically designed to stress exactly those high-performance situations demanded of professional level workstation and server machines--demonstrates quite nicely.

Yes, x86 is a bad ISA, and yes it presents a problem to be overcome by chip engineers. But it has been overcome and will continue to be overcome--today by taking on a decoding stage to x86 processors that turns x86 instructions to RISC-like instructions for internal operations (taken out of the critical path by the P4's trace cache), and tomorrow perhaps by dynamic recompilation software ala Transmeta, IBM's DAISY, and HP's Dynamo, techniques which are still in their infancy and *may* end up providing better-than-compiled performance even without the benefit of converting to a more optimal ISA. The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.

The real question is, does the performance hit generated by sticking with x86 exceed the performance gain generated by having a much larger target market, and thus more money to spend keeping up with the latest process technology and thus getting faster clocked CPUs? The answer thus far has been a rather resounding "no"--that is, the economies of scale granted by staying x86 have meant processors which are outright faster and cost much much less.

After all, there is no doubt that were the Alpha not around 18 months behind Intel in terms of process technology, the EV67 would be much faster than the P4. On the other hand, the EV67 gets to take advantage of resources that Intel could never dream of in a mainstream chip--like a 300+mm^2 die size, extra wide memory buses, and 4-8MB L2 caches--because of the tremendous added cost. And even with all that plus what is widely acknowledged as the best CPU design team on the planet, the Alpha only manages to keep up with the P4.

Moreover, the rest of the 64-bit world--despite the same advantages as the Alpha (well, except their design team)--can barely keep up with the P3, and that's a 5 year old design. They may be available in multi-chip boxes scaling to kingdom come, but on the level of individual chips, the best that Sun, IBM, HP or MIPS has to offer is pretty lame, despite all the advantages of a RISC ISA. Of course, the same old folks will be claiming that x86 is an inherent dead end when the P4 (or whatever Intel is calling its current NetBurst core by then) scales past 4 GHz two years from now, well ahead of anyone in the RISC world. And we'll hear it again in 4 or 5 years, when Intel releases another all-new x86 core.

2. "The P4 should have left in all those features this article talks about." Uhhuh. Sure. Um...now, who would know more about this? Would that be you, having read some article on the Internet? Or would that be Intel's engineers who maybe understand the P4 core and the issues involed with these features a bit better than you, and who had the benefit of cycle-perfect simulations on dozens if not hundreds of possible P4 variants running every concievable type of code??

If there's a feature which doesn't make it into a finished CPU, it's because of one of two reasons:
1) The designers didn't think of it;
2) The designers couldn't figure a way to implement it and make it work with the rest of their design in such a way that it raised performance/cost.

Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.

IMO, the features listed here are all better off gone from the current P4. The only really intriguing one--another FPU--was *not* left off for die size considerations (i.e. cost): FPU's are not very big. It was left off for performance issues. You see, while "more is better" sounds like a nice philosophy, adding an extra FPU would have meant extra decoding and routing logic in the FP section of the chip. Considering Intel actually went to the considerable trouble of implementing this feature and then decided against it, it is very likely that this extra logic was in the P4's critical path. Thus while including the extra FPU would have meant extra performance/clock, it would have meant lower overall clock speeds. Obviously Intel felt the tradeoff worked better without the extra FPU than with it.

If you "disagree" with their decision, please refer to the cycle-perfect simulators which Intel has and you don't, and the P4/1500's SPECfp2000 score which is a mere, oh, 68% better than the fastest P3. Also you might note that the P4 is scaling quite well with clock speed on SPECfp, that it will spend most of its life at speeds well above 2 GHz, and that it will likely sell most (at least for the next 2 years) in combination with a memory subsystem providing *less* bandwidth than the current dual-RDRAM i850 chipset--all of which point to this being a very smart decision on Intel's part. (The reasoning is this: if the P4's FPU can already keep up quite nicely with a larger memory bandwidth, then why increase FPU power/clock when most P4's will have higher clocks and lower bandwidth to keep them fed?)

As for the features I'd like to see added to the P4 when it moves to its .13 um Northwood variant next summer: one of them was on the list, i.e. a 16kb L1 data cache. The reason it was left off was clearly not die size but clock scalability--Intel decided having a 2-cycle latency L1 was more important than having a bigger one, and I totally agree. After the move to .13, though, perhaps a 16kb 2-cycle L1 will no longer limit clock scalability, just as the PPro's 8kb L1's were expanded to 16kb each with the PII. The other, a 512kb L2, would take up much too much die space at .18um to be feasible; it too, may make it to Northwood, depending on Intel's target die size. Needless to say, whatever they decide, it will be a much better informed decision than I or anyone here could presume to make.