Slashdot Mirror


P4 - The Art Of Compromise

Buckaroo writes: "Interesting article at EETimes on what Intel's architects originally had in mind for the P4 - 1 slow ALU, 2 fast ALUs, 2 FPUs, 16K of L1 cache, 128K of L2 cache, 1 MB of external L3 cache, etc. - It was all too big and hot, so a bunch of it got the chop." This article sheds new light on the reasons behind performance problems with the chip.

35 of 117 comments (clear)

  1. Another compromise by Anonymous Coward · · Score: 2

    Also, the name of the chip was supposed to be "Pentium IV", but their studies indicated that 32% of the surveyed americans pronounced it "Pentium Eve", and another 26% pronounced it "Pentium I've". So they decided to write it "Pentium 4" instead.

  2. Re:A simple question... by sjames · · Score: 2

    Your latency is just as bad as with 3 GB/sec, and for most things you're likely to do, that's more important.

    It's all a matter of cost. If cost isn't a problem, prefetch can be used along with cache to help minimize the latency issue. Especially if there is a prefetch instruction so the compiler (or even the programmer through a pragma) can issue a prefetch instruction.

    I imagine the issue there is related to the extra hardware complexity needed to make that happen.

  3. Not again. Okay, one more time... by Chris+Burke · · Score: 2

    Sigh. What was that .sig about the label "insightful" saying more about the moderator than the moderated?

    Look, we've been over this before, but I'll say it again. Yes, Intel's ISA sucks. No, it isn't "slow". IA32 hasn't been directly implemented in a CPU core since the Pentium MMX. The only place on the P6 and later designs that deal with x86 are the decoders. Internally it uses RISC-like micro-ops, which is convenient since most compilers only use the simpler x86 instructions. In the P4 the situation gets even better, since the trace cache holds decoded information -- the decoders aren't even in the critical path! Which is why x86 processors are able to compete successfully with RISC cores on performance.

    Or, in short, IA32 itself has nothing to do with the P4's lack of oomph. Which should be obvious, since the things it's being compared unfavorably to are other x86 processors!

    Ahem. As to the 'not much perfomance decrease'... Well, maybe a re-read is in order, or at least a re-think.

    The 5% was for cutting the FPU's area in half, not the whole chip! 5% is a huge effect on overal performance for a change in just one part of the architecture. For something that relied solely on FPU performance (the photoshop and 3dsmax benchmarks AMD does so well on), it would certainly be much more than 5%.

    And that was just one number for one change. That no other specific numbers were given doesn't imply that they were 0%. I'd say it's more likely that they are larger than 5%.

    Originally, the L1 data cache was supposed to be 16K, accessed in 1 cycle. That wouldn't work, so instead of increasing the access time they cut the size in half. I guarantee you that cutting the size of the l1 in half has a big impact on performance.

    An off-chip L3 would have been nice, too. Especially when paired with high-latency RDRAM. This would have had a huge impact on performance, especially in benchmarks that are sensitive to memory latencies. Doubling the size of the l2 (but increasing access time as a result) probably doesn't mitigate this much.

    The P4 of the Intel architects' dreams would have smoked. Instead we have what we have. x86 has nothing to do with it... economics and engineering reality do.

    Lastly, Itanium is going to suck. Intel has said as much themselves. It's neat technology, but not well designed. It's the Daikatana of the chip industry -- a running joke that some people hope will come off well anyway, but who inevitably will end up dissapointed.

    --

    The enemies of Democracy are
  4. Re:Registers and scheduling. by Christopher+Thomas · · Score: 2

    However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

    [Emphasis added.]

    There are more things to fetch, decode, schedule, execute and retire. What's good about that?

    As clearly stated in my original post - nothing at all. However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.

    It hurts. A lot. Check out some of our papers on the subject, especially this one, which contains many references to other work.

    Done. I compliment you on your fascinating approaches to register use optimization. However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved. The dependence of performance on register file size is only studied in one document ("The Need for Large Register Files in Integer Codes"), and the advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).

    A factor of two speed difference makes a processor unmarketable. A 20% speed difference doesn't (witness the holy war still going on between Intel and AMD proponents).

    The effect of a small register file is undoubtedly more severe as size decreases, but I have yet to see evidence of truly earth-shattering performance impacts. Circumstantial evidence suggests that the effect is not earth-shattering (SPECmarks for high-end workstation chips fail to thoroughly trounce SPECmarks for x86 chips for comparable configurations, and the PowerPC architecture fails to blow x86 out of the water).

    Most certainly, a larger register file is nice, and causes a speed improvement - but the effect of a small register file does not seem to be as devastating in practice as you appear to be suggesting above.

  5. Registers and scheduling. by Christopher+Thomas · · Score: 2

    You raise several good points; however, it turns out that there are a few mitigating factors.

    First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store

    This is true, and greatly hampers things like loop unrolling on the x86. However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

    Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.

    Actually, since the Pentium Pro, x86 processors have been fundamentally RISC-ian. x86 instructions are decoded into "micro-ops" (Intel's term), which are essentially RISC instructions. These can be scheduled by the processor as effectively as RISC instructions.

    The decoding adds latency, but that's what the P4's "trace cache" is for. Arguably, a compiler with access to the underlying RISC instruction set could do better scheduling, but in practice the gain is marginal (especially since most people don't seem to use really-good compilers). I also have a sneaking suspicion that basic blocks in most code are small enough to fit inside the processor's scheduler window, which means that the compiler probably _wouldn't_ do a better job in most cases than the hardware scheduler. Higher-level transformations like loop unrolling have benefit even if done at a CISC level.

    In summary, I'm not sure there's a very big performance hit from the instruction granularity (just a silicon hit).

    I am impressed with your knowledge of the subject, though.

  6. Re:I don't know what to say by nosferatu-man · · Score: 2

    This is the best thing I've read on /. in weeks.

    (jfb)

    --
    To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
  7. This is an all too familiar story by Roy+Ward · · Score: 2

    Is there any generation of processor development in which this sort of thing hasn't happened?

    There seem to be two constants in processor generations that I can see:

    (1) The new generation always takes longer, runs slower, and has more things taken out than was originally suggested,
    (2) the old generation gets ramped up to clock speeds way beyond what was originally anticipated while we wait.

    I remember when the PowerPC G4 was going to have a lot of changes including multi-core and run at most of a GHz - what we eventually got was in some ways a smartened up (mostly fp & memory improvements) G3 + altivec. Meanwhile IBM keeps making the G3s faster and faster.

    Then of course there is the story of how the whole x86 architecture wasn't supposed to get this far before being replaced by something with less cruft.

    1. Re:This is an all too familiar story by hattig · · Score: 2
      1) The new generation always takes longer, runs slower, and has more things taken out than was originally suggested

      Guess you haven't seen the specs for the new Alphas then. Instruction throughput per clock is still higher than the previous version, whereas with the P4 it is less than the PIII. One of these processors is going in the right direction.

      Shame that Alphas cost so flippin' much. But considering the 150million transistors on the 21364 and the 300-400 million transistors on the 21464 this can be understood I suppose. 1.5MB of on-die cache adds to this though. I want to read more about the SMT the 21464 is using though.

      Oh, yes, the Alpha article can be found from the front page of AMDZone.

  8. P4 is not a suitable server processor ... yet by hattig · · Score: 2
    Stoopid < character - have to wait for Slashdot to allow me to post, and then not tell me that the comment has been posted already... Knew I should have Previewed the post...

    especially servers (which is currently the only thing P4s are worthwhile in right now)

    Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.

    The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.

    The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the <20W Palomino 1.5GHz coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.

    I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.

    I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(

  9. Re:herm... by hattig · · Score: 2
    especially servers (which is currently the only thing P4s are worthwhile in right now)

    Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.

    The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.

    The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.

    I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.

    I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(

  10. This is about the Pentium 4, not AMD, you know? by Junks+Jerzey · · Score: 2

    When I saw the headline, I cringed and thought "Oh no, so many of the messages are going to include the acronym "AMD" in the first sentence. Ugh. And it turned out to be horribly true. I can barely wade through this stuff. Had I enough moderator points, I would tag them all as either "offtopic" or "troll."

    AMD zealots, I mean this seriously: You have moved past the realm of simply enjoying a product to becoming annoying zealots, like Jehovah's Witnesses. Please, please, please, consider taking a lower key "live and let live" approach. As it is, I think many companies shy away from anything involving the term "Linux" because they know what kind of people come swarming around when they hear that word.

    This is not a troll, nor a flame. It's a gentle suggestion that the rabid, juvenile AMD advocacy is doing harm in at least my particular case. I doubt I am alone.

  11. This is not a story of failure by Junks+Jerzey · · Score: 2

    It seems that this story is being greatly misinterpreted. It is not a story of failure, it is a story of engineering.

    Of course every geek would like a processor that has 500 integer units, 200 floating point units, and a gigabyte of on-chip RAM. But cost, development time, power consumption, heat, and reliability all come into the picture. The P4 team started with lofty goals and scaled them back to meet reality. That's how any hardware or software engineering project works. How often do you hear people say "We added tons of extra features, had better performance than projected, and finished six months early"?

    A good many consumer hardware junkies don't understand that "faster, faster, more, more, more" is not a worthy goal. The goal is "good performance given real-world constraints." I know that people who would willingly pay $500 for a video card don't understand this, but this is how engineering of commodity items works. AMD has exactly the same set of constraints. It's not like AMD engineers can magically solve all of these problems. If anything, perhaps AMD is keeping their sights lower, so they don't have to scale back as much in the end.

  12. Re:The P4 is the world's fastest microprocessor. by Junks+Jerzey · · Score: 2

    It merely needs recompiled code to perform well.

    This has been said often enough for so many different processors that it has become trite. From experience, extra bits of compiler optimization rarely pay off in a big way. Quite often, it is impossible to tell the difference between minimal and full optimization settings. I suspect that contrived examples are being used for benchmarks, such as an image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop. Sure, one tweak to the scheduler will make it run in 8 minutes instead, but how realistic is this? It isn't a win in the general case.

  13. Re:The P4 is the world's fastest microprocessor. by ToLu+the+Happy+Furby · · Score: 2

    This has been said often enough for so many different processors that it has become trite. From experience, extra bits of compiler optimization rarely pay off in a big way. Quite often, it is impossible to tell the difference between minimal and full optimization settings. I suspect that contrived examples are being used for benchmarks, such as an image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop. Sure, one tweak to the scheduler will make it run in 8 minutes instead, but how realistic is this? It isn't a win in the general case.

    That's why I was talking about SPEC_CPU, the most comprehensive and well balanced CPU benchmark suite on the planet, and not some crappy toy benchmark. Indeed, the P4 does very well on recompiled toy benchmarks as well, but I didn't mention them because they don't tell us anything useful.

    FYI, SPEC_CPU is about as far from some "image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop" as one can get. Indeed, it is a suite consisting of no less than 28 benchmarks, each designed to stress different algorithmic and data set size combinations, and each very non-trivial. It is the industry's only truly cross-platform benchmark, and it is designed and revised every few years by a committee consisting of some of the foremost experts on high-performance and scientific computing, and advised by every significant MPU vendor to assure fairness. It does not, as you imply, allow any hand-tweaking of assembly code, nor--like most benchmarks--does it come in the form of precompiled binaries which may favor one platform over another. Instead, it comes completely as source code, to be compiled by a vendor supplied compiler--which must be publicly available within a certain time frame--under very specific regulations. The "base" and "peak" categories refer to different levels of allowable customization in the compiler settings, and indeed all compiler flags used must be revealed along with the results. And rather than taking 10 seconds, a full SPEC_CPU run takes a couple hours even on a P4 or high-end Alpha; on the reference machine (i.e. a SPEC_CPU2000 score of 100) it would take something like 12 hours!

    So, nice try. But trust me, the only way to beat SPEC_CPU is to built a really fast CPU. It also helps to have an amazing compiler--which Intel does with its VTune 5.0 compilers--but that allows nowhere near the potential for unfair binaries that precompiled benchmarks do. Also, being aimed at the high-performance market rather than the PC market, SPECfp2000 has been criticized by some as "unfairly" rewarding the very large memory bandwidth of the P4 compared to the P3 and SDR SDRAM Athlon. For an IMO interesting technical discussion of this issue, you might want to see this thread over at Ace's Hardware. (See if you can guess who I am. :)

  14. Re:herm... by maraist · · Score: 2

    Though I generally agree with you. The performance per clock point is not really representative of a test.. If the PII maxes out at 1GHZ, and the Athlon is pretty close to it's limit at 1.3GHZ, and the P4 debuts at 1.5GHZ, then the likelyhood that the total performance will be higher on a P4 half a year from now is pretty good.

    Additionally, apps that really need the horsepower are going to be recompiled to take advantage of the new pipelines and SIMD instructions. What is remarkable is that this new CPU design can even keep up with pre-existing apps without recompilation.

    As for having to buy new boards, cases etc.. Do you really think that the majority of CPU purchases are made by people who would even think about opening up their cases? Though I don't have numbers, it is my understanding that businesses are the primary purchaser of computers; especially servers (which is currently the only thing P4s are worthwhile in right now). They don't upgrade; they get all new machines. So the fact that the power supplies are different are irrelevant. Even in light of the fact that they'll be marginally more expensive because of their newness.

    Personally I'm not satisfied with the P4. But that probably doesn't really interest Intel too much. They have the highest clock speed and will soon have some of the highest benchmarks.. And brilliant IT people will see these numbers and hurd.

    -Michael

    --
    -Michael
  15. Re:herm... by -brazil- · · Score: 2
    And brilliant IT people will see these numbers and hurd.

    Now *that* is a sight to see: "Hurd" and "brilliant" in the same sentence! :)

    --

    The illegal we do immediately. The unconstitutional takes a little longer.
    --Henry Kissinger

  16. The problem with ever smaller chips by Kwelstr · · Score: 2

    Intel has discovered a physical limit on how small a chip can be, because there are only so many small people alive at any given time, that can actually work on all those tiny transistors. So, unless they resort to massive cloning of the littles of people, we will reach a limiting factor both on the size and amount of chips created.

    Having said that, I am sure our new elected president will delegate onto some smart people the task of figuring out, either cloning or chips, wichever comes first.

    Also, on the same subject, in Europe they finaly figured out the Royal Family's behaviour for the last couple of centuries, it's a human strain for the Mad Cow Desease.

    --


    ~~~Please pass the salt, I hate unsalted MD5s :-/
  17. Re:Hmmm by BaronM · · Score: 2

    OK - you're wrong. The 80486 was the first x86 from Intel to integrate a math coprocessor. The "original" 80486 did not have a suffix, ran at 25Mhz, and produced enough heat to brew a decent cup of tea. Later, the 80486SX was introduced, being a 80486 without an FPU, and the original 80486 was renamed the 80486DX. The 80386 did not have an integrated FPU, but could work with the 80387 or 80287 running at an equal or greater speed. The 80386 was introduced at 16Mhz. The 80386SX, introduced later was an 80386 with the external interface necked down to 16 bit to allow for cheaper system designs. There was an 80386 variant from Intel with an integrated FPU: the FastCad386. The FastCad chipset (yes, chipSET) replaced both an 80386 and an 80387 with an integrated chip and a 'dummy' chip for the 80387 socket, and provided a modest performancd boost over the stock 80386/80387 combination. There was also an 80386SL variant, which had advanced (for the time) power managemet features and was intended for mobile applications. 80386 chips were produced by other manufactures, including IBM and AMD, under license from Intel, since multiple-sourcing of CPUs was common at that time. Eventually, some of there other companies produce yet more "386" chips, including the IBM BlueLightning, and AMD's AM386-40, which was the fastest commercial 80386 ever produced to my knowledge. Enough?

  18. Re:Marketing is not engineering by istartedi · · Score: 2

    Maybe not, but cost analysis is engineering. Don't believe me? Next time your manager asks you to outline your approach to a new problem, present him with something that requires 10,000 developers and a $60 billion equipment expenditure.

    Yes boss, our next server should use the Hoover dam as a power supply, and hand-wound relays instead of transistors for the processor core. Actually, I'd kind of like to see that...

    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
  19. Re:Sure by ca1v1n · · Score: 2

    That's why AMD had their DX/5 running at 133 and 160 MHz on 33 and 40 MHz busses respectively, to compete with pentiums while the K5 (and K6, and K7) was in development. AMD had to differentiate somehow, so they followed something like the Intel pattern, which everyone was familiar with. My 486-133 wasn't the fastest chip on the market, but the money my dad saved on it allowed him to buy more RAM than your average P-120 had, so it performed better.

  20. Re:What is the DEAL with CPU cache!? by muyThaiBxr · · Score: 2

    they did that because 256K of FULL PROCESSOR SPEED cache increases performance more than 512K running at HALF PROCESSOR SPEED

  21. herm... by fjordboy · · Score: 2

    compromise? naww..I think I will just stick with AMD thank you very much. I would rather not buy a new mobo, powersupply, case, heatsink, memory etc....if intel wants to sell this puppy, they should have made it compatible with the hardware that is already around. SDRAM is fine by me for now. /me avoids aiding companies that are predatory and monopolistic.

  22. Re:DO PEOPLE STILL EVEN USE INTEL CHIPS THESE DAYS by Fervent · · Score: 2

    Funny, I'm typing this on a PIII Speedstep laptop. They make excellent mobile chips for machines.

    --

    - I don't care if they globalize against free speech. All my best free thoughts are done in my head.

  23. Marketing is not engineering by ModelX · · Score: 2
    I find this article rather uninformative and mostly marketing inclined.

    The first reason is: nearly all high tech projects start with rosy goals and then reconsider when they know exactly what is essential and what is feasible. So all this crap about "we wanted to do it better" is pure marketing. If they really could do it, they would, because they need something to fight Athlon.

    The second reason is: the article does not tell anything about the compromises necessary to reach high-mhz for the sake of marketing.

    And the third reason is: the article does not even hint at the possibility that P4 might have been castrated to not appear much better than Itanium/McKinley in floating point.

  24. A simple question... by vishakh · · Score: 2

    Why has the BUS speed screeched to a halt at 400 MHz when the processor itself is beyong 1.5 GHz? I don't think these grandoise plans make any sense unless something is done about the very basics.

    --

    Posting messages for the betterment of humanity..

    1. Re:A simple question... by Petrophile · · Score: 2

      You have just made the most fatal Slashdot error:

      The only thing that matters is Quake benchmarks.

      Keep telling yourself that, or we'll send you to the re-education camp.

  25. Other limits exist within the architecture by Pink+Daisy · · Score: 2
    Your points are pretty good, but there are two things that take away from your arguement that you do not mention.

    First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store, unless you go to the more complicated addressing modes, with the problems that you note.

    Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.

    Where microcode might really be nice is in mitigating the effects of optimizing for one single processor. You could have write-once, run-optimally-on-any-x86 code, but as we see from the real world, that's working about as well as write-once, run-anywhere is with Java.

    I completely agree with you on the importance of higher level parallelism. In most cases, the instruction level parallelism in code is low. SIMD in particular seems a waste to me, since one of the few things less likely than code that doesn't have dependencies is identical code operating on different values that doesn't have dependencies. It has its places, but not that many of them. With all the indirections in neat object code, you get a lot of cache incoherency, so you don't even take as bad a hit as you might think from running threads in parallel on account of that.

    Of course, there are alternatives. Running multiple branches so that a wrong prediction won't stall is a decent, although less efficient, use of extra execution units also.

    --

    If you are modding me down because you disagree with me, use the "Flamebait" category, not the "Troll" one.
  26. P5 - The Art of War by laserdance · · Score: 2
    In designing the P5, Intel again tries to pack every possible feature onto their processor.

    The solution? Balancing the extreme power and cooling requirements of the processor by making the computer made out of PURE DEATH!!!

    NeoNecroElectrical Engineering is the future...

  27. Not much performance decrease (according to articl by Fervent · · Score: 3
    According to the article there's not much performance decrease that can be directly tied to the design changes (they mention 5% loss for cutting the chip's area nearly in half. I'll take that).

    The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture. Hopefully Itanium will change all that.

    --

    - I don't care if they globalize against free speech. All my best free thoughts are done in my head.

  28. More info by Tomcow2000 · · Score: 3

    Once again, The Register has a story with not only more info, but a much better title :)

    --

    Sleep: A completely inadequate substitute for caffeine.
  29. PPC by Chris+Johnson · · Score: 4

    So go PPC. 512K cache is _small_ for current PPCs, 1M cache is typical and 2M of cache is possible with the G4s. You don't _have_ to cling to x86 just because an industry is desperately trying to keep it hobbling along. It's possible to not use x86. For that matter, UltraSPARC cache can be up to _four_ megs.

  30. Re:Clock by Detritus · · Score: 4
    I've always wondered why one of the vendors hasn't put a divide-by-two circuit on the clock input pin of the microprocessor. Then they could claim that their 800 MHz chip was actually a 1.6 GHz chip.

    A friend told me that in the early days of portable transistor radios, some manufacturers would intentionally add non-functional transistors to their radios, just so they could advertise them as N transistor radios, where larger values of N were "better".

    --
    Mea navis aericumbens anguillis abundat
  31. Next breakthrough by max99ted · · Score: 4
    Watched the A&E Biography on Steve Wozniak last night. One of his designs (Apple 1 methinks) was revolutionary for its time - a reduction from about one thousand chips on board to sixty - way ahead of what anyone else was doing.

    Maybe it's me, but I can't think of a simliar 'breakthrough' advance in recent years. I remember reading somewhere that computers are approaching the 'limits' of current architecture design - we can only crank out so much from today's motherboard/x386 technology. I know that optical computing is slated as the next wave, but I can't help thinking that to bring this to light there needs to be a new "Apple I" breakthrough. Am I off base here?

    --

    Please stop APK.. you're only hurting yourself.

  32. The real limits, IMO. by Christopher+Thomas · · Score: 5

    The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture.

    Actually, while inconvenient, the x86 architecture isn't as horrible a limit to performance as a lot of people seem to be assuming. The main problem is the extra latency in the decode stage, which lengthens the pipeline somewhat, but the P4's trace cache takes care of that.

    The real problem with the P4 is that it has very wierd optimization requirements (the whole "bundle" thing) and so needs a very smart compiler if code is to run quickly. Generally, even if compilers like this exist, they aren't used (remember the original Pentium?).

    The other problem with the P4 is the long pipeline, which exacerbates stall problems.

    As for architecure in general, heat issues are what's limiting clock speeds (for x86 and non-x86 processors alike). However, the main limit people are noticing is the limit to the number of instructions you can run in parallel. As long as you're executing only one thread, you're not going to be able to sqeeze more operations per clock beyond a certain point. The "performance problem" isn't with clock speed - it's with people expecting new chips to do more, clock for clock, than old chips while running serial programs. This parallelism problem affects all chips - x86 and non-x86.

    This is why the major manufacturers are starting to look at SMT chips (Symmetric Multi-Threading) seriously. Running multiple threads in parallel on one chip doesn't take much extra hardware, and makes it *much* easier to schedule concurrent instructions and to keep on running when one instruction stalls (your "Instruction-Level Parallelism" goes up in proportion to the number of threads).

  33. The P4 is the world's fastest microprocessor. by ToLu+the+Happy+Furby · · Score: 5

    It merely needs recompiled code to perform well.

    On what am I basing this apparently heretical statement? On SPEC_CPU2000, the most demanding, well balanced, most respected cross-platform CPU benchmark in the world. As you can see if you peruse these lists, the P4/1500 has the highest scores of any shipped CPU in the world, both in SPECint (base and peak) and in SPECfp (base only).

    Before any of you reply and think you've caught a mistake, the Alpha EV67/833 is *not* publicly available, and won't be until January, at which point it will take back leadership in SPECfp_base and SPECint_peak. Of course, the P4/1700 will probably take back the lead when it's released in March or so. Indeed, the P4 and Alpha will likely trade the top SPEC spot back and forth at least until the EV68 (EV67 moved to .18 um process and with on-die L2 cache) makes an appearance (Q2?), if not all the way until the EV7 (EV68 with integrated on-chip *8-channel* RDRAM controller) is released (Q4?).

    This is why all this banal talk about the P4 being a crappy chip or (in the wake of this article) a "crippled" chip is ignorant drivel. SPEC_CPU is an exceptionally well designed, balanced, and comprehensive benchmark stressing a CPU to its limits in all sorts of ways. Why then the P4's disappointing performance on all those other benchmarks? They are all on "legacy" code--code compiled with the P6 core in mind. Because the P4 represents the first chip with a new core architecture (the horribly misnamed "NetBurst" core) from Intel in 5 years, it has a lot of pretty radical design features which don't take well to code compiled for the P6 core. While this means the P4 is pretty a useless (or at least very overpriced) solution to running today's code--and indeed, most code released for at least the next year or so--it has nothing to do with how good a *design* it has, which is ostensibly the point of this discussion. Indeed, the PPro--the first P6 core chip--posted very "disappointing" benchmarks on legacy code when *it* was released 5 years ago; many observers wrote it and the P6 core off as underperforming overdesigned wackiness from Intel. It was arguably the most successful and innovative CPU core ever. Not so incidentally, this was strongly forshadowed by its brief theft of the SPECint95 performance crown from the top Alpha of the time...

    Now to dispense with the most repeated "points" we've seen thus far.

    1) "This just goes to show that x86 is a dead ISA with no headroom to grow." Not the most unexpected statement to be found on /., but let's just say that the other 99.99% of the world that enjoys backwards compatability will make sure x86 stays alive for quite a long time to come thank you. On a technical (rather than marketing level), though, this is ridiculous bunk as well, as the fact that the P4 beats every released 64-bit 10-times-as-expensive RISC chip with 30-times-as-expensive platforms, on SPEC_CPU--a benchmark specifically designed to stress exactly those high-performance situations demanded of professional level workstation and server machines--demonstrates quite nicely.

    Yes, x86 is a bad ISA, and yes it presents a problem to be overcome by chip engineers. But it has been overcome and will continue to be overcome--today by taking on a decoding stage to x86 processors that turns x86 instructions to RISC-like instructions for internal operations (taken out of the critical path by the P4's trace cache), and tomorrow perhaps by dynamic recompilation software ala Transmeta, IBM's DAISY, and HP's Dynamo, techniques which are still in their infancy and *may* end up providing better-than-compiled performance even without the benefit of converting to a more optimal ISA. The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.

    The real question is, does the performance hit generated by sticking with x86 exceed the performance gain generated by having a much larger target market, and thus more money to spend keeping up with the latest process technology and thus getting faster clocked CPUs? The answer thus far has been a rather resounding "no"--that is, the economies of scale granted by staying x86 have meant processors which are outright faster and cost much much less.

    After all, there is no doubt that were the Alpha not around 18 months behind Intel in terms of process technology, the EV67 would be much faster than the P4. On the other hand, the EV67 gets to take advantage of resources that Intel could never dream of in a mainstream chip--like a 300+mm^2 die size, extra wide memory buses, and 4-8MB L2 caches--because of the tremendous added cost. And even with all that plus what is widely acknowledged as the best CPU design team on the planet, the Alpha only manages to keep up with the P4.

    Moreover, the rest of the 64-bit world--despite the same advantages as the Alpha (well, except their design team)--can barely keep up with the P3, and that's a 5 year old design. They may be available in multi-chip boxes scaling to kingdom come, but on the level of individual chips, the best that Sun, IBM, HP or MIPS has to offer is pretty lame, despite all the advantages of a RISC ISA. Of course, the same old folks will be claiming that x86 is an inherent dead end when the P4 (or whatever Intel is calling its current NetBurst core by then) scales past 4 GHz two years from now, well ahead of anyone in the RISC world. And we'll hear it again in 4 or 5 years, when Intel releases another all-new x86 core.

    2. "The P4 should have left in all those features this article talks about." Uhhuh. Sure. Um...now, who would know more about this? Would that be you, having read some article on the Internet? Or would that be Intel's engineers who maybe understand the P4 core and the issues involed with these features a bit better than you, and who had the benefit of cycle-perfect simulations on dozens if not hundreds of possible P4 variants running every concievable type of code??

    If there's a feature which doesn't make it into a finished CPU, it's because of one of two reasons:
    1) The designers didn't think of it;
    2) The designers couldn't figure a way to implement it and make it work with the rest of their design in such a way that it raised performance/cost.

    Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.

    IMO, the features listed here are all better off gone from the current P4. The only really intriguing one--another FPU--was *not* left off for die size considerations (i.e. cost): FPU's are not very big. It was left off for performance issues. You see, while "more is better" sounds like a nice philosophy, adding an extra FPU would have meant extra decoding and routing logic in the FP section of the chip. Considering Intel actually went to the considerable trouble of implementing this feature and then decided against it, it is very likely that this extra logic was in the P4's critical path. Thus while including the extra FPU would have meant extra performance/clock, it would have meant lower overall clock speeds. Obviously Intel felt the tradeoff worked better without the extra FPU than with it.

    If you "disagree" with their decision, please refer to the cycle-perfect simulators which Intel has and you don't, and the P4/1500's SPECfp2000 score which is a mere, oh, 68% better than the fastest P3. Also you might note that the P4 is scaling quite well with clock speed on SPECfp, that it will spend most of its life at speeds well above 2 GHz, and that it will likely sell most (at least for the next 2 years) in combination with a memory subsystem providing *less* bandwidth than the current dual-RDRAM i850 chipset--all of which point to this being a very smart decision on Intel's part. (The reasoning is this: if the P4's FPU can already keep up quite nicely with a larger memory bandwidth, then why increase FPU power/clock when most P4's will have higher clocks and lower bandwidth to keep them fed?)

    As for the features I'd like to see added to the P4 when it moves to its .13 um Northwood variant next summer: one of them was on the list, i.e. a 16kb L1 data cache. The reason it was left off was clearly not die size but clock scalability--Intel decided having a 2-cycle latency L1 was more important than having a bigger one, and I totally agree. After the move to .13, though, perhaps a 16kb 2-cycle L1 will no longer limit clock scalability, just as the PPro's 8kb L1's were expanded to 16kb each with the PII. The other, a 512kb L2, would take up much too much die space at .18um to be feasible; it too, may make it to Northwood, depending on Intel's target die size. Needless to say, whatever they decide, it will be a much better informed decision than I or anyone here could presume to make.