Slashdot Mirror


Intel to Increase Stages in Prescott

Alizarin Erythrosin writes "Further contributing to the MHz Myth, The Register and ZDNet are reporting that the new P4 core, codenamed Prescott, will have a longer pipeline then Northwood. No official numbers have been released, but The Reg is saying an Intel spokesman said that 30 stages seems to be a reasonable estimate. As most of us know, a longer pipeline can lead to slowdowns in the form of branch mispredictions and pipeline stalls. 'And just as the PIII proved faster than the early P4s in some applications, it's likely that Northwood will similarly prove faster than Prescott, which has clearly been designed for speeds of the order of 4GHz.'"

42 of 524 comments (clear)

  1. I guess the home market rules... by ghostis · · Score: 4, Interesting

    I work at an engineering firm. The deep pipelines in the current P4 perform so poorly with general number crunching (e.g. matlab) we have almost completely switched to Athlons and are seriously considering Opteron.

    -ghostis

    --


    Computer Science is all about trying to find the right wrench to bang in the right screw. -T.Cumbo?
    1. Re:I guess the home market rules... by EulerX07 · · Score: 3, Interesting

      Matlab can hardly be beat in speed when you need to produce custom software to crunch huges matrices full of number. You can have a GUI designed, working, put some code quickly together that can grab data from any txt format, run mathematical formulas on those data. Then you can do any operations you want on the matrices that are in memory and easily accessible. Want to throw your data into a chart? A few minutes of coding and you've got the perfect chart on there.

      Back in my days of internship at the canadian space agency, I'd program multiple custom apps to pre-process the data before it being fed to the mainframes of a contractor for finite element analysis. Matlab is the tool to use for anybody involved in scientific projects. Yes, your code in C will run much faster, but it'll take significantly longer to get it up and running.

      If you run a lot of loops and it's really bogging the performance down, you can program just those sections of code in C and compile with matlab libraries to be able to use it in Matlab like the native commands. I did one piece of code that took a finite element file and created the 3d model in matlab. Took 20 minutes to run the code in matlab, 3.45 seconds once I had compiled the tough part of the code in C.

      In the end it's all about using the right tool, and for engineering/matlab, Matlab is excellent.

    2. Re:I guess the home market rules... by Mr.+Frilly · · Score: 2, Interesting

      Just another (single) data point to add, for the image reconstruction software I use routinely, I get these performances:

      intel pentium IV, 3.2 GHz: 5.0 minutes
      athlon XP, 1.533 GHz: 5.7 minutes
      intel pentium III 733 MHz: 8.1 minutes

      From the PIII to the PIV, a 340% increase in processor speed, I get 60% increase in performance...

    3. Re:I guess the home market rules... by Cecil · · Score: 2, Interesting

      The deep pipelines in the P4 perform poorly, period. Even when running simple desktop apps on a Windows machine, I notice my P4-2.5GHz w/1GB RAM at work often jerks around or lags, while my Athlon 1900XP+ w/256MB RAM at home works like lightning. Obviously processor is not the whole story, but I think that under typical, multi-tasking usage, the deep pipelines are even more painful than benchmarks suggest.

      Disclaimer: I am not an EE, so I could very well be full of shit.

    4. Re:I guess the home market rules... by woodhouse · · Score: 2, Interesting

      Each to their own I suppose. I admit I don't have much experience with Matlab (I'm planning on keeping it that way). As a college project, we were told to use matlab for a computer vision task. I tried everything to optimise it, followed all the guidelines on vectorising code and not using loops, and eventually found that the only way to do it was to write the critical code in C, as you suggest (this improved the speed by a factor of 100). In the end, there was almost no advantage from having used matlab and I would have been better to just write the whole thing in C.

      What baffles me the most is that people use it for image processing, of all things. Surely if performance is important anywhere, it's here? It doesn't help that Matlab 6.5 runs on a Java back end.

    5. Re:I guess the home market rules... by buysse · · Score: 2, Interesting

      I thought that SSE and MMX both had significantly lower precision than standard IEEE floating point ops. If I'm wrong, please correct me, but if it is lower precision, it makes it useless for Real Work(tm).

      --
      -30-
    6. Re:I guess the home market rules... by tomstdenis · · Score: 5, Interesting

      It isn't just branches though. For example, a 32x32=>64 multiplication on the P4 can take upto 14 cycles [iirc] whereas on the Athlon it's 6-cycles. So for example,

      MUL EAX,EBX [DIMMMM]
      ADD ECX,EAX [_D___IE]

      So in total takes seven cycles.

      The same code on the P4 would take at least 15 cycles. What's worse is consider

      MUL EAX,EBX [DIMMMM_]
      ADD ECX,EBX [_DIE___]
      INC ESI [_DIE___]
      DEC EBP [__DIE__]
      ADD EBX,EDX [__D__IE]

      Again this takes seven cycles. Specially since instruction 1 and 2 can go start in cycle two in pipes 1/2.

      Compare that to the P4 which only has two ALU pipes [one of which is now stalled for 14 cycles for the MUL to finish].

      Tom

      --
      Someday, I'll have a real sig.
  2. History repeats itself..... by Selecter · · Score: 5, Interesting
    I guess Intel's short term game plan is to keep the Mhz game going yet again until they can get something going on the 64 bit front worth having.

    I suspect AMD and even Apple are going to shrink Intel's bragging rights in that same time frame unless Intel gets their act together. From AMD's recent earnings report it sure seems somebody is buying Athlon 64's.

    Intel blew it when they made the decision to let 32 bits ride for another 2 to 3 years. They look like old fuddy-duddys now. It's AMD and Apple via IBM thats has the cool shit.

  3. So What ? by El+Cabri · · Score: 4, Interesting

    I'm kind of tired of the perpetual whining of armchair hardware designers. So the happy few, highly paid architects, 30 years-experience in the industry, hundred-published scientific papers at Intel decide that the next gen chip will have more stages and they have to be called morons ? How do you know better ? Hasn't intel produced the fastest chips on the market with each and every micro-architectural generation ? Long pipelines = costly branch mispredicts, whoooaah, you're so bright why don't YOU have the job leading the prescott team ? branches can be predicted. Long pipelines can improve throughput. Microprocessors are all about trade-offs. Let the pros do the work and go back playing Quake.

    1. Re:So What ? by drinkypoo · · Score: 2, Interesting
      obviously branches cannot always be predicted, and intel has traditionally (not a long tradition, OoO is relatively new, but still) been poor at it. Witness the amazing slowness of the P4 compared to the P3, clock for clock. Some of those pipeline stages in the current P4 are already there for signal propagation, I suspect more of them in this core will be so-called "Drive" stages in which the CPU is doing nothing but waiting for signal propagation.

      Intel has the fastest chips (by a fine RCH), but AMD has consistently produced the best price:performance ratio and since the K6 faded over the horizon, AMD has got its act together WRT chipsets and compatibility, to the point where there is no longer any reason to get intel over AMD. AMD has realized that since CPUs are usually doing many things at once, it is better to be broad than deep.

      Intel is going to have to do something really spectacular soon or continue to lose market share to AMD. Personally I hope they blow it, because I'm so much happier with Athlons than with any intel CPU. AMD's only black mark is the K6, which until the K6/3 has only 24 bit FPU, and as such has many compatibility problems. Of course, if you're running linux, you'll never see them, so the faster K6s are not useless yet. (Cobalt Raq3 owners rejoice.)

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:So What ? by PlazMan · · Score: 2, Interesting

      How about some whining from a real hardware designer?

      I used to work at Intel designing micros, and I can assure you that there are several highly-qualified and brilliant people in the microprocessor architecture and design teams. Unfortunately, Intel management directed them to trade performance for MHz about seven years ago and now they're finally paying for that foolishness. Lots of really good people have either left the company or drifted away from the project teams to the labs.

      Most of the people that I know who work or worked on the Prescott team say that it was probably the worst managed project ever at Intel. Take two (rival) divisions and tell them to work together, combine that with a design-by-committee mentality, and throw in a completely unreasonable schedule (imagine being in "crunch mode" for 2 years straight).

      Intel has succeeded in staying ahead by virtue of brute force. They have the resources to make diving save after diving save. The manufacturing and process engineers are unbelievably resourceful. The Northwood team has saved their bacon for the past two years as Prescott has missed deadline after deadline. It will be interesting to see if the behemoth can change its course and use its huge amount of engineering talent more efficiently in the future.

  4. Pipeline stalls by k4_pacific · · Score: 4, Interesting

    When the processor branches, all the partially executed instructions in the pipeline are lost.

    They could minimize this by creating two different conditional branch instructions for each condition. One for cases where the programmer expects the branch to occur most of the time, and one for where the branching rarely occurs. They could then optimize the pipeline behavior for each case. If its a 'likely branch' instruction, it could start fetching commands from the branch. If its an 'unlikely branch' instruction, it could prefetch the next instructions after the branch.

    This would work well in loops where every time but the last, the processor branches back to the top.

    --
    Unknown host pong.
    1. Re:Pipeline stalls by bmorris · · Score: 2, Interesting

      Read up on predication. http://www.geek.com/procspec/features/itanium They do some cool stuff with it in Itanium.

  5. Intel bit by their own tricks? by lambadomy · · Score: 4, Interesting

    Assume for a second that Intels P4 design was really meant to boost GHz numbers easily (to guarantee victory in the GHz war if not the performance war). If so is the Prescott design now due to having to keep up with themselves? Obviously they could design a chip that is "faster" but runs at a lower clock speed than the P4s, but they've pushed the GHz number so much that now they're kind of hamstrung in their design options.

  6. Slower than Northwood? by StarCat76 · · Score: 4, Interesting

    Although the Prescott core will have a longer pipeline, it will proboably end up performing a bit better clock-per-clock against Northwood. This is due to a couple reasons. Firsly, Prescoot has 1 MB on-die L2 cache. That's a good bit, and one could see how the P4 was helped by the 2M L3 cache in the P4 "EE". Secondly, the new P4 will have improved hyperthreading. It will also have somewhat improved branch prediction and implements PNI(Prescott New Instruction) which will require a recompile to help things out. All in all, I see the Prescott as being just as fast or faster per clock as Northwood, mostly due to the doubled L2 cache.

  7. Low-power consumption devices by johnthorensen · · Score: 4, Interesting

    So, since Prescott has approximately a 30 stage pipeline, I guess Intel has decided to continue to ignore the low-power consumption market, leaving it open to people like VIA and Transmeta. This is really disappointing to a lot of folks in the embedded markets, who would really like to see Intel ship something with significant horsepower that doesn't require a heatsink with the mass of a black hole to keep running.

    Word has it that VIA is readying a new x86 processor to their line that supposedly has P3-class FPU performance while maintaining the same levels of poser consumption as its predecessors. It is expected that this processor may actually have a big win in front of it for DirecTV boxes. With the extra CPU horsepower, it should be exciting to see what nifty features come out of this, especially considering most set-top CPUs generally just act as "traffic cops" for the data moving between ASICs. If they're really making the move to this class of processor, perhaps they've got more in mind.

    --JT

    1. Re:Low-power consumption devices by Pyro226 · · Score: 2, Interesting

      ...would really like to see Intel ship something with significant horsepower that doesn't require a heatsink with the mass of a black hole to keep running. Aside from the whole Earth getting sucked into oblivion thing, a black hole would make an excelent heat sink. I mean, not even light can escape its gravity - heat wouldn't stand a chance.

      --
      This message is encrypted with Quad ROT-13 to protect the author's copyright under the DMCA.
  8. Sounds Like Marketing by Anonymous Coward · · Score: 4, Interesting

    It sounds like Intel has totally given up on efficiency, and has the Marketing department doing processor requirements now... (has to clock to xGHZ!)

    I've been working with Dual Opterons for a few months now, and have been very impressed as to their speed, heat dissapation, and bang for the buck.

    A large data transformation job (really doing a scrape of a mainframe report for data) on the order of 1.1GB processed much faster on an IBM E325 Dual Opteron 2.0ghz running 32bit Windows (ack) than my Dual 2.4ghz Xeon (w/HT) running Windows (double ack)....

    Yeah- it's not a benchmark, but it is real world performance.

    1. Re:Sounds Like Marketing by be-fan · · Score: 1, Interesting

      P4s aren't designed for efficiency, but raw performance. The long pipeline is an engineering decision. Consider, what market really pushes CPU performance? In the consumer arena, its games and media applications. They are "streaming" (predictable branching) type applications, and the pipeline latency has a lower cost than the benefit of the higher clock-speed.

      So comparing a 2.0GHz Opteron to a 2.4GHz Xeon is not a fair comparison. You have to do a price/performance comparison. A 2.0GHz Opteron costs $700. A 2.4GHz Xeon costs $200. A 3.0GHz Xeon will be more comparable to the 2.0GHz Opteron and costs about $500. The Opteron is faster than the fastest Xeon, but on a price/performance standpoint, the Xeon is still competitive.

      --
      A deep unwavering belief is a sure sign you're missing something...
  9. Pipelines != Math Performance by TubeSteak · · Score: 3, Interesting
    My understanding was that AMD has 3 FPUs to Intel's 2. Oh, and AMD has 3 AGUs (integer units) compared to Intel's 2+2 (two of them also do other things). Anyways, most users, @ the Ghz speeds this proc is coming in at, will never notice the difference. For the people who care, they'll figure out what the proc can and cannot do... then use it accordlingy. Unless you guys really want to run windows, why not compare the Opteron to a Dually Mac? After all, the PowerPC is really good at number crunching.

    How come your computer takes seconds to multiply two 400 digit #s, but ages to factor them?

    --
    [Fuck Beta]
    o0t!
    1. Re:Pipelines != Math Performance by tomstdenis · · Score: 5, Interesting

      More specifically the Athlon has three ALU/IEU pipeline pairs, 1 FADD, 1 FMUL and 1 FLOAD pipeline [e.g. you can't do 3 FP muls at once].

      The decoder can send upto three instructions into the pipeline per cycle. Actually that's only for directpath instructions [e.g. simple ALU/FP]. Vector instructions stall all three decoders.

      The ALU scheduler is fairly strong but it does have several weaknesses. from the manual I can't see that it can resolve dependencies from other pipelines. For instance,

      ADD EAX,EBX [DIE ]
      ADD EBX,EAX [D IE ]
      ADD ECX,EBX [D IE] - critical path
      INC ESI [ DIE ]

      D == decode, I == issue, E == execute [pp.. 227 of the athlon opt manual].

      So the fourth instruction will always start on the second cycle despite the fact that ALU1/2 are blocked.

      Similarly the Athlon memory ports are a bit weak. There are read/write buffers but you still can only issue two reads or one write per cycle which is annoying.

      However, the strength of the Athlon ALU over the P4 ALU is that for the most part it can keep all three pipelines busy even if they are blocked at some stage [e.g. it can decode/issue even if blocked]. It doesn't say in the documentation but I could swear the Athlon can cross-pipe things too. Cuz sometimes I can mess the order of ops [e.g. create a dependecy] and it executes in the same time regardless.

      Anyways, yeah it's all about the 3 ALUs and a decent scheduler. Something the P4 does not have.

      Tom

      --
      Someday, I'll have a real sig.
    2. Re:Pipelines != Math Performance by tomstdenis · · Score: 2, Interesting

      "Vector instructions stall all three decoders.

      Yup. E.g. splitting movps -> movlps+movhps does indeed make a performace gain."

      I meant VectorPath instructions like DIV, LGDT, etc... ;-)

      They stall all three decoders. As for alignment the trick is to pack as many instructions into 8-byte aligned windows. According to the manual it fetches 24-byte windows and performs one [or two I forget... PDF is so far away] of scan/early decoding.

      So the trick is to organize your code so that each 8-byte segment has as many directpath instructions in it. That will minimize the decode latency [depending on the instructions may minimize issue/execute latency].

      The problem though is most ALU opcodes are at least two bytes [except for things like INC/DEC] and worse yet things like

      00000000 89D8 mov eax,ebx
      00000002 8B00 mov eax,[eax]
      00000004 8B0418 mov eax,[eax+ebx]
      00000007 A100040000 mov eax,[0x400]
      0000000C 8B8000040000 mov eax,[eax+0x400]

      So really offsets/constants are horrible [the last two instructions are 5 and 6 bytes each].

      If you have to step through arrays I think the idea would be to use the middle, e.g.

      00000012 03040B add eax,[ebx+ecx]
      00000015 81C100040000 add ecx,0x400
      0000001B 03040B add eax,[ebx+ecx]
      0000001E 81C100040000 add ecx,0x400

      Which takes 18 bytes. [four windows]. Another trick is to use a register for the step size...

      00000024 BA00040000 mov edx,0x400
      00000029 03040B add eax,[ebx+ecx]
      0000002C 01D1 add ecx,edx
      0000002E 03040B add eax,[ebx+ecx]
      00000031 01D1 add ecx,edx

      [16 bytes, 3 windows, ignore stalls.... ;-)]

      Tom

      --
      Someday, I'll have a real sig.
  10. Re:Why? by phorm · · Score: 4, Interesting

    Which basically means, Intel can release a CPU with a higher MHZ rating for those that fall for such things.

    In reality the CPU will be somewhat faster than current ones due to the higher clock, but much less efficient.

    Why not just dump MHZ as a rating altogether? Wouldn't FLOPS-based (Floating Operations Per Sec) or something similar be a better measurement? Maybe how far a simple program can compute PI in a second? We should really be looking at an operational-based measurement rather than a clock-based one.

  11. One-off number crunching... by Goonie · · Score: 3, Interesting
    In some situations, this kind of number-crunching is done with a custom program that is only run a few times. In such situations hacking something together in Matlab is quicker to get up and running than a full-blown C++ or, god forbid, FORTRAN program.

    Programmer time is much more expensive than faster machines.

    --

    Any sufficiently advanced technology is indistinguishable from a rigged demo
    --Andy Finkel (J. Klass?)
  12. Is this the right move? by Zebra_X · · Score: 4, Interesting

    Intel has shown no real interest in joining the 64-bit fray. Indeed, they don't have much choice. To release a 64/32-bit chip at this point would truly create an Itantic out of the Itanium. Microsoft would have more or less wasted it's time producing low volume products such as SQL Server 64 and XP 64 (different than XP 64-bit extended which is as yet to be released). Other consequences for such a shift in strategy would include, a number of people investing in the itanic platform who would be the proud owners of an all but useless, but very expensive hardware platform on their hands.

    Most real world tests point to AMD chips being faster. The Int and Floating Point Tests still belong to the P4 3.2, but the P4 is having to pass the 1st place troughy to AMD when it comes to games and office productivity.

    And then there is price. For $320 you can get $700 worth of Intel performance. Mind you this is the AMD64 running in 32-bit mode.

    It would appear that all that is really needed to justify mass market adoption is a consumer OS, that would be Windows XP 64-Bit extended. Currently in Beta. The only delay there is that the .NET framework is not 64-bit ready. We can probably expect it's release with VS.NET Whitby, a.k.a. .NET 2.0.

    After that - we just need to see some AMD adoption in the mainstream pc builders.

  13. Do you know what you're talking about ? by vlad_petric · · Score: 4, Interesting
    Matlab is mostly loops. Loops generate branches with high predictability, and as a consequence deep pipelineing won't incur much performance loss. Furthermore there's a lot of parallelism in those loops, and the out-of-order execution engine is quite good at exploiting it (i.e. hide the long latency of FP ops by overlapping them)

    It's much more likely the size of the L2 cache is affecting you (i.e. your working set does not fit into P4's L2 cache but it does in Barton's).

    If you don't believe me, try the demo version of Intel Vtune performance analizer on matlab running one of your programs.

    How well your caches perform is probably the most important thing for a processor today, as the speed of the main memory is a couple of orders of magnitude under the speed of the processor. It takes a couple of hundred cycles to service an L2 miss, while a long FP operation takes at most 20 cycles.

    --

    The Raven

  14. Re:Why? by Wanderer2 · · Score: 5, Interesting
    Why not just dump MHZ as a rating altogether?

    Didn't AMD try to organise this and recently concede it wasn't going to happen?

    As long as any metric favours one particular manufacturer, the rest will try to replace it with a new one. The result will be more FUD and ore confused users ("I've finally worked out what GHz are and you tell me I have to look at the number of flops?!?")

    </Pessimist>

    --
    I say we take-off and slashdot the site from orbit... it's the only way to be sure
  15. Re:It;'s not that it'll be slower... by philthedrill · · Score: 3, Interesting

    It'll most likely be slower per clock cycle.

    Yes, I agree. My guess is that they're trying to achieve higher absolute performance. What surprises me is that this is still considered a P4 core, since adding pipeline stages (even 1 stage) is a very non-trivial task.

    This'll also kill the benefits of reduced power consumption of 90 nm technology (increase in area from the additional pipeline registers, increase in frequency), which is important in server design. An argument about the benefits of having a trace cache is the reduction in power consumption since you can remove some decoders (x86 decoders are horribly complex, yet having enough to feed the rest of the processor is critical for high performance). The P4 only has one x86 decoder (plus the uROM) and is able to perform well in general.

    It'll be interesting to see the power consumption numbers (average and max) as well as the die size. Also, I wonder how AMD's CPU rating system will change as a result of this.

  16. Effective pipeline by jmv · · Score: 3, Interesting

    I read somewhere that on the P4, when an instruction is already in the L1 cache, the pipeline gets shortened. That's because the L1 instruction cache stores pre-decoded instructions (micro-ops). This means that when the instruction is reached again, the decoding (and branch prediction?) steps are already done, shortening the pipeline. When the instruction is not in cache, there's already a big hit anyway. With that in mind, we'll need to see whether the extra pipeline stages in Prescott will still be there when the instruction is in the L1.

  17. Pass the Crack Pipe, Please... by Anonymous Coward · · Score: 1, Interesting

    Yeah, we all know that Q3 and MicroSoft Word are the best methods of testing platform-independent CPU (note: not GPU or GPU driver) performance... You should really lay off the crack pipe. For someone who wants to know how number crunching compares on either platforms, Q3 and Word (and Photoshop) aren't going to tell them squat about how something like MatLAB will perform. Q3 is mostly going to tell you the state of GPU tech and GPU drivers than integer ops, and MS Word is obviously going to be better (supported) on MS Windows than on Apple anything. Photoshop is only relevant to people who work a lot with Photoshop, like desktop publishers. That PCWorld benchmark is the most worthless piece of garbage that somehow gets linked to each time people bring up performance comparisons between x86 and PPC, even though it has no bearing on the performance of processes being discussed.

  18. Dilbert Marketing by stuffedmonkey · · Score: 2, Interesting

    This is the end result of engineering driven marketing... When you relentlessly try to make the chip with the "most megahertz', you lose focus. AMD and Apple/IBM have started to pull away in quality - in terms of actual work done per clock cycle. While it's true that the average Joe or PHB might not know any better - you can only continue on so long...

  19. Completely Wrong by Anonymous Coward · · Score: 1, Interesting

    --Wow, I can't believe this got modded as 'Insightful'. 3000+ is a performance rating that is designed to show the CPU performs equivalently to a P4-3Ghz.

    If you look at some actual benchmarks, you will see that the P4 3.06 is actually better in some cases than an AthlonXP3000+ (note this is the 2.167Ghz Barton in the graph)

    SpecFP
    SpecInt

    Additionally, the data shows that a 3Ghz P4 is in fact MORE than 3x faster at SpecFP than a 1Ghz P3. Perhaps you should inform yourself a little before posting FUD.

  20. A note about pipeline stages by Anonymous Coward · · Score: 2, Interesting

    The reasons that Intel has for increasing the # of pipeline stages seems, to me, more for marketing than actual performance.

    By increasing the # of stages (say, to do less work per stage), they're able to minimize interconnect delay (among other things), and therefore bump up the processor speed.

    It doesn't mean they'll be able to do more -- in fact, they're doing less per stage, just at a faster rate. (Whereas I suspect the Athlons are doing more per stage, and that's why we're seeing 2GHz Athlons tying or beating 3.2GHz Pentiums.)

    Marketing-wise, it'll be a win for Intel. Performance-wise (due to pipeline stalls), these changes will demand that Intel keep bumping up chip performance or else lose out to AMD. Of course, we all know which of these two criteria are the most important to the bottom-line.

  21. Re:Why? by Sivar · · Score: 4, Interesting

    More clockspeed = more sales. 95% of computer users (or is it 94%, with recent improvements in public education) believe in the MHz Myth mentioned on the front page.
    The MHz myth is the belief that the OneTrue measure of CPU performance is clockspeed. A 2GHz CPU is twice as fast as a 1GHz CPU. A 4GHz CPU is twice as fast as a 2GHz CPU.

    While it may not seem common to many of us, if you speak with a large number of average people about computer performance, you will quickly want to kill yourself. Or them. Or both.

    This isn't the fault of the general public, as Intel's marketing machine takes advantage of this common belief. Intel Pentium IV processors are some of the highest clocked processors in the world, and they benefit from everyone that thinks this somehow matters.

    --
    Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
  22. Don't forget Prescott's larger L1/L2 cache sizes by Anonymous Coward · · Score: 1, Interesting
    It'll most likely be slower per clock cycle. What this means, is that it will take a faster clock cycle (4GHZ, for instance) to do the same amount of processing as the Northwood core.

    Prescott will have 16KB of L1 cache (Northwood has 8KB) and 1024KB of L2 cache (Northwood has 512KB). These changes will most likely increase the performance per clock cycle.

    Maybe the larger cache sizes will "make up" for the longer pipeline. I won't criticize Intel until I see benchmarks of 3.4GHz Northwood vs 3.4GHz Prescott.

  23. Re:Scientific work on optimal pipeline depth by -tji · · Score: 3, Interesting


    > What all these papers have in common is that they find that increasing the pipeline depth past 20 stages increases performance.

    Is that a typo, or am I misinterpreting the papers you liked above?

    In all but the Intel paper, it looked to me like they were saying the optimal pipeline depth was somewhere between 6 and 20 (depending on workload).

    In the introduction of the Intel paper, it says "Focusing on single stream performance". So, basically they are focusing on artificial benchmark performance.

  24. Re:Why? by GerryGilmore · · Score: 2, Interesting

    Before you run off blaming the evil Marketing demons, let me ask you this.....what readily quantifiable measure would you use instead to compare systems for the broad range of users and applications - all other things being the same? (memory, disk, etc.)

    Imperfect a measure that it may be, it's a hell of a lot easier to relate to and compare than "how many FPS of Quake3 can I get?" or "how quickly can it compile the 2.6 kernel?"

  25. Re:Holy pipelines by mikeabbott420 · · Score: 3, Interesting

    Could we explain to people the differance between megahertz and performance by comparing it to cars? Sure the intel xxx does yyy but thats a 4 (IPC) cylinder that does yyy rpm vs a a 8 (IPC) that does zzz rpm but more horsepower. megahertz=rpm ips=horsepower if the general public understood that megahertz was rpm not horsepower intels talented engineers could build great things freed from the marketing departments focus on rpm

    --
    This program was made possible by a grant from the Ultra-Humanite, and viewers like you.
  26. Re:Why? by Sivar · · Score: 4, Interesting

    " Out here in "Reality World", as I like to call it, it _does_ matter. You see - performance is performance, whether it comes via IPC or high clock speed."

    Yes, high clockspeed "speed demon" chips can and often do outperform high-IPC "braniac" chips. Whether the final performance of the fastest Pentium IVs ends up being as high or even higher than the fastest competitor does not change the fact that Intel has made no effort to dispel the MHz myth--and it IS a myth, and have in fact encouraged it.
    I said nothing of final performance figures. I was stating that the marketing gimmick is that MHz is an accurate measure of speed, which it is not--even between different revisions of Intel's own Pentium IV core, let alone in comparison to their competitors.

    "Until the Athlon64/Opterons AMD had no answer to the P4. They just couldn't quite keep up. And you people harped on the same thing "Ooh, it's a marketing gimmick!"."

    Athlons and Pentium IVs have been leapfrogging each-other for years. If you believe that 32-bit Athlons were never competitive with Pentium IVs, you are quite mistaken. I would be happy to help you research the issue.

    You want a marketing gimmick? How about selling a 64-bit CPU to people who have like 512M of memory. There's your gimmick.

    You may not be aware of this, but it is actually an intelligent idea to fix problems before they become problems.
    --LBA-48 was introduces before more than a tiny fraction of people had hard drives that were larger than the 128GB limit. Is it a marketing gimmick that LBA-48 supports multi-petabyte drives? (2^48-1 512 byte sectors).

    --Serial ATA, and even ATA100 were introduced long before any hard disk drive could possibly approach 100MB/sec sustained transfer rate. Even today's world's fastest hard drive, the Fujitsu MAS3735, cannot quite reach 80MB/sec. DId you know, however, that the same situation occurred with ATA66, ATA33, ATA16, etc.? Perhaps engineers should have waited until the performance barriers were making drive upgrades pointless before introducing faster means of communication? After all, "no hard drive could possibly even approach 33MB/sec" --1995.

    The same applies to 64-bit processors.
    The average Dell comes with what, 256MB RAM? Probably 512MB now? That is 1/8 of the "4 GB barrier" of 32-bit pointers. Actually, that barrier is either 1.5GB, 2GB, or 3GB depending on your operating system.
    Now, let's think: Have you ever seen the average amount of RAM in a system double? I seem to remember 4MB being "plenty" and 16MB being "wastefull and rediculous". I seem to remember 32MB being the standard, and anything over 128MB was an unwise waste of money.
    Do you think that maybe, possibly, that pattern might repeat? Perhaps--since it has happened every few years for decades--the average amount of RAM in a system might increase? Applications might want more than 4GB of address space? Quake 5 may require 6GB RAM minimum (16GB recommended)?

    In case you were not aware, the 64-bit mode of the Athlon64 provides real performance benefits, whether software cares about the extra address space or not. Many algorithms, particularly encryption, data management, HL math, high precision math, media en/decoding, and compression can make use of the larger register size.
    The fact that there are double the number of GPRs (that stands for "General Purpose Register" Ohhh, ahhh) and that the amount of data that one can fit into those GPRs has quadrupled, helps ALL software that is more than a 20-line assembly language experiment. Hell, even having 16GPRs (twice as many as previous x86 chips), the AMD64 architecture is still considered register-starved. Look at the PowerPC, the IA64, the AXP, the UltraSPARC, and just about any other mainstream high-performance processor architecture.
    You may want to look at the reviews from reputable publications showing substantial performance gains from 64-bit Opteron software, including software that could not care less if you have >4GB of memory. Hint: Tom's Hardware is not on that list.

    Is a 10%-30% performance boost a gimmick?

    --
    Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
  27. Re:Why? by pastafazou · · Score: 2, Interesting

    The problem in killing the myth is the dominance Intel has in the processor market. The average Joe is force fed "Intel inside" everywhere he looks, and the sales people in most stores don't bother to explain the differences between different architectures (or they just don't know). Intel has capitalized on this by pushing their architecture heavily towards higher clock speeds, at the cost of many other efficiencies. It's simply MHz & GHz that everyone mentions. AMD, IBM, Apple, Sun, Motorola etc should start pushing something else that can be realistically measured. Maybe someone can do the conversion from clock speeds and GigaFlops to horsepower and Torque? Start talking in powertool talk, and a huge chunk of the population will suddenly start to understand a bit better.

  28. Lab to market lag time: 4 years by Anonymous Coward · · Score: 2, Interesting

    I knew they were up to something when this mail appeared on the linux-kernel mailing list in 2000. 4.3 GHz, indeed!

  29. Re:Size of pipeline by Hoser+McMoose · · Score: 4, Interesting

    Ironically enough, that's quite accurate for processors!

    A 6-stage pipeline with terrible branch prediction and all sorts of holes in it isn't going to do any good at all, while a 30 stage pipeline with great branch prediction (and the P4 does have great branch prediction) and few bubbles or holes (improved SMT, aka hyperthreading, is supposed to help here) will do wonders.

    Of course, the real question is now how long the total pipeline is, but the branch mispredict penalty. It should be noted that the "Northwood" P4 has a 28-stage pipeline, but only a 20-stage mispredict penalty. If the "Prescott" has a 30-stage pipeline with a 22-stage mispredict penalty, it isn't exactly a huge change.