Intel to Increase Stages in Prescott
Alizarin Erythrosin writes "Further contributing to the MHz Myth, The Register and ZDNet are reporting that the new P4 core, codenamed Prescott, will have a longer pipeline then Northwood. No official numbers have been released, but The Reg is saying an Intel spokesman said that 30 stages seems to be a reasonable estimate. As most of us know, a longer pipeline can lead to slowdowns in the form of branch mispredictions and pipeline stalls. 'And just as the PIII proved faster than the early P4s in some applications, it's likely that Northwood will similarly prove faster than Prescott, which has clearly been designed for speeds of the order of 4GHz.'"
Northwood was really unsatisfying. I found that for the money, it was too short with too few stages. While gameplay was fine, the lack of stages simply made the cost not worth it for me.
2 stars.
I have been pwned because my
I suspect AMD and even Apple are going to shrink Intel's bragging rights in that same time frame unless Intel gets their act together. From AMD's recent earnings report it sure seems somebody is buying Athlon 64's.
Intel blew it when they made the decision to let 32 bits ride for another 2 to 3 years. They look like old fuddy-duddys now. It's AMD and Apple via IBM thats has the cool shit.
It'll most likely be slower per clock cycle.
What this means, is that it will take a faster clock cycle (4GHZ, for instance) to do the same amount of processing as the Northwood core. However, increasing the pipeline should allow Intel engineers to achieve higher clock speeds, as the longest transistor path will likely be shorter (faster switching times).
In essence, Intel is attempting to increase the speed of their CPU's by focusing on increasing the clock speed (P4), while AMD is focusing on increasing the amount of calculations per clock cycle (Hammer).
Of course, there are a lot of more complex tradeoffs that factor in (ie. branch prediction). I highly recommend reading a computer architecture book if you're at all interested. It's really facinating stuff.
-=Lothsahn=-
Recall that GW Bush's grandfather was Prescott Bush.
Unknown host pong.
Re-read the register article. Its not the Intel guy who said 30 stages, its the Register who is guessing. They're assuming that since it went from 10 to 20 before it'll go from 20 to 30 now. Its not likely to end up being more than a few extra stages.
Right, Intel always has had the fastest chip, if you ignore things like Alpha, Athlon, Opteron, Power, PowerPC, and others.
And of course, Intel's motivations are entirely performance, or at least price/performance, not marketing.
The fact that every other company has chosen a different design decision and has made better chips as a result is just an illusion foisted on us by those who think there own thoughts.
I've had this sig for three days.
Let me guess - 'Alizarin Erythrosin' is Cupertinus Elvish for 'Mac User', right?
Oh. Yeah... LINUX.
Nevermind-- go back to writing the best OS there is.
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
In case anyone wants some hard facts:
A. Hartstein and Thomas R. Puzak (IBM): The Optimum Pipeline Depth for a Microprocessor, ISCA 2002.
M.S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, Doug Burger, Stephen W. Keckler, Premkishore Shivakumar (UT Austin, Compaq): The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays, ISCA 2002.
Eric Sprangle , Doug Carmean (Intel): Increasing Processor Performance by Implementing Deeper Pipelines, ISCA 2002.
A. Hartstein and Thomas R. Puzak (IBM): Optimum Power/Performance Pipeline Depth, MICRO 2003.
What all these papers have in common is that they find that increasing the pipeline depth past 20 stages increases performance.
More specifically the Athlon has three ALU/IEU pipeline pairs, 1 FADD, 1 FMUL and 1 FLOAD pipeline [e.g. you can't do 3 FP muls at once].
The decoder can send upto three instructions into the pipeline per cycle. Actually that's only for directpath instructions [e.g. simple ALU/FP]. Vector instructions stall all three decoders.
The ALU scheduler is fairly strong but it does have several weaknesses. from the manual I can't see that it can resolve dependencies from other pipelines. For instance,
ADD EAX,EBX [DIE ]
ADD EBX,EAX [D IE ]
ADD ECX,EBX [D IE] - critical path
INC ESI [ DIE ]
D == decode, I == issue, E == execute [pp.. 227 of the athlon opt manual].
So the fourth instruction will always start on the second cycle despite the fact that ALU1/2 are blocked.
Similarly the Athlon memory ports are a bit weak. There are read/write buffers but you still can only issue two reads or one write per cycle which is annoying.
However, the strength of the Athlon ALU over the P4 ALU is that for the most part it can keep all three pipelines busy even if they are blocked at some stage [e.g. it can decode/issue even if blocked]. It doesn't say in the documentation but I could swear the Athlon can cross-pipe things too. Cuz sometimes I can mess the order of ops [e.g. create a dependecy] and it executes in the same time regardless.
Anyways, yeah it's all about the 3 ALUs and a decent scheduler. Something the P4 does not have.
Tom
Someday, I'll have a real sig.
Didn't AMD try to organise this and recently concede it wasn't going to happen?
As long as any metric favours one particular manufacturer, the rest will try to replace it with a new one. The result will be more FUD and ore confused users ("I've finally worked out what GHz are and you tell me I have to look at the number of flops?!?")
</Pessimist>
I say we take-off and slashdot the site from orbit... it's the only way to be sure
It isn't just branches though. For example, a 32x32=>64 multiplication on the P4 can take upto 14 cycles [iirc] whereas on the Athlon it's 6-cycles. So for example,
MUL EAX,EBX [DIMMMM]
ADD ECX,EAX [_D___IE]
So in total takes seven cycles.
The same code on the P4 would take at least 15 cycles. What's worse is consider
MUL EAX,EBX [DIMMMM_]
ADD ECX,EBX [_DIE___]
INC ESI [_DIE___]
DEC EBP [__DIE__]
ADD EBX,EDX [__D__IE]
Again this takes seven cycles. Specially since instruction 1 and 2 can go start in cycle two in pipes 1/2.
Compare that to the P4 which only has two ALU pipes [one of which is now stalled for 14 cycles for the MUL to finish].
Tom
Someday, I'll have a real sig.
The Quantium has the following new features:
"DIMMMM / DIE / DIE / DIE / D_IE" ... You aren't an employee of Rambus Inc. by any chance?