Intel to Increase Stages in Prescott

← Back to Stories (view on slashdot.org)

Intel to Increase Stages in Prescott

Posted by CowboyNeal on Thursday January 22, 2004 @01:24PM from the massive-pipelining dept.

Alizarin Erythrosin writes "Further contributing to the MHz Myth, The Register and ZDNet are reporting that the new P4 core, codenamed Prescott, will have a longer pipeline then Northwood. No official numbers have been released, but The Reg is saying an Intel spokesman said that 30 stages seems to be a reasonable estimate. As most of us know, a longer pipeline can lead to slowdowns in the form of branch mispredictions and pipeline stalls. 'And just as the PIII proved faster than the early P4s in some applications, it's likely that Northwood will similarly prove faster than Prescott, which has clearly been designed for speeds of the order of 4GHz.'"

4 of 524 comments (clear)

Min score:

Reason:

Sort:

History repeats itself..... by Selecter · 2004-01-22 13:30 · Score: 5, Interesting

I guess Intel's short term game plan is to keep the Mhz game going yet again until they can get something going on the 64 bit front worth having.
I suspect AMD and even Apple are going to shrink Intel's bragging rights in that same time frame unless Intel gets their act together. From AMD's recent earnings report it sure seems somebody is buying Athlon 64's.
Intel blew it when they made the decision to let 32 bits ride for another 2 to 3 years. They look like old fuddy-duddys now. It's AMD and Apple via IBM thats has the cool shit.
Re:Pipelines != Math Performance by tomstdenis · 2004-01-22 14:02 · Score: 5, Interesting

More specifically the Athlon has three ALU/IEU pipeline pairs, 1 FADD, 1 FMUL and 1 FLOAD pipeline [e.g. you can't do 3 FP muls at once].

The decoder can send upto three instructions into the pipeline per cycle. Actually that's only for directpath instructions [e.g. simple ALU/FP]. Vector instructions stall all three decoders.

The ALU scheduler is fairly strong but it does have several weaknesses. from the manual I can't see that it can resolve dependencies from other pipelines. For instance,

ADD EAX,EBX [DIE ]
ADD EBX,EAX [D IE ]
ADD ECX,EBX [D IE] - critical path
INC ESI [ DIE ]

D == decode, I == issue, E == execute [pp.. 227 of the athlon opt manual].

So the fourth instruction will always start on the second cycle despite the fact that ALU1/2 are blocked.

Similarly the Athlon memory ports are a bit weak. There are read/write buffers but you still can only issue two reads or one write per cycle which is annoying.

However, the strength of the Athlon ALU over the P4 ALU is that for the most part it can keep all three pipelines busy even if they are blocked at some stage [e.g. it can decode/issue even if blocked]. It doesn't say in the documentation but I could swear the Athlon can cross-pipe things too. Cuz sometimes I can mess the order of ops [e.g. create a dependecy] and it executes in the same time regardless.

Anyways, yeah it's all about the 3 ALUs and a decent scheduler. Something the P4 does not have.

Tom

--
Someday, I'll have a real sig.
Re:Why? by Wanderer2 · 2004-01-22 14:16 · Score: 5, Interesting

Why not just dump MHZ as a rating altogether?

Didn't AMD try to organise this and recently concede it wasn't going to happen?

As long as any metric favours one particular manufacturer, the rest will try to replace it with a new one. The result will be more FUD and ore confused users ("I've finally worked out what GHz are and you tell me I have to look at the number of flops?!?")

</Pessimist>

--
I say we take-off and slashdot the site from orbit... it's the only way to be sure
Re:I guess the home market rules... by tomstdenis · 2004-01-22 15:20 · Score: 5, Interesting

It isn't just branches though. For example, a 32x32=>64 multiplication on the P4 can take upto 14 cycles [iirc] whereas on the Athlon it's 6-cycles. So for example,

MUL EAX,EBX [DIMMMM]
ADD ECX,EAX [_D___IE]

So in total takes seven cycles.

The same code on the P4 would take at least 15 cycles. What's worse is consider

MUL EAX,EBX [DIMMMM_]
ADD ECX,EBX [_DIE___]
INC ESI [_DIE___]
DEC EBP [__DIE__]
ADD EBX,EDX [__D__IE]

Again this takes seven cycles. Specially since instruction 1 and 2 can go start in cycle two in pipes 1/2.

Compare that to the P4 which only has two ALU pipes [one of which is now stalled for 14 cycles for the MUL to finish].

Tom

--
Someday, I'll have a real sig.