Intel to Increase Stages in Prescott

← Back to Stories (view on slashdot.org)

Intel to Increase Stages in Prescott

Posted by CowboyNeal on Thursday January 22, 2004 @01:24PM from the massive-pipelining dept.

Alizarin Erythrosin writes "Further contributing to the MHz Myth, The Register and ZDNet are reporting that the new P4 core, codenamed Prescott, will have a longer pipeline then Northwood. No official numbers have been released, but The Reg is saying an Intel spokesman said that 30 stages seems to be a reasonable estimate. As most of us know, a longer pipeline can lead to slowdowns in the form of branch mispredictions and pipeline stalls. 'And just as the PIII proved faster than the early P4s in some applications, it's likely that Northwood will similarly prove faster than Prescott, which has clearly been designed for speeds of the order of 4GHz.'"

15 of 524 comments (clear)

Bang for your buck by ObviousGuy · 2004-01-22 13:26 · Score: 5, Funny

Northwood was really unsatisfying. I found that for the money, it was too short with too few stages. While gameplay was fine, the lack of stages simply made the cost not worth it for me.

2 stars.

--
I have been pwned because my /. password was too easy to guess.
History repeats itself..... by Selecter · 2004-01-22 13:30 · Score: 5, Interesting

I guess Intel's short term game plan is to keep the Mhz game going yet again until they can get something going on the 64 bit front worth having.
I suspect AMD and even Apple are going to shrink Intel's bragging rights in that same time frame unless Intel gets their act together. From AMD's recent earnings report it sure seems somebody is buying Athlon 64's.
Intel blew it when they made the decision to let 32 bits ride for another 2 to 3 years. They look like old fuddy-duddys now. It's AMD and Apple via IBM thats has the cool shit.
It;'s not that it'll be slower... by Lothsahn · 2004-01-22 13:32 · Score: 5, Informative

It'll most likely be slower per clock cycle.

What this means, is that it will take a faster clock cycle (4GHZ, for instance) to do the same amount of processing as the Northwood core. However, increasing the pipeline should allow Intel engineers to achieve higher clock speeds, as the longest transistor path will likely be shorter (faster switching times).

In essence, Intel is attempting to increase the speed of their CPU's by focusing on increasing the clock speed (P4), while AMD is focusing on increasing the amount of calculations per clock cycle (Hammer).

Of course, there are a lot of more complex tradeoffs that factor in (ie. branch prediction). I highly recommend reading a computer architecture book if you're at all interested. It's really facinating stuff.

--
-=Lothsahn=-
1. Re:It;'s not that it'll be slower... by edrugtrader · 2004-01-22 14:01 · Score: 5, Funny
  
  I highly recommend reading a computer architecture book if you're at all interested. It's really facinating stuff.
  
  dude, i don't even read the articles.
  
  --
  MARIJUANA, SHROOMS, X: ONLINE?! - E
Re:Holy pipelines by k4_pacific · 2004-01-22 13:34 · Score: 5, Funny

Recall that GW Bush's grandfather was Prescott Bush.

--
Unknown host pong.
Re-read the article the reg is GUESSING 30 by uarch · 2004-01-22 13:35 · Score: 5, Informative

Re-read the register article. Its not the Intel guy who said 30 stages, its the Register who is guessing. They're assuming that since it went from 10 to 20 before it'll go from 20 to 30 now. Its not likely to end up being more than a few extra stages.
Re:So What ? by addaon · 2004-01-22 13:37 · Score: 5, Insightful

Right, Intel always has had the fastest chip, if you ignore things like Alpha, Athlon, Opteron, Power, PowerPC, and others.

And of course, Intel's motivations are entirely performance, or at least price/performance, not marketing.

The fact that every other company has chosen a different design decision and has made better chips as a result is just an illusion foisted on us by those who think there own thoughts.

--

I've had this sig for three days.
Myth? by The+Bungi · 2004-01-22 13:38 · Score: 5, Funny

Alizarin Erythrosin writes "Further contributing to the MHz Myth ...
Let me guess - 'Alizarin Erythrosin' is Cupertinus Elvish for 'Mac User', right?
Re:So What ? by stevesliva · 2004-01-22 13:45 · Score: 5, Funny

I'm kind of tired of you armchair OS coders. So the happy few, highly paid Microsoft employees, 20 years experience in copying IBM, thousands of stock options in Redmond decide the next gen OS will have some wack FS and they have to be called morons? How do you know better? Hasn't Microsoft produced the best selling OS on the market for 15 years? Why don't YOU have the job leading the Longhorn team?
Oh. Yeah... LINUX.
Nevermind-- go back to writing the best OS there is.

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Scientific work on optimal pipeline depth by Wesley+Felter · 2004-01-22 13:58 · Score: 5, Informative

In case anyone wants some hard facts:

A. Hartstein and Thomas R. Puzak (IBM): The Optimum Pipeline Depth for a Microprocessor, ISCA 2002.

M.S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, Doug Burger, Stephen W. Keckler, Premkishore Shivakumar (UT Austin, Compaq): The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays, ISCA 2002.

Eric Sprangle , Doug Carmean (Intel): Increasing Processor Performance by Implementing Deeper Pipelines, ISCA 2002.

A. Hartstein and Thomas R. Puzak (IBM): Optimum Power/Performance Pipeline Depth, MICRO 2003.

What all these papers have in common is that they find that increasing the pipeline depth past 20 stages increases performance.
Re:Pipelines != Math Performance by tomstdenis · 2004-01-22 14:02 · Score: 5, Interesting

More specifically the Athlon has three ALU/IEU pipeline pairs, 1 FADD, 1 FMUL and 1 FLOAD pipeline [e.g. you can't do 3 FP muls at once].

The decoder can send upto three instructions into the pipeline per cycle. Actually that's only for directpath instructions [e.g. simple ALU/FP]. Vector instructions stall all three decoders.

The ALU scheduler is fairly strong but it does have several weaknesses. from the manual I can't see that it can resolve dependencies from other pipelines. For instance,

ADD EAX,EBX [DIE ]
ADD EBX,EAX [D IE ]
ADD ECX,EBX [D IE] - critical path
INC ESI [ DIE ]

D == decode, I == issue, E == execute [pp.. 227 of the athlon opt manual].

So the fourth instruction will always start on the second cycle despite the fact that ALU1/2 are blocked.

Similarly the Athlon memory ports are a bit weak. There are read/write buffers but you still can only issue two reads or one write per cycle which is annoying.

However, the strength of the Athlon ALU over the P4 ALU is that for the most part it can keep all three pipelines busy even if they are blocked at some stage [e.g. it can decode/issue even if blocked]. It doesn't say in the documentation but I could swear the Athlon can cross-pipe things too. Cuz sometimes I can mess the order of ops [e.g. create a dependecy] and it executes in the same time regardless.

Anyways, yeah it's all about the 3 ALUs and a decent scheduler. Something the P4 does not have.

Tom

--
Someday, I'll have a real sig.
Re:Why? by Wanderer2 · 2004-01-22 14:16 · Score: 5, Interesting

Why not just dump MHZ as a rating altogether?

Didn't AMD try to organise this and recently concede it wasn't going to happen?

As long as any metric favours one particular manufacturer, the rest will try to replace it with a new one. The result will be more FUD and ore confused users ("I've finally worked out what GHz are and you tell me I have to look at the number of flops?!?")

</Pessimist>

--
I say we take-off and slashdot the site from orbit... it's the only way to be sure
Re:I guess the home market rules... by tomstdenis · 2004-01-22 15:20 · Score: 5, Interesting

It isn't just branches though. For example, a 32x32=>64 multiplication on the P4 can take upto 14 cycles [iirc] whereas on the Athlon it's 6-cycles. So for example,

MUL EAX,EBX [DIMMMM]
ADD ECX,EAX [_D___IE]

So in total takes seven cycles.

The same code on the P4 would take at least 15 cycles. What's worse is consider

MUL EAX,EBX [DIMMMM_]
ADD ECX,EBX [_DIE___]
INC ESI [_DIE___]
DEC EBP [__DIE__]
ADD EBX,EDX [__D__IE]

Again this takes seven cycles. Specially since instruction 1 and 2 can go start in cycle two in pipes 1/2.

Compare that to the P4 which only has two ALU pipes [one of which is now stalled for 14 cycles for the MUL to finish].

Tom

--
Someday, I'll have a real sig.
More details on Intel's processor by rice_burners_suck · 2004-01-22 17:31 · Score: 5, Funny
Intel today announced its new 1024-hexabit microprocessor architecture technology. Named the Quantium, Intel's new processor core boasts powerful new technologies which will enable governments to better manage the rights (or lack thereof) of their subjects.
The Quantium has the following new features:
- Intel (r) LightSpeed (tm) technology breaks the processing pipeline into 299,792,458 discreet steps. As there is no internal clock within the processor, all operations occur at the speed of light. Hence, one "cycle" represents the absolute cosmic measure unit of time and all operations occur in one cycle. While this will not increase the processor's performance--indeed, it will pale in comparison to that of the ancient 80286 processor of old folklore--the faster internal clock speed is expected to increase Intel's sales by 0.000001% within 180 quarters.
- Intel (r) SingleAtom (tm) technology squeezes the entire processor into a single atom by modifying the universe at the M-theory level. Individual strings compose modified quarks and other subatomic structures, which combine to form a very heavy atom, one with approximately the same weight as 1 million protons. As the matter is extremely dense, the radioactive decay, combined with the gravity generated by itself causes the configuration of the subatomic particles to remain bonded at the subatomic level while realigning a nearly infinite number of times every second. This realignment constitutes the execution of instructions within the SingleAtom (tm) processor.
- 893,378,665,113 new operations have been added since the previous model, bringing the new total to over 18 googleplexes of instructions. All SCO intellectual property can be programmed in a single instruction, increasing SCO revenues. Corporations will have to pay $799 per processor instruction executed, or face serious legal action.
- RAM has been depreciated. 4 billion exabytes of internal general-use registers allow software to make more efficient data access, providing a more compelling Internet experience over a 28k modem connection.
Re:I guess the home market rules... by gjm11 · 2004-01-22 22:49 · Score: 5, Funny

"DIMMMM / DIE / DIE / DIE / D_IE" ... You aren't an employee of Rambus Inc. by any chance?