Intel Pentium 4 NetBurst Architecture Explained

← Back to Stories (view on slashdot.org)

Intel Pentium 4 NetBurst Architecture Explained

Posted by ryuzaki0 on Sunday August 20, 2000 @11:46PM from the digging-into-it dept.

fr0child writes "Next week is Intel's Developer Forum (IDF) and it seems they'll be releasing quite a bit of information (aka hype) about the Pentium 4. Anandtech seems to have gotten the scoop on Intel's NetBurst Architecture, basically covering the P4's internal architecture."

1 of 130 comments (clear)

Min score:

Reason:

Sort:

Problems with longer pipelines, as in P4 by SpinyNorman · 2000-08-20 21:28 · Score: 5

1) Pipeline stalls / operand latency:

If the compiler and/or CPU is unable to reorder instructions effectively (or if a particular piece of code is not amenable to reordering), then an instruction in the pipeline may not have it's operands ready when it needs them and will stall the pipeline waiting for them. With a longer pipeline it will take more clock ticks for the necessarty operands to work their way thru the pipeline to clear the stall. Intel have added a double clock speed arithmetic unit (ALU) to the P4 to try to mitigate operand latency.

2) Branch mispredict penalty:

When a modern CPU such as the P4 encounters a branch instruction, it predicts whether the branch will be taken or not (by using the execution history) in order to be able to continue processing instructions through the pipeline. When the branch is finally evaluated near the end of the pipeline it may turn out that the prediction was wrong, and that all the instructions following the branch (now in the pipeline) should not ne executed. In this case the processor has to flush the pipeline and instead take the correct branch. This "pipeline flush" branch mispredict penalty is obviously higher the longer the pipeline is - a 20 stage pipeline means you are throwing away 20 instructions when a branch is mispredicted.

P4 was designed with a long pipeline so that each pipeline step could be very simple/quick and therefore the processor could have a very high clock rate. The downside of doing this is the above two problems, which mean that the average number of instructions executed per clock cycle (IPC - aka processor efficiency) gets reduced.

P4 at 1.4GHz may be faster than P3 at 1GHz, but because P4 will have a lower IPC than P3, it won't be as fast as a 1.4GHz P3 (if we ever see one) or 1.4GHz Athlon (which we will see).

The one area where P4 should excel is in SSE2 optimized floating point math intensive applications, which is why Intel are now trying to reposition the P4 as an Internet/multimedia CPU rather than a general purpose one. The fallacy of this is that once you can decode your DivX in real-time, you don't need to go any faster!