Intel Pentium 4 NetBurst Architecture Explained
fr0child writes "Next week is Intel's Developer Forum (IDF) and it seems they'll be releasing quite a bit of information (aka hype) about the Pentium 4. Anandtech seems to have gotten the scoop on Intel's NetBurst Architecture, basically covering the P4's internal architecture."
1) Pipeline stalls / operand latency:
If the compiler and/or CPU is unable to reorder instructions effectively (or if a particular piece of code is not amenable to reordering), then an instruction in the pipeline may not have it's operands ready when it needs them and will stall the pipeline waiting for them. With a longer pipeline it will take more clock ticks for the necessarty operands to work their way thru the pipeline to clear the stall. Intel have added a double clock speed arithmetic unit (ALU) to the P4 to try to mitigate operand latency.
2) Branch mispredict penalty:
When a modern CPU such as the P4 encounters a branch instruction, it predicts whether the branch will be taken or not (by using the execution history) in order to be able to continue processing instructions through the pipeline. When the branch is finally evaluated near the end of the pipeline it may turn out that the prediction was wrong, and that all the instructions following the branch (now in the pipeline) should not ne executed. In this case the processor has to flush the pipeline and instead take the correct branch. This "pipeline flush" branch mispredict penalty is obviously higher the longer the pipeline is - a 20 stage pipeline means you are throwing away 20 instructions when a branch is mispredicted.
P4 was designed with a long pipeline so that each pipeline step could be very simple/quick and therefore the processor could have a very high clock rate. The downside of doing this is the above two problems, which mean that the average number of instructions executed per clock cycle (IPC - aka processor efficiency) gets reduced.
P4 at 1.4GHz may be faster than P3 at 1GHz, but because P4 will have a lower IPC than P3, it won't be as fast as a 1.4GHz P3 (if we ever see one) or 1.4GHz Athlon (which we will see).
The one area where P4 should excel is in SSE2 optimized floating point math intensive applications, which is why Intel are now trying to reposition the P4 as an Internet/multimedia CPU rather than a general purpose one. The fallacy of this is that once you can decode your DivX in real-time, you don't need to go any faster!
Could someone explain to me how having a longer pipeline speeds things up? this seems kinda counter intuative to me. Guess its like the pipelines in the 3D GPUs, but i don't see how that would work in a general purpose CPU.
The longer the pipeline is, the smaller each stage (of the pipeline) is. The smaller the stages are, the higher the frequencey you can run them on is. If you cut each of the existing stages exactly down the middle, you could run your CPU on twice the frequency, without making any other changes! (Of course, you can never cut a stage exactly in half, so you'll never reach 2x increase).
Why don't we make 10,000-stage pipelines, then, you might ask :). In the ideal world, a completed instruction "comes out" of the pipeline at each clock cycle, so with 2x frequency, your cpu is twice as fast. The problem is, with a huge pipeline, you increase the chance that the instruction will "stall" along the way, and you'll get less than 1 instruction (on average) coming out on each clock cycle (the "IPC" thing the article talks about). If you add enough stalls to your pipeline, your might effectively decrease your CPU's performance.
Never underestimate the bandwidth of a 747 filled with CD-ROMs.