Slashdot Mirror


Understanding Pipelining and Superscalar Execution

Zebulon Prime writes "Hannibal over at Ars has just posted a new article on processor technology. The article uses loads of analogies and diagrams to explain the basics behind pipelining and superscalar execution, and it's actually kind of funny (for a tech article). It's billed as a basic introduction to the concepts, but as a CS student and programmer I found it really helpful. I think this article is a sequel to a previous one that was linked here a while ago."

7 of 87 comments (clear)

  1. Branch Prediction by hng_rval · · Score: 5, Informative

    One thing that his excellent analogy leaves out is the concept of branch prediction.

    For those of you who didn't major in CS...

    Imagine that we finish the first stage of building our SUV (building the engine) and commence with stage 2 (putting the engine in the stasis). While we are doing that we are building another engine for SUV #2. However, what if the next customer didn't want an SUV, but instead wanted a compact car. We have to throw away our engine for SUV #2 and start over. We wasted an entire stage!!

    This analogy doesn't work so well it seems. So we'll stick with computers. If you have 5 instructions in your pipeline and one of them is a conditional branch (think, If the user hit ENTER, print a message to the screen. If they hit escape, BSOD).

    If the conditional instruction is high up in the pipeline then every instruction under it could be wasted. Obviously, if the processor could predict which path the branch would follow it would waste less instructions.

    Branch predicting algorithms are extremely interesting. The early ones were very simple with:

    Prediction: Never take the branch
    OR
    Prediction: Always take the branch

    People soon realized that most branches were in loops, so they came up with a new algorithm

    Prediction: If the last time we were here we took the branch, take it again, otherwise don't take it. Basically, repeat what we did the last time we ran this instruction.

    IIRC there are lots of branch prediction algorithms, some of which are eerily accuratae (above 90%). Unforunately, branch prediction requires cache which takes away from the cache your programs need.

    --
    Thank you Mario! But our princess is in another castle!
    1. Re:Branch Prediction by jbrandon · · Score: 5, Informative

      Unforunately, branch prediction requires cache which takes away from the cache your programs need.

      This notes that the branch prediction unit has some cache that is separate from the other cache. It also notes that the PIII BPU has the "eerily accurate" prediction success you describe.

    2. Re:Branch Prediction by ottffssent · · Score: 4, Informative

      All modern CPUs have highly accurate branch predictors. Just as caches need to catch the vast majority of memory accesses (About 98% I believe is a common number for the L1+L2 caches) for the CPU to work even remotely close to its theoretical maximum, branch predictors need to avoid pipeline flushes whenever possible. Consider the P4 with 20-odd stages, more in the FP pipeline. There could be dozens of instructions in flight by the time the branch test gets completed and says it took the wrong branch - The 20-stage pipeline needs to be emptied of invalid instructions, all of which get thrown away. A new memory location needs to be loaded, and execution needs to resume at that point. Extremely inefficient, so the occurence of this sort of thing has to be kept to an absolute minimum. Since the P4 suffers so horribly in the case of mispredicted branches, it tends to do poorly in branch-heavy code such as chess benchmarks. In order to keep such wasteful operations from happening, the CPU keeps track of thousands of previous branches in an attempt to guess correctly which way a new branch will go. There is a data structure in the CPU called the BHT, Branch History Table, that holds all this information, some times with many bits of info per branch.

      I won't take the time to dig up references to see if any CPU architecture currently does all of these tricks, but consider all the things that can be done to minimize the impact of branches:
      When a branch occurs, fetch instructions from *both* possible branch locations. Begin executing both sets of instructions in parallel, keeping the CPU's back-end busy. Flag these instructions with a "left branch" or "right branch" tag. When the branch test completes, toss out the wrong instructions and keep the good ones. Both branches will execute more slowly than a correctly-guessed branch that executes only one, but in the case of a mispredict, there is no pipeline flush, and no delay waiting for the PC to update and new instructions to flow in. Also, it's hard on the caches and RAM subsystem, since two sets of instructions rather than 1 need to be fetched.
      Build a better predictor. By analyzing the type of branch and surrounding code, the branch predictor can get eerily accurate. Way better than 90%. A K6-2 could get 90% accuracy with no sweat, and that's a pretty old chip.
      Prefetch. Grab the branch test and run it way ahead of the branch itself (when possible). If the outcome of the branch can be determined before the branch is reached (using instruction reordering trickery), there is effectively no branch at all. This can be "stacked" with other techniques as well.
      I'm sure you can come up with several more. It's an interesting problem to think about, with most techniques having a good mix of benefits and drawbacks.

  2. Re:Holy Crap! by Jucius+Maximus · · Score: 2, Informative
    Heh, I was reading slashdot the night before I had my exam on this stuff as well (which was 2 weeks ago.)

    Make sure you can identify RAW, WAW, and WAR hazards. (R == read, A == after, W == write)

  3. Re:Optimizing by sparkz · · Score: 3, Informative
    It is relevant to programmers - knowing how a CPU will pipleine your instructions can help in structuring a loop to be efficient; tied-in with this is caching strategies: how you read / write your data can have a huge impact on the program's performance.

    I've posted the link earlier, but here it is again... Alan Cox recently gave an IRC talk: here.

    --
    Author, Shell Scripting : Expert Re
  4. Re:P&H - Pipelining by cheezedawg · · Score: 4, Informative

    to a certain point, but the P4 is a bit excessive

    Actually, there is a lot of research about pipeline depths, and here is a paper that calculates the optimal pipeline for x86 to be around 50 stages. In fact, they theorize that you could see up to a 90% increase in performance in the P4 by making the pipeline even deeper. So not everybody thinks that the P4 pipeline is "a bit excessive."

    think rambus (bad idea... memory width is a good thing!)

    I'm a little confused here- until the past few months, Rambus still offered superior memory bandwidth. It wasn't until DDR333 and higher that SDRAM started to catch up. Rambus didn't lose in the market because of performance.

    Itanitium (scrap an entire architechture for one that allows you to disable instructions, so that it is gauranteed that part of the processor won't be used at that point)

    That is a pretty strange complaint about Itanium. In fact, I think that it is weird that you even think that is a problem.

    --
    "The defense of freedom requires the advance of freedom" - George W Bush
  5. Re:P&H - Pipelining by j3110 · · Score: 3, Informative

    A 50 stage pipeline wouldn't be bad if it were done differently. I still prefer the old "have one instruction after the branch that always is executed" method. Only with a longer pipeline, you would need more instructions. A good optimized compiler could handle this MUCH better than a processor could. Longer pipelines need better branch prediction. I don't really care about your research, because it's all theory. It's pretty appearant from comparing P3 to P4 that longer pipelines are only better if you can manage to crank out a factor of speed greater than the factor of increase in pipeline length. The whole "Software isn't compiled for P4 optimizations" arguement is really dumb when you think about it. If it can't run X86 faster, then it must really suck. In order to compensate for this, they have to run the P4 at much faster clock rates, and the compiled binaries have to be larger so as to place the jump points in strategic positions. Deeper pipelining will cause you to run into problems of silicon transistor switching speed much earlier than you would usually as well.

    Rambus is a bad idea. It had very much inferior latency that increased with each memory module you add. The processor waits everytime you ask for ram. That's wasted cycles. There is a reason the P4 has an internal bus that is VERY wide compared to other processors. DDR's memory speed is in no way related to the bus's inferiority. The faster you make the chips, the faster the processor gets the data. With RAMBUS, you have additional latencies for the same speed of RAM. They are all made from the same DRAM design, and with the same speed chips, a wider bus will be faster.

    I'm not upset that the Itanium changes instructions sets. I just don't think that any processor that disables part of itself will ever be optimal. I don't think lugging around old instruction sets are a good idea either. It's a waste of space that could have been used for some more full multipliers.

    Don't take my word for it... go clock a P4 against another CPU and see how well it performs in sorting with RAMBUS memory. The bulk of the P4's gains on any other CPU comes from SSE2, 400Mhz FSB, and CPU clock speed. Take those away, and it will be slower per clock cycle than any other CPU (including P3), especially if it has RAMBUS memory.

    --
    Karma Clown