Understanding Pipelining and Superscalar Execution
Zebulon Prime writes "Hannibal over at Ars has just posted a new article on processor technology. The article uses loads of analogies and diagrams to explain the basics behind pipelining and superscalar execution, and it's actually kind of funny (for a tech article). It's billed as a basic introduction to the concepts, but as a CS student and programmer I found it really helpful. I think this article is a sequel to a previous one that was linked here a while ago."
no fp
I would of got the 1st post.. But all this new technology is so sloww!
If these two articles, along with the promised third one came along a few months ago, I could have skipped even more architecture classes and still passed. Let's hope they keep popping these out.
Sorta sound like torture techniques, don't they? Right up the ol' MS alley.
Any sufficiently well-organized Government is indistinguishable from bullshit.
But do we really need to be notified everytime a part of this story comes out?
I am not a number! I am a man! And don't you
Otherwise, it is useless. Worse, "regular" programmers (meaning anyone not writing a compiler or assembler) will probably spend countless hours poring over this document trying to "squeeze out that last little bit of performance" despite the fact that Knuth proved performance is a useless metric..
I just finished a CS course co-taught by Professor Patterson, and our primary text this semester was Patterson and Hennessy's Computer Organization and Design.
When we discussed pipelining this semester, the analogy used was the four stages of doing laundry: washing, drying, folding, and stashing. Here are the lecture notes (both PDF). The notes spend a good deal of time going over the hazards of pipelines and how to avoid them.
One thing that his excellent analogy leaves out is the concept of branch prediction.
For those of you who didn't major in CS...
Imagine that we finish the first stage of building our SUV (building the engine) and commence with stage 2 (putting the engine in the stasis). While we are doing that we are building another engine for SUV #2. However, what if the next customer didn't want an SUV, but instead wanted a compact car. We have to throw away our engine for SUV #2 and start over. We wasted an entire stage!!
This analogy doesn't work so well it seems. So we'll stick with computers. If you have 5 instructions in your pipeline and one of them is a conditional branch (think, If the user hit ENTER, print a message to the screen. If they hit escape, BSOD).
If the conditional instruction is high up in the pipeline then every instruction under it could be wasted. Obviously, if the processor could predict which path the branch would follow it would waste less instructions.
Branch predicting algorithms are extremely interesting. The early ones were very simple with:
Prediction: Never take the branch
OR
Prediction: Always take the branch
People soon realized that most branches were in loops, so they came up with a new algorithm
Prediction: If the last time we were here we took the branch, take it again, otherwise don't take it. Basically, repeat what we did the last time we ran this instruction.
IIRC there are lots of branch prediction algorithms, some of which are eerily accuratae (above 90%). Unforunately, branch prediction requires cache which takes away from the cache your programs need.
Thank you Mario! But our princess is in another castle!
If you claim to be a CS student and programmer and doesn't know this, I really have my doubts about your qualifications.
Or maybe that's because I'm an EE student. We kick your asses up and down the block at basically everything you do, you know. And software isn't even our specialty.
I have a final exam on this stuff tomorrow morning... It would seem there is a God...!!
Gnu/HURD STILL DOSENT SUPPORT PS/2 mice, the debian team have created an ugly hack, but you will still need to find a serial mouse if you want to actually develop on the hurd!
Any operating system without usb support these days have already shot themselves in the balls!
-Cyc
/.'s 10 Millionth
Well I'm glad somebody understands it.
processors design YOU!
You're a retard.
I think it's best that you hear it here rather than from your friends.
There are things called "books" that have all the information in this article, and more. If you want to find out about a subject, you can go to a "library" which has lots of them. (Yes, those were Doctor Evil air-quotes.)
Next up, an article on how to do long division!
See item #3.
PIPELINES UNDERSTAND YOU
Thanks for the links people. I really need this refresher course. Pipelining and superscalar execution are stuff I havent touch for quite a while :)
In fact, Alan Cox gave a talk on this recently: UMeet2002.
Author, Shell Scripting : Expert Re
i am sad that PIPELINES understand you :*(
...you could be arrested if you understood too much!
i am sad that processors design YOU :*( !
"Understanding why black text on a white background is easier to read than black text on a white background."
John Pointdexter will have you killed if you know too much, because he will know if you know too much.
...party finds YOU!
Am I the only one who finds this stuff easier to understand when the author just explains what actually happens instead of using analogies? I thought the Hennessey & Patterson version of this was better, but then it wasn't free...
I just read all of that article, and I kinda think I unserstood a little of it, but now I gotta lie down for a while
Go Team Cheese!
that this article isn't a hoax as well. (-:
Thank you! After suffering many long and terrible months under an oath of involuntary celibacy, this new found knowledge in superscalar execution is sure to win me a date with one of the many "cam girl amatuers" that have been offering me free services through email for months. Thank you for restoring my confidence. Now I must learn how to convey my thoughts without using run-on sentences.
I don't need to read some dumb article to explain it.
The scaling is really super and the pipes are lined. Big freaking deal.
-
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
DEC designed the Alpha CPU to interface to the BUS and RAM over multiple pipelines. Their EV6 is the prime of their effort. EV7 is even better. Due to DEC's inability to attain loans and offset the high consumer cost to make Alpha more affordable, they were in a position to sell themselves to Compaq and their financial team followed. Compaq, same tradition, sold-out to Hewlet Packard and we all remember when HP's President declared HP will close its doors if the merger didn't complete. And now HP is *trying* to kill the Alpha at the ev7. Alpha is fastest in the market, and HP is also canceling its own in-house PA-RISC in favor of the much slower Intel Itanium2. Only company left with a license to produce Alpha hardware is Samsung and they've already dumped the technical details of their Alpha products from their website. This world stinks...Star Trek pulled off the air and now Alpha. Ode to Bankers for conspiracy and marketers for pushing inferior non-Alpha hardware.
But I'm sure you already Gnu that.
They actually give CS majors degrees without knowing this? Plain stupid.
The forthcoming IBM PowerPC 970 CPU is supposed to have a very sophisticated branch prediction unit. (I'm not sure how it compares to that of the POWER4, from which the PPC 970 was derived, or how it compares to other CPUs, though.)
(Disclaimer: recalling all of this from memory based on the paper I wrote a few weeks ago on the PPC 970. Forgive me if I over-simply or mis-state something.)
The PPC 970 hast three branch history tables (BHTs). Each one has 16k (2^14) entries of one bit each. One BHT follows the more or less traditional method of tracking whether or not the branch prediction from a previous execution of the instrution was successful. One BHT has its entries associated with an 11-bit vector which tracks the last 11 instructions executed by the CPU (and using this to determine if the branch prediction was successful. The third and final BHT is used to determine which BHT has been more successful for the corresonding instructions. For each individual branch instruction, the third BHT is used to determine which method has had better success in the past and then that BHT is used as the branch prediction method for this execution of the instruction.
CyberDave
Here is a great link if you want to visualize how this works.
"I believe in everything in moderation. Including moderation." -Dean DeLeo, Stone Temple Pilots
Hello from Spain
..
if there are CC and ASSEMBLERS for DLX everywhere
Where can i buy a PC with 'DLX-hypethreading inside' of low-consumption?
(DLX-HT 20 millions transistors versus P4 3.04 HT)
(10 W of DLX versus 85 W of P4)
JCPM
Hi americans, from Spain
e ad ing+
:) better than stupid P4 with MMX, SSE, SSE2, .. with owner compiler.
DLX+Pipeline+Superscalar+Tomasulo+VLIW+Hyperthr
multiple caches(L1+L2+L3)+multiple buses of RAM(dual,trial,..)+multiple modules RAM+
(many ideas of people at Internet)+
(please: better A = A op B than C = A op B to reduce the size of operation of instruccion to pack well at VLIW)
=>
design of CPU and GCC
JCPM (copyright)
this article would have been a lot more interesting if he'd trimmed 50% of the distracting BS, stories of Caesar hiring his relatives to play foosball, etc. I didn't say make it boring, i said that there are so many indirect references to things having nothing to do with pipelining that someone truly new to this material is going to have a hard time teasing it apart. just my opinion.
The article presents pipelining in terms of assembly-line production, which is a good analogy for a while, but CPUs are different from factories, and the article doesn't address the biggest differences.
1. Data dependancies. In the hypothetical SUV assembly line, constructing one SUV does not depend on having completed previous SUVs; in a computer, executing one instruction almost always depends on the result of a previous instruction, and when the previous instruction is still working its way through the pipeline, things get complicated. Bypasses and stalls are what make CPU pipelines interesting to engineer and difficult to build.
2. Control flow. Factories don't have jumps and conditional branches. What happens when the arithmetic to decide whether to take a branch isn't computed until the Execute stage, but a couple more instructions have already been fetched into the pipeline?
3. Memory access. This occupies an entire stage in most minimalist CPU tutorials I've seen; it has to go after execute but before register writeback. And that's when you get lucky enough to hit the cache and you don't have to stall the pipe for a dozen or a hundred cycles to go to L2 or "real" memory.
Bypasses, stalls, annullment, and cache effects are the sort of things that make CPU design at this level interesting, and not just a rehash of factory design.
- A former TA for a computer architecture course
gorgo: *lol* :) :>
joey: what's so funny?
shh, joey is losing all sanity from lack of sleep
'yes joey, very funny'
Humor him
-- Seen on #Debian
- this post brought to you by the Automated Last Post Generator...