RISC vs. CISC in the post-RISC era

← Back to Stories (view on slashdot.org)

RISC vs. CISC in the post-RISC era

Posted by Hemos on Thursday October 21, 1999 @12:47AM from the chip-differential dept.

S. Casey writes "Ars Technica has a very cool article up that takes on the typical RISC versus CISC debate. The author argues that today's microprocessors aren't RISC or CISC, really, and covers the historical/technical reasons why these two distinctions aren't particularly useful anymore. It's pretty convincing (to me). " Essentially, the author argues that it is difficult, if not impossible, to have the normal debate because both chipsets have evolved features that used to be found in the other chipset.

9 of 119 comments (clear)

Min score:

Reason:

Sort:

RISC=load/store + fixed I-size + software burden by Anonymous Coward · 1999-10-20 22:27 · Score: 3

The author of the article rightly notes that a basic design philosophy difference is where the burden of reducing run-time should be placed. The original RISC philosophy was to place the burden on software--this is especially true of (V)LIW processors--whether the programmer, the compiler, or the set of libraries. CISC (or more rightly "old style") design philosophy sought to place the burden on the hardware. Effectively microcoded CISC is like RISC with a fixed set of library functions in a specialized read-only high-speed cache. (The obvious limitation of this being that the library is fixed in hardware.)

Separating load/store into a specialized (memory) instruction nearly forces fixed instruction size (the real mark of RISC). Yet this also works with the general idea of modularized (and potentially superscalar) processing--of course, the same could apply to microcode. Fixed instruction size has a consequence for memory access. A CISC (variable length) instruction fetch would either have to load a maximum-instruction-size number of bytes or use multiple fetches for longer instructions. (One could reduce the performance loss of this if the first load included the opcode and later loads would contain (optional) register ids or immediates (hard-coded values). Since the register ids might not be needed until a second or third stage of the pipeline, the load delay would be invisible--the next instruction fetch would forward the values to the appropriate place in the pipeline. This would make some instructions very fast with a low memory bandwidth--the number of additional values needed might vary from 0 to 3 ("save state," e.g., might require no arguments, while increment might require one argument.))

The RISC vs. CISC debate also concentrates on general purpose processors. In a single, purpose processor (with a stable, well-defined operation set), a direct-execution CISC design would make sense because the "library" of code is so small that hardwiring it (not merely placing it in ROM for a microcode implementation, though such would also be more efficient that a RISC design, it seems) would increase speed. This, of course, assumes that the design costs are low enough to justify making a specialized design for a specialized function. (With improved design tools and efficient low-output manufacturing, this could become more common. Of course, some are looking to processors that dynamically change their wiring to allow changes of specialization without replacing the hardware.) Certain classes of Digital Signal Processors would seem to fit into this category. 2D and 3D video accelerators--which rely on a generalized processor for certain functions--might be another reasonable CISC application (well defined problem, very limited instruction set required, relatively stable algorithms).
A true post-RISC processor (for general purpose computing) would take advantage of hardware scheduling (which allows scheduling based on the current data--something even a compiler cannot do), compiler optimization (which would probably include (V)LIW probably with some predication information), and programmer knowledge (a high level language and set of libraries that allows the programmer to share knowledge of the design with the compiler). IA-64 seems to place too little burden on the programmer (perhaps rightly based on the lack of time put into writing code well--by open-sourcers who often just want to make a working program (not considering performance, cooperative development, system integration, etc.) and by 'commercial' programmers who are often trapped by unrealistic deadlines (it is faster in the short term to implement a kludge than to design (write) or research (find another's implementation) good code)) and excessive burden on the compiler (this MIGHT be reasonable since Intel can control the compiler--that being a single unit of coding). IBM's Power-4 seems to emphasizing programmer cleverness (multithreading, multiprocessing, tight libraries) and to a lesser extent compiler cleverness.

What I wonder is why no system seems to support vector-processor-style non-unit stride in the cache (This could perhaps be implemented by having a vector mode which took, say, one set of a 2-way set associative cache and used it for non-unit offset entries. Say every N cache lines might be associated with an address, offset, and length. This would make certain vector processor functions work very well in general purpose processors.), why a sticky bit is not provided for cache (This would allow code fragments that are known to be reused somewhat frequently but would otherwise be removed by other loads.), why memory is not segmented into non-virtual-addressed kernel, non-virtual-addressed common library, and virtual-addressed application space (This would seem to allow some of the benefits of embedded systems without losing the benefits of virtual addressing where it makes sense. The compiler would use a table of system-call locations and common-library-function locations so that at linking, the jumps would be to these absolute locations. Of course, changing the libraries or kernel would require relinking all applications, but this might be relatively quick and the frequency of having to do such could be reduced by placing libraries like libc (commonly used, very stable implementation) at the beginning of the memory space and using small functions or even empty space to pad larger functions. Placing the kernel memory space onto a processor daughter card would also reduce the card-board memory traffic. Using separate kernel caching might also reduce application cache misses after a kernel invokation. The memory access style of kernels might also be used to design that specific memory system--the kernel (and the common libraries) are not paged out to disk, so the instruction memory could be designed for very slow write and fast read. (Whether the improved performance would justify the cost of a specialized memory system is questionable.)
Paul A. Clayton (not quite AC)
Comments by bhurt · 1999-10-20 22:54 · Score: 4

First of all, FP doesn't add all that many instructions to an architecture. The alpha has about 6 FP instructions- load fp, store fp, add, subtract, multiply, divide. The PPC (ingoring SIMD) takes this up to about a dozen or so- two of which have as their sole purpose in life making sin() etc. fast to implement (three instructions, vr.s a dozen or so on the Alpha).

Second, there are two _different_ optimization problems chip designers face, generally at different times. The least common optimization is the "clean slate design"- where the chip designers don't have to support anything, and can draw the boundaries wherever they make sense to be drawn. In essence this what what the RISC designers of the eigthies did. The other optimization problem the chip designers are handed an architecture and a set of existing programs and told "make them go faster".

Super-scalar Out of Order execution, branch predicition, more functional units, and speculative code execution are all optimizations you can apply to an existing architecture *without changing the (apparent) semantics of that architecture*- i.e. without breaking legacy code. New instructions allow "new" or recompiled code to gain a performance boost without dropping support for old code (SIMD and DSP-like instructions just happen to be all the rage these days). So of course they're applied to both legacy RISC and legacy CISC applications!

Of course these "patches" are not as effective as fundamentally rearchitecting the CPU. Of course they increase the complexity of the CPU in much greater proportion than they increase performance. This doesn't imply some "ideological impurity", however- this is the fundamental problem of supporting legacy code. This articles thesis boils down to "there are only legacy CPUs out there!". Which, for the moment is true.

But let's consider for a moment what a rearchitected CPU for today would look like. What we'd like to do is to continue the trend RISC started- of shoving the complexity off the CPU and onto the compiler. It would be sort of accurate to claim that RISC's central idea was to shove the complexity of the translation to microcode onto the compiler.

Today's CPU complexity comes primarily from the patches applied to make the legacy code run faster- especially superscalar execution, branch prediction, and speculative execution- all of which require the CPU to deduce information out of sets of instructions. It'd be nice to have the compiler _tell_ the CPU the data ahead of time, so the CPU wouldn't have to spend precious clock time and transistor budget deducing. This, of course, implies a method for explicitly communicating this information in the instruction stream (the only channel of information between the compiler and the CPU)- older instruction sets (of all stripes) forced the CPU to deduce this information because there was no channel in the instruction stream for communicating it.

If this is begining to sound like the Itanium, you're right. Wether this is the right way to go, only time will tell (and, on advice on time's lawyers, time has no statement to make at this point).
The Coming XISC Evo/Revolution by Effugas · 1999-10-20 23:36 · Score: 4

Before I say anything, I want to commend Hannibal on an absolutely excellent article that clarified issues I thought I understood and illuminated much of the technological history behind the technology we each use every day.

I am completely impressed.

That being said, I'd like to take a moment and theorize on the direction microprocessor design is likely to go. This is my theory; you're welcome to disagree and in fact eagerly await commentary from those far deeper in the industry than I. Insert Slashdot Self-Correcting Nature here.

Of all the chasms in the computer world, there are few as vast as the speed differential between general purpose processors programmed to execute a given task and hard-coded ASICs(Application Specific Integrated Circuits) designed to meet the functional needs of a given process. (OK, granted, Internet -> Local Network -> Hard Drive -> System Memory -> Processor Cache -> Processor Registers is pretty vast too, but cut me some slack here.)

Telephony is a joke without ASICs--I haven't found a voice over IP solution that operates in software well enough to even be used as a room to room intercom over a 100BaseT Lan--but it's actually reasonably lag-free with hardware encoding.

Similarly, huge banks of boxen rendering frames for movies became significantly less impressive to me when I realized how many banks of Pentium Processors it would take to match, say, a single Voodoo 2. While, in recent times 3D Rendering has gotten shots in the arm on the general purpose x86 architecture via both MMX and KNI, the order of magnitude difference in speed makes CPU rendering of realtime 3D graphics almost useless.

(Then again, Sumea is probably the single coolest thing I've done with Java, short of Mindterm.)

As I observed in the Amiga newsgroup, shove a couple of custom ASICs in a box and you can run a highly competitive multitasking OS in 512K of RAM, with unmatched graphical support to boot.

But ASICs have their limitations--while they're fast at what they do, they're extremely inflexible. You can't merely program in a new transparency algorithm, nor implement Depth of Field in an architecture that totally lacks it. The inflexibility of ASICs dooms their long term viability.

CPU's are flexible but slow, ASICs are inflexible but fast. It's a dichotomy the industry is on the verge of smashing.

I dub the coming processor design specificiation(which, as the article correctly noted, is all RISC/CISC really are) XISC, for eXtensible Instruction Set Computing. XISC essentially specifies that the underlying computational structures--be they microcode or raw gate arrays--ought to be dynamically reconfigurable to meet the needs of the process.

Just as the lack of a quick bilinear filter function(SIMD stuff) on older Intel chips doomed them as far as efficient 3D in relation to customized ASICs, the ability to insert such a command directly into the internal microcode of a processor has a theoretical chance of executing at extremely high speeds for a non-dedicated processor.

Transmeta, also known as the only reason many people willingly acknowledge the US Patent Office, appears to be spearheading the XISC drive. Their patents refer to technologies that automatically cache microcode translations, that provide backwards-flow in case of a broken emulate, and so on. They've often been "accused" of developing a chip that can emulate any chip--in the XISC context, a chip optimized to execute the instruction set most required by any given process.

If you accept that performance drops in the orders of magnitude are suffered when a processor lacks the appropriate design for a given set of requests, it's quite obvious that intelligent designers seeking to execute a quantum leap in system performance would try to allow processors to acquire any necessary designs to achieve much higher speeds.

Of course, most of my chip designer friends would be happy to remind me that much of the speed of ASICs comes from their hard coded nature--the literal gates correspond to whatever output is desired, no translation is necessary.

Of course, here's where FPGA's come in. Field Programmable Gate Arrays are chips whose internal gate structure can be rewritten on command, sometimes many thousands of time per second. They can't be clocked as fast as true ASICs, nor are the yields as high, but one quickly morphing chip can do the job of three or four in a digital camera. With at least one company(someone give me a name!) developing a language for programmatically defining instruction sets for a FPGA processor, the technology for XISC is obviously in development.

Ah, but not all is not fair thee well. In fact, while on the topic of 3D chips, the Rendition Verite chipset had a programmable RISC core, and the chip ended up failing because it could not scale in speed like 3DFX's Voodoo could. Developers could write new 3D instructions, but didn't (in general) because it was just too hard. (Yes, Carmack did.)

That's why there's such a powerful force towards automation in this XISC evo/revolution, such as the FPGA language and Transmeta's automated Microcode translations that stay in memory so as to speed up future similar instruction requests. In an ideal world, a developer merely compiles a chunk of code that profiles as heavy usage directly into CPU microcode, or at least specifies in some way that a given routine ought to be run through the "special ops" part of the system.

Whether the world will become ideal is a point of question. Whether we will have instruction sets that morph is almost obvious, it's just a matter of when will the bridge between ASICs and CPU's finally be resolved.

Yours Truly,

Dan Kaminsky
DoxPara Research
http://www.doxpara.com
The 'post-RISC' world by Mr+Z · 1999-10-20 21:40 · Score: 3
I think most people have realized that RISC CPUs (in the sense of the early MIPS and SPARC designs) have not been made for a long time. Nowadays, the only real remaining differentiator between a "RISC" machine and a "CISC" machine is whether the instruction set is LOAD/STORE based or has memory operands on various instructions. There are several reasons for this:
- Perhaps the largest reason: Backwards compatibility. If you look at the MIPS and SPARC architectures, they both brought to the table the concept of "delay slots". For instance, a LOAD instruction on MIPS won't write its result until after the second cycle, since the LOAD is pipelined over multiple cycles. (SPARC's delayed branches are similar.) Change the pipeline depth, and you sign yourself up for alot of hocus-pocus to continue playing this charade.
- The relative cost of operations changes pretty radically over time. For example, when RISC debuted, ideas such as including a multiplier on-chip were out of the question, since they cost too many transistors. So instead, programs implemented multiplies with shifts and adds, or in some cases, with lookup-tables. Nowadays, multipliers are relatively cheap when you consider how much the transistor budget has grown.
- Memory latency has gotten really bad relative to CPU performance. This is one of the largest drivers of out-of-order issue, actually. IBM invented this way-back-when before caches were feasible. (It escapes me on which machine they did this, but it was one of the old mainframes.) The idea is that you can cope with latency by allowing instructions to run when their data arrived -- whether it's from a slow memory or slow floating point unit. (Those of you with Hennesey & Patterson on your shelf: Go look up the Tomasulo scheduling algorithm.)
- The cost of communication and control has gone way up as pipelines have gotten deeper and transistors have gotten faster than the wires that connect them. "Complex" instructions which reduce communication, and sophisticated branch-prediction schemes which try to flatten control are attempts to address these issues.
The point is that these are difficult problems, no matter what type of architecture you have. Even Alpha spends alot of transistors allowing such things as "hit under miss" in the cache (allowing one stream of instructions to proceed while another is stalled waiting for a cache service).

Every so often, when the gap between the original "programmer's model of the architecture" and "what we can do easily in silicon" gets too wide, it becomes necessary to move to a new paradigm. With this shift comes a new programmer's model. Moving from "CISC" to "RISC" was one such paradigm shift. One could argue that we're due for another one, and that VLIW/EPIC-like schemes are the new contenders.

Some approaches, such as traditional VLIW, say "We're not going to worry so much about bout presenting a traditional scalar model to the programmer. We're going to expose our complexity to the compiler and let it do its best, rather than play tricks behind its back." These work by exposing the functional units of the machine, removing alot of the control complexity. These are today's "direct execution" architectures. (See TI's C6000 DSP family for a live example of VLIW in the field.)

Intel's EPIC takes VLIW a step further. It adopts alot of the explicitness of VLIW, but it retains alot of the chip complexity that's required to retain compatibility across widely varying family members. It's too early to know how this will turn out, but I'm somewhat concerned that it does not reduce complexity.

All-in-all, it'll be interesting to see how this turns out. Who knows what type of architectures we'll be programming in 20 years...
--Joe
--
--
Program Intellivision!
Make way for PISC! by TheDullBlade · 1999-10-21 00:20 · Score: 3

As anyone can obviously see, the old paradigms have failed. Make way for the future, make way for PISC!

That's right, the wave of the future is the Pathetic Instruction Set Computer.

(in all seriousness, if this stuff turns your crank, building a computer from standard TTLs is way cool)

--
/.
Design Theory by marphod · 1999-10-20 20:53 · Score: 3

THere is a lot a-like in today's RISC and CISC chip designs, true, but there is still a lot that is different.

The major difference between your two designs include how they interact with memory; RISC's still take three instructions to load, alter, and store memory alterations, CISC load and store does it in one instruction. There are differences in instruction size; where RISC design knows that each instruction is 16/32/64 bits long, CISC allow variable length instructions and fancy footwork/chp design to allow read ahead buffers to work well. The most significant example of this is in the x86 chip series, which still has 8 bit instructions on the legacy registers, but with prefixes and extensions can have 100+ bit instructions, as well. THe system needs, effectively, a parser before feeding the instructions to the actual CPU. And while my background on CISC design is lacking, I can only imagine the design acrobatics to do superscalar/pipelined design for instructions that can do so much.

While not strictly a RISC/CISC issue, there is also the use of registers. In gross generalities, RISC design is much more apt to use general purpose registers than CISC. There are definitive advantages to each design.

Yes, this is all 'under the hood' items, but they have a large effect on design; compilers that know of, for instance, the legacy registers from the 8088/8086 and use them primarily, have nice small instructions, and can get the most out of the x86 instruction preloading. THis has been less and less significant with newer x86's (P IIs, P IIIs) but it is still present.
The author of this article is clueless. by anonymous+loser · 1999-10-20 21:00 · Score: 4

His main evidence is a quote wherein Ditzel is quoted as saying, "Superscalar and out-of-order execution are the biggest problem areas that have impeded performance [leaps]." Obviously the author has absolutely no knowledge of how processors work internally, or he wouldn't say that this is due to the complexity of the ISA (Instruction Set Architecture).

The complexity with superscalars is not in the ISA, but in the scheduling. At the most basic level, though, RISC instructions are used because it is (effectively) impossible to schedule CISC instructions for out-of-order execution.

The whole idea with RISC is to make instructions so basic that they can (almost) all be completed in a single processor cycle. In the article, he tries to refute this with a quote from Patterson, but the quote actually refutes the author's point, and the author is too blind to realize it. Twice in the quote Patterson refers to reducing the cycle time for each instruction, but the author says that's not Patterson's point.

Today's processors take the idea a step further, by trying to execute MORE than one instruction per cycle by providing multiple processing units (the thing that does the actual addidtion, subtraction, or whatever) which can execute instructions in parallel. However, instructions still need to be scheduled so that they can execute in parallel while preserving dependencies.
The hardware that accomplishes this scheduling is complex.

IMHO VLIW is the way to go. With VLIW, you do the scheduling at compile time, and remove a lot of the complexity involved with hardware scheduling. Not only do you gain the possiblity of higher parallelism through an increased number of processing units (you can use the silicon previously reserved for the scheduling hardware), but you also can gain a little more since theoretically a complier can spend more time looking for dependencies between instructions, and come up with a more optimal schedule.

anyway, that's just my 2 cents.
Or perhaps you just can't read so well by ToLu+the+Happy+Furby · 1999-10-21 04:06 · Score: 3
Now let's see. Why is it that the author of this article is so "clueless", as you say?
1. When the first RISC machines came out, superscalar execution hadn't been invented yet.
  
  Some processors (Cray for one) had been doing this years before RISC came out.
Unfortunately, the actual quotation from the article is "When the first RISC machines came out, Seymore Cray was the only one really doing superscalar execution." Whoops. Looks like ya' dropped the ball on that one.
1. I also think that the ideas behind RISC such as "move the complexity from the chip into the compiler" also apply today and that VLIW is an example of this applied to scheduling
Well, that's very forward thinking of you. Of course, I guess that makes you clueless as well, because, once again, it's exactly what the author of the article wrote: "VLIW got a bad wrap when it first came out, because compilers weren't up to the task of ferreting out dependencies and ordering instructions in packets for maximum ILP. Now however, it has become feasible, so it's time for a fresh dose of the same medicine that RISC dished out almost 20 years ago: move complexity from hardware to software."

So this leaves us with your remaining "point":
1. The fast CISC chips (PII, Athlon) do instruction conversion into RISC...so if the debate is over, its because RISC won -- big time.
Well that's a brilliant insight you've uncovered there (covered in the article here), but the point of this "clueless" article (which you obviously did not read) was that there no longer is a debate. The term RISC refers to a CPU design philosophy which was invented in 1981. That was 18 years ago. It was intended as a replacement to "CISC", the CPU design philosophy which came about in the early 70's.

That is to say, the CPUs of today, whether P6 or PA-RISC or G4, and the systems that they are in, bear almost no resemblence whatsoever to either a "CISC" chip or a RISC chip. The only similarity is that the P6's and K7's of the world are compatible with the x86 ISA, which was originally written (back in 1977 IIRC) for a "CISC" chip. Yes, this adds an extra decoding step (to break down instructions into "RISC-like" ops), and yes, theoretically that means increased die-size and complexity, which of course means lower clock speeds. Oh wait--that reminds me: which currently shipping chip has the fastest clock speed?? Oh yeah--the 700MHz Athlon. With a 750 part set to be shipped later this week. And, looks like, a 900 in time for New Year's. One of those ungainly "CISC" chips, huh.

Hmm...but let's take a look at how all those competing "RISC" chips have used their incredibly simple architectures to keep die size down. Like the PA-RISC, which is, IIRC, about 6 or 7 times the size of a PIII. Or those new G4's, with their impressive yield of 0% at 500MHz. The simplicity of today's modern "RISC" chips in action.

Now, none of this is to say that the G4 or the PA or whathaveyou isn't a great design. Just that the resemblence of today's CPU's to a true "CISC" or RISC chip is so tangential as to make the categorization meaningless. And as for your "debate"--of course RISC won big time. It came out nearly 10 years after the first chips made with the "CISC" philosophy. As was, IMO, rather compellingly and insightfully explained in the article, "CISC" chips were the best possible solution to the awful state of complilers and memory available at the time. By 1981, the state-of-the-art had advanced to the point where RISC was a better solution. Duh.

Since then, compilers have gotten better, transistor densities have increased, and RAM prices have plummetted, allowing all the advancements which the author termed "post-RISC". And, looking at the CPU's of tomorrow, we see all sorts of new techniques on the horizon: optimization based on advancements in compiler/software technology, ala IA-64, MAJC, and (apparently) Transmeta; or optimizations based on incresing transistor densities, like some of the new physical parallelization designs that appear to be a couple generations down the road for Alphas and IBM chips.

But as long as people like you insist on categorizing these chips into meaningless 20-year design philosophies, the tech world will be a more ignorant place.

What a dissappointing comment.
Not this nonsense again by chip+guy · 1999-10-20 21:31 · Score: 3

I am sick and tired of people who cannot fathom the difference between an abstract instruction set architecture (ISA) and a chip implementation with functional units, gates, and flops. The terms RISC and CISC refer to ISAs. If you build a CISC processor using many of the same implementation tricks that are commonly used in RISC processors then fine for you. But you still have a CISC MPU. RISC and CISC have always shared many implementation details be it triple ported register files, on-chip cache or a 32 bit adder.

Look at this way. Lets say CISC is analogous to a bungalow while RISC is a two story house. These are architectural differences. Lets say that in the early days of house building bungalows were always built out of brick with load bearing walls while two story houses were wood framed. If some joker comes along and builds a bungalow with wood frame technology it suddenly doesn't make his edifice a two story house even though it may be a big improvement on earlier bungalows.

While CISC has generally caught back up to near RISC for integer performance once MPU complexity reached about 3 to 5 million transistors the ISA differences do matter. For example, an out-of-order x86 with a translator front end and register renaming might have 16 or 32 physical GPRs instead of the eight architected GPRs. But the compiler cannot address these physical registers, it only sees eight. This means that an x86 compiler will have to spill values to memory much more often than a RISC compiler and it will not be able to exploit performance enhancing techniques such as register assignment to local and globals variables and software pipelining nearly as well as a RISC.

There is also the baggage effect with CISC architectures. For example, nearly every x86 instruction modifies flag bits as well as GPRs. This means that the flags become a central dependency choke point that requires a lot of attention to address. CISC ISAs also invariably have multiple instruction sizes. This means that a CISC CPU will typically require an extra pipe stage or two to sort out instruction boundaries regardless of how "RISC-like" the backend looks.

People who believe the Intel party line of "x86 was CISC but is now RISC" should pay attention to what Intel is doing rather than saying. They are busy spending billions to develop a new RISC-like 64 bit architecture to carry them into the future. It is true AMD will stretch x86 to 64 bits but they had to change the x86 floating point programming model to a RISC like flat register file with three address instructions to even attempt to close the distance on FP code. And AMD's future success in keeping up with IA-64 and SMT superscalar RISC implmentations is far from asured.