RISC vs. CISC in the post-RISC era
S. Casey writes "Ars Technica has a very cool article up that takes on the typical RISC versus CISC debate. The author argues that today's microprocessors aren't RISC or CISC, really, and covers the historical/technical reasons why these two distinctions aren't particularly useful anymore. It's pretty convincing (to me). " Essentially, the author argues that it is difficult, if not impossible, to have the normal debate because both chipsets have evolved features that used to be found in the other chipset.
The author of the article rightly notes that a basic design philosophy difference is where the burden of reducing run-time should be placed. The original RISC philosophy was to place the burden on software--this is especially true of (V)LIW processors--whether the programmer, the compiler, or the set of libraries. CISC (or more rightly "old style") design philosophy sought to place the burden on the hardware. Effectively microcoded CISC is like RISC with a fixed set of library functions in a specialized read-only high-speed cache. (The obvious limitation of this being that the library is fixed in hardware.)
Separating load/store into a specialized (memory) instruction nearly forces fixed instruction size (the real mark of RISC). Yet this also works with the general idea of modularized (and potentially superscalar) processing--of course, the same could apply to microcode. Fixed instruction size has a consequence for memory access. A CISC (variable length) instruction fetch would either have to load a maximum-instruction-size number of bytes or use multiple fetches for longer instructions. (One could reduce the performance loss of this if the first load included the opcode and later loads would contain (optional) register ids or immediates (hard-coded values). Since the register ids might not be needed until a second or third stage of the pipeline, the load delay would be invisible--the next instruction fetch would forward the values to the appropriate place in the pipeline. This would make some instructions very fast with a low memory bandwidth--the number of additional values needed might vary from 0 to 3 ("save state," e.g., might require no arguments, while increment might require one argument.))
The RISC vs. CISC debate also concentrates on general purpose processors. In a single, purpose processor (with a stable, well-defined operation set), a direct-execution CISC design would make sense because the "library" of code is so small that hardwiring it (not merely placing it in ROM for a microcode implementation, though such would also be more efficient that a RISC design, it seems) would increase speed. This, of course, assumes that the design costs are low enough to justify making a specialized design for a specialized function. (With improved design tools and efficient low-output manufacturing, this could become more common. Of course, some are looking to processors that dynamically change their wiring to allow changes of specialization without replacing the hardware.) Certain classes of Digital Signal Processors would seem to fit into this category. 2D and 3D video accelerators--which rely on a generalized processor for certain functions--might be another reasonable CISC application (well defined problem, very limited instruction set required, relatively stable algorithms).
A true post-RISC processor (for general purpose computing) would take advantage of hardware scheduling (which allows scheduling based on the current data--something even a compiler cannot do), compiler optimization (which would probably include (V)LIW probably with some predication information), and programmer knowledge (a high level language and set of libraries that allows the programmer to share knowledge of the design with the compiler). IA-64 seems to place too little burden on the programmer (perhaps rightly based on the lack of time put into writing code well--by open-sourcers who often just want to make a working program (not considering performance, cooperative development, system integration, etc.) and by 'commercial' programmers who are often trapped by unrealistic deadlines (it is faster in the short term to implement a kludge than to design (write) or research (find another's implementation) good code)) and excessive burden on the compiler (this MIGHT be reasonable since Intel can control the compiler--that being a single unit of coding). IBM's Power-4 seems to emphasizing programmer cleverness (multithreading, multiprocessing, tight libraries) and to a lesser extent compiler cleverness.
What I wonder is why no system seems to support vector-processor-style non-unit stride in the cache (This could perhaps be implemented by having a vector mode which took, say, one set of a 2-way set associative cache and used it for non-unit offset entries. Say every N cache lines might be associated with an address, offset, and length. This would make certain vector processor functions work very well in general purpose processors.), why a sticky bit is not provided for cache (This would allow code fragments that are known to be reused somewhat frequently but would otherwise be removed by other loads.), why memory is not segmented into non-virtual-addressed kernel, non-virtual-addressed common library, and virtual-addressed application space (This would seem to allow some of the benefits of embedded systems without losing the benefits of virtual addressing where it makes sense. The compiler would use a table of system-call locations and common-library-function locations so that at linking, the jumps would be to these absolute locations. Of course, changing the libraries or kernel would require relinking all applications, but this might be relatively quick and the frequency of having to do such could be reduced by placing libraries like libc (commonly used, very stable implementation) at the beginning of the memory space and using small functions or even empty space to pad larger functions. Placing the kernel memory space onto a processor daughter card would also reduce the card-board memory traffic. Using separate kernel caching might also reduce application cache misses after a kernel invokation. The memory access style of kernels might also be used to design that specific memory system--the kernel (and the common libraries) are not paged out to disk, so the instruction memory could be designed for very slow write and fast read. (Whether the improved performance would justify the cost of a specialized memory system is questionable.)
Paul A. Clayton (not quite AC)... because it's got polynomial equation solving built into its instruction set!!
In grad school, I worked on the Pixel-Planes graphics supercomputer at UNC. Among its unique features were large arrays (256 to a chip) of 1-bit processors, with special hardware for solving second-degree polynomial equations. The polygon edges, texture maps, etc. were all turned into plane equations, and the resulting values solved per-pixel with one such processor per pixel. Pretty darned nifty eight years ago...
Ooh, a sarcasm detector. Oh, that's a real useful invention.
- RISC or not is about arcitecture, not implementation
- RISC is really about having an architecture whose instructions pipeline cleanly, and which responds to the demands of actual workloads.
If it's possible within an architecture to generate more than one page-fault within an instruction, then you're not on a RISC (the record seems to be a VAX 3-operand memory-to-memory indirect-indexed instruction with memory-based indexes and offsets, which can generate up to 47 (!) page-faults.If you take the point of view that a P6 is a RISC core running an x86 interpreter, then still the user-visible architecture is not RISC. It would only be RISC if you let me program the core directly with its native micro-ops. "Hannibal" still doesn't understand this distinction between architecture and implementation.
"My opinions are my own, and I've got *lots* of them!"
First of all, FP doesn't add all that many instructions to an architecture. The alpha has about 6 FP instructions- load fp, store fp, add, subtract, multiply, divide. The PPC (ingoring SIMD) takes this up to about a dozen or so- two of which have as their sole purpose in life making sin() etc. fast to implement (three instructions, vr.s a dozen or so on the Alpha).
Second, there are two _different_ optimization problems chip designers face, generally at different times. The least common optimization is the "clean slate design"- where the chip designers don't have to support anything, and can draw the boundaries wherever they make sense to be drawn. In essence this what what the RISC designers of the eigthies did. The other optimization problem the chip designers are handed an architecture and a set of existing programs and told "make them go faster".
Super-scalar Out of Order execution, branch predicition, more functional units, and speculative code execution are all optimizations you can apply to an existing architecture *without changing the (apparent) semantics of that architecture*- i.e. without breaking legacy code. New instructions allow "new" or recompiled code to gain a performance boost without dropping support for old code (SIMD and DSP-like instructions just happen to be all the rage these days). So of course they're applied to both legacy RISC and legacy CISC applications!
Of course these "patches" are not as effective as fundamentally rearchitecting the CPU. Of course they increase the complexity of the CPU in much greater proportion than they increase performance. This doesn't imply some "ideological impurity", however- this is the fundamental problem of supporting legacy code. This articles thesis boils down to "there are only legacy CPUs out there!". Which, for the moment is true.
But let's consider for a moment what a rearchitected CPU for today would look like. What we'd like to do is to continue the trend RISC started- of shoving the complexity off the CPU and onto the compiler. It would be sort of accurate to claim that RISC's central idea was to shove the complexity of the translation to microcode onto the compiler.
Today's CPU complexity comes primarily from the patches applied to make the legacy code run faster- especially superscalar execution, branch prediction, and speculative execution- all of which require the CPU to deduce information out of sets of instructions. It'd be nice to have the compiler _tell_ the CPU the data ahead of time, so the CPU wouldn't have to spend precious clock time and transistor budget deducing. This, of course, implies a method for explicitly communicating this information in the instruction stream (the only channel of information between the compiler and the CPU)- older instruction sets (of all stripes) forced the CPU to deduce this information because there was no channel in the instruction stream for communicating it.
If this is begining to sound like the Itanium, you're right. Wether this is the right way to go, only time will tell (and, on advice on time's lawyers, time has no statement to make at this point).
Seriously, it's not really possible to have a genuine "HISC" environment. Either the underlying layer is basic or it isn't. That is what's really important, not what is visible from "outside".
The RISC architecture follows the idea that you have as -few- instructions as possible, thus making it faster to search for what to do.
If you were to, say, have a whole load of RISC systems wired together, in such a way that they appear to be a CISC, in terms of the instructions available, it's -still- RISC. You just have (effectively) a Beowulf of RISCs.
On the other hand, if you have a translation layer, converting CISC instructions into RISC ones, you have a CISC architecture, because all you've done is move the search (which is the crucial difference in philosophy) from one layer to another. It's still present. (The previous case didn't have a search mechanism, because it doesn't need one. You can filter to multiple destinations without needing any kind of search.)
It's that search that makes the crucial difference between whether something is RISC or CISC. If the search is complex or lengthy, it's CISC. If it's simple -and- quick, it's RISC. There are no other cases.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
(Unless you have Processor-In-Memory architecture, you -HAVE- to load your data into the processor somehow!)
Multiple addressing modes - provided your instruction search time isn't impacted, these don't change whether a processor is RISC or CISC. The whole point of RISC is reducing the overheads of processing. So, if you don't add any overheads, you're not changing the type of processor.
Variable length instructions - Depends on how it's implemented. Remember, the key is the overhead. If the maximum size of instruction can be fetched in a single transaction, it's quite irrelevent as to whether you've fetched 1, 2, 4, 8 or a million bytes. It becomes CISC if you have to do multiple fetches and parse the data.
Instructions which require multiple clock cycles - you'll find even the early ARM and Transputer chips (the very ESSENCE of RISC design!) had opcodes which took more than one clock cycle. Indeed, it's impossible to do anything much inside a single clock cycle. (Even basic operations, such as adding two numbers, or fetching a word from memory, only just fit into that, and sometimes not even then. The 8086 took 2 clock cycles to do either of these.)
The problem, I believe, is the changing definition of RISC and CISC, =NOT= what chip manufacturers are doing. I use the classic definitions that I learned when RISC architectures first appeared in Britain, where (IIRC) the idea was pioneered by companies such as ARM and Inmos.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Before I say anything, I want to commend Hannibal on an absolutely excellent article that clarified issues I thought I understood and illuminated much of the technological history behind the technology we each use every day.
I am completely impressed.
That being said, I'd like to take a moment and theorize on the direction microprocessor design is likely to go. This is my theory; you're welcome to disagree and in fact eagerly await commentary from those far deeper in the industry than I. Insert Slashdot Self-Correcting Nature here.
Of all the chasms in the computer world, there are few as vast as the speed differential between general purpose processors programmed to execute a given task and hard-coded ASICs(Application Specific Integrated Circuits) designed to meet the functional needs of a given process. (OK, granted, Internet -> Local Network -> Hard Drive -> System Memory -> Processor Cache -> Processor Registers is pretty vast too, but cut me some slack here.)
Telephony is a joke without ASICs--I haven't found a voice over IP solution that operates in software well enough to even be used as a room to room intercom over a 100BaseT Lan--but it's actually reasonably lag-free with hardware encoding.
Similarly, huge banks of boxen rendering frames for movies became significantly less impressive to me when I realized how many banks of Pentium Processors it would take to match, say, a single Voodoo 2. While, in recent times 3D Rendering has gotten shots in the arm on the general purpose x86 architecture via both MMX and KNI, the order of magnitude difference in speed makes CPU rendering of realtime 3D graphics almost useless.
(Then again, Sumea is probably the single coolest thing I've done with Java, short of Mindterm.)
As I observed in the Amiga newsgroup, shove a couple of custom ASICs in a box and you can run a highly competitive multitasking OS in 512K of RAM, with unmatched graphical support to boot.
But ASICs have their limitations--while they're fast at what they do, they're extremely inflexible. You can't merely program in a new transparency algorithm, nor implement Depth of Field in an architecture that totally lacks it. The inflexibility of ASICs dooms their long term viability.
CPU's are flexible but slow, ASICs are inflexible but fast. It's a dichotomy the industry is on the verge of smashing.
I dub the coming processor design specificiation(which, as the article correctly noted, is all RISC/CISC really are) XISC, for eXtensible Instruction Set Computing. XISC essentially specifies that the underlying computational structures--be they microcode or raw gate arrays--ought to be dynamically reconfigurable to meet the needs of the process.
Just as the lack of a quick bilinear filter function(SIMD stuff) on older Intel chips doomed them as far as efficient 3D in relation to customized ASICs, the ability to insert such a command directly into the internal microcode of a processor has a theoretical chance of executing at extremely high speeds for a non-dedicated processor.
Transmeta, also known as the only reason many people willingly acknowledge the US Patent Office, appears to be spearheading the XISC drive. Their patents refer to technologies that automatically cache microcode translations, that provide backwards-flow in case of a broken emulate, and so on. They've often been "accused" of developing a chip that can emulate any chip--in the XISC context, a chip optimized to execute the instruction set most required by any given process.
If you accept that performance drops in the orders of magnitude are suffered when a processor lacks the appropriate design for a given set of requests, it's quite obvious that intelligent designers seeking to execute a quantum leap in system performance would try to allow processors to acquire any necessary designs to achieve much higher speeds.
Of course, most of my chip designer friends would be happy to remind me that much of the speed of ASICs comes from their hard coded nature--the literal gates correspond to whatever output is desired, no translation is necessary.
Of course, here's where FPGA's come in. Field Programmable Gate Arrays are chips whose internal gate structure can be rewritten on command, sometimes many thousands of time per second. They can't be clocked as fast as true ASICs, nor are the yields as high, but one quickly morphing chip can do the job of three or four in a digital camera. With at least one company(someone give me a name!) developing a language for programmatically defining instruction sets for a FPGA processor, the technology for XISC is obviously in development.
Ah, but not all is not fair thee well. In fact, while on the topic of 3D chips, the Rendition Verite chipset had a programmable RISC core, and the chip ended up failing because it could not scale in speed like 3DFX's Voodoo could. Developers could write new 3D instructions, but didn't (in general) because it was just too hard. (Yes, Carmack did.)
That's why there's such a powerful force towards automation in this XISC evo/revolution, such as the FPGA language and Transmeta's automated Microcode translations that stay in memory so as to speed up future similar instruction requests. In an ideal world, a developer merely compiles a chunk of code that profiles as heavy usage directly into CPU microcode, or at least specifies in some way that a given routine ought to be run through the "special ops" part of the system.
Whether the world will become ideal is a point of question. Whether we will have instruction sets that morph is almost obvious, it's just a matter of when will the bridge between ASICs and CPU's finally be resolved.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
There are really only three differences between RISC and CISC architectures.
1. Accumulator model vs separate destination.
2. Variable length instructions.
3. Memory addressing complexity.
Since most "CISC" compliers and chips have all but abandoned the highly complex addressing modes
(relagated to slow operation), there really isn't much difference between the architectures today
except for #1 and #2.
The main advantage of #1 is to allow the compilers better control of register renaming strategies.
The main disadvantage of #1 is that the extra operand chews up bits in the instruction word.
This indirectly increases the instruction cache bandwidth (a bad thing in today's world). In fact,
if you look at the compressed 16-bit RISC instruction sets (MIPS16, THUMB), they went to an
8 register accumulator model (hmm, sound familiar)...
However since today's superscalar processors can execute instructions so fast, the copying operands
to a temporary accumulator isn't a big deal compared to missing the instruction cache. In
today's world, #1 and #2 are really tied closely together. In some sense, the variable length
instruction decode logic can be seen as "cache efficiency" enhancing logic.
Some may argue that memory addressing modes for the arithmetic functions is slowing things down.
Although that's true in some cases, in the most common case (stack access in the data cache),
today's highly pipelined "CISC" implementation is only slightly more complex than reading from a
scoreboarded register file or a reorder/retire buffer.
So although they've mostly converged to each other, the 2 operand 1 destination model is still
useful for the next processing paradigm to come down the track - dataflow processors. You can
see some of this now in EPIC (IA-64) and the TI-C600 where operation register dependancies are
encoded in the instruction in increasingly simpiler ways.
Several instruction set generations from now, registers will probably go away completely and
simply the instruction dependancies will be encoded. The internals of most super-scalar/
out-of-order processors already look like this. The register numbers are just references to data
dependancies and really are just place holders. In this context, separating the operands and the
destination still makes good sense.
I think most people have realized that RISC CPUs (in the sense of the early MIPS and SPARC designs) have not been made for a long time. Nowadays, the only real remaining differentiator between a "RISC" machine and a "CISC" machine is whether the instruction set is LOAD/STORE based or has memory operands on various instructions. There are several reasons for this:
Perhaps the largest reason: Backwards compatibility. If you look at the MIPS and SPARC architectures, they both brought to the table the concept of "delay slots". For instance, a LOAD instruction on MIPS won't write its result until after the second cycle, since the LOAD is pipelined over multiple cycles. (SPARC's delayed branches are similar.) Change the pipeline depth, and you sign yourself up for alot of hocus-pocus to continue playing this charade.
The relative cost of operations changes pretty radically over time. For example, when RISC debuted, ideas such as including a multiplier on-chip were out of the question, since they cost too many transistors. So instead, programs implemented multiplies with shifts and adds, or in some cases, with lookup-tables. Nowadays, multipliers are relatively cheap when you consider how much the transistor budget has grown.
Memory latency has gotten really bad relative to CPU performance. This is one of the largest drivers of out-of-order issue, actually. IBM invented this way-back-when before caches were feasible. (It escapes me on which machine they did this, but it was one of the old mainframes.) The idea is that you can cope with latency by allowing instructions to run when their data arrived -- whether it's from a slow memory or slow floating point unit. (Those of you with Hennesey & Patterson on your shelf: Go look up the Tomasulo scheduling algorithm.)
The cost of communication and control has gone way up as pipelines have gotten deeper and transistors have gotten faster than the wires that connect them. "Complex" instructions which reduce communication, and sophisticated branch-prediction schemes which try to flatten control are attempts to address these issues.
The point is that these are difficult problems, no matter what type of architecture you have. Even Alpha spends alot of transistors allowing such things as "hit under miss" in the cache (allowing one stream of instructions to proceed while another is stalled waiting for a cache service).
Every so often, when the gap between the original "programmer's model of the architecture" and "what we can do easily in silicon" gets too wide, it becomes necessary to move to a new paradigm. With this shift comes a new programmer's model. Moving from "CISC" to "RISC" was one such paradigm shift. One could argue that we're due for another one, and that VLIW/EPIC-like schemes are the new contenders.
Some approaches, such as traditional VLIW, say "We're not going to worry so much about bout presenting a traditional scalar model to the programmer. We're going to expose our complexity to the compiler and let it do its best, rather than play tricks behind its back." These work by exposing the functional units of the machine, removing alot of the control complexity. These are today's "direct execution" architectures. (See TI's C6000 DSP family for a live example of VLIW in the field.)
Intel's EPIC takes VLIW a step further. It adopts alot of the explicitness of VLIW, but it retains alot of the chip complexity that's required to retain compatibility across widely varying family members. It's too early to know how this will turn out, but I'm somewhat concerned that it does not reduce complexity.
All-in-all, it'll be interesting to see how this turns out. Who knows what type of architectures we'll be programming in 20 years...
--Joe--
Program Intellivision!
Obviously you didn't read the article, but...
What's RISC about and ISA with AltiVec instructions? You know, single "instructions" that take multiple cycles to process because they do multiple things?
That's the point. Not that "x86 is now RISC" but that traditional RISC is as dead as traditional CISC. If RISC is a two-story house and CISC is one-story, then the current chips are all three stories.
John Mashey of MIPS/SGI has written a good description of the differences between RISC and CISC architectures. It is posted to comp.arch on an irregular basis and is available on the web here.
Mea navis aericumbens anguillis abundat
The complexity with superscalars is not in the ISA, but in the scheduling. At the most basic level, though, RISC instructions are
used because it is (effectively) impossible to schedule CISC instructions for out-of-order execution.
Where have you been? Pentium II and K6 (and now K7) has been executing CISC instructions out of order for years!
Anyway, I argue that the complexity of superscalars is in the ISA. Most CISC architecture do not embed scheduling information in the ISA, which makes the decoding phase quite complex. RISC helps the situation dramatically because the ISA define each instruction to be of fixed length, so less pipeline stages to find instruction boundary. VLIW's ISA takes responsibility of the scheduling from the processor. EPIC, exposes this scheduling information, the processor still needs to do the actual scheduling.
I am opposed to VLIW because of
1) lack of binary compatibility (there workarounds being done in IBM research
2) difficult to fill in all the slots. I don't care how good the compiler is. Most system and application program languages in use today assumes sequential execution. You'll end of with NOPs and unnecessary loop unrollings, which leads to code bloat.
The bottom line is about price / performace AND the purchasing power of the consumers. If FPGA technology becomes so cheap that I can get 1 billion gates per chip for only a cent, I'll write my routines in verilog rather than C.
Hasdi
... is a verifier for the interpreter to the compiler of the vm translator that emits code to the vectorising assembler optimised for the hardware scheduler :-).
... As Linus pointed out, controlling the complexity of the kernel requires understanding very clearly the minimal protocol that is needed to communicate between the different functions.
/. filtering with references to Encyclopeida Britanica or archived news sites. Much as EBay might squeal about sites "stealing" their auction databases (what they want to do in practice), it is a way of creating large aggregated information complexes.
If you look at any high level abstract language (say Python) it goes through a number of stages, each of them designed to feed into the lower layers. The debates about the various schools can be viewed as an on-going bun-fight between the various groups as to who gets the largest slice of the $$ pie and simplified workload. In some ways, the hardware guys have a conceptually easier task, they get to include more of the surrounding chipset. The software language or API developers are forced to explore unknown territory. Witness the fumbled gropings to move beyond OpenGL to higher level 3D scene representation.
The rather interesting factor is that the OpenSource scene allows flexibility for the software and hardware to be realigned periodically. The example I'm thinking about is the GGI project and the move towards the Graphics Processing Unit as a self-contained CPU instead of an add-on video board. The next step might even be dedicated I/O/media processors combining FibreChannel, TCP/IP, SCSI, XML/Perl/Java engines, codecs, etc
The biggest problem nowadays is not actually technical (tough but doable) but legal. Witness the jockeying around System on a Chip where you have to combine multiple IPs along with the core. Hardware vendors have cross-licensing portfolios and reverse-engineer their competitors to copy the ideas anyway. Linux avoids the problem by making everything GNU and thus designers/engineers can concentrate on the job without fighting with the lawyers, as well as defining prior art (cf with universities rushing to publish the human genome before the commerical mobs fence it off). Given the fast pace of the industry, the market is a stronger judge than any legal protection (why bother protecting something trivial that will be obsolete in a few years?). Perhaps in a few decades, people might look back and consider the millstone the patent system has become.
The biggest open question IMHO now is how to get multiple internet sites to interoperate. For example, some people might wish to combine customised
CPU tricks and speed races will always make headlines but despite the appeal of multi-gigahertz chips, the information backlane will remain a mess until the telcos/cable/sat get their act together.
LL
As anyone can obviously see, the old paradigms have failed. Make way for the future, make way for PISC!
That's right, the wave of the future is the Pathetic Instruction Set Computer.
(in all seriousness, if this stuff turns your crank, building a computer from standard TTLs is way cool)
Both the Ars Technica article and the MSU paper to which it refers are very well done, providing an excellent introduction to material covered in greater depth in Hennesy And Patterson's[1] Computer Architecture: A Quantitative Approach.
My one quibble is that the AT author gives the basic performance formula in what I consider an "upside down" form - time/program rather than work/second - which I think makes it harder to understand some of the later points.
Like Ditzel, I pine for a return to the days when true RISC architectures existed. The principles involved haven't really changed, and I think some later "evolution" has been misguided. See my recent posts in the "Intel Athlon-Killer" article for more details if you're interested.
[1] I can never remember which letters in which name get doubled, or which one was at Berkeley and which at Stanford. Apologies for any errors on either count.
Slashdot - News for Herds. Stuff that Splatters.
So we've figured out that neither RISC or CISC were an optimal solution and that we need something new/better. Excuse me for not being impressed by this conclusion: I've been programming for quite a few years now and if there's one thing I've learned, it's that there are no optimal solutions. Every design, every optimization is biased towards some specific situation/bias, evey optimization has a tradeoff which may have far reaching implications. What may work well in one area, might be horrible in another. It' just that in the CPU world architectures have to last for a while, so the general turnaround of fundamental designs is a bit slow, but this is hardly news and should not be surprising to you if you're more than just a causual programmer.
...ALCBSKRISC (pronounced: ALCBSKRISK)
First of all you can see the benefits by the ease of pronunciation of the name...(I mean, compared to CISC...which is next to IMPOSSIBLE to say 3 times fast)
Second, it makes sense:
A Little Complex But Still Kinda Reduced Instruction Set Computing.
I'm telling you, it'll be the future!
THere is a lot a-like in today's RISC and CISC chip designs, true, but there is still a lot that is different.
The major difference between your two designs include how they interact with memory; RISC's still take three instructions to load, alter, and store memory alterations, CISC load and store does it in one instruction. There are differences in instruction size; where RISC design knows that each instruction is 16/32/64 bits long, CISC allow variable length instructions and fancy footwork/chp design to allow read ahead buffers to work well. The most significant example of this is in the x86 chip series, which still has 8 bit instructions on the legacy registers, but with prefixes and extensions can have 100+ bit instructions, as well. THe system needs, effectively, a parser before feeding the instructions to the actual CPU. And while my background on CISC design is lacking, I can only imagine the design acrobatics to do superscalar/pipelined design for instructions that can do so much.
While not strictly a RISC/CISC issue, there is also the use of registers. In gross generalities, RISC design is much more apt to use general purpose registers than CISC. There are definitive advantages to each design.
Yes, this is all 'under the hood' items, but they have a large effect on design; compilers that know of, for instance, the legacy registers from the 8088/8086 and use them primarily, have nice small instructions, and can get the most out of the x86 instruction preloading. THis has been less and less significant with newer x86's (P IIs, P IIIs) but it is still present.
RISC does stand for Reduced Instruction Set Chip, but that doesn't mean less instructions, it means less Instruction Formats. Think of how many different instruction formats x86 has, with varying lengths of instructions, non-orthogonal instructions, etc, compared with the simplified instructions provided by RISC processors, which might have as few as 3 or 4 different instruction formats.
RISC really should have been SISC, for Simplified Instruction Set Chip, but that clashes with CISC, ho hum....
Remember, you don't get a RISC chip (Reduced Instruction Set Chip Chip)! :-)
The article was silly really, the author didn't look beyond the word 'Reduced' in RISC, thought it meant less instructions, then saw that most RISC chips have tonnes more instructions than most CISC chips, and arrived at the wrong conclusion. Hell, a simple ARM chip has a theoretical 4 billion instructions (all conditional etc) but there are much fewer general operations.
RISC:
CISC:
CISC chips now typically have a RISC core, where instructions such as ADD (contents of memory A), (contents of memory B), (resulting memory location) are broken up into micro-ops, LD A, LD B, ADD A,B,C, ST C
Anyway, just my (small) point of view...
While that is an oversimplification, it looks an awful lot like truth, sometimes. For years, processor manufacturers have touted their new "RISC" processors even though those "reduced" instruction sets might have more instructions than the 8086.
Of course, the real hallmarks of a RISC processors are the load-store architecture, the large general-purpose register sets, and the uniform instruction size, but even those aren't sufficient to give a significant performance advantage to a computer based upon the RISC architecture. In fact, various benchmarks provide evidence that CISC vs RISC has very little effect on the performance of the computer.
So, where did the RISC vs CISC distinction come from, and why do RISC processors have a reputation of being faster, all other things being equal, than CISC processors? The answer has to do with what is now the prehistory of microcomputing. Back in the dim dawn of history (early 70's) microprocessors were for embedded systems. They were, therefore, designed to minimize part count and that meant optimizing the program space. The early embedded systems were programmed universally in assembly language and programmers typically used various tricks to use space very efficiently because space was more at a premium than time.
The complexity of those early embedded systems processors was mostly focused on reducing instruction count and instruction size as much as possible. "Bit mining" while it is still around today, was a way of life for the early microprocessor programmers, and the processor manufacturers built processors to facilitate that effort.
However, in the middle of the 70's, some people started putting these processors into general-purpose computers, and the microcomputer market became significant. That drew the attention of some processor designers who wanted to add some of the advanced performance-enhancing features, like caching and pipelining, from minicomputers and mainframes, to the micros.
The only problem was that the largest scale of integration available in the 70's was a few thousand transistors. When you've got 4000 transistors in your whole processor, you're going to need to trim unnecessary functionality away from the whole processor if you want to add an on-processor cache that's of some use to somebody. Hence the desire for a reduced instruction set.
Originally, the idea was to take those transistors that would otherwise go into complex instructions and put them into performance-enhancing features. The loss of memory efficiency was not a problem because they were intended to be put into fairly large, fairly capable, and fairly expensive computers.
Of course, now the processor designers have silicon budgets of millions of transistors, and the amount taken up by the instruction set is relatively tiny. That means that, in the 20 or so years since RISC processors were first envisioned, the instruction sets of the so-called RISC processors have gotten far more complex and the CISC processors have gotten essentially all of the performance-enhancing features of the RISC processors such that there is no real difference between them any more. Moore's law has made the distinction obsolete.
His main evidence is a quote wherein Ditzel is quoted as saying, "Superscalar and out-of-order execution are the biggest problem areas that have impeded performance [leaps]." Obviously the author has absolutely no knowledge of how processors work internally, or he wouldn't say that this is due to the complexity of the ISA (Instruction Set Architecture).
The complexity with superscalars is not in the ISA, but in the scheduling. At the most basic level, though, RISC instructions are used because it is (effectively) impossible to schedule CISC instructions for out-of-order execution.
The whole idea with RISC is to make instructions so basic that they can (almost) all be completed in a single processor cycle. In the article, he tries to refute this with a quote from Patterson, but the quote actually refutes the author's point, and the author is too blind to realize it. Twice in the quote Patterson refers to reducing the cycle time for each instruction, but the author says that's not Patterson's point.
Today's processors take the idea a step further, by trying to execute MORE than one instruction per cycle by providing multiple processing units (the thing that does the actual addidtion, subtraction, or whatever) which can execute instructions in parallel. However, instructions still need to be scheduled so that they can execute in parallel while preserving dependencies.
The hardware that accomplishes this scheduling is complex.
IMHO VLIW is the way to go. With VLIW, you do the scheduling at compile time, and remove a lot of the complexity involved with hardware scheduling. Not only do you gain the possiblity of higher parallelism through an increased number of processing units (you can use the silicon previously reserved for the scheduling hardware), but you also can gain a little more since theoretically a complier can spend more time looking for dependencies between instructions, and come up with a more optimal schedule.
anyway, that's just my 2 cents.
- When the first RISC machines came out, superscalar execution hadn't been invented yet.
Unfortunately, the actual quotation from the article is "When the first RISC machines came out, Seymore Cray was the only one really doing superscalar execution." Whoops. Looks like ya' dropped the ball on that one.Some processors (Cray for one) had been doing this years before RISC came out.
- I also think that the ideas behind RISC such as "move the complexity from the chip into the compiler" also apply today and that VLIW is an example of this applied to scheduling
Well, that's very forward thinking of you. Of course, I guess that makes you clueless as well, because, once again, it's exactly what the author of the article wrote: "VLIW got a bad wrap when it first came out, because compilers weren't up to the task of ferreting out dependencies and ordering instructions in packets for maximum ILP. Now however, it has become feasible, so it's time for a fresh dose of the same medicine that RISC dished out almost 20 years ago: move complexity from hardware to software."So this leaves us with your remaining "point":
- The fast CISC chips (PII, Athlon) do instruction conversion into RISC...so if the debate is over, its because RISC won -- big time.
Well that's a brilliant insight you've uncovered there (covered in the article here), but the point of this "clueless" article (which you obviously did not read) was that there no longer is a debate. The term RISC refers to a CPU design philosophy which was invented in 1981. That was 18 years ago. It was intended as a replacement to "CISC", the CPU design philosophy which came about in the early 70's.That is to say, the CPUs of today, whether P6 or PA-RISC or G4, and the systems that they are in, bear almost no resemblence whatsoever to either a "CISC" chip or a RISC chip. The only similarity is that the P6's and K7's of the world are compatible with the x86 ISA, which was originally written (back in 1977 IIRC) for a "CISC" chip. Yes, this adds an extra decoding step (to break down instructions into "RISC-like" ops), and yes, theoretically that means increased die-size and complexity, which of course means lower clock speeds. Oh wait--that reminds me: which currently shipping chip has the fastest clock speed?? Oh yeah--the 700MHz Athlon. With a 750 part set to be shipped later this week. And, looks like, a 900 in time for New Year's. One of those ungainly "CISC" chips, huh.
Hmm...but let's take a look at how all those competing "RISC" chips have used their incredibly simple architectures to keep die size down. Like the PA-RISC, which is, IIRC, about 6 or 7 times the size of a PIII. Or those new G4's, with their impressive yield of 0% at 500MHz. The simplicity of today's modern "RISC" chips in action.
Now, none of this is to say that the G4 or the PA or whathaveyou isn't a great design. Just that the resemblence of today's CPU's to a true "CISC" or RISC chip is so tangential as to make the categorization meaningless. And as for your "debate"--of course RISC won big time. It came out nearly 10 years after the first chips made with the "CISC" philosophy. As was, IMO, rather compellingly and insightfully explained in the article, "CISC" chips were the best possible solution to the awful state of complilers and memory available at the time. By 1981, the state-of-the-art had advanced to the point where RISC was a better solution. Duh.
Since then, compilers have gotten better, transistor densities have increased, and RAM prices have plummetted, allowing all the advancements which the author termed "post-RISC". And, looking at the CPU's of tomorrow, we see all sorts of new techniques on the horizon: optimization based on advancements in compiler/software technology, ala IA-64, MAJC, and (apparently) Transmeta; or optimizations based on incresing transistor densities, like some of the new physical parallelization designs that appear to be a couple generations down the road for Alphas and IBM chips.
But as long as people like you insist on categorizing these chips into meaningless 20-year design philosophies, the tech world will be a more ignorant place.
What a dissappointing comment.
RISC's still take three instructions to load, alter, and store memory alterations, CISC load and store does it in one instruction
True.
But keep in mind that RISC is designed to ideally execute one instruction per clock cycle. CISC doesn't care. it may take one instruction for the memory op, but it still takes multiple clock cycles for the op to complete.
Otherwise there wouldn't be so much effort put into math tricks to get around division. One instruction. approximately 6-12 (or more) clock cycles to complete depending on your system. (if its faster since I took my computer architecture class, someone gimme a link...) Division is notoriously slow.
"You want to kiss the sky? Better learn how to kneel." - U2
"It was like trying to herd cats..." - Robert A. Heinlein
Sig:
Barbeque is a noun. Not a verb.
I am sick and tired of people who cannot fathom the difference between an abstract instruction set architecture (ISA) and a chip implementation with functional units, gates, and flops. The terms RISC and CISC refer to ISAs. If you build a CISC processor using many of the same implementation tricks that are commonly used in RISC processors then fine for you. But you still have a CISC MPU. RISC and CISC have always shared many implementation details be it triple ported register files, on-chip cache or a 32 bit adder.
Look at this way. Lets say CISC is analogous to a bungalow while RISC is a two story house. These are architectural differences. Lets say that in the early days of house building bungalows were always built out of brick with load bearing walls while two story houses were wood framed. If some joker comes along and builds a bungalow with wood frame technology it suddenly doesn't make his edifice a two story house even though it may be a big improvement on earlier bungalows.
While CISC has generally caught back up to near RISC for integer performance once MPU complexity reached about 3 to 5 million transistors the ISA differences do matter. For example, an out-of-order x86 with a translator front end and register renaming might have 16 or 32 physical GPRs instead of the eight architected GPRs. But the compiler cannot address these physical registers, it only sees eight. This means that an x86 compiler will have to spill values to memory much more often than a RISC compiler and it will not be able to exploit performance enhancing techniques such as register assignment to local and globals variables and software pipelining nearly as well as a RISC.
There is also the baggage effect with CISC architectures. For example, nearly every x86 instruction modifies flag bits as well as GPRs. This means that the flags become a central dependency choke point that requires a lot of attention to address. CISC ISAs also invariably have multiple instruction sizes. This means that a CISC CPU will typically require an extra pipe stage or two to sort out instruction boundaries regardless of how "RISC-like" the backend looks.
People who believe the Intel party line of "x86 was CISC but is now RISC" should pay attention to what Intel is doing rather than saying. They are busy spending billions to develop a new RISC-like 64 bit architecture to carry them into the future. It is true AMD will stretch x86 to 64 bits but they had to change the x86 floating point programming model to a RISC like flat register file with three address instructions to even attempt to close the distance on FP code. And AMD's future success in keeping up with IA-64 and SMT superscalar RISC implmentations is far from asured.
Of course it's possible to have a hybrid! Look at the powerPC chip. In designing that chip, they tried to follow the risc mentality, having simple instructions and a low orthogonality. They also however wanted to have powerful math instructions, and you end up having instructions like a+b*c as a single instruction... certainly risc.
CISC/RISC is more than instructions that 'do alot', the complexity of instructions also takes these into account:
CISC characteriscs:
Register to register, register to memory, and memory to register commands.
Multiple addressing modes for memory, including specialized modes for indexing through arrays
Variable length instructions where the length often varies according to the addressing mode Instructions which require multiple clock cycles to execute.
It is not hard to imagine a basically RISC machine that allows different memory addressing modes? gee, what do we have now, a hybrid!
What I wish the author made more of a poit of is that this "debate" between camps has been moot for years. I think it can be argued that RISC in its pure sense refers only to those old early 80's CPUs, and everything since then is part of an evolution from the origonal principle. People may like the warm feeling of saying that their chip is part of a greater cadre of RISC(=good) parts, but where ever we are now, we've surely evolved to somewhere else from the old terms. The author also seems too willing to let his writing "make a point" and put people in their place, instead of just trying to educate... Maybe he's hanged out in too many forums where folks are endlessly bickering. I read Patterson and Henesey, - wonderful book... anyone interested in the subject should pick up a copy. E
Evan - needs to hit preview before submitting
All I can see is you making claims that the author didn't make. He tries to move the debate to a useful level. If you want to talk about an architecture whose instructions pipeline cleanly, aren't you talking about performance oriented concerns? And if so, why are you unwilling to look at the aftereffects of implementation.
As Ditzel has said, the theory is useful in practice because you end up with a FINAL PRODUCT that is every bit as wieldy as CISC.
Now, you can do what many RISC supports do and try to continually refine what RISC means, but all you are doing there is playing "essentialist" games and missing the point: there's more in common now between so-called RISC and CISC CPUs than anyone ever imagined. If that's the case, then the distinction is hardly worth what marketers and trumpeters on web sites make it out to be.