Research Shows RISC vs. CISC Doesn't Matter
fsterman writes The power advantages brought by the RISC instruction sets used in Power and ARM chips is often pitted against the X86's efficiencies of scale. It's difficult to assess how much the difference between instruction sets matter because teasing out the theoretical efficiency of an ISA from the proficiency of a chip's design team, technical expertise of its manufacturer, and support for architecture-specific optimizations in compilers is nearly impossible . However, new research examining the performance of a variety of ARM, MIPS, and X86 processors gives weight to Intel's conclusion: the benefits of a given ISA to the power envelope of a chip are minute.
i've read the legacy x86 instructions were virtualized in the CPU a long time ago and modern intel processors are effectively RISC that translate to x86 in the CPU
Back when compilers weren't crazy optimized to their target instruction set, people coding things in assembler wanted CISC, and people using compilers wanted RISC.
But nowadays almost no one still does the former, and the latter uses CISC chips a lot better.
This is now a question for comp sci history, not engineers.
The CPU ISA isn't the important aspect. Reduced power consumption mostly stems from not needing a high end CPU because the expensive tasks are handled by dedicated hardware. What counts as top of the line ARM hardware can barely touch the processing power of a desktop CPU, but it doesn't need to be faster because all the bulk processing is handled by graphics cores and DSPs. Intel has for a long time tried to stave off the barrage of special purpose hardware. The attempts to make use of ever more general purpose CPU power sometimes bordered on sad clown territory (Remember Intel's attempt to make raytracing games look like something worth pursuing? Guess why: Raytracing is notoriously difficult to implement on graphics hardware due to the almost random data accesses.)
as a greybeard I remember when choosing Intel over Sun meant the project wasnt completed on time, and your electrical/mechanical engineering group lived in the breakroom while their jobs chugged along. Intel was a toy train compared to the power you'd get with RISC. however I can somewhat confidently say the RISC CISC battle is moot these days because x86 has largely caught up to power, sparc, and others. a competent argument could be made however that if it werent for AMD, most servers would probably still be running some flavour of RISC. The foolhardy nature of SUN and SGI can also be argued as a cause of their demise, but ill not flame. Intel wouldn't have bothered to get off their duff without a poke in the ribs from AMD; they had partnerships with RISC manufacturers anyhow and their own RISC-ish processor called itanium. outside of performance though there is another reason people stick with Power and others just as they have in the past. Lock-in.
you see, applications like Oracle Business Objects and JD Edwards come with a quid-pro-quo of exacting standards to which most businesses must adhere. Namely, IBM or Sun/Oracle hardware. You may only need accounting and payroll, but you'll have to clear a corner of the room for the circus to set up their hardware and make sure everything is "just so." Their hope is that their quiet mandate becomes your quiet mandate, and before you know it other systems that interact with JDE are now required to be Power-based because "thats what runs JDE." The only way out of this is to realize that any business that doesnt explicitly do payroll or metrics for profit, doesnt need the kind of horsepower decreed by things like SAP.
Good people go to bed earlier.
Granted, you can build a tablet to do specific tasks (like decoding video codecs) around a really slow processor and some special-purpose DSPs. But perhaps the companies in that business aren't making enough profit to interest Intel.
This study looks seriously flawed. They just throw up their hands at doing a direct comparison of architectures when they try to use extremely complicated systems and sort of do their best to beat down and control all the factors that introduces. One of the basic principles of a scientific study is that independent variables are controlled. It's very hard to say how much the instruction set architecture matters when you can't tell what pipelining, out of order execution, branch prediction, speculative execution, caching, shadowing (of registers), and so on are doing to speed things up. An external factor that could influence the outcome is temperature. Maybe one computer was in a hotter corner of the test lab than the other, and had to spend extra power just overcoming the higher resistance that higher temperatures cause.
It might have been better to approach this from an angle of simulation. Simulate a more idealized computer system, one without so many factors to control.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
They are seriously comparing some 90nm process with much better intel 32nm and 45 nm processes.
They have just taken some random cores made on random (and uncomparable) manufacturing technologies, throw couple of benchmarks and try to declare universal results based on these.
Few facts about the benchmarks setup and the cores cores:
1) They use ancient version of GCC. ARM suffers this much more than x86.
2) Bobcat is relatively balanced core, no bad bottlenecks. mfg tech is cheap, not high performance but relatively small/new.
3) Cortex A8 and A9 are really starved by bad cache design. Newer A7 and A12 would be similar in area and powet consumption but much better in performance and performance/power. There are also manufactured on old cheap mfg processes, which hurt them. Use modern manufacturing tech and results are quite much better
4) Their loonson is made on ANCIENT technology. With modern mfg tech it would be many times better on performance/power.
5) The cortex A15, even though made on 32nm process, is cheap process, not much better than intel's 45nm process and much worse than intel's 32nm. Also it's known to be a "power hog"-design. Qualcomm's Krait has similar performance level, but with much lower power.
No relation to energy used. It's in the article: Haswell will get it's work done faster and use about the same energy as the slower chips that take longer. What matters is architecture, not ISA (Atom is lower power than Haswell at the same process node).
x86 instructions, are in fact, decoded to micro opcodes, so the distinction isn't as useful in this context.
Actually it is. Modern performance tuning has a lot to do with cache misses and such. CISC can allow for more instructions per cache hit. The strategy of a hybrid type design, CISC external architecture and RISC internal architecture definitely has some advantages.
That said, the point of RISC was not solely execution speed. It was also simplicity of design. A simplicity that allowed organization with less money and resources than Intel to design very capable CPUs.
That is correct. Every time this comes up I like to spark a debate over what I perceive as the uselessness of referring to an "instruction set architecture" because that is a bullshit, meaningless term and has been ever since we started making CPUs whose external instructions are decomposed into RISC micro-ops. You could switch out the decoder, leave the internal core completely unchanged, and have a CPU which speaks a different instruction set. It is not an instruction set architecture. That's why the architectures themselves have names. For example, K5 and up can all run x86 code, but none of them actually have logic for each x86 instruction. All of them are internally RISCy. Are they x86-compatible? Obviously. Are they internally x86? No, nothing is any more.
This same myth keeps being repeated by people who don't really understand the details on how processors internally work.
You cannot just change the decoder, the instruction set affect the internals a lot:
1) Condition handling is totally different on different instruciton sets. This affect the banckend a lot. X86 has flags registers, many other architectures have predicate registers, some predicate registers with different conditions.
2) There are totally different number of general purpose and floating point registers. The register renamer makes this a smaller difference, but then there is the fact that most RISC's use same registers for both FPU and integer, X86 has separate registers for both. And this totally separates them, the internal buses between the register files and function units in the processor are done very differently.
3) Memory addressing modes are very different. X86 still does relatively complex address calculations on single micro-operation, so it has more complex address calculation units.
4) Whether there are operations with more than 2 inputs, or more than 1 output has quite big impact on what kind of internal buses are needed, how many register read and write ports are needed.
5) There are a LOT of more complex instructions in X86 ISA which are not split into micro-ops but handled via microcode. the microcode interpreter is totally missing on pure RISCs ( but exists on some not-so pure RISC's like Powe/PowerPC).
6) Instruction set dictates the memory aligment rules. Architectures with more strict alignment rules can have simples load-store-units.
7) Instruction set dictatetes the multicore memory ordering rules. This may affect the load-store units, caches and buses.
8) Some instructions have different bitnesses in different architectures. For example x86 has N x X -> 2N wide multiply operations which most RISC's don't have. So x86 needs bigger/different multiplier than most RISCs.
9) X87 FPU values are 80-bit wide(truncated to 64-bit when storing/loading). Practically all the other CPU's have maximum of 64-bit wide FPU values (though some versions Power have support for 128-bit FP numbers also)
That's easy: maintain compatbility with fucktons of legacy code; arguably more of which exists for x86 than every other architecture combined...
This same myth keeps being repeated by people who don't really understand the details on how processors internally work.
Actually, YOU are wrong.
You cannot just change the decoder, the instruction set affect the internals a lot:
All the reason you list could all be "fixed in software". The fact that silicon designed by Intel handles opcode in a way a little bit better optimized toward being fed from a x86-compatible frontend is just specific optimisation. Simply doing the same stuff with another RISCy back-end, i.e: interpreting the same ISA fed to the front-end, will simply require each x86 ISA being executed as a different set of micro-instructions. (some that are handled as single ALU opcode on Intel's silicon might require a few more instruction, but that's about the different).
You could switch the frontend and speak a completely different instruction set. Simply if the two ISA are radically different, the result wouldn't be as efficient as a chip designed with that ISA in mind. (You would need a much bigger and less efficient microcode, because of all the reasons you list. They won't STOP intel from making a chip that speaks something else. Intel will simply produce a chip where the front-end is much more clunky, inefficient, waste 3x more opcode per instruction, and waste much time waiting that some bus gets free or copying values around, etc.).
And to go back to the parent...
Not only is this possible, but this was INDEED done.
There was an entire company called "Transmeta" whose business was centered around exactly that:
Their chip, the "Crusoe" was compatible with x86.
- But their chip was actually a VLIW chips, with the front-end being 100% pure software. Absolutely as remote from a pure x86 core as possible.
- The frontend was entirely 100% pure software.
The advantage touted by Transmeta was that, although their chip was a bit slower and less efficient, it consumed a tiny fraction of the power and was field-upgradeable (in theory just issue a firmware upgrade to support newer instruction.) Transmeta had demos of Crusoe playing back MPEG video on a few watts, whereas Pentium 3 (the then lower-power Intel chip) would consume way much more.
Saddly, it all happened in an era where pure raw performance was the king, and where use a small nuclear plant to power an Pentium IV (the then high performance flagship) and needing a small lake nearby for cooling was considered perfectly acceptable. So Crusoe didn't see that much success.
Still, Crusoe was successfully used as a test bed for a few experimental CPU to test their ISA before actual test-bed where available. (If I remember correctly, Crusoe where used to test running x86_64 code before actual Athlon 64 where available for developers), and there were a few experimental proof-of-concept running PowerPC ISA.
In a way modern way, this isn't that much dissimilar from how Radeon handle compiled shared, except that the front-end is now a piece of software which run inside OpenGL on the main CPU: intermediate instruction a compiled to either VLIW or CGN opcode which are 2 entirely different back-ends.
(Except that, due to the highly repetitive nature of a shared, instead of decoding instruction on the fly as they come, you optimise it once into opcode, store it into a cache and you're good).
Again, on a similar way ARM can switch between 2 different types of instruction set (normal and thumb mode), 2 different sets, one back-end.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
It is really surprising that neither the linked Extremetech article, nor the slashdot summary cite the original source. This research was presented in HPCA'13 in a paper titled "Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures", by Emily Blem et al, from the University of Wisconsin's Vertical Research group, led by Dr. Karu Sankaralingam. You can find the original conference paper in their website.
The Extremtech article indicates that there are new results with some additional architectures (MIPS Loongson and AMD processors were not included in the original HPCA paper), so I assume that they have published an extended journal version of this work, which is not yet listed in their website. Please add a comment if you have a link to the new work.
I do not have any relation with them, but I knew the original HPCA work.
All of which paints a bleak picture for Itanium. There is no compelling reason to keep Itanium alive other than existing contractual agreements with HP. SGI was the only other major Itanium holdout, and they basically dumped it long ago. And Itaiums are basically just glorified space heaters in terms of power usage.
Itanium was dead on arrival.
It ran existing x86 code much slower. So if you wanted to move up to 64bit (and use Itanium to get there), you had to pay a lot more for your processors, just to run your existing workload.
Okay, you say, but everyone was supposed to stop running x86 and start running Itanium binaries! Please put down the pipe and come back to reality. No company is going to repurchase all of their software to run on a new platform, just because Intel says this is the way forward.
Maybe, maybe! If all of the business software was open-source and easily ported to a different CPU architecture it might have worked. But only if you'd gain a 3x-5x improvement in wall clock performance by porting from x86 to Itanium instruction sets. (An advantage that never materialized.)
And once AMD started shipping AMD64 and Opterons that could run your existing x86 workload, on a 64bit CPU, at slightly fastter speeds then your old kit for the same price - that buried any chance of Itanium ever succeeding in the market. Any forward looking IT person, when it came time to upgrade old kit, chose AMD64 - because while they might be running 32bit OS/progs today, the 64bit train was rumbling down the tracks. So picking a chip that could do both, and do both well, was the best move.
Wolde you bothe eate your cake, and have your cake?
All the reason you list could all be "fixed in software".
The quotes around the "software" mean that i refer about the firmware/microcode as a piece of software designed to run on top of the actual execution units of a CPU.
No, they cannot. OR the software will be terible slow , like 2-10 times slowdown.
Slow: yes, indeed. But not impossible to do.
What matters are the differences in the semantics of the instructions.
X86 instructions update flags. This adds dependencies between instructions. Most RISC processoers do not have flags at all.
This is semantics of instructions, and they differ between ISA's.
Yeah, I pretty well know that RISCs don't (all) have flags.
Now, again, how is that preventing the micro-code swap that dinkypoo refers to (and that was actually done on transmeta's crusoe)?
You'll just end with a bigger clunkier firmware that for a given front-end instruction from the same ISA, will translate into a big bunch of back-end micro-ops.
Yup. A RISC's ALU won't update flags. But what's preventing the firmware to dispatch *SEVERAL* micro-ops ? first to do the base operation and then aditionnal instructions to update some register emulating flags?
Yes, it's slower. But, no that don't make micro-code based change of supported ISA impossible, only not as efficient.
The backend, the micro-instrucions in x86 CPUs are different than the instructions in RISC CPU's. They differ in the small details I tried to explain.
Yes, and please explain how that makes *definitely impossible* to run x86 instruction? and not merely *somewhat slower*?
Intel did this, they added x86 decoder to their first itanium chips. {...} But the perfromance was still so terrible that nobody ever used it to run x86 code, and then they created a software translator that translated x86 code into itanium code, and that was faster, though still too slow.
Slow, but still doable and done.
Now, keep in mind that:
- Itanium is a VLIW processor. That's an entirely different beast, with an entirely different approach to optimisation, and back during Itanium development the logic was "The compiled will handle the optimising". But back then such magical compiler didn't exist and anyway didn't have the necessary information at compile time (some type of optimisation requires information only available at run time. Hence doable in microcode, not in compiler).
Given the compilers available back then, VLIW sucks for almost anything except highly repeated task. Thus it was a bit popular for cluster nodes running massively parallel algorithms (and at some point in time VLIW were also popular in Radeon GFX cards). But VLIW sucks for pretty much anything else.
(Remember that, for example, GCC has auto-vectorisaion and well performing Profile-Guided-Optimisation only since recently).
So "supporting an alternate x86 instruction on Itanium was slow" has as much to do with "supporting an instruction set on a back-end that's not tailored for the front-end is slow" as it has to do with "Itanic sucks for pretty much everything which isn't a highly optimized kernel-function in HPC".
But still it proves that runing a different ISA on a completely alien back-end is doable.
The weirdness of the back-end won't prevent it, only slow it down.
Luckily, by the time Transmeta Crusoe arrived:
- knowledge had a bit advance in how to handle VLIW ; crusoe had a back-end better tuned to run CISC ISA
Then by the time Radeon arrived:
- compilers had gotten even better ; GPU are used for the same (only) class of task at which VLIW excels.
The backend of Crusoe was designed completely x86 on mind, all the execution units contained the small quirks in a manner which made it easy to emulate x86 with it. The backend of Crusoe contains things like {...} All these were made to make binary translation from x86 eas
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]