Research Shows RISC vs. CISC Doesn't Matter
fsterman writes The power advantages brought by the RISC instruction sets used in Power and ARM chips is often pitted against the X86's efficiencies of scale. It's difficult to assess how much the difference between instruction sets matter because teasing out the theoretical efficiency of an ISA from the proficiency of a chip's design team, technical expertise of its manufacturer, and support for architecture-specific optimizations in compilers is nearly impossible . However, new research examining the performance of a variety of ARM, MIPS, and X86 processors gives weight to Intel's conclusion: the benefits of a given ISA to the power envelope of a chip are minute.
i've read the legacy x86 instructions were virtualized in the CPU a long time ago and modern intel processors are effectively RISC that translate to x86 in the CPU
Back when compilers weren't crazy optimized to their target instruction set, people coding things in assembler wanted CISC, and people using compilers wanted RISC.
But nowadays almost no one still does the former, and the latter uses CISC chips a lot better.
This is now a question for comp sci history, not engineers.
She said, boy, you need to put some CISC into your RISC or I am gone, she said.
...a million smug old school douche-nozzle Mac users cried out and suddenly fell silent.
ok, so the effect of RISC vs CISC has absolutely *no* relation to power, right? so why in god's green earth is, for example, the allwinner a20 1.2ghz processor - which is still in 40nm btw - maxing out at 2.5 watts and delivering great 1080p video, reasonable 3D graphics and so on - yet intel is having to go to 14nm and, even at 14nm they STILL can't release a processor that, if you run it in a very limited configuration, is STILL listed as 3.5 watts??
there's a quad-core rockchip 28nm SoC. maximum (actual) top power consumption: below 3.0 watts. intel's haswell tablet SoC is 20nm: it's 4.5 watts "Scenario" Design Power i.e. if you only run certain apps in certain ways it *might* keep below 4.5 watts.
i really _really_ want to know why it is that intel cannot deliver an SoC that has an absolute peak limit of 2.5 watts.
This is a discussion forum on the Internet. Definitely the former.
Do not look into laser with remaining eye.
The CPU ISA isn't the important aspect. Reduced power consumption mostly stems from not needing a high end CPU because the expensive tasks are handled by dedicated hardware. What counts as top of the line ARM hardware can barely touch the processing power of a desktop CPU, but it doesn't need to be faster because all the bulk processing is handled by graphics cores and DSPs. Intel has for a long time tried to stave off the barrage of special purpose hardware. The attempts to make use of ever more general purpose CPU power sometimes bordered on sad clown territory (Remember Intel's attempt to make raytracing games look like something worth pursuing? Guess why: Raytracing is notoriously difficult to implement on graphics hardware due to the almost random data accesses.)
You seem to be conveniently ignoring Intel's Atom and Quark lines. They're all x86 and none of them has a TDP larger than 3w.
TFA measures how much energy is used for each processor to complete certain tasks, but it ignores how complex each processor is. Unsurprisingly, the processors with more transistors do better. Manufacturing cost is also ignored.
However, in the real world, x86's flexibility and versatility (including the plethora of developer tools available) win the day.
as a greybeard I remember when choosing Intel over Sun meant the project wasnt completed on time, and your electrical/mechanical engineering group lived in the breakroom while their jobs chugged along. Intel was a toy train compared to the power you'd get with RISC. however I can somewhat confidently say the RISC CISC battle is moot these days because x86 has largely caught up to power, sparc, and others. a competent argument could be made however that if it werent for AMD, most servers would probably still be running some flavour of RISC. The foolhardy nature of SUN and SGI can also be argued as a cause of their demise, but ill not flame. Intel wouldn't have bothered to get off their duff without a poke in the ribs from AMD; they had partnerships with RISC manufacturers anyhow and their own RISC-ish processor called itanium. outside of performance though there is another reason people stick with Power and others just as they have in the past. Lock-in.
you see, applications like Oracle Business Objects and JD Edwards come with a quid-pro-quo of exacting standards to which most businesses must adhere. Namely, IBM or Sun/Oracle hardware. You may only need accounting and payroll, but you'll have to clear a corner of the room for the circus to set up their hardware and make sure everything is "just so." Their hope is that their quiet mandate becomes your quiet mandate, and before you know it other systems that interact with JDE are now required to be Power-based because "thats what runs JDE." The only way out of this is to realize that any business that doesnt explicitly do payroll or metrics for profit, doesnt need the kind of horsepower decreed by things like SAP.
Good people go to bed earlier.
Granted, you can build a tablet to do specific tasks (like decoding video codecs) around a really slow processor and some special-purpose DSPs. But perhaps the companies in that business aren't making enough profit to interest Intel.
FYI
http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation_figures#Intel_Atom
Below 2.5 watts for roughly 25% of the steppings. And since there's not really a significant advantage to the SoC configuration you describe above (CPU + 3D graphics), unless you're trying to squeeze your system into a thimble, if any of those Atom CPUs don't come with 3D graphics, throw a cheap lower-power 3D graphics chip into the mix and you're still probably under 2.5 watts.
The examples you described(3D graphics and video decoding) are both handled by the GPU which breaks the data sets down through SIMD. One algorithm single dataset optimization. That really has little to do with anything having to do with this discussion of CPUs.
When comparing the CPU cores between the Allwinner a20 and a broadwell atom. The broadwell atom is more performant by a wide margin.
Well, the power consumption of various processor architectures are a *bit* more complicated than RISC vs CISC which is the point of this story.
Well.. maybe. Or Maybe not. But Definitely not sort of.
I have not RTFA but my guess is that Intel is trying to compare apples to apples and that would mean (and again I am assuming) that at the same process (14nm, for instance) and with similar circuit design densities that the power envelope is more or less the same. I am still not convinced, but trying to figure out how they came to that conclusion given that Intel has not come that close to the power efficiencies found in modern RISC designs. I also don't get their scaling argument either as some of the largest (by number of cores) supercomputers in the world are (still) RISC based or are using GPUs that would be closer to RISC designs than CISC given their limited instruction set. Sooo, this sounds like Intel we-are-so-awesome-aren't-we BS.
If the ISA does not actually affect the performance of a modern processor, then why does the 64-bit ARM architecture outperform the 32-bit. Surely 32-bit ARM code is at least comparable to x86 in architectural elegance.
As ARM chips creep up in processing power and power consumption, there's no good reason to develop in the opposite direction. Intel wants to be where ARM is headed, not where it has been.
1080p video and reasonable 3D graphics are nice to look at, but they're terrible CPU benchmarks, because the CPU only shuffles data into and out of special purpose hardware. You can do high definition video and 3D graphics with a shitty single core ARM processor, provided the dedicated coprocessors do all the work: You've probably heard of the most famous incarnation of that concept, the Raspberry Pi. The Broadcom chip at the core of the Raspberry Pi doesn't even boot using the CPU. The graphics core starts first and loads the firmware. Only then does the CPU start. That's where the ARM world is coming from: Powerful special purpose hardware together with a "microcontroller" of a CPU. Intel is coming from the opposite end of the spectrum, where general purpose processing is king, and special purpose hardware is seen as an optional add-on.
The benefit to CISC instructions is the ability to get the processor to do more work with less instructions.
While it may seem like a compiler can negate that, for a non-trivial number of programs you can get much higher code density using CISC instructions, which in turn frees up memory cycles for data instead of code. One of the key arguements I head back in the late 90s/early 00s however was that Hybridized chips was where it was at. If you go and look at previously 'CISC' and 'RISC' chips and look at the instruction sets, the majority of RISC chips have added a number of CISC-like instructions for certain types of operations (notably floating point and byte array/string handling stuff) while CISC chips have basically just turned into a translation layer over a slightly more complicated RISC core that is only really optimized for say the top 50-100 operations, and everything else is 2+x slower than the old CISC implementations (instructions that used to take say 2 cycles due to dedicated circuitry may now take 4+ due to reuse of micro-ops).
Wait, what? These chips aren't at all comparable in performance. For instance, consider a 7-zip lzma benchmark from http://www.7-cpu.com/:
A20: 2 cores: 1 ghz: 880 mips compressing, 1560 mips decompressing
Haswell: 8 cores: 3.4 ghz: 20500 mips compressing, 21000 mips decompressing
This study looks seriously flawed. They just throw up their hands at doing a direct comparison of architectures when they try to use extremely complicated systems and sort of do their best to beat down and control all the factors that introduces. One of the basic principles of a scientific study is that independent variables are controlled. It's very hard to say how much the instruction set architecture matters when you can't tell what pipelining, out of order execution, branch prediction, speculative execution, caching, shadowing (of registers), and so on are doing to speed things up. An external factor that could influence the outcome is temperature. Maybe one computer was in a hotter corner of the test lab than the other, and had to spend extra power just overcoming the higher resistance that higher temperatures cause.
It might have been better to approach this from an angle of simulation. Simulate a more idealized computer system, one without so many factors to control.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Comment removed based on user account deletion
RISC architecture is going to change everything.
They are seriously comparing some 90nm process with much better intel 32nm and 45 nm processes.
They have just taken some random cores made on random (and uncomparable) manufacturing technologies, throw couple of benchmarks and try to declare universal results based on these.
Few facts about the benchmarks setup and the cores cores:
1) They use ancient version of GCC. ARM suffers this much more than x86.
2) Bobcat is relatively balanced core, no bad bottlenecks. mfg tech is cheap, not high performance but relatively small/new.
3) Cortex A8 and A9 are really starved by bad cache design. Newer A7 and A12 would be similar in area and powet consumption but much better in performance and performance/power. There are also manufactured on old cheap mfg processes, which hurt them. Use modern manufacturing tech and results are quite much better
4) Their loonson is made on ANCIENT technology. With modern mfg tech it would be many times better on performance/power.
5) The cortex A15, even though made on 32nm process, is cheap process, not much better than intel's 45nm process and much worse than intel's 32nm. Also it's known to be a "power hog"-design. Qualcomm's Krait has similar performance level, but with much lower power.
No relation to energy used. It's in the article: Haswell will get it's work done faster and use about the same energy as the slower chips that take longer. What matters is architecture, not ISA (Atom is lower power than Haswell at the same process node).
20 years ago, RISC vs CISC absolutely mattered. The x86 decoding was a major bottleneck and transistor budget overhead.
As the years have gone by, the x86 decode overhead has been dwarfed by the overhead of other units like functional units, reorder buffers, branch prediction, caches etc. The years have been kind to x86, making the x86 overhead appear like noise in performance. Just an extra stage in an already long pipeline.
All of which paints a bleak picture for Itanium. There is no compelling reason to keep Itanium alive other than existing contractual agreements with HP. SGI was the only other major Itanium holdout, and they basically dumped it long ago. And Itaiums are basically just glorified space heaters in terms of power usage.
x86 instructions, are in fact, decoded to micro opcodes, so the distinction isn't as useful in this context.
Actually it is. Modern performance tuning has a lot to do with cache misses and such. CISC can allow for more instructions per cache hit. The strategy of a hybrid type design, CISC external architecture and RISC internal architecture definitely has some advantages.
That said, the point of RISC was not solely execution speed. It was also simplicity of design. A simplicity that allowed organization with less money and resources than Intel to design very capable CPUs.
Because years and years ago it was obvious that what manufacturers were calling RISC wasn't really that. It was typically some middle ground between REAL RISC and something else. Back in the day x86 had about 374 instructions and a SUN or analogous IBM chip had about 150-175 instructions. But according to the actual science, a RISC chip should only have 30-45 instructions. So for the sake of flexibility manufacturers split the difference and built chips that were neither fish nor fowl. If someone to actually have a real high power chip that ran only 40 instructions I wonder if the benchmarks would come out differently. Or maybe they wouldn't because the benchmarks themselves make some attempt to model the complexity of real world scenarios. And if that's the case then the REAL RISC chips would be stumbling trying to execute most things in software not the instruction set.
That is correct. Every time this comes up I like to spark a debate over what I perceive as the uselessness of referring to an "instruction set architecture" because that is a bullshit, meaningless term and has been ever since we started making CPUs whose external instructions are decomposed into RISC micro-ops. You could switch out the decoder, leave the internal core completely unchanged, and have a CPU which speaks a different instruction set. It is not an instruction set architecture. That's why the architectures themselves have names. For example, K5 and up can all run x86 code, but none of them actually have logic for each x86 instruction. All of them are internally RISCy. Are they x86-compatible? Obviously. Are they internally x86? No, nothing is any more.
This same myth keeps being repeated by people who don't really understand the details on how processors internally work.
You cannot just change the decoder, the instruction set affect the internals a lot:
1) Condition handling is totally different on different instruciton sets. This affect the banckend a lot. X86 has flags registers, many other architectures have predicate registers, some predicate registers with different conditions.
2) There are totally different number of general purpose and floating point registers. The register renamer makes this a smaller difference, but then there is the fact that most RISC's use same registers for both FPU and integer, X86 has separate registers for both. And this totally separates them, the internal buses between the register files and function units in the processor are done very differently.
3) Memory addressing modes are very different. X86 still does relatively complex address calculations on single micro-operation, so it has more complex address calculation units.
4) Whether there are operations with more than 2 inputs, or more than 1 output has quite big impact on what kind of internal buses are needed, how many register read and write ports are needed.
5) There are a LOT of more complex instructions in X86 ISA which are not split into micro-ops but handled via microcode. the microcode interpreter is totally missing on pure RISCs ( but exists on some not-so pure RISC's like Powe/PowerPC).
6) Instruction set dictates the memory aligment rules. Architectures with more strict alignment rules can have simples load-store-units.
7) Instruction set dictatetes the multicore memory ordering rules. This may affect the load-store units, caches and buses.
8) Some instructions have different bitnesses in different architectures. For example x86 has N x X -> 2N wide multiply operations which most RISC's don't have. So x86 needs bigger/different multiplier than most RISCs.
9) X87 FPU values are 80-bit wide(truncated to 64-bit when storing/loading). Practically all the other CPU's have maximum of 64-bit wide FPU values (though some versions Power have support for 128-bit FP numbers also)
Whaddya mean? The A20 beats the pants off of a Pentium 2....
As mentioned over many years of slashdot posts, x86 as a hardware instruction no longer truly exists and represents a fraction of the overall die space. The real bread and butter of CPU architecture and trade secrets rests in the microcode that is unique in every generation or edition of a processor. Today all intel processors are practically RISC.
If you look at the graph "raw average energy normalised" you see that the ARM A9 core has the lowest energy score -> that clearly shows ARM being the most efficient and hence the conclusion is completely wrong.
Still the test is very interesting. I would like to see it updated with latest CPUs
This same myth keeps being repeated by people who don't really understand the details on how processors internally work.
Actually, YOU are wrong.
You cannot just change the decoder, the instruction set affect the internals a lot:
All the reason you list could all be "fixed in software". The fact that silicon designed by Intel handles opcode in a way a little bit better optimized toward being fed from a x86-compatible frontend is just specific optimisation. Simply doing the same stuff with another RISCy back-end, i.e: interpreting the same ISA fed to the front-end, will simply require each x86 ISA being executed as a different set of micro-instructions. (some that are handled as single ALU opcode on Intel's silicon might require a few more instruction, but that's about the different).
You could switch the frontend and speak a completely different instruction set. Simply if the two ISA are radically different, the result wouldn't be as efficient as a chip designed with that ISA in mind. (You would need a much bigger and less efficient microcode, because of all the reasons you list. They won't STOP intel from making a chip that speaks something else. Intel will simply produce a chip where the front-end is much more clunky, inefficient, waste 3x more opcode per instruction, and waste much time waiting that some bus gets free or copying values around, etc.).
And to go back to the parent...
Not only is this possible, but this was INDEED done.
There was an entire company called "Transmeta" whose business was centered around exactly that:
Their chip, the "Crusoe" was compatible with x86.
- But their chip was actually a VLIW chips, with the front-end being 100% pure software. Absolutely as remote from a pure x86 core as possible.
- The frontend was entirely 100% pure software.
The advantage touted by Transmeta was that, although their chip was a bit slower and less efficient, it consumed a tiny fraction of the power and was field-upgradeable (in theory just issue a firmware upgrade to support newer instruction.) Transmeta had demos of Crusoe playing back MPEG video on a few watts, whereas Pentium 3 (the then lower-power Intel chip) would consume way much more.
Saddly, it all happened in an era where pure raw performance was the king, and where use a small nuclear plant to power an Pentium IV (the then high performance flagship) and needing a small lake nearby for cooling was considered perfectly acceptable. So Crusoe didn't see that much success.
Still, Crusoe was successfully used as a test bed for a few experimental CPU to test their ISA before actual test-bed where available. (If I remember correctly, Crusoe where used to test running x86_64 code before actual Athlon 64 where available for developers), and there were a few experimental proof-of-concept running PowerPC ISA.
In a way modern way, this isn't that much dissimilar from how Radeon handle compiled shared, except that the front-end is now a piece of software which run inside OpenGL on the main CPU: intermediate instruction a compiled to either VLIW or CGN opcode which are 2 entirely different back-ends.
(Except that, due to the highly repetitive nature of a shared, instead of decoding instruction on the fly as they come, you optimise it once into opcode, store it into a cache and you're good).
Again, on a similar way ARM can switch between 2 different types of instruction set (normal and thumb mode), 2 different sets, one back-end.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
It is really surprising that neither the linked Extremetech article, nor the slashdot summary cite the original source. This research was presented in HPCA'13 in a paper titled "Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures", by Emily Blem et al, from the University of Wisconsin's Vertical Research group, led by Dr. Karu Sankaralingam. You can find the original conference paper in their website.
The Extremtech article indicates that there are new results with some additional architectures (MIPS Loongson and AMD processors were not included in the original HPCA paper), so I assume that they have published an extended journal version of this work, which is not yet listed in their website. Please add a comment if you have a link to the new work.
I do not have any relation with them, but I knew the original HPCA work.
This is all brought to you by labs run by Intel. Why are were even talking about this? There's nothing here but marketing.
A computer instructor in 1992 told my class that computers won't ever need more than 4GB (the 32-bit memory limit). IIRC, 8MB was the norm back then.
These days the instruction set matters less than the underlying chip architecture than customization. With this, ARM has an advantage in that their business model allows for higher degree of customization. While some companies can work with Intel or AMD on their designs, for the most part, ARM allows them to change the design as much as they need depending on the licensing.
Well, there's spam egg sausage and spam, that's not got much spam in it.
Intel funded research to show their processors are not also rans in the tablet market so manufacturers could feel comfortable using Intel chips in tablets?
All the reason you list could all be "fixed in software".
The quotes around the "software" mean that i refer about the firmware/microcode as a piece of software designed to run on top of the actual execution units of a CPU.
No, they cannot. OR the software will be terible slow , like 2-10 times slowdown.
Slow: yes, indeed. But not impossible to do.
What matters are the differences in the semantics of the instructions.
X86 instructions update flags. This adds dependencies between instructions. Most RISC processoers do not have flags at all.
This is semantics of instructions, and they differ between ISA's.
Yeah, I pretty well know that RISCs don't (all) have flags.
Now, again, how is that preventing the micro-code swap that dinkypoo refers to (and that was actually done on transmeta's crusoe)?
You'll just end with a bigger clunkier firmware that for a given front-end instruction from the same ISA, will translate into a big bunch of back-end micro-ops.
Yup. A RISC's ALU won't update flags. But what's preventing the firmware to dispatch *SEVERAL* micro-ops ? first to do the base operation and then aditionnal instructions to update some register emulating flags?
Yes, it's slower. But, no that don't make micro-code based change of supported ISA impossible, only not as efficient.
The backend, the micro-instrucions in x86 CPUs are different than the instructions in RISC CPU's. They differ in the small details I tried to explain.
Yes, and please explain how that makes *definitely impossible* to run x86 instruction? and not merely *somewhat slower*?
Intel did this, they added x86 decoder to their first itanium chips. {...} But the perfromance was still so terrible that nobody ever used it to run x86 code, and then they created a software translator that translated x86 code into itanium code, and that was faster, though still too slow.
Slow, but still doable and done.
Now, keep in mind that:
- Itanium is a VLIW processor. That's an entirely different beast, with an entirely different approach to optimisation, and back during Itanium development the logic was "The compiled will handle the optimising". But back then such magical compiler didn't exist and anyway didn't have the necessary information at compile time (some type of optimisation requires information only available at run time. Hence doable in microcode, not in compiler).
Given the compilers available back then, VLIW sucks for almost anything except highly repeated task. Thus it was a bit popular for cluster nodes running massively parallel algorithms (and at some point in time VLIW were also popular in Radeon GFX cards). But VLIW sucks for pretty much anything else.
(Remember that, for example, GCC has auto-vectorisaion and well performing Profile-Guided-Optimisation only since recently).
So "supporting an alternate x86 instruction on Itanium was slow" has as much to do with "supporting an instruction set on a back-end that's not tailored for the front-end is slow" as it has to do with "Itanic sucks for pretty much everything which isn't a highly optimized kernel-function in HPC".
But still it proves that runing a different ISA on a completely alien back-end is doable.
The weirdness of the back-end won't prevent it, only slow it down.
Luckily, by the time Transmeta Crusoe arrived:
- knowledge had a bit advance in how to handle VLIW ; crusoe had a back-end better tuned to run CISC ISA
Then by the time Radeon arrived:
- compilers had gotten even better ; GPU are used for the same (only) class of task at which VLIW excels.
The backend of Crusoe was designed completely x86 on mind, all the execution units contained the small quirks in a manner which made it easy to emulate x86 with it. The backend of Crusoe contains things like {...} All these were made to make binary translation from x86 eas
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
And actually so is RISC to a degree on POWER processors.
Back in the 80's going RISC was a big deal. It simplified decode logic (which was a more appreciable portion of the circuit area), reduced the number of cycles and logic area necessary to execute an instruction, and was more amenable (by design) to pipelining. But this was back in the days when CISC processors actually directly executed their ISAs.
Today, CISC processors come with translation front-ends that convert their external ISA into a RISC-like internal representation. It's on-line dynamic binary translation. Now, instructions are broken down into simpler steps that are more amenable to pipelining and out-of-order scheduling. CISC processors don't execute CISC ISAs and therefore don't suffer from their drawbacks.
It has occurred to me that this could be taken to its logical extreme. ISAs could be made entirely abstract and optimized to be used that way, along with optimizing them for reasonably efficient translation. You get the benefits of microops and the benefits of a CISC ISA (more compact code). Abstract ISAs make it easier to extend functionality in a backward-compatible way too. And unlike x86, we can shed some of the deadweight and also go to all 3-operand instructions, which have some benefits. Decoupling the ISA from the execution engine, we could get even more performance and energy efficiency than Intel does.
With a processor like Haswell, the logic area dedicated to translation is very small, which is why it doesn't matter much. On the other hand, with something like Atom, it occupies a more substantial portion of the total, making the translation (basically, elaborate decode logic) a buden on die area and therefore power consumption.
So it's not really appropriate to say it doesn't matter. It MOSTLY doesn't matter, because most of the drawbacks of CISC have been overcome. The fact that we're using an out-dated CISC ISA for x86, however, has drawbacks of having to support rare and excessively complex instructions, a plethora of addressing modes, and only having two operands per instruction.
Especially "IPL" (What my Ma to this very day uses for "BootStrapping" in fact, & she was a computer operator all thru the "dim days of yore" on IBM stuff, using PL1/2 as languages to code for it, punchcard systems & all too, for 22++ yrs. as a civil servant in that role (funniest part is, @ the DMV the other day when I was with her driving her there so she could take care of renewals etc., their systems went "down" & she was like "How long does it take them to do the SIMPLE THING & do an IPL?", lol)...
* :)
(You're "dating yourself" I *think*... OR, you're a mainframe/midrange man, right?)
APK
P.S.=> I am leaning towards You're an "oldster" though (& not a damn thing wrong with that @ all - you're more respectable in my book for that)... apk
Unless you actually make use of 64-bit arithmetics, 32-bit CPUs will always be more power efficient than 64-bit. The ARMv8 has many other improvements over ARMv7 than just 32 more bits.
The reason why x86 still team up so well against ARM is that modern x86 employ pretty sophisticated run-time hardware optimization techniques, while ARM only has pretty basic and simple optimization techniques - ARM CPUs require much more than x86 CPUs sophisticated compiler optimizations, and even current day optimizing compilers aren't good enough to catch with what Intel CPUs do at runtime, especially since x86 backend optimization usually is more mature than ARM backend optimization. Granted, the more sophisticated x86 run-time hardware optimizations require both power and die surface.
A computer instructor in 1992 told my class that computers won't ever need more than 4GB (the 32-bit memory limit). IIRC, 8MB was the norm back then.
...and he was right..at least from his point of view back in 1992.
Like the first guy that posted said: "i've read the legacy x86 instructions were virtualized in the CPU a long time ago and modern intel processors are effectively RISC that translate to x86 in the CPU"
Intel CPU's haven't used CISC in a long time, like 486 era long time. They translate the old CISC instructions into something more like RISC. So... What is Intel's point with this?
Back in the day, RISC actually made a huge impact and was able to show it's might against CISC CPU's in a lot of areas, but that time has long since past. Like, almost 20 years past. Pure RISC CPU's still save real estate, but that doesn't matter as much as it did back in the 80's and 90's when CPU die area was at a premium.
i've read the legacy x86 instructions were virtualized in the CPU a long time ago and modern intel processors are effectively RISC that translate to x86 in the CPU
Actually, the biggest change in CPUs was not so much Intel adapting RISC techniques in post Pentium CPUs, but rather, multiprocessing, and therefore, the Core platform taking hold
Remember, one of the things that RISC did better was the multiprocessing support for those who needed it. There were Pentium based multiprocessing systems too from companies like Sequent, but those at the time ran Unix, so the competition was really b/w the likes of Sequent, vs the Suns, HPs, SGIs, and so on. All low volume, and Intel enhancing multiprocessing capabilities of its CPUs would do nothing for its PC platform.
What changed that was when Microsoft decided to merge the win32 code bases and offer XP as their merged OS for both desktops and servers, it opened the window of opportunity for Intel and AMD. Since NT, in addition to supporting RISC CPUs like Alpha or MIPS, also supported SMP, Intel could take advantage of that fact and thrown in more cores at a platform, and Windows i.e. now NT, would be capable of handling it. That couldn't have worked w/ Windows 95-ME, but once NT took over the desktop, it could.
Once this happened, the RISC vs CISC game was over. RISC previously had a performance advantage running its own native software over Pentiums running Wintel software. The struggle to beat Intel in running Wintel software was lost first by MIPS, and then by Alpha. Once Intel could throw more cores at the problem w/o costing more than a SPARC or a Power, it was over. Intel being several generations ahead of Cypress, Ross, Fujitsu and even IBM could easily toss in 4-8 cores and still be cheaper than a SPARC CPU, not to mention the off the shelf motherboards and other peripheral logic. Once that happened, it became more cost effective to use Xeons to run Linux or FBSD than it was to run Solaris or AIX or even HP/UX.
Even in the case of the Itanium, discussed later in this thread, the initial Itaniums were just meant to be uniprocessor CPUs w/ several instructions concatenated together. Today, even Itaniums are multi-core - which solves the compatibility issue b/w generations, but then again throws into question why the Itanium would be needed in the first place, if one can just toss N number of, say, Atoms, and solve the problem.
Intel's process and manufacturing advantages helped, no doubt, but the big difference was multiprocessing becoming mainstream on the desktop due to the NT architecture replacing the Windows 95 architecture in Microsoft's desktop CPUs
Here is your answer, the A20 is freakishly slow compared to anything Intel would put their name on.
Granted, you can build a tablet to do specific tasks (like decoding video codecs) around a really slow processor and some special-purpose DSPs. But perhaps the companies in that business aren't making enough profit to interest Intel.
interestingly that assumption - that allwinner is not making enough profit - is completely wrong. allwinner is now one of _the_ dominant tablet SoC manufacturers in the world. their first revision (the A10, which was a Cortex A8) actually caused a major recession in the electronics industry when it first came out, as it was only $7.50 compared to the nearest competitor at around $11 to $12. everyone *not* using the A10 at the time was left holding worthless components; contracts for supply were reneged on; the change was so quick that many factories and design houses simply went out of business.
the volumes that allwinner are shipping are simply enormous, and, along with rockchip, their nearest competitor, the tablet market is completely and utterly overwhelmingly dominated by processors of the type that you describe as "built to do specific tasks".
those "specific tasks" include "running the android OS at a pace that's good enough for the overwhelming majority of end-users".
in short, intel has a long *long* way to go before they can even remotely consider that they have a processor that can be taken seriously in this very large market, both in terms of price and also in terms of performance.
what is particularly interesting about the comment that you make is that it would seem that intel really does, just as you do, believe that "a really slow processor and some special-purpose DSPs" simply is... not enough. and, contrary to that belief, it can be quite clearly seen by the total dominance of allwinner and rockchip that "a really slow processor and some special-purpose DSPs" really *is* enough.
one of the reasons for that is because if you look at the market you find that you need:
* audio and video CODEC processing. this can be handled by a special-purpose DSP. some of these are now handling 3D 4096-bit-wide screens.
* 3D graphics. these are handled by licensing a whole range of hard macros (special-purpose DSPs) that come with proprietary libraries implementing OpenGL ES 2.0. they're good enough, and some of them are getting _really_ good.
* an (as you put it) "really slow processor" - although if you look at allwinner's latest processor the A80 it can hardly be called "slow", it's an 8 core monster - which covers the running of the general OS.
overall these processors are graded according to price: $5 will get you something dreadful but "good enough", $20 will get you something that's complete overkill for a tablet.
and you know what? the $7 1.2ghz dual-core ARM Cortex A7 Allwinner A20 is, when it's put with 2gb of RAM, actually extremely quick. i tested out 1gb of RAM running debian GNU/Linux: i fired up xrdp and i had *five* rdesktop sessions running OpenOffice and Firefox on it, onto my laptop. it didn't fall over, and it wasn't dreadfully slow.
so i think you, just like intel, are completely and entirely missing the point. and in intel's case, that means entirely missing out on a *huge* market segment.
You seem to be conveniently ignoring Intel's Atom and Quark lines. They're all x86 and none of them has a TDP larger than 3w.
i'm not. intel's quark line - the one i saw announced on here last year - tops out at 400mhz. it has... nothing in the way of interfaces that can be taken seriously. it doesn't even have RGB/TTL video out. however if you are right about the latest intel atom being 3w, then now i am interested! so i am very grateful for you pointing this out, i will go check.