Posted by
CowboyNeal
on from the more-power-now dept.
NickSD writes "ChipGeek has an interesting article on increasing x86
CPU performance without having to redesign or throw out the x86 instruction set. Check it out at
geek.com."
Another Hideous Hack for IA32
by
seanellis
·
· Score: 5, Informative
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp.)
Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
The article should really just be called:
"An Acedemic Exercise in Register Gymnastics"
Re:The Problems of Obsolete design
by
Zathrus
·
· Score: 5, Informative
As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
-- -Kz-
Re:Cache is the key
by
Anonymous Coward
·
· Score: 5, Informative
Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
When programmers try to be architects...
by
Chris+Burke
·
· Score: 5, Informative
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
-- "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
Re:modular chips
by
Zathrus
·
· Score: 5, Informative
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
, 00.asp .)
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327
Sean Ellis
Follow OfQuack's antics on Twitter.
Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
Hell yeah!
I myself am an old x86 Assembly hacker.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
The article should really just be called:
"An Acedemic Exercise in Register Gymnastics"
As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
Damn Right!
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
-Kz-
Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.
AMD has 128K L1 since the original Athlon, and had 24K in the K5.
The Transmeta 3200 and the Motorola G4 both have 96K, the UltraSparc-III has 100K, Alpha had 128K when it died, and HP's PA-8500 has a whopping 1.5MB.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
The enemies of Democracy are
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
"Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?