By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.
The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.
I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).
One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.
My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).
What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.
If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.
Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86.
Not all code is suited for SSE or MMX. If the FPU is used at all then MMX is pretty much out the window, leaving only SSE/SSE2. And, while there are some speedups XMMX register use would provide, there are still a large number of programming situations which would not benefit at all. Also, critical path code that's not multimedia based would be wise not to use SSE/MMX. Why? It takes significantly longer to execute the prolog/epilog FXSAVE/FXRSTOR than it does to execute a PUSHAD/POPAD.
The only way to give truly "revolutionizing" performance is to do high level optimizations.
Since it is unlikely that general software engineers (those that write in tools, for example) are capable of those kinds of speedups on a consistent basis (due to task restraints, time constraints and what have you), it seems only logical to make the hardware running the application as fast as possible. After all, a lot of people spend a lot of time working with different compiler switches to find the optimum solution for whatever code they've written.
It seems to me that the purpose should be to make the machine as flexible for the programmer as possible, as fast as possible, and allow the rest of us to do the best we can writing software under deadlines.
"making heavy use of this instruction set level register mapping would DOUBLE the instructions for a program!!"
I was wondering when someone was going to mention this. x86-64 implemented a similar capability with their REX prefix overrides in 64-bit long mode, and their increase was 10%. Of course, they threw out an existing set of duplicated instruction encodings to accomplish that.
Under my proposal, one of the 3 unused bits could be utilized to alter the base instruction set so that something similar to AMD's x86-64 implementation could be used with RMC. It would allow single-byte (or two-byte) opcode overrides to be employed which convey alternate register usage. My current design has the SUPRMC instruction as a 3 byte opcode (without changing any existing opcode sequence definitions). However, if the hardware designers saw a need to do something more like what AMD did, then it would probably work.
One other consideration that could be handled very eloquently by a compiler or a skilled assembly programmer, would be the ability to utilize different registers for various functions. When it is known, for example, that a main portion of program code calls a specific function on a regular basis, then that function could be assigned specific registers. It would only require a MOVRMC instruction at the start and a MOVRMC instruction at the end, but the speedups provided (because the function wouldn't need to save/restore GP registers) would be very measurable. There are a lot of potentials here. It would take a while to work them all out. But, giving the x86 an additional 48 registers is something it desperately needs. Re-writing compilers and assemblers to accomodate whatever hardware model the engineers come up with seems to be a small task for the potential gains.
Thank you for responding.:) - Rick C. Hodgin, geek.com
"Actually, this is just one of many potential downfalls."
I was referring to use of the POPRMC instruction in code. I wouldn't recommend it unless there are other reasons why there might be a delay before actual code is executed, such as the last thing done before a RETF.
"He forgot interrupts, mode switching....and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster."
I didn't forget those aspects of coding. There are two distinct possibilities here which entirely resolve that dilema, both handled in hardware. 1) Interrupts are handled in a special way, during interrupt processing all RM/RMC values are ignored and utilization of the default 8 GP registers exist, or 2) Interrupts automatically push RM/RMC on the stack when signaled, and automatically pop them back off when IRETD is issued. These non-problems are resolvable.
Next, mode switching. Mode switching would make no difference. Again, the hardware state could either persist as it is presently setup through the mode switch (meaning that SC will either count down and reset RM/RMC to default values/popped values when it hits zero, or it will be populated with 1111b and it will persist forever (until changed with MOVRMC again).).
I've been told by probably 10 people so far that the P4 engine was designed with a 2 cycle latency L1 data cache, the purpose of which is to hide a lot of the latency required by not having a large GP register set. While this is, indeed, a great thing... it never approaches the speed of register to register transfers. If code could be written to utilize up 56 GP registers instead of 8 (8 GP + 16 MMX + 32 XMMX) then a great deal of those 2-cycle latency hits would be removed, thereby speeding up code fairly significantly.
I've had a couple people that I respect contact me in email about this concept. They've asked me to write an emulator which demonstrates this process. I will be doing that in the coming weeks/months. I'm sure this topic will be dead by the time I get it completed, but it might help stir it up again. We'll see what it really does when the numbers are published. Take care!
"Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging."
This is an incorrect assumption. Existing operating systems would run entirely unaffected. RM/RMC support would be implemented in hardware. The data would be stored in the TSS during a task switch and the existing mechanisms used for storing MMX/FPU and SSE/SSE2 register space (either doing it explicitly with FXSAVE or deferring it by later trapping a fault when an attempt to read/write is encountered) would still be used.
Nothing would need to be changed to that end.
"Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide."
Absolutely not. Each task has its own TSS right now. Each task context saves everything and context restores everything before/following a task switch. All systems would run as they do today. In fact, no additional operating system support would be required (since the necessarying saving/restoring of RM/RMC in the TSS would be handled entirely by the processor). It would be an invisible add-on that only software utilizing it would see.
By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.
The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.
I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).
One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.
My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).
What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.
If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.
- Rick C. Hodgin, geek.com
Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86.
Not all code is suited for SSE or MMX. If the FPU is used at all then MMX is pretty much out the window, leaving only SSE/SSE2. And, while there are some speedups XMMX register use would provide, there are still a large number of programming situations which would not benefit at all. Also, critical path code that's not multimedia based would be wise not to use SSE/MMX. Why? It takes significantly longer to execute the prolog/epilog FXSAVE/FXRSTOR than it does to execute a PUSHAD/POPAD.
- Rick C. Hodgin, geek.com
The only way to give truly "revolutionizing" performance is to do high level optimizations.
Since it is unlikely that general software engineers (those that write in tools, for example) are capable of those kinds of speedups on a consistent basis (due to task restraints, time constraints and what have you), it seems only logical to make the hardware running the application as fast as possible. After all, a lot of people spend a lot of time working with different compiler switches to find the optimum solution for whatever code they've written.
It seems to me that the purpose should be to make the machine as flexible for the programmer as possible, as fast as possible, and allow the rest of us to do the best we can writing software under deadlines.
- Rick C. Hodgin, geek.com
"making heavy use of this instruction set level register mapping would DOUBLE the instructions for a program!!"
:) - Rick C. Hodgin, geek.com
I was wondering when someone was going to mention this. x86-64 implemented a similar capability with their REX prefix overrides in 64-bit long mode, and their increase was 10%. Of course, they threw out an existing set of duplicated instruction encodings to accomplish that.
Under my proposal, one of the 3 unused bits could be utilized to alter the base instruction set so that something similar to AMD's x86-64 implementation could be used with RMC. It would allow single-byte (or two-byte) opcode overrides to be employed which convey alternate register usage. My current design has the SUPRMC instruction as a 3 byte opcode (without changing any existing opcode sequence definitions). However, if the hardware designers saw a need to do something more like what AMD did, then it would probably work.
One other consideration that could be handled very eloquently by a compiler or a skilled assembly programmer, would be the ability to utilize different registers for various functions. When it is known, for example, that a main portion of program code calls a specific function on a regular basis, then that function could be assigned specific registers. It would only require a MOVRMC instruction at the start and a MOVRMC instruction at the end, but the speedups provided (because the function wouldn't need to save/restore GP registers) would be very measurable. There are a lot of potentials here. It would take a while to work them all out. But, giving the x86 an additional 48 registers is something it desperately needs. Re-writing compilers and assemblers to accomodate whatever hardware model the engineers come up with seems to be a small task for the potential gains.
Thank you for responding.
"Actually, this is just one of many potential downfalls."
... it never approaches the speed of register to register transfers. If code could be written to utilize up 56 GP registers instead of 8 (8 GP + 16 MMX + 32 XMMX) then a great deal of those 2-cycle latency hits would be removed, thereby speeding up code fairly significantly.
I was referring to use of the POPRMC instruction in code. I wouldn't recommend it unless there are other reasons why there might be a delay before actual code is executed, such as the last thing done before a RETF.
"He forgot interrupts, mode switching....and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster."
I didn't forget those aspects of coding. There are two distinct possibilities here which entirely resolve that dilema, both handled in hardware. 1) Interrupts are handled in a special way, during interrupt processing all RM/RMC values are ignored and utilization of the default 8 GP registers exist, or 2) Interrupts automatically push RM/RMC on the stack when signaled, and automatically pop them back off when IRETD is issued. These non-problems are resolvable.
Next, mode switching. Mode switching would make no difference. Again, the hardware state could either persist as it is presently setup through the mode switch (meaning that SC will either count down and reset RM/RMC to default values/popped values when it hits zero, or it will be populated with 1111b and it will persist forever (until changed with MOVRMC again).).
I've been told by probably 10 people so far that the P4 engine was designed with a 2 cycle latency L1 data cache, the purpose of which is to hide a lot of the latency required by not having a large GP register set. While this is, indeed, a great thing
I've had a couple people that I respect contact me in email about this concept. They've asked me to write an emulator which demonstrates this process. I will be doing that in the coming weeks/months. I'm sure this topic will be dead by the time I get it completed, but it might help stir it up again. We'll see what it really does when the numbers are published. Take care!
- Rick C. Hodgin, geek.com
"Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging."
This is an incorrect assumption. Existing operating systems would run entirely unaffected. RM/RMC support would be implemented in hardware. The data would be stored in the TSS during a task switch and the existing mechanisms used for storing MMX/FPU and SSE/SSE2 register space (either doing it explicitly with FXSAVE or deferring it by later trapping a fault when an attempt to read/write is encountered) would still be used.
Nothing would need to be changed to that end.
"Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide."
Absolutely not. Each task has its own TSS right now. Each task context saves everything and context restores everything before/following a task switch. All systems would run as they do today. In fact, no additional operating system support would be required (since the necessarying saving/restoring of RM/RMC in the TSS would be handled entirely by the processor). It would be an invisible add-on that only software utilizing it would see.
- Rick C. Hodgin, geek.com