Posted by
CowboyNeal
on from the more-power-now dept.
NickSD writes "ChipGeek has an interesting article on increasing x86
CPU performance without having to redesign or throw out the x86 instruction set. Check it out at
geek.com."
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
Re:Why?
by
Anonymous Coward
·
· Score: 3, Insightful
You don't get it...
What Intel is currently doing is putting a turbo on an old and obsolete architecture.
By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM). As it is now, you need to many memory access for simple operations. With more registers, you would need less clock speed.
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
No.
Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.
The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.
After that, it doesn't hurt to make the bus bandwidth bigger.
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
With more registers, you would need less clock speed.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Cache is the key
by
Anonymous Coward
·
· Score: 3, Insightful
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
Re:Cache is the key
by
Sivar
·
· Score: 5, Insightful
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache. No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math. Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache. Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%. Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI If you are interested in competant, intelligent, technical reviews of hardware, you might like www.aceshardware.com
-- Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Re:Cache is the key
by
Anonymous Coward
·
· Score: 1, Insightful
Cache's are good for general cases but you can easily get degraded performance from having a cache too for some problems. You have to be cache aware and plan your memory access strategy for the algorithm.
Re:Cache is the key
by
mmol_6453
·
· Score: 5, Insightful
20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million, not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
-- What's this Submit thingy do?
Re:Cache is the key
by
alaeth
·
· Score: 2, Insightful
One of the reasons they reduced the size of L1 cache is because it takes up a huge amount of physical die size. If you trying to reduce the number of bad chips in a batch, the easiest way is to reduce the size of the chip itself.
I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
--
try to make ends meet, you're a slave to money, then you die
I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.
It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
-- In the future, I would want to not be isolated from my friends in the Space Station.
If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
Re:The Problems of Obsolete design
by
gpinzone
·
· Score: 3, Insightful
Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.
Re:Switching Architectures
by
killmenow
·
· Score: 5, Insightful
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
But a lot of the code running today wasn't "written today" if you know what I mean. The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
Not entirely true. RISC instruction sets can be quite huge too. And the whole idea of RISC is to take the complexity out of the hardware and put it into the compiler instead. It is easier to optimize for x86 than RISC.
More than 3 answers !FREE!
by
purrpurrpussy
·
· Score: 4, Insightful
You are VERY confused.
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....;-)
-- "None of this shit works" -W.Shatner
So this is a "register pointer"?
by
zerofoo
·
· Score: 3, Insightful
Great, in a time where we are removing god awful pointers from high level programming languages, we're putting them in the hardware.....uuugh.
Anyone ever write something with intensive pointer arithmetic in C++? It's enough to drive you mad.
Can you imagine peer code review: "No, that's not the instruction.....that's a pointer to the instruction."
Oh boy!
-ted
Re:So this is a "register pointer"?
by
Anonymous Coward
·
· Score: 1, Insightful
For God's sake, learn some assembly. Learn what's going on under the hood. Pointers are used at the lowest levels to address memory, and always have been. Your compiled code will use pointers, even if you don't have any actual pointers in your C++ code. *Gasp!*
Intel isn't interested in performance
by
zaqattack911
·
· Score: 3, Insightful
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
--Marketing 101
More trouble than its worth...
by
gillbates
·
· Score: 4, Insightful
The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything,
Intel should stop adding instructions and start adding registers.
-- The society for a thought-free internet welcomes you.
Re:RISC
by
Anonymous Coward
·
· Score: 1, Insightful
With 256 GP registers, you'd better hope your compiler's code optimizer is very smart about allocating, using, and saving registers on the stack. Pushing 256 registers to the stack on every function call is no longer a performance penalty... it's a performance bitchslap.
For hand-coded assembly though, there isn't much problem (You do already watch your registers and push only the ones you need to use, right?).
too much of a good thing = pie wagon
by
epine
·
· Score: 3, Insightful
If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
try to make ends meet, you're a slave to money, then you die
I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.
Reality is the ultimate Rorschach.
It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
In the future, I would want to not be isolated from my friends in the Space Station.
If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.
The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
You are VERY confused.
;-)
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....
"None of this shit works" -W.Shatner
Great, in a time where we are removing god awful pointers from high level programming languages, we're putting them in the hardware.....uuugh.
Anyone ever write something with intensive pointer arithmetic in C++? It's enough to drive you mad.
Can you imagine peer code review: "No, that's not the instruction.....that's a pointer to the instruction."
Oh boy!
-ted
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
--Marketing 101
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.
The society for a thought-free internet welcomes you.
With 256 GP registers, you'd better hope your compiler's code optimizer is very smart about allocating, using, and saving registers on the stack.
Pushing 256 registers to the stack on every function call is no longer a performance penalty... it's a performance bitchslap.
For hand-coded assembly though, there isn't much problem (You do already watch your registers and push only the ones you need to use, right?).
If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?