Revolutionizing x86 CPU Performance

Um, how is this anything new? by Andy+Dodd · 2002-10-11 00:56 · Score: 4, Informative

Linux kernel source - memcpy() anyone?

(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)

This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.

--
retrorocket.o not found, launch anyway?

Another Hideous Hack for IA32 by seanellis · 2002-10-11 00:57 · Score: 5, Informative

The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.

Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).

Worse, this scheme would not benefit existing code - it still requires code changes to work.

Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp .)

--
Sean Ellis
Follow OfQuack's antics on Twitter.

More registers are not enough. by gpinzone · 2002-10-11 01:12 · Score: 4, Informative

The whole gist of the article has to do with the x86's lack of general purpose registers. While this is true, you're not going to solve all of the x86 shortcomings simply by figuring out a way to add more of them. There are MANY things wrong with the x86 design; GP registers are just one of them. There's an entire section in the famous Patterson book that goes into all of the issues in much more detail than I care to state here.

Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming is one such example.

Re:Why? by hatchet · 2002-10-11 01:14 · Score: 2, Informative

You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
We must not forget that most operations processor does are data movements and not calculations.

All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.

Re:RISC by Milican · 2002-10-11 01:23 · Score: 3, Informative

RTFA or nicely put...read the article. By adding the instructions he reduced the complexity of shifts, the multiple ordered instructions it takes to do one thing, and increases the visibility of all the registers. There are added instructions, but the benefit is reduced complexity in assembly instructions due to greater direct accessibility of all the registers.

JOhn

--
Campaign for Liberty

Re:RISC by Zathrus · 2002-10-11 01:47 · Score: 5, Informative

Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).

There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.

AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).

Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.

The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.

Re:RISC by snatchitup · 2002-10-11 01:47 · Score: 5, Informative

Hell yeah!

I myself am an old x86 Assembly hacker.

When I started looking at the ARM chips I wondered why we ever used x86's etc.

RISC / CISC is really a misnomer.

RISC has plenty of instructions, and it's meant to be super-scaler.

It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.

With Intel x86, everything has it's place.

Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.

Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386

Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.

The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.

The article should really just be called:

"An Acedemic Exercise in Register Gymnastics"

Re:The Problems of Obsolete design by Zathrus · 2002-10-11 01:57 · Score: 5, Informative

As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.

Anyone else remember the horrors of all those damn control files on floppies?

There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.

The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).

Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.

It won't be enough in the future by Anonymous Coward · 2002-10-11 02:13 · Score: 1, Informative

It would be a lot better to use a new chip with new instructions (well, a new PC architecture would be even better).
The problem, as i see it, is that nobody wants to face the risk inherent to surch a big step forward.
But ... what about a new instruction that would switch the micro to use a new RISC instruction set, of course taking care of content switching between applications and what instruction set they use.
It would keep backward compatibility with old applications and, as the OS gives enough time to every application, it wouldn't affect too much to instruction caching, jump prediction and so, even, could have a mode in which only applications written in the new instruction set will be allowed, and the system administrator could choose between backward compatibility or not.
Of course, it's only an idea, and of course it would have a lot of flaws and so, but, at first view, it seems very feasible to me.

Why should one do that? by mick29 · 2002-10-11 02:27 · Score: 4, Informative

I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
But this change would:

Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)

Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.

Re:Why? by Kz · 2002-10-11 02:33 · Score: 5, Informative

Damn Right!

Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.

Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)

--
-Kz-

Re:Cache is the key by Anonymous Coward · 2002-10-11 02:41 · Score: 5, Informative

Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.

AMD has 128K L1 since the original Athlon, and had 24K in the K5.

The Transmeta 3200 and the Motorola G4 both have 96K, the UltraSparc-III has 100K, Alpha had 128K when it died, and HP's PA-8500 has a whopping 1.5MB.

They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...

When programmers try to be architects... by Chris+Burke · 2002-10-11 02:46 · Score: 5, Informative

Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.

Here's why his idea sucks:

1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.

2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.

3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.

4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.

5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.

Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.

Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.

I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.

--

The enemies of Democracy are

Re:Why? by mgblst · 2002-10-11 02:53 · Score: 4, Informative

Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!

Re:Why? by Chris+Burke · 2002-10-11 03:26 · Score: 3, Informative

Register renaming already does what's being proposed here, but transparently.

Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.

However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.

His idea would, but it's still a stupid idea. :P

--

The enemies of Democracy are

Register Windows by 1000StonedMonkeys · 2002-10-11 03:37 · Score: 2, Informative

This already exists on SPARC. It's called register windows. It makes writing compilers/assembly a real bitch. Chipgeek needs to do his homework.

As several posters have already mentioned, Intel gets around the lack of registers problem by using register renaming. There are actually 128 general purpose registers in the P4. Which ones you're writing to is controlled by the processor.

Re:Why? by MajroMax · 2002-10-11 03:59 · Score: 5, Informative

Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.

--
"Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC

Re:modular chips by Zathrus · 2002-10-11 04:08 · Score: 5, Informative

Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.

Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.

If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.

All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.

Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.

Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?

Re:RISC by bored · 2002-10-11 04:08 · Score: 2, Informative

Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.

The ARM is a cute little arch, the only problem is that EVERY instruction is conditional. At first this seems like it might make for some really nice optimizations. But its a lot harder than you think (the instruction cache cannot just dump instructions, because it has to know what the current state of the processor is, this means that all instructions which affect the condition codes have to retire before decisions about which instructions can be executed are made). When I started thinking about how I would design an OO superscalar version it started to give me a real headache. Eventually I realized about the only way (I could think of, maybe there is a better way) would be to have some kind of in order conditional retire stage near the end of the pipeline. This would allow the processor to run at decent speeds as long as the code was very careful to rarely use the conditional execution, since it would effectivly serialize the instruction stream. The 'Always' execute instructions could retire out of order as long as there was enough distance between them and the condition changing/condition dependent instructions.

All this is to say, the ARM is a nice arch for low speed, low power devices. Really high speed versions might be pretty hard to get right. Intel's Xscale is like this, everything considered, it's IPC is pretty bad.

Real data using x86 emulator by Nynaeve · 2002-10-11 04:36 · Score: 2, Informative

What happened to backing up flames and claims with real data? The author of this article would be well advised to implement his ideas using an x86 emulator and at least do some prelimiary testing. Processor-level features such as out-of-order execution and register-renaming may not be handled by an emulator, but it would be an informative investigation nonetheless.

Re:An intelligent comment on the subject by RickHGeek · 2002-10-11 05:47 · Score: 2, Informative

"Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging."

This is an incorrect assumption. Existing operating systems would run entirely unaffected. RM/RMC support would be implemented in hardware. The data would be stored in the TSS during a task switch and the existing mechanisms used for storing MMX/FPU and SSE/SSE2 register space (either doing it explicitly with FXSAVE or deferring it by later trapping a fault when an attempt to read/write is encountered) would still be used.

Nothing would need to be changed to that end.

"Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide."

Absolutely not. Each task has its own TSS right now. Each task context saves everything and context restores everything before/following a task switch. All systems would run as they do today. In fact, no additional operating system support would be required (since the necessarying saving/restoring of RM/RMC in the TSS would be handled entirely by the processor). It would be an invisible add-on that only software utilizing it would see.

- Rick C. Hodgin, geek.com

If only by Dollyknot · 2002-10-11 06:05 · Score: 2, Informative

Many years ago I learned assembler, first on a Z80 then on a 6502. When I learned the power of zero page addressing, yes I thought, way to go. I left behind my computing hobby, to become an international truck driver, for about 10 years this is what I did, seven years ago, events occured leading me to take up my old hobby. I tried to learn the 86, gave up after a while, thinking what was the point in banging my head against such a mess.

What they should have done is kept the 6502 architecture and scaled it up. The architecture of the 6502 was wonderful. Sixteen bit address bus, eight bit data bus, same as the Z80, the clever bit with the 6502 was zero page addressing ,which basicaly gave 256 registers as well the three GP registers. The idea being the CPU could access the bottom page in memory with just eight bits in the address field, zeropage could be used as index registers, I can't rightly remember all the operations that could be performed on zero page as opposed to the X Y and Z registers, but I remember it leading to good tight code. The same architecture in a 32 bit address space, ah the dreams

--
It's called an elephant's trunk whereas it is in fact, an elephant's nose, a nose by any other name would smell as sweet

Re:Cache is the key by Mr+Z · 2002-10-11 08:35 · Score: 2, Informative

It wasn't die area so much as clock rate. At smaller and smaller geometries, the transit time for a bit starts going up at some point due to transmission line effects. RC delay goes up since R goes up (your wire got smaller) and C goes up (you got closer to the other wires).

--Joe

--
Program Intellivision!

Re:Cache is the key by orz · 2002-10-11 09:01 · Score: 4, Informative

Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.

For a graphical depiction of the difference between Intel and AMD cache performances, try this link:
http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3
It was the first think that came up in a google search for linpack and "cache size".

Re:An intelligent comment on the subject by RickHGeek · 2002-10-11 15:56 · Score: 2, Informative

By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.

The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.

I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).

One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.

My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).

What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.

If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.

- Rick C. Hodgin, geek.com

Slashdot Mirror

Revolutionizing x86 CPU Performance

25 of 296 comments (clear)