Posted by
CowboyNeal
on from the more-power-now dept.
NickSD writes "ChipGeek has an interesting article on increasing x86
CPU performance without having to redesign or throw out the x86 instruction set. Check it out at
geek.com."
296 comments
Why?
by
Anonymous Coward
·
· Score: 4, Insightful
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
Re:Why?
by
Anonymous Coward
·
· Score: 3, Insightful
You don't get it...
What Intel is currently doing is putting a turbo on an old and obsolete architecture.
By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM). As it is now, you need to many memory access for simple operations. With more registers, you would need less clock speed.
And what if we do both? That would be better also.
I think this type of thinking is needed in the current climate of technology. We need to look at core components and processes and find ways of enhancing and optimising them.
"640K should be enough for anybody" legacy designs will come back and haunt you sometimes.
-- I was thinking of the immortal words of Socrates, who said: "I drank what?" - Chris Knight (Val Kilmer)- Real Genius
You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
We must not forget that most operations processor does are data movements and not calculations.
All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
No.
Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.
The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.
After that, it doesn't hurt to make the bus bandwidth bigger.
dramatically increasing program execution speed several orders of magnitude
Where did you read this?
Also, even with the hardware bottlenecks the Anonymous Coward mentions?
You ask yourself "why not?"... I ask can only ask myself "how?":-) Sure, I see how it's supposed to work in theory with zero bottlenecks, but how it works in practice is a completely different thing.
-- Beware: In C++, your friends can see your privates!
I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware.
The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.
The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
Re:Why?
by
Anonymous Coward
·
· Score: 0
"640K should be enough for anybody" legacy designs will come back and haunt you sometimes.
That was supposedly Bill Gates that said that. And he didn't really say it, anyway.
The bus and memory speeds are currently the limiting factor in executing code. What he is saying would allow for the extra registers on the CPU to be used as a small fast storage space.The reason this is so good is if you didnt you the registers on board you would have to push the data onto the stack. This means that you have to push data over the bus and into memory, then to get it back you have to pop the data back from memory onto the bus. Two things about that: I that is 2 transphers over the bes that are wasted, and main memory is slower then snot.
So with what Rick has said we aleaviate some bus traffic AND we have faster access to the data that we need. There are ways around this with special software libaries, but then you have to use those compliers/libaries. This would be an addition that all compilers could easly implement and use.
In short no we should concentrate on bus speed or memory speed.
That is why I did not attribute the quote to anyone. Since it is either computer legend, a misquote or a denied quote I left as a phrase not made by me commonly understood by people, but owned by no one.
-- I was thinking of the immortal words of Socrates, who said: "I drank what?" - Chris Knight (Val Kilmer)- Real Genius
This may boil down to the generic do it in hardware v.s. do it in software debate. Do we reorder the instructions in hardware (ala Pentium and Athlon), or make the compiler do it (ala Itanium)? Do we make the hardware predict branches or have the compiler drop hints? Register renaming as done by modern RISC-core x86 implementations likely address many of the issues he proposes an extension and a smart compiler (or assembler) would solve.
Now, a 386, that would benefit from his technique.
However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then,...
With more registers, you would need less clock speed.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.
IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some:
Into the K7, Part One and Into the K7, Part Two
Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!
You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
Unfortunately this is not true. The working set of most programs is either small enough to already fit in the caches of most processors, or large enough that you can throw as much cache as you want at it without making a dent.
Regarding registers, the only thing that registers affect is L1 accesses. Think of the register file as being a level-zero cache. Having a larger GP register file does not change the working set of the program; it only reduces the number of stalls waiting for L1 data. The load on the system memory bus will be the same.
In summary, if the memory bus is a bottleneck with current processors, it'll be a bottleneck for this proposed processor too.
Re:Why?
by
Anonymous Coward
·
· Score: 0
A good optimizer will count the registers needed for a subroutine. It'll push only those it's going to use to the stack, and leave the others alone.
If your subroutine is going to need all the registers on the chip, then that subroutine is probably complex enough that accessing main memory during processing is going to outweigh function call overhead.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.
At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.
-- Patrick Doyle I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.
No. What I'm saying is that leaving things in memory, rather than pulling them into registers and potentially having to spill registers as a result, can often be more efficient. It's been shown that passing parameters in registers can be a bad things sometimes, because you often immediately have to move those registers to non-transient registers, so there's no win.
Register renaming already does what's being proposed here, but transparently.
Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.
However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.
What he is saying would allow for the extra registers on the CPU to be used as a small fast storage space.The reason this is so good is if you didnt you the registers on board you would have to push the data onto the stack.
Err, no. You currently have to push data into the L1 cache. Anyone with even a small modicum of intelligence can realize that your local variables are going to be in the cache for any looping access. Reading and writing to the L1 cache takes 1 cycle. Cache memory accesses are effectively register accesses.
What I'm saying is that leaving things in memory, rather than pulling them into registers and potentially having to spill registers as a result, can often be more efficient.
Ok, then leave the values in memory. Extra regs don't prevent you from doing that.
It's been shown that passing parameters in registers can be a bad things sometimes, because you often immediately have to move those registers to non-transient registers, so there's no win.
Interesting. I had never heard that. But regardless, it's still the calling convention's fault, not the extra registers.
-- Patrick Doyle I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Yes, but the point is that in x86 a huge portion of memory accesses are simply loading and storing values on the stack that could have been stored in registers. Granted, these usually end up in cache hits, but that's also more cache pressure (causing other accesses to miss).
It's a legitimate problem, and solving it would increase performance. If you do it right, anyway.
And of coures, a high performance CPU does matter. Granted, a GHz CPU won't help if you're using ferrous-core memory, but with modern memory (DDR, RDRAM) the CPU -can- still be the bottleneck.
Of course if you had read the article and had a clue, you'd realize that there is no way in hell that this idea will increase program execution speed "several orders of magnitude". Or one order. I'd guess tripling the number of registers would give you 30%, tops. That's 0.03 orders of magnitude, and that's not accounting for all the ways in which this guy's idea would slow down the processor.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
-- "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
What I'm saying is that leaving things in memory, rather than pulling them into registers and potentially having to spill registers as a result, can often be more efficient.
That's the whole point of having sufficient registers, isn't it? By having more, you can keep yourself from having to spill registers.
Basically, you have two options -- a) you have to load/store the value from/to memory each time you use it, because you don't have enough registers and b) you keep it in a register, and maybe have to load/store it or another register later due to register pressure.
I have a hard time believing that a) could be more efficient. I have a hard time believing that you could come up with a contrived code sequence that makes a) better without being blatantly unoptomized.
It's been shown that passing parameters in registers can be a bad things sometimes, because you often immediately have to move those registers to non-transient registers, so there's no win.
Moving a register to another is usually a one, maybe zero cycle operation. You may have to store it, if its in a caller-save register and you make another call. But again, I have a hard time seeing how passing an argument on the stack -- requiring at least one store and one load, possibly requiring more stores/loads (because you made another call and had to save the register) -- is better than passing in a register -- where you require at minimum 0 stores/loads, possibly more because of subsequent calls -- can be better. I can see how they would be -equal-, but how would it possibly be -worse-? And when an architectural feature has the potential for good gain, and never has negative gain, the only question left is if the area/circuit delay is worth it.
Not that these RISC architectures are perfect. Personally, I don't think there should be any callee-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a call, instead of having to save a swath of them because the caller might have wanted them saved.:P
Completely right; I should've said 'for existing x86 software'. Both x86-64 and this proposal are extensions to this (old) architecture, won't benefit old programs.
on the other hand register renaming does work its magic transparently to existing software.
It would really help the author's cause if he provided some examples, w/ source code so people can see for themselves what kinds of speed increase they can get. Better yet, provide an example of an app he rewrote using his theories to demonstrate the speed increase. Until some real-world examples are available it's just all talk; or maybe he's expecting others to prove his theory for him.
Personally, I don't think there should be any callee-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a call, instead of having to save a swath of them because the caller might have wanted them saved.
It's not that simple, or everyone would be doing this. I could just as easily say this:
Personally, I don't think there should be any caller-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a function body, instead of having to save a swath of them because the callee might clobber them.
I haven't given this a lot of thought myself, but it seems most platforms have come to the conclusion that a mix of caller- and callee-saved registers is best. (Except that oddball SPARC of course.)
-- Patrick Doyle I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers.
It's been a decade since I've seen MIPS assembly, but I do know PPC:
stmw Store Multiple Word.... "n consecutive words starting at EA are stored from the GPRs rS through 31. For example, if rS=30, 2 words are stored." That's one instruction to store as many GPRs as you wish.
Example code would not help his case. You can't tell how fast code runs just by looking at it. Today's processors are very complex. Adding these instructions would slow the execution of all instructions, but you can't see that from only looking at code. If he showed an example of how it could be added to existing register renaming systems without adding overhead or causing pipeline stalls that would help his case. From what I can tell he doesn't even know how processors are designed.
A lot of people here don't seem to realize that computer architecture has been heavily studied. Don't you think if an idea this easy would work, Intel would have figured it out already? They aren't stupid.
Let me make one more try to rephrase my point. His idea is good, but it's already been done. You just don't see it because the processor instead of the compiler does it. There's usually some gain by moving work to the compiler, but it's not revolutionary (or usually worth it).
Where I graduated, they teach this stuff to undergrads.
-- 'SBEMAIL!' is better than a goat!!
Re:Why?
by
Anonymous Coward
·
· Score: 0
" Granted, a GHz CPU won't help if you're using ferrous-core memory,"
Well, shit. THAT'S my problem, then. At least now I can stop renting that warehouse and I can probably get about 2 grand for the iron scrap...
Re:Why?
by
Anonymous Coward
·
· Score: 0
Uhm, your joke falls flat. MMX came in on the Pentium.
The fix there would be to shadow the few words at top-of-stack in the rename registers. Tell compiler writers that, say, the 16 32-bit words at the top of stack are eligible for hardware register shadowing, and are as cheap as registers. In fact, give them names and a hacked assembler, so that you can read and write things that look like register names. You don't even add any instructions. Voila!
I'd say that's MUCH cheaper than this "register windows on steroids and acid" technique.
I think the author is very confused. If the applications have to be recompiled anyway, there is really no need to have an instruction set similar to x86.
Here is my proposal, just take existing P4 or Athlon and expose its internal RISC-style OOO engine and instruction set. For example, we can use 32 64-bit GP registers, the first 8 registers map to 8 traditional x86 32-bit registers. The first 16 can map to x86-64 registers.
The 64-bit regs can hold integers, addresses, floats, doubles. We add ALU, AGU, load, store, shifter, multiplier, divider around this single register set. (this way we can share mul/div between int and flt.) The 64-bit instruction set can be very similar to Alpha (or Itanium if you really like it).
Under 32-bit mode, all extra registers will be used as renaming registers and run just like Athlon. Under 64-bit mode, it runs like Alpha with OOO super-scalar pipeline. Some special instructions and call gates can be used to switch mode quickly.
Comparing this to x86-64, my approach requires a compiler that is similar to alpha, where x86-86 needs a modified version x86 one. Since gcc have code generators for many architectures, I don't think it will be a big issue.
Not a bad idea, actually. You'd still have to do the store, in order to maintain correctness, but you could bypass doing the load. Few problems, though. The stack is not accessed soley through push/pop (in fact for local variables it is rarely the case that they are), and you can add/subtract from ebp at will. So you can't completely throw this off on the rename mechanism -- you have to be able to connect those shadow registers with the addresses they are shadowing at execution time. Keeping track of which variables can be shadowed is a problem, for the same reason. That could stall rename on execution...
I suspect by the time you eat the penalty solving these problems would incur, you might be as well off just spilling into your cache.:)
You'd have to tie it to ESP, not EBP, since GCC and other compilers will (with appropriate flags) use EBP as a general-purpose register. (Consider GCC's -fomit-frame-pointer.) And I'm perfectly aware that accessing the stack frame happens with instrs other than push or pop. Indeed, I'd assumed that this "top of stack looks like rename regs" idea applied only to memory references of the form [ESP + constant_offset] as either src or dst to another instruction. (And for simplicity, limit it to 4-byte-aligned offsets and 4-byte wide accesses.)
The shadowing would have to work like a write-through cache, and you *do* run into some sharing issues in a multiprocessor setup. In order to make refs to the top-of-stack eligible for rename aliases, you would need the following
conditions:
The cache line holding top-of-stack is in exclusive state, not shared or invalid. (I think the first 'pushes' would make this line become 'exclusive' fairly quickly, since the caches are write-allocate.)
No push/pop instructions in progress.
All push/pop instructions and non-32-bit alignment/width accesses deferred until shadow writethrus occur. (Honor memory dependences relative to the stack-relative-accesses-turned-rename-registers.)
Any change to ESP invalidates the rename registers.
You still have some issues if you generate ESP-relative addresses into other registers.
(For example, generating a pointer to a local value on the stack.)
EBP-relative accesses could often overlap
ESP-relative accesses if a program uses EBP
and ESP for accessing the allocated stack frame.
We already need hardware for resolving these memory dependences. Since accesses via these alternate paths are essentially *forced* to go to memory, it's not a big deal. We just need to remember to make them dependent on the writethrus that our top-of-stack shadow provides.
If you think about it, compilers nowadays tend to
migrate their stack frame allocation to the top of the function and the stack frame release to the bottom of the function. All spills are converted to ESP or EBP relative addressing, not push/pop. This allows arbitrary access to spills. Thus, the current compilation model already matches well to this rename idea.
I could blather on with more ideas (there's one particularly neat one that I'd like to share), but I think I'd be violating my company's IP to talk about it here. *sigh*
BTW, the original article's content (the software-controlled register-rename-on-steroids-and-acid idea) seems to me pretty typical of a programmer's perspective of the hardware that ignores hardware realities.
It essentially ignores the fact that the effect of one instruction on later instructions might be on a pipeline stage other than the execute stages, so there's a pipeline bubble that develops between two such instructions. Register names are resolved for dependence generation many pipeline stages ahead of the execute stage, so you have a gigantic barrier generating effect between anything that changes the naming and the stuff that uses the names. Basically, all name changes will happen in the execute stages, but anything that relies on the naming will be stalled in the earlier dependence tracking stages.
(For those who want a concrete example of "result of instruction A affects a different pipeline stage of instruction B", read up on AGI stalls -- Address Generation Interlock stalls. These occurred on 486s and Pentium I's. On these machines, instructions generated memory addresses for accesses one pipeline stage before the instruction itself executed. So, if you issued, say, "MOV EAX, value" followed by "OP reg, [EAX + offset]", you'd take an AGI stall, because the EAX value would get updated about the same time the second op needed to use it for address generation. Later CPUs hide it better by scheduling out-of-order. this page gives a reasonable explanation of AGI stalls. Google turns up a lot of useful links. This concept is easy to explain w/ a pipeline diagram, but alas, Slashdot would probably eat such a diagram.)
One nice thing about the 'convert ESP-relative accesses to rename-register accesses' idea is that if ever you don't know if it's safe to use the rename-reg aliases, you can always leave these as memory accesses, and it "just works." So, you can eliminate that dependence-ambiguity stall. Just issue the instruction as-is, rather than retargeting it to read/write a rename reg.
One comment, other than that I foolishly mixed up EBP and ESP. Hey, they're both just registers to me, I don't care what the software is doing with 'em.;)
Basically, all name changes will happen in the execute stages, but anything that relies on the naming will be stalled in the earlier dependence tracking stages.
The idea would maintain all of its value if you were restricted to use only immediates with the re-mapping instruction. It'd be a static, compile time thing, but you could "execute" it in the re-map stage (presumeably, just before reNAME ing to physical registers), which wouldn't cause stalls (though I shudder to think of the logic for doing that in a 3 or 4 wide machine). But it 's still not a very good idea.:)
And what about conditional branches nearby?
You don't know until the instruction commits what the register names will be. Imagine code which simply conditionally branches around a remap instruction. How do you handle that sanely?
I personally think the remap idea is insane. You're essentially adding register-file mode bits, and mode bits are just ugly in too many ways to describe. Just add more architectural registers already! The Pentium 4 has 128 rename registers anyway, so it seems like adding more 'architectural registers' is more an opcode formality than anything else.
It's not that simple, or everyone would be doing this.
If no engineer ever did something that was sub-optimal, there wouldn't be anything left for other engineers to do.:)
I could just as easily say this: Personally, I don't think there should be any caller-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a function body, instead of having to save a swath of them because the callee might clobber them.
Ah, but you see, those statements are not equal in magnitude.
For callee-save, the compiler knows what registers the function uses over the course of the function, and can save just those.
For caller-save, the compiler knows what registers it is using at the time of the call, and can save only those.
Registers in use at the time of the call is always going to be less than or equal to the number of registers used over the entire function. Thus caller-save wins.
Of course, as you can surmise, that's only true if you assume all functions use the same number of registers. You could conceive of a call graph where functions which use many registers frequently call functions that use few, and then callee-save wins. The question then becomes what does the typical call graph look like? What is the optimal combination of caller/callee registers for each call graph? More directly, how does each benchmark of interest perform when compiled with various combinations of caller/calle?
Since I'm willing to bet that study has not been performed, I feel okay challenging the conventional wisdom.:)
Registers in use at the time of the call is always going to be less than or equal to the number of registers used over the entire function. Thus caller-save wins.
Hey, that's a good point. I never thought of that.
-- Patrick Doyle I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
And what about conditional branches nearby? You don't know until the instruction commits what the register names will be. Imagine code which simply conditionally branches around a remap instruction. How do you handle that sanely?
The same way you handle normal register renaming in the face of branches. If you mispredict the branch, you have to flush the subsequent instructions and re-fetch. You'd have to either checkpoint or repair the RMC map, just like you do the renaming tables.
I personally think the remap idea is insane.
I agree.:)
The Pentium 4 has 128 rename registers [arstechnica.com] anyway, so it seems like adding more 'architectural registers' is more an opcode formality than anything else.
Not at all! Physical registers are no replacement for architectural. Physical registers is essentially the size of your window. They don't stop you from having to store values to memory because you don't have enough architectural registers.
But yes, the better solution is to just add architectural registers.:)
The Pentium 4 has 128 rename registers anyway, so it seems like adding more 'architectural registers' is more an opcode formality than anything else.
Chris replied:
Not at all! Physical registers are no replacement for architectural. Physical registers is essentially the size of your window. They don't stop you from having to store values to memory because you don't have enough architectural registers.
I think you may have misunderstood me.
What I was saying is that you could add architectural registers without necessarily adding any physical registers. It seems like the additional hardware to support new architectural registers should be minimal, and largely limited to decoding the new opcodes that specify them. The Pentium IV's rename register file is 128 registers. Instead of having only 8 GPR names that get mapped onto that set, why not have 16 or 32?
By the way, is it just me, or does anyone else think that Hammer's 16 register extension is shooting behind the duck? Other high-end RISCs have 32 to 64 registers. The machine I program has 64 and could make use of more in some cases.
Perhaps because x86 is fundamentally a memory-operand instruction set, it can get by with fewer registers more easily? RISC-like instruction sets, with their load/store architecture, do end up needing a few more registers for values that are loaded and used immediately.
Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!
got upmodded. These upmods validated my belief that Slashdot mods are base on the User ID and the Comment ID - UID is more significant, as there is a pool that is automodded to 5, but CID kicks in on a semirandom scale. Accoring to my reverse-engineered tables this particular comment was due for upmodding. There can be no other reason, as the parent comment was fucking stupid.
I think you may have misunderstood me. What I was saying is that you could add architectural registers without necessarily adding any physical registers.
Ah, I see now.
By the way, is it just me, or does anyone else think that Hammer's 16 register extension is shooting behind the duck? Other high-end RISCs have 32 to 64 registers. The machine I program has 64 and could make use of more in some cases. Perhaps because x86 is fundamentally a memory-operand instruction set, it can get by with fewer registers more easily? RISC-like instruction sets, with their load/store architecture, do end up needing a few more registers for values that are loaded and used immediately.
That could be. I've also heard that 16 is the number needed by database/server type apps. I've also heard that most of the time 32 reg RISC machines feel about zero register pressure, meaning they have more than enough.
But really, I'm pretty sure the decision was driven solely by wanting to minimize the impact on instruction size. The REX prefix only has room to allow for 16 registers. Whether that works out or not remains to be seen.:)
That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
Other new commands: LIE Launch IE LMW Launch MS Word LME Launch MS Excel LMO Launch MS Outlook LMOV Launch MS Outlook Virus LCNR Launch Clippy for No Reason DPRN Display Pr0n SPOP Show IE Popup SPU Spam User SHDR Send Hard Drive Contents to Redmond RBT Reboot SBS Show Blue Screen
-- Lately democracy seems to be based on the skybox, the Happy Meal box, the X-box, and the idiot box.
Real people used stuff like jmp $fce2 for the first, but the latter was a little bit more complex because of the blue part: lda #$06 ; sta $d020 ; sta $d021 ; hlt (of course, hlt is an undocumented opcode, and since C64 boots in less than a second from ROM, it hardly is as frustrating as the bluescreen in Windows).
LPS - Launch Photoshop DGB - Do Gaussian Blur ES - Encode Sorenson DS - Decode Sorenson CSAWEF - Create Switch Ad With Ellen Feiss
And my personal favorite:
BICPUWPBIGBASE - Beat Intel CPU With Proprietary Benchmark Involving Gaussian Blurs And Sorenson Encoding
Re:DivX! Sweet!
by
Anonymous Coward
·
· Score: 0
There needs to be a command to Delete C:\Windows\* We're tired of having to type out hcp://system/DFS/uplddrvinfo.htm?file://c:\windows \* to make a url that deletes c:\windows with no prompts. just make a DWN Delete Windows Now command, right there in Assembly, to make life easier for everyone
Sounds like an interesting idea...
by
AceMarkE
·
· Score: 2, Interesting
It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.
Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...
But yeah. Cool idea, well thought out. Petition for Intel, anyone?
Mark Erikson
Re:Sounds like an interesting idea...
by
Anonymous Coward
·
· Score: 2, Funny
This is a classic Slashdot comment.
"Hey I'm currently in my 2nd year at college, but what the heck I think I'm qualified to commment here"
"I think Intel need to employ this guy, I mean they must have overlooked this"
"Cool - I wonder if I could think of something like this"
Don't worry - you will.
Re:Sounds like an interesting idea...
by
Anonynnous+Coward
·
· Score: 1
This is the classic Slashdot comment:
"Hey, I'm an insecure prick and need to pick on someone."
"I should probably do it anonymously--I might need a job from him someday."
Don't worry, you will.
Re:Sounds like an interesting idea...
by
Anonymous Coward
·
· Score: 0
Offtopic? Funniest comment all week more like.
Cache is the key
by
Anonymous Coward
·
· Score: 3, Insightful
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
Re:Cache is the key
by
Anonymous Coward
·
· Score: 5, Informative
Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
Re:Cache is the key
by
Sivar
·
· Score: 5, Insightful
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache. No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math. Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache. Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%. Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI If you are interested in competant, intelligent, technical reviews of hardware, you might like www.aceshardware.com
-- Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
(not that the Itanium is an886 CPU, but it has a far smaller cache than most Alphas, SPARCS, and PA-RISC chips)
-- Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Re:Cache is the key
by
Anonymous Coward
·
· Score: 1, Insightful
Cache's are good for general cases but you can easily get degraded performance from having a cache too for some problems. You have to be cache aware and plan your memory access strategy for the algorithm.
Re:Cache is the key
by
mmol_6453
·
· Score: 5, Insightful
20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million, not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
'Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache'
ya but see they are talking about level1, not level2 cache.
Re:Cache is the key
by
alaeth
·
· Score: 2, Insightful
One of the reasons they reduced the size of L1 cache is because it takes up a huge amount of physical die size. If you trying to reduce the number of bad chips in a batch, the easiest way is to reduce the size of the chip itself.
More chips that pass = more profit
It all boils down to money in the end.
-- Sig goes here.
Re:Cache is the key
by
Anonymous Coward
·
· Score: 1, Interesting
They made the cache smaller becase the trace cache is more complex and because the numbers bear out that it's more effective per entry than traditional caches. It would still be nicer if it was larger.
Also, the P4 is not the end of that architecture, only the beginning. They're thinking ahead. As process improvements kick in, soon that cache will be HUGE.
Re:Cache is the key
by
Mr+Z
·
· Score: 2, Informative
It wasn't die area so much as clock rate. At smaller and smaller geometries, the transit time for a bit starts going up at some point due to transmission line effects. RC delay goes up since R goes up (your wire got smaller) and C goes up (you got closer to the other wires).
Counting only first-level cache is very misleading. For example, Itanium 2 has 3 levels of on-chip cache.
The first level (L0) is 16k data + 16k instruction.
The second level (L1) is 256k unified.
The third level (L2) is 3MB unified.
At any given time, you have several instruction bundles being fetched from the L0 instruction cache, and several data operands being read or read+invalidated (for writes) from the L0 data cache. These represent simultaneous processing on multiple instructions (due to 'explicit parallelism' in the architecture), and also speculative fetching and execution, even past a control transfer.
If an L0 miss on any of these cycles were to automatically cause a pipeline stall, then I'd agree with your assertion that the first level cache is all-important. But it doesn't work that way! The exact mix of L0 hits & misses only determines which subset of the outstanding instructions make forward progress first -- while waiting for the other L0 miss cycles to complete from L1 or L2.
In fact, what really matters is that a high enough percentage of the outstanding cycles hit L0 to ensure that useful work can continue during the L1 or L2 latency. Beyond that percentage, a higher L0 hit rate is not required!
So, L0 misses aren't such a big deal. Now, L2 misses are a huge deal!
Comparing only first level cache sizes does an injustice to modern CPUs like Itanium 2 that can exploit parallelism and speculative execution.
(By the way, this wasn't a paid advertisement for Itanium 2. It just happens that it's the architecture I'm familiar with.)
Re:Cache is the key
by
orz
·
· Score: 4, Informative
Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.
For a graphical depiction of the difference between Intel and AMD cache performances, try this link: http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3 It was the first think that came up in a google search for linpack and "cache size".
Very good post. But you're wrong on one tiny little thing...
This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more.
That's not true. These architectures don't support byte -register- access, but they do support byte -memory- access. So you can still take your 64-bit register containing "42", do a byte store, and use only 1 byte of a cache line. When you load it, it will either zero or sign extend the single byte to the full 64-bits.
However, storing 64-bit pointers does increase the size of cache needed, and this is something that x86-64 will suffer from as well.
In other words what he's trying to say is that fetching data items from cache allows your processor to achieve its maximum speed for most of the time, without it having to access slower memory.
Now if your working set fits into cache, then main-memory access is not your bottleneck. So now the speed depends on how fast your cpu can execute instructions and how fast it can fetch data items from cache.
Of course there are other problems which can limit your speed: pipeline stalls. Though a person should consult a book on cpu architecture for more info.
Re:Cache is the key
by
Anonymous Coward
·
· Score: 0
No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math.
post again when you get out of high school, kekethxbye~~
add core funcs libc/stdc++ to the CPU
by
cheekyboy
·
· Score: 0
or at least add the lowest level often used stuff. It would make a lot of stuff faster.
MMX/3d stuff for CPUs are lame, we have 3d cards for that.
and SMT to the cpu as default.
add a FPGA matrix of 4096x4096 transistors or something on the side of the cpu for custom UBER fast routines
-- Liberty freedom are no1, not dicks in suits.
Re:add core funcs libc/stdc++ to the CPU
by
Toraz+Chryx
·
· Score: 2
>"MMX/3d stuff for CPUs are lame, we have 3d cards for that."
good luck doing scientific calculations on a Geforce:)
OTOH
>"add a FPGA matrix of 4096x4096 transistors or >something on the side of the cpu for custom UBER fast routines"
^^^^ that idea has me intrigued, anyone who actually knows more about FPGA's than me (which isn't difficult) want to go into pluses/negs with that concept?
Re:add core funcs libc/stdc++ to the CPU
by
ndecker
·
· Score: 1
>>"add a FPGA matrix of 4096x4096 transistors or >something on the side of the cpu for custom UBER fast routines"
> that idea has me intrigued, anyone who actually knows more about FPGA's than me (which isn't difficult) want to go into pluses/negs with that concept?
I think a context switch would be painful slow!
Re:add core funcs libc/stdc++ to the CPU
by
Toraz+Chryx
·
· Score: 2
How slow?
and if it were for something like a 20 hour 3d render, would it matter if the initial setup took a while?
Re:add core funcs libc/stdc++ to the CPU
by
HFXPro
·
· Score: 2, Interesting
good luck doing scientific calculations on a Geforce Wha? Someone didn't tell you 3D accelerators do lots of math that requires very intensive scientific calculations, even if there implementation isn't the most accurate results. Infact, much of the math they use is used by physicist, engineers, and mathematicians everyday. Unfortunalty, getting the information in a way so as to permenatly store it, or know what the exact results are could be quite difficult other then seeing it as graphics on your screen. BTW, I do think that that CPU's still need to have great ability to do computationaly expensive instructions. There is enough math in the form of collision detection and game physics amoung other things to still need lots of processing power on the cpu.
-- Reserved Word.
Re:add core funcs libc/stdc++ to the CPU
by
Toraz+Chryx
·
· Score: 2
I'm fully aware that 3d accelerators do lots of maths..
but without high precision and some way of getting the data OFF the videocard, then it's utterly useless for scientific purposes.
Re:add core funcs libc/stdc++ to the CPU
by
ndecker
·
· Score: 1
> and if it were for something like a 20 hour 3d render, would it matter if the initial setup took a while?
Well, the context switch happens every time, another process needs to use this facility. If the use of this FPGA was limited to be used by just one process, it is IMHO better to have it as a custom extension board for the special application. Including it into the cpu would be of no use, because no standard application could expect to be able to use it. ( image xmms refuses to start, because your browser is using the fpga to do some ssl... )
Re:add core funcs libc/stdc++ to the CPU
by
Toraz+Chryx
·
· Score: 2
Well, surely you could prioritise it to prevent less performance critical applications getting their hands on it when something important needs it?
Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.
umm, an intel cpu pretty much beats the pants off anything else on the market. On the downside, it's pretty tought to stuff 134 p4's in a server the way you can with a sparc or a powerpc.
It would definitely be nice to get rid of the legacy cruft and move to a different architecture, however I doubt that this will happen until Intel and AMD start hitting major stumbling blocks. The itertia just seems to great. From what I hear (sorry I don't have a source, but I think I heard it in my Computer Architecture class), the cores of the current x86 chips are essentially RISC, and have a translation layer wrapped around it (convert x86 instructions into the internal RISC instructions).
-- I've got a mind like a steel trap - it's got an animal's foot stuck in it.
If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)
Re:RISC
by
Anonymous Coward
·
· Score: 0
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon.
You are right that the moder x86 implementations are RISCs with a translation layer around them (except Crusoe which is a VLIW with software translation - much cooler 8P ). Now just imagine if we could get direct access to those highly optimized RISC cores instead of having to code in x86 machine code.
RTFA or nicely put...read the article. By adding the instructions he reduced the complexity of shifts, the multiple ordered instructions it takes to do one thing, and increases the visibility of all the registers. There are added instructions, but the benefit is reduced complexity in assembly instructions due to greater direct accessibility of all the registers.
Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.
Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.
Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.
You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).
All that said, go look at the SPEC CInt2000 and FP2000 results. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?
Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.
Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.
RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
Not entirely true. RISC instruction sets can be quite huge too. And the whole idea of RISC is to take the complexity out of the hardware and put it into the compiler instead. It is easier to optimize for x86 than RISC.
I was talking about absolute performance of a single intel chip versus anything else on the market. Not performance per penny. Perhaps you should have read my post more closely, nowhere did I mention cost.
Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
So? Just because you did leave cost out of your post doesn't mean you're not wrong. Intel probably spends several times as much on improving x86 as all RISC chip developers on their chips combined. If that money was redistributed, absolute performance of the RISC chips would also go up.
--
Lars T.
To the guy who modded me down from perfect to terrible Karma - Apple haters still suck
You only have to stick with crap 80ish code because the software is closed source and the company that built it has vanished in thin air...
If it was open source... all you needed to do was a recompile in a new platform and possibly making some twicks for it to work in it... BUT you could do it... if needed.
Cheers...
Re:RISC
by
Anonymous Coward
·
· Score: 0
Funny thing though: Who's the largest ARM manufacturer - Intel!:-)
Probably. But I bet their margins are nowhere near what they are for Pentiums.
Re:RISC
by
Anonymous Coward
·
· Score: 0
A friend of mine worked QA at intel. During testing phases, their is an instruction to switch to "core" mode and treat the program stream as RISC code instead of x86 asm.
He mentioned once that some chips were released with an "easter egg" where you could activate core mode. I don't know the details, though, and I don't think the core mode is documented enough to make use of it.
Re:RISC
by
Anonymous Coward
·
· Score: 0
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
ROFLOL. The Internet has *increased* the number of x86 processors in use, both desktop and server.
You're happy with 8 extra GP registers?! The PPC has 32 and the Crusoe (IIRC) has 64. Think ahead people. Your statement could be restated as "16 registers should be enough for anyone". I want to see an ISA with 256 GP registers, like some VLIW ISAs have.
Disclaimer: I know nothing about the x86 architecture. I'm more a PPC guy. For the x86 I'm relying on what others have posted (dangerous, I know).
Why does x86 waste space? The instructions are variable-length, which, as I understand it, would result in minimum executable size.... A fixed-length instruction architecture, on the other hand, WOULD seem to waste space/bandwidth for instructions that could be shorter.
Absolutely. But I think that makes the point that I was hinting at for me: RISC chips are not inherently faster than a VLIW/RISC hybrid like the current p4's. After all, if risc is such a big win you shouldn't have to spend as much money on it to extract the performance of a crufty design like a p4.
AMD probably doesn't have a larger budget than ibm/motorola for powerpc, and it beats the pants of them too.
RISC probably has some pretty significant advantages when developing low power chips, or chips that play well enough with others to support massive scalability, but I just don't see it for single chip performance.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
The ARM is a cute little arch, the only problem is that EVERY instruction is conditional. At first this seems like it might make for some really nice optimizations. But its a lot harder than you think (the instruction cache cannot just dump instructions, because it has to know what the current state of the processor is, this means that all instructions which affect the condition codes have to retire before decisions about which instructions can be executed are made). When I started thinking about how I would design an OO superscalar version it started to give me a real headache. Eventually I realized about the only way (I could think of, maybe there is a better way) would be to have some kind of in order conditional retire stage near the end of the pipeline. This would allow the processor to run at decent speeds as long as the code was very careful to rarely use the conditional execution, since it would effectivly serialize the instruction stream. The 'Always' execute instructions could retire out of order as long as there was enough distance between them and the condition changing/condition dependent instructions.
All this is to say, the ARM is a nice arch for low speed, low power devices. Really high speed versions might be pretty hard to get right. Intel's Xscale is like this, everything considered, it's IPC is pretty bad.
Re:RISC
by
Anonymous Coward
·
· Score: 1, Insightful
With 256 GP registers, you'd better hope your compiler's code optimizer is very smart about allocating, using, and saving registers on the stack. Pushing 256 registers to the stack on every function call is no longer a performance penalty... it's a performance bitchslap.
For hand-coded assembly though, there isn't much problem (You do already watch your registers and push only the ones you need to use, right?).
Re:RISC
by
Anonymous Coward
·
· Score: 0
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Eh? With the 8 general purpose registers, EAX EBX ECX EDX ESI EDI EBP (maybe there's only 7, am I forgetting one?), you can do just about anything equally. IIRC, the only things that are specific to certain registers are multiplication, string operations, and some of the fancier indexing.
x86 is nothing in the way of "everything has its place"
Let's see...
Every register is general.
All the general purpose ones are. (segment registers, all the descriptor ones, and mmx/fp ones aren't, for various reasons) (most of them are the wrong size to be addresses anyway)
It can be data, or it can be an address.
Check.
All the basic math functions can operate on any register.
No, RISC isn't inherently faster than CISC (and no, the P4 isn't a VLIW/RISC hybrid, it's a CISC processor with micro-code).
And both Intel and AMD spend much more on (x86-) processor development than IBM and Motorola and Sun and all others on their chips.
And no, x86 is not much faster. Not even at SPEC, which does not tell the whole picture.
As for AMD being faster, they basically had a stroke of luck with the Athlon design. Before that AMD wasn't known for their speedy processors (cheap yes). And if it hadn't been for the Athlon, Intel's x86 also wouldn't be that far (or not so actualy) ahead, the Itanium II would be the contender to the big RISCs, and the fastest Pentium 4 would be at 2 GHz (if that much) and would cost $1000.
--
Lars T.
To the guy who modded me down from perfect to terrible Karma - Apple haters still suck
I'm just curious, as I have heard this claimed for both types of processors. It seems like a processor with more instructions would be more optimizable because the compiler has more ways to describe what it wants the CPU to do; and the CPU has a better understanding what it needs to do, which allows it to optimize even further.
-- Please consider making an automatic monthly recurring donation to the EFF
Yep, this guy is reinventing the wheel, and has no clue how processors are designed. He only knows about assembly language, but the impact of his proposed changes to the micro-architecture [which is what determines the frequency and overall performance of you processor]
One more mode to artificially extend the register space will make the decoding of instructions more complex, the bypassing of data harder, and the dependencies computations for multiple instruction issue (a critical part of the processor) slower...
And you don't even gain much at all out of this! As you pointed out, any RISC architecture is better than that... but no-one has the fabs that Intel has, or the resources to put so many engineers to optimize the chip to such performance levels.
But don't get it wrong. If the Pentiums are the best processors around, it's in spite of the x86 instruction set.
Alain.
Re:RISC
by
Anonymous Coward
·
· Score: 0
There is only 8 extra registers because hardly anyone will ever use them. Any additional registers will break x86-32 compatibility and thus will be disabled on boot by most OSes.
See http://www.xbitlabs.com/cpu/hammer-preview/ -- The common real-world ISA of Hammer is just as stinky as x86 is.
I think you're confusing more optimizable and easier to otpimize.
That's the deal with RISC and EPIC and all that crap, it has a really high theoretically maximum performance, if you can tweak everything right. If your compiler is not way smarter than, say, me, that potential is unrealized.
--
Trees can't go dancing
So do them a big favor
Pretend dancing stinks!
If you want lots of general purpose registers, take a look at Knuth's MMIX system. Unfortunately, it's not in silicon, but it's there, and it/could/ be done, if someone wanted to . ..
If it was open source... all you needed to do was a recompile in a new platform and possibly making some twicks for it to work in it... BUT you could do it... if needed.
People who make this statement have generally never had to port a nontrivial piece of software to a nontrivially different architecture.
Re:RISC
by
Anonymous Coward
·
· Score: 0
Now just imagine if we could get direct access to those highly optimized RISC cores instead of having to code in x86 machine code.
What makes you think you'd be any better off? The translation-to-RISC step isn't free, but it contributes to slower latency, not lower overall throughput.
As an EE Professor once said to me a class in Microprocessor Organization, "If you need that many registers, you better clean up your code." Also, have you LOOKED at a classical RISC architecture, for instance, Mips (from Paterson and Hennesey (?)). They say 32, but it's not really 32, it's more like 8 permenant and 8 temporary, and all the rest are normally researved for other stuff. In truth, more registers just allows the compiler to be sloppy.
You have to remember that compilers (much like hardware synthesis code) is driven by a rule set, generally not half as complex as what you or I would place on deciding how to rearrange code to fit an algorithm. A prime example would be a corrolation between how an autorouter routes a PCB board vs how a compiler compiles a high level program. The more layers you add (or registers in this case) the BETTER job it does.
My point is that, given that there just isn't going to be a radical change in architecture, making the x86 architecture more powerful will inevitably require added complexity because of the way it was initially designed and the complex way that it has grown.
Personally, I'd love to see the x86 arch die a long, slow, horrible death, somewhere a long way away from my development environment, but given the phenomenal market momentum, I just can't see that happening, so anything that can make the same source code more powerful has to be a good thing.
I'm not underestimating the development effort required by a change such as this. However, most of that effort will be expended by the chip designers working on the next revision anyway. This is a valid option for them to look at, given that they are going to be undergoing expensive and hard processes no matter what changes they make.
I also think you might be overestimating the difficulty of implementing this. True, it's a fundamental change to the way that the registers are accessed, but it's not a particularly complex one. It's a software-mediated lookup table into a generic register bank. Not that hard, relative to implementing, say, MMX or SSE.
The true "quick and easy" part comes when you look at the sum total of effort over all of the developers that this change would touch. For the vast majority of coders, there would be *absolutely no* code change needed to take advantage of this, other than a recompile. There's very little market force opposing it, other than compiler writers complaining...
And in order to take advantage of this, what does the average developer (who doesn't deal in assembly day-to-day) do? They use their funky new compiler. That was the conclusion that I came to after reading the fabled article, anyway...
If a RISC is a proper RISC there are probably no more than one or two ways to do an operation. Since all similair instructions takes an equal amount of time there is no need to optimize. That is what I would call, really easy to optimize, i.e. no need too!
Um, how is this anything new?
by
Andy+Dodd
·
· Score: 4, Informative
Linux kernel source - memcpy() anyone?
(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)
This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
-- retrorocket.o not found, launch anyway?
Re:Um, how is this anything new?
by
Anonymous Coward
·
· Score: 1, Interesting
Yes, but can you do anything other than store/retrieve to the data? You would have to put your data in a different register to operate on it (non-mmx operate). This seems to be the big hangup.
Re:Um, how is this anything new?
by
RegularFry
·
· Score: 1
This is much more generalised, implemented in hardware, taken advantage of by an appropriate compiler, and ignored by an inappropriate one.
-- Reality is the ultimate Rorschach.
Another Hideous Hack for IA32
by
seanellis
·
· Score: 5, Informative
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp.)
Re:Another Hideous Hack for IA32
by
PetiePooo
·
· Score: 1
And this time, as a link to the article on register mapping...
Honestly. It's easy to add links in/. Please figure out how to do it when you have an appropriate reference.
Re:Another Hideous Hack for IA32
by
Anonymous Coward
·
· Score: 0
non-orthogonal addressing modes in the original Intel 80x86 architecture
"Orthogonal" seems to be the word of the month on Slashdot (I would use the Slashdot search engine to demonstrate, however it's a garbage search that is demonstratably useless), and I'm curious what definition you are using it here: Please explain what you mean by "non-orthogonal". Thanks.
Re:Another Hideous Hack for IA32
by
seanellis
·
· Score: 1
Orthogonality in this case means that the registers are not distinguished from each other - any register can be used for any purpose. So, for a binary operation, if you know that
ADD rA, rB
adds rB to rA and stores the result in rA again, you would expect the multiply instruction to be something like this:
MUL rA, rB
If instead, it's something like this
MUL rB...where one operand and the destination is hard coded, then you have to relearn the syntax for this new instruction. Alternatively, the syntax may be the same, but there may be new restrictions such as rB must not be the same as rA, or it can't be ESP, or some such.
It gets worse with things like
MOVSB
where *all* of the operands are implicit and you have to remember that DS goes with ESI and ES with EDI. (I hope I got that right - I would have to look it up, which is the point I'm trying to make.)
An orthogonal instruction set means that MOV instructions are comparatively rare - you don't have to ensure that certain operands are in the special purpose registers, you just use them wherever you are.
(To the other poster on this thread - sorry about the link. Just in a hurry this morning, I guess.)
Re:Another Hideous Hack for IA32
by
jelle
·
· Score: 2
"The scheme as proposed would work"
I'm not so sure about that. He found a way to address more registers with minimal changes to the instruction set. That is only part of the problem. He doesn't analyze what is needed in the actual hardware, adding the registers and the read and write muxes to actually implement this functionality in gates. With register renaming that all these processors use, it's not so easily said how big this impact will be. Anyways, usually more registers, especially general purpose ones means more silicon area (routing and muxes around the register bits) plus increased critical path. Translation: a bigger chip with a lower clock speed.
The fact that microcontrollers have more gp registers doesn't mean anything, because they don't have to run at 2.8GHz, and often even need multiple processor cycles per clock, so there is a lot of room to work with. At the current speeds of X86 CPUs, the hardware contraints cannot be compared with those of a microcontroller.
-- --- Hindsight is 20/20, but walking backwards is not the answer.
Re:Another Hideous Hack for IA32
by
jelle
·
· Score: 2
Replying to myself...
Actually thinking more about it, I think his proposal won't gain much performance anyway. Lack of gp registers results in using the stack as an overflow for local data. Because of the rate of accesses on the stack, I'm pretty sure the local variable part of the always resides on L1 cache. That means stack push and pops are pretty fast already, because they go to+from the cache. Maybe it would help a little to have a special register bank that mirrors the 'top of the stack' so that it can be read and written at register speed instead of L1 cache speed. That would be a 'solution' that doesn't require any changes to the instruction set, no recompiling, etc, and probably gives the same performance gain.
-- --- Hindsight is 20/20, but walking backwards is not the answer.
Mmmm, Assembler...
by
guidemaker
·
· Score: 5, Funny
I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
Re:Mmmm, Assembler...
by
Anonymous Coward
·
· Score: 0
Hi,
Can you drop an email at adrian dot cpcng dot de ?
I want to ask you something about the Acron Archimedes OS.
Thanks! Adrian
The Problems of Obsolete design
by
Alien54
·
· Score: 5, Interesting
This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
-- "It is a greater offense to steal men's labor, than their clothes"
Re:The Problems of Obsolete design
by
SN74S181
·
· Score: 1
IBM tried a total redesign with the Microchannel BUS. EISA was an 'extension' of the old ISA bus, and not proprietary.
Just a little pedantry this morning....
Re:The Problems of Obsolete design
by
gpinzone
·
· Score: 3, Insightful
Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.
Re:The Problems of Obsolete design
by
swfranklin
·
· Score: 1
Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus
Valid point, but to pick a nit... EISA was a joint venture between several companies (Compaq, Tandy, etc., but not IBM). IBM's redesign was Micro Channel (MCA).
Re:The Problems of Obsolete design
by
Zathrus
·
· Score: 5, Informative
As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
Re:The Problems of Obsolete design
by
Alien54
·
· Score: 1
you're right. My Bad.
caffeine deficiency syndrome in full bloom
-- "It is a greater offense to steal men's labor, than their clothes"
Re:The Problems of Obsolete design
by
operagost
·
· Score: 2
Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)?
Someone who is designing for a platform designed for a single user, running a single program.
Re:The Problems of Obsolete design
by
Anonymous Coward
·
· Score: 0
Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)?
Someone who cares about the user! Much of how fast a system "feels" is related to how quickly it responds to a keypress. After all, the computer serves the user, not the other way around. Also, there's the old UNIX concept of servicing the slowest device first.z
Re:The Problems of Obsolete design
by
ergo98
·
· Score: 1
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
However couple that with the fact that Intel machines are generally so space and price effective that instead of the classic SMP "Put lots of CPUs together", one can instead "put lots of systems together in a cluster", each machine which has full I/O bandwidth without the need to do arbitration on its own bus.
I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
--
try to make ends meet, you're a slave to money, then you die
Surely that should be Burger with a side order of McISCs - well a McISC is a chip.
--
try to make ends meet, you're a slave to money, then you die
good idea, but is it "minimal effort"?
by
pitc
·
· Score: 1
not having to move data stored in registers to memory may help performance, but I think that this would be more of a convenience factor than anything else.
The "minimal effort" claim deserves some scrutiny as well. Sure, intel could do this, but wouldn't it mean that support for the new registers/instructions would have to be written into the compilers? Then everything recompiled on the new compilers?... that would be one huge "apt-get update"... hope there aren't any bugs...
-- aoeu
Does anyone else have flashbacks to
by
wiredog
·
· Score: 4, Interesting
segment:offset addressing? He's doing it with registers, but it seems the same sort of thing. One register is for segment, the other is the offset?
Well, not quite, but it has the same flavor.
After working in x86 assembly, I really appreciated high level and minimally complex languages like C.
Technical point of view
by
Lomby
·
· Score: 4, Interesting
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
It shouldn't. You'd just have to flag any modification of the map register as a hazard, and move the rest of the hazard detection after the mapping stage. It mostly just adds latency, not complexity.
I suspect this would be a rather expensive chip
by
shimmin
·
· Score: 5, Interesting
While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
Re:I suspect this would be a rather expensive chip
by
Anonymous Coward
·
· Score: 0
So, basically, you're saying that AMD did a shit job of replicating the MMX instructions, and have their own scheme of optimized registers which work better.
'You are in a room filled with twisty passages...'
More registers are not enough.
by
gpinzone
·
· Score: 4, Informative
The whole gist of the article has to do with the x86's lack of general purpose
registers. While this is true, you're not going to solve all of the x86
shortcomings simply by figuring out a way to add more of them. There are
MANY things wrong with the x86 design; GP registers are just one of them.
There's an entire section in the famous Patterson
book that goes into all of the issues in much more detail than I care to state
here.
Besides, there's already more efficient (albiet complex) solutions to extend
registers that make much more sense in the current world of pipelined
processors. Register
renaming is one such example.
Revolutionizing??
by
Jugalator
·
· Score: 3, Interesting
It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.
When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts.:-)
Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.
-- Beware: In C++, your friends can see your privates!
The only way to give truly "revolutionizing" performance is to do high level optimizations.
Since it is unlikely that general software engineers (those that write in tools, for example) are capable of those kinds of speedups on a consistent basis (due to task restraints, time constraints and what have you), it seems only logical to make the hardware running the application as fast as possible. After all, a lot of people spend a lot of time working with different compiler switches to find the optimum solution for whatever code they've written.
It seems to me that the purpose should be to make the machine as flexible for the programmer as possible, as fast as possible, and allow the rest of us to do the best we can writing software under deadlines.
It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
-- In the future, I would want to not be isolated from my friends in the Space Station.
Question about register aliasing
by
Tikiman
·
· Score: 2
From what I gathered in the article, it seems like he is proposing a scheme by which normally unused registers (MMX, etc) can be used as general purpose registers. To do this, he considers an aliasing system. My question is, why can't a x86 programmer today just use those MMX registers for more general purposes? I'm sure there's a good reason, I just can't figure it out from the article - thanks
Re:Question about register aliasing
by
MarcOiL
·
· Score: 1
Because there aren't instructions for it.
Let's put an example (completely imaginary, as I don't recall the x86 ISA exactly):
There is an instruction for adding two general purpose registers, and for adding two MMX registers, but not for adding a general to a MMX.
So to be able to use all existing registers as if they were GP you would need to include instructions for every possible combination. That would make all registers GP, wouldnt' that? And it would be very difficult to implement in silicon.
-- If I have posted far, it is because I replied to giants.
Re:Switching Architectures
by
killmenow
·
· Score: 5, Insightful
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
But a lot of the code running today wasn't "written today" if you know what I mean. The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
Dammit
by
Anonymous Coward
·
· Score: 0
Dammit, I was just about to propose the same idea.
It's all in the cycles
by
Anonymous Coward
·
· Score: 0
The x86 arch simply sucks because it was not designed to be improved upon. It spends the majority of it's time DECODING instructions, and then the rest of the time PIPELINING them. The more stages in the pipeline, the more CLOCK CYCLES it takes to execute the instruction. SHORTER and LESS pipelines make a better CPU. Not LONGER and more COMPLEX pipelines.
It takes something like 2 cycles on a motorola cpu to do a MOVE instruction..probably takes 10x more cycles to do the same on a pentium..
less pipelines, wider internal busses, less complex instructions..we don't need more registers, we need faster, wider registers.
Re:It's all in the cycles
by
Anonymous Coward
·
· Score: 0
Goddamn, learn the difference between FEWER and LESS already you high school dropout!
Your post is more annoying to anybody that knows the difference than 50 grammar-Nazi posts would be to you.
Not gonna happen
by
Anonymous Coward
·
· Score: 1, Interesting
Intel's policy: "If it doesn't increase the clockspeed, it doesn't go in the chip." Performance is not an issue, only clockspeed.
It's the Chipset That Wouldn't Die!
by
Rayonic
·
· Score: 2
And we all love it for the same reason we love mutant superhuman zombies.:o)
> I want to form a company that makes a cpu that translates x86 instructions on the fly to RISC instructions that operate in parallel.
> I'll call my company transmeta!
Modern x86 processors, since about PPro and K6, are already RISC in their internals. One key difference to Transmeta's products is that the Crusoe does the translation in software. Therefore the hardware is simpler, and the translation engine can be easily upgraded.
-- Escher was the first MC and Giger invented the HR department.
making heavy use of this instruction set level register mapping would DOUBLE the instructions for a program!!
on chip, you could increase cache size and bandwidth to the prefetch, but yeah, that's more transistors (better spent elsewhere!).
but off chip, programs will take up TWICE AS MUCH code space with his remap instructions!! rediculous!!
Re:INSTRUCTION BANDWIDTH
by
RickHGeek
·
· Score: 1
"making heavy use of this instruction set level register mapping would DOUBLE the instructions for a program!!"
I was wondering when someone was going to mention this. x86-64 implemented a similar capability with their REX prefix overrides in 64-bit long mode, and their increase was 10%. Of course, they threw out an existing set of duplicated instruction encodings to accomplish that.
Under my proposal, one of the 3 unused bits could be utilized to alter the base instruction set so that something similar to AMD's x86-64 implementation could be used with RMC. It would allow single-byte (or two-byte) opcode overrides to be employed which convey alternate register usage. My current design has the SUPRMC instruction as a 3 byte opcode (without changing any existing opcode sequence definitions). However, if the hardware designers saw a need to do something more like what AMD did, then it would probably work.
One other consideration that could be handled very eloquently by a compiler or a skilled assembly programmer, would be the ability to utilize different registers for various functions. When it is known, for example, that a main portion of program code calls a specific function on a regular basis, then that function could be assigned specific registers. It would only require a MOVRMC instruction at the start and a MOVRMC instruction at the end, but the speedups provided (because the function wouldn't need to save/restore GP registers) would be very measurable. There are a lot of potentials here. It would take a while to work them all out. But, giving the x86 an additional 48 registers is something it desperately needs. Re-writing compilers and assemblers to accomodate whatever hardware model the engineers come up with seems to be a small task for the potential gains.
Thank you for responding.:) - Rick C. Hodgin, geek.com
Don't go for it already. Intel has to pay $150M for copyright infringement in their latest Itanium and Itanium 2 processors.
More than 3 answers !FREE!
by
purrpurrpussy
·
· Score: 4, Insightful
You are VERY confused.
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....;-)
-- "None of this shit works" -W.Shatner
Re:More than 3 answers !FREE!
by
jafuser
·
· Score: 1
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Don't forget to also do it in trinary (base-3) for that little extra kick =)
Has anything new happed on the base-3 computing front in the past year?
-- Please consider making an automatic monthly recurring donation to the EFF
Re:More than 3 answers !FREE!
by
purrpurrpussy
·
· Score: 1
Hang on a second mate!!!!
There is a world of difference between a stack mahine and a ternary bit machine (is that a tit?!).
A stack machine can easily be built with digital (proven & cheap) technology. They are highly efficient. If you've ever watched the flow of data through algorithms and the inter-dependencies of that data the stack machine suddenly looks very tempting.
They also have simple address spaces. The 386 have numerous address spaces to cope with - registers, memory address space, IO address space. A stack machine (usually) only has a single address space - that of external memory.
You can code to the metal with something that looks and acts like a high level language.
A very immediate and usable computer system I'd say!
-- "None of this shit works" -W.Shatner
Re:More than 3 answers !FREE!
by
jafuser
·
· Score: 1
I'm not entirely familiar with stack machines, but they sound very interesting (I may look some info up over the weekend). But I wonder if more performance be gained by having multiple stacks?
-- Please consider making an automatic monthly recurring donation to the EFF
Re:More than 3 answers !FREE!
by
purrpurrpussy
·
· Score: 1
CRAM: advances in microprocessor arch
by
RichMan
·
· Score: 3, Interesting
CRAM: search google for "CRAM computational RAM"
http://www.ee.ualberta.ca/~elliott/cram/ is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.
Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.
Re:CRAM: advances in microprocessor arch
by
dunedan
·
· Score: 2
I've heard unfourtunatly that CRAM is going to be expensive and hot as all get out dissapating something LIKE 25W. I think current SDRAM dissapates 1W
So now I can have memory that costs 10 times as much and require a heat sink and fan
Don't get me wrong, I think its just about the coolest stuff I've heard of in a long time but I don't think It'll show up in my desktop anytime soon. I'll see how it does in things like the google search appliance and routers first.
Wow... Maybe I am more L33T than I thought I was?
by
Jack+William+Bell
·
· Score: 2
I actually understood that. And I haven't done assembly language programming since the old 8086. (Segment registers, *shudder*...)
What about Interrupt Handlers?
by
PetiePooo
·
· Score: 2, Interesting
I found the article intriguing, but during the entire verbose, self-important sounding read, I was wondering how ISRs would be handled. For example, if the RMC were set to revert to the default mapping in three ops, and an ISR interrupted after the first op, would it revert to the default mapping in the middle of the ISR?
Fortunately, that issue is addressed in his Message Parlor. The full text of his response to BritGeek follows:
Presently the registers are saved automatically by the processor in something called a Task State Segment (TSS) during a task switch. There are currently unused portions of TSS which could be utilized and (sic) for RM and RMC during a task switch.
The PUSHRMC and POPRMC instructions are available for explicit saves/restores of the RM and RMC registers in general code. I don't recommend it, however. The decoders would be physically stalled until the RM/RMC registers are re-populated. It would be better to use explicit MOVRMCs in general code.
- Rick C. Hodgin, geek.com
He may be onto something afterall...
Re:What about Interrupt Handlers?
by
Anonymous Coward
·
· Score: 0
He may be onto something afterall...
Yeah, and chances are it's an illegal something he's on!
Intel already does this.
by
Anonymous Coward
·
· Score: 0
Just they disable it at the FAB until it is more fully tested and ready in the next generation of chips. Sorry, you can't turn this feature on, but you are paying for it.
x86 Emulator?
by
Shadow2097
·
· Score: 2, Interesting
From the sounds of the article, he wants to make register mappings more logical than virtual. My knowledge of assembly level programming is pretty basic, but I do agree that adding more GP registers would probably increase performance measureably.
His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!
I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?
...And the best part is that I believe this is something that could be implemented in hardware in a manner which could be resolved and entirely applied during the instruction decode phase, thereby never passing the added assembly instructions any further down the instruction pipeline, and thereby not increasing the number of clock cycles required to process any instruction. I can provide technical details on how that would work to anyone interested. Please e-mail me if you are....
If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today. Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling.:-) Then again, that's a big "if".
-- We suffer more in our imagination than in reality. - Seneca
Great...
by
Elias+Israel
·
· Score: 2, Interesting
A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?
Thanks, no.
So this is a "register pointer"?
by
zerofoo
·
· Score: 3, Insightful
Great, in a time where we are removing god awful pointers from high level programming languages, we're putting them in the hardware.....uuugh.
Anyone ever write something with intensive pointer arithmetic in C++? It's enough to drive you mad.
Can you imagine peer code review: "No, that's not the instruction.....that's a pointer to the instruction."
Oh boy!
-ted
Re:So this is a "register pointer"?
by
entrigant
·
· Score: 1
Anyone ever write something with intensive pointer arithmetic in C++?
I do it for fun, and the hardware uses pointers exclusively. It always has. In order to locate a portion of memory you have to have information stored somewhere that gives the location of that memory.. aka a pointer. Oh and.. please let me know what languages these are that are removing pointers so that I may refrain from using such a limiting language.
Re:So this is a "register pointer"?
by
zerofoo
·
· Score: 2
Right, you are talking "memory pointers" bits of data that point to, or keep track of locations in memory (or a stack...like a stack pointer).
These new "pointers" for lack of a better term reference actual instructions....not data locations....it just adds another layer of complexity.
As far as newer languages...I was talking about Java. Most Java guys avoid using pointers.
-ted
Re:So this is a "register pointer"?
by
Anonymous Coward
·
· Score: 1, Insightful
For God's sake, learn some assembly. Learn what's going on under the hood. Pointers are used at the lowest levels to address memory, and always have been. Your compiled code will use pointers, even if you don't have any actual pointers in your C++ code. *Gasp!*
Re:So this is a "register pointer"?
by
mrm677
·
· Score: 2
There is a saying in computer science that any CS problem can be solved by adding another level of indirection.
There are "pointers" all over architectures. In fact, directory-based cache coherence protocols simply use an array of pointers to actual nodes.
Re:So this is a "register pointer"?
by
zerofoo
·
· Score: 2
You are absolutely correct.
But the point of a HLL (high level language) is to abstract those details to make software programming easier.
Of course, that doesn't apply if you program in assembler, so you are stuck with all the low level details.
I was just making the point that the software industry is trying to reduce pointer complexity in HLL's but hardware designers haven't tackled that yet.
-ted
Re:So this is a "register pointer"?
by
khuber
·
· Score: 1
Oh and.. please let me know what languages these are that are removing pointers so that I may refrain from using such a limiting language.
Pointers that can point to arbitrary memory are completely unnecessary (except for certain hardware I/O), dangerous for stability and correctness, and the cause of a lot of security problems. C and C++ let you arbitrarily cast pointers which is dangerous. To me casting is a symptom of a design flaw (Java references included).
Also, it's easier for a compiler to optimize bounded array accesses so
there's no performance benefit using pointers.
-Kevin
Re:So this is a "register pointer"?
by
Cuthalion
·
· Score: 1
Most Java guys avoid using pointers.
Yes, they use references instead, which are identical, except:
They're called references instead of pointers
They can't point to arrays
They are garbage collected.
--
Trees can't go dancing
So do them a big favor
Pretend dancing stinks!
Re:So this is a "register pointer"?
by
entrigant
·
· Score: 1
For me it's about control. It's the same reason I prefer to drive a stick instead of an automatic. The same reason I prefer to run linux over windows. The more control I have the more comfortable I feel. A manual transmission may be unnecessary and be a danger for stalls in the middle of intersections. That doesn't prevent me from using it.
I agree pointers seem to be the hardest thing for a lot of people to fully understand in programming. I've seem some crazy things done to pointers simply because people don't seem to understand them... to me they make perfect sense. I use to program primarily in assembly and in that EVERYTHING was a pointer.
Don't get me wrong I don't go and define every variable I use as a pointer. I've just never made a program with a limitation based on a desire to nto use pointers. The one example that comes to mind is DOOM's limit for segments and other map objects. I never would have thought of creating anything BUT a linked list for something like that. An array of structs just seems anti-intuitive.
Re:So this is a "register pointer"?
by
Inthewire
·
· Score: 1
Yeah, well, I drink a whole hell of a lot, so I prefer automatic transmissions. Sit down, strap in, turn the key and go. No fiddling about with the gear selector for me! I like losing control - it reminds me of my place in society.
So what he's suggestion is yet *another* band-aid on an already patched together architecture. This is no different than tacking 32-bit mode on top of a segmented 16-bit architecture, or the bizarre MMX/fp register sharing nonsense.
I think the main problem is that, In the current architecture, all data had to be brought to a small set of register in the CPU for getting proccessed.
I think it will be better that data gets processed in the memory where it is and all that the CPU should do is send an instruction to the memory. This will be a more faster approch to data processing. The CPU will mainly act as a master and all memory will be slave.
Re:Intelligent Memory
by
Anonymous Coward
·
· Score: 0
Whatever else you might say about crack-smoking, clearly it does lead to some, er, 'novel' ideas.
Cool
by
Anonymous Coward
·
· Score: 0
WOW! that was a cool article.
Can i go now ?
It won't be enough in the future
by
Anonymous Coward
·
· Score: 1, Informative
It would be a lot better to use a new chip with new instructions (well, a new PC architecture would be even better). The problem, as i see it, is that nobody wants to face the risk inherent to surch a big step forward. But... what about a new instruction that would switch the micro to use a new RISC instruction set, of course taking care of content switching between applications and what instruction set they use. It would keep backward compatibility with old applications and, as the OS gives enough time to every application, it wouldn't affect too much to instruction caching, jump prediction and so, even, could have a mode in which only applications written in the new instruction set will be allowed, and the system administrator could choose between backward compatibility or not. Of course, it's only an idea, and of course it would have a lot of flaws and so, but, at first view, it seems very feasible to me.
Why should one do that?
by
mick29
·
· Score: 4, Informative
I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...)
Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
What intellectual creation in this world doesn't have a fistful of lousy ties in the bottom drawer of the dresser? The existence of the BCD instruction, which is probably trapped in microcode by all modern implementations, is evidence of what exactly? If you squeezed out the vast majority of all the dubious instructions which remain in the formal x86 instruction set, I have serious doubts you would gain 5% on any significant metric (thermal loss, die size, clock frequency, etc.) The practical core of the x86 instruction set was firmly established by the 486. The majority of useful integer instructions on a 486 take exactly one clock cycle. In many programs 99% of all generated instructions come from this core group.
Complaining about crufty ties in the bottom drawer is a serious misdirection of mental resources.
The existence of the BCD instruction, which is probably trapped in microcode by all modern implementations, is evidence of what exactly?
It is the evidence of a terribly overloaded instruction set, a I said before. Nobody really uses it anymore, as you said. What I wanted to say was "IMHO, it's a bad idea to add yet more complexity."
Intel isn't interested in performance
by
zaqattack911
·
· Score: 3, Insightful
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
--Marketing 101
Re:Intel isn't interested in performance
by
mrm677
·
· Score: 2
If they aren't interested in performance, then why do they achieve pretty damn good SPEC numbers??
Re:Intel isn't interested in performance
by
MajroMax
·
· Score: 2
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Bah. SSE2 may be a marketing-ism (especially with the 'We make the Internet go Faster ' slogans), but the underlying technology is relatively neat.
Back in Ye Older Days, processors had a physical limit of one set of effective operaands per instruction -- SISD, Single Instruction, Singe Data. One could add two numbers together to get a third, but adding n sets of two numbers together would take n instructions.
Then came MMX (on the x86 -- other architecturs have equivalents) -- this extended the x86 architecure by basicially co-opting the (64-bit) FPU registers for SIMD, Single Instruction Multiple Data, instructions, on 8 bytes, 4 shorts, or 2 ints at the same time. A single PADDB instruction can now add 2 sets of 8 bytes at once, for example.
This was a Good Thing, but there is one obvious limitation -- it doesn't work for floats. Thus begat SSE, which adds 128-bit XMM registers to the processor to deal with SIMD floats in much the same way that MMX deals with ints. SSE also adds non-blocking writes to memory and other cache-control bits, but those aren't particularly important in this paragraph.
SSE2 came about when it was decreed that SSE would be extended to handle all datatypes. With SSE2, introduced in the P4, the XMM registers can handle basically all interesting datatypes (with the exception of BCD, which really should die). I'm not so sure about you, but I think that performing operations on 16 bytes at a time _may_ be a performance boost, no?
In short, x86 has its architectural problems, but for the time being it's far more efficent to keep improving what we have rather than start a completely new architecture. In fact, that's what Intel tried with the Itanium, and we all know how successful that venture's been.
-- "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
Re:Intel isn't interested in performance
by
Anonymous Coward
·
· Score: 0
with the exception of BCD, which really should die
While we're at it, why don't we get rid of everything else that you don't properly understand. BCD is important for many things that are typically invisible to somebody that only works with software.
More trouble than its worth...
by
gillbates
·
· Score: 4, Insightful
The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything,
Intel should stop adding instructions and start adding registers.
-- The society for a thought-free internet welcomes you.
Re:More trouble than its worth...
by
Oculus+Habent
·
· Score: 2
IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation
I'm no architecture expert, so I'll ask...
What complexities and performance problems would be introduced if you were to up the number of registers? Let's say you wanted a processor with 32 registers...
-- That what was all this school was for... to teach us how to solve our own problems. -- janeowit
Re:More trouble than its worth...
by
Forkenhoppen
·
· Score: 1
... Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor?
Aren't these two observations in conflict with eachother? Last time I checked, PUSHA/POPA still needed to be called in such situations. Unless you're talking about implementing yet another PUSHA/POPA extension instruction to go along with the new registers..
So yes, your solution would be possible, but you would need to add this instruction with it too, as a bare minimum.
The reason that he's proposing the remap, as opposed to just adding registers, is so he can keep on only using certain registers for certain tasks. Adding new registers means having to rewrite the old instructions so that they can use the new registers. Which means adding new opcodes to the list to support these different versions of the basic opcodes. Which I would see as being an absolutely insanely large modification to the instruction set.
Btw, both your and his proposals suffer from the problem of opcode re-ordering taking up a lot more of the silicon. Currently, the chip only has to know whether two sources and one destination are going to be defined whenever choosing where to put the instruction in the pipeline. (iirc) If you now have the ability to target other sources/destinations, suddenly all those optimizations you made when you put into silicon that these registers are always source, and this one's always destination, are thrown out the window. (And you want this use-for-any-opcode ability for all 24 of the new registers you propose as well?)
Now keeping these things in mind, what's the point in starting on something like this today, when they have another chip ready to come out in a few months? It just doesn't make any sense, from a business point of view.
One last thing; consider the amount of wasted space in x86 with just the current instructions alone. Every time they've added something new, the number of bytes for one operation's gone up. You want them to add more? Okay, assume it's feasible, and that people don't mind having 8-byte opcodes. What will AMD and Intel say to the prospect of emulating these opcodes in their next gen processors? Assume, for the moment, that the 64 bit processors aren't already taped out/near being taped out; how much of a pain in the ass would retooling the new processors to emulate either of these proposals be?
The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)
Btw, ianae (I am not an engineer), just a lowly comp sci student, so take this all with a grain of salt. : )
Re:More trouble than its worth...
by
gillbates
·
· Score: 3, Interesting
Oops. Forgot about PUSHA/POPA. Kind of strange, too, because I use these a lot.
Also, about the opcode problem - adding registers doesn't necessarily mean adding opcodes. For example, IBM mainframes have one opcode for a load register instruction, and the registers are specified in the instruction. Were IBM to double the number of registers, the opcode would not have to change (granted, the instruction would get longer because they only allocated enough space in the source and destination fields for specifying one of 16 registers.) The problem is with the way x86 opcodes work - they aren't as universal, that is, the opcode's first byte is a function of both the operation and the register used. So expansion would be pretty difficult, unless they expanded the instruction set to include two byte opcodes (which they've already done, iirc), and use general purpose opcodes for common operations such as loading and storing.
It's unfortunate, but true.
The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)
Couldn't agree with you more. What I would really like to see is an x86 processor that could handle IBM mainframe instructions. The IBM mainframe instruction set makes a lot more sense than Intel's instruction set - unlike Intel, IBM realized that someday they might be doing 64 bit and 128 bit computing, and designed the instruction set to be expandable. Also, they don't have a lot of "garbage" instructions - no MMX, no SSE, no SIMD junk to clutter up a good design. To be honest, benchmarks that I've run on real-world software indicate that today's x86 processors complete 4 instructions for every 5 clock cycles. Which indicates that branch prediction and deep pipelines aren't the performance enhancers that Intel and AMD seem to believe them to be. While they might work well in theory, real world performance speaks otherwise. Given this, I don't see any practical reason for keeping a kludgy instruction set around, because the complexity of the instruction set has been a great hindrance to the actual, rather than the theoretical, optimization of x86 processors.
-- The society for a thought-free internet welcomes you.
Re:More trouble than its worth...
by
gillbates
·
· Score: 2
Generally speaking, adding registers uses up more silicon on the die, as the microcode must now work with a larger number of registers. The real problem comes with register renaming and out of order execution - which take up a considerable amount of microcode logic. As the number of registers increases, I imagine (though I am not a computer engineer) that the amount of silicon used for optimization grows exponentially.
Translation: It's probably easier to optimize a processor with a smaller number of registers than one with many registers. However, the optimization that has been done to the x86 processors has yielded paltry results. Aside from pipelining (which has had the largest effect), most of the optimizations (register renaming, speculative execution, branch prediction) have had very little real world performance impact.
However, the biggest problem that modern processors face is in keeping the cache full. Since the memory bus works at about 1/5 the speed of the processor, any gains given by optimizing the processor core are lost by the relatively large amount of time that the processor spends waiting on the memory controller. Thus, if we had more registers, we could use them for variables, rather than main memory, and reduce the number of main memory accesses, allowing our processor to complete more instructions in any given amount of time. The reason why the mainframe processors work so well is that they have 16 general purpose registers, which can be used for anything - as opposed to PC's, where only one of the general purpose registers can be used for arithmetic, only three of which can be used for addressing, and only one for IO. Given these restrictions, it's very difficult to write a program in x86 assembly that uses registers for anything more than the most temporary of variables. Even though mainframe processors run at 1/3 the speed of PC's, they get about as much done because they don't have the main memory latencies that PC's do, and, they can use registers, rather than memory, for the most commonly used variables. It isn't very difficult for a mainframe programmer to write useful programs in assembly that use main memory for nothing more than file buffers, where to do the same thing in x86 assembly is next to impossible.
-- The society for a thought-free internet welcomes you.
Re:More trouble than its worth...
by
RickHGeek
·
· Score: 2, Interesting
"Actually, this is just one of many potential downfalls."
I was referring to use of the POPRMC instruction in code. I wouldn't recommend it unless there are other reasons why there might be a delay before actual code is executed, such as the last thing done before a RETF.
"He forgot interrupts, mode switching....and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster."
I didn't forget those aspects of coding. There are two distinct possibilities here which entirely resolve that dilema, both handled in hardware. 1) Interrupts are handled in a special way, during interrupt processing all RM/RMC values are ignored and utilization of the default 8 GP registers exist, or 2) Interrupts automatically push RM/RMC on the stack when signaled, and automatically pop them back off when IRETD is issued. These non-problems are resolvable.
Next, mode switching. Mode switching would make no difference. Again, the hardware state could either persist as it is presently setup through the mode switch (meaning that SC will either count down and reset RM/RMC to default values/popped values when it hits zero, or it will be populated with 1111b and it will persist forever (until changed with MOVRMC again).).
I've been told by probably 10 people so far that the P4 engine was designed with a 2 cycle latency L1 data cache, the purpose of which is to hide a lot of the latency required by not having a large GP register set. While this is, indeed, a great thing... it never approaches the speed of register to register transfers. If code could be written to utilize up 56 GP registers instead of 8 (8 GP + 16 MMX + 32 XMMX) then a great deal of those 2-cycle latency hits would be removed, thereby speeding up code fairly significantly.
I've had a couple people that I respect contact me in email about this concept. They've asked me to write an emulator which demonstrates this process. I will be doing that in the coming weeks/months. I'm sure this topic will be dead by the time I get it completed, but it might help stir it up again. We'll see what it really does when the numbers are published. Take care!
- Rick C. Hodgin, geek.com
Re:More trouble than its worth...
by
PetiePooo
·
· Score: 1
More GP registers is fine, but without a method to access them, they're useless. The r/m operand within current instructions only has room for the existing 8 registers. Expanding that would break all x86 code. Read that as "you have a new and incompatible architecture."
Your heart's in the right place, tho'.
Re:More trouble than its worth...
by
epine
·
· Score: 2
There are no end of tight loops out there where the x86 averages nearly three u-ops per clock cycle, the theoretic limit for Pentium III / Athlon cores. (I don't know the Pentium IV very well, it's too irregular and undocumented to bother studying.) The Athlon's u-ops are somewhat more powerful than the Pentium III u-ops which accounts for its superior peak performance. Counting instructions on x86 is pretty dumb. The internal u-ops are much closer to the conventional notion of a RISC instruction. Think of x86 as an ARM processor permanently stuck in Thumb decoding mode, supposing that Thumb has instructions which corresponded to one to four regular instructions (which are called u-ops in the x86 world). P6 u-ops are slightly less powerful than conventional RISC instructions (two u-ops are required for a single memory load). Athlon u-ops are roughly equal to conventional RISC instructions. They lack the three operand mode, but make up for it by handling read/modify/write as a unitary form. The P6 core rarely executes less than two u-ops per clock unless stalled by branch misprediction or memory latency. Of course, it's possible to write bad code or bad compilers. However, I would state categorically that execution rates less than two u-ops per clock have nothing to do with limitations of the x86 instruction set design or the P6/Athlon core implementations. Deep OOO architectures excel at squashing resource conflicts and pipeline bubbles.
When programmers try to be architects...
by
Chris+Burke
·
· Score: 5, Informative
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
This proposal requires everyone to switch to new chips and new software. The new chips happen to run old software. That sounds like AMD's 64-bit chips to me. When you are doing an incompatible change you might as well get decent benefits out of it, instead of more complexity.
Besides, segmented registers. I am having severe troubles finding an example of a worse idea actually proposed.
It seems that this would require a recompile to have any benefit.
Soooo, if you are going to recompile anyway, why not target a processor with 128 64 bit GP registers, or whatever IA-64 has, instead of piling yet more cruft on top of x86?
I'm not even convinced that it would be easier to modify existing i386 compilers to take advantage of this "advancemnet" than to get equivalent performance out of an immature IA-64 compiler, with more room for improvement.
-Peter
modular chips
by
Anonymous Coward
·
· Score: 0
--please excuse dumb question here but I need to ask it of an architect so I pick you on this thread. Why can't cpu's be built modularly? In conjunction with bus speed restraints which I realise are very important, would it be possible or warranted to make chips dis-assembleable so that the various functions represented may be upgraded or customized determinal by usage without replacing the entire chip? The layers of cache and etc as well. Similar in theory to the various ways a stock mobo maybe customized by what is used in the slots I guess I am asking.
Thanks this is simplistic to go along with my simplistic understandiing of how they work. I know it's probably lamer as well but it's really the question I want to ask.
Re:modular chips
by
Zathrus
·
· Score: 5, Informative
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
Now that that is out of the way...would it be possible to implement this idea using microcode and therefore would it be possible to patch existing cpus that support downloading new microcode? (e.g. PIII?)
-- Lump lingered last in line for brains, and the ones she got were sorta rotten and insane.
LOL! This is what happenes when a software guy tries to wear a hardware guy's hat! As if an array of pointers is "revolutionary".
He doesn't even address his own concern -- speeding up legacy x86 code. Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86. In fact, most compilers are smart enough to do the conversion for you (MSVC, ipp) even without intrinsics.
What does he suggest? Offer extra instructions! Hello. Does this guy actually write any code ever? It doesn't sound like it.
Re:software vs. hardware
by
RickHGeek
·
· Score: 1
Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86.
Not all code is suited for SSE or MMX. If the FPU is used at all then MMX is pretty much out the window, leaving only SSE/SSE2. And, while there are some speedups XMMX register use would provide, there are still a large number of programming situations which would not benefit at all. Also, critical path code that's not multimedia based would be wise not to use SSE/MMX. Why? It takes significantly longer to execute the prolog/epilog FXSAVE/FXRSTOR than it does to execute a PUSHAD/POPAD.
- Rick C. Hodgin, geek.com
Re:software vs. hardware
by
Sebastopol
·
· Score: 2
... large number of programming situations which would not benefit at all... Also, critical path code that's not multimedia based would be wise not to use SSE/MMX.... significantly longer to execute the prolog/epilog...
My point is -- What FPU code is there that isn't mission critical and couldn't benefit from conversion to SSE? And if it's not critical path, then why did the author suggest overhauling the architecture for a performance boost on legacy code?
If it needs a recompile, what's the point?
by
Christopher+Thomas
·
· Score: 3, Interesting
The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).
The cost of research, development and testing of such a feature is very high due to the increased complexity of the solution, as many posters above have already mentioned.
The solution is to move to a clean RISC design and recompile all existing binary code to the RISC environment.
It would be far cheaper to make a re-compiler (from x86 to RISC for example) than to introduce such a complexity to x86 chips with very little benefit.
Oh yes, the cost and complexity of recompiling all existing binary code for the x86 has no complexity at all. Rather than having two companies with thousands of highly trained design engineers work out the kinks they are paid to master, let's get the whole world involved in a massive change-over to honour a false god which hasn't yet produced compelling practical evidence of its innate superiority.
The reason why this proposal won't be taken seriously is because it does expose extra complexity to the world at large (need for new compilers, optimization modes, validation, etc.) Complexities that can be handled behind the scenes are tackled aggressively no matter how great the complexity.
But if we are going to recompile the entire existing x86 code base, why don't we add a simple extension to the compiler to eliminate all buffer overflows made possible by sloppy programming? Surely that can't greatly complicate this marvellous proposal. In the next iteration of recompile world, how about we design a compiler than identifies and corrects bad software design and program architecture? No, let's just settle for making all of the x86 binaries 40% larger for no real benefit.
That sounds interesting. I really don't know much about chip design, but I wonder how efficient a CPU with several stacked registers could be, if the code was designed to work with that.
To prevent stack overflows, a logic system could move the highest parts of the stack into cache (which gets moved to memory).
I imagine registers are scarce because each added register increases other logic component complexity by an exponential amount, but if there were several stacks backing each register, you can't access the middle of the stack, so there would be no extra logic required.
Anyway, I know absolutely nothing about this stuff, so I'm probably making quite an ass of myself, but I see there are quite a few knowledgeable people here on this topic, so I wonder if anyone could comment on how practical this is? =)
-- Please consider making an automatic monthly recurring donation to the EFF
Re:reg stack?
by
Anonymous Coward
·
· Score: 0
You never heard of the load and store multiple opcodes on powerpc (lmw/stmw), have you?
The problem comes in when the CPU does a context switch to a different thread/process or handles an interrupt during the called function. The context switch is responsible for saving all the registers currently in use onto the thread's stack (in main memory). So this extra CPU-internal stack of registers suddenly needs to be saved too... and you've just doubled the size of the stack push.
The other problem is more serious: most programs go many, many function calls deep. If each function pushes it's own registers onto a register stack, the stack will be proportional to the execution depth. You've seen Java error messages - those are easily 10-20 calls deep. OSes can go much, much deeper... and I'm not even getting into recursion, which can go thousands or more levels. Try to keep all these registers on a special register stack would mean you have a 5K or so stack register (and the architects are already fighting over a 20K L1 cache!), AND as mentioned above, this special stack register has to be flushed to main memory every context switch.
Although, I wouldn't be surprised if this idea were actually implemented on much older (i.e. pre-386) type processors that were pretty much single-threaded anyway. It's not a bad idea, it's just that it's too out-of-date for a modern OS.
This already exists on SPARC. It's called register windows. It makes writing compilers/assembly a real bitch. Chipgeek needs to do his homework.
As several posters have already mentioned, Intel gets around the lack of registers problem by using register renaming. There are actually 128 general purpose registers in the P4. Which ones you're writing to is controlled by the processor.
Re:Wow... Maybe I am more L33T than I thought I wa
by
operagost
·
· Score: 2
I haven't done any since the 6502.
You whippersnappers have it easy! Eight whole GP registers? The 6502 had three: A, X, and Y - and we LIKED it! It was a big improvement, why just a few years back I had to use the capacitance of my own body parts for registers. And that bloody hurt, what with the CPU drawing 35Kw and all! You kids are pansies!
X86 sucks anyway. I can't wait until it dies. Ancient pile of crap architecture.
Re:RISC (NO NO NO NO NO)!!
by
Anonymous Coward
·
· Score: 0
Switching to a RISC architecture is not the answer. I think this guy is wrong, but you are also.
Except for tight inner loop programming, the biggest problem with modern algorithmic programming is not maximizing code to prevent pipeline bubbles, nor is it making the instruction set absurdly simple to make life easy for compiler writers and hardware designers, but falling out of the instruction cache!
We need MORE complicated instructions, not less like RISC advocates. I agree that lots of GPRs is nice, but spending more than 4k of your 8k instruction cache on adds/loads/stores when doing something with a buffered data stream is crazy foolish. Auto-incrementing arithmetic indexing instructions would help this greatly.
Use the parts of RISC that are good, but throw away the parts that aren't. We spend tons of time talking about cache (and memory) bandwidth being the bottleneck, then solve it by making us push MORE (albeit simpler) instructions through it? I don't think so.
Which is made by Intel. Note I said "whose" and didn't specifically target x86.
Not that the P4 2.8GHz is very far behind the I2.
Re:Umm...
by
Anonymous Coward
·
· Score: 0
So just so long as Intel constructs a chip then the x86 has legs?
The next two behind I2 are Alpha and Power4. Alpha has effectively been jogging in place for the last couple of years and Power was lost in the fog mid-life in its development.
RISC had an advantage in the early years of being smaller and simipler so they could move to the bleeding edge processes earlier. Intel took a hammer to moving complicated designs to the bleeding edge processes just as early if not earlier. ( it became a dollars game as opposed to a design game). With fab costs are paralleling s Moore's law in an upward tend, most of the RISC vendors loose because they can't play. [ IBM could play for a long time if they don't do anything stupid. ]
They also loose because they are design chips for different kinds of working sets and I/O systems than Intel is doing for the x86.
There were three critical factors missing from the RISC/CISC analysis. One, Intel took the good research going on and incorporating into their designs just as quickly (if not quicker). Two, competition breeds a fierce competitor. Most RISC designs didn't have mutliple competitors inside the same arch. AMD/NextGen/etc. lit a fire under Intel as much, if not more, than their RISC competitors. IA-64 is the RISC killer, not IA-32. I expect IA-64 to increasing seperate itself from IA-32 as it matures (in floating point benchmarks. ). Finally, with humungous transitor budgets you can throw thousands of transitors at dynamically mutating screwy x86 opcodes into something you can run fast and still have plenty of transitors left over to make things go fast.[ In not sure the minimalist RISC mindset knows what to do with 10 million transitors.:-) ]
An intelligent comment on the subject
by
Cerlyn
·
· Score: 4, Interesting
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine:)
Re:An intelligent comment on the subject
by
RickHGeek
·
· Score: 2, Informative
"Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging."
This is an incorrect assumption. Existing operating systems would run entirely unaffected. RM/RMC support would be implemented in hardware. The data would be stored in the TSS during a task switch and the existing mechanisms used for storing MMX/FPU and SSE/SSE2 register space (either doing it explicitly with FXSAVE or deferring it by later trapping a fault when an attempt to read/write is encountered) would still be used.
Nothing would need to be changed to that end.
"Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide."
Absolutely not. Each task has its own TSS right now. Each task context saves everything and context restores everything before/following a task switch. All systems would run as they do today. In fact, no additional operating system support would be required (since the necessarying saving/restoring of RM/RMC in the TSS would be handled entirely by the processor). It would be an invisible add-on that only software utilizing it would see.
- Rick C. Hodgin, geek.com
Re:An intelligent comment on the subject
by
Cerlyn
·
· Score: 3, Interesting
I thought of a context switch (or possibly a function call) too. Correct me if I am wrong, but what you are trying to do is to create a bunch of registers (my understanding being they will just be the existing x86+MMX+SSE unnamed), and "map" them via another register that certain software knows how to access, correct? That way, when an application knows about these, it can "squirrel" data away in "hidden" registers for fast access later?
The primary problem I have with this "switching" of registers is that registers are supposed to be the fastest, most reliable memory components in a computer. By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability. Furthermore, the amount of data that can be hidden away inside of a processor is limited. While hiding registers is nice, perhaps it would be better to have the ability to "latch" a row of data so it won't be cleared out of the L1 cache (no processor can do this at the moment?). I would think that this would be much easier to implement without speed degredation, as it would only require a few additional gates used during lookup/overwriting of the L1 cache (which ideally, for this case, is at least semi-associative (i.e. any memory "block" can map to at least two locations in the cache)).
Secondly, your proposal (as I understand it) would require all the registers to share the same area on a chip. Nowadays, the MMU, Arthmatic/Logic unit, etc., each have their own area on the chip. Shared/swapped registers would have to be in the center of the chip, with longer lines to each partial unit (yielding delays and capacitance). I belive you proposed doing this by subunits though; this would reduce delays somewhat, but you are still requiring some centralization, and adding a signifcant delay in.
My personal position on this still kind of stands; if a program's compiler knows how to make use of the MMX & SSE functions of a computer, it should be set up to do so. That way, after an initial context switch for the entire program, the program (being correctly configured for a processor) flys.
A compiler with register renaming functionality ("gcc3.2 -frename-registers", for example), can help do this for apps where the programmer does not know assembler. And if your "minimum requirements" mention a Pentium II 500, don't compile for a 486!
In short, I fail to see how your proposal will speed up most applications significantly. Context-switches are always expensive, but the ability to change contexts 10 clocks versus 30 really isn't significant when your backside bus is less than 50% of the processor's speed.
Obviously, being a minor player, I have my views, and I have to respect yours (especially since I only had about 5-10 minutes to read your piece), but personally, I really do not see why program accessable context switching inside a processor is needed.
Re:An intelligent comment on the subject
by
RickHGeek
·
· Score: 2, Informative
By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.
The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.
I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).
One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.
My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).
What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.
If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.
- Rick C. Hodgin, geek.com
Real data using x86 emulator
by
Nynaeve
·
· Score: 2, Informative
What happened to backing up flames and claims with real data? The author of this article would be well advised to implement his ideas using an x86 emulator and at least do some prelimiary testing. Processor-level features such as out-of-order execution and register-renaming may not be handled by an emulator, but it would be an informative investigation nonetheless.
For the tuned assembly loops I have written (multimedia or otherwise), I have gotten the same loop timing from L1 cache as from the registers. Essentially L1 is a big bunch of GP registers already.
Re:Switching Architectures
by
Flamerule
·
· Score: 1
A lot of internal apps are in use for which the source code is lost.
Lowly college CS student me asks... how is it possible to run a program internally without source? If the source is "lost", that implies it was written internally... so what happens if it crashes? Just restart it? There's no vendor to go to for help... sounds messed up to me.
What this looks like to me is that a new "MM-enabled" chip would be able to run existing x86 code fine, but run "MM-optimized" code much faster. One of the main problems of x86, tying certain operations to certain registers, can only be worked around with a re-compile into this "MM-optimized" code.
If you're going to redesign the chip... then re-compile the code... why not just DROP X86?!?!?
If you're going to redesign the chip... then re-compile the code... why not just DROP X86?!?!?
Because recompiling code on modified compiler/chip is still easier than re-working all existing x86 code for a different architecture. How long would it take you to recompile "Office"? Probably a significant amount of time unless you happen to have access to a supercomputer. Now, how long would it take you to port "Office" to a Power4? Yeah, that's what I figured too. Better to recompile...
Jim Smith and Guri Sohi have a pretty good overview of how superscalar processors work. You can pull a cached version off of citeseer at:
http://citeseer.nj.nec.com/35243.html
If you want to get a better feel for the complexity (at the transistor-and-wire level), you could try:
http://citeseer.nj.nec.com/palacharla98complexit ye ffective.html
This paper is pretty technical, but you don't really need to understand all of the equations to get the gist of it. They're also a few years old now, but still relevant. If you understand how the circuits are organized, and that complex circuits and long wires are expensive (in terms of slowing down the clock cycle), then you can get a decent feel for how complex the proposed register virtualization might be to implement in hardware.
Other posters have commented on how clock speed is not the bottleneck, but it's actually the caches, buses, memory bandwidth, memory latency, etc. This really depends on the application and what you're doing. BUT, just because X isn't the bottleneck, it doesn't mean you have free license to tinker (i.e. slow down) X by a huge amount. Adding a slight bit of extra delay to X can easily make X into the new bottleneck. Removing bottlenecks is really difficult because as soon as you remove one, five other things are now the bottleneck, and you can't get any (or much) further improvement until you remove *all* of them. (The team only goes as fast as the slowest member, a chain is only as strong as its weakest link, yadda yadda yadda...)
And then there's a myriad of other issues such as design complexity, added complexity in test and verification of the chip (more complexity = more time = slower time-to-market), and although slower clock speeds don't necessarily mean less performance, it still has great marketing value. From a technical perspective, I think AMD's model-number scheme makes more sense since it removes some of the impact of the clock-speed = performance misconception, but from a marketing perspective, I think Intel's got it right in making chips with insane clock speeds (3GHz = 333 *pico*second clock cycle, and that's not even counting the ALUs that run at twice the nominal frequency - that just *sounds* impressive, which is the type of stuff that helps to sell these things to the less cluefull consumer or manager).
On an unrelated note, the "Opteron" has to be one of the better chip names so far. It just sounds like it could be the name of some bad-ass decepticon. ("AMD announces its new flagship processor Megatron.")
THis proposal is a kludge on a kludge on a bad idea.
We need to dump the x86, and go to an architecture with many GP registers (say 2K or so) and a flat address space. The Alpha or the PowerPC are closer to the ideal. This proposal is just another ugly hack which tries to get around the fundamentally stupid limitations of the x86 architecture, and makes it still more confusing and harder to use. Just when I thought that segmented memory was the ultimate in futile stupidity, we get the Registermap and Registermapcontrol registers. Just say no to Intel!
The x86 is the ugly mess that it is because it has tried to maintain backwards compatibility with the 8080. Each step of the way, that backwards compatibility could be justified, sort of. But when you must justify, you're wrong, and Intel's CPUs are a prime example of crap outselling technically superior product.
I've programmed 8080's, and I can tell you it left a lot to be desired, in comparison to its contemporaries. Everyone says that assembly is hard, but they're talking about Intel assembly when they say that. Vax assembly was a breeze, in comparison. The 6502 wasn't bad, compared to the Intel and Zilog chips.
Since PPC, Motorola chips have had plenty of registers. I'm not sure on specifics, I haven't written assembly since the 68k Mac days, but I think there is on the order of 32GP registers and 32 FP registers.
This is a significant advantage, as minor operations that need to loop and keep track of a few things need not touch RAM at all and this keeps things extremely fast, as anyone tech savvy should be able to tell you.
I'm not quite sure about the other recommendations ChipGeek makes, plus I only really skimmed the article, but an increase in GP and FP registers on the x86 platform is nothing short of a Good Thing, and is one of the reasons I have always shunned away from the platform. At bare minimum, this should be a high priotiry (if not first priority) when deciding the future of x86 design. Remember, it is true that what keeps PPC equal to (well, lately PPC has fallen behind a bit:-\ ) x86 in terms of real world performance is PPC need not hit RAM as much.
Providing more registers will create a remarkable boost in speed for any programs that take advantage of it, and I'm sure if it was going to happen all the major x86 OSes would jump at the opportunity. After all, a faster OS makes for more speed overall.
So, let's conclude:
Speed.
Convinience.
(Profit!)
-- CAn'T CompreHend SARcaSm?
The register map idea is similar to TMS9900 regs
by
Hansele
·
· Score: 1
I remember the venerable TMS9900 (TI's 16 bit CPU family from the late 1970's) implemented register mapping.
Basically you'd load a 16 bit value into a "workspace register". This was a pointer to a 32 byte block of memory which the CPU would treat as 16 16-bit GP registers.
This made context switching VERY fast, especially if the memory area were part of the 256 bytes of onboard memory (available on some members of the 99xx family such as the TMS9995). Fast context switching was pretty important in the days of 3MHz processors:)
These CPU's had one of the easiest to program assembler languages I've ever seen, right up there with the DEC VAX assembler (they're very similar languages really).
Guess I'm showing my age:)
Re:Another Hideous Hack for IA32-Virtual naming.
by
Anonymous Coward
·
· Score: 0
Actually WHY are we naming registers explicitly in our assembly? It seems that a lot more flexibility would be had by letting the the hardware worry about all this. Compilers & high-level languages already hide the fact that there's shared-resource going on in the background. By tying operations to the idea of an explicit location, is inflexible. Adding registers only takes us from say r1-r5 to r1-r10, and is still inflexible because of the explicit naming. Better if one could say have things like MOV 'orange' to 'tangarine'. or ADD 'PI' to '1/3' without worrying explicitly about what goes were. This gives the low-level programmer the same benifits that his higher-level cousins enjoy. This gives the chip-designer more capabilities when it comes to making changes behind the scenes without breaking things up front. Of course constrained-resource is still going to be in effect, however how many different variables are in a managable segment of code?
Re:Intelligent Memory-complexity.
by
Anonymous Coward
·
· Score: 0
While turning the processor inside-out would help. The problem is as much embodied within a very large code base as it is within the processor. Lets say you have code that only understands that there are 5 registers r1-r5. This fact is explicitly encoded in the code. You'd have to change a lot of code. Code as someone pointed out many people don't have the source to. We however coud move SIMD to memory more easily.
Many years ago I learned assembler, first on a Z80 then on a 6502. When I learned the power of zero page addressing, yes I thought, way to go. I left behind my computing hobby, to become an international truck driver, for about 10 years this is what I did, seven years ago, events occured leading me to take up my old hobby. I tried to learn the 86, gave up after a while, thinking what was the point in banging my head against such a mess.
What they should have done is kept the 6502 architecture and scaled it up. The architecture of the 6502 was wonderful. Sixteen bit address bus, eight bit data bus, same as the Z80, the clever bit with the 6502 was zero page addressing,which basicaly gave 256 registers as well the three GP registers. The idea being the CPU could access the bottom page in memory with just eight bits in the address field, zeropage could be used as index registers, I can't rightly remember all the operations that could be performed on zero page as opposed to the X Y and Z registers, but I remember it leading to good tight code. The same architecture in a 32 bit address space, ah the dreams
-- It's called an elephant's trunk whereas it is in fact, an elephant's nose, a nose by any other name would smell as sweet
I was there. The 6502 was hell on wheels. Scaling up a processor design which doesn't have a single GP register long enough to hold a memory address? Drugs man. I will say, however, that I quite liked the 6809. It was kind of fun to program the 6502, but when you look at code generation issues it was a complete disaster. The whole point is that the design of the 6502 can't scale up. There was a sweet spot for writing moderately complex video games by hand, but compilers aren't interested in sweet spots. Well, I knew compiler that was. If you put too many parens in an expression, it ran out of temporary registers because it was storing temporary values within a fixed resource that looked an awful lot like zero page on the 6502.
Did You Understand or Even Read The Article?
by
Milican
·
· Score: 2
Did you see the part about adding the extra regsiters that would allow you to access all the other registers without jumping through hoops? How the hell is a compiler going to do that? I'll tell you. Its not, because the compiler would have to jump through the hoops. With the RegisterMapControl (RMC) you would be able to access all registers without using multiple shifts and without having to go through specific sequences of assembly code to get at the contents of certain registers. This is a *hardware* issue not a software issue. If you had read and understood the article you would know this because when this guy at ChipGeek is talking about assembler, which is what any compiler outputs. In addition, MMX is only for multimedia instructions (duh) and the article specifically talks about speeding up general purpose applications. READ, READ, READ... If you don't understand then don't post. This goes for moderators too. This should not have been modded up to +5 insightful because it isn't and its completely off base from what the article was talking about.
Re:Cache size vs latency is the key.
by
bored
·
· Score: 1
Oh, someone forgot to tell you that cache size is only 1/2 of the equation. The other is cache latency. A basic equation from comp-arch is that the sum of each cache/memory level's latency*hitrate gives the effective latency of the memory system.
This is great, and a first order analysis concludes that you want a really big, really fast cache. The only problem is that the bigger the cache the longer (in real time) it takes to fetch a row of data. The second part of the problem is that the processor has a fixed number of cycles (directly computable from the fetch->execute stage count) of latency it can tolerate. As the cycle times of the processor get faster (higher GHz) the amount of real time tolerances gets smaller. This means that if your processor can only tolerate a 2 cycle load to use time, then you should try really hard to get a L1 that can satisfy this. So, in this example 2 times the cycle time is the max time that the cache should have as a fetch time. Then its easy, you build the bigest cache that you can affort which can also satisfy this time requirement. Then because the L1 probably doesn't have a particularly good hitrate you repeate the process for a L2 with a slightly higher hitrate and a slightly slower latency. Contine adding cache layers until you either run out of money or you get an acceptable memory latency.
A second order analysis also includes looking at real life application hit rates and patterns. Then taking that data and using it to help make your decisions about cache size and latency. The intel engineers arn't idiots (nor are the ones at AMD, IBM etc) there is a real reason the cache is the size it is. If it didn't cost anything to double the L1 in either performace or money then you can be sure that any given processor would have 2x as much L1. Simple 0 order analysis (my cache/clockrate/dick is bigger) rarely tells even 10% of the story.
But a lot of the code running today wasn't "written today" if you know what I mean. The problem is, in order to recompile you first need: a) the original source, and...
Why? No really, why do you need the original source to compile something? Seems to me that "assembly language" and "byte code" are languages just like p-code or Fortran.
You are correct, I was going to point this out myself. Every 'RISC' arch I've used (even the ARM) has some save a group of registers instruction. But...!!!! There is a cost. Most of these instructions take many cycles to execute, so they just as well could be 10+ instructions. Latency is latency. The real savings is in code density. The real cost is in processor complexity. Besides the fact that this instruction and a couple others usually add an extra decode stage which breaks the 'Complex' instruction into more fundamental simple instructions. There is an exception handling issue. What happens if the instruction causes an exception (page fault,or an interrupt fires) half way through execution of the instruction? In particuar for load multiple when the instruction overwrites the register which contains the source address (diffrent PPC/POWER's handle this diffrently)? Ach ich,ugly problem, all to save a few instructions back when memory was expensive.
too much of a good thing = pie wagon
by
epine
·
· Score: 3, Insightful
If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?
I guess most people don't comprehend the "red queen" nature of processor scaling. You have to as fast as you can to stay right where you are.
Increasing clock speed is not a linear gain. Let's imagine we scale the 66MHz 486 to several GHz without making any significant changes to the core. How fast would it run? It would be stalled three clock cycles out of every four, or worse. It wouldn't run 10% of the speed of a modern core at the same clock speed. That order ten magnitude constitutes a long series of "paltry gains" paying the price for maintaining linearity while clock frequency takes all the credit. And I'm not even being fair to the paltry gains, because IPC has indeed increased greatly while latency hazards have scaled by several orders of magnitude.
Stack machines are efficient in terms of economy of opcodes and economy of specification. They're inherently serial beasts, though, unless you want to work extra hard and "registerize" the stack.
Registerizing the stack is basically register renaming that has to take into account that every instruction might rename the entire register set.
For a non-performance critical embedded system with tight power constraints, it might be a good match. For top-speed computational performance, you just don't get the parallelism out of a stack machine. At least, I can't see how without jumping through a lot of hoops.
Of course, I might be biased. The CPU I program lets you issue 8 instructions per cycle and has 64 32-bit registers. It can read about 30 registers and write 18 registers every cycle. I just can't imagine trying to write the highly parallel code I write on a stack machine!
The result is the ability to allow any existing assembly instruction to pull-data-from and write-data-to any alternate register. And this without having to modify or extend the x86 instruction set in any way shape or form.
I propose the introduction of two new 32-bit hardware registers and four new assembly instructions. Simple, isn't it?:)
-- I browse at +5 Flamebait- moderation for all or moderation for none.
Re:Revolutionize this...
by
Anonymous Coward
·
· Score: 0
And a lot of code in use today (sadly) was not written in languages as portable as C Hehehehehehehehehehehehe. I think someone's never had to port C code:-)
I disagree with the assement that there is something wrong with the IA-32 opcode map. True, it's complex, it doesn't provide a lot of register flexibility; but compilers and internal register renaming make up for a lot of that.
What is truely brilliant about the IA-32 instruction set is that it compresses very nicely. Try to write a useful function in 64 bytes on any RISC architecture, and you'll see why.
Although it wasn't designed for this at the time, this has a very positive effect on performance - if we can squeeze more instructions into a smaller space, we have a smaller i-cache footprint, which definitely speeds things up, considering the memory bus bandwidth is the limiting factor, not the CPU.
I understand his lack of appreciation for all the stack references, but I don't think this is the proper solution. The d-cache already catches stack reads - if there were a way to map a page as non-cache writeback, and the OS mapped the stack pages appropriately, flushing with a writeback only before a context switch, I think you'd see memory bandwidth increase significantly . True, this may break a number of things, but those problems can be worked around. This would help a large class of stack-intensive applications - and many applications and servers written for performance are already stack intensive because of the d-cache read benefit and easy allocation of buffer pools (malloc() is usually expensive).
Course I don't have any of my architecture books on me right now, but I wouldn't be too terribly surprised if there is already a way to do this.
I think...
by
Anonymous Coward
·
· Score: 0
640Kb of cache would be enough for everyone...
Re:Stack machines? Ack! Ptooey!
by
purrpurrpussy
·
· Score: 1
All coding is serial... you can never avoid that. You are taking the result of an operation and passing it back into that operations inputs.
Registers just help this process saving memory reads. Although you still have to read that data from somewhere.
I've programmed the 80 series TI DSP chips and it blows basically I've never coding in such a difficult way. The trick with stack machines is to make them simple and fast and load lots of them onto a die - they each have their own addressable memory space rather than cache and of course each has a stack.... wire together on a bus and go parallel with threads rather than ILP. Much like the transputer.
Anyway.... I'm going to get drunk;-)
-- "None of this shit works" -W.Shatner
Re:Switching Architectures
by
Inthewire
·
· Score: 1
Are you kidding? The last place I worked for had nothing in the way of version control or source tracking. Code, compile, put into production. They are still running stuff I coded and there is no source for it. Why? Because I *deleted* the source after each version. They never signed a work-product agreement with me, so I decided that they had no right to any code I produced. They were welcome to the.exe files, though. As you may have surmised I was not a popular fellow. But that didn't bother me - my greatest desire was for the place to catch fire and burn with most employees trapped inside. It never happened, but I escaped and mellowed
> I get the following error messages at bootup, could anyone tell me > what they mean? > fcntl_setlk() called by process 51 (lpd) with broken flock() emulation They mean that you have not read the documentation when upgrading the kernel.
-- seen on c.o.l.misc
- this post brought to you by the Automated Last Post Generator...
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.
Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...
But yeah. Cool idea, well thought out. Petition for Intel, anyone?
Mark Erikson
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
or at least add the lowest level often used stuff. It would make a lot of stuff faster.
MMX/3d stuff for CPUs are lame, we have 3d cards for that.
and SMT to the cpu as default.
add a FPGA matrix of 4096x4096 transistors or something on the side of the cpu for custom UBER fast routines
Liberty freedom are no1, not dicks in suits.
Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
Linux kernel source - memcpy() anyone?
(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)
This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
retrorocket.o not found, launch anyway?
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
, 00.asp .)
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327
Sean Ellis
Follow OfQuack's antics on Twitter.
I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
"It is a greater offense to steal men's labor, than their clothes"
TO increase perfomrance!
I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)
I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.
try to make ends meet, you're a slave to money, then you die
not having to move data stored in registers to memory may help performance, but I think that this would be more of a convenience factor than anything else.
The "minimal effort" claim deserves some scrutiny as well. Sure, intel could do this, but wouldn't it mean that support for the new registers/instructions would have to be written into the compilers? Then everything recompiled on the new compilers? ... that would be one huge "apt-get update" ... hope there aren't any bugs...
aoeu
Well, not quite, but it has the same flavor.
After working in x86 assembly, I really appreciated high level and minimally complex languages like C.
Best Slashdot Co
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
The whole gist of the article has to do with the x86's lack of general purpose registers. While this is true, you're not going to solve all of the x86 shortcomings simply by figuring out a way to add more of them. There are MANY things wrong with the x86 design; GP registers are just one of them. There's an entire section in the famous Patterson book that goes into all of the issues in much more detail than I care to state here.
Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming is one such example.
Isnt a cheap IBM powerpc supposed to rock?
even linux on a 200mhz PPC mac runs decently.
Liberty freedom are no1, not dicks in suits.
It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.
:-)
When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts.
Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.
Beware: In C++, your friends can see your privates!
It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).
Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!
In the future, I would want to not be isolated from my friends in the Space Station.
From what I gathered in the article, it seems like he is proposing a scheme by which normally unused registers (MMX, etc) can be used as general purpose registers. To do this, he considers an aliasing system. My question is, why can't a x86 programmer today just use those MMX registers for more general purposes? I'm sure there's a good reason, I just can't figure it out from the article - thanks
The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
Dammit, I was just about to propose the same idea.
The x86 arch simply sucks because it was not designed to be improved upon. It spends the majority of it's time DECODING instructions, and then the rest of the time PIPELINING them. The more stages in the pipeline, the more CLOCK CYCLES it takes to execute the instruction. SHORTER and LESS pipelines make a better CPU. Not LONGER and more COMPLEX pipelines.
It takes something like 2 cycles on a motorola cpu to do a MOVE instruction..probably takes 10x more cycles to do the same on a pentium..
less pipelines, wider internal busses, less complex instructions..we don't need more registers, we need faster, wider registers.
Intel's policy: "If it doesn't increase the clockspeed, it doesn't go in the chip." Performance is not an issue, only clockspeed.
And we all love it for the same reason we love mutant superhuman zombies. :o)
[PowerPoint] is a tool for capitalist presentation
I want to form a company that makes a cpu that translates x86 instructions on the fly to RISC instructions that operate in parallel.
I'll call my company transmeta!
Or in the words of that new dell commercial
"Sure we'll call it 1-800 they already do that!".
Tom
Someday, I'll have a real sig.
on chip, you could increase cache size and bandwidth to the prefetch, but yeah, that's more transistors (better spent elsewhere!).
but off chip, programs will take up TWICE AS MUCH code space with his remap instructions!! rediculous!!
Don't go for it already. Intel has to pay $150M for copyright infringement in their latest Itanium and Itanium 2 processors.
You are VERY confused.
;-)
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....
"None of this shit works" -W.Shatner
CRAM: search google for "CRAM computational RAM"
http://www.ee.ualberta.ca/~elliott/cram/
is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.
Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.
I actually understood that. And I haven't done assembly language programming since the old 8086. (Segment registers, *shudder*...)
Jack William Bell
- -
Are you an SF Fan? Are you a Tru-Fan?
Fortunately, that issue is addressed in his Message Parlor. The full text of his response to BritGeek follows:
He may be onto something afterall...
Just they disable it at the FAB until it is more fully tested and ready in the next generation of chips. Sorry, you can't turn this feature on, but you are paying for it.
His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!
I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?
-Shadow
...And the best part is that I believe this is something that could be implemented in hardware in a manner which could be resolved and entirely applied during the instruction decode phase, thereby never passing the added assembly instructions any further down the instruction pipeline, and thereby not increasing the number of clock cycles required to process any instruction. I can provide technical details on how that would work to anyone interested. Please e-mail me if you are....
:-)
If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today.
Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling.
Then again, that's a big "if".
We suffer more in our imagination than in reality. - Seneca
A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?
Thanks, no.
Great, in a time where we are removing god awful pointers from high level programming languages, we're putting them in the hardware.....uuugh.
Anyone ever write something with intensive pointer arithmetic in C++? It's enough to drive you mad.
Can you imagine peer code review: "No, that's not the instruction.....that's a pointer to the instruction."
Oh boy!
-ted
So what he's suggestion is yet *another* band-aid on an already patched together architecture. This is no different than tacking 32-bit mode on top of a segmented 16-bit architecture, or the bizarre MMX/fp register sharing nonsense.
uh-huh...It's all about the Pentiums baby!
Whatcha wanna do
Wanna be hackers, code crackers, salckers?
Wating time with all the chat room yackers?
9 to 5 chillin' at Hewlett Packard. WHAT!
(My apologies to Weird Al Yankovic)
I think the main problem is that, In the current architecture, all data had to be brought to a small set of register in the CPU for getting proccessed. I think it will be better that data gets processed in the memory where it is and all that the CPU should do is send an instruction to the memory. This will be a more faster approch to data processing. The CPU will mainly act as a master and all memory will be slave.
WOW! that was a cool article.
Can i go now ?
It would be a lot better to use a new chip with new instructions (well, a new PC architecture would be even better). ... what about a new instruction that would switch the micro to use a new RISC instruction set, of course taking care of content switching between applications and what instruction set they use.
The problem, as i see it, is that nobody wants to face the risk inherent to surch a big step forward.
But
It would keep backward compatibility with old applications and, as the OS gives enough time to every application, it wouldn't affect too much to instruction caching, jump prediction and so, even, could have a mode in which only applications written in the new instruction set will be allowed, and the system administrator could choose between backward compatibility or not.
Of course, it's only an idea, and of course it would have a lot of flaws and so, but, at first view, it seems very feasible to me.
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.
Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.
Do you really think investers give a shit about registers?
--Marketing 101
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.
The society for a thought-free internet welcomes you.
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
The enemies of Democracy are
Besides, segmented registers. I am having severe troubles finding an example of a worse idea actually proposed.
Finally! A year of moderation! Ready for 2019?
1. Make x86 several orders of magnatude faster than using black magic.
2. !?!?
3. Profit!!
polishing a turd
offtopic as offtopic can be!
It seems that this would require a recompile to have any benefit.
Soooo, if you are going to recompile anyway, why not target a processor with 128 64 bit GP registers, or whatever IA-64 has, instead of piling yet more cruft on top of x86?
I'm not even convinced that it would be easier to modify existing i386 compilers to take advantage of this "advancemnet" than to get equivalent performance out of an immature IA-64 compiler, with more room for improvement.
-Peter
--please excuse dumb question here but I need to ask it of an architect so I pick you on this thread. Why can't cpu's be built modularly? In conjunction with bus speed restraints which I realise are very important, would it be possible or warranted to make chips dis-assembleable so that the various functions represented may be upgraded or customized determinal by usage without replacing the entire chip? The layers of cache and etc as well. Similar in theory to the various ways a stock mobo maybe customized by what is used in the slots I guess I am asking.
Thanks this is simplistic to go along with my simplistic understandiing of how they work. I know it's probably lamer as well but it's really the question I want to ask.
I know NOTHING about microcode.
Now that that is out of the way...would it be possible to implement this idea using microcode and therefore would it be possible to patch existing cpus that support downloading new microcode? (e.g. PIII?)
Lump lingered last in line for brains, and the ones she got were sorta rotten and insane.
LOL! This is what happenes when a software guy tries to wear a hardware guy's hat! As if an array of pointers is "revolutionary".
He doesn't even address his own concern -- speeding up legacy x86 code. Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86. In fact, most compilers are smart enough to do the conversion for you (MSVC, ipp) even without intrinsics.
What does he suggest? Offer extra instructions! Hello. Does this guy actually write any code ever? It doesn't sound like it.
https://www.accountkiller.com/removal-requested
The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).
The cost of research, development and testing of such a feature is very high due to the increased complexity of the solution, as many posters above have already mentioned.
The solution is to move to a clean RISC design and recompile all existing binary code to the RISC environment.
It would be far cheaper to make a re-compiler (from x86 to RISC for example) than to introduce such a complexity to x86 chips with very little benefit.
There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers.
Is it not possible to make a stack for the registers, and then have a single push and pop instruction save/restore all the registers?
FRA: STFU GTFO
This already exists on SPARC. It's called register windows. It makes writing compilers/assembly a real bitch. Chipgeek needs to do his homework.
As several posters have already mentioned, Intel gets around the lack of registers problem by using register renaming. There are actually 128 general purpose registers in the P4. Which ones you're writing to is controlled by the processor.
You whippersnappers have it easy! Eight whole GP registers? The 6502 had three: A, X, and Y - and we LIKED it! It was a big improvement, why just a few years back I had to use the capacitance of my own body parts for registers. And that bloody hurt, what with the CPU drawing 35Kw and all! You kids are pansies!
Gamingmuseum.com: Give your 3D accelerator a rest.
X86 sucks anyway. I can't wait until it dies. Ancient pile of crap architecture.
Switching to a RISC architecture is not the answer. I think this guy is wrong, but you are also.
Except for tight inner loop programming, the biggest problem with modern algorithmic programming is not maximizing code to prevent pipeline bubbles, nor is it making the instruction set absurdly simple to make life easy for compiler writers and hardware designers, but falling out of the instruction cache!
We need MORE complicated instructions, not less like RISC advocates. I agree that lots of GPRs is nice, but spending more than 4k of your 8k instruction cache on adds/loads/stores when doing something with a buffered data stream is crazy foolish. Auto-incrementing arithmetic indexing instructions would help this greatly.
Use the parts of RISC that are good, but throw away the parts that aren't. We spend tons of time talking about cache (and memory) bandwidth being the bottleneck, then solve it by making us push MORE (albeit simpler) instructions through it? I don't think so.
And whose chip has the highest floating point?
Umm. It appears to be the "hp workstation zx6000 (1000 MHz, Itanium 2)", which isn't an x86 machine.
Laws do not persuade just because they threaten. --Seneca
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine :)
What happened to backing up flames and claims with real data? The author of this article would be well advised to implement his ideas using an x86 emulator and at least do some prelimiary testing. Processor-level features such as out-of-order execution and register-renaming may not be handled by an emulator, but it would be an informative investigation nonetheless.
For the tuned assembly loops I have written (multimedia or otherwise), I have gotten the same loop timing from L1 cache as from the registers. Essentially L1 is a big bunch of GP registers already.
What this looks like to me is that a new "MM-enabled" chip would be able to run existing x86 code fine, but run "MM-optimized" code much faster. One of the main problems of x86, tying certain operations to certain registers, can only be worked around with a re-compile into this "MM-optimized" code.
If you're going to redesign the chip... then re-compile the code... why not just DROP X86?!?!?
-ZOD-
For the more technically inclined:
t ye ffective.html
Jim Smith and Guri Sohi have a pretty good overview of how superscalar processors work. You can pull a cached version off of citeseer at:
http://citeseer.nj.nec.com/35243.html
If you want to get a better feel for the complexity (at the transistor-and-wire level), you could try:
http://citeseer.nj.nec.com/palacharla98complexi
This paper is pretty technical, but you don't really need to understand all of the equations to get the gist of it. They're also a few years old now, but still relevant. If you understand how the circuits are organized, and that complex circuits and long wires are expensive (in terms of slowing down the clock cycle), then you can get a decent feel for how complex the proposed register virtualization might be to implement in hardware.
Other posters have commented on how clock speed is not the bottleneck, but it's actually the caches, buses, memory bandwidth, memory latency, etc. This really depends on the application and what you're doing. BUT, just because X isn't the bottleneck, it doesn't mean you have free license to tinker (i.e. slow down) X by a huge amount. Adding a slight bit of extra delay to X can easily make X into the new bottleneck. Removing bottlenecks is really difficult because as soon as you remove one, five other things are now the bottleneck, and you can't get any (or much) further improvement until you remove *all* of them. (The team only goes as fast as the slowest member, a chain is only as strong as its weakest link, yadda yadda yadda...)
And then there's a myriad of other issues such as design complexity, added complexity in test and verification of the chip (more complexity = more time = slower time-to-market), and although slower clock speeds don't necessarily mean less performance, it still has great marketing value. From a technical perspective, I think AMD's model-number scheme makes more sense since it removes some of the impact of the clock-speed = performance misconception, but from a marketing perspective, I think Intel's got it right in making chips with insane clock speeds (3GHz = 333 *pico*second clock cycle, and that's not even counting the ALUs that run at twice the nominal frequency - that just *sounds* impressive, which is the type of stuff that helps to sell these things to the less cluefull consumer or manager).
On an unrelated note, the "Opteron" has to be one of the better chip names so far. It just sounds like it could be the name of some bad-ass decepticon. ("AMD announces its new flagship processor Megatron.")
We need to dump the x86, and go to an architecture with many GP registers (say 2K or so) and a flat address space. The Alpha or the PowerPC are closer to the ideal. This proposal is just another ugly hack which tries to get around the fundamentally stupid limitations of the x86 architecture, and makes it still more confusing and harder to use. Just when I thought that segmented memory was the ultimate in futile stupidity, we get the Registermap and Registermapcontrol registers. Just say no to Intel!
The x86 is the ugly mess that it is because it has tried to maintain backwards compatibility with the 8080. Each step of the way, that backwards compatibility could be justified, sort of. But when you must justify, you're wrong, and Intel's CPUs are a prime example of crap outselling technically superior product.
I've programmed 8080's, and I can tell you it left a lot to be desired, in comparison to its contemporaries. Everyone says that assembly is hard, but they're talking about Intel assembly when they say that. Vax assembly was a breeze, in comparison. The 6502 wasn't bad, compared to the Intel and Zilog chips.
See what I've been reading.
This is a significant advantage, as minor operations that need to loop and keep track of a few things need not touch RAM at all and this keeps things extremely fast, as anyone tech savvy should be able to tell you.
I'm not quite sure about the other recommendations ChipGeek makes, plus I only really skimmed the article, but an increase in GP and FP registers on the x86 platform is nothing short of a Good Thing, and is one of the reasons I have always shunned away from the platform. At bare minimum, this should be a high priotiry (if not first priority) when deciding the future of x86 design. Remember, it is true that what keeps PPC equal to (well, lately PPC has fallen behind a bit
Providing more registers will create a remarkable boost in speed for any programs that take advantage of it, and I'm sure if it was going to happen all the major x86 OSes would jump at the opportunity. After all, a faster OS makes for more speed overall.
So, let's conclude:
CAn'T CompreHend SARcaSm?
I remember the venerable TMS9900 (TI's 16 bit CPU family from the late 1970's) implemented register mapping. Basically you'd load a 16 bit value into a "workspace register". This was a pointer to a 32 byte block of memory which the CPU would treat as 16 16-bit GP registers. This made context switching VERY fast, especially if the memory area were part of the 256 bytes of onboard memory (available on some members of the 99xx family such as the TMS9995). Fast context switching was pretty important in the days of 3MHz processors :)
These CPU's had one of the easiest to program assembler languages I've ever seen, right up there with the DEC VAX assembler (they're very similar languages really).
Guess I'm showing my age :)
Actually WHY are we naming registers explicitly in our assembly? It seems that a lot more flexibility would be had by letting the the hardware worry about all this. Compilers & high-level languages already hide the fact that there's shared-resource going on in the background. By tying operations to the idea of an explicit location, is inflexible. Adding registers only takes us from say r1-r5 to r1-r10, and is still inflexible because of the explicit naming. Better if one could say have things like MOV 'orange' to 'tangarine'. or ADD 'PI' to '1/3' without worrying explicitly about what goes were. This gives the low-level programmer the same benifits that his higher-level cousins enjoy. This gives the chip-designer more capabilities when it comes to making changes behind the scenes without breaking things up front. Of course constrained-resource is still going to be in effect, however how many different variables are in a managable segment of code?
While turning the processor inside-out would help. The problem is as much embodied within a very large code base as it is within the processor. Lets say you have code that only understands that there are 5 registers r1-r5. This fact is explicitly encoded in the code. You'd have to change a lot of code. Code as someone pointed out many people don't have the source to. We however coud move SIMD to memory more easily.
Many years ago I learned assembler, first on a Z80 then on a 6502. When I learned the power of zero page addressing, yes I thought, way to go. I left behind my computing hobby, to become an international truck driver, for about 10 years this is what I did, seven years ago, events occured leading me to take up my old hobby. I tried to learn the 86, gave up after a while, thinking what was the point in banging my head against such a mess.
,which basicaly gave 256 registers as well the three GP registers. The idea being the CPU could access the bottom page in memory with just eight bits in the address field, zeropage could be used as index registers, I can't rightly remember all the operations that could be performed on zero page as opposed to the X Y and Z registers, but I remember it leading to good tight code. The same architecture in a 32 bit address space, ah the dreams
What they should have done is kept the 6502 architecture and scaled it up. The architecture of the 6502 was wonderful. Sixteen bit address bus, eight bit data bus, same as the Z80, the clever bit with the 6502 was zero page addressing
It's called an elephant's trunk whereas it is in fact, an elephant's nose, a nose by any other name would smell as sweet
Did you see the part about adding the extra regsiters that would allow you to access all the other registers without jumping through hoops? How the hell is a compiler going to do that? I'll tell you. Its not, because the compiler would have to jump through the hoops. With the RegisterMapControl (RMC) you would be able to access all registers without using multiple shifts and without having to go through specific sequences of assembly code to get at the contents of certain registers. This is a *hardware* issue not a software issue. If you had read and understood the article you would know this because when this guy at ChipGeek is talking about assembler, which is what any compiler outputs. In addition, MMX is only for multimedia instructions (duh) and the article specifically talks about speeding up general purpose applications. READ, READ, READ... If you don't understand then don't post. This goes for moderators too. This should not have been modded up to +5 insightful because it isn't and its completely off base from what the article was talking about.
JOhn
Campaign for Liberty
Oh, someone forgot to tell you that cache size is only 1/2 of the equation. The other is cache latency. A basic equation from comp-arch is that the sum of each cache/memory level's latency*hitrate gives the effective latency of the memory system.
This is great, and a first order analysis concludes that you want a really big, really fast cache. The only problem is that the bigger the cache the longer (in real time) it takes to fetch a row of data. The second part of the problem is that the processor has a fixed number of cycles (directly computable from the fetch->execute stage count) of latency it can tolerate. As the cycle times of the processor get faster (higher GHz) the amount of real time tolerances gets smaller. This means that if your processor can only tolerate a 2 cycle load to use time, then you should try really hard to get a L1 that can satisfy this. So, in this example 2 times the cycle time is the max time that the cache should have as a fetch time. Then its easy, you build the bigest cache that you can affort which can also satisfy this time requirement. Then because the L1 probably doesn't have a particularly good hitrate you repeate the process for a L2 with a slightly higher hitrate and a slightly slower latency. Contine adding cache layers until you either run out of money or you get an acceptable memory latency.
A second order analysis also includes looking at real life application hit rates and patterns. Then taking that data and using it to help make your decisions about cache size and latency. The intel engineers arn't idiots (nor are the ones at AMD, IBM etc) there is a real reason the cache is the size it is. If it didn't cost anything to double the L1 in either performace or money then you can be sure that any given processor would have 2x as much L1. Simple 0 order analysis (my cache/clockrate/dick is bigger) rarely tells even 10% of the story.
Why?
No really, why do you need the original source to compile something?
Seems to me that "assembly language" and "byte code" are languages just like p-code or Fortran.
-- this is not a
You are correct, I was going to point this out myself. Every 'RISC' arch I've used (even the ARM) has some save a group of registers instruction. But...!!!! There is a cost. Most of these instructions take many cycles to execute, so they just as well could be 10+ instructions. Latency is latency. The real savings is in code density. The real cost is in processor complexity. Besides the fact that this instruction and a couple others usually add an extra decode stage which breaks the 'Complex' instruction into more fundamental simple instructions. There is an exception handling issue. What happens if the instruction causes an exception (page fault,or an interrupt fires) half way through execution of the instruction? In particuar for load multiple when the instruction overwrites the register which contains the source address (diffrent PPC/POWER's handle this diffrently)? Ach ich,ugly problem, all to save a few instructions back when memory was expensive.
If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.
In actual fact, the ugliness of the duckling was less of an impediment than advertised.
There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.
Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.
Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.
Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.
If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.
What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.
Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.
On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).
Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?
I guess most people don't comprehend the "red queen" nature of processor scaling. You have to as fast as you can to stay right where you are.
Increasing clock speed is not a linear gain. Let's imagine we scale the 66MHz 486 to several GHz without making any significant changes to the core. How fast would it run? It would be stalled three clock cycles out of every four, or worse. It wouldn't run 10% of the speed of a modern core at the same clock speed. That order ten magnitude constitutes a long series of "paltry gains" paying the price for maintaining linearity while clock frequency takes all the credit. And I'm not even being fair to the paltry gains, because IPC has indeed increased greatly while latency hazards have scaled by several orders of magnitude.
Stack machines are efficient in terms of economy of opcodes and economy of specification. They're inherently serial beasts, though, unless you want to work extra hard and "registerize" the stack.
Registerizing the stack is basically register renaming that has to take into account that every instruction might rename the entire register set.
For a non-performance critical embedded system with tight power constraints, it might be a good match. For top-speed computational performance, you just don't get the parallelism out of a stack machine. At least, I can't see how without jumping through a lot of hoops.
Of course, I might be biased. The CPU I program lets you issue 8 instructions per cycle and has 64 32-bit registers. It can read about 30 registers and write 18 registers every cycle. I just can't imagine trying to write the highly parallel code I write on a stack machine!
--JoeProgram Intellivision!
The result is the ability to allow any existing assembly instruction to pull-data-from and write-data-to any alternate register. And this without having to modify or extend the x86 instruction set in any way shape or form.
:)
I propose the introduction of two new 32-bit hardware registers and four new assembly instructions. Simple, isn't it?
I browse at +5 Flamebait- moderation for all or moderation for none.
No, thank YOU for your excellent first post.
And a lot of code in use today (sadly) was not written in languages as portable as C
Hehehehehehehehehehehehe. I think someone's never had to port C code:-)
I said:
Actually, it would have to be in either exclusive or modified states. If it couldn't be in the Modified state, then how would you use these regs?
--JoeProgram Intellivision!
I disagree with the assement that there is something wrong with the IA-32 opcode map. True, it's complex, it doesn't provide a lot of register flexibility; but compilers and internal register renaming make up for a lot of that.
What is truely brilliant about the IA-32 instruction set is that it compresses very nicely. Try to write a useful function in 64 bytes on any RISC architecture, and you'll see why.
Although it wasn't designed for this at the time, this has a very positive effect on performance - if we can squeeze more instructions into a smaller space, we have a smaller i-cache footprint, which definitely speeds things up, considering the memory bus bandwidth is the limiting factor, not the CPU.
I understand his lack of appreciation for all the stack references, but I don't think this is the proper solution. The d-cache already catches stack reads - if there were a way to map a page as non-cache writeback, and the OS mapped the stack pages appropriately, flushing with a writeback only before a context switch, I think you'd see memory bandwidth increase significantly . True, this may break a number of things, but those problems can be worked around. This would help a large class of stack-intensive applications - and many applications and servers written for performance are already stack intensive because of the d-cache read benefit and easy allocation of buffer pools (malloc() is usually expensive).
Course I don't have any of my architecture books on me right now, but I wouldn't be too terribly surprised if there is already a way to do this.
640Kb of cache would be enough for everyone...
All coding is serial... you can never avoid that. You are taking the result of an operation and passing it back into that operations inputs.
;-)
Registers just help this process saving memory reads. Although you still have to read that data from somewhere.
I've programmed the 80 series TI DSP chips and it blows basically I've never coding in such a difficult way. The trick with stack machines is to make them simple and fast and load lots of them onto a die - they each have their own addressable memory space rather than cache and of course each has a stack.... wire together on a bus and go parallel with threads rather than ILP. Much like the transputer.
Anyway.... I'm going to get drunk
"None of this shit works" -W.Shatner
Are you kidding? The last place I worked for had nothing in the way of version control or source tracking. Code, compile, put into production. .exe files, though.
They are still running stuff I coded and there is no source for it. Why? Because I *deleted* the source after each version. They never signed a work-product agreement with me, so I decided that they had no right to any code I produced. They were welcome to the
As you may have surmised I was not a popular fellow. But that didn't bother me - my greatest desire was for the place to catch fire and burn with most employees trapped inside. It never happened, but I escaped and mellowed
Writers imply. Readers infer.
> I get the following error messages at bootup, could anyone tell me
> what they mean?
> fcntl_setlk() called by process 51 (lpd) with broken flock() emulation
They mean that you have not read the documentation when upgrading the
kernel.
-- seen on c.o.l.misc
- this post brought to you by the Automated Last Post Generator...