Slashdot Mirror


Revolutionizing x86 CPU Performance

NickSD writes "ChipGeek has an interesting article on increasing x86 CPU performance without having to redesign or throw out the x86 instruction set. Check it out at geek.com."

144 of 296 comments (clear)

  1. Why? by Anonymous Coward · · Score: 4, Insightful

    Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...

    1. Re:Why? by io333 · · Score: 3, Interesting

      Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"

    2. Re:Why? by Anonymous Coward · · Score: 3, Insightful

      You don't get it...

      What Intel is currently doing is putting a turbo on an old and obsolete architecture.

      By having more GP registers, you could make the same job more easily and with better performances (and easier to read if you code in ASM).
      As it is now, you need to many memory access for simple operations.
      With more registers, you would need less clock speed.

      It's not all about MHz's.

    3. Re:Why? by hatchet · · Score: 2, Informative

      You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.
      We must not forget that most operations processor does are data movements and not calculations.

      All three x86 problems which are described by article author are fixed with IA-64 architecture, but not so with AMD's x86-64.

    4. Re:Why? by OttoM · · Score: 5, Insightful
      Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...

      No. Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.

      The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.

      After that, it doesn't hurt to make the bus bandwidth bigger.

    5. Re:Why? by Jugalator · · Score: 2

      dramatically increasing program execution speed several orders of magnitude

      Where did you read this?

      Also, even with the hardware bottlenecks the Anonymous Coward mentions?

      You ask yourself "why not?"... I ask can only ask myself "how?" :-) Sure, I see how it's supposed to work in theory with zero bottlenecks, but how it works in practice is a completely different thing.

      --
      Beware: In C++, your friends can see your privates!
    6. Re:Why? by DustMagnet · · Score: 5, Interesting
      I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware. The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.

      The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.

      --
      'SBEMAIL!' is better than a goat!!
    7. Re:Why? by sql*kitten · · Score: 5, Insightful

      Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"

      It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.

    8. Re:Why? by PetiePooo · · Score: 3, Interesting

      This may boil down to the generic do it in hardware v.s. do it in software debate. Do we reorder the instructions in hardware (ala Pentium and Athlon), or make the compiler do it (ala Itanium)? Do we make the hardware predict branches or have the compiler drop hints? Register renaming as done by modern RISC-core x86 implementations likely address many of the issues he proposes an extension and a smart compiler (or assembler) would solve. Now, a 386, that would benefit from his technique.

      However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then, ...

      Oh, wait. nevermind.

    9. Re:Why? by Junks+Jerzey · · Score: 5, Insightful

      With more registers, you would need less clock speed.

      Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.

    10. Re:Why? by Kz · · Score: 5, Informative

      Damn Right!

      Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.

      Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

      P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)

      --
      -Kz-
    11. Re:Why? by fstanchina · · Score: 3, Interesting

      And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.

      IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some: Into the K7, Part One and Into the K7, Part Two

    12. Re:Why? by mgblst · · Score: 4, Informative

      Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!

    13. Re:Why? by Christopher+Thomas · · Score: 2

      You do not understand how computer really works. If you have more registers, more instructions for manipulating those registers and finally more cache.. you don't need high bus speeds. Processor won't need to get much of data from memory anyway, because it will have 99% of what it needs already in registers & internal cache.

      Unfortunately this is not true. The working set of most programs is either small enough to already fit in the caches of most processors, or large enough that you can throw as much cache as you want at it without making a dent.

      Regarding registers, the only thing that registers affect is L1 accesses. Think of the register file as being a level-zero cache. Having a larger GP register file does not change the working set of the program; it only reduces the number of stalls waiting for L1 data. The load on the system memory bus will be the same.

      In summary, if the memory bus is a bottleneck with current processors, it'll be a bottleneck for this proposed processor too.

    14. Re:Why? by p3d0 · · Score: 5, Interesting
      Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
      Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.

      At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    15. Re:Why? by Junks+Jerzey · · Score: 2

      Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.

      No. What I'm saying is that leaving things in memory, rather than pulling them into registers and potentially having to spill registers as a result, can often be more efficient. It's been shown that passing parameters in registers can be a bad things sometimes, because you often immediately have to move those registers to non-transient registers, so there's no win.

    16. Re:Why? by Chris+Burke · · Score: 3, Informative

      Register renaming already does what's being proposed here, but transparently.

      Well, not exactly. Renaming takes care of the case where two things write, say, EAX, by allowing both to go with different physical registers. I.E. you don't have to stall because you only have one architected "EAX" register.

      However, your program is still limited to only 8 visible values at any instant. So when you need a 9th -thing- to keep around, you have to spill some registers onto the stack. Register renaming doesn't solve this problem.

      His idea would, but it's still a stupid idea. :P

      --

      The enemies of Democracy are
    17. Re:Why? by Chris+Burke · · Score: 2

      Yes, but the point is that in x86 a huge portion of memory accesses are simply loading and storing values on the stack that could have been stored in registers. Granted, these usually end up in cache hits, but that's also more cache pressure (causing other accesses to miss).

      It's a legitimate problem, and solving it would increase performance. If you do it right, anyway.

      And of coures, a high performance CPU does matter. Granted, a GHz CPU won't help if you're using ferrous-core memory, but with modern memory (DDR, RDRAM) the CPU -can- still be the bottleneck.

      --

      The enemies of Democracy are
    18. Re:Why? by Chris+Burke · · Score: 2

      Of course if you had read the article and had a clue, you'd realize that there is no way in hell that this idea will increase program execution speed "several orders of magnitude". Or one order. I'd guess tripling the number of registers would give you 30%, tops. That's 0.03 orders of magnitude, and that's not accounting for all the ways in which this guy's idea would slow down the processor.

      --

      The enemies of Democracy are
    19. Re:Why? by MajroMax · · Score: 5, Informative
      Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.

      Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.

      --
      "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
    20. Re:Why? by Chris+Burke · · Score: 2

      What I'm saying is that leaving things in memory, rather than pulling them into registers and potentially having to spill registers as a result, can often be more efficient.

      That's the whole point of having sufficient registers, isn't it? By having more, you can keep yourself from having to spill registers.

      Basically, you have two options -- a) you have to load/store the value from/to memory each time you use it, because you don't have enough registers and b) you keep it in a register, and maybe have to load/store it or another register later due to register pressure.

      I have a hard time believing that a) could be more efficient. I have a hard time believing that you could come up with a contrived code sequence that makes a) better without being blatantly unoptomized.

      It's been shown that passing parameters in registers can be a bad things sometimes, because you often immediately have to move those registers to non-transient registers, so there's no win.

      Moving a register to another is usually a one, maybe zero cycle operation. You may have to store it, if its in a caller-save register and you make another call. But again, I have a hard time seeing how passing an argument on the stack -- requiring at least one store and one load, possibly requiring more stores/loads (because you made another call and had to save the register) -- is better than passing in a register -- where you require at minimum 0 stores/loads, possibly more because of subsequent calls -- can be better. I can see how they would be -equal-, but how would it possibly be -worse-? And when an architectural feature has the potential for good gain, and never has negative gain, the only question left is if the area/circuit delay is worth it.

      Not that these RISC architectures are perfect. Personally, I don't think there should be any callee-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a call, instead of having to save a swath of them because the caller might have wanted them saved. :P

      --

      The enemies of Democracy are
    21. Re:Why? by Chris+Burke · · Score: 2

      Okay, but 30% is still only 0.15 binary orders of magnitude. :)

      --

      The enemies of Democracy are
    22. Re:Why? by p3d0 · · Score: 2
      Personally, I don't think there should be any callee-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a call, instead of having to save a swath of them because the caller might have wanted them saved.
      It's not that simple, or everyone would be doing this. I could just as easily say this:
      Personally, I don't think there should be any caller-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a function body, instead of having to save a swath of them because the callee might clobber them.
      I haven't given this a lot of thought myself, but it seems most platforms have come to the conclusion that a mix of caller- and callee-saved registers is best. (Except that oddball SPARC of course.)
      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    23. Re:Why? by Chris+Burke · · Score: 2

      Not a bad idea, actually. You'd still have to do the store, in order to maintain correctness, but you could bypass doing the load. Few problems, though. The stack is not accessed soley through push/pop (in fact for local variables it is rarely the case that they are), and you can add/subtract from ebp at will. So you can't completely throw this off on the rename mechanism -- you have to be able to connect those shadow registers with the addresses they are shadowing at execution time. Keeping track of which variables can be shadowed is a problem, for the same reason. That could stall rename on execution...

      I suspect by the time you eat the penalty solving these problems would incur, you might be as well off just spilling into your cache. :)

      --

      The enemies of Democracy are
    24. Re:Why? by Chris+Burke · · Score: 2

      One comment, other than that I foolishly mixed up EBP and ESP. Hey, they're both just registers to me, I don't care what the software is doing with 'em. ;)

      Basically, all name changes will happen in the execute stages, but anything that relies on the naming will be stalled in the earlier dependence tracking stages.

      The idea would maintain all of its value if you were restricted to use only immediates with the re-mapping instruction. It'd be a static, compile time thing, but you could "execute" it in the re-map stage (presumeably, just before reNAME ing to physical registers), which wouldn't cause stalls (though I shudder to think of the logic for doing that in a 3 or 4 wide machine). But it 's still not a very good idea. :)

      --

      The enemies of Democracy are
    25. Re:Why? by Chris+Burke · · Score: 2

      It's not that simple, or everyone would be doing this.

      If no engineer ever did something that was sub-optimal, there wouldn't be anything left for other engineers to do. :)

      I could just as easily say this:
      Personally, I don't think there should be any caller-save registers. Let the compiler's register allocator decide what registers need to be saved prior to a function body, instead of having to save a swath of them because the callee might clobber them.


      Ah, but you see, those statements are not equal in magnitude.

      For callee-save, the compiler knows what registers the function uses over the course of the function, and can save just those.

      For caller-save, the compiler knows what registers it is using at the time of the call, and can save only those.

      Registers in use at the time of the call is always going to be less than or equal to the number of registers used over the entire function. Thus caller-save wins.

      Of course, as you can surmise, that's only true if you assume all functions use the same number of registers. You could conceive of a call graph where functions which use many registers frequently call functions that use few, and then callee-save wins. The question then becomes what does the typical call graph look like? What is the optimal combination of caller/callee registers for each call graph? More directly, how does each benchmark of interest perform when compiled with various combinations of caller/calle?

      Since I'm willing to bet that study has not been performed, I feel okay challenging the conventional wisdom. :)

      --

      The enemies of Democracy are
    26. Re:Why? by Chris+Burke · · Score: 2

      And what about conditional branches nearby? You don't know until the instruction commits what the register names will be. Imagine code which simply conditionally branches around a remap instruction. How do you handle that sanely?

      The same way you handle normal register renaming in the face of branches. If you mispredict the branch, you have to flush the subsequent instructions and re-fetch. You'd have to either checkpoint or repair the RMC map, just like you do the renaming tables.

      I personally think the remap idea is insane.

      I agree. :)

      The Pentium 4 has 128 rename registers [arstechnica.com] anyway, so it seems like adding more 'architectural registers' is more an opcode formality than anything else.

      Not at all! Physical registers are no replacement for architectural. Physical registers is essentially the size of your window. They don't stop you from having to store values to memory because you don't have enough architectural registers.

      But yes, the better solution is to just add architectural registers. :)

      --

      The enemies of Democracy are
    27. Re:Why? by Chris+Burke · · Score: 2

      I think you may have misunderstood me. What I was saying is that you could add architectural registers without necessarily adding any physical registers.

      Ah, I see now.

      By the way, is it just me, or does anyone else think that Hammer's 16 register extension is shooting behind the duck? Other high-end RISCs have 32 to 64 registers. The machine I program has 64 and could make use of more in some cases. Perhaps because x86 is fundamentally a memory-operand instruction set, it can get by with fewer registers more easily? RISC-like instruction sets, with their load/store architecture, do end up needing a few more registers for values that are loaded and used immediately.

      That could be. I've also heard that 16 is the number needed by database/server type apps. I've also heard that most of the time 32 reg RISC machines feel about zero register pressure, meaning they have more than enough.

      But really, I'm pretty sure the decision was driven solely by wanting to minimize the impact on instruction size. The REX prefix only has room to allow for 16 registers. Whether that works out or not remains to be seen. :)

      --

      The enemies of Democracy are
  2. DivX! Sweet! by von+Prufer · · Score: 4, Funny

    That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!

    1. Re:DivX! Sweet! by pokeyburro · · Score: 5, Funny

      Other new commands:

      LIE Launch IE
      LMW Launch MS Word
      LME Launch MS Excel
      LMO Launch MS Outlook
      LMOV Launch MS Outlook Virus
      LCNR Launch Clippy for No Reason
      DPRN Display Pr0n
      SPOP Show IE Popup
      SPU Spam User
      SHDR Send Hard Drive Contents to Redmond
      RBT Reboot
      SBS Show Blue Screen

      --
      Lately democracy seems to be based on the skybox, the Happy Meal box, the X-box, and the idiot box.
    2. Re:DivX! Sweet! by WWWWolf · · Score: 3, Funny
      RBT Reboot
      SBS Show Blue Screen

      Argh, get this CISC rubbish out of my sight!

      Real people used stuff like jmp $fce2 for the first, but the latter was a little bit more complex because of the blue part: lda #$06 ; sta $d020 ; sta $d021 ; hlt (of course, hlt is an undocumented opcode, and since C64 boots in less than a second from ROM, it hardly is as frustrating as the bluescreen in Windows).

      =)

    3. Re:DivX! Sweet! by BasharTeg · · Score: 5, Funny

      That's rather like the PPC instruction set!

      LPS - Launch Photoshop
      DGB - Do Gaussian Blur
      ES - Encode Sorenson
      DS - Decode Sorenson
      CSAWEF - Create Switch Ad With Ellen Feiss

      And my personal favorite:

      BICPUWPBIGBASE - Beat Intel CPU With Proprietary Benchmark Involving Gaussian Blurs And Sorenson Encoding

  3. Sounds like an interesting idea... by AceMarkE · · Score: 2, Interesting

    It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.

    Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...

    But yeah. Cool idea, well thought out. Petition for Intel, anyone?

    Mark Erikson

    1. Re:Sounds like an interesting idea... by Anonymous Coward · · Score: 2, Funny

      This is a classic Slashdot comment.

      "Hey I'm currently in my 2nd year at college, but what the heck I think I'm qualified to commment here"

      "I think Intel need to employ this guy, I mean they must have overlooked this"

      "Cool - I wonder if I could think of something like this"

      Don't worry - you will.

  4. Cache is the key by Anonymous Coward · · Score: 3, Insightful
    I've got three words for you: cache, cache and cache.

    Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.

    1. Re:Cache is the key by Anonymous Coward · · Score: 5, Informative

      Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.

      AMD has 128K L1 since the original Athlon, and had 24K in the K5.

      The Transmeta 3200 and the Motorola G4 both have 96K, the UltraSparc-III has 100K, Alpha had 128K when it died, and HP's PA-8500 has a whopping 1.5MB.

      They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...

    2. Re:Cache is the key by Sivar · · Score: 5, Insightful

      I've got three words for you: cache, cache and cache.

      Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.

      No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math.
      Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache.
      Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%.
      Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.

      Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:

      1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.

      2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.

      3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.

      That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.

      Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.

      FYI
      If you are interested in competant, intelligent, technical reviews of hardware, you might like
      www.aceshardware.com

      --
      Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
    3. Re:Cache is the key by Sivar · · Score: 2

      (not that the Itanium is an886 CPU, but it has a far smaller cache than most Alphas, SPARCS, and PA-RISC chips)

      --
      Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
    4. Re:Cache is the key by mmol_6453 · · Score: 5, Insightful

      20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.

      Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.

      A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.

      Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?

      In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million, not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)

      --
      What's this Submit thingy do?
    5. Re:Cache is the key by alaeth · · Score: 2, Insightful

      One of the reasons they reduced the size of L1 cache is because it takes up a huge amount of physical die size. If you trying to reduce the number of bad chips in a batch, the easiest way is to reduce the size of the chip itself.

      More chips that pass = more profit

      It all boils down to money in the end.

      --
      Sig goes here.
    6. Re:Cache is the key by Mr+Z · · Score: 2, Informative

      It wasn't die area so much as clock rate. At smaller and smaller geometries, the transit time for a bit starts going up at some point due to transmission line effects. RC delay goes up since R goes up (your wire got smaller) and C goes up (you got closer to the other wires).

      --Joe
    7. Re:Cache is the key by orz · · Score: 4, Informative

      Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.

      For a graphical depiction of the difference between Intel and AMD cache performances, try this link:
      http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3
      It was the first think that came up in a google search for linpack and "cache size".

    8. Re:Cache is the key by Chris+Burke · · Score: 2

      Very good post. But you're wrong on one tiny little thing...

      This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more.

      That's not true. These architectures don't support byte -register- access, but they do support byte -memory- access. So you can still take your 64-bit register containing "42", do a byte store, and use only 1 byte of a cache line. When you load it, it will either zero or sign extend the single byte to the full 64-bits.

      However, storing 64-bit pointers does increase the size of cache needed, and this is something that x86-64 will suffer from as well.

      --

      The enemies of Democracy are
    9. Re:Cache is the key by Sivar · · Score: 2

      Interesting. Thanks for the correction and information. :)

      --
      Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
  5. RISC by e8johan · · Score: 5, Interesting

    Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!

    As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!

    If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).

    And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.

    1. Re:RISC by RegularFry · · Score: 2, Insightful

      I don't think anyone would disagree with that, but that's not the issue. What he's saying is, given that we've got to stick with x86 for historical and commercial reasons, this would be a relatively quick and easy way to allow the compilers to produce *much* groovier code.

      --
      Reality is the ultimate Rorschach.
    2. Re:RISC by benwb · · Score: 2

      umm, an intel cpu pretty much beats the pants off anything else on the market. On the downside, it's pretty tought to stuff 134 p4's in a server the way you can with a sparc or a powerpc.

    3. Re:RISC by shadow303 · · Score: 2, Interesting

      It would definitely be nice to get rid of the legacy cruft and move to a different architecture, however I doubt that this will happen until Intel and AMD start hitting major stumbling blocks. The itertia just seems to great. From what I hear (sorry I don't have a source, but I think I heard it in my Computer Architecture class), the cores of the current x86 chips are essentially RISC, and have a translation layer wrapped around it (convert x86 instructions into the internal RISC instructions).

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
    4. Re:RISC by e8johan · · Score: 3, Insightful

      If we're going to stick to the x86 we still do not want to add complexity. I also tried to point out how easy it would be to move to a new architecture.
      As you must add complexity I do not think that it would be "quick and easy". It takes huge resources in both time and equipment to verify the timing of a new chip, so these kind of changes (fundamental changes to the way registers are accessed) are expensive and hard since you also need to implement many new hardware solutions and verify the functionality (not only the timing!)

    5. Re:RISC by e8johan · · Score: 2

      You are right that the moder x86 implementations are RISCs with a translation layer around them (except Crusoe which is a VLIW with software translation - much cooler 8P ). Now just imagine if we could get direct access to those highly optimized RISC cores instead of having to code in x86 machine code.

    6. Re:RISC by Milican · · Score: 3, Informative

      RTFA or nicely put...read the article. By adding the instructions he reduced the complexity of shifts, the multiple ordered instructions it takes to do one thing, and increases the visibility of all the registers. There are added instructions, but the benefit is reduced complexity in assembly instructions due to greater direct accessibility of all the registers.

      JOhn

    7. Re:RISC by Zathrus · · Score: 5, Interesting

      Ok, when you get to the Real World, let us know.

      Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.

      Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.

      Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.

      You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).

      All that said, go look at the SPEC CInt2000 and FP2000 results. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?

      Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.

      Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.

    8. Re:RISC by earthman · · Score: 2, Insightful
      RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
      Not entirely true. RISC instruction sets can be quite huge too. And the whole idea of RISC is to take the complexity out of the hardware and put it into the compiler instead. It is easier to optimize for x86 than RISC.
    9. Re:RISC by benwb · · Score: 2

      I was talking about absolute performance of a single intel chip versus anything else on the market. Not performance per penny. Perhaps you should have read my post more closely, nowhere did I mention cost.

    10. Re:RISC by Zathrus · · Score: 5, Informative

      Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).

      There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.

      AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).

      Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.

      The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.

    11. Re:RISC by snatchitup · · Score: 5, Informative

      Hell yeah!

      I myself am an old x86 Assembly hacker.

      When I started looking at the ARM chips I wondered why we ever used x86's etc.

      RISC / CISC is really a misnomer.

      RISC has plenty of instructions, and it's meant to be super-scaler.

      It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.

      With Intel x86, everything has it's place.

      Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.

      Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386

      Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.

      The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.

      The article should really just be called:

      "An Acedemic Exercise in Register Gymnastics"

    12. Re:RISC by Lars+T. · · Score: 2

      So? Just because you did leave cost out of your post doesn't mean you're not wrong. Intel probably spends several times as much on improving x86 as all RISC chip developers on their chips combined. If that money was redistributed, absolute performance of the RISC chips would also go up.

      --

      Lars T.

      To the guy who modded me down from perfect to terrible Karma - Apple haters still suck

    13. Re:RISC by ZigMonty · · Score: 2
      You're happy with 8 extra GP registers?! The PPC has 32 and the Crusoe (IIRC) has 64. Think ahead people. Your statement could be restated as "16 registers should be enough for anyone". I want to see an ISA with 256 GP registers, like some VLIW ISAs have.

      Disclaimer: I know nothing about the x86 architecture. I'm more a PPC guy. For the x86 I'm relying on what others have posted (dangerous, I know).

    14. Re:RISC by CTho9305 · · Score: 2

      Why does x86 waste space? The instructions are variable-length, which, as I understand it, would result in minimum executable size.... A fixed-length instruction architecture, on the other hand, WOULD seem to waste space/bandwidth for instructions that could be shorter.

    15. Re:RISC by benwb · · Score: 2

      Absolutely. But I think that makes the point that I was hinting at for me: RISC chips are not inherently faster than a VLIW/RISC hybrid like the current p4's. After all, if risc is such a big win you shouldn't have to spend as much money on it to extract the performance of a crufty design like a p4.

      AMD probably doesn't have a larger budget than ibm/motorola for powerpc, and it beats the pants of them too.

      RISC probably has some pretty significant advantages when developing low power chips, or chips that play well enough with others to support massive scalability, but I just don't see it for single chip performance.

    16. Re:RISC by bored · · Score: 2, Informative
      Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.

      The ARM is a cute little arch, the only problem is that EVERY instruction is conditional. At first this seems like it might make for some really nice optimizations. But its a lot harder than you think (the instruction cache cannot just dump instructions, because it has to know what the current state of the processor is, this means that all instructions which affect the condition codes have to retire before decisions about which instructions can be executed are made). When I started thinking about how I would design an OO superscalar version it started to give me a real headache. Eventually I realized about the only way (I could think of, maybe there is a better way) would be to have some kind of in order conditional retire stage near the end of the pipeline. This would allow the processor to run at decent speeds as long as the code was very careful to rarely use the conditional execution, since it would effectivly serialize the instruction stream. The 'Always' execute instructions could retire out of order as long as there was enough distance between them and the condition changing/condition dependent instructions.

      All this is to say, the ARM is a nice arch for low speed, low power devices. Really high speed versions might be pretty hard to get right. Intel's Xscale is like this, everything considered, it's IPC is pretty bad.

    17. Re:RISC by Lars+T. · · Score: 3, Interesting
      No, RISC isn't inherently faster than CISC (and no, the P4 isn't a VLIW/RISC hybrid, it's a CISC processor with micro-code).

      And both Intel and AMD spend much more on (x86-) processor development than IBM and Motorola and Sun and all others on their chips.

      And no, x86 is not much faster. Not even at SPEC, which does not tell the whole picture.

      As for AMD being faster, they basically had a stroke of luck with the Athlon design. Before that AMD wasn't known for their speedy processors (cheap yes). And if it hadn't been for the Athlon, Intel's x86 also wouldn't be that far (or not so actualy) ahead, the Itanium II would be the contender to the big RISCs, and the fastest Pentium 4 would be at 2 GHz (if that much) and would cost $1000.

      --

      Lars T.

      To the guy who modded me down from perfect to terrible Karma - Apple haters still suck

    18. Re:RISC by jafuser · · Score: 2
      It is easier to optimize for x86 than RISC.

      I'm just curious, as I have heard this claimed for both types of processors. It seems like a processor with more instructions would be more optimizable because the compiler has more ways to describe what it wants the CPU to do; and the CPU has a better understanding what it needs to do, which allows it to optimize even further.

      --
      Please consider making an automatic monthly recurring donation to the EFF
    19. Re:RISC by benwb · · Score: 2

      So basically we agree. :)

    20. Re:RISC by himi · · Score: 3, Interesting

      If you want lots of general purpose registers, take a look at Knuth's MMIX system. Unfortunately, it's not in silicon, but it's there, and it /could/ be done, if someone wanted to . . .

      himi

      --

      My very own DeCSS mirror.
    21. Re:RISC by e8johan · · Score: 2

      It is easier to optimize for x86 than RISC.

      If a RISC is a proper RISC there are probably no more than one or two ways to do an operation. Since all similair instructions takes an equal amount of time there is no need to optimize. That is what I would call, really easy to optimize, i.e. no need too!

  6. Um, how is this anything new? by Andy+Dodd · · Score: 4, Informative

    Linux kernel source - memcpy() anyone?

    (On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)

    This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.

    --
    retrorocket.o not found, launch anyway?
  7. Another Hideous Hack for IA32 by seanellis · · Score: 5, Informative

    The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.

    Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).

    Worse, this scheme would not benefit existing code - it still requires code changes to work.

    Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp .)

    1. Re:Another Hideous Hack for IA32 by jelle · · Score: 2

      "The scheme as proposed would work"

      I'm not so sure about that. He found a way to address more registers with minimal changes to the instruction set. That is only part of the problem. He doesn't analyze what is needed in the actual hardware, adding the registers and the read and write muxes to actually implement this functionality in gates. With register renaming that all these processors use, it's not so easily said how big this impact will be. Anyways, usually more registers, especially general purpose ones means more silicon area (routing and muxes around the register bits) plus increased critical path. Translation: a bigger chip with a lower clock speed.

      The fact that microcontrollers have more gp registers doesn't mean anything, because they don't have to run at 2.8GHz, and often even need multiple processor cycles per clock, so there is a lot of room to work with. At the current speeds of X86 CPUs, the hardware contraints cannot be compared with those of a microcontroller.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    2. Re:Another Hideous Hack for IA32 by jelle · · Score: 2

      Replying to myself...

      Actually thinking more about it, I think his proposal won't gain much performance anyway. Lack of gp registers results in using the stack as an overflow for local data. Because of the rate of accesses on the stack, I'm pretty sure the local variable part of the always resides on L1 cache. That means stack push and pops are pretty fast already, because they go to+from the cache. Maybe it would help a little to have a special register bank that mirrors the 'top of the stack' so that it can be read and written at register speed instead of L1 cache speed. That would be a 'solution' that doesn't require any changes to the instruction set, no recompiling, etc, and probably gives the same performance gain.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
  8. Mmmm, Assembler... by guidemaker · · Score: 5, Funny

    I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.

    When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.

  9. The Problems of Obsolete design by Alien54 · · Score: 5, Interesting
    This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.

    This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.

    The upside is being able to run old software on new hardware. You don't want to break too many things.

    --
    "It is a greater offense to steal men's labor, than their clothes"
    1. Re:The Problems of Obsolete design by gpinzone · · Score: 3, Insightful

      Microchannel was the bus you are thinking about. It actually was very good, but wan't backward compatible with ISA. EISA was the "rest of the industry's" response to provide a 32-bit bus that was backwards compatible. It wasn't a very good implementation since it was still locked at 8MHz.

    2. Re:The Problems of Obsolete design by Zathrus · · Score: 5, Informative

      As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.

      Anyone else remember the horrors of all those damn control files on floppies?

      There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.

      The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).

      Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.

    3. Re:The Problems of Obsolete design by operagost · · Score: 2
      Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)?
      Someone who is designing for a platform designed for a single user, running a single program.
      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
  10. Full Circle ... by tubs · · Score: 3, Insightful

    I remember the "next big thing" during the early and middle 90s was RISC - So will the next big thing will be McISC (More Complex Instruction Set Chips)

    I wonder if the core of a MCISC will be RISC, or CISC and that have a RISC core.

    --

    try to make ends meet, you're a slave to money, then you die

  11. Does anyone else have flashbacks to by wiredog · · Score: 4, Interesting
    segment:offset addressing? He's doing it with registers, but it seems the same sort of thing. One register is for segment, the other is the offset?

    Well, not quite, but it has the same flavor.

    After working in x86 assembly, I really appreciated high level and minimally complex languages like C.

  12. Technical point of view by Lomby · · Score: 4, Interesting

    The guy does not realize that what he proposed is not at all simple to implement in silico.

    This two additional mapping register would complicate the pipeline hazard detection in an exponential way.

    Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.

    Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.

    1. Re:Technical point of view by Christopher+Thomas · · Score: 2

      The guy does not realize that what he proposed is not at all simple to implement in silico.

      This two additional mapping register would complicate the pipeline hazard detection in an exponential way.


      It shouldn't. You'd just have to flag any modification of the map register as a hazard, and move the rest of the hazard detection after the mapping stage. It mostly just adds latency, not complexity.

  13. I suspect this would be a rather expensive chip by shimmin · · Score: 5, Interesting
    While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.

    Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.

    I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.

    Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.

    So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?

  14. Re:add core funcs libc/stdc++ to the CPU by Toraz+Chryx · · Score: 2

    >"MMX/3d stuff for CPUs are lame, we have 3d cards for that."

    good luck doing scientific calculations on a Geforce :)

    OTOH

    >"add a FPGA matrix of 4096x4096 transistors or >something on the side of the cpu for custom UBER fast routines"

    ^^^^ that idea has me intrigued, anyone who actually knows more about FPGA's than me (which isn't difficult) want to go into pluses/negs with that concept?

  15. More registers are not enough. by gpinzone · · Score: 4, Informative

    The whole gist of the article has to do with the x86's lack of general purpose registers. While this is true, you're not going to solve all of the x86 shortcomings simply by figuring out a way to add more of them. There are MANY things wrong with the x86 design; GP registers are just one of them. There's an entire section in the famous Patterson book that goes into all of the issues in much more detail than I care to state here.

    Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming is one such example.

  16. Revolutionizing?? by Jugalator · · Score: 3, Interesting

    It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.

    When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts. :-)

    Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.

    --
    Beware: In C++, your friends can see your privates!
  17. Amen, brother by mekkab · · Score: 3, Insightful

    It's a cute idea having a "stackspace" for your GPRs, but you could just move to an architecture with more GPRs and not have to design a brand new chip (I hate verilog).

    Now if I could only get my compiler to stop moving items from gpr to gpr with a RLINM that has a rotate of 0 and an AND mask of all 0xFFFFFFFF's!

    --
    In the future, I would want to not be isolated from my friends in the Space Station.
  18. Question about register aliasing by Tikiman · · Score: 2

    From what I gathered in the article, it seems like he is proposing a scheme by which normally unused registers (MMX, etc) can be used as general purpose registers. To do this, he considers an aliasing system. My question is, why can't a x86 programmer today just use those MMX registers for more general purposes? I'm sure there's a good reason, I just can't figure it out from the article - thanks

  19. Re:Switching Architectures by killmenow · · Score: 5, Insightful
    As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
    But a lot of the code running today wasn't "written today" if you know what I mean.
    The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.

    A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.

    Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.

    In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
  20. It's the Chipset That Wouldn't Die! by Rayonic · · Score: 2

    And we all love it for the same reason we love mutant superhuman zombies. :o)

  21. Cool idea by tomstdenis · · Score: 2, Funny

    I want to form a company that makes a cpu that translates x86 instructions on the fly to RISC instructions that operate in parallel.

    I'll call my company transmeta!

    Or in the words of that new dell commercial

    "Sure we'll call it 1-800 they already do that!".

    Tom

    --
    Someday, I'll have a real sig.
    1. Re:Cool idea by TeknoHog · · Score: 2
      > I want to form a company that makes a cpu that translates x86 instructions on the fly to RISC instructions that operate in parallel.

      > I'll call my company transmeta!

      Modern x86 processors, since about PPro and K6, are already RISC in their internals. One key difference to Transmeta's products is that the Crusoe does the translation in software. Therefore the hardware is simpler, and the translation engine can be easily upgraded.

      --
      Escher was the first MC and Giger invented the HR department.
  22. More than 3 answers !FREE! by purrpurrpussy · · Score: 4, Insightful

    You are VERY confused.

    1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.

    1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.

    2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....

    3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!

    As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.

    Personally.... I'd go for a stack machine. Easily the most efficient compute engine.

    Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY.... ;-)

    --
    "None of this shit works" -W.Shatner
  23. CRAM: advances in microprocessor arch by RichMan · · Score: 3, Interesting

    CRAM: search google for "CRAM computational RAM"

    http://www.ee.ualberta.ca/~elliott/cram/
    is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.

    Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.

    1. Re:CRAM: advances in microprocessor arch by dunedan · · Score: 2

      I've heard unfourtunatly that CRAM is going to be expensive and hot as all get out dissapating something LIKE 25W. I think current SDRAM dissapates 1W

      So now I can have memory that costs 10 times as much and require a heat sink and fan

      Don't get me wrong, I think its just about the coolest stuff I've heard of in a long time but I don't think It'll show up in my desktop anytime soon. I'll see how it does in things like the google search appliance and routers first.

  24. Wow... Maybe I am more L33T than I thought I was? by Jack+William+Bell · · Score: 2

    I actually understood that. And I haven't done assembly language programming since the old 8086. (Segment registers, *shudder*...)

    Jack William Bell

    --
    - -
    Are you an SF Fan? Are you a Tru-Fan?
  25. What about Interrupt Handlers? by PetiePooo · · Score: 2, Interesting
    I found the article intriguing, but during the entire verbose, self-important sounding read, I was wondering how ISRs would be handled. For example, if the RMC were set to revert to the default mapping in three ops, and an ISR interrupted after the first op, would it revert to the default mapping in the middle of the ISR?

    Fortunately, that issue is addressed in his Message Parlor. The full text of his response to BritGeek follows:

    Presently the registers are saved automatically by the processor in something called a Task State Segment (TSS) during a task switch. There are currently unused portions of TSS which could be utilized and (sic) for RM and RMC during a task switch.

    The PUSHRMC and POPRMC instructions are available for explicit saves/restores of the RM and RMC registers in general code. I don't recommend it, however. The decoders would be physically stalled until the RM/RMC registers are re-populated. It would be better to use explicit MOVRMCs in general code.

    - Rick C. Hodgin, geek.com
    He may be onto something afterall...
  26. x86 Emulator? by Shadow2097 · · Score: 2, Interesting
    From the sounds of the article, he wants to make register mappings more logical than virtual. My knowledge of assembly level programming is pretty basic, but I do agree that adding more GP registers would probably increase performance measureably.

    His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!

    I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?

    -Shadow

  27. The tricky part: by Qbertino · · Score: 2

    ...And the best part is that I believe this is something that could be implemented in hardware in a manner which could be resolved and entirely applied during the instruction decode phase, thereby never passing the added assembly instructions any further down the instruction pipeline, and thereby not increasing the number of clock cycles required to process any instruction. I can provide technical details on how that would work to anyone interested. Please e-mail me if you are....

    If this is really acomplishable without wasting *any* extra cpu time (that waste would aply to *all* instructions the CPU goes through!) this is indeed a good stunt that could work out to add a substancial ooomph to x86 performance with the code we have today.
    Thank god, 'cuz' my Athlon is to hot allready and I'm kinda sceptical about watercooling. :-)
    Then again, that's a big "if".

    --
    We suffer more in our imagination than in reality. - Seneca
  28. Re:add core funcs libc/stdc++ to the CPU by Toraz+Chryx · · Score: 2

    How slow?

    and if it were for something like a 20 hour 3d render, would it matter if the initial setup took a while?

  29. Re:add core funcs libc/stdc++ to the CPU by HFXPro · · Score: 2, Interesting

    good luck doing scientific calculations on a Geforce Wha? Someone didn't tell you 3D accelerators do lots of math that requires very intensive scientific calculations, even if there implementation isn't the most accurate results. Infact, much of the math they use is used by physicist, engineers, and mathematicians everyday. Unfortunalty, getting the information in a way so as to permenatly store it, or know what the exact results are could be quite difficult other then seeing it as graphics on your screen. BTW, I do think that that CPU's still need to have great ability to do computationaly expensive instructions. There is enough math in the form of collision detection and game physics amoung other things to still need lots of processing power on the cpu.

    --
    Reserved Word.
  30. Great... by Elias+Israel · · Score: 2, Interesting

    A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?

    Thanks, no.

  31. So this is a "register pointer"? by zerofoo · · Score: 3, Insightful

    Great, in a time where we are removing god awful pointers from high level programming languages, we're putting them in the hardware.....uuugh.

    Anyone ever write something with intensive pointer arithmetic in C++? It's enough to drive you mad.

    Can you imagine peer code review: "No, that's not the instruction.....that's a pointer to the instruction."

    Oh boy!

    -ted

    1. Re:So this is a "register pointer"? by zerofoo · · Score: 2

      Right, you are talking "memory pointers" bits of data that point to, or keep track of locations in memory (or a stack...like a stack pointer).

      These new "pointers" for lack of a better term reference actual instructions....not data locations....it just adds another layer of complexity.

      As far as newer languages...I was talking about Java. Most Java guys avoid using pointers.

      -ted

    2. Re:So this is a "register pointer"? by mrm677 · · Score: 2

      There is a saying in computer science that any CS problem can be solved by adding another level of indirection.

      There are "pointers" all over architectures. In fact, directory-based cache coherence protocols simply use an array of pointers to actual nodes.

    3. Re:So this is a "register pointer"? by zerofoo · · Score: 2

      You are absolutely correct.

      But the point of a HLL (high level language) is to abstract those details to make software programming easier.

      Of course, that doesn't apply if you program in assembler, so you are stuck with all the low level details.

      I was just making the point that the software industry is trying to reduce pointer complexity in HLL's but hardware designers haven't tackled that yet.

      -ted

  32. Re:add core funcs libc/stdc++ to the CPU by Toraz+Chryx · · Score: 2

    I'm fully aware that 3d accelerators do lots of maths..

    but without high precision and some way of getting the data OFF the videocard, then it's utterly useless for scientific purposes.

  33. Another band-aid by Junks+Jerzey · · Score: 2

    So what he's suggestion is yet *another* band-aid on an already patched together architecture. This is no different than tacking 32-bit mode on top of a segmented 16-bit architecture, or the bizarre MMX/fp register sharing nonsense.

    1. Re:Another band-aid by mrm677 · · Score: 2

      This is no different than tacking 32-bit mode on top of a segmented 16-bit architecture, or the bizarre MMX/fp register sharing nonsense.

      Hmmm...thats funny that AMD is doing something so similar with Hammer

  34. Why should one do that? by mick29 · · Score: 4, Informative
    I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
    But this change would:

    Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)

    Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
    Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.

    1. Re:Why should one do that? by epine · · Score: 2


      What intellectual creation in this world doesn't have a fistful of lousy ties in the bottom drawer of the dresser? The existence of the BCD instruction, which is probably trapped in microcode by all modern implementations, is evidence of what exactly? If you squeezed out the vast majority of all the dubious instructions which remain in the formal x86 instruction set, I have serious doubts you would gain 5% on any significant metric (thermal loss, die size, clock frequency, etc.) The practical core of the x86 instruction set was firmly established by the 486. The majority of useful integer instructions on a 486 take exactly one clock cycle. In many programs 99% of all generated instructions come from this core group.

      Complaining about crufty ties in the bottom drawer is a serious misdirection of mental resources.

  35. Intel isn't interested in performance by zaqattack911 · · Score: 3, Insightful

    I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.

    Look at the pentium4 design! Intel would much rather use a dated cpu, with a nice pretty GHZ rating than keep the same MHZ and improve the architecture design.

    Do you really think investers give a shit about registers?

    --Marketing 101

    1. Re:Intel isn't interested in performance by mrm677 · · Score: 2

      If they aren't interested in performance, then why do they achieve pretty damn good SPEC numbers??

    2. Re:Intel isn't interested in performance by MajroMax · · Score: 2
      I hate to say it, but lately it's becoming more and more obvious that Intel is no longer really interested in performance. They'll squeeze a bit more out of an ancient architecture and add a few buz words like "SSE2", so they can slap on a hefty price-tag.

      Bah. SSE2 may be a marketing-ism (especially with the 'We make the Internet go Faster ' slogans), but the underlying technology is relatively neat.

      Back in Ye Older Days, processors had a physical limit of one set of effective operaands per instruction -- SISD, Single Instruction, Singe Data. One could add two numbers together to get a third, but adding n sets of two numbers together would take n instructions.

      Then came MMX (on the x86 -- other architecturs have equivalents) -- this extended the x86 architecure by basicially co-opting the (64-bit) FPU registers for SIMD, Single Instruction Multiple Data, instructions, on 8 bytes, 4 shorts, or 2 ints at the same time. A single PADDB instruction can now add 2 sets of 8 bytes at once, for example.

      This was a Good Thing, but there is one obvious limitation -- it doesn't work for floats. Thus begat SSE, which adds 128-bit XMM registers to the processor to deal with SIMD floats in much the same way that MMX deals with ints. SSE also adds non-blocking writes to memory and other cache-control bits, but those aren't particularly important in this paragraph.

      SSE2 came about when it was decreed that SSE would be extended to handle all datatypes. With SSE2, introduced in the P4, the XMM registers can handle basically all interesting datatypes (with the exception of BCD, which really should die). I'm not so sure about you, but I think that performing operations on 16 bytes at a time _may_ be a performance boost, no?

      In short, x86 has its architectural problems, but for the time being it's far more efficent to keep improving what we have rather than start a completely new architecture. In fact, that's what Intel tried with the Itanium, and we all know how successful that venture's been.

      --
      "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
  36. More trouble than its worth... by gillbates · · Score: 4, Insightful
    The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.

    Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.

    Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.

    I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.

    A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.

    --
    The society for a thought-free internet welcomes you.
    1. Re:More trouble than its worth... by Oculus+Habent · · Score: 2
      IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation

      I'm no architecture expert, so I'll ask...

      What complexities and performance problems would be introduced if you were to up the number of registers? Let's say you wanted a processor with 32 registers...

      --
      That what was all this school was for... to teach us how to solve our own problems. -- janeowit
    2. Re:More trouble than its worth... by gillbates · · Score: 3, Interesting

      Oops. Forgot about PUSHA/POPA. Kind of strange, too, because I use these a lot.

      Also, about the opcode problem - adding registers doesn't necessarily mean adding opcodes. For example, IBM mainframes have one opcode for a load register instruction, and the registers are specified in the instruction. Were IBM to double the number of registers, the opcode would not have to change (granted, the instruction would get longer because they only allocated enough space in the source and destination fields for specifying one of 16 registers.) The problem is with the way x86 opcodes work - they aren't as universal, that is, the opcode's first byte is a function of both the operation and the register used. So expansion would be pretty difficult, unless they expanded the instruction set to include two byte opcodes (which they've already done, iirc), and use general purpose opcodes for common operations such as loading and storing.

      It's unfortunate, but true.

      The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)

      Couldn't agree with you more. What I would really like to see is an x86 processor that could handle IBM mainframe instructions. The IBM mainframe instruction set makes a lot more sense than Intel's instruction set - unlike Intel, IBM realized that someday they might be doing 64 bit and 128 bit computing, and designed the instruction set to be expandable. Also, they don't have a lot of "garbage" instructions - no MMX, no SSE, no SIMD junk to clutter up a good design. To be honest, benchmarks that I've run on real-world software indicate that today's x86 processors complete 4 instructions for every 5 clock cycles. Which indicates that branch prediction and deep pipelines aren't the performance enhancers that Intel and AMD seem to believe them to be. While they might work well in theory, real world performance speaks otherwise. Given this, I don't see any practical reason for keeping a kludgy instruction set around, because the complexity of the instruction set has been a great hindrance to the actual, rather than the theoretical, optimization of x86 processors.

      --
      The society for a thought-free internet welcomes you.
    3. Re:More trouble than its worth... by gillbates · · Score: 2

      Generally speaking, adding registers uses up more silicon on the die, as the microcode must now work with a larger number of registers. The real problem comes with register renaming and out of order execution - which take up a considerable amount of microcode logic. As the number of registers increases, I imagine (though I am not a computer engineer) that the amount of silicon used for optimization grows exponentially.

      Translation: It's probably easier to optimize a processor with a smaller number of registers than one with many registers. However, the optimization that has been done to the x86 processors has yielded paltry results. Aside from pipelining (which has had the largest effect), most of the optimizations (register renaming, speculative execution, branch prediction) have had very little real world performance impact.

      However, the biggest problem that modern processors face is in keeping the cache full. Since the memory bus works at about 1/5 the speed of the processor, any gains given by optimizing the processor core are lost by the relatively large amount of time that the processor spends waiting on the memory controller. Thus, if we had more registers, we could use them for variables, rather than main memory, and reduce the number of main memory accesses, allowing our processor to complete more instructions in any given amount of time. The reason why the mainframe processors work so well is that they have 16 general purpose registers, which can be used for anything - as opposed to PC's, where only one of the general purpose registers can be used for arithmetic, only three of which can be used for addressing, and only one for IO. Given these restrictions, it's very difficult to write a program in x86 assembly that uses registers for anything more than the most temporary of variables. Even though mainframe processors run at 1/3 the speed of PC's, they get about as much done because they don't have the main memory latencies that PC's do, and, they can use registers, rather than memory, for the most commonly used variables. It isn't very difficult for a mainframe programmer to write useful programs in assembly that use main memory for nothing more than file buffers, where to do the same thing in x86 assembly is next to impossible.

      --
      The society for a thought-free internet welcomes you.
    4. Re:More trouble than its worth... by RickHGeek · · Score: 2, Interesting

      "Actually, this is just one of many potential downfalls."

      I was referring to use of the POPRMC instruction in code. I wouldn't recommend it unless there are other reasons why there might be a delay before actual code is executed, such as the last thing done before a RETF.

      "He forgot interrupts, mode switching....and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster."

      I didn't forget those aspects of coding. There are two distinct possibilities here which entirely resolve that dilema, both handled in hardware. 1) Interrupts are handled in a special way, during interrupt processing all RM/RMC values are ignored and utilization of the default 8 GP registers exist, or 2) Interrupts automatically push RM/RMC on the stack when signaled, and automatically pop them back off when IRETD is issued. These non-problems are resolvable.

      Next, mode switching. Mode switching would make no difference. Again, the hardware state could either persist as it is presently setup through the mode switch (meaning that SC will either count down and reset RM/RMC to default values/popped values when it hits zero, or it will be populated with 1111b and it will persist forever (until changed with MOVRMC again).).

      I've been told by probably 10 people so far that the P4 engine was designed with a 2 cycle latency L1 data cache, the purpose of which is to hide a lot of the latency required by not having a large GP register set. While this is, indeed, a great thing ... it never approaches the speed of register to register transfers. If code could be written to utilize up 56 GP registers instead of 8 (8 GP + 16 MMX + 32 XMMX) then a great deal of those 2-cycle latency hits would be removed, thereby speeding up code fairly significantly.

      I've had a couple people that I respect contact me in email about this concept. They've asked me to write an emulator which demonstrates this process. I will be doing that in the coming weeks/months. I'm sure this topic will be dead by the time I get it completed, but it might help stir it up again. We'll see what it really does when the numbers are published. Take care!

      - Rick C. Hodgin, geek.com

    5. Re:More trouble than its worth... by epine · · Score: 2

      There are no end of tight loops out there where the x86 averages nearly three u-ops per clock cycle, the theoretic limit for Pentium III / Athlon cores. (I don't know the Pentium IV very well, it's too irregular and undocumented to bother studying.) The Athlon's u-ops are somewhat more powerful than the Pentium III u-ops which accounts for its superior peak performance. Counting instructions on x86 is pretty dumb. The internal u-ops are much closer to the conventional notion of a RISC instruction. Think of x86 as an ARM processor permanently stuck in Thumb decoding mode, supposing that Thumb has instructions which corresponded to one to four regular instructions (which are called u-ops in the x86 world). P6 u-ops are slightly less powerful than conventional RISC instructions (two u-ops are required for a single memory load). Athlon u-ops are roughly equal to conventional RISC instructions. They lack the three operand mode, but make up for it by handling read/modify/write as a unitary form. The P6 core rarely executes less than two u-ops per clock unless stalled by branch misprediction or memory latency. Of course, it's possible to write bad code or bad compilers. However, I would state categorically that execution rates less than two u-ops per clock have nothing to do with limitations of the x86 instruction set design or the P6/Athlon core implementations. Deep OOO architectures excel at squashing resource conflicts and pipeline bubbles.

  37. When programmers try to be architects... by Chris+Burke · · Score: 5, Informative

    Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.

    Here's why his idea sucks:

    1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.

    2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.

    3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.

    4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.

    5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.

    Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.

    Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.

    I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.

    --

    The enemies of Democracy are
  38. Why not just go 64 bit? by amorsen · · Score: 2
    This proposal requires everyone to switch to new chips and new software. The new chips happen to run old software. That sounds like AMD's 64-bit chips to me. When you are doing an incompatible change you might as well get decent benefits out of it, instead of more complexity.

    Besides, segmented registers. I am having severe troubles finding an example of a worse idea actually proposed.

    --
    Finally! A year of moderation! Ready for 2019?
  39. Who does this help? by pete-classic · · Score: 2

    It seems that this would require a recompile to have any benefit.

    Soooo, if you are going to recompile anyway, why not target a processor with 128 64 bit GP registers, or whatever IA-64 has, instead of piling yet more cruft on top of x86?

    I'm not even convinced that it would be easier to modify existing i386 compilers to take advantage of this "advancemnet" than to get equivalent performance out of an immature IA-64 compiler, with more room for improvement.

    -Peter

  40. software vs. hardware by Sebastopol · · Score: 2


    LOL! This is what happenes when a software guy tries to wear a hardware guy's hat! As if an array of pointers is "revolutionary".

    He doesn't even address his own concern -- speeding up legacy x86 code. Everyone writing performance assembly code uses SSE/MMX. Critical path code is hardly ever written in legacy x86. In fact, most compilers are smart enough to do the conversion for you (MSVC, ipp) even without intrinsics.

    What does he suggest? Offer extra instructions! Hello. Does this guy actually write any code ever? It doesn't sound like it.

    --
    https://www.accountkiller.com/removal-requested
    1. Re:software vs. hardware by Sebastopol · · Score: 2

      ... large number of programming situations which would not benefit at all ... Also, critical path code that's not multimedia based would be wise not to use SSE/MMX. ... significantly longer to execute the prolog/epilog ...

      My point is -- What FPU code is there that isn't mission critical and couldn't benefit from conversion to SSE? And if it's not critical path, then why did the author suggest overhauling the architecture for a performance boost on legacy code?

      --
      https://www.accountkiller.com/removal-requested
  41. If it needs a recompile, what's the point? by Christopher+Thomas · · Score: 3, Interesting

    The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).

  42. Register Windows by 1000StonedMonkeys · · Score: 2, Informative

    This already exists on SPARC. It's called register windows. It makes writing compilers/assembly a real bitch. Chipgeek needs to do his homework.

    As several posters have already mentioned, Intel gets around the lack of registers problem by using register renaming. There are actually 128 general purpose registers in the P4. Which ones you're writing to is controlled by the processor.

  43. Re:Wow... Maybe I am more L33T than I thought I wa by operagost · · Score: 2
    I haven't done any since the 6502.

    You whippersnappers have it easy! Eight whole GP registers? The 6502 had three: A, X, and Y - and we LIKED it! It was a big improvement, why just a few years back I had to use the capacitance of my own body parts for registers. And that bloody hurt, what with the CPU drawing 35Kw and all! You kids are pansies!

    --

    Gamingmuseum.com: Give your 3D accelerator a rest.
  44. Re:reg stack? by jafuser · · Score: 2
    That sounds interesting. I really don't know much about chip design, but I wonder how efficient a CPU with several stacked registers could be, if the code was designed to work with that.

    To prevent stack overflows, a logic system could move the highest parts of the stack into cache (which gets moved to memory).

    I imagine registers are scarce because each added register increases other logic component complexity by an exponential amount, but if there were several stacks backing each register, you can't access the middle of the stack, so there would be no extra logic required.

    Anyway, I know absolutely nothing about this stuff, so I'm probably making quite an ass of myself, but I see there are quite a few knowledgeable people here on this topic, so I wonder if anyone could comment on how practical this is? =)

    --
    Please consider making an automatic monthly recurring donation to the EFF
  45. Umm... by Grendel+Drago · · Score: 2

    And whose chip has the highest floating point?

    Umm. It appears to be the "hp workstation zx6000 (1000 MHz, Itanium 2)", which isn't an x86 machine.

    --
    Laws do not persuade just because they threaten. --Seneca
    1. Re:Umm... by Zathrus · · Score: 2

      Which is made by Intel. Note I said "whose" and didn't specifically target x86.

      Not that the P4 2.8GHz is very far behind the I2.

  46. Re:modular chips by Zathrus · · Score: 5, Informative

    Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.

    Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.

    If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.

    All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.

    Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.

    Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?

  47. An intelligent comment on the subject by Cerlyn · · Score: 4, Interesting

    I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.

    Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.

    The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.

    And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.

    Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.

    Please pardon the omissions; I am not presently using a gcc 3.2 machine :)

    1. Re:An intelligent comment on the subject by RickHGeek · · Score: 2, Informative

      "Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging."

      This is an incorrect assumption. Existing operating systems would run entirely unaffected. RM/RMC support would be implemented in hardware. The data would be stored in the TSS during a task switch and the existing mechanisms used for storing MMX/FPU and SSE/SSE2 register space (either doing it explicitly with FXSAVE or deferring it by later trapping a fault when an attempt to read/write is encountered) would still be used.

      Nothing would need to be changed to that end.

      "Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide."

      Absolutely not. Each task has its own TSS right now. Each task context saves everything and context restores everything before/following a task switch. All systems would run as they do today. In fact, no additional operating system support would be required (since the necessarying saving/restoring of RM/RMC in the TSS would be handled entirely by the processor). It would be an invisible add-on that only software utilizing it would see.

      - Rick C. Hodgin, geek.com

    2. Re:An intelligent comment on the subject by Cerlyn · · Score: 3, Interesting

      I thought of a context switch (or possibly a function call) too. Correct me if I am wrong, but what you are trying to do is to create a bunch of registers (my understanding being they will just be the existing x86+MMX+SSE unnamed), and "map" them via another register that certain software knows how to access, correct? That way, when an application knows about these, it can "squirrel" data away in "hidden" registers for fast access later?

      The primary problem I have with this "switching" of registers is that registers are supposed to be the fastest, most reliable memory components in a computer. By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability. Furthermore, the amount of data that can be hidden away inside of a processor is limited. While hiding registers is nice, perhaps it would be better to have the ability to "latch" a row of data so it won't be cleared out of the L1 cache (no processor can do this at the moment?). I would think that this would be much easier to implement without speed degredation, as it would only require a few additional gates used during lookup/overwriting of the L1 cache (which ideally, for this case, is at least semi-associative (i.e. any memory "block" can map to at least two locations in the cache)).

      Secondly, your proposal (as I understand it) would require all the registers to share the same area on a chip. Nowadays, the MMU, Arthmatic/Logic unit, etc., each have their own area on the chip. Shared/swapped registers would have to be in the center of the chip, with longer lines to each partial unit (yielding delays and capacitance). I belive you proposed doing this by subunits though; this would reduce delays somewhat, but you are still requiring some centralization, and adding a signifcant delay in.

      My personal position on this still kind of stands; if a program's compiler knows how to make use of the MMX & SSE functions of a computer, it should be set up to do so. That way, after an initial context switch for the entire program, the program (being correctly configured for a processor) flys. A compiler with register renaming functionality ("gcc3.2 -frename-registers", for example), can help do this for apps where the programmer does not know assembler. And if your "minimum requirements" mention a Pentium II 500, don't compile for a 486!

      In short, I fail to see how your proposal will speed up most applications significantly. Context-switches are always expensive, but the ability to change contexts 10 clocks versus 30 really isn't significant when your backside bus is less than 50% of the processor's speed.

      Obviously, being a minor player, I have my views, and I have to respect yours (especially since I only had about 5-10 minutes to read your piece), but personally, I really do not see why program accessable context switching inside a processor is needed.

    3. Re:An intelligent comment on the subject by RickHGeek · · Score: 2, Informative

      By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.

      The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.

      I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).

      One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.

      My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).

      What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.

      If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.

      - Rick C. Hodgin, geek.com

  48. Real data using x86 emulator by Nynaeve · · Score: 2, Informative

    What happened to backing up flames and claims with real data? The author of this article would be well advised to implement his ideas using an x86 emulator and at least do some prelimiary testing. Processor-level features such as out-of-order execution and register-renaming may not be handled by an emulator, but it would be an informative investigation nonetheless.

  49. Why I'm a Motorola fan by Tokerat · · Score: 2
    Since PPC, Motorola chips have had plenty of registers. I'm not sure on specifics, I haven't written assembly since the 68k Mac days, but I think there is on the order of 32GP registers and 32 FP registers.

    This is a significant advantage, as minor operations that need to loop and keep track of a few things need not touch RAM at all and this keeps things extremely fast, as anyone tech savvy should be able to tell you.

    I'm not quite sure about the other recommendations ChipGeek makes, plus I only really skimmed the article, but an increase in GP and FP registers on the x86 platform is nothing short of a Good Thing, and is one of the reasons I have always shunned away from the platform. At bare minimum, this should be a high priotiry (if not first priority) when deciding the future of x86 design. Remember, it is true that what keeps PPC equal to (well, lately PPC has fallen behind a bit :-\ ) x86 in terms of real world performance is PPC need not hit RAM as much.

    Providing more registers will create a remarkable boost in speed for any programs that take advantage of it, and I'm sure if it was going to happen all the major x86 OSes would jump at the opportunity. After all, a faster OS makes for more speed overall.

    So, let's conclude:
    1. Speed.
    2. Convinience.
    3. (Profit!)
    --
    CAn'T CompreHend SARcaSm?
  50. If only by Dollyknot · · Score: 2, Informative

    Many years ago I learned assembler, first on a Z80 then on a 6502. When I learned the power of zero page addressing, yes I thought, way to go. I left behind my computing hobby, to become an international truck driver, for about 10 years this is what I did, seven years ago, events occured leading me to take up my old hobby. I tried to learn the 86, gave up after a while, thinking what was the point in banging my head against such a mess.

    What they should have done is kept the 6502 architecture and scaled it up. The architecture of the 6502 was wonderful. Sixteen bit address bus, eight bit data bus, same as the Z80, the clever bit with the 6502 was zero page addressing ,which basicaly gave 256 registers as well the three GP registers. The idea being the CPU could access the bottom page in memory with just eight bits in the address field, zeropage could be used as index registers, I can't rightly remember all the operations that could be performed on zero page as opposed to the X Y and Z registers, but I remember it leading to good tight code. The same architecture in a 32 bit address space, ah the dreams

    --
    It's called an elephant's trunk whereas it is in fact, an elephant's nose, a nose by any other name would smell as sweet
    1. Re:If only by epine · · Score: 2

      I was there. The 6502 was hell on wheels. Scaling up a processor design which doesn't have a single GP register long enough to hold a memory address? Drugs man. I will say, however, that I quite liked the 6809. It was kind of fun to program the 6502, but when you look at code generation issues it was a complete disaster. The whole point is that the design of the 6502 can't scale up. There was a sweet spot for writing moderately complex video games by hand, but compilers aren't interested in sweet spots. Well, I knew compiler that was. If you put too many parens in an expression, it ran out of temporary registers because it was storing temporary values within a fixed resource that looked an awful lot like zero page on the 6502.

  51. Did You Understand or Even Read The Article? by Milican · · Score: 2

    Did you see the part about adding the extra regsiters that would allow you to access all the other registers without jumping through hoops? How the hell is a compiler going to do that? I'll tell you. Its not, because the compiler would have to jump through the hoops. With the RegisterMapControl (RMC) you would be able to access all registers without using multiple shifts and without having to go through specific sequences of assembly code to get at the contents of certain registers. This is a *hardware* issue not a software issue. If you had read and understood the article you would know this because when this guy at ChipGeek is talking about assembler, which is what any compiler outputs. In addition, MMX is only for multimedia instructions (duh) and the article specifically talks about speeding up general purpose applications. READ, READ, READ... If you don't understand then don't post. This goes for moderators too. This should not have been modded up to +5 insightful because it isn't and its completely off base from what the article was talking about.

    JOhn

  52. Re:Switching Architectures by AnotherBlackHat · · Score: 2
    But a lot of the code running today wasn't "written today" if you know what I mean.
    The problem is, in order to recompile you first need: a) the original source, and ...


    Why?
    No really, why do you need the original source to compile something?
    Seems to me that "assembly language" and "byte code" are languages just like p-code or Fortran.

    -- this is not a .sig
  53. too much of a good thing = pie wagon by epine · · Score: 3, Insightful

    If there was any sense to this comment, the x86 would have proved such a disaster it was abandoned ten years ago. Many people think it should have been, that its continued existence is some bizarre aberration of rational forces.

    In actual fact, the ugliness of the duckling was less of an impediment than advertised.

    There are several consequences of large, flat register sets. First of all, if your register set greatly exceeds the number of in flight instructions, you have a lot of extra transistors in your register set sitting there, on average, doing nothing. Well, not nothing. They are sitting there adding extra capacitance and leakage to your register file, increasing path length, cycle times, power dissipationm, and routing complexity.

    Second effect: large registers sets increase average instruction length. Larger average instruction lengths translates into a larger L1 instruction cache to achieve the same hit ratio. PPC requires a 40% larger I-cache to achieve the same effectiveness as the x86 I-cache.

    Third effect: context switches take longer. If you want to actually use all those registers, your process has to save and restore them on every context switch.

    Finally, there is the register set mirage. Modern implementations of x86 have approximately 40 general purpose registers. Only you can't see most of them. Six of these can be named to the instruction set at any given time. The others are in-flight copies of values previous named with the same names. This all happens transparently within an OOO processor model.

    If x86 only had six GP registers in practice, it really would have died ten years ago. What it actually has is six GP registers you can name at any one time, which means only six GP registers you have to load and store on context switches, etc.

    What did die ten years ago was the notion that convenience to the human assembly language programmer was worth a hill of beans. Good architectures are convenient to the silicon and the compiler.

    Other aspects of x86 have proved more serious than the shortage of namable GP registers. To many instructions change the flag register affecting too many patterns of flag bits. That's hell for an OOO processor to patch back together. The floating stack was an abomination. Lack of a three operand instruction format is another significant liability.

    On the other hand, the ill reputed RMW (read/modify/write) instruction mode is 90% of the reason the Athlon performs as well as it does. You get two memory transactions for the price of one address generation, translation, and cache contention analysis. It amounts to having the entire L1 cache available as a register set extension every other clock cycle (leaving half of you L1 cache cycles for other forms of work).

    Having someone comment on the x86 is an excellent litmus test of the capacity for someone to dig deeper than their shallow preconceptions of elegance. If it were anything other than the despised x86, it's ability to scale from 4.77MHz to 10GHz would have been considered a marvel of engineering soundness. Sometimes ugliness has lessons to teach us. Who among us is prepared to listen?

  54. Re:recompilers by epine · · Score: 2

    Oh yes, the cost and complexity of recompiling all existing binary code for the x86 has no complexity at all. Rather than having two companies with thousands of highly trained design engineers work out the kinks they are paid to master, let's get the whole world involved in a massive change-over to honour a false god which hasn't yet produced compelling practical evidence of its innate superiority.

    The reason why this proposal won't be taken seriously is because it does expose extra complexity to the world at large (need for new compilers, optimization modes, validation, etc.) Complexities that can be handled behind the scenes are tackled aggressively no matter how great the complexity.

    But if we are going to recompile the entire existing x86 code base, why don't we add a simple extension to the compiler to eliminate all buffer overflows made possible by sloppy programming? Surely that can't greatly complicate this marvellous proposal. In the next iteration of recompile world, how about we design a compiler than identifies and corrects bad software design and program architecture? No, let's just settle for making all of the x86 binaries 40% larger for no real benefit.

  55. paltry results by epine · · Score: 2


    I guess most people don't comprehend the "red queen" nature of processor scaling. You have to as fast as you can to stay right where you are.

    Increasing clock speed is not a linear gain. Let's imagine we scale the 66MHz 486 to several GHz without making any significant changes to the core. How fast would it run? It would be stalled three clock cycles out of every four, or worse. It wouldn't run 10% of the speed of a modern core at the same clock speed. That order ten magnitude constitutes a long series of "paltry gains" paying the price for maintaining linearity while clock frequency takes all the credit. And I'm not even being fair to the paltry gains, because IPC has indeed increased greatly while latency hazards have scaled by several orders of magnitude.

  56. Re:add core funcs libc/stdc++ to the CPU by Toraz+Chryx · · Score: 2

    Well, surely you could prioritise it to prevent less performance critical applications getting their hands on it when something important needs it?

  57. IA-32 is not flawed by DotComVictim · · Score: 2

    I disagree with the assement that there is something wrong with the IA-32 opcode map. True, it's complex, it doesn't provide a lot of register flexibility; but compilers and internal register renaming make up for a lot of that.

    What is truely brilliant about the IA-32 instruction set is that it compresses very nicely. Try to write a useful function in 64 bytes on any RISC architecture, and you'll see why.

    Although it wasn't designed for this at the time, this has a very positive effect on performance - if we can squeeze more instructions into a smaller space, we have a smaller i-cache footprint, which definitely speeds things up, considering the memory bus bandwidth is the limiting factor, not the CPU.

    I understand his lack of appreciation for all the stack references, but I don't think this is the proper solution. The d-cache already catches stack reads - if there were a way to map a page as non-cache writeback, and the OS mapped the stack pages appropriately, flushing with a writeback only before a context switch, I think you'd see memory bandwidth increase significantly . True, this may break a number of things, but those problems can be worked around. This would help a large class of stack-intensive applications - and many applications and servers written for performance are already stack intensive because of the d-cache read benefit and easy allocation of buffer pools (malloc() is usually expensive).

    Course I don't have any of my architecture books on me right now, but I wouldn't be too terribly surprised if there is already a way to do this.