Slashdot Mirror


Revolutionizing x86 CPU Performance

NickSD writes "ChipGeek has an interesting article on increasing x86 CPU performance without having to redesign or throw out the x86 instruction set. Check it out at geek.com."

29 of 296 comments (clear)

  1. Sounds like an interesting idea... by AceMarkE · · Score: 2, Interesting

    It sounds like a pretty decent idea to me. Granted, I'm no assembly expert (I'm just now in my Microprocessors class, which is based on the Z80), but I don't see how having more registers could be a bad thing. Anything that keeps operations there inside the CPU rather than going out to memory would pretty much have to be faster. I especially like the fact that he's implemented it such that no current code would be affected. THAT is a key point right there.

    Admittedly, even if Intel and AMD decided to implement this, it'd still be a while, and then we'd have to get compilers that compile for those extra instructions, and there's our entire now-legacy code base that doesn't make use of them, and don't forget those ever-present patent issues...

    But yeah. Cool idea, well thought out. Petition for Intel, anyone?

    Mark Erikson

  2. Re:Why? by io333 · · Score: 3, Interesting

    Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"

  3. RISC by e8johan · · Score: 5, Interesting

    Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!

    As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!

    If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).

    And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.

    1. Re:RISC by shadow303 · · Score: 2, Interesting

      It would definitely be nice to get rid of the legacy cruft and move to a different architecture, however I doubt that this will happen until Intel and AMD start hitting major stumbling blocks. The itertia just seems to great. From what I hear (sorry I don't have a source, but I think I heard it in my Computer Architecture class), the cores of the current x86 chips are essentially RISC, and have a translation layer wrapped around it (convert x86 instructions into the internal RISC instructions).

      --
      I've got a mind like a steel trap - it's got an animal's foot stuck in it.
    2. Re:RISC by Zathrus · · Score: 5, Interesting

      Ok, when you get to the Real World, let us know.

      Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.

      Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.

      Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.

      You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).

      All that said, go look at the SPEC CInt2000 and FP2000 results. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?

      Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.

      Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.

    3. Re:RISC by Lars+T. · · Score: 3, Interesting
      No, RISC isn't inherently faster than CISC (and no, the P4 isn't a VLIW/RISC hybrid, it's a CISC processor with micro-code).

      And both Intel and AMD spend much more on (x86-) processor development than IBM and Motorola and Sun and all others on their chips.

      And no, x86 is not much faster. Not even at SPEC, which does not tell the whole picture.

      As for AMD being faster, they basically had a stroke of luck with the Athlon design. Before that AMD wasn't known for their speedy processors (cheap yes). And if it hadn't been for the Athlon, Intel's x86 also wouldn't be that far (or not so actualy) ahead, the Itanium II would be the contender to the big RISCs, and the fastest Pentium 4 would be at 2 GHz (if that much) and would cost $1000.

      --

      Lars T.

      To the guy who modded me down from perfect to terrible Karma - Apple haters still suck

    4. Re:RISC by himi · · Score: 3, Interesting

      If you want lots of general purpose registers, take a look at Knuth's MMIX system. Unfortunately, it's not in silicon, but it's there, and it /could/ be done, if someone wanted to . . .

      himi

      --

      My very own DeCSS mirror.
  4. The Problems of Obsolete design by Alien54 · · Score: 5, Interesting
    This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.

    This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.

    The upside is being able to run old software on new hardware. You don't want to break too many things.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  5. Does anyone else have flashbacks to by wiredog · · Score: 4, Interesting
    segment:offset addressing? He's doing it with registers, but it seems the same sort of thing. One register is for segment, the other is the offset?

    Well, not quite, but it has the same flavor.

    After working in x86 assembly, I really appreciated high level and minimally complex languages like C.

  6. Technical point of view by Lomby · · Score: 4, Interesting

    The guy does not realize that what he proposed is not at all simple to implement in silico.

    This two additional mapping register would complicate the pipeline hazard detection in an exponential way.

    Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.

    Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.

  7. I suspect this would be a rather expensive chip by shimmin · · Score: 5, Interesting
    While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.

    Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.

    I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.

    Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.

    So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?

  8. Revolutionizing?? by Jugalator · · Score: 3, Interesting

    It's interesting to hear "revolutionizing performance" in the same topic as instruction level fiddling. The only way to give truly "revolutionizing" performance is to do high level optimizations.

    When you have your highly optimized C++ code or whatever, *then* you can get down to low-level and start polishing whatever routine/loop you have that's the bottleneck. The compilers of today also usually does a better job than humans at optimizing performance at this level and ordering the instructions in an optimized way. Especially if you consider the developing time costs you'd need if doing it by hand. It's a myth that assembly code is generally faster if manually written -- many modern compilers are real optimizing beasts. :-)

    Anyway, I think one should always keep in mind that C++ code will only gain the greatest benefit from well optimized C++ code, not from new assembly level instructions, regardless if they unlock SSE registers for more general purpose or whatever. Oh, yeah, more registers won't revolutionize performance either. If they did, Intel and AMD would already have fixed that problem. I'm sure they're more than capable of doing it... More registers increase the amount of logic quite dramatically and I'm pretty sure it doesn't give good enough performance gains for the increased die cost, compared to increasing L2 cache size, improving branch prediction, etc.

    --
    Beware: In C++, your friends can see your privates!
  9. Not gonna happen by Anonymous Coward · · Score: 1, Interesting

    Intel's policy: "If it doesn't increase the clockspeed, it doesn't go in the chip." Performance is not an issue, only clockspeed.

  10. Re:Why? by DustMagnet · · Score: 5, Interesting
    I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware. The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.

    The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.

    --
    'SBEMAIL!' is better than a goat!!
  11. Re:Um, how is this anything new? by Anonymous Coward · · Score: 1, Interesting

    Yes, but can you do anything other than store/retrieve to the data? You would have to put your data in a different register to operate on it (non-mmx operate). This seems to be the big hangup.

  12. CRAM: advances in microprocessor arch by RichMan · · Score: 3, Interesting

    CRAM: search google for "CRAM computational RAM"

    http://www.ee.ualberta.ca/~elliott/cram/
    is your ultimate parallel compute machine. It turns your entire memory (all the CRAM anyways) into a register set. It is based on the concept of rather than bringing the data to the CPU for the computation the CPU is brought to the memory.

    Small computational units AND/OR/Adder are included on the bit access lines for all the memory cells.

  13. What about Interrupt Handlers? by PetiePooo · · Score: 2, Interesting
    I found the article intriguing, but during the entire verbose, self-important sounding read, I was wondering how ISRs would be handled. For example, if the RMC were set to revert to the default mapping in three ops, and an ISR interrupted after the first op, would it revert to the default mapping in the middle of the ISR?

    Fortunately, that issue is addressed in his Message Parlor. The full text of his response to BritGeek follows:

    Presently the registers are saved automatically by the processor in something called a Task State Segment (TSS) during a task switch. There are currently unused portions of TSS which could be utilized and (sic) for RM and RMC during a task switch.

    The PUSHRMC and POPRMC instructions are available for explicit saves/restores of the RM and RMC registers in general code. I don't recommend it, however. The decoders would be physically stalled until the RM/RMC registers are re-populated. It would be better to use explicit MOVRMCs in general code.

    - Rick C. Hodgin, geek.com
    He may be onto something afterall...
  14. x86 Emulator? by Shadow2097 · · Score: 2, Interesting
    From the sounds of the article, he wants to make register mappings more logical than virtual. My knowledge of assembly level programming is pretty basic, but I do agree that adding more GP registers would probably increase performance measureably.

    His second proposal, the RegisterMap field strikes me as the incredibly complex part of this idea. He sounds like he's suggesting an idea that will turn x86 achitecture into a simplified emulator by allowing you logically map any register address to any physical address you choose. While there are probably some benefits to this, it sounds like the complexity of programming an already exceptionally complex chipset could go through the roof!

    I read somewhere in a previous article (last year sometime, can't find a link) that the way most compilers treated x86 was already done with so many pseduo instructions as to basically be an emulator. Now this was before I had any knowledge of assembly level programming, so maybe someone with more knoweldge could clarify this?

    -Shadow

  15. Re:add core funcs libc/stdc++ to the CPU by HFXPro · · Score: 2, Interesting

    good luck doing scientific calculations on a Geforce Wha? Someone didn't tell you 3D accelerators do lots of math that requires very intensive scientific calculations, even if there implementation isn't the most accurate results. Infact, much of the math they use is used by physicist, engineers, and mathematicians everyday. Unfortunalty, getting the information in a way so as to permenatly store it, or know what the exact results are could be quite difficult other then seeing it as graphics on your screen. BTW, I do think that that CPU's still need to have great ability to do computationaly expensive instructions. There is enough math in the form of collision detection and game physics amoung other things to still need lots of processing power on the cpu.

    --
    Reserved Word.
  16. Great... by Elias+Israel · · Score: 2, Interesting

    A segment architecture for memory wasn't nasty enough, now we want to have a segment register for the registers?

    Thanks, no.

  17. Re:Why? by PetiePooo · · Score: 3, Interesting

    This may boil down to the generic do it in hardware v.s. do it in software debate. Do we reorder the instructions in hardware (ala Pentium and Athlon), or make the compiler do it (ala Itanium)? Do we make the hardware predict branches or have the compiler drop hints? Register renaming as done by modern RISC-core x86 implementations likely address many of the issues he proposes an extension and a smart compiler (or assembler) would solve. Now, a 386, that would benefit from his technique.

    However, if we're going to revise that architecture, I say we add MMX and call it a 486. Then, we can add SSE and call it a Pentium.. And then, ...

    Oh, wait. nevermind.

  18. Re:Why? by fstanchina · · Score: 3, Interesting

    And you don't understand how modern processors really work. First, they have several levels of cache memory exactly because you don't want to go out to main memory too often. Second, they have many more registers than the assembly instructions can see: register renaming, speculative eeecution and all those tricks reorganize instructions so that the CPU core doesn't really have to move data back and forth between memory and registers so often.

    IANACPUD (I'm not a CPU designer) so I'm not going to try to descripe this stuff further, but articles abound. Here are some: Into the K7, Part One and Into the K7, Part Two

  19. Re:Why? by p3d0 · · Score: 5, Interesting
    Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
    Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.

    At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  20. If it needs a recompile, what's the point? by Christopher+Thomas · · Score: 3, Interesting

    The part that confuses me is that, since code would need to be recompiled to make use of this, you might as well just compile for x86-64 and make use of a larger flat register space. While the idea is interesting, there doesn't seem to be any advantage to using it (and a few disadvantages, pointed out by other posters).

  21. An intelligent comment on the subject by Cerlyn · · Score: 4, Interesting

    I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.

    Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.

    The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.

    And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.

    Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.

    Please pardon the omissions; I am not presently using a gcc 3.2 machine :)

    1. Re:An intelligent comment on the subject by Cerlyn · · Score: 3, Interesting

      I thought of a context switch (or possibly a function call) too. Correct me if I am wrong, but what you are trying to do is to create a bunch of registers (my understanding being they will just be the existing x86+MMX+SSE unnamed), and "map" them via another register that certain software knows how to access, correct? That way, when an application knows about these, it can "squirrel" data away in "hidden" registers for fast access later?

      The primary problem I have with this "switching" of registers is that registers are supposed to be the fastest, most reliable memory components in a computer. By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability. Furthermore, the amount of data that can be hidden away inside of a processor is limited. While hiding registers is nice, perhaps it would be better to have the ability to "latch" a row of data so it won't be cleared out of the L1 cache (no processor can do this at the moment?). I would think that this would be much easier to implement without speed degredation, as it would only require a few additional gates used during lookup/overwriting of the L1 cache (which ideally, for this case, is at least semi-associative (i.e. any memory "block" can map to at least two locations in the cache)).

      Secondly, your proposal (as I understand it) would require all the registers to share the same area on a chip. Nowadays, the MMU, Arthmatic/Logic unit, etc., each have their own area on the chip. Shared/swapped registers would have to be in the center of the chip, with longer lines to each partial unit (yielding delays and capacitance). I belive you proposed doing this by subunits though; this would reduce delays somewhat, but you are still requiring some centralization, and adding a signifcant delay in.

      My personal position on this still kind of stands; if a program's compiler knows how to make use of the MMX & SSE functions of a computer, it should be set up to do so. That way, after an initial context switch for the entire program, the program (being correctly configured for a processor) flys. A compiler with register renaming functionality ("gcc3.2 -frename-registers", for example), can help do this for apps where the programmer does not know assembler. And if your "minimum requirements" mention a Pentium II 500, don't compile for a 486!

      In short, I fail to see how your proposal will speed up most applications significantly. Context-switches are always expensive, but the ability to change contexts 10 clocks versus 30 really isn't significant when your backside bus is less than 50% of the processor's speed.

      Obviously, being a minor player, I have my views, and I have to respect yours (especially since I only had about 5-10 minutes to read your piece), but personally, I really do not see why program accessable context switching inside a processor is needed.

  22. Re:More trouble than its worth... by gillbates · · Score: 3, Interesting

    Oops. Forgot about PUSHA/POPA. Kind of strange, too, because I use these a lot.

    Also, about the opcode problem - adding registers doesn't necessarily mean adding opcodes. For example, IBM mainframes have one opcode for a load register instruction, and the registers are specified in the instruction. Were IBM to double the number of registers, the opcode would not have to change (granted, the instruction would get longer because they only allocated enough space in the source and destination fields for specifying one of 16 registers.) The problem is with the way x86 opcodes work - they aren't as universal, that is, the opcode's first byte is a function of both the operation and the register used. So expansion would be pretty difficult, unless they expanded the instruction set to include two byte opcodes (which they've already done, iirc), and use general purpose opcodes for common operations such as loading and storing.

    It's unfortunate, but true.

    The real, and only solution, is that these companies get their acts together, quit issuing refreshes of old hardware, and finally give us their next gen chips to play with. Proposing anything else is just pointless. (Unless, of course, the new CPUs completely flop..)

    Couldn't agree with you more. What I would really like to see is an x86 processor that could handle IBM mainframe instructions. The IBM mainframe instruction set makes a lot more sense than Intel's instruction set - unlike Intel, IBM realized that someday they might be doing 64 bit and 128 bit computing, and designed the instruction set to be expandable. Also, they don't have a lot of "garbage" instructions - no MMX, no SSE, no SIMD junk to clutter up a good design. To be honest, benchmarks that I've run on real-world software indicate that today's x86 processors complete 4 instructions for every 5 clock cycles. Which indicates that branch prediction and deep pipelines aren't the performance enhancers that Intel and AMD seem to believe them to be. While they might work well in theory, real world performance speaks otherwise. Given this, I don't see any practical reason for keeping a kludgy instruction set around, because the complexity of the instruction set has been a great hindrance to the actual, rather than the theoretical, optimization of x86 processors.

    --
    The society for a thought-free internet welcomes you.
  23. Re:More trouble than its worth... by RickHGeek · · Score: 2, Interesting

    "Actually, this is just one of many potential downfalls."

    I was referring to use of the POPRMC instruction in code. I wouldn't recommend it unless there are other reasons why there might be a delay before actual code is executed, such as the last thing done before a RETF.

    "He forgot interrupts, mode switching....and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster."

    I didn't forget those aspects of coding. There are two distinct possibilities here which entirely resolve that dilema, both handled in hardware. 1) Interrupts are handled in a special way, during interrupt processing all RM/RMC values are ignored and utilization of the default 8 GP registers exist, or 2) Interrupts automatically push RM/RMC on the stack when signaled, and automatically pop them back off when IRETD is issued. These non-problems are resolvable.

    Next, mode switching. Mode switching would make no difference. Again, the hardware state could either persist as it is presently setup through the mode switch (meaning that SC will either count down and reset RM/RMC to default values/popped values when it hits zero, or it will be populated with 1111b and it will persist forever (until changed with MOVRMC again).).

    I've been told by probably 10 people so far that the P4 engine was designed with a 2 cycle latency L1 data cache, the purpose of which is to hide a lot of the latency required by not having a large GP register set. While this is, indeed, a great thing ... it never approaches the speed of register to register transfers. If code could be written to utilize up 56 GP registers instead of 8 (8 GP + 16 MMX + 32 XMMX) then a great deal of those 2-cycle latency hits would be removed, thereby speeding up code fairly significantly.

    I've had a couple people that I respect contact me in email about this concept. They've asked me to write an emulator which demonstrates this process. I will be doing that in the coming weeks/months. I'm sure this topic will be dead by the time I get it completed, but it might help stir it up again. We'll see what it really does when the numbers are published. Take care!

    - Rick C. Hodgin, geek.com

  24. Re:Cache is the key by Anonymous Coward · · Score: 1, Interesting

    They made the cache smaller becase the trace cache is more complex and because the numbers bear out that it's more effective per entry than traditional caches. It would still be nicer if it was larger.

    Also, the P4 is not the end of that architecture, only the beginning. They're thinking ahead. As process improvements kick in, soon that cache will be HUGE.