Posted by
CowboyNeal
on from the more-power-now dept.
NickSD writes "ChipGeek has an interesting article on increasing x86
CPU performance without having to redesign or throw out the x86 instruction set. Check it out at
geek.com."
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
No.
Because the whole point is preventing memory access. High bandwidth busses are very expensive. If you have a lot of registers, you can avoid memory accesses, making instructions run at full speed.
The best way to reduce the impact of a bottleneck is not making the bottleneck wider. It is making sure your data doesn't need to travel through the bottleneck.
After that, it doesn't hurt to make the bus bandwidth bigger.
I read the article. From what I can see this guy writes lots of assembly, but knows very little about how processors are designed. The huge gains you all see have already been made by register renaming and caches. There might be some gain left by giving the compiler direct control over these, but at the cost of much complexity in the register renaming hardware.
The P4 has a very deep pipeline. Looking for register conflicts is hard enough without adding another layer of redirection.
The fact that the article never mentions register renaming shows the author never did any research into this topic before writing.
Because if you had read the article you'd realize that this is essentially a zero cost, backwards compatable method of dramatically increasing program execution speed several orders of magnitude -- so the question is really, "Why not?"
It does not matter how fast your CPU is if it spends a significant amount of its time waiting for main memory access. All that happens is that it's doing more NOPs/sec, which isn't terribly useful. That's why industrial-grade systems have fancy buses like the GigaPlane.
With more registers, you would need less clock speed.
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Register renaming already does what's being proposed here, but transparently. In fact, most of the instructions reordering done by a good optimizing compiler (and later by the out-of-order dispatching unit) aims to increase paralelism on register usage.
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
P4 processors have 128 registers available for register renaming, using all of them is not so easy, so Hyperthreading (still only on Xeon) tries to bring in two different processes to the intruction mix, keeping their renaming maps separate, so the dispatching unit has more noncolliding instructions ready for execution. This won't make one CPU as fast as 2, but it does keep that insanely deep pipeline from getting filled with bubbles (or would that be 'empty of instructions' ?)
Intel is constantly adding new commands and register to the CPU, this is the whole point of the article, so it can easily do it to greatly increase execution speed of ALL programs, not just a few!!!
Have you ever looked at the function entry and exit for for processors like MIPS or PowerPC? There can easily be 20-40 instructions (at 4 bytes per instruction) to save and restore registers. Sometimes fewer registers is a win.
Ridiculous. You're saying that architectures with lots of regs are inferior because they make you save lots of registers at certain times, but reg-starved architectures make you save them all the time, all over the place, in any code that feels the slightest register pressure.
At best, the problem you describe indicates those architectures use too many callee-save registers in their calling conventions. Having more caller-save registers are a pure win from this perspective.
-- Patrick Doyle I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Of course RISC processors are so much nicer to work with because of their large, flat register files (at least 16 or 32 registers, all of them equally usable), but that's not possible with existing x86 architecture.
Although I would like to take this opportunity to point out that AMD's X86-64 (Opteron) architecture increases the number of gp and xxm (used for SSE instructions) registers up to 16 each.
-- "Evil company X is threatening to restrict our rights! Let's all get together to stop--OOOH! SHINEY!!!" -- AC
That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
Other new commands: LIE Launch IE LMW Launch MS Word LME Launch MS Excel LMO Launch MS Outlook LMOV Launch MS Outlook Virus LCNR Launch Clippy for No Reason DPRN Display Pr0n SPOP Show IE Popup SPU Spam User SHDR Send Hard Drive Contents to Redmond RBT Reboot SBS Show Blue Screen
-- Lately democracy seems to be based on the skybox, the Happy Meal box, the X-box, and the idiot box.
Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
Switching architectures is not that trivial. You seem to think that every company has the source code available for every piece of software they run. That isn't true. You seem to think that programs can easily be compiled between programs if written in C/C++ - also untrue. You think that the bug fixes for compiling between platforms are "small peculiarities" -- well, they may be small, but that doesn't make them easy. In fact, it makes it fucking hard because the differences are so buried in libraries, case-specific, and undocumented that it's a nightmare to find them. Yes, I've done this kind of thing. It's godawful.
Changing architecture is difficult. This is not a closed vendor market - anyone can put together an x86 box and you have at least 3 different CPU vendors to chose from, 3 - 5 motherboard chipsets, and a virtually infinite variety of other hardware. If Dell computer suddenly decides to move to a PPC architecture what's going to happen? They're going to lose all their customers and fast. Because the very limited benefits of a different architecture do not make up for the costs of going to one.
Yes, I said limited benefits. Yeah, when I was in college taking CompE, EE, and CS courses on CPU and system design I also found the x86 ISA to be the most demonic thing this side of Hell. Well, I'm older and wiser now and while x86 isn't perfect, it's not that bad either. It's price/performance ratio is utterly insane and getting better yearly. Contrary to the RISC architecture doom and gloomers, x86 didn't die under it's own backwards compatibility problems. It's actually grown far more than anyone expected and is now eating those same manufacturers for lunch.
You know, back in the early 90s when RISC was first starting to make noise the jibe was that Intel's x86 architecture was so bad because it couldn't ramp up in clock speeds. Intel was sitting at 66 MHz for their fastest chip while MIPS, Sparc, etc. were doing 300 MHz. Of course, now Intel has the MHz crown, with demonstrations exceeding 4 GHz, and the RISC crowd is saying that MHz isn't everything and they do more work/cycle than Intel (which is true, but the point remains).
All that said, go look at the SPEC CInt2000 and FP2000 results. Would you care to state what system has the highest integer performance? And whose chip has the highest floating point?
Oh, and let's not forget that I can buy roughly 50 server-class x86 systems for the price of one mid-level Sun/IBM/HP/etc. server.
Note - server performance isn't all about CPU, but since the OP wanted to make that argument, I just thought I'd point out how wrong he is. There is still quite a bit of need for high end servers with improved bus and memory architectures, but don't even try to argue that the CPU is more powerful. It isn't.
Both Intel Pentium III and IV and the AMD K6-2, and K7 (Athlon) are essentially RISC processors in the core. There's an outer layer that essentially translates from the x86 ISA to their internal micro architecture. Excepting for a few outdated commands that are virtually never used, which are implemented in microcode (and thus slow as hell comparatively).
There is no way to directly access the core ISA, nor do I know of it being documented anywhere. Intel planned to move the industry off the x86 ISA to Itanium, but so far that's utterly failed and with the Intergraph lawsuit it may be dead in the water now.
AMD's x86-64 still uses the x86 ISA, but extends it. Additionally if you talk to the chip in 64 bit mode then 8 (I think) additional GP registers are available in silicon - not just register renaming, which occurs already in every major CPU on the market today. The additional registers (all 64-bit wide) pretty much eliminate the need for an architecture move, at least as it relates to registers. Intel hasn't yet adopted x86-64 though (although they can since AMD must license to them because of IP agreements).
Still, what's funny is this desire for a performance increase... the x86 chips are the fastest CPUs on the market for integer performance and in the top 5 for floating point - although Alpha still reigns supreme for FP I believe. But compare the price of an x86 chip to pretty much anyone else and you start wondering exactly what the performance issue is.
The performance problems are not with the CPU anymore. The bus and memory interfaces are slow. They've been getting faster over the years, but closed vendor boxes like Sun, HP, IBM, etc. will always do better because they don't have to deal with getting a half dozen different major OEMs on board, along with countless peripheral manufacturers. Nor do they have to concern themselves overly with backwards compatibility.
When I started looking at the ARM chips I wondered why we ever used x86's etc.
RISC / CISC is really a misnomer.
RISC has plenty of instructions, and it's meant to be super-scaler.
It starts with Register Gymnastics. Basically with RISC, there's no more of it. Every register is general. It can be data, or it can be an address. All the basic math functions can operate on any register.
With Intel x86, everything has it's place.
Extend it further out. There's something called "Conditional Instructions". Properly utilized, these make for an ultra efficient code cache. The processer is able to dump the code cache instructions ahead of time. Which also means, not as much unecessary "pipeline preparation" to perform an instruction.
Then there's THUMB which compresses instructions so that they take up less physical space in a 64, 128 bit world. There's lots of wasted bits in an (.exe) compiled for a 386
Last I checked, 32bit ARM THUMB processors are dirt freaken cheap, they're manufactured by a consortium of multitude of verdors as opposed to AMD and INTC.
The Internet is slowing wearing down the x86 as more and more processing is moving back on the server where big iron style RISC can churn through everything.
The article should really just be called:
"An Acedemic Exercise in Register Gymnastics"
Um, how is this anything new?
by
Andy+Dodd
·
· Score: 4, Informative
Linux kernel source - memcpy() anyone?
(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)
This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
-- retrorocket.o not found, launch anyway?
Another Hideous Hack for IA32
by
seanellis
·
· Score: 5, Informative
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327, 00.asp.)
Mmmm, Assembler...
by
guidemaker
·
· Score: 5, Funny
I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
The Problems of Obsolete design
by
Alien54
·
· Score: 5, Interesting
This is what I call the big problem. That design is utterly abominable. We live in a world where it's nothing to have 1 gigabyte of RAM in a computer. We have 80 GB hard drive platters now, allowing even greater-sized drives. And yet at the heart of every single one of your x86 computers out there, a mere 6 GP registers are doing nearly all of the processing. It's amazing. And it's something I've personally wrestled with every day of my assembly programming career.
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
-- "It is a greater offense to steal men's labor, than their clothes"
Re:The Problems of Obsolete design
by
Zathrus
·
· Score: 5, Informative
As others mentioned, MCA (MicroChannel Architecture) was IBM's abysmal attempt at recapturing the PC market. It died a horrible death, and deserved it. Frankly, the technology sucked only slightly less than the ISA/EISA bus it wanted to replace.
Anyone else remember the horrors of all those damn control files on floppies?
There are a lot of architectural nightmares in the PC design... and while some of them are at the CPU level (like the 6 GP registers), most of them are at the bus level. Who the hell puts the keyboard on the 2nd most important interrupt (IRQ1)? The entire bus is still borked, although PCI has mostly hidden that now. But the system and memory buses are the sole reason that IBM, HP, Sun, etc. have higher performance ratings than x86 -- the P4 and Athlon processors are faster in virtually every case on a CPU to CPU basis.
The bus and memory architecture is also why x86 does so incredibly bad in multi-CPU boxes. It's just not designed for it, the contention issues are hideous, and while you may only get 1.9x the performance going to a 2 CPU Sun box, you'll only get 1.7x on x86. It gets worse as you scale (note - those numbers are for reference only, I don't recall the exact relationships for dual CPU x86 boxes anymore, but the RISC systems handle it better due to bus design).
Really there's nothing wrong with the x86 processors except to the CompE/EE/CS student. I was there once and couldn't stand it. Real life has shown that it isn't that bad, and recent times have shown that it's actually really damn good. Except for the buses. They suck. And while things like PCI-X and 3GIO are on the horizon, I don't see them seriously changing the core issues without causing massive compatibility problems.
Does anyone else have flashbacks to
by
wiredog
·
· Score: 4, Interesting
segment:offset addressing? He's doing it with registers, but it seems the same sort of thing. One register is for segment, the other is the offset?
Well, not quite, but it has the same flavor.
After working in x86 assembly, I really appreciated high level and minimally complex languages like C.
Technical point of view
by
Lomby
·
· Score: 4, Interesting
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
I suspect this would be a rather expensive chip
by
shimmin
·
· Score: 5, Interesting
While the base idea is interesting (add instructions that support using the multimedia registers as GP registers), I suspect that actually implementing the functionality of the GP registers in the multimedia ones could result in a prohibitively expensive CPU.
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
More registers are not enough.
by
gpinzone
·
· Score: 4, Informative
The whole gist of the article has to do with the x86's lack of general purpose
registers. While this is true, you're not going to solve all of the x86
shortcomings simply by figuring out a way to add more of them. There are
MANY things wrong with the x86 design; GP registers are just one of them.
There's an entire section in the famous Patterson
book that goes into all of the issues in much more detail than I care to state
here.
Besides, there's already more efficient (albiet complex) solutions to extend
registers that make much more sense in the current world of pipelined
processors. Register
renaming is one such example.
Re:Switching Architectures
by
killmenow
·
· Score: 5, Insightful
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching...
But a lot of the code running today wasn't "written today" if you know what I mean. The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
More than 3 answers !FREE!
by
purrpurrpussy
·
· Score: 4, Insightful
You are VERY confused.
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....;-)
-- "None of this shit works" -W.Shatner
Why should one do that?
by
mick29
·
· Score: 4, Informative
I do not like the changes proposed although x86 is awfully flawed (not enough GP registers, terribly overloaded instruction set {anyone ever used BCD commands? -- Yes, I hear the loud "We do" from the COBOL corner.}, you name it... ).
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...)
Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
More trouble than its worth...
by
gillbates
·
· Score: 4, Insightful
The only potential downfall I see in this design is the possible pipeline stall seen when RM/RMC have to be populated from stack data. When that happens, no assembly instructions can be decoded until the POPRMC instruction completes and RM/RMC are loaded with the values from the stack.
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything,
Intel should stop adding instructions and start adding registers.
-- The society for a thought-free internet welcomes you.
Re:Cache is the key
by
Anonymous Coward
·
· Score: 5, Informative
Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
When programmers try to be architects...
by
Chris+Burke
·
· Score: 5, Informative
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
Re:Cache is the key
by
Sivar
·
· Score: 5, Insightful
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache. No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math. Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache. Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%. Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI If you are interested in competant, intelligent, technical reviews of hardware, you might like www.aceshardware.com
-- Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Re:modular chips
by
Zathrus
·
· Score: 5, Informative
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
An intelligent comment on the subject
by
Cerlyn
·
· Score: 4, Interesting
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine:)
Re:Cache is the key
by
mmol_6453
·
· Score: 5, Insightful
20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million, not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
-- What's this Submit thingy do?
Re:Cache is the key
by
orz
·
· Score: 4, Informative
Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.
For a graphical depiction of the difference between Intel and AMD cache performances, try this link: http://www.tech-report.com/reviews/2002q1/n orthwoo d-vs-2000/index.x?pg=3 It was the first think that came up in a google search for linpack and "cache size".
Shouldn't we improve bus speed, data access speeds, etc etc first? After all, the bottleneck is not the processor anymore...
That's pretty sweet how he makes the x86 processor faster by adding commands for divx! This guy knows how to improve Intel architecture for the masses!
Ok, he realizes that the x86 architecture is flawed. One of the most limiting problems is the lack of general purpose registers (GPR), so he adds more complexity to an allready over-complex solution to solve this problem. All I have to say to this is: when will you see that the solution is as simple as switching architecture!
As most code today is written in higher level languages (C/C++, Java, etc.) all it takes is a recompile and perhaps some patching and adaptations to small peculiarities. The Linux kernel is a proof of this concept, a highly complex piece of code portable to several platforms with a huge part of the code folly portable and shareable. This means that it is not hard to change architecture!
If the main competition and its money would move from the x86 to a RISC architecure (why not Alpha, MIPS, SPARC or PPC) I'm sure that the gap in performance per penny would go away pretty soon. RISCs have several advantages, but the biggest (IMHO) is the simplicity: no akward rules (non-GP registers), no special case instructions, easy to pipeline, easy to understand and easy to optimize code for (since the instruction set is smaller).
And to return to the original article. Please do not introduce more complexity. What we need is simple, beautiful designs, those are the ones that one can make go *really* fast.
Linux kernel source - memcpy() anyone?
(On MMX machines, the wider 64-bit MMX registers are used for memcpy() rather than the 32-bit standard integer registers)
This has been in the kernel for a few years now and anything that uses memcpy() benefits from it. Move along now.
retrorocket.o not found, launch anyway?
The scheme as proposed would work, but nothing will change the fact that it's another hideous hack to get around the non-orthogonal addressing modes in the original Intel 80x86 architecture.
, 00.asp .)
Even the little microcontroller chips that I can buy for $2 have 32 general purpose registers (Atmel AVRs, for anyone who cares).
Worse, this scheme would not benefit existing code - it still requires code changes to work.
Finally, on the gripping hand, the Pentium III and 4 have a very similar register renaming scheme going on automatically in the hardware. The 8 "logical" registers are already mapped dynamically into a much larger physical register file. (From ExtremeTech: http://www.extremetech.com/article2/0,3973,471327
Sean Ellis
Follow OfQuack's antics on Twitter.
I'm reminded of the days I used to code for the old Acorn Archimedes (don't look for it now, it's not there any more) and our apps were usually way faster than the competition's.
When asked why, we were tempted to tell them that we used the undocumented 'unleash' instruction to unleash the raw power of the ARM processor.
This sort of reminds me of what happened with IRQs. Ultimately Intel "solved this" via the PCI bus, but performace has occasionally been problematic. Of course, that problem goes back to the original IBM design for original IBM PC. Intel is also very aware, I imagine, of what happened when IBM tried a total redesign woth the EISA bus, etc. It got rejected, I think, primarily because it was propriatary. In any case, enough companies have been nailed on backward compatibility issues that Intel may be nervous about making a total break.
The upside is being able to run old software on new hardware. You don't want to break too many things.
"It is a greater offense to steal men's labor, than their clothes"
Well, not quite, but it has the same flavor.
After working in x86 assembly, I really appreciated high level and minimally complex languages like C.
Best Slashdot Co
The guy does not realize that what he proposed is not at all simple to implement in silico.
This two additional mapping register would complicate the pipeline hazard detection in an exponential way.
Another point is that I don't think that by doubling/tripling the number of registers available you will get a ten fold performance increase: a small increase could be expected, but not much.
Another problem is the SpecialCount counter: this would complicate the compilers too much. It would also make the instruction reordering almost impossible.
Anyone who's ever tried to use the MMX or XMMX registers for non-multimedia applications knows what I'm talking about. The instruction sets for them are nicely tweaked to let you do "sloppy" parallel operations on large blocks of data, and not really suited for general computing. You can't move data into them the way you would like to. You can't perform the operations you would like to. You can't extract data from them the way you would like you. They were meant to be good at one thing, and they are.
I once tried to use the multimedia registers to speed up my implementation of a cryptographic hash function whose evaluation required more intermediate data than could nicely fit in GP registers, and had enough parallelism that I thought it might benefit from the multimedia instructions. No such luck. The effort involved in packing and unpacking the multimedia registers undid any gains in actually performing the computation faster -- and the computation itself wasn't that much faster. I was using an Athlon at the time, and AMD has so optimized the function of the GP registers and ALU that most common GP operations execute in a single clock if they don't have to access memory, while all the multimedia instructions (including the multiple move instructions to load the registers) require at least 3 clocks apiece.
Now this leads me to suspect that the multimedia registers have limited functionality and slow response for a single reason: economics. The lack of instructions useful for non-multimedia applications could be explained via history, but what chip manufacturer wouldn't want to boast of the superior speed of their multimedia instructions? And yet they remain slower than the GP part of the chip.
So I conclude that merely making a faster MMX X/MMX processor is prohibitively expensive in today's market. And this proposal would definitely require that, even if actually adding the additional wiring to support the GP instructions for these registers was feasible. Because what would be the point of using these registers for GP instructions if they executed them slower than the same instructions actually executed on GP registers?
The whole gist of the article has to do with the x86's lack of general purpose registers. While this is true, you're not going to solve all of the x86 shortcomings simply by figuring out a way to add more of them. There are MANY things wrong with the x86 design; GP registers are just one of them. There's an entire section in the famous Patterson book that goes into all of the issues in much more detail than I care to state here.
Besides, there's already more efficient (albiet complex) solutions to extend registers that make much more sense in the current world of pipelined processors. Register renaming is one such example.
The problem is, in order to recompile you first need: a) the original source, and b) someone capable of patching, etc.
A lot of internal apps are in use for which the source code is lost. And a lot of code in use today (sadly) was not written in languages as portable as C, C++, and Java. A lot of apps in use today were written in Clipper and COBOL and a bunch of other languages that may not have decent compilers for other platforms. So recompiling it isn't an option. A complete re-write is necessary.
Even for situations in which application source *does* exist, and suitable compilers exist on other architectures, it is more often than not poorly documented...and the original author(s) is/are nowhere to be found. So in order to patch/fix the source to run on the new architecture, you not only need someone well versed in both the old and the new architectures, but someone who can read through what is often spaghetti code, understand it and make appropriate changes.
In a lot of these cases it's easier to stick with the current architecture. And that, to some degree, is why the x86 architecture has gotten as complex as it is.
You are VERY confused.
;-)
1 - Zero Cost. 2 - Backwards Compatible. 3 - Orders of magnitude.
1 - You have to buy new chips - this will improve the speed of "computing" but it will not increase the speed of THIS computer I have right HERE.
2 - No old code has RM/RMC instructions in it and will NOT run any faster than it already does in a "standard" x86 mode. Yes it is backwards compat. but by the same token so is MMX, EMMX, 3DNOW!, SSE, SSEII, AA64 etc....
3 - Anyone who can sell me a program to "suddenly" make all my code go 10x or 100x faster is garaunteed to give me a good chuckle!!!!!!!
As for the aritcle... well you've hugely increased the number of bits it takes to address a register and swapping the RM register is going to cause all sorts of new dependency chains inside the chip.
Personally.... I'd go for a stack machine. Easily the most efficient compute engine.
Now - if we could get back to point number 1 and point number 3. If YOU can make MY computer go 10 or 100 times FAST with SOFTWARE I promise I WILL give YOU some MONEY....
"None of this shit works" -W.Shatner
But this change would:
Make an internal interface explicitly controlled by the programmer/compiler, loading an enormous amount of work on the compiler creators. (Just have a look at IA64 - is there any good compiler out there already? I haven't had a look for a while.)
Destroy (or at least reduce the efficiency) of the internal register renaming unit, thus slowing down the out-of-order execution core and such (the entire core, actually...) Sorry, but this man may have been busy programming x86 assembly his entire life (and for this he deserves my respect), but he is not up to date on how a modern x86 cpu works in its heart. When I heard the lectures in my university about how this stuff works, I gave up learning assembly -- one just doesn't need it anymore with the compilers around.
Reading the books by Hennesy/Patterson (don't know if I spelled them correctly) may help a lot.
Actually, this is just one of many potential downfalls. He forgot interrupts, mode switching (going from protected to real mode, as some OS's still do), and IO would all require that the proposed RM/RMC register be loaded from the stack. The net effect would be that if his scheme is implemented, existing programs would run slower, not faster. Furthermore, placing the RM/RMC register on the stack is an impossibility without breaking backward compatibility; many assembly language coders depend on a set number of bytes being added to the stack when they perform a call or interrupt.
Why not just add 24 GP registers to the existing processor? Honestly, it would be a lot simpler, and would not complicate the whole x86 mess, nor break backward compatibility.
I don't mean to flame, but this guy is way off base. The biggest problem with the x86 instruction set is lack of registers, and the second biggest problem is that its complexity is rapidly becoming unmanageable. Not even Intel recommends using assembly anymore - their recommendation is to write in C and let their compiler perform the optimizations. Adding more instructions like this would further diminish the viability of coding in assembly.
A far better solution would be to simply keep the existing instruction set intact, and add more GP registers. IBM got it right the first time - their mainframe processors have 16 general purpose registers which can be used for any operation - addressing, indexing, integer, and floating point calculations. If anything, Intel should stop adding instructions and start adding registers.
The society for a thought-free internet welcomes you.
Cache is a huge Intel problem. 20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 only has 32K.
AMD has 128K L1 since the original Athlon, and had 24K in the K5.
The Transmeta 3200 and the Motorola G4 both have 96K, the UltraSparc-III has 100K, Alpha had 128K when it died, and HP's PA-8500 has a whopping 1.5MB.
They may throw big chunks of L2 at the problem, but it seems to me that so little L1 means more time moving data and less time processing...
Yes, he basically invented register renaming, but put it under explicit programmer control. It's a programmer's solution to what hardware has already done, and as was inevitable he doesn't see that he will do more harm than good.
Here's why his idea sucks:
1) Register renaming dependent on the RMC. You can't issue any instructions if there is a POPRMC in the machine until the POPRMC finishes execution. He calls it "a few cycles", but it's much worse than that. You've prevented any new instructions from entering the window until the stack acess is done, preventing any work that -could- have been done in parallel from even being seen. Function call/return overhead is a big deal, and he just kicked it up a notch.
2) His whole problem #3 -- that you can't explicitly access the upper 16 bits of a 32-bit GPR. All I can say is -- thank God! Being a programmer, he probably doesn't realize that being able to address sub-registers is actually a big problem with x86. The whole sub-register-addressing problem causes all kinds of extra dependencies and merge operations. And he wants to make it worse? I think he should be slapped for this idea. x86-64 had the right idea -- you cannot access -just- the upper 32 bits of a GPR, and when you execute a 32-bit instruction that writes a GPR, the upper 32-bits are not preserved. Which is how the RISCy folks have been doing it all along, but hey.
3) This idea requires an extra clock cycle in the front-end, to do the translation from architected to the expanded architected register space, prior to being able to do the architected->physical register translation.
4) Because you still can't address more than 8 registers at a time, you'll be using lots of MOVRMC instructions in order to make the registers you need visible. Ignore how horrible this would make it for people writting assembly ("Okay, so now EAX means GPR 13?") or compilers, this is going to result in a lot of code bloat.
5) Because of 1) and 4), modern superscalar decoders are going to be shot. If you fetch a MOVRMC, followed by POP EAX and POP EBX, you can't decode the second two until -after- you've decode the MOVRMC and written it's values into the map.
Now all this is so that you can save on loads/stores to the stack. Which is great, but at least when those loads and stores are executing, independent instructions can still go. Every RMC-related stall is exactly that -- no following instruction can make progress.
Not that increasing the number of registers in x86 isn't a good idea -- it's just his implementation that sucks. With him being an x86 programmer, I'm surprised he didn't think of the most obvious solution -- just add a prefix byte to extend the size of the register identifiers in the ModR/M and SIB bytes. You get access to ALL GPRs at once (rather than a 8-register window), no extra stalls are required, and your code size only goes up by one byte for instructions that use the extra registers.
I can't help but commend him on his idea being well-thought out. To the best of his knowlege, he tried to address all issues. But that's the problem -- he's a programmer, not a computer architect.
The enemies of Democracy are
I've got three words for you: cache, cache and cache.
Why do you think Pentium Pro was such a huge success that's it's still being used in CPU intensive operations? Why do you think Sun Sparc and Digital/Samsung Alpha CPUs trash modern Pentium 4s and Athlons at 500 MHz? Yup. Loads and loads of cache.
No. First, Alphas and SPARCS do not trash modern x86 CPUs, the Pentium IV 2.8GHz and Athlon XP 2800+ are the fastest CPUs in the world for integer math and the Itanium 2 is the fastest in the world for floating point math.
Cache memory is only useful until it is large enough to contain the working set of the promary application being run. Larger cache can improve performance further, but after the cache can contain the working set, the gain is in the single digit percents. The working set of the vast, vast majority of applications is under 512K, and most are under 256K. You'll find that increasing the speed of a small cache is generally more important than increasing the size of the cache.
Case in point: When the Pentium 3 and Athlon went from a large (512K) to a small (256K) faster cache, performance went up, for the Athlon by about 10% and for the Pentium 3...I don't recall, but around 10%.
Some desktop apps, like SETI@Home, have a large working set (more than 512K) and DO benefit from large caches, but nothing larger than 1MB would improve performance here either.
Most server CPUS, like Alphas and SPARCS, have fairly large caches for the following reasons:
1) Databases love large caches. They are one of the few applications that can take advantage of a large cache, because they can store lookup tables of arbitrary size in cache. Server CPUs are oftenused for databases because Joe x86 CPU is just fine for webservers, FTP servers, desktop systems, etc. and is generally faster at them then server CPUs.
2) Most server class CPUs are fuly 64-bit and do NOT support register splitting. On the SPARC64, for example, if you want to store an integer containing the number "42", that integer will take up a full 64-bits regardless of the fact that the register can store numbers up to 18,446,744,073,709,551,616. This larger size increases the cache size needed to store the working set of programs, because all integers (and many other data primitives) require a full 64 bits or more. With 886 CPUs, which support register splitting and have only 32-bit registers, that number could be stored in a mere eight bits. The square root of the number of bits the SPARC requires.
3) Big servers with multiple CPUS are often expected to run multiple apps, all of which are CPU intensive. If the cache can store the working set for all of them, speed is slightly improved.
That said, who in their right mind would use an incredibly slow Pentium Pro for a CPU intensive calculation? A Pentium Pro at the highest available speed, 200MHz, with 2MB cache may be able to outperform a Celeron 266, but not by much and only for very specific cache-hungry software. Show me a person that thinks a Pentium Pro with even 200GB of cache can outperform ANY Athlon system and I will show you a person that hasn't a clue what they are talking about.
Look at the performance difference between the Pentium IV with 256K and with 512K (a doubling) of cache. You will have to do some research to find an application that gets even a 10% performance boost.
FYI
If you are interested in competant, intelligent, technical reviews of hardware, you might like
www.aceshardware.com
Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
Anytime you modularize you have to design interfaces. Interfaces are inherently slow - there's a physical disconnect which simply can't have as good of an electrical connection, they're bulky (consider that while a Pentium IV chip package is 35 mm on a side (1225 mm^2), the actual chip is only 131 mm^2 - the size is needed primarily for all the pinouts from the chip), and they're noisy.
Consider that while you can buy a P4 that runs at 2.8 GHz internally (and the fast ALUs run at 5.6 GHz, although they're only 16-bits wide), the memory bus is a lackluster 133 MHz (which you get an effective 533 MHz from because it's quad pumped - you read 4 values every clock instead of just 1). The I/O bus also runs at 133 MHz. These are the only two external buses the CPU deals with.
If you were to try and segment the CPU similarly you'd quickly hit limitations. You simply can't run a multi-GHz electrical signal over a physical disconnect, at least not with current technology.
All of that said, if you look at how CPU cores are laid out the cache is distinctly segmented from the ALU, the ALU is segmented from the FPU, and so forth. It makes chip design easier since if you want to make a change to one part of the chip you minimize effects on other parts. It also helps for signal routing and noise prevention.
Also you can do more or less what you're asking - just not at high speeds. Modern chips are often preliminarily tested using gate arrays that can be reprogrammed quickly and easily... but instead of running at 3 GHz this test chip runs at 2 MHz. Maybe.
Oh... a final bit... back in the days of the 386 and 486 the 2nd level cache was actually on the motherboard, and different MB vendors would put different amounts of cache. Some even had it socketed or solderable so you could add more if you wanted! But by the time the P2 came out clock speeds were too high for this. The connection latency and distance were simply too high. So we wound up with the slot processors, where a CPU slot card had the CPU core and 1-4 second level caches on it. Pretty soon both Intel and AMD integrated the 2nd level cache onto the CPU itself (which wasn't previously possible because it would have made the chips far too big), which further improved speed. The next generation of CPUs are requiring 3rd level cache on the motherboards. How long before that gets integrated onto the CPU?
I can speak on some authority on this subject since I am presently taking a course on code optimization. What it looks like Mr. Hogdin is trying to do is workaround the issue where people do not compile programs with processor specific optimizations. He seems to be proposing doing so by allowing "paging" per se of registers amongst themselves, although in a bit of an odd fashion.
Personally, I am not too fond of this approach. First of all, operating systems will need to be written to support this paging. Secondly, running a single MMX and/or SSE enabled application (which would use most if not all of the mapped registers), would cause all the other applications on the system to suddently lose any benefit that paging would provide.
The approach I would take (which may or may not be better) would be to change the software. Compilers like gcc 3.2 already know how to generate code with MMX and SSE instructions. Patches are available for Linux 2.4 that add in gcc 3.2's new targets (-march=athlon-xp, etc.) to the Linux kernel configuration system. Libraries for *any* operating system compiled towards a processor or family of processor likely would fair better than generics.
And yes, gcc 3.2 can do register mapping in a similar fashion (to ensure that all registers) on its own. If you read gcc's manual page, you will note that this makes debugging harder though. Gcc even has an *experimental* mode where it will use the x87 and SSE floating point registers simultaneously.
Mr. Hogdin's approach might be a bit be better for inter-process paging by a task scheduler for low numbers of tasks. But as a beginner in this field, I'm not sure what else it would be good for.
Please pardon the omissions; I am not presently using a gcc 3.2 machine :)
20K L1 for P4, down from 32K since the Pentium MMX. Even the Itanium2 [geek.com] only has 32K.
Just for people who don't know, Intel reduced the amount of cache when they moved from the P3 to the P4. And hardware junkies know the performance hit that caused.
A seemingly unrelated sidenote: Intel wants to move to their IA-64 system, and, since it's not backwards-compatible, they're going to have to force a grass-roots popular movement to pull it off.
Perhaps they crippled the P4 to make the IA-64 processors look even faster to the general public?
In any case, I think the quality of the P4 is a sign that Intel wants to make its move soon. (Though losing $150 million, not to mention the context in which they lost it, may set back their schedule, giving AMD's 64-bit system a chance to catch on.)
What's this Submit thingy do?
Intel's processors are not crippled by small L1 cache. Yes, P3 and P4 the L1 caches are WAY smaller than the Athlon L1 cache, but Intel doesn't NEED a large L1 cache, because their L2 cache is extremely fast. Intel tends to have small extremely fast L1 caches, and make up for the higher miss rate with fast L2 caches as well. For instance, the P3 L1 cache has a miss rate roughly twice as high as the Athlons L1 cache, but the P3's L1 miss penalty is roughly 8 cycles (assuming an L2 hit...), less than half the Athlons L1 miss penalty of 20+ cycles on an L2 hit. Also, the P4s L1 cache, which is even smaller than the P3s, allows them to decrease the L1 hit latency AND run at a substancially higher clock speed than AMDs larger cache.
n orthwoo d-vs-2000/index.x?pg=3
For a graphical depiction of the difference between Intel and AMD cache performances, try this link:
http://www.tech-report.com/reviews/2002q1/
It was the first think that came up in a google search for linpack and "cache size".