Domain: agner.org
Stories and comments across the archive that link to agner.org.
Comments · 55
-
Re:But SSE is already 128 bits!
Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency
Well, certainly you won't be able to get a square root through in one clock cycle, but many/most of the simple integer arithmetic, bitwise, and MOV SSE instructions on the Core 2 really do have single cycle latency. source. None do on the AMD64, which supports the theory that SSE128 means more "new for us" than "new for everyone." Not to put AMD down - many of the other features sound promising (but the article is long on breathlessness and light on details, alas).
-
Re:x86 aint what it used to be
Nowadays, the x86 ISA is just an API...god knows how the core actually executes instructions and in what order, which makes it very hard to optimise code beyond a certain point.
Actually it's still pretty well documented which instructions are executed in which order, even on the Pentium 4. Agner Fog has a very nice document that I tend to reference when writing x86 ASM routines (usually to be linked in with a higher level language because the HLL emits poorly optimized code (or simply is unable to take advantage of certain processor features)). Give it a look--
http://www.agner.org/assem/ -
Re:Why?
Have a look at the start of chapter 16 of Agner Fox's P5/P6 Optimisation Tome for an example of how register renaming works in practice.
-
A more serious answer ...
Since its optimization you are concerned with I have a few choices you will be interested in:
1. The Zen of Code Optimization by Michael Abrash.
2. Agner Fog's Assembly Resources
3. The Athlon Optimization Guide
4. Intel's IA32 Optimization Guide
5. The Aggregate Magic Algorithms
These sources will give you everything you need to know about code optimization for x86. -
Re:Time to kick a friend's ass
Actually, that's pretty much what the Pentium Pro (ergo p2, p3, celeron, celeron2 and the p4) do - only there it's done using "virtual registers" which means that the register "eax" can map to a completely different physical register if the instruction scheduler needs it to.
For example, you could write your code like this:
mov ebx, Pointer
mov cx, [ebx]
mov eax, [ebx+cx]
mov Pointer2, eax
(now I'm pretty sure that's not the best way to do it - it's just an example, ok?)
Now, if you have another multi-instruction operation after this and it's going to use any of the registers used above, the CPU will see in the decoding phase that "a-ha! eax has received a value that doesn't depend on itself (i.e. a completely new value)" and will assign a different physical register to "eax" until it's overwritten again. (this is also the reason who xor reg,reg is not the preferred way of clearing a register on the ppro and up.) Same for ebx and ecx and the other regs. By the time the CPU is finished decoding these instructions (this would take 1 and 1/3 cycles for ppro through p3 and 1 cycle for the p4 (due to the 4-1-1-1 decoders)), the reorder buffer (that receives the decoded instructions, also called micro-ops or uops) will have been filled up with previously decoded instructions and will be able to put as many uops into the execution "ports" as possible (3 per cycle in ppro through p3, not sure about the p4).This, of course, assumes that the code is organised so that the decoders can feed the reorder buffer with more than 3 micro-ops per decoding cycle, so that there's something to reorder. But this will, for the most part, take care of that data-dependency problem.
Personally, I prefer explicit register setting (a'la PowerPC, 32 int regs + 32 fp regs) so that the CPU won't have to schedule instructions for me...
(all this information, except for the p4 decoder uop-max series, comes from the excellent pentopt.txt file.)