Cliff Click's Crash Course In Modern Hardware
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.
However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.
The enemies of Democracy are
Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.
Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).
Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.
Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.
I am TheRaven on Soylent News
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.
Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.
In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.
Being able to use %r2 for the same stuff you use %r1 for is just boring.
Support SETI@home
The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.
Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).
I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.
We're wanted men. I have the death sentence in 12 systems!