Cliff Click's Crash Course In Modern Hardware
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
Probably due to your x86 processor doing all sorts of monkeying with the code.
Spaghetti code can be hard to digest.
That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
The ringing of the division bell has begun... -PF
Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compiler to optimize.
That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.
Dewey, what part of this looks like authorities should be involved?
I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.
Someone has to write those tools.
Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.
Dewey, what part of this looks like authorities should be involved?
Yah sorry about that. :)
Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since that is what everyone is using, and no one wants to move to the new architecture since the compiler doesn't take full advantage of it.
Need help treating your acne? Come here!
That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.
Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.
However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.
Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.
The enemies of Democracy are
Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.
Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".
On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.
I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into what might be going wrong when their code isn't compiling or executing the way they want it to. It also gives them the tools to make their code better, as opposed to simply shrugging and saying, "I sure hope they fix this damn compiler..."
Spaghetti code can be hard to digest.
Sounds to me like someone is using stale Copypasta.
That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!
Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly programmer, which a lot of people are but they don't realize it so they bad-mouth the language) is just icing on the cake. You should definitely teach yourself assembly, if you can find the time.
Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).
Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.
It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.
In contrast, PGI or ICC is much harder to beat using assembly.
That's *generally* true. It's not *always* true.
There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.
The ringing of the division bell has begun... -PF
But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.
The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.
Or you could get NASM, which is open source :)
People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.
Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.
Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.
However, you'll still hear people say it is a great trick because they haven't updated their knowledge.
Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.
The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory
I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.
Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.
If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.
I am TheRaven on Soylent News
Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.
I am TheRaven on Soylent News
The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.
I am TheRaven on Soylent News
Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.
Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.
In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.
Being able to use %r2 for the same stuff you use %r1 for is just boring.
Support SETI@home
In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:
printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);
3 * 2 + 1 = 7
3 << 1 + 1 = 12
(3 << 1) + 1 = 7
--
Perpenso Calc for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN
I hear they have nice 0xC0FFEE
GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
It all depends on your problem domain. As a high energy physicist, I write plenty of code that me, a postdoc, and maybe a couple other grad students will ever see, and probably I'm the only one that will actually ever use it. I'm designing a small cluster that will get built here in a month or few, and some of my code will take up about 2 months of solid run time on it, then never see the light of day again. If I can spend 2 days getting a 5% performance improvement, even at the expense of locking the code to this cluster, it's a net win for us.
In short, I have no "customers", I know exactly what hardware my code will be running on, and it won't ever change (until they ditch the cluster in 4-5 years and make a new one, but I'll be long gone), and I don't even have to worry about maintaining the code years in the future.
All the same, I'll probably still write the code as cleanly as possible and run it through an optimizer, and leave it at that.
SIGSEGV caught, terminating
wait... not that kind of sig.
I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.
We're wanted men. I have the death sentence in 12 systems!
intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.
Liberty freedom are no1, not dicks in suits.
This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:
In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.
You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.
Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.
I wrote a longer explanation a while ago.
I am TheRaven on Soylent News
In general modern compilers are good enough that you are much more likely to get better performance by spending the time finding a better algorithm then you are hand optimizing the code. Obviously for things like H.264 where the algorithm is already set this is not true, but that's a very small fraction of the code out there.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.
One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.
On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper (slides) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.
Program Intellivision!