Cliff Click's Crash Course In Modern Hardware

← Back to Stories (view on slashdot.org)

Cliff Click's Crash Course In Modern Hardware

Posted by timothy on Thursday January 14, 2010 @11:19AM from the first-there-were-the-dinosaurs dept.

Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"

8 of 249 comments (clear)

Min score:

Reason:

Sort:

Fast forward... by LostCluster · 2010-01-14 11:20 · Score: 5, Informative

I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
1. Re:Fast forward... by Brian+Gordon · 2010-01-14 13:06 · Score: 5, Informative
  
  A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.
2. Re:Fast forward... by Brian+Gordon · 2010-01-14 15:47 · Score: 5, Informative
  
  You've done it! Interested slashdotters can download the video file at this link:
  http://67.202.36.223/presentations/09-sep-JVMperformance.flv.
  Good detective work, partner!
Re:Code in high-level by Sycraft-fu · 2010-01-14 11:50 · Score: 4, Informative

Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.
Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.
Re:Premature optimization is evil... and stupid by marcansoft · 2010-01-14 13:00 · Score: 4, Informative

Which CPU's are those?
Those with a barrel shifter.

The fastest way to multiply today on AMD/Intel is to use the multiply instructions.
Then someone needs to beat the GCC developers with a cluestick.
$ cat test.c int main(int argc, char **argv) { return 4*(unsigned int)argc; } $ gcc -march=core2 test.c -o test $ objdump -d test ... 00000000004004ec <main>: 4004ec: 55 push %rbp 4004ed: 48 89 e5 mov %rsp,%rbp 4004f0: 89 7d fc mov %edi,-0x4(%rbp) 4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp) 4004f7: 8b 45 fc mov -0x4(%rbp),%eax 4004fa: c1 e0 02 shl $0x2,%eax 4004fd: c9 leaveq 4004fe: c3 retq 4004ff: 90 nop

yeah... it seems like only assembly language programs know this.
I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.
Re:Code in high-level by TheRaven64 · 2010-01-14 13:44 · Score: 4, Informative

One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory
I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.
Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.
If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.

--
I am TheRaven on Soylent News
Re:Code in high-level by TheRaven64 · 2010-01-15 00:52 · Score: 4, Informative

The GCC manual tells you everything you need to know. First you declare a vector type, so if you want four shorts representing an RGBA colour value , you declare a type like this:

typedef short colour_t __attribute__ ((vector_size (4 * sizeof(short))));

This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:

colour_t a = {1,2,3,4}; colour_t b = {1,2,3,4}; colour_t c = a + b;

In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.
You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.
Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.
I wrote a longer explanation a while ago.

--
I am TheRaven on Soylent News
Re:Premature optimization is evil... and stupid by Rockoon · 2010-01-15 01:02 · Score: 4, Informative

GCC is a big offender, thats true.

This is one of the reasons that GCC sucks compared to ICC and VC++.

Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.

If this was the entire story, then they would be equal. Buts its not the entire story.

The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.

"But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.

Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.

In some code, it wont make any discernible difference, but in other code it will make a big difference.

As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.

There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.

One is a straight multiplication (MASM syntax, CDECL):

main:
mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
imul eax, edx, 4
ret

The other is leveraging the LEA instruction (MASM syntax, CDECL):

main:
mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
lea eax, [eax * 4]
ret

The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.

GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.

--
"His name was James Damore."