Cliff Click's Crash Course In Modern Hardware
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
on the website? I'm not sure what I'm looking at...
Iit doesn't make sense to code in ASM anymore.
With computing expanding towards more and more parallelism, I can clearly see that one should learn to start coding in the most abstract of way and let the tools do the optimisation for him...
That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.
Dewey, what part of this looks like authorities should be involved?
/.'d, to say the least. Wow.
Great lecture so far, 2 minute pauses every 20 seconds make it kind of hard to listen to though!
Need help treating your acne? Come here!
Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.
The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.
I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.
Need a Python, C++, Unix, Linux develop
People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.
Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.
Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.
However, you'll still hear people say it is a great trick because they haven't updated their knowledge.
You have to admit it's pretty nice to have the presentation slides automatically display and advance below the video as you watch..
This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...
Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.
The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:
printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);
3 * 2 + 1 = 7
3 << 1 + 1 = 12
(3 << 1) + 1 = 7
--
Perpenso Calc for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN
Let's start the pissing contest:
I have a 6-digit slashdot ID. Beat that you newbs!
First against the wall when the revolution comes
My first programming was putting the little white plastic straws on a Digi-Comp 1.
(And it really was in a snowstorm -- I got it as a Christmas present.)
I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.
We're wanted men. I have the death sentence in 12 systems!
What is this, I don't even...
Seriously, I write optimized DSP code for x86 and non-x86 architectures, and I see absolutely nothing relevant or meaningful in the above comment.
"x86-friendly code"? Most of the code anyone ever sees is not "friendly" to any architecture -- for example, it uses way, way too much memory to be efficient at cache use, so the only "performance" the user sees is the speed of his RAM. At best someone manages to fit some code into cache sizes (that vary more within the x86 architecture than between architectures), or adapts it to various kinds of parallelism (that are usually portable in general but have to be adapted to particular implementations). The rest if a job for compilers -- and x86 is an architecture with very long history of compiler development, so it may be better supported than some others.
"Query a processor for its capabilities"? Why would you want to do that? What do you think, compiler optimizes for if not "capabilities" of a target CPU -- even if there is no generalized way to represent those "capabilities" in a generalized way?
"Easily present applications with a VM view of the processor to reign in power hungry apps"? If there was a VM (that is, VIRTUAL machine) that can be easily converted into any CPU architecture and such representation also represented the performance of the target architecture, it would be the target of last compiler ever written. The whole "problem" is, progress in CPU technology involves fundamentally different ways CPU treats code (pipelines, cores, non-SIMD parallelism), memory (caches, SIMD) and I/O (bus architectures) that allow to optimize code for those architectures -- sometimes purely by optimized compilation, sometimes by developer consciously adapting the code for particular CPU features.
Contrary to the popular belief, there indeed is no God.
Try FLAT Assembler.
A free assembler that do real wonders !
Download from: http://flatassembler.net/download.php
Forum: http://board.flatassembler.net/index.php
Muchas Gracias, Señor Edward Snowden !
Here's an awesome assembler that do wonders --- Flat Assembler.
Download from: http://flatassembler.net/download.php [flatassembler.net]
Forum: http://board.flatassembler.net/index.php [flatassembler.net]
Muchas Gracias, Señor Edward Snowden !
I do code in HLL, but I do not give up the right to code in ASM.
In fact, coding in ASM is super fun !
Couple years ago I code in MASM, but now I use FASM (Flat Assembler) instead.
It is available from http://www.flatassembler.net/.
Enjoy ! :D
Muchas Gracias, Señor Edward Snowden !
Coreplayer PowerPC, for OS X, does play 720P H264 video on a G4 1.42 Ghz fine. Adding more to shock, its benchmark function actually shows 70-80 fps levels. Why? Altivec is used along with very clever OpenGL and possibly ASM.
Of course, some idiot will popup and say "powerpc is dead"... Well, in case of Intel Core 2 duo, the CPU load is sub 3-5% levels giving free cycles for all the amazing filters one can run. It is not just PowerPC/Altivec wasted, SSE is always wasted too. I really wonder what kind of computing we would do if these guys coding X86 only and relying on automatic optimization actually knew/used SSE instructions.
This talk was great! But, I'd love to have seen some of the other ones Cliff offered (particularly the GC one). A quick search of google video/youtube turns up only his lock free hash table talk, which is great, but I've seen it already.
Anyone have links to more of this guy?
-c
"If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
" As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own. Of course all that is predicated on using a profiler first to find out where the actual problem is." - by Sycraft-fu (314770) on Thursday January 14, @06:50PM (#30772966)
Exactly, & I tend to use a very "old-school/primitive" method of "hand-rolled profiling" (in using hi-res multimedia timers registered with the OS to do so, in order to find the 'slow spots' in my methods/subroutines/procedures/functions), &, it works (to @ least spot those areas, just as you noted).
HOWEVER: I only took my program, noted below earlier here today also, down from roughly a 4.5 hour runtime, down to a 4 hour runtime using programmatic optimizations of varying kinds! Not a bad gain, especially by optimizing (but that took a lot of time to determine mind you, & that matters in the workplace of course).
The program's one for my personal use here though, &, it's for:
----
A.) Sorting data alphabetically in datasets/records in HOSTS files
B.) Removing duplicated entries
C.) Pings+ resolved DOMAIN/HOST Names to IP Addresses of fav. sites I go to to add into the HOSTS file as to their correct IP-to-DOMAIN/HOST name resolution (faster doing it from a local HOSTS file than calling out to a potentially compromized & SLOWER external DNS server by far)
D.) Changing the preceeding blocking address used on domain/host names in a HOSTS file from the larger & slower 127.0.0.1 or 0.0.0.0 blocking addresses to the smaller/faster 0 one (& vice-a-versa IF NEED BE too)
----
Anyhow/anyways, now that the background of what my app does is covered?
Well - I did so via BOTH compiler level switchwork (Borland Delphi 7.1) & also hand-level ones done in Assembly language!
(Plus, using sorted lists & a better algorithm for sorting than QuickSort variants on small/medium/large lists (small = insertion sort variation, medium & large = quick sort variation (I changed it on the fly depending on the size of the data being sent into the lists I used (dynamically resizing types of course))).
I also chose Borland Delphi because it was shown, as far back as 1997 in fact & in a COMPETING TRADE JOURNAL in computing called "Visual Basic Programmer's Journal" Sept./Oct. 1997 issue entitled "INSIDE THE VB5 COMPILER" where Delphi SOUNDLY "KNOCKED THE CHOCOLATE" out of BOTH MSVC++ &/or VB5 by DOUBLE (or, better) in BOTH Math & String processing related tasks work...
That all "said & aside", well...?
I only got SO MUCH out of using programmatic optimizations by hand, a 1/2 hour decrease in work time, going from 4.5 hr. runtimes down to 4 hrs. time for all tasks A-D above completing.
I further used x86 Assembly code, inlined via the asm directive mostly here, or using shifts vs. multiplies etc. et al & more, such as FOR loops vs. WHILE loops etc. too & better algorithms after profiling showed me where I was "slowing up" the most, via hi-res multimedia timers timing all my procedures)
I also used lastly used compiler switches work (& also, removing ones I did not need for safety once the code proved safe & accurate enough, via using Try-Catch-Except/Finally errtrap methods in Delphi, doing my OWN exception handling & err trapping vs. the built-in "structured exception handlers" in the compiler itself only))...
Sure, 1/2 hr. less of 4.5 hours, down to 4 hours only, is a decent increase... but? Compared to what you get from BETTER HARDWARE??
It pales by comparison!
In fact, I noted this very example in another thread here today ->
Forrester Says Tech Downturn Is "Unofficially Over":
http://news.slashdot.org/comments.pl?sid=1508482&cid=30776266
Where others there ar
---
C++ Programming Feed @ Feed Distiller
That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.
One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.
On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper (slides) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.
Program Intellivision!
The main take-away from this talk is that the modern software engineer needs to pay more attention to memory access and data dependency.
For some reason, the Slashdot luddites have come out in force to declare that it was actually about how inaccessible modern architectures are and how it's more proof that you should never use anything but a high level language. Nonsense.
I see this happen every time the subject of low level architecture comes up. There's a (sadly) large proportion of engineers who vehemently refuse to learn anything below the highest levels of programming. This turns into a silly justification backed by the evidence of how complex modern architectures are.
Some variants of this luddite behavior emerge as 'premature optimization is the root of all evil'. Yes, it's a good quote, but it's not referring to what you're referring to. There's nothing wrong with knowing in advance where the bottlenecks in a system will likely be. That's called experience. It's called knowing the characteristics of your platform. Those who stubbornly design systems without thought to performance are doomed to produce code which is inefficient, slow, and worst of all - incapable of being optimized without a re-write. Premature optimization may be bad, but preemptive optimization is a good quality to have.
That's the second take-away, in my opinion, from the talk: Engineers are all going to have to learn how to optimize code for the architecture, because your free ride on the MHz and CPI slope has ended. Here's a clue: if you're someone who knows how it all works, can preemptively optimize their designs to better fit their system, and can use their knowledge to debug issues, you are a far more valued engineer than the others. Bear this in mind the next time you find that a million outsourced engineers can do exactly the same job as you.
As a former Z80 black-belt ninja, I'd say it was easier - except that Z80 didn't have mul/div instructions... but, who needs these? You could write a routine for this (granted: dead slow) but I've written pretty significant programs without ever needing a single "full" mul/div, just hacking around it with a couple shifts, adds or bit ops.
or you needlessly wrote some hideous O(n!) search which is NP complete, then no amount of profiling and instruction tuning is ever going to help you.
In this situation the value of the profiling tools is not for instruction tuning, but to help you notice the existence of the bad search function so you can replace it with something else.
In a large program there can be lurking n-squaredness which may not be obvious from looking at any one section of the code. For example there could be an innocent function which loops over n objects, and you may not realize that it is being called from a function twelve levels up the stack which is also looping over the same n objects.
Sometimes it's enough to just stop in the debugger a few times to realize what is slower than it should be and why. In other cases, browsing the output of a good call graph profiler can help inspire the fix faster.