Cliff Click's Crash Course In Modern Hardware
Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"
I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.
on the website? I'm not sure what I'm looking at...
Iit doesn't make sense to code in ASM anymore.
With computing expanding towards more and more parallelism, I can clearly see that one should learn to start coding in the most abstract of way and let the tools do the optimisation for him...
That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.
Dewey, what part of this looks like authorities should be involved?
/.'d, to say the least. Wow.
Great lecture so far, 2 minute pauses every 20 seconds make it kind of hard to listen to though!
Need help treating your acne? Come here!
Now that no one knows what they're doing, who's to keep them from merging. How long is it before several machines of x86 chips become self-aware? The end is nigh comrades!
Alternatively, maybe they'll become Data.
What the fuck are you talking about. Why the hell do you need to write a compiler in assembler? Do you have any idea how a compiler works? Your last sentence suggests not.
I can't even watch this. Anyone got a transcript so that I can skip the video BS and just read it? I can read a lot faster than he can talk, and I wouldn't have to wait 30 minutes for the video to load (slow connection) ...
Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.
The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.
I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.
Need a Python, C++, Unix, Linux develop
People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.
Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.
Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.
However, you'll still hear people say it is a great trick because they haven't updated their knowledge.
This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...
A Lock-Free Hash Table
Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.
The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.
One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.
In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:
printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);
3 * 2 + 1 = 7
3 << 1 + 1 = 12
(3 << 1) + 1 = 7
--
Perpenso Calc for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN
Let's start the pissing contest:
I have a 6-digit slashdot ID. Beat that you newbs!
First against the wall when the revolution comes
My first programming was putting the little white plastic straws on a Digi-Comp 1.
(And it really was in a snowstorm -- I got it as a Christmas present.)
I am guessing that the site was slashdotted because the video never ran. Yet another example of some imbecile who designs their own video player and either can't run the material correctly or can't handle the load. I see this over and over, someone - or some site - decides to run their own video player and it's either inoperative or runs badly. I wrote about this on my blog in October 2008 how so many places try - and fail - to properly run video.
You know, running video correctly isn't rocket science, YouTube does it fine under loads that would slashdot Slashdot. But do these stupidos use YouTube to serve their video? Noooo, they'd prefer to use some incompetent who can't provide it properly, probably because they're under the impression they'd lose ad revenue or something, I guess. But I see this all the time. The New York Times provides video for some of their stories, But their video doesn't work, and stalls, but has no way to cache the video so that if it fails you can either get it to run smoothly or go back and run it again without having to download the entire video all over again after it's already been served. I guess they never thought about people having problems,
If these were streamed video like a live event, that would be one thing. But they do the exact same thing YouTube does, they feed stored video to a player written using Adobe Flash. So there's no excuse for their failures except pure incompetence and/or stupidity.
The lessons of history teach us - if they teach us anything - that nobody learns the lessons that history teaches us.
Nobody cares about performance, or what exactly the code does, since java takes care of all those "pesky details".
Virtual machines use a very simple instruction set, hopefully optimized for the processor , hopefully optimized by the OS, which is hopefully optimized for the processor.
Want performance? Don't code in java.
Want performance? Do some PERFORMANCE ANALYSIS.
I know it's hard, since it requires actual MATH, something that simple programmers are not taught, and really don't care about, but it's worth it.
Some up front optimization can save you MONTHS of recoding effort.
Rule #1 in Performance Optimization: DON'T OPTIMIZE, PLAN!
Sure there is tons and tons of x86-friendly code out there but you really don't want it running naked on power sensitive devices such as smart phones? It is trivial these days to query a processor for its capabilities and applications optimized for the desktop and server environments are going to run flat out, partying on every flavor of SSEx available. For x86 to be more than just an also-ran in the mobile world systems need to be able to easily present applications with a VM view of the processor to reign in power hungry apps, IMHO.
I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).
I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.
But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.
We're wanted men. I have the death sentence in 12 systems!
Try FLAT Assembler.
A free assembler that do real wonders !
Download from: http://flatassembler.net/download.php
Forum: http://board.flatassembler.net/index.php
Muchas Gracias, Señor Edward Snowden !
Here's an awesome assembler that do wonders --- Flat Assembler.
Download from: http://flatassembler.net/download.php [flatassembler.net]
Forum: http://board.flatassembler.net/index.php [flatassembler.net]
Muchas Gracias, Señor Edward Snowden !
I do code in HLL, but I do not give up the right to code in ASM.
In fact, coding in ASM is super fun !
Couple years ago I code in MASM, but now I use FASM (Flat Assembler) instead.
It is available from http://www.flatassembler.net/.
Enjoy ! :D
Muchas Gracias, Señor Edward Snowden !
Coreplayer PowerPC, for OS X, does play 720P H264 video on a G4 1.42 Ghz fine. Adding more to shock, its benchmark function actually shows 70-80 fps levels. Why? Altivec is used along with very clever OpenGL and possibly ASM.
Of course, some idiot will popup and say "powerpc is dead"... Well, in case of Intel Core 2 duo, the CPU load is sub 3-5% levels giving free cycles for all the amazing filters one can run. It is not just PowerPC/Altivec wasted, SSE is always wasted too. I really wonder what kind of computing we would do if these guys coding X86 only and relying on automatic optimization actually knew/used SSE instructions.
This talk was great! But, I'd love to have seen some of the other ones Cliff offered (particularly the GC one). A quick search of google video/youtube turns up only his lock free hash table talk, which is great, but I've seen it already.
Anyone have links to more of this guy?
-c
"If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
" As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own. Of course all that is predicated on using a profiler first to find out where the actual problem is." - by Sycraft-fu (314770) on Thursday January 14, @06:50PM (#30772966)
Exactly, & I tend to use a very "old-school/primitive" method of "hand-rolled profiling" (in using hi-res multimedia timers registered with the OS to do so, in order to find the 'slow spots' in my methods/subroutines/procedures/functions), &, it works (to @ least spot those areas, just as you noted).
HOWEVER: I only took my program, noted below earlier here today also, down from roughly a 4.5 hour runtime, down to a 4 hour runtime using programmatic optimizations of varying kinds! Not a bad gain, especially by optimizing (but that took a lot of time to determine mind you, & that matters in the workplace of course).
The program's one for my personal use here though, &, it's for:
----
A.) Sorting data alphabetically in datasets/records in HOSTS files
B.) Removing duplicated entries
C.) Pings+ resolved DOMAIN/HOST Names to IP Addresses of fav. sites I go to to add into the HOSTS file as to their correct IP-to-DOMAIN/HOST name resolution (faster doing it from a local HOSTS file than calling out to a potentially compromized & SLOWER external DNS server by far)
D.) Changing the preceeding blocking address used on domain/host names in a HOSTS file from the larger & slower 127.0.0.1 or 0.0.0.0 blocking addresses to the smaller/faster 0 one (& vice-a-versa IF NEED BE too)
----
Anyhow/anyways, now that the background of what my app does is covered?
Well - I did so via BOTH compiler level switchwork (Borland Delphi 7.1) & also hand-level ones done in Assembly language!
(Plus, using sorted lists & a better algorithm for sorting than QuickSort variants on small/medium/large lists (small = insertion sort variation, medium & large = quick sort variation (I changed it on the fly depending on the size of the data being sent into the lists I used (dynamically resizing types of course))).
I also chose Borland Delphi because it was shown, as far back as 1997 in fact & in a COMPETING TRADE JOURNAL in computing called "Visual Basic Programmer's Journal" Sept./Oct. 1997 issue entitled "INSIDE THE VB5 COMPILER" where Delphi SOUNDLY "KNOCKED THE CHOCOLATE" out of BOTH MSVC++ &/or VB5 by DOUBLE (or, better) in BOTH Math & String processing related tasks work...
That all "said & aside", well...?
I only got SO MUCH out of using programmatic optimizations by hand, a 1/2 hour decrease in work time, going from 4.5 hr. runtimes down to 4 hrs. time for all tasks A-D above completing.
I further used x86 Assembly code, inlined via the asm directive mostly here, or using shifts vs. multiplies etc. et al & more, such as FOR loops vs. WHILE loops etc. too & better algorithms after profiling showed me where I was "slowing up" the most, via hi-res multimedia timers timing all my procedures)
I also used lastly used compiler switches work (& also, removing ones I did not need for safety once the code proved safe & accurate enough, via using Try-Catch-Except/Finally errtrap methods in Delphi, doing my OWN exception handling & err trapping vs. the built-in "structured exception handlers" in the compiler itself only))...
Sure, 1/2 hr. less of 4.5 hours, down to 4 hours only, is a decent increase... but? Compared to what you get from BETTER HARDWARE??
It pales by comparison!
In fact, I noted this very example in another thread here today ->
Forrester Says Tech Downturn Is "Unofficially Over":
http://news.slashdot.org/comments.pl?sid=1508482&cid=30776266
Where others there ar
---
C++ Programming Feed @ Feed Distiller
I'm sorry, where exactly did you find 'C++ methods' in Quake 3 code?
That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.
One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.
On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper (slides) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.
Program Intellivision!
It took me hours to download this video, trying several different mirrors. So, here's a bittorrent version of the video.
It took me hours to download this video, trying several different mirrors. So, here's a bittorrent version of the video
The main take-away from this talk is that the modern software engineer needs to pay more attention to memory access and data dependency.
For some reason, the Slashdot luddites have come out in force to declare that it was actually about how inaccessible modern architectures are and how it's more proof that you should never use anything but a high level language. Nonsense.
I see this happen every time the subject of low level architecture comes up. There's a (sadly) large proportion of engineers who vehemently refuse to learn anything below the highest levels of programming. This turns into a silly justification backed by the evidence of how complex modern architectures are.
Some variants of this luddite behavior emerge as 'premature optimization is the root of all evil'. Yes, it's a good quote, but it's not referring to what you're referring to. There's nothing wrong with knowing in advance where the bottlenecks in a system will likely be. That's called experience. It's called knowing the characteristics of your platform. Those who stubbornly design systems without thought to performance are doomed to produce code which is inefficient, slow, and worst of all - incapable of being optimized without a re-write. Premature optimization may be bad, but preemptive optimization is a good quality to have.
That's the second take-away, in my opinion, from the talk: Engineers are all going to have to learn how to optimize code for the architecture, because your free ride on the MHz and CPI slope has ended. Here's a clue: if you're someone who knows how it all works, can preemptively optimize their designs to better fit their system, and can use their knowledge to debug issues, you are a far more valued engineer than the others. Bear this in mind the next time you find that a million outsourced engineers can do exactly the same job as you.
Alternately, instead of using TIMERS (especially high-resolution multimedia timers registered w/ the OS? You can use prebuilt functions/procedures/methods/subroutines that use methods like JAVA's "getTime()" method @ the START of your procedures/methods/subroutines/functions instead... getting the current time in MILLISECONDS, first.
Then, @ the END of your procedure/function/method/subroutine being timed? Get the time again, using an analog to JAVA's getTime() method, & subtract the start time (what you did @ the BEGINNING of what's being timed) from the END TIME (what you obtain @ the END of what you're timing) & voila:
You have just EASILY "profiled" the runtime of your method/subroutine/function/procedure being timed... with relative ease!
APK
P.S.=> No timers required either... Just thought I'd add that, as an 'addendum' to my original methods noted, as this omits having to use TIMERS, period... & additionally? There are even more "finer grained" methods that use MICROSECONDS (finer grained than getTime() & nanoseconds, afaik, too)... some "Food 4 Thought" for you all to "drink in, & digest" here... apk
As a former Z80 black-belt ninja, I'd say it was easier - except that Z80 didn't have mul/div instructions... but, who needs these? You could write a routine for this (granted: dead slow) but I've written pretty significant programs without ever needing a single "full" mul/div, just hacking around it with a couple shifts, adds or bit ops.
or you needlessly wrote some hideous O(n!) search which is NP complete, then no amount of profiling and instruction tuning is ever going to help you.
In this situation the value of the profiling tools is not for instruction tuning, but to help you notice the existence of the bad search function so you can replace it with something else.
In a large program there can be lurking n-squaredness which may not be obvious from looking at any one section of the code. For example there could be an innocent function which loops over n objects, and you may not realize that it is being called from a function twelve levels up the stack which is also looping over the same n objects.
Sometimes it's enough to just stop in the debugger a few times to realize what is slower than it should be and why. In other cases, browsing the output of a good call graph profiler can help inspire the fix faster.