Slashdot Mirror


Cliff Click's Crash Course In Modern Hardware

Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"

61 of 249 comments (clear)

  1. Fast forward... by LostCluster · · Score: 5, Informative

    I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

    1. Re:Fast forward... by Jah-Wren+Ryel · · Score: 5, Funny

      The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

      That's just the branch predictor pre-loading the cache for each possible conditional result.

      --
      When information is power, privacy is freedom.
    2. Re:Fast forward... by Brian+Gordon · · Score: 5, Informative

      A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.

    3. Re:Fast forward... by pyrrhonist · · Score: 3, Informative

      some poor guy's machine through a dynamic DNS service!

      Some poor guy? It's on an Amazon EC2 server!

      $ host flv.thruhere.net
      flv.thruhere.net has address 67.202.36.223
      $ host 67.202.36.223
      223.36.202.67.in-addr.arpa domain name pointer ec2-67-202-36-223.compute-1.amazonaws.com.

      --
      Show me on the doll where his noodly appendage touched you.
    4. Re:Fast forward... by Brian+Gordon · · Score: 5, Informative

      You've done it! Interested slashdotters can download the video file at this link:

      http://67.202.36.223/presentations/09-sep-JVMperformance.flv.

      Good detective work, partner!

    5. Re:Fast forward... by iammani · · Score: 3, Informative

      Mirror available at http://www.mediafire.com/?j21t2ynnnzn

      And please stop hitting the server liked in GP's post. The poor server hardly sustains 30 KB/s.

    6. Re:Fast forward... by iammani · · Score: 3, Informative

      Or type http://www.infoq.com/resource/presentations/click-crash-course-modern-hardware/en/slides/1.swf

      And change the the number at the end to change slides

      PS: I am no good in javascript, but the above, in FF 3.5/Linux, just displayed a page saying "true"

  2. Re:Could someone give me a crash course by Lunix+Nutcase · · Score: 5, Funny

    Probably due to your x86 processor doing all sorts of monkeying with the code.

  3. Re:Could someone give me a crash course by __aaclcg7560 · · Score: 2, Funny

    Spaghetti code can be hard to digest.

  4. Re:Code in high-level by caerwyn · · Score: 5, Insightful

    That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

    However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

    --
    The ringing of the division bell has begun... -PF
  5. Re:Code in high-level by Thiez · · Score: 2, Insightful

    Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compiler to optimize.

  6. Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 2, Insightful

    That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.

    --
    Dewey, what part of this looks like authorities should be involved?
    1. Re:Premature optimization is evil... and stupid by RightSaidFred99 · · Score: 4, Funny

      And messy and embarrassing. Oh, wait...

    2. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 3, Interesting

      Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

    3. Re:Premature optimization is evil... and stupid by Rockoon · · Score: 3, Informative

      Using shift to multiply is often a great idea on most CPUs.

      Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

      Didn't know that? yeah... it seems like only assembly language programs know this.

      --
      "His name was James Damore."
    4. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 4, Informative

      Which CPU's are those?

      Those with a barrel shifter.

      The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

      Then someone needs to beat the GCC developers with a cluestick.
      $ cat test.c
      int main(int argc, char **argv) {
                      return 4*(unsigned int)argc;
      }
      $ gcc -march=core2 test.c -o test
      $ objdump -d test ...
      00000000004004ec <main>:
          4004ec: 55 push %rbp
          4004ed: 48 89 e5 mov %rsp,%rbp
          4004f0: 89 7d fc mov %edi,-0x4(%rbp)
          4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
          4004f7: 8b 45 fc mov -0x4(%rbp),%eax
          4004fa: c1 e0 02 shl $0x2,%eax
          4004fd: c9 leaveq
          4004fe: c3 retq
          4004ff: 90 nop

      yeah... it seems like only assembly language programs know this.

      I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

    5. Re:Premature optimization is evil... and stupid by AuMatar · · Score: 2, Insightful

      It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random unknown multiply.

      Also IIRC the p4 got rid of the barrel shifter on Intel. Or maybe it was the gen after that. THey may have re-added it though, it seems fairly stupid not to have one.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    6. Re:Premature optimization is evil... and stupid by tomtefar · · Score: 3, Informative

      I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.

    7. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 2, Interesting

      I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.

      However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data structures so that you can pass the data directly into some FFT functions without packing/unpacking the data to some other format that the rest of the functions were written to expect. It may also mean that your design scale to many cores and that inner loops be heavily optimized and vectorized. Of course, all of that code should be performance tested during development against the simpler versions.

      Profiling after the fact sounds like a good idea, but what if the code has no real "hotspot"? What if you find out that you need to redesign the entire software framework to support zero-copy processing of the data? Also, profiling tools in general are really not that good. Running oprofile on a large-scale application with dozens of threads and data source dependencies on other processes can be less than enlightening. gprof is entirely useless for non-trivial applications. cachegrind is sometimes helpful, but most people working on performance optimization seem to simply build their own timers based on the rdtsc instruction and manually time sections of the code.

      I work on software for processing medical device data and performance is often critical. You probably want an image display to update very quickly when it is providing feedback to the doctor guiding a catheter toward your heart, for example. We had one project where the team decided to start over with a clean framework without concern for performance -- they would profile and optimize once everything was working. They followed the advice of many a software engineer: their framework was very nice, replete with design patterns and applications of generic programming, and entirely unscalable beyond a single processor core. There were no performance tests done during development, and of course the timeline was such that there would only be minimal time for optimization once the functionality was complete. The software that it was replacing was ugly, but also scaled nicely to many cores. The software shipped on a system with two quad-core processors, just as it had before.

      Let's just say that customers were unimpressed with the new software framework.

    8. Re:Premature optimization is evil... and stupid by TheRaven64 · · Score: 2, Informative

      I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and one shift unit) and found that it was much faster to just emit a multiply.

      --
      I am TheRaven on Soylent News
    9. Re:Premature optimization is evil... and stupid by smash · · Score: 2, Interesting

      by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite.

      "Not thinking about performance" is different from writing in high level first.

      Get the algorithm right first, THEN optimise hot spots.

      Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the algorithm correct should you break out the assembler for the hotspots WITHIN that algorithm.

      If you're writing such shitty code that its "impossible to optimize later" then I don't think starting out in ASM will help you. You'll just have slightly faster shitty code.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    10. Re:Premature optimization is evil... and stupid by kc8apf · · Score: 5, Insightful

      Having spent 4 years being one of the primary developers of Apple's main performance analysis tools (CHUD, not Instruments) and having helped developers from nearly every field imaginable tune their applications for performance, I can honestly say that regardless of your performance criteria, you shouldn't be doing anything special for optimization when you first write a program. Some thought should be given to the architecture and overall data flow of the program and how that design might have some high-level performance limits, but certainly no code should be written using explicit vector operations and all loops should be written for clarity. Scalability by partitioning the work is one of those items that can generally be incorporated into the program's architecture if the program lends itself to it, but most other performance-related changes depend on specific usage cases. Trying to guess those while writing the application logic relies solely on intuition which is usually wrong.

      After you've written and debugged the application, profiling and tracing is the prime way for finding _where_ to do optimization. Your experiences have been tainted by the poor quality of tools known by the larger OSS community, but many good tools are free (as in beer) for many OSes (Shark for OS X as an example) while others cost a bit (VTune for Linux or Windows). Even large, complex multi-threaded programs can be profiled and tuned with decent profilers. I know for a fact that Shark is used to tune large applications such as Photoshop, Final Cut Pro, Mathematica, and basically every application, daemon, and framework included in OS X.

      What do you do if there really isn't much of a hotspot? Quake 3 was an example where the time was spread out over many C++ methods so no one hotspot really showed up. Using features available in the better profiling tools, the collected samples could be attributed up the stack to the actual algorithms instead of things like simple accessors. Once you do that, the problems become much more obvious.

      What do you do after the application has been written and a major performance problem is found that would require an architectural change? Well, you change the architecture. The reason for not doing it during the initial design is that predicting performance issues is near impossible even for those of us who have spent years doing it as a full time job. Sure, you have to throw away some code or revisit the design to fix the performance issues, but that's a normal part of software design. You try an approach, find out why it won't work, and use that knowledge to come up with a new approach.

      That largest failing I see from my experiences have been the lack of understanding by management and engineers that performance is a very iterative part of software design and that it happens late in the game. Frequently, schedules get set without consideration for the amount of time required to do performance analysis, let alone optimization. Then you have all the engineers who either try to optimize everything they encounter and end up wasting lots of time, or they do the initial implementation and never do any profiling.

      Ultimately, if you try to build performance into a design very early, you end up with a big, messy, unmaintainable code base that isn't actually all that fast. If you build the design cleanly and then optimize the sections that actually need it, you have a most maintainable code base that meets the requirements. Be the latter.

      --
      kc8apf
    11. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 2, Insightful

      I totally agree about the evils of premature opimisation, however I also think correct choice of algorithm and data structures is vital, and does not necessarily need to be left to the end. Often the correct choice of algorithm and data structure not only results in faster code, but more readable and maintainable code too. Of these the most often I see over looked is data-structure. For me learning functional programming really helped with this.

    12. Re:Premature optimization is evil... and stupid by epine · · Score: 3, Insightful

      That's the main reason why I want to shoot people who write "clever" code on the first pass.

      Over the years, I've grown to hate this meme. Not because it isn't right, but because it stops ten floors below the penthouse of human potential.

      First of all, it's an incredible instance of cultural drift. In the mid 1980s, when this meme was halfway current, I worked on adding support for Asian characters to an Asian-made PC. On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language. Slower than YouTube over 300 baud. It was doing a lot of pixel swizzling it shouldn't have been, because the fonts were supplied in a format better suited to printing. This was an order of magnitude below an invitation to whiffle-ball training camp. This was Lance Armstrong during his chemotherapy years towing a baby trailer. Today you get 60fps with a 100 thousand or a 100 million polygons, I've sort of lost track.

      Let's not shunt performance onto the side track of irrelevancy. While there's no good excuse, ever, for writing faulty code, an enlightened balance between starting out with an approach you can live with, and exploiting necessary cleverness *within your ability* goes a long way.

      How about we update Knuth's arthritic maxim? Don't tweak what you don't grok. If you grok, use your judgement. Exploit your human potential. Live a little.

      The books I've been reading lately about the evolution of skills in the work place suggest that painstaking reductive work processes are on their way to India. Job security in home world is greatly enhanced if you can navigate multiple agendas in tandem, exploiting more of that judgement thing.

      One of the reasons Carmack became so successful is that he didn't waste his effort looking for excuses to deprive his co-workers of their oxygen bits. Instead he conducted shrewd excursions along the edge of the envelope in pursuit of the sweet spot between cleverness too oppressive to live with, and no performance at all.

      In my day of deprecating my elders, I always knew where the pea was hidden under the mattress. These days, there are so many squishy mattresses stacked one upon the other, I have to plan my work day with a step ladder. Which I think is what this unwatchable cult-encoded video is on about: the ankle level view most of us never see any more.

      Here's another thing. I've you're going to be clever about how you code something, also be clever about how you do it. In other words, be equally clever all levels of the solution process simultaneously: algorithm selection, implementation, commenting, software engineering, documentation, and unit test. Knuth got away with TeX, barely, for precisely this reason. Because of his cleverness, the extension to handle Asian languages was far from elegant. Because of his cleverness (in making everything else run extremely well), people actually wanted to extend TeX to handle Asian languages. So who's to say he was wrong? Despite his cleverness, he managed to keep his booboo score in single or low double digits. His bug tracking database fit nicely on an index card.

      In the modern era, people quote the old "make it right before you make it faster" as the cure for the halitosis of ineptitude: you're feeble and irritating, so practice your social graces. Don't make me come over there and choke off your oxygen bit. It's a long ways from saying "you have a lot of human potential, and not much experience, so let me help you confront the challenges in a meaningful way". These sayings leak a lot of sentiment about social engagement.

      Every so often I have to pull up a chair beside a junior resource and go "Dude, you're jousting at windmills here, let's roll that change back and try again. I know you can do better." Five minutes of war stories about how to shoot yourself in the foot six ways from Sunday is usually enough to rebalance the flywheel of self preservation.

    13. Re:Premature optimization is evil... and stupid by Rockoon · · Score: 4, Informative

      GCC is a big offender, thats true.

      This is one of the reasons that GCC sucks compared to ICC and VC++.

      Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.

      If this was the entire story, then they would be equal. Buts its not the entire story.

      The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.

      "But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.

      Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.

      In some code, it wont make any discernible difference, but in other code it will make a big difference.

      As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.

      There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.

      One is a straight multiplication (MASM syntax, CDECL):

      main:
      mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
      imul eax, edx, 4
      ret

      The other is leveraging the LEA instruction (MASM syntax, CDECL):

      main:
      mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
      lea eax, [eax * 4]
      ret

      The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.

      GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.

      --
      "His name was James Damore."
    14. Re:Premature optimization is evil... and stupid by chelberg · · Score: 2, Informative

      Two books to look at Cormen et. al Intro to Algorithms, and Bentley's Programming Pearls. The second is more practical, the former is used in many CS Algorithms courses.

  7. Re:Code in high-level by __aaclcg7560 · · Score: 3, Interesting

    I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

  8. Re:Code in high-level by Just+Some+Guy · · Score: 2, Insightful

    Someone has to write those tools.

    Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.

    --
    Dewey, what part of this looks like authorities should be involved?
  9. Re:Code in high-level by Com2Kid · · Score: 2, Interesting

    Also, the compiler doesn't always take advantage of instructions that it could use.

    Yah sorry about that. :)

    Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since that is what everyone is using, and no one wants to move to the new architecture since the compiler doesn't take full advantage of it.

  10. Re:Code in high-level by Chris+Burke · · Score: 3, Interesting

    That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

    Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.

    However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

    Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.

    --

    The enemies of Democracy are
  11. Re:Code in high-level by Sycraft-fu · · Score: 4, Informative

    Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.

    Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.

  12. Re:Code in high-level by marcansoft · · Score: 3, Informative

    Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

    On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.

  13. Re:Code in high-level by dave562 · · Score: 2, Interesting

    I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into what might be going wrong when their code isn't compiling or executing the way they want it to. It also gives them the tools to make their code better, as opposed to simply shrugging and saying, "I sure hope they fix this damn compiler..."

  14. Re:Could someone give me a crash course by Icegryphon · · Score: 5, Funny

    Spaghetti code can be hard to digest.

    Sounds to me like someone is using stale Copypasta.

  15. Re:Code in high-level by KC1P · · Score: 2, Interesting

    That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!

    Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly programmer, which a lot of people are but they don't realize it so they bad-mouth the language) is just icing on the cake. You should definitely teach yourself assembly, if you can find the time.

  16. Re:Code in high-level by dr2chase · · Score: 3, Interesting

    Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).

    Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.

  17. Re:Code in high-level by RzUpAnmsCwrds · · Score: 2, Informative

    It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.

    In contrast, PGI or ICC is much harder to beat using assembly.

  18. Re:Code in high-level by caerwyn · · Score: 2, Insightful

    That's *generally* true. It's not *always* true.

    There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.

    --
    The ringing of the division bell has begun... -PF
  19. Re:Code in high-level by smash · · Score: 2, Insightful
    Not quite.

    But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)

    --
    I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
  20. It's not just x86 by RzUpAnmsCwrds · · Score: 3, Informative

    Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.

    The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

  21. Re:Code in high-level by Anonymous Coward · · Score: 2, Informative

    Or you could get NASM, which is open source :)

  22. It's just outdated knowledge by Sycraft-fu · · Score: 2, Informative

    People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.

    Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.

    Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

    However, you'll still hear people say it is a great trick because they haven't updated their knowledge.

    1. Re:It's just outdated knowledge by Cassini2 · · Score: 3, Informative

      I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.

      You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.

      Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

      Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.

      Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
      ADD AL, AH
      MOV CL, 0
      JC somewhere

      should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
      XOR CL, 0
      ADD AL, AH
      JC somewhere

      executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.

    2. Re:It's just outdated knowledge by SpinyNorman · · Score: 2, Informative

      Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...

    3. Re:It's just outdated knowledge by BZ · · Score: 3, Interesting

      The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.

      Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).

  23. rule of the code by Bork · · Score: 3, Informative

    Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.

    The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.

    One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

    1. Re:rule of the code by XMunkki · · Score: 2, Informative

      I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizing something, is to fiddle with the code a bit and see what generates the best assembly. This usually is a good compromise (you still stay at a higher level and got portable code with some gained performance on at least one platform). Now still compilers aren't a magic wand everywhere, especially when going to deeply embededded or specialized hardware. One example is SPU programming. Since SPUs read&write everything from/to 16-aligned addresses, current GCC compiler lots of "align ptr, load, rotate, calculate, rotate, combine, store" sequences. If you want good SPU performance, going into ASM is indeed viable something. Though most of the times, staying at intrinsic functions gives you an adequate compromise. But since SPUs are basically fast DSPs, many of the tasks that are ran by them are in nature quite repetitive with short amount of work per item and millions of items (like doing vertex transforms, simulating some post processing effect, mixing audio etc). But a good programmer always benchmarks first, checks the compiler output etc before hitting the deck with raw assembly.

    2. Re:rule of the code by SorcererX · · Score: 2, Informative

      You got the fastest time simply by playing with the compiler flags? We had a similar problem where we had to do a matrix multiplication on symmetric matrices for C = AB^T+BA^T (rank2k update with alpha=1.0, beta=0.0) and there was nothing the compiler could do for us to get even remotely near good scores. Doing the simplest implementation we got about 5 FLOPS/cycle on an 8 core system, optimizing just with SSE etc, I got it up to about 13 FLOPS/cycle, and by splitting up the matrix in tiny parts to avoid cache trashing etc I was able to get it up to 47 FLOPS/cycle. For comparison Intel's MKL library managed about 85 FLOPS/cycle on the same hardware. I believe the best in my class was about 50 FLOPS/cycle, and it took an insane amount of fiddling for any of us to get above 25-30 FLOPS/cycle or so. That said, most things done on a computer is rarely that limited by memory access, and then the compiler does an awesome job :)

      --
      Any sufficiently advanced technology is indistinguishable from magic.
    3. Re:rule of the code by wirelessbuzzers · · Score: 2, Informative

      One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

      Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.

      --
      I hereby place the above post in the public domain.
  24. Re:Code in high-level by TheRaven64 · · Score: 4, Informative

    One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory

    I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.

    Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.

    If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.

    --
    I am TheRaven on Soylent News
  25. Re:Code in high-level by TheRaven64 · · Score: 3, Interesting

    Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.

    --
    I am TheRaven on Soylent News
  26. Re:Code in high-level by TheRaven64 · · Score: 3, Informative

    The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.

    --
    I am TheRaven on Soylent News
  27. Re:Code in high-level by SETIGuy · · Score: 3, Interesting

    Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.

    Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.

    In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.

    Being able to use %r2 for the same stuff you use %r1 for is just boring.

  28. In C/C++ shift is not the same as multiply/divide by perpenso · · Score: 2, Interesting

    Using shift to multiply is often a great idea on most CPUs.

    In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:

    printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
    printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
    printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);

    3 * 2 + 1 = 7
    3 << 1 + 1 = 12
    (3 << 1) + 1 = 7

    --
    Perpenso Calc for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN

  29. Re:Could someone give me a crash course by Eudial · · Score: 2, Informative

    I hear they have nice 0xC0FFEE

    --
    GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
  30. Re:Code in high-level by The_Wilschon · · Score: 2, Insightful

    It all depends on your problem domain. As a high energy physicist, I write plenty of code that me, a postdoc, and maybe a couple other grad students will ever see, and probably I'm the only one that will actually ever use it. I'm designing a small cluster that will get built here in a month or few, and some of my code will take up about 2 months of solid run time on it, then never see the light of day again. If I can spend 2 days getting a 5% performance improvement, even at the expense of locking the code to this cluster, it's a net win for us.

    In short, I have no "customers", I know exactly what hardware my code will be running on, and it won't ever change (until they ditch the cluster in 4-5 years and make a new one, but I'll be long gone), and I don't even have to worry about maintaining the code years in the future.

    All the same, I'll probably still write the code as cleanly as possible and run it through an optimizer, and leave it at that.

    --
    SIGSEGV caught, terminating

    wait... not that kind of sig.
  31. ...except for the uControllers I use. by podom · · Score: 3, Interesting

    I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).

    I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.

    But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.

    --
    We're wanted men. I have the death sentence in 12 systems!
  32. Re:Code in high-level by cheekyboy · · Score: 2, Informative

    intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.

    --
    Liberty freedom are no1, not dicks in suits.
  33. Re:Code in high-level by TheRaven64 · · Score: 4, Informative
    The GCC manual tells you everything you need to know. First you declare a vector type, so if you want four shorts representing an RGBA colour value , you declare a type like this:

    typedef short colour_t __attribute__ ((vector_size (4 * sizeof(short))));

    This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:

    colour_t a = {1,2,3,4};
    colour_t b = {1,2,3,4};
    colour_t c = a + b;

    In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.

    You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.

    Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.

    I wrote a longer explanation a while ago.

    --
    I am TheRaven on Soylent News
  34. Re:Code in high-level by afidel · · Score: 2, Insightful

    In general modern compilers are good enough that you are much more likely to get better performance by spending the time finding a better algorithm then you are hand optimizing the code. Obviously for things like H.264 where the algorithm is already set this is not true, but that's a very small fraction of the code out there.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  35. What about prefetching? by Mr+Z · · Score: 2, Interesting

    That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.

    One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.

    On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper (slides) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.