Slashdot Mirror


Cliff Click's Crash Course In Modern Hardware

Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"

249 comments

  1. Fast forward... by LostCluster · · Score: 5, Informative

    I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

    1. Re:Fast forward... by Jah-Wren+Ryel · · Score: 5, Funny

      The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

      That's just the branch predictor pre-loading the cache for each possible conditional result.

      --
      When information is power, privacy is freedom.
    2. Re:Fast forward... by OverlordQ · · Score: 0, Flamebait

      You mean you actually got it to play instead of stare at a play button? Can we please kill Flash already.

      --
      Your hair look like poop, Bob! - Wanker.
    3. Re:Fast forward... by Mitchell314 · · Score: 1

      Huh, I got to play it nice for a while. Then it kept on stopping. Now it won't play at all. Good show, up as far as I could see (~15 minutes).

      And Flash really really needs to die for the greater good. And for us Linux users.

      --
      I read TFA and all I got was this lousy cookie
    4. Re:Fast forward... by Gazzonyx · · Score: 1

      You're lucky that you didn't get it to play; mine played to six minutes and then just stopped and won't play or let me skip past that.

      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    5. Re:Fast forward... by Brian+Gordon · · Score: 5, Informative

      A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.

    6. Re:Fast forward... by Ginger+Unicorn · · Score: 1

      That is proper hard-fucking-core geek wit. Bravo.

      --
      (1.21 gigawatts) / (88 miles per hour) = 30 757 874 newtons
    7. Re:Fast forward... by Anonymous Coward · · Score: 0

      jsclassref. Base64. That's lame.

    8. Re:Fast forward... by Anonymous Coward · · Score: 0

      You really need to watch it on the web site however, because the slides are not in the video.

    9. Re:Fast forward... by Anonymous Coward · · Score: 0

      It keeps stalling for me around the 17 minute mark :(

    10. Re:Fast forward... by Anonymous Coward · · Score: 0

      Er, can you point us to "the website" you speak of?

    11. Re:Fast forward... by pyrrhonist · · Score: 3, Informative

      some poor guy's machine through a dynamic DNS service!

      Some poor guy? It's on an Amazon EC2 server!

      $ host flv.thruhere.net
      flv.thruhere.net has address 67.202.36.223
      $ host 67.202.36.223
      223.36.202.67.in-addr.arpa domain name pointer ec2-67-202-36-223.compute-1.amazonaws.com.

      --
      Show me on the doll where his noodly appendage touched you.
    12. Re:Fast forward... by Brian+Gordon · · Score: 5, Informative

      You've done it! Interested slashdotters can download the video file at this link:

      http://67.202.36.223/presentations/09-sep-JVMperformance.flv.

      Good detective work, partner!

    13. Re:Fast forward... by Brian+Gordon · · Score: 1

      Oh, and to advance the slides on the page, use this ugly hack:

      so = new SWFObject(getPathForSlide(0), "slides", "100%", "100%", "7");
      so.write("slideArea");

      Do this in the javascript console of your browser or with the "javascript:" pseudo-protocol in the address bar. Change the 0 to 1 or 2 or 3 or whatever slide you want.

    14. Re:Fast forward... by Anonymous Coward · · Score: 0

      Informative my ass, how is it any different than the link the GGP has posted (the one in +5 informative post above)?

    15. Re:Fast forward... by iammani · · Score: 3, Informative

      Mirror available at http://www.mediafire.com/?j21t2ynnnzn

      And please stop hitting the server liked in GP's post. The poor server hardly sustains 30 KB/s.

    16. Re:Fast forward... by iammani · · Score: 3, Informative

      Or type http://www.infoq.com/resource/presentations/click-crash-course-modern-hardware/en/slides/1.swf

      And change the the number at the end to change slides

      PS: I am no good in javascript, but the above, in FF 3.5/Linux, just displayed a page saying "true"

    17. Re:Fast forward... by pyrrhonist · · Score: 1

      The slides can also be found here here on Azul Systems' website (i.e. the company Cliff works at).

      --
      Show me on the doll where his noodly appendage touched you.
    18. Re:Fast forward... by Anonymous Coward · · Score: 0

      Mediafire sucks huge donkey dick.

    19. Re:Fast forward... by Anonymous Coward · · Score: 0

      ERROR
      The requested URL could not be retrieved

      While trying to retrieve the URL: http://67.202.36.223/presentations/09-sep-JVMperformance.flv

      The following error was encountered:

              * Connection to 67.202.36.223 Failed

      The system returned:

              (146) Connection refused

      The remote host or network may be down. Please try the request again.

      Great video, however I have seen it before.

    20. Re:Fast forward... by RMH101 · · Score: 1

      Loving your work there, dude

    21. Re:Fast forward... by drseuk · · Score: 1

      Or kill Flash Gordon to get rid of both Adobe and Intel simultaneously ...

    22. Re:Fast forward... by Anonymous Coward · · Score: 0

      Great, thanks! ...and whoever GP is, he might have thought about that beforehand :-)

    23. Re:Fast forward... by Anonymous Coward · · Score: 0

      Works only up to Nr. 8....

    24. Re:Fast forward... by Brian+Gordon · · Score: 1

      You have to make sure it returns void or the browser will display the return value. Wrap the javascript in the void() function like

      javascript:void(paula = "Brillant");

    25. Re:Fast forward... by Brian+Gordon · · Score: 1

      9 and 10 don't exist, just go on to 11

    26. Re:Fast forward... by Anonymous Coward · · Score: 0

      For awhile after the story was posted, the domain name didn't resolve, so you had to type the IP address directly

    27. Re:Fast forward... by iammani · · Score: 1

      Ah thanks! that seems to work!! And here is PDF link to the slides, if any body is still following this story - http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf

    28. Re:Fast forward... by JAlexoi · · Score: 1

      That genius should have used AWS CDN

  2. Could someone give me a crash course by Anonymous Coward · · Score: 1, Funny

    on the website? I'm not sure what I'm looking at...

    1. Re:Could someone give me a crash course by Lunix+Nutcase · · Score: 5, Funny

      Probably due to your x86 processor doing all sorts of monkeying with the code.

    2. Re:Could someone give me a crash course by __aaclcg7560 · · Score: 2, Funny

      Spaghetti code can be hard to digest.

    3. Re:Could someone give me a crash course by Icegryphon · · Score: 5, Funny

      Spaghetti code can be hard to digest.

      Sounds to me like someone is using stale Copypasta.

    4. Re:Could someone give me a crash course by funwithBSD · · Score: 1

      They made the meatballs out of DEADBEEF.

      --
      Never answer an anonymous letter. - Yogi Berra
    5. Re:Could someone give me a crash course by TeknoHog · · Score: 1

      Incidentally, my most reliable Flash player is found on a Nokia N800, running Linux on ARM. Fortunately there are ways to download the video file in many cases.

      --
      Escher was the first MC and Giger invented the HR department.
    6. Re:Could someone give me a crash course by networkBoy · · Score: 1

      gotten at the 0xCAFE 0F DEAD BEEF

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    7. Re:Could someone give me a crash course by Eudial · · Score: 2, Informative

      I hear they have nice 0xC0FFEE

      --
      GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
  3. Code in high-level by elh_inny · · Score: 1, Insightful

    Iit doesn't make sense to code in ASM anymore.
    With computing expanding towards more and more parallelism, I can clearly see that one should learn to start coding in the most abstract of way and let the tools do the optimisation for him...

    1. Re:Code in high-level by caerwyn · · Score: 5, Insightful

      That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

      However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

      --
      The ringing of the division bell has begun... -PF
    2. Re:Code in high-level by Thiez · · Score: 2, Insightful

      Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compiler to optimize.

    3. Re:Code in high-level by __aaclcg7560 · · Score: 3, Interesting

      I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

    4. Re:Code in high-level by Anonymous Coward · · Score: 1, Insightful

      Someone has to write those tools.

    5. Re:Code in high-level by Just+Some+Guy · · Score: 2, Insightful

      Someone has to write those tools.

      Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.

      --
      Dewey, what part of this looks like authorities should be involved?
    6. Re:Code in high-level by Com2Kid · · Score: 2, Interesting

      Also, the compiler doesn't always take advantage of instructions that it could use.

      Yah sorry about that. :)

      Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since that is what everyone is using, and no one wants to move to the new architecture since the compiler doesn't take full advantage of it.

    7. Re:Code in high-level by Chris+Burke · · Score: 3, Interesting

      That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

      Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.

      However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

      Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.

      --

      The enemies of Democracy are
    8. Re:Code in high-level by Sycraft-fu · · Score: 4, Informative

      Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.

      Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.

    9. Re:Code in high-level by Anonymous Coward · · Score: 0

      > no one wanted to get their hands dirty under the hood.

      That's not at all how I remember college.

    10. Re:Code in high-level by marcansoft · · Score: 3, Informative

      Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

      On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.

    11. Re:Code in high-level by DarkOx · · Score: 1

      You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes. To really over simplify: Compliers are pretty much syntax checkers and search tree engines. They take your code and replace it with a matching assembly listing or set of listings substituting which ever registers happen to be free etc etc.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    12. Re:Code in high-level by dave562 · · Score: 2, Interesting

      I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into what might be going wrong when their code isn't compiling or executing the way they want it to. It also gives them the tools to make their code better, as opposed to simply shrugging and saying, "I sure hope they fix this damn compiler..."

    13. Re:Code in high-level by phantomfive · · Score: 1

      One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory. One of the biggest processor slowdowns, especially if you are dealing with a lot of data, is cache misses. If you can align your data in memory on the cache pages, then you can make huge performance gains. Since C doesn't give you much control over this, if you really want to optimize it you have to go to assembly.

      Also, some of glibc function calls (like memmove or memcpy, I believe) have been optimized in assembly, which is kind of nice. As always, use a profiler to make sure you're actually speeding things up.

      --
      Qxe4
    14. Re:Code in high-level by KC1P · · Score: 2, Interesting

      That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!

      Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly programmer, which a lot of people are but they don't realize it so they bad-mouth the language) is just icing on the cake. You should definitely teach yourself assembly, if you can find the time.

    15. Re:Code in high-level by Anonymous Coward · · Score: 1, Insightful

      There is an old saying that performance improvement comes from better algorithms and not instruction fiddling. Simply put if your performance is not adequate using ordinary compiler code then you have serious issues with your software or hardware design.
      Note that code fiddling couples the software closely to the specific CPU which is not a good idea unless you can control both indefinitely.

    16. Re:Code in high-level by dr2chase · · Score: 3, Interesting

      Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).

      Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.

    17. Re:Code in high-level by dbIII · · Score: 1

      Also there is code that is used a lot for a long time.
      For example in geophysics there is a process of arranging data called "Pre Stack Time Migration" which can keep a small cluster busy for a week with relatively small datasets. In cases like that tiny improvements save hours. Only one percent of improvement saves more than an hour in a week.

    18. Re:Code in high-level by oldhack · · Score: 1

      Yeah, probably makes sense only for DSPs and microcontrollers. But then isn't 68k used as microcontrollers now?

      We used to say there were two many layers of shit. Now it's truly "turtles all the way down."

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    19. Re:Code in high-level by RzUpAnmsCwrds · · Score: 2, Informative

      It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.

      In contrast, PGI or ICC is much harder to beat using assembly.

    20. Re:Code in high-level by s73v3r · · Score: 1

      Wow, that sucks. My college ASM class was AWESOME! Granted, it was probably only there to give us a feeling for what was going on under the hood, not to actually learn x86 assembly, but it was taught by a guy who not only was very knowledgeable about the subject, but was also really enthusiastic (even for being upwards of 70!).

    21. Re:Code in high-level by oldhack · · Score: 1

      Is there a study of why we sometimes sub same-sounding words when typing in stream-of-conscious style? Might be something there...

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    22. Re:Code in high-level by Dunbal · · Score: 1

      I think you can legally get MASM (Microsoft Macro Assembler) somewhere on the internet for free. A good place to start would be Microsoft. Then you can do what real coders do, and teach yourself!

      And to think I paid several hundred dollars for that, back in the day.

      --
      Seven puppies were harmed during the making of this post.
    23. Re:Code in high-level by caerwyn · · Score: 2, Insightful

      That's *generally* true. It's not *always* true.

      There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.

      --
      The ringing of the division bell has begun... -PF
    24. Re:Code in high-level by smash · · Score: 2, Insightful
      Not quite.

      But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    25. Re:Code in high-level by WilyCoder · · Score: 1

      I've heard that the first C compiler was written in C.

    26. Re:Code in high-level by Anonymous Coward · · Score: 2, Informative

      Or you could get NASM, which is open source :)

    27. Re:Code in high-level by __aaclcg7560 · · Score: 1

      Sweet! The last time that I looked at ASM, I had to run a DOS box under Windows XP that didn't work out too well.

    28. Re:Code in high-level by Anonymous Coward · · Score: 0

      amd64 is horrible, the calling convention is ridiculously complicated and different on many operating systems

    29. Re:Code in high-level by frank_adrian314159 · · Score: 1

      In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations.

      And that will work until the next rev of the board's chip, which your hardware vendor will change when he wants to and not notify you about. You'll know about it when the customer complaints roll in about poor performance or during your next rev of the firmware when your performance stats go to hell. And, if you're trying to do this for COTS hardware, forget it - you won't even know which chips you'll be running on. The bottom line? Unless price (and cost to your company) is of no concern, write the code as cleanly as possible and run it through an optimizer.

      --
      That is all.
    30. Re:Code in high-level by Anonymous Coward · · Score: 0

      I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

      I did an EE and we had to learn ASM for some embedded courses (6811, PIC). Learning it for "larger" processors is certainly possible, but you could always get a hobby kit and learn it. We also had some courses in VHDL to design simple CPUs and VGA emulators (e.g., had to program an FPGA to display certain patterns on a CRT).

      I'm guessing you were a CS, and they didn't really go down into hardware was much as in comp. eng. or EE.

    31. Re:Code in high-level by __aaclcg7560 · · Score: 1

      Uh, no. C was written in B. B was written in A. A was written in leftover naughty bits. :P

    32. Re:Code in high-level by Just+Some+Guy · · Score: 1

      You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes.

      That's news to GCC:

      $ cd /usr/src/contrib/gcc
      $ find . -name '*.[ch]' | wc -l
      869
      $ find . -name '*.[ch]' | xargs cat | wc -l
      895866
      $ find . -name '*.asm' | wc -l
      34
      $ find . -name '*.asm' | xargs cat | wc -l
      6520

      Translation: In GCC 4.2.1 as shipped with FreeBSD 8-STABLE, there are 869 .c and .h files with a total of 900KLOC, and 34 .asm files with 6KLOC. It seems that GCC itself isn't written with very much assembler.

      --
      Dewey, what part of this looks like authorities should be involved?
    33. Re:Code in high-level by Kjella · · Score: 1

      I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

      I'm probably going to need an asbestos suit for this post, but to be honest I don't think assembler is a good programming language for humans. My impression is that they absolutely don't want to pollute the instruction set with instructions unless there's a performance benefit to doing so. But what it means in practice is that anyone I've seen writing advanced assembly relies on lots and lots of macros to do essential things, because the combination of instructions is useful but there's no language construct. For example, in general you JMP everywhere which is the low-level equivalent of GOTO and you use that to create the equivalent of FOR and WHILE etc. which is neat to have seen once but gets quite tedious to do over and over.

      Most of the real world issues I run into, aren't of the type "yeah with an assembler optimization here we could squeeze another 2% out of it", It's stuff like "wtf why are you putting that inside the loop?" or "why are you doing this processing one by one when a batch update would do this 1000x faster?" If you got a clue on what's happening in C, if you know when memory is allocated/deallocated and that the basic operations you do makes sense, you'll write better code than 90% of the developers out there anyway.

      --
      Live today, because you never know what tomorrow brings
    34. Re:Code in high-level by __aaclcg7560 · · Score: 1

      I was learning computer programming at the local community college while working as a lead video game tester. Two-thirds of my classes was Java-centric. When C++ became available again after the college got the money for a renewed Microsoft site license, I took the remaining classes in that language. Ironically, the instructor didn't like the new version of Microsoft Visual Studio and we switched to Linux.

    35. Re:Code in high-level by mfnickster · · Score: 1

      Will NASM let you write structured assembly, like MASM?

      I picked up a used copy of Inner Loops by Rick Booth, and it intrigued me enough to consider tracking down an old version of MASM.

      --
      "Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."
    36. Re:Code in high-level by TheRaven64 · · Score: 4, Informative

      One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory

      I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.

      Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.

      If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.

      --
      I am TheRaven on Soylent News
    37. Re:Code in high-level by TheRaven64 · · Score: 3, Interesting

      Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.

      --
      I am TheRaven on Soylent News
    38. Re:Code in high-level by TheRaven64 · · Score: 3, Informative

      The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.

      --
      I am TheRaven on Soylent News
    39. Re:Code in high-level by AdamHaun · · Score: 1

      It's not a great language (family) for general use, but it is a good way to learn something about how CPUs work, what a function call actually is, etc.

      --
      Visit the
    40. Re:Code in high-level by SETIGuy · · Score: 1

      In non-trivial single threaded application code on a modern processor, the CPU core is spending about 95% of its time waiting on memory transfers. To fix that problem, it can make sense to prefetch and reorder memory accesses. Chances are you know better than your compiler how to do that. It also makes sense to start more threads on a processor with multiple hardware threads so you can do things while waiting for memory.

      Most programmers won't even bother to do that, because the processor is fast enough to do what they want without the optimization. Only in heavy duty numerical code and in games does optimization by hand get done. Where you really need top performance regardless of the platform, coders will write multiple versions of an core routine and time them to find what's best on the machine being used.

    41. Re:Code in high-level by Anonymous Coward · · Score: 0

      Actually, a for or while construct is trivially easy in asm, almost easier than in c.

      for construct: for(i=amount;i--;)
      mov ecx,amount
      loop: ...inner loop...
      dec ecx
      jnz loop

      while construct: while(amount!=0)
      loop: ...inner loop...
      cmp amount,0
      jnz loop

    42. Re:Code in high-level by toastar · · Score: 1

      Pfft... GCC,

      When I was a kid I had to learn to program using Machine Code, Uphill, both ways!

    43. Re:Code in high-level by wisty · · Score: 1

      I heard a rumor that there's some fundamental geophysical program that's been around for decades. It doesn't accumulate the results in an array, because memory was too expensive when fortran 66 was the hot new thing.

      It has a write-to-disk instruction in an inner loop. But it works, and nobody wants to touch it.

      A little micro-optimization there would grant a 1000x speedup.

    44. Re:Code in high-level by Anonymous Coward · · Score: 0

      That's a shame, because Java is actually a great language to learn the principles of assembly. It's very easy to disassemble compiled class files to bytecode, and thus easy to map the stack-based instructions to the Java source.

    45. Re:Code in high-level by Just+Some+Guy · · Score: 1

      My first "real" programming was using a machine language monitor on a C64, so I feel your pain.

      --
      Dewey, what part of this looks like authorities should be involved?
    46. Re:Code in high-level by SETIGuy · · Score: 3, Interesting

      Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.

      Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.

      In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.

      Being able to use %r2 for the same stuff you use %r1 for is just boring.

    47. Re:Code in high-level by keeboo · · Score: 1

      It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.

      In contrast, PGI or ICC is much harder to beat using assembly.

      ICC does a great work with auto-vectorization.
      Yet (perhaps it's no longer true), ~3 years ago I had problems with ICC generating wrong code in certain situations. I went back to GCC.

    48. Re:Code in high-level by keeboo · · Score: 1

      Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

      No need to go that far.
      68k ASM is pure heaven compared to x86.

    49. Re:Code in high-level by mpgalvin · · Score: 1

      How will they learn compilers? or Driver design?

      "Gentlemen, I have met the code-monkeys and it is us."

    50. Re:Code in high-level by fuzzyfuzzyfungus · · Score: 1

      Presumably, if faced with such a program and unwilling to alter it, wouldn't a ramdisk be the logical course of action?

      Takes about 30 seconds to set up in most any modern OS, all but the cheapest and nastiest contemporary systems have enough RAM that you can safely carve out something larger than any HDD of the fortran66 era, and(while not as fast as using RAM properly) should run like a bat out of hell compared to any actual disk....

    51. Re:Code in high-level by ChrisMaple · · Score: 1

      Not many compilers are aware of the video extensions (SSE, etc.), nor are they able to turn even simple loops into code using those parallel extensions. Speedups of 2X, 3X, or more are possible in certain cases.

      --
      Contribute to civilization: ari.aynrand.org/donate
    52. Re:Code in high-level by KC1P · · Score: 1

      Or you could get WASM (part of the Open Watcom package at www.openwatcom.org) which is open-source AND uses something approaching standard syntax.

      NASM unfortunately falls into the common trap of figuring that, since MASM-style syntax has a lot wrong with it, the syntax should be changed. But as with all such projects, the syntax is changed to fit someone's particular taste, and now you'll write source code which isn't compatible with anything. And IMHO NASM's syntax is no improvement over MASM anyway. ALL aftermarket assemblers do this -- yes I understand why it's hard to resist but it rarely solves anything. MASM syntax is a mess but it's very expressive and gets the job done. And anyone who thinks that numbered macro arguments are better than named ones needs to have their head examined!

    53. Re:Code in high-level by KC1P · · Score: 1

      I use tons of macros in my assembly code but the main reason I do is because present-day assemblers provide a GREAT macro language -- MUCH better than what C has. So while macros can also be used to package purely rote operations, they're also great for code which contains lots of assembly-time checks and will do the right thing if constants change in the headers, or whatever.

      If you think JMPing is tedious then that just means you're not an assembly programmer, which is fine. The classic problem is for HLL programmers to think in their native HLL and then try to translate HLL operations into assembly code. So it makes sense for HLL programmers to chafe at all the JMPs and want to wrap things up to look like REPEAT-UNTIL, or DO, or whatever. Assembly programmers are used to thinking in tiny steps so it makes perfect sense -- ask a question and then branch based on the answer. That's how the computer works so that's what I have to tell it. Since there's no stigma on JMPing (it's the only way to do anything anyway) we embrace it and learn how to do it well (it's not always spaghetti code just because it won't lie flat on the page).

      I definitely agree that having a clear grasp of what you're doing and when beats any amount of low-level tweaking. But you can have both! And all those 2% speedups really start to add up if you do them all the time. But of course the real motivation is the same as for anyone else -- C programmers use C because they like C. I use assembly because I like assembly. I have all kinds of justifications (just like a C programmer does) and some of them are right (as are some for C) but mostly I just love it, and almost anything else makes my skin crawl.

    54. Re:Code in high-level by Anonymous Coward · · Score: 0

      that's why i majored in computer engineering. not because i wanted to engineer computers. but because compsci focuses on higher level aspects of software. and now i'm unemployable.

    55. Re:Code in high-level by The_Wilschon · · Score: 2, Insightful

      It all depends on your problem domain. As a high energy physicist, I write plenty of code that me, a postdoc, and maybe a couple other grad students will ever see, and probably I'm the only one that will actually ever use it. I'm designing a small cluster that will get built here in a month or few, and some of my code will take up about 2 months of solid run time on it, then never see the light of day again. If I can spend 2 days getting a 5% performance improvement, even at the expense of locking the code to this cluster, it's a net win for us.

      In short, I have no "customers", I know exactly what hardware my code will be running on, and it won't ever change (until they ditch the cluster in 4-5 years and make a new one, but I'll be long gone), and I don't even have to worry about maintaining the code years in the future.

      All the same, I'll probably still write the code as cleanly as possible and run it through an optimizer, and leave it at that.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    56. Re:Code in high-level by Anonymous Coward · · Score: 1, Informative

      One of the biggest *advantages* of C/C++ as a systems language is that it gives you lots of control on how you arrange memory. You can write your allocators. You get a guaranteed layout of your structs and using the extensions the C/C++ compilers implement you *can* force alignments. You go a bit outside of the standard by using the "extensions", but you can encapsulate its use by using the preprocessor and porting is less a hassle than if you write your stuff in straight assembly.

      What the compiler cant do rearrange the data automagically so that it runs faster, so it is the programmer who has to think about that... just as in assembly (but in a more comfortable way)

    57. Re:Code in high-level by Anonymous Coward · · Score: 0

      Or even better... use a C compiler that supports inline assembly (Id say most, if not all C compilers).
      Use and abuse the generate asm of your C compiler (prefer a compiler that allows you to that with optimization turned on).

      This also allows you to learn how the compiler generates code. If you are fluent with assembly you will be able to spot where the C compiler doesnt optimize as it should (the generated code is not what you expect). In many cases it is due to some language rules where by providing some extra information to the compiler you can get better generated code (restrict et all). In other cases rearranging your C code so that the compiler generates what you want. Throw in intrinsics and you can get really close to asm performance without the tedious parts of asm.
      I personally find it funnier to trick a C compiler to actually generate the asm I want that asm coding. The code is also more portable and easier to understand to outsiders (and to me after a month).

    58. Re:Code in high-level by gandhi_2 · · Score: 1

      And who will write these tools for us? Code generators?

    59. Re:Code in high-level by Surt · · Score: 1

      Which part of highly tuned assembler did you miss when picking GCC ....

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    60. Re:Code in high-level by Anonymous Coward · · Score: 0

      Don't worry, help is always around the corner.

      Google "Flat Assembly" to get started.

    61. Re:Code in high-level by stinerman · · Score: 1

      Odd. Assembler was a required course at my college for CS/CEG students. Of course, they taught m68k assembler because it was a lot easier, but that was our class.

    62. Re:Code in high-level by __aaclcg7560 · · Score: 1

      Sorry. The moronic Microsoft grammar check doesn't work with Slashdot. :P

    63. Re:Code in high-level by NoNickNameForMe · · Score: 1

      Hi, I agree with your sentiments (I've programmed in 8086 and 68HC11), though currently I'm trying to pick up ARM assembly language.

      I find that 'gcc -Os' beats my handcrafted assembly due to the fact that the compiler can make a lot of 'short cut' optimizations based on what it know regarding memory address locations, etc. (via PC-relative indirect addressing) that would be difficult to take advantage of in Assembly without making it a non-maintainable hack.

      I'm still trying to figure out how to beat gcc, but I don't think it'll be easy, especially not via micro-optimization where I take a C algorithm and reimplement it as given in Assembly.

    64. Re:Code in high-level by pydev · · Score: 1

      Well, you an produce code that runs fast on your particular processor, but that doesn't mean it runs fast on other processors, even if they have the same instruction set.

    65. Re:Code in high-level by vtcodger · · Score: 1

      There are, or used to be, a number of free assemblers that could generate X86 code. Maybe not all the instructions, but more than enough for "Hello World" and other simple exercises. The problem is -- as others have mentioned -- that the x86 instruction set has all the beauty and elegance of a third world slum.

      As an alternative for learning, I'd suggest using an emulator and programming for some sane instruction set. Maybe the MC6809 which had a nice, clean, easily comprehensible instruction set. (I'm sure that there are other equally good choices). If one still has an interest in assembly language programming after that, then by all means tackle x86. You'll probably be appalled.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    66. Re:Code in high-level by dbIII · · Score: 1

      I've never heard of it and doubt that such a thing would be used in a commercial or even research environment now.
      If things are not running at 100% on all CPUs we go looking for bottlenecks since some of these things do run for weeks.
      Also since a lot of stuff is standardised or based on published techniques you can use one tool to do one job and others before and after it. If one software company has slow stuff it's likely that there's another company just down the road in Texas with something better and probably all the good staff the previous company used to have (precisely the case with the PSTM software we use). I'm on the other side of the world but most of the software, even the queuing system originally from NASA, comes out of Texas.
      There's also seismic un*x (Colorado School of Mines) which is a collection of open source tools. Since it has no GUI it mostly has an R&D role or in some places with custom front ends. Unfortunately people think a crap python script to display a GUI on top of the open source software is a priceless bit of corporate IP so most places that would consider it are scared off by the prospect of writing their own GUI for the point and click generation.

    67. Re:Code in high-level by dunkelfalke · · Score: 1

      there is always debug.exe ;-)

      --
      "It's such a fine line between stupid and clever" -- David St. Hubbins, Spinal Tap
    68. Re:Code in high-level by steveha · · Score: 1

      GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations

      I very much want to know more about this.

      Is there a book or web site that you recommend where I can learn more about this? The GCC manual doesn't have much about it.

      steveha

      --
      lf(1): it's like ls(1) but sorts filenames by extension, tersely
    69. Re:Code in high-level by cheekyboy · · Score: 2, Informative

      intel compilers have options to optimize to more than one target, and its runtime engine uses code that was made for X cpu. Sure your binary is larger, but everyone is happy.

      --
      Liberty freedom are no1, not dicks in suits.
    70. Re:Code in high-level by hughk · · Score: 1

      Ah but you look at the machines that influenced the 68K - the PDP-11 and then later the VAX. Beautiful instruction sets - CISC - but so nice!

      Digital also came out with their RISC design, the Alpha - also a beautiful architecture which Intel and HP promptly proceeded to nuke in favour of the Itanium, which is one of the few things that will make you pine for x86.

      --
      See my journal, I write things there
    71. Re:Code in high-level by ubersoldat2k7 · · Score: 1

      Which software is it you're talking about? Maybe some volunteers here would be willing to help them with this GUI stuff.

    72. Re:Code in high-level by ubersoldat2k7 · · Score: 1

      He's THE guy who was going to take the ASM course. I'm sure he wasn't getting his hands dirty "under the hood".

    73. Re:Code in high-level by Lisandro · · Score: 1

      Well, it depends. I've been a x86 hacker for a large part of my adult life and i agree - the architecture is a mess. It's got to the point where you can't even really predict how hand-written assembler code will perform, as there're a gazillion different hardware architectures running the same instruction set. In this sense, i've found Click's video really interesting. In the good old days of the 386-486, things were a lot more predictable and, yes, fun.

      I have to wholeheartedly agree with you about ARM. Lately, i've been digging into ARM (in order to reverse-engineer a devices' bootloader) and i'm stoked. A clean, well thought out architecture with a minimal instruction set that's very rich. I love being able to add conditionals to pretty much every instruction available and having a barrel shifter available per instruction.

    74. Re:Code in high-level by drinkypoo · · Score: 1

      Isn't this all still true of x86 today? AMD64 being a slightly different beast.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    75. Re:Code in high-level by drinkypoo · · Score: 1

      Alpha was getting its ass kicked by x86 in the form of AMD chips when it was canned. iTanic was a good idea, poorly executed: classic intel.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    76. Re:Code in high-level by dbIII · · Score: 1

      Seismic un*x (Colorado School of Mines).

    77. Re:Code in high-level by erenare · · Score: 1

      As an alternative for learning, I'd suggest using an emulator and programming for some sane instruction set. Maybe the MC6809 which had a nice, clean, easily comprehensible instruction set. (I'm sure that there are other equally good choices). If one still has an interest in assembly language programming after that, then by all means tackle x86. You'll probably be appalled.

      *sigh*

      I just shed a tear, I had forgotten the beauty of coding 6809 assembler back in the old CoCo/Dragon32 days. From personal experience, it is indeed a great platform to learn a decent/simple assembler.

      (on a side note, wasn't the 6809 the only one of the then popular processors to implement integer multiplication in hardware? ah! the beauty of "MUL")

      If interested, I would recommend downloading either Paul Burgin's "T3" Dragon32/CoCo emulator for the PC or Xroar, (first hit for "Xroar" in google, multi platform) and have a go at it. Of course it requires a "fast machine" (something like a 386DX 33 will be enough for T3 ;o) and the original Dragon ROMs. (getting hold of the ALLDREAM Editor/Assembler would also be a great bonus if you are pursuing this route)

      (please, insert mandatory reference to "people getting off my lawn" at your discretion)

    78. Re:Code in high-level by TheRaven64 · · Score: 4, Informative
      The GCC manual tells you everything you need to know. First you declare a vector type, so if you want four shorts representing an RGBA colour value , you declare a type like this:

      typedef short colour_t __attribute__ ((vector_size (4 * sizeof(short))));

      This will give you a 64-bit vector type, so you can fit one in an MMX register, or two in an SSE or AltiVec register. You can then create these and do simple operations on them. For example, if you wanted to add two together, you could do this:

      colour_t a = {1,2,3,4};
      colour_t b = {1,2,3,4};
      colour_t c = a + b;

      In this case, the add is constant so it will be evaluated at compile time, but in the case where a and b have unknown values GCC will emit either four scalar add operations or one 64-bit vector add.

      You can also pass them as arguments to vector intrinsics, which are listed in the manual under target-specific builtins. These correspond directly to a single underlying vector instruction, so if you look in the assembly language reference for the target CPU then you will find a detailed explanation of what each one does.

      Rather than declare vector types directly, it's often a good idea to declare unions of vector and array types. This lets you use the same value as both an array and a vector.

      I wrote a longer explanation a while ago.

      --
      I am TheRaven on Soylent News
    79. Re:Code in high-level by Anonymous Coward · · Score: 0

      The Intel Compiler in particular is extremely good at what it does

      Except on AMD!

    80. Re:Code in high-level by cmarkn · · Score: 1

      But it would take a big hit on your speed before it would be worth bothering with. If you're running a analysis that takes 100 days to run, you're not going to be much bothered when it stretches out an extra day.

      The real concern is for really short applications with hard deadlines, such as a missile interceptor. When you have only a couple of seconds to track, aim and make your shot, 20 milliseconds can mean a lot.

      --
      People should not fear their government. Governments should fear their people.
    81. Re:Code in high-level by mindstrm · · Score: 1

      To your first statement: My understanding, from my system programmer friends is it's generally the opposite - it's extremely difficult to second-guess their good compilers (speaking intel here) - and even though the compiler spits out weird things, out of order, that don't make sense, after profiling, the compiler beats them out 90% of the time.

    82. Re:Code in high-level by TheRaven64 · · Score: 1

      Not exactly true. Alpha was canned in terms of development when Intel and HP started working on Itanic. At the time, nothing touched Alpha in terms of performance. By the time Itanic was released, Alpha was still doing okay in big SMP systems (AMD licensed the Alpha interconnect, but their implementation didn't scale as well), and clock-for-clock was holding its own, but was clocked at about half the speed of competing chips. By the time they officially killed Alpha, it had been almost a decade since the last new design had been implemented, and it was still reasonably fast, although not anywhere near the front of the pack. It was only two or three years ago that the last Alpha machine in the top 10 supercomputers got pushed off the list.

      --
      I am TheRaven on Soylent News
    83. Re:Code in high-level by afidel · · Score: 2, Insightful

      In general modern compilers are good enough that you are much more likely to get better performance by spending the time finding a better algorithm then you are hand optimizing the code. Obviously for things like H.264 where the algorithm is already set this is not true, but that's a very small fraction of the code out there.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    84. Re:Code in high-level by Xest · · Score: 1

      Normal programmers using existing compilers like GCC usually...

      Even if you're building for a brand new architecture, it's still quicker to write a compiler for the new architecture on an existing machine using an existing architecture with existing tools and languages than it is to build one from scratch in ASM.

      The situations in which you'd need to build from ASM and couldn't do this are pretty much non-existent. If the architecture in question is still binary based it's almost certainly just easier to do things this way.

    85. Re:Code in high-level by Lucid+3ntr0py · · Score: 1

      Wow. Have you tried not being an ass?

    86. Re:Code in high-level by Com2Kid · · Score: 1

      x86 isn't the only game in town. :)

      Also, larger binaries can be problematic on platforms that don't ship on hundred gigabyte+ hds.

    87. Re:Code in high-level by hughk · · Score: 1

      Alpha AXP was canned because it was a competitor to Itanium in which HP and Intel had too much investment. Remember Digital weren't just building chips, they were building the entire computers, ones that were running as mainframes at the centre of banks and stock exchanges. Compaq didn't have much money for continued development but it happened. Compaq had problems and they were acquired by HP. They had OpenVMS some big systems and a regular income stream. HP was committed towards Itanium by their agreement with Intel so they started a programme to replace Alpha with Itanium - a painful process as there was a lot of suckage to overcome.

      --
      See my journal, I write things there
    88. Re:Code in high-level by ignavus · · Score: 1

      Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.

      Many programs spend most of their time in an idle loop. Optimising the idle loop rarely causes a "noticeable difference in execution time". They just idle faster!

      --
      I am anarch of all I survey.
    89. Re:Code in high-level by DeadCatX2 · · Score: 1

      Iit (sic) doesn't make sense to code in ASM anymore.

      Good thing they weren't talking about coding in ASM. But, if one wishes to talk about what's going on under the hood of a processor, and ASM is what's actually running under the hood, then one will likely see ASM.

      --
      :(){ :|:& };:
    90. Re:Code in high-level by SETIGuy · · Score: 1

      Only in 16 bit instructions or some instructions that are rarely used (i.e. bcd arithmetic). ECX is still the loop counter, but you can index with any register. In 32bit mode EAX-EDX registers have been pretty generic since the 80386. On the other hand the lower half of ESI and EDI can't be accessed as two 8-bit registers. The only exception major exception is expanding multiply (32x32=64bit) and reducing divide (64/32=32bit), in that you can't specify individual registers for each portion of the result or dividend. You still need to pair EAX and EDX.

    91. Re:Code in high-level by SETIGuy · · Score: 1

      The problem is that the C and C++ languages are very strict about how out-of-order you are allowed to get. And for very good reason. Unless your compiler supports C99 and the restrict keyword any reads or writes through a pointer need to be completed by the next sequence point. That prevents most loops from accessing memory in an order that is not specified by the program.

      Here's a matrix transpose that works on an XxY matrix of float.

      void transpose(int X, int Y, float *in, float *out) {

      // stupidest possible algorithm
      // assume in and out can't overlap
      int i,j;
      for (j=0;j<Y;j++) {

      for (i=0;i<X;i++) {

      out[i*Y+j]=in[j*X+i];

      }

      }

      }

      So a compiler would like to optimize this. The easiest way is to transpose sub-blocks of NxM, where N is the cache line, and M is an number of cache lines defined by the cache size, associativity, and the line fill time for a prefetch. But the compiler doesn't know what the cache line size is, or whether in and out overlap.

      Regardless of what in and out are, the compiler has to reproduce the behavior that the one-float-at-a-time C code has. So a fully optimized version would be... check if in and out overlap, if they don't determine the alignment of in and out and the number of cache lines that can be prefetched based upon memory timings. In the loop appropriately prefetching while performing the sub-transposes in whatever order maximizes performance, of course, while using the best available instruction set. If the arrays do overlap, fall back to the slow version.

      Call me when you find a compiler that does that, and produces a binary that is within 20% of optimal on every recent processor (last 3 years) from AMD and Intel.

    92. Re:Code in high-level by SETIGuy · · Score: 1

      And in case it wasn't apparent, a programmer writes multiple transpose functions both in C and assembly (for SIMD instructions) and determines at run time which version works best on the processor at hand. If the compiler wants to do that, I'd be happy to let it.

    93. Re:Code in high-level by Anonymous Coward · · Score: 0

      Amen to the unemployable computer engineers. Not sure about your situation but Australia sucks for computer engineering jobs.

  4. Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 2, Insightful

    That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.

    --
    Dewey, what part of this looks like authorities should be involved?
    1. Re:Premature optimization is evil... and stupid by RightSaidFred99 · · Score: 4, Funny

      And messy and embarrassing. Oh, wait...

    2. Re:Premature optimization is evil... and stupid by Monkeedude1212 · · Score: 1

      If (and only if!)

      Compiler Error: Numerous Syntax Errors.
      Line 1, 4; Object Expected
      Line 1, 15; '(' Expected
      Line 1, 16; Condition Expected
      Line 1, 17; 'Then' Expected

    3. Re:Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 1

      That was Lisp. You should parse it as If(only && !if).

      --
      Dewey, what part of this looks like authorities should be involved?
    4. Re:Premature optimization is evil... and stupid by EvanED · · Score: 1

      Always make the rough draft of a program clean and readable.

      Not only that, but if the optimized version is much less readable than the initial version, consider keeping and maintaining *both* versions. You can run tests to compare the output of each version, replace the fast, not-obviously-incorrect version with the slow, obviously-not-incorrect version if you hit a bug and see if it's still there, etc.

      (MS did or does this with Excel; at least until recently, and perhaps still, the recomputation engine for the spreadsheet was hand-tuned assembly. However, for testing and development reasons, they also had a much slower, high-level-language version.)

    5. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 3, Interesting

      Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

    6. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 0

      I do that a great deal actually, usually not at the entire application level but certainly like your Excel example. Usually if I find a certain process is slow I will move that function aside or maybe an entire class and try and optimize. I always have the clean code version to test against or go back to simply running with if the deadline to finish something creeps up; worst case some maintenance program can try and produce an optimized version of an algorithm (s)he can at least read understand when someone decides the app is to slow.

    7. Re:Premature optimization is evil... and stupid by Rockoon · · Score: 3, Informative

      Using shift to multiply is often a great idea on most CPUs.

      Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

      Didn't know that? yeah... it seems like only assembly language programs know this.

      --
      "His name was James Damore."
    8. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 4, Informative

      Which CPU's are those?

      Those with a barrel shifter.

      The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

      Then someone needs to beat the GCC developers with a cluestick.
      $ cat test.c
      int main(int argc, char **argv) {
                      return 4*(unsigned int)argc;
      }
      $ gcc -march=core2 test.c -o test
      $ objdump -d test ...
      00000000004004ec <main>:
          4004ec: 55 push %rbp
          4004ed: 48 89 e5 mov %rsp,%rbp
          4004f0: 89 7d fc mov %edi,-0x4(%rbp)
          4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
          4004f7: 8b 45 fc mov -0x4(%rbp),%eax
          4004fa: c1 e0 02 shl $0x2,%eax
          4004fd: c9 leaveq
          4004fe: c3 retq
          4004ff: 90 nop

      yeah... it seems like only assembly language programs know this.

      I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

    9. Re:Premature optimization is evil... and stupid by AuMatar · · Score: 2, Insightful

      It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random unknown multiply.

      Also IIRC the p4 got rid of the barrel shifter on Intel. Or maybe it was the gen after that. THey may have re-added it though, it seems fairly stupid not to have one.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    10. Re:Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 1

      so there's no reason to explicitly use shift in code (unless you're doing bit manipulation

      Well, right. The general advice is to always write what you actually want the compiler to do and not how to do it, unless you have specific proof that the compiler's not optimizing it well.

      --
      Dewey, what part of this looks like authorities should be involved?
    11. Re:Premature optimization is evil... and stupid by AuMatar · · Score: 1

      The opposite problem also exists though- by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite. Saying optimize at the end is just as stupid and just as costly. Learning when to care about what level is part of the art of programming. (Although on your specific examples I'll agree with you- especially since I would expect anything but a really old compiler to do mult->shift conversions for you, so you may as well use the more maintainable and readable multiply.)

      --
      I still have more fans than freaks. WTF is wrong with you people?
    12. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 1

      I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.

    13. Re:Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 1

      Saying optimize at the end is just as stupid and just as costly.

      There is an enormous difference between optimization and choosing appropriate algorithms. If you write a program well, it's almost always easy to optimize it later. If you write it poorly, it'll almost always be impossible to optimize at any point of its development. For example, I'd rather sort a big array with an unoptimized (but correct) quicksort than with an extremely clever (but insane) bogosort.

      --
      Dewey, what part of this looks like authorities should be involved?
    14. Re:Premature optimization is evil... and stupid by tomtefar · · Score: 3, Informative

      I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.

    15. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 2, Interesting

      I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.

      However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data structures so that you can pass the data directly into some FFT functions without packing/unpacking the data to some other format that the rest of the functions were written to expect. It may also mean that your design scale to many cores and that inner loops be heavily optimized and vectorized. Of course, all of that code should be performance tested during development against the simpler versions.

      Profiling after the fact sounds like a good idea, but what if the code has no real "hotspot"? What if you find out that you need to redesign the entire software framework to support zero-copy processing of the data? Also, profiling tools in general are really not that good. Running oprofile on a large-scale application with dozens of threads and data source dependencies on other processes can be less than enlightening. gprof is entirely useless for non-trivial applications. cachegrind is sometimes helpful, but most people working on performance optimization seem to simply build their own timers based on the rdtsc instruction and manually time sections of the code.

      I work on software for processing medical device data and performance is often critical. You probably want an image display to update very quickly when it is providing feedback to the doctor guiding a catheter toward your heart, for example. We had one project where the team decided to start over with a clean framework without concern for performance -- they would profile and optimize once everything was working. They followed the advice of many a software engineer: their framework was very nice, replete with design patterns and applications of generic programming, and entirely unscalable beyond a single processor core. There were no performance tests done during development, and of course the timeline was such that there would only be minimal time for optimization once the functionality was complete. The software that it was replacing was ugly, but also scaled nicely to many cores. The software shipped on a system with two quad-core processors, just as it had before.

      Let's just say that customers were unimpressed with the new software framework.

    16. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 0

      A profiler is only one way of determining what's important. Knowledge and experience is another way. If you've worked in a problem domain for a long time, you know the fundamentals of what's fast enough and what's not acceptable. Utilizing a large store of knowledge to "prematurely" optimize is not necessarily a bad thing. If you know for certain that the clearest, easiest-to-understand way of doing something just isn't going to be fast enough, that's perfectly valid. Just balance that against the clarity of the resulting solution.

      These sorts of things generally fall into the realm of algorithmic design, though, not micro-optimizations like substituting shifts for multiplications. The compiler figures all that shit out for you, anyway.

    17. Re:Premature optimization is evil... and stupid by Just+Some+Guy · · Score: 1

      Interesting anecdote that has nothing to do with optimization and everything to do with bad design. Optimization is great for making your program run n% faster. Design is great for making your program run in O(log n) time instead of O(n^2) time. The important part is to come up with a good design, implement it, and address the specific problem areas. I can't think of a single justification for doing it any other way.

      --
      Dewey, what part of this looks like authorities should be involved?
    18. Re:Premature optimization is evil... and stupid by TheRaven64 · · Score: 2, Informative

      I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and one shift unit) and found that it was much faster to just emit a multiply.

      --
      I am TheRaven on Soylent News
    19. Re:Premature optimization is evil... and stupid by Estanislao+Mart�nez · · Score: 1

      I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance. However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance.

      Well, and then there's another approach, where you first write a fully-functional and readable implementation of the solution without regard to performance until you get it right, then rewrite the really critical parts from scratch to be a lot faster.

      I've been involved with projects that went like this. Typically, the first stage is necessary because the task is very exploratory--e.g., write a fairly generic computation engine that processes user-defined formulas. The first pass is slow, but it serves to prove that you're on the right track, and then you rewrite it to be fast (typically by changing it from an interpreter-like design to a compiler-like one).

      I work on software for processing medical device data and performance is often critical. You probably want an image display to update very quickly when it is providing feedback to the doctor guiding a catheter toward your heart, for example.

      Well, yeah, real-time is a requirement that must be built into the design.

      We had one project where the team decided to start over with a clean framework without concern for performance -- they would profile and optimize once everything was working. They followed the advice of many a software engineer: their framework was very nice, replete with design patterns and applications of generic programming, and entirely unscalable beyond a single processor core.

      Partly this is an indication of how all the faddish evangelism about "frameworks" and "design" is often nonsense, and in this case, damned by the stated goals. Their claim that the framework was "generic" is contradicted by the fact that it doesn't scale beyond a single core. Basically, "generic" is supposed to mean that as few assumptions as possible are built in, yet the framework slipped in a big one-core assumption.

      There were no performance tests done during development, and of course the timeline was such that there would only be minimal time for optimization once the functionality was complete.

      And that of course is as big of an error as any of the other ones.

    20. Re:Premature optimization is evil... and stupid by smash · · Score: 1
      Does that code change if you use the arch flags for GCC to generate AMD64 or at least i686 code?

      Not taking the piss... i have no idea - i just noticed you didn't use any architecture specific flags so its no doubt defaulted to dumb but compatible code?

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    21. Re:Premature optimization is evil... and stupid by smash · · Score: 1

      UH... delete that comment, i didn't see -march=core2. Sorry....

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    22. Re:Premature optimization is evil... and stupid by smash · · Score: 2, Interesting

      by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite.

      "Not thinking about performance" is different from writing in high level first.

      Get the algorithm right first, THEN optimise hot spots.

      Starting out with ASM makes it a lot more time consuming/difficult to get many different algorithms written, debugged and tested. The time you spend doing that is time better spent testing/developing a better algorithm. Only once you get the algorithm correct should you break out the assembler for the hotspots WITHIN that algorithm.

      If you're writing such shitty code that its "impossible to optimize later" then I don't think starting out in ASM will help you. You'll just have slightly faster shitty code.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    23. Re:Premature optimization is evil... and stupid by smash · · Score: 1

      On the contrary, i'd be more concerned that the medical software is CORRECT. You can throw more hardware at the problem to make it faster. You can't throw more hardware at the problem to correct bugs.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    24. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 0

      But you cant ignore efficiency :)

    25. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 0

      The real point is that it doesn't matter if you use a shift or a multiply, the compiler can usually figure out the fastest way to order instructions for the current technology. (And better yet, as technology changes and improves, new compilers can automatically adapt your old program to the new shiny. What an idea.)

      There's no need to "optimize" at such a low level; focus on things compilers aren't good at, like coming up with really great algorithms.

    26. Re:Premature optimization is evil... and stupid by kc8apf · · Score: 5, Insightful

      Having spent 4 years being one of the primary developers of Apple's main performance analysis tools (CHUD, not Instruments) and having helped developers from nearly every field imaginable tune their applications for performance, I can honestly say that regardless of your performance criteria, you shouldn't be doing anything special for optimization when you first write a program. Some thought should be given to the architecture and overall data flow of the program and how that design might have some high-level performance limits, but certainly no code should be written using explicit vector operations and all loops should be written for clarity. Scalability by partitioning the work is one of those items that can generally be incorporated into the program's architecture if the program lends itself to it, but most other performance-related changes depend on specific usage cases. Trying to guess those while writing the application logic relies solely on intuition which is usually wrong.

      After you've written and debugged the application, profiling and tracing is the prime way for finding _where_ to do optimization. Your experiences have been tainted by the poor quality of tools known by the larger OSS community, but many good tools are free (as in beer) for many OSes (Shark for OS X as an example) while others cost a bit (VTune for Linux or Windows). Even large, complex multi-threaded programs can be profiled and tuned with decent profilers. I know for a fact that Shark is used to tune large applications such as Photoshop, Final Cut Pro, Mathematica, and basically every application, daemon, and framework included in OS X.

      What do you do if there really isn't much of a hotspot? Quake 3 was an example where the time was spread out over many C++ methods so no one hotspot really showed up. Using features available in the better profiling tools, the collected samples could be attributed up the stack to the actual algorithms instead of things like simple accessors. Once you do that, the problems become much more obvious.

      What do you do after the application has been written and a major performance problem is found that would require an architectural change? Well, you change the architecture. The reason for not doing it during the initial design is that predicting performance issues is near impossible even for those of us who have spent years doing it as a full time job. Sure, you have to throw away some code or revisit the design to fix the performance issues, but that's a normal part of software design. You try an approach, find out why it won't work, and use that knowledge to come up with a new approach.

      That largest failing I see from my experiences have been the lack of understanding by management and engineers that performance is a very iterative part of software design and that it happens late in the game. Frequently, schedules get set without consideration for the amount of time required to do performance analysis, let alone optimization. Then you have all the engineers who either try to optimize everything they encounter and end up wasting lots of time, or they do the initial implementation and never do any profiling.

      Ultimately, if you try to build performance into a design very early, you end up with a big, messy, unmaintainable code base that isn't actually all that fast. If you build the design cleanly and then optimize the sections that actually need it, you have a most maintainable code base that meets the requirements. Be the latter.

      --
      kc8apf
    27. Re:Premature optimization is evil... and stupid by wirelessbuzzers · · Score: 1

      GCC on x86 these days likes to emit small multiplies as one or two lea instructions. It gives you a = b + [1248]c + const. This lets you multiply by 2,3,5 or 9 in one cycle, along with the shifter which lets you multiply by 1,2,4,8,... in one cycle. Between these, you should be able to multiply by any constant up to 21 in 2 cycles, and add a constant to boot. Similarly, you can multiply by a smaller range of values and add another register and a constant as well.

      You can do this even better on ARM, where every instruction gets a free shift or rotate. In 2 cycles you can multiply by thousands of different values.

      Regardless of this, on any platform with a multiplier, the multiplier is faster for some random unknown value. For a small known value, though, add/shift ladders may be faster.

      --
      I hereby place the above post in the public domain.
    28. Re:Premature optimization is evil... and stupid by Anonymous Coward · · Score: 2, Insightful

      I totally agree about the evils of premature opimisation, however I also think correct choice of algorithm and data structures is vital, and does not necessarily need to be left to the end. Often the correct choice of algorithm and data structure not only results in faster code, but more readable and maintainable code too. Of these the most often I see over looked is data-structure. For me learning functional programming really helped with this.

    29. Re:Premature optimization is evil... and stupid by igb · · Score: 1

      Which is all true, but I think misses one important point: you do need to consider the complexity order of your algorithms. I've seen several applications which do ludicrous O(n^2) or worse operations, and by the time they get to see a dataset large enough to provoke serious problems no-one dares touch the basic methods, so all that's left is local code optimisation or throwing more hardware at the problem. And if the problem is that you used an O(n^2) sort rather than an O(n.log(n)) sort, or you computed a cross-product of two large tables O(n^2) at least for both space and time), or you needlessly wrote some hideous O(n!) search which is NP complete, then no amount of profiling and instruction tuning is ever going to help you. I think that at the outset of a design process, it's important to consider the complexity order of any algorithm used which has the potential to process large amounts of data, as otherwise you can see some startlingly bad performance problems as you scale.

    30. Re:Premature optimization is evil... and stupid by epine · · Score: 3, Insightful

      That's the main reason why I want to shoot people who write "clever" code on the first pass.

      Over the years, I've grown to hate this meme. Not because it isn't right, but because it stops ten floors below the penthouse of human potential.

      First of all, it's an incredible instance of cultural drift. In the mid 1980s, when this meme was halfway current, I worked on adding support for Asian characters to an Asian-made PC. On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language. Slower than YouTube over 300 baud. It was doing a lot of pixel swizzling it shouldn't have been, because the fonts were supplied in a format better suited to printing. This was an order of magnitude below an invitation to whiffle-ball training camp. This was Lance Armstrong during his chemotherapy years towing a baby trailer. Today you get 60fps with a 100 thousand or a 100 million polygons, I've sort of lost track.

      Let's not shunt performance onto the side track of irrelevancy. While there's no good excuse, ever, for writing faulty code, an enlightened balance between starting out with an approach you can live with, and exploiting necessary cleverness *within your ability* goes a long way.

      How about we update Knuth's arthritic maxim? Don't tweak what you don't grok. If you grok, use your judgement. Exploit your human potential. Live a little.

      The books I've been reading lately about the evolution of skills in the work place suggest that painstaking reductive work processes are on their way to India. Job security in home world is greatly enhanced if you can navigate multiple agendas in tandem, exploiting more of that judgement thing.

      One of the reasons Carmack became so successful is that he didn't waste his effort looking for excuses to deprive his co-workers of their oxygen bits. Instead he conducted shrewd excursions along the edge of the envelope in pursuit of the sweet spot between cleverness too oppressive to live with, and no performance at all.

      In my day of deprecating my elders, I always knew where the pea was hidden under the mattress. These days, there are so many squishy mattresses stacked one upon the other, I have to plan my work day with a step ladder. Which I think is what this unwatchable cult-encoded video is on about: the ankle level view most of us never see any more.

      Here's another thing. I've you're going to be clever about how you code something, also be clever about how you do it. In other words, be equally clever all levels of the solution process simultaneously: algorithm selection, implementation, commenting, software engineering, documentation, and unit test. Knuth got away with TeX, barely, for precisely this reason. Because of his cleverness, the extension to handle Asian languages was far from elegant. Because of his cleverness (in making everything else run extremely well), people actually wanted to extend TeX to handle Asian languages. So who's to say he was wrong? Despite his cleverness, he managed to keep his booboo score in single or low double digits. His bug tracking database fit nicely on an index card.

      In the modern era, people quote the old "make it right before you make it faster" as the cure for the halitosis of ineptitude: you're feeble and irritating, so practice your social graces. Don't make me come over there and choke off your oxygen bit. It's a long ways from saying "you have a lot of human potential, and not much experience, so let me help you confront the challenges in a meaningful way". These sayings leak a lot of sentiment about social engagement.

      Every so often I have to pull up a chair beside a junior resource and go "Dude, you're jousting at windmills here, let's roll that change back and try again. I know you can do better." Five minutes of war stories about how to shoot yourself in the foot six ways from Sunday is usually enough to rebalance the flywheel of self preservation.

    31. Re:Premature optimization is evil... and stupid by epine · · Score: 1

      Addition to previous post. I didn't make an idea quite as clear as I meant to here on the un-wiki.

      On the "make it right" pass it took 15s to update the screen after pressing the page down key, and this from assembly language.

      What I was trying to express here is that in this era, performance was a hardcore narcotic, it was pure heroine to a code junkie trying to get a 5MHz processor to update half a million green pixels faster than Google now returns a results page indexing the entire internet. Knuth's mantra, more honoured in the breach, was a secret handshake of the local AA chapter, where everyone got together to show off our needle tracks. These days, performance as an addictive drug hardly rivals a can of Jolt cola.

    32. Re:Premature optimization is evil... and stupid by Rockoon · · Score: 4, Informative

      GCC is a big offender, thats true.

      This is one of the reasons that GCC sucks compared to ICC and VC++.

      Let me give you the facts as they are today. In isolation, both the shift instructions and the multiply instructions have the same latency and throughput, and are also performed on the same execution units.

      If this was the entire story, then they would be equal. Buts its not the entire story.

      The shift instructions only modify some of the flags in the flags register. Essentially, the shift instructions must do a read/modify/write on the flags. The multiplication instructions, however, alter the entire flags register, so only perform a write.

      "But Rockoon.. they are the same latency anyways, right?" .. yes, in isolation. But that read/modify/write cycle on the flags register prevents a hell of a lot of out-of-order execution.

      Essentially, one of the inputs to the shift instruction is the flags register so all prior operations that modify the flags register must be completed first, and no instruction following the shift that also partially modify the flags register can be completed until that shift is completed.

      In some code, it wont make any discernible difference, but in other code it will make a big difference.

      As far as that GCC compiler output.. thats code is horrible, and not just because its AT&T syntax.

      There are two alternatives here for multiplying by 4 that should be in competition here, and neither uses a shift.

      One is a straight multiplication (MASM syntax, CDECL):

      main:
      mov edx, [esp + 4] ; 32-bit version, so +4 skips the return address
      imul eax, edx, 4
      ret

      The other is leveraging the LEA instruction (MASM syntax, CDECL):

      main:
      mov eax, [esp + 4] ; 32-bit version, so +4 skips the return address
      lea eax, [eax * 4]
      ret

      The alternative LEA version on some processors (P4..), in isolation, is slower .. but it has the advantage that it uses different execution units on those very same processors, so might pair better with other stuff in the pipeline, and it doesnt touch the flags register at all.

      GCC is great at folding constants and such, even calculates constant loops at compile time.. but its big-time-fail at code generation. GCC is one of the processors that one optimization expert struggled with because he was trying to turn a series of shifts and adds into a single far more efficient multiplication.. the compiler converted it back into a series of shifts and adds on him. Fucking fail.

      --
      "His name was James Damore."
    33. Re:Premature optimization is evil... and stupid by Rockoon · · Score: 1

      For a small known value, though, add/shift ladders may be faster.

      This is only the case until most all instructions spend exactly 1 cycle in an execution unit. This includes shifts, adds, subs, muls, ands, ors, xors, and so on and on. Thats the state of the modern processor.

      Nearly all instructions take exactly 1 cycle to actually "execute" so there is no benefit to preferring one method over the other on an "in theory" basis. Yes, "in theory" a processors can rotate a bunch of bits very quicky and its obvious that a multiplication is much more complicated. "in practice" there is so much silicon dedicated to multiplication that it completes in the same minimum time as the shift.

      All of these instructions are 3 cycle latency on the AMD64 processors that I am familiar with (pre-phenom) and essentially the first cycle is loading the operands from the register pool into an execution unit, the second cycle is the execution, and the third cycle is retirement from the execution unit, updating the register pool. Core2's and i7's have these down to 2 cycle latency.

      Stop living in the past.

      --
      "His name was James Damore."
    34. Re:Premature optimization is evil... and stupid by iivel · · Score: 1

      Big O notation and analysis of the complexity order of algorithms is something I saw in a few classes in college a long time ago, and not something I ever paid a lot of attn to (not much of my career was actually spent developing). I would, at this point like to delve back into the subject some ... do you have any reccomended read that discusses how to do this analysis (preferably using real world examples)?

    35. Re:Premature optimization is evil... and stupid by Theovon · · Score: 1

      Up to ARM7, the multiplier was single-cycle. It was pipelined in later designs. Either way, due to the architecture, multiply and barrel shift are the same speed on ARM.

    36. Re:Premature optimization is evil... and stupid by igb · · Score: 1

      Sorry: I have the battered notes from a lecture course twenty-five years ago...

    37. Re:Premature optimization is evil... and stupid by marcansoft · · Score: 1

      Nope. On an ARM9xxEJ-S (that's based on the ARMv5TEJ architecture), MUL takes 2 cycles plus an extra penalty cycle if the next instruction depends on the result. Data ops (which all have a free shift) take one cycle. This means using two ADDs (or a MOV and an ADD) to multiply by a constant with two bits set plus an optional free LSB set and it will only take two cycles, which is as fast as a MUL or faster if you have a dependency on the result in the next instruction.

      Cortex-A8 (that's ARMv7) (which is used e.g. on the iPhone 3GS) does away with the dependency penalty as far as I can tell, but MUL still takes two cycles as opposed to one for ADD/MOV (with free shift).

    38. Re:Premature optimization is evil... and stupid by chelberg · · Score: 1

      That's why they invented the size_t type!

      I use this type to ref all arrays, etc. whenever possible.

    39. Re:Premature optimization is evil... and stupid by chelberg · · Score: 2, Informative

      Two books to look at Cormen et. al Intro to Algorithms, and Bentley's Programming Pearls. The second is more practical, the former is used in many CS Algorithms courses.

    40. Re:Premature optimization is evil... and stupid by yuhong · · Score: 1

      Yea, the INC/DEC instructions have the same problem.

    41. Re:Premature optimization is evil... and stupid by DeadCatX2 · · Score: 1

      Essentially, the shift instructions must do a read/modify/write on the flags

      Your posts contain some awesome information, but I must say that I would be quite surprised if processors do read-modify-writes on the flag register. That just feels...dangerous, especially if two instructions want to touch different bits in the flags register.

      I vaguely remember, back when I was an undergrad designing a simple MIPS processor for my computer architecture class, that we ran a copy of the ALU's output to a gigantic NOR gate, which then hit an SR latch attached to the "zero" bit of the flags register. The "negative" bit was just the MSB run through a clock-enabled D flip flop, and so on.

      Now, you do prefix with "essentially", which implies to me that you're white-lying for brevity. Would you care to expand on this?

      --
      :(){ :|:& };:
    42. Re:Premature optimization is evil... and stupid by ploxiln · · Score: 1

      Your undergrad simple MIPS processor was an in-order design. Part of the point of TFA is that modern processors work totally differently - they have a register renaming unit, and tagged microcoded instructions taking multiple paths of various lengths through the pipeline, and an instruction retirement unit, oh and the branch predictor and the pipeline flush and rollback stuff... in fact real modern processors are much more complicated than even that. So there is no single physical flags register in there, but there are versioned copies of it floating around...

      Anyone feel free to correct me if you have more specific knowledge of how the flags are actually handled these days.

    43. Re:Premature optimization is evil... and stupid by wirelessbuzzers · · Score: 1

      This is only the case until most all instructions spend exactly 1 cycle in an execution unit. This includes shifts, adds, subs, muls, ands, ors, xors, and so on and on. Thats the state of the modern processor.

      mul has latency 3 on Core i7. Of course, this assumes that by "modern" you mean i7. If we're talking about mobile processors, the latency is terrible on Atom, and probably not so hot on ARM either.

      All of these instructions are 3 cycle latency on the AMD64 processors that I am familiar with (pre-phenom) and essentially the first cycle is loading the operands from the register pool into an execution unit, the second cycle is the execution, and the third cycle is retirement from the execution unit, updating the register pool. Core2's and i7's have these down to 2 cycle latency.

      Every major desktop processor built in the last 10 years (maybe more like 20?) has single-cycle latency (or less!) for most simple operations (add/sub/xor/shift), enabled by forwarding between different stages of the pipeline. I don't know if lea counts for this because it's CISC-y, but something like an add always has single-cycle latency.

      --
      I hereby place the above post in the public domain.
  5. Well that is /.'d by Com2Kid · · Score: 1

    /.'d, to say the least. Wow.

    Great lecture so far, 2 minute pauses every 20 seconds make it kind of hard to listen to though!

    1. Re:Well that is /.'d by Jorl17 · · Score: 0

      And I was here banging the computers to get them to work faster! Damn /.! Next time, tell me before you eat up another server!

      --
      Have you heard about SoylentNews?
    2. Re:Well that is /.'d by MaskedSlacker · · Score: 1

      Try waiting for it to full buffer?

    3. Re:Well that is /.'d by Anonymous Coward · · Score: 0

      For me the link is borked. Coral cache anyone? The link is dead. Its dead Jim! We need Miracle Max. Its not completely dead, its only mostly dead. .....Oh wait! I'm mixing my star trek and princess bride metaphors. My bad.

  6. Skynet... by Anonymous Coward · · Score: 0

    Now that no one knows what they're doing, who's to keep them from merging. How long is it before several machines of x86 chips become self-aware? The end is nigh comrades!

    Alternatively, maybe they'll become Data.

  7. What?! by Anonymous Coward · · Score: 0

    What the fuck are you talking about. Why the hell do you need to write a compiler in assembler? Do you have any idea how a compiler works? Your last sentence suggests not.

  8. Video is a waste of time... by Anonymous Coward · · Score: 0

    I can't even watch this. Anyone got a transcript so that I can skip the video BS and just read it? I can read a lot faster than he can talk, and I wouldn't have to wait 30 minutes for the video to load (slow connection) ...

    1. Re:Video is a waste of time... by Brian+Gordon · · Score: 1

      You have to admit it's pretty nice to have the presentation slides automatically display and advance below the video as you watch..

  9. It's not just x86 by RzUpAnmsCwrds · · Score: 3, Informative

    Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.

    The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

    1. Re:It's not just x86 by pla · · Score: 1

      Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around.

      ...Of course, since the biggest bottlenecks in code usually occur in essentially serial sections of code, you can't just reorder them and hope for the best. In which case, the coder who understands where the target CPU will stall can still blow away even the best of compilers.

      TFA using examples such as shift-vs-multiply sound like your grandfather complaining that you don't double-clutch on downshifting into first gear. "True" in the sense that yes, it once had meaning and no longer does - But totally wrong in the sense that people who think about their code at that level have moved beyond such trivialities and onto actual modern ones such as how to feed N pipelines so as to minimize stalls, or what degenerate conditions flog the latest branch prediction techniques (or more usefully, as a classic example, how to write your code so as to minimize branching).


      In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

      Now explain to a Java programmer how to align your data and code on machine-page boundaries. How to insure that each read/write sequence takes advantage of read/write combining. How to properly interleave memory access with number crunching to minimize waiting around for a 1000x slower subsystem to wake up and respond.

      Mostly this article sounds like exactly the reasons I don't like Java for every task, and why the vast majority of Java apps feel like molasses in January despite every benchmark telling you that in theory they run just as fast as unmanaged code - Because although you can do the above, you have to work against the language rather than with it. When merely assigning a value to a basic machine-supported data type (32 bit integer, as the simple example) involves an implicit function call (and the whole stack-frame preservation that entails), the little details such as memory row size (which, incidentally, can trivially double the CPU time sitting around doing nothing if not explicitly considered) vanish in the haze of inefficiency.


      I would also point out that some of us really do still work with CPUs that have memory capacities measured in mere hundreds or thousands of bytes and an instruction set that doesn't even include a general-purpose integer multiply instruction. But I suppose we can't fault TFA's author for overlooking the single dominant use of CPUs in our world today, the ultra-low-horsepower embedded market, now can we?

    2. Re:It's not just x86 by RzUpAnmsCwrds · · Score: 1

      ...Of course, since the biggest bottlenecks in code usually occur in essentially serial sections of code, you can't just reorder them and hope for the best.

      Well, that's highly dependent on your code. If you're writing something like matrix manipulation or most image processing routines, none of your code is particularly serial. You can even get away with things like WAW hazards because of register renaming.

      If you've already done all of the high-level optimization that you can, maybe it's time to start looking at a VTune or CodeAnalyst and figuring out where your branches are being mispredicted and where you're seeing stalls. But the reality is, those optimizations get you the last 20%, which doesn't mean shit if your algorithm is inefficient or accesses memory inefficiently.

      TFA using examples such as shift-vs-multiply sound like your grandfather complaining that you don't double-clutch on downshifting into first gear. "True" in the sense that yes, it once had meaning and no longer does - But totally wrong in the sense that people who think about their code at that level have moved beyond such trivialities and onto actual modern ones such as how to feed N pipelines so as to minimize stalls, or what degenerate conditions flog the latest branch prediction techniques (or more usefully, as a classic example, how to write your code so as to minimize branching)

      I also hate the 'instruction weenie' optimizations. It doesn't matter (for example) if an integer multiply instruction has two-cycle latency and a shift instruction has one-cycle latency. Something like 1 in 5 instructions is a memory access, and another 1 in 5 is a branch. Both of those are potentially far more problematic than an extra cycle that's probably going to be scheduled around anyway.

      Mostly this article sounds like exactly the reasons I don't like Java for every task, and why the vast majority of Java apps feel like molasses in January despite every benchmark telling you that in theory they run just as fast as unmanaged code - Because although you can do the above, you have to work against the language rather than with it.

      Pretty much no benchmark shows Java (or .NET) running as fast as unmanaged code with a decent compiler; even the best JIT runtimes usually come out in the 50-70% range.

      Also, Java doesn't feel slow because of execution performance. It feels slow because it has crappy UI libraries that are slow. There are many, many GTK+/Python apps that are perfectly fine despite the fact that Python is abysmally slow compared to even Java.

      When merely assigning a value to a basic machine-supported data type (32 bit integer, as the simple example) involves an implicit function call (and the whole stack-frame preservation that entails)

      I'm not sure where you're getting this, but assigning to an int (not an Integer) in Java does not involve a function call in any mainstream JRE I'm familiar with; indeed, it performs very similarly to assignment in C.

      The big fault of Java (and also .NET) is that the JIT doesn't have very much time to optimize. Compared with unoptimized C/C++ compilers, Java is considerably faster. It's only once you add the substantial benefits of optimization (loop unrolling, constant propagation, function inlining, instruction scheduling, and a whole host of other optimizations) that the JIT starts to look pretty crappy.

      Simple cases, like assignment to an int, are well-optimized by the JIT.

    3. Re:It's not just x86 by Anonymous Coward · · Score: 0

      Goretex A8?

      OMG wearable processors are here!!

  10. I hate flash video by Omnifarious · · Score: 1

    I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.

    1. Re:I hate flash video by Anonymous Coward · · Score: 0

      You hate Flash video and you list YouTube as an alternative? This Flash player reads an FLV file just like YouTube does. You can use Firebug to pull the FLV URL and play it however you like (VLC, mplayer, etc.)

    2. Re:I hate flash video by Korin43 · · Score: 1

      You do realize that there's a native 64 bit version of flash for Linux now right?

    3. Re:I hate flash video by Omnifarious · · Score: 1

      With all the security holes in flash and Adobe's inability to fix them in a timely fashion I would prefer to use Adobe's flash on a box I care less about than my Linux boxes.

    4. Re:I hate flash video by Quantumstate · · Score: 1

      I use it and I find that it is buggy on my machine. I don't know about other people but mine random stops working from time to time and needs a browser restart.

  11. It's just outdated knowledge by Sycraft-fu · · Score: 2, Informative

    People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.

    Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.

    Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

    However, you'll still hear people say it is a great trick because they haven't updated their knowledge.

    1. Re:It's just outdated knowledge by marcansoft · · Score: 1

      Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

      I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-king land, that means optimizing for the common-if-weird cases, not the sane cases.

    2. Re:It's just outdated knowledge by Cassini2 · · Score: 3, Informative

      I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.

      You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.

      Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

      Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.

      Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
      ADD AL, AH
      MOV CL, 0
      JC somewhere

      should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
      XOR CL, 0
      ADD AL, AH
      JC somewhere

      executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.

    3. Re:It's just outdated knowledge by SpinyNorman · · Score: 2, Informative

      Actually the reason us old fogies normally used XOR A, A rather than LD A, 0 wasn't because it was faster but rather because it was smaller - 1 byte rather than two bytes (instruction + immediate operand). On the old memory constrained 8-bitters, these assembly "tricks" were all about saving a byte here, another byte there...

    4. Re:It's just outdated knowledge by BZ · · Score: 3, Interesting

      The smaller instructions are still worth it, not so much because of main RAM size constraints but because of cache size constraints. Staying in L1 is great if you can swing it; falling out of L2 blows your performance out of the water.

      Most recently, just iterating over an array and doing a simple op on each entry became about 2x faster on my machine by going from an array of ints to an array of unsigned chars (all the entries are guaranteed in unsigned char range). Reason was, the array of ints was just about the total size of my L2... and the new array is 1/4 the size, which means there's space for other things too (like the code).

    5. Re:It's just outdated knowledge by Rockoon · · Score: 1

      Smaller is better in many cases, but not all. For example, using the small 'loop' opcode is a bigtime performance fail on the average x86 desktop of today (still highly biased towards Athlon64 and Pentium4) and that was even true back in the days of the 80386

      Back in the 8088 days, ENTER / LEAVE were preferred for stack frames, and then somewhere around the 80386 is became better to PUSH EBP .. MOV EBP, ESP / POP EBP .. and now again ENTER and LEAVE are preferred (this time, for their size)

      But the killer fun fact of this post is that assembly language programmers don't often use stack frames. Stack frames are for debuggers so that the debugger can figure out whats going on during a break and how the machine code relates to the high level source code. Assembly language programmers dont need to relate to high level source code .. the machine code is the code .. assembly language programmers calculate where parameters are on the stack directly relative to the current stack pointer, rather than create a stack frame. This saves, on average, several cycles per function call.

      --
      "His name was James Damore."
    6. Re:It's just outdated knowledge by Anonymous Coward · · Score: 0

      Not to mention having ebp free means another register... goddamnit I hated going from 68k to x86 ><

    7. Re:It's just outdated knowledge by Hurricane78 · · Score: 1

      In conclusion: XOR reg,reg and MOV reg,0 have set their Facebook relationship status to “it’s complicated”? ^^

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
  12. Kung-Fu and Ninjitsu...They're not dead! by geekmux · · Score: 1

    This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...

  13. Another fascinating Click talk by Anonymous Coward · · Score: 0
  14. rule of the code by Bork · · Score: 3, Informative

    Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.

    The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.

    One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

    1. Re:rule of the code by XMunkki · · Score: 2, Informative

      I agree that many low-level programming methods aren't that necessary anyhow, but there is one big point where the compiler cannot help much, and that is data layout. Big hits come from all levels of cache misses, and it's good for the programmer to be aware of this and benchmark the memory access patterns and try to make them good (predictable, linear, clumping frequently used data, etc). Also on some hardwares, the Load-Hit-Stores are something to be aware as well. A reasonable thing to do, when optimizing something, is to fiddle with the code a bit and see what generates the best assembly. This usually is a good compromise (you still stay at a higher level and got portable code with some gained performance on at least one platform). Now still compilers aren't a magic wand everywhere, especially when going to deeply embededded or specialized hardware. One example is SPU programming. Since SPUs read&write everything from/to 16-aligned addresses, current GCC compiler lots of "align ptr, load, rotate, calculate, rotate, combine, store" sequences. If you want good SPU performance, going into ASM is indeed viable something. Though most of the times, staying at intrinsic functions gives you an adequate compromise. But since SPUs are basically fast DSPs, many of the tasks that are ran by them are in nature quite repetitive with short amount of work per item and millions of items (like doing vertex transforms, simulating some post processing effect, mixing audio etc). But a good programmer always benchmarks first, checks the compiler output etc before hitting the deck with raw assembly.

    2. Re:rule of the code by SorcererX · · Score: 2, Informative

      You got the fastest time simply by playing with the compiler flags? We had a similar problem where we had to do a matrix multiplication on symmetric matrices for C = AB^T+BA^T (rank2k update with alpha=1.0, beta=0.0) and there was nothing the compiler could do for us to get even remotely near good scores. Doing the simplest implementation we got about 5 FLOPS/cycle on an 8 core system, optimizing just with SSE etc, I got it up to about 13 FLOPS/cycle, and by splitting up the matrix in tiny parts to avoid cache trashing etc I was able to get it up to 47 FLOPS/cycle. For comparison Intel's MKL library managed about 85 FLOPS/cycle on the same hardware. I believe the best in my class was about 50 FLOPS/cycle, and it took an insane amount of fiddling for any of us to get above 25-30 FLOPS/cycle or so. That said, most things done on a computer is rarely that limited by memory access, and then the compiler does an awesome job :)

      --
      Any sufficiently advanced technology is indistinguishable from magic.
    3. Re:rule of the code by wirelessbuzzers · · Score: 2, Informative

      One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

      Really? Because I had a similar assignment (make Strassen's algorithm as fast as possible, in the 5-10k range) in my algorithms class a while back. I found that the key to a blazing fast program was careful memory layout: divide the matrix into tiles that fit into L1, transpose the matrix to avoid striding problems. Vectorizing the inner loops got another large factor. Compiling with -msse3 -march=native -O3 helped, but the other two were critical and took a fair amount of effort.

      --
      I hereby place the above post in the public domain.
    4. Re:rule of the code by SorcererX · · Score: 1

      indeed, we did the same thing, except we fit the matrix into L1 tiles, L2 tiles and CPU-cache tiles (two cpus with 4 cores each), did vectorizing of inner loops, and also unrolled the inner loops (with #pragma unroll). In addition, since we were working with symmetrical matrices, and only needed to calculate half the matrix (along the diagonal), I had to fiddle some with the scheduling for the OpenMP pragma to get the best possible performance out of it.

      --
      Any sufficiently advanced technology is indistinguishable from magic.
    5. Re:rule of the code by Anonymous Coward · · Score: 1, Interesting

      Thanks for rolling out the received wisdom. Like most received wisdom, it is of course wrong, but applies in enough cases and for enough people to be useful.

      What you must bear is mind though is that something like Amdahl's law applies to hotspot optimization. Even if you make the hotspot take zero time, the speed of your code is still limited by the performance of the non-hotspots. This leads to a phenomenon I name "uniformly slow code".

      If your mission is actually to make fast code you need to start from scratch, and create a perfect algorithm, given the hardware constraints. But this is too hard, and too much work in most cases to be worthwhile, hence the hotspot advice.

      As for "your time is better off writing correct code" ... well, if you can't write correct code, go become an actor or something. This is a pre-requisite, not a target.

      You have a lot of confidence in the ability of compilers to work magic. The average compiler is actually pretty bad at all this stuff, and you can forget it on any slightly novel architecture (e.g. PPC vs x86). Knowing how to do it yourself is still useful - or essential, depending on what you're trying to do. Every compiler I've used recently has bugs in it regarding floating-point re-ordering anyway (and I mean *every* compiler - gcc 4 is even worse than MSVC at this) so don't lean on this too heavily. Remember what you said about CORRECT code? It's a lot easier to tweak a flag than it is to robustly test that the algorithm is still working properly.

    6. Re:rule of the code by Jeppe+Salvesen · · Score: 1

      Write good, clean, well-designed and efficient code. The compiler will not make your mergesort into a quicksort.

      --

      Stop the brainwash

  15. In C/C++ shift is not the same as multiply/divide by perpenso · · Score: 2, Interesting

    Using shift to multiply is often a great idea on most CPUs.

    In C/C++ shift is not the same as multiply/divide by 2. Multiplication and division operators have a different precedence level than shift operators. Not only is there the possibility of poor optimization but such a substitution may lead to a computational error. For example mul/div has a higher precedence than add/sub, but shift has a lower precedence:

    printf(" 3 * 2 + 1 = %d\n", 3 * 2 + 1);
    printf(" 3 << 1 + 1 = %d\n", 3 << 1 + 1);
    printf("(3 << 1) + 1 = %d\n", (3 << 1) + 1);

    3 * 2 + 1 = 7
    3 << 1 + 1 = 12
    (3 << 1) + 1 = 7

    --
    Perpenso Calc for iPhone and iPod touch, scientific and bill/tip calculator, fractions, complex numbers, RPN

  16. Lots of low Slashdot IDs commenting on this... by slagheap · · Score: 1

    Let's start the pissing contest:

    I have a 6-digit slashdot ID. Beat that you newbs!

    --
    First against the wall when the revolution comes
    1. Re:Lots of low Slashdot IDs commenting on this... by the+brown+guy · · Score: 1

      7 digits sucka and I WIN

      --
      Orbis terrarum est non altus satis
    2. Re:Lots of low Slashdot IDs commenting on this... by iammani · · Score: 1

      Meh, mine is a 0-digit UID. I win, game over!

    3. Re:Lots of low Slashdot IDs commenting on this... by iammani · · Score: 1

      Er, scratch that, it was supposed to be Anonymous!

    4. Re:Lots of low Slashdot IDs commenting on this... by chill · · Score: 1

      Who you talking to, noob?

      --
      Learning HOW to think is more important than learning WHAT to think.
    5. Re:Lots of low Slashdot IDs commenting on this... by Just+Some+Guy · · Score: 1

      Yo mama, son.

      --
      Dewey, what part of this looks like authorities should be involved?
    6. Re:Lots of low Slashdot IDs commenting on this... by daveb1 · · Score: 1

      well are you compensating for something then ? :P

  17. Pfft. by dtmos · · Score: 1

    My first programming was putting the little white plastic straws on a Digi-Comp 1.

    (And it really was in a snowstorm -- I got it as a Christmas present.)

  18. Another Imbecile incompetently running video by rfc1394 · · Score: 0, Offtopic

    I am guessing that the site was slashdotted because the video never ran. Yet another example of some imbecile who designs their own video player and either can't run the material correctly or can't handle the load. I see this over and over, someone - or some site - decides to run their own video player and it's either inoperative or runs badly. I wrote about this on my blog in October 2008 how so many places try - and fail - to properly run video.

    You know, running video correctly isn't rocket science, YouTube does it fine under loads that would slashdot Slashdot. But do these stupidos use YouTube to serve their video? Noooo, they'd prefer to use some incompetent who can't provide it properly, probably because they're under the impression they'd lose ad revenue or something, I guess. But I see this all the time. The New York Times provides video for some of their stories, But their video doesn't work, and stalls, but has no way to cache the video so that if it fails you can either get it to run smoothly or go back and run it again without having to download the entire video all over again after it's already been served. I guess they never thought about people having problems,

    If these were streamed video like a live event, that would be one thing. But they do the exact same thing YouTube does, they feed stored video to a player written using Adobe Flash. So there's no excuse for their failures except pure incompetence and/or stupidity.

    --
    The lessons of history teach us - if they teach us anything - that nobody learns the lessons that history teaches us.
  19. Virtualization by Anonymous Coward · · Score: 0

    Nobody cares about performance, or what exactly the code does, since java takes care of all those "pesky details".

    Virtual machines use a very simple instruction set, hopefully optimized for the processor , hopefully optimized by the OS, which is hopefully optimized for the processor.

    Want performance? Don't code in java.

    Want performance? Do some PERFORMANCE ANALYSIS.

    I know it's hard, since it requires actual MATH, something that simple programmers are not taught, and really don't care about, but it's worth it.

    Some up front optimization can save you MONTHS of recoding effort.

    Rule #1 in Performance Optimization: DON'T OPTIMIZE, PLAN!

  20. This is why x86 everywhere is a bad idea by Anonymous Coward · · Score: 0

    Sure there is tons and tons of x86-friendly code out there but you really don't want it running naked on power sensitive devices such as smart phones? It is trivial these days to query a processor for its capabilities and applications optimized for the desktop and server environments are going to run flat out, partying on every flavor of SSEx available. For x86 to be more than just an also-ran in the mobile world systems need to be able to easily present applications with a VM view of the processor to reign in power hungry apps, IMHO.

    1. Re:This is why x86 everywhere is a bad idea by Alex+Belits · · Score: 1

      What is this, I don't even...

      Seriously, I write optimized DSP code for x86 and non-x86 architectures, and I see absolutely nothing relevant or meaningful in the above comment.

      "x86-friendly code"? Most of the code anyone ever sees is not "friendly" to any architecture -- for example, it uses way, way too much memory to be efficient at cache use, so the only "performance" the user sees is the speed of his RAM. At best someone manages to fit some code into cache sizes (that vary more within the x86 architecture than between architectures), or adapts it to various kinds of parallelism (that are usually portable in general but have to be adapted to particular implementations). The rest if a job for compilers -- and x86 is an architecture with very long history of compiler development, so it may be better supported than some others.

      "Query a processor for its capabilities"? Why would you want to do that? What do you think, compiler optimizes for if not "capabilities" of a target CPU -- even if there is no generalized way to represent those "capabilities" in a generalized way?

      "Easily present applications with a VM view of the processor to reign in power hungry apps"? If there was a VM (that is, VIRTUAL machine) that can be easily converted into any CPU architecture and such representation also represented the performance of the target architecture, it would be the target of last compiler ever written. The whole "problem" is, progress in CPU technology involves fundamentally different ways CPU treats code (pipelines, cores, non-SIMD parallelism), memory (caches, SIMD) and I/O (bus architectures) that allow to optimize code for those architectures -- sometimes purely by optimized compilation, sometimes by developer consciously adapting the code for particular CPU features.

      --
      Contrary to the popular belief, there indeed is no God.
  21. ...except for the uControllers I use. by podom · · Score: 3, Interesting

    I watched about half of his presentation. I was amused because on a lot of the slides he says something like "except on really low end embedded CPUs." I spend a lot of my time programming (frequently in assembly) for these exact very low end CPUs. I haven't had to do much with 8-bit cores, fortunately, but I've been doing a lot of programming on a 16-bit microcontroller lately (EMC eSL).

    I suspect the way I'm programming these chips is a lot like how you would have programmed a desktop CPU in about 1980, except that I get to run all the tools on a computer with a clock speed 100x the chip I'm programming (and at least 1000x the performance). I am constantly amazed by how little we pay for these devices: ~10 Mips, 32k RAM, 128k Program memory, 1MB data memory and they're $1.

    But they do have a 3-stage pipeline, so I guess some of what Dr. Cliff says still applies.

    --
    We're wanted men. I have the death sentence in 12 systems!
    1. Re:...except for the uControllers I use. by walshy007 · · Score: 1

      and funnily enough most of the cost of that would be the packaging, i'm a big fan of DIP chips because they are so easy to solder etc, yet they cost almost double what surface mount chips do, *sigh*

    2. Re:...except for the uControllers I use. by TheRaven64 · · Score: 1

      Even that niche probably won't be around for much longer. You can get 32-bit ARM and SPARC processors for not much more than you're paying. Even a high-end ARM SoC costs around $40, the low end ones are well under $10. When you can get a 50MHz 32-bit ARM core for 50, there won't be much incentive to keep using the 16-bit chips. Out of curiosity, I just had a look at what kind of price these chips fetch now. You can get an LPC2101 for around $1.50. This is a 32-bit ARM7 core (supporting 16-bit Thumb code if code density is important to you) running at 70MHz. Less RAM than your microcontrollers, but it has controllers for plugging in external memory. $5 gets you 128KB of flash and 32KB of RAM, so the prices already aren't far off what you're paying for a core that is much slower and harder to program.

      --
      I am TheRaven on Soylent News
    3. Re:...except for the uControllers I use. by DeadCatX2 · · Score: 1

      Slower, yes. Harder to program...sometimes. But much easier to actually connect the physical goods together.

      You can take a 8-pin DIP-sized PIC, hook it up to a regulator, a battery, a button, an LED, and make it blink. It doesn't have fancy power requirements. It doesn't even need an external oscillator.

      There will always be a niche for the smallest, cheapest, simplest way to turn bits on and off.

      --
      :(){ :|:& };:
    4. Re:...except for the uControllers I use. by TheRaven64 · · Score: 1

      The ARM chips I mentioned have analogue and digital I/O pins designed for exactly this kind of application, although I'm not sure if they need some external clock source.

      --
      I am TheRaven on Soylent News
    5. Re:...except for the uControllers I use. by Mr+Z · · Score: 1

      I actually got the impression he was including a lot of the 32-bit microcontrollers out there too. If you look at a lot of the embedded ARMs that show up in various SoCs, his "low end CPU" comments apply to those, too. Simple pipeline, in-order execution, etc.

    6. Re:...except for the uControllers I use. by DeadCatX2 · · Score: 1

      I'm not saying they don't have GPIO. But you need to roll your own PCB if you want to use those ARM chips. My point was that you can't just jam the thing into a breadboard and go and, even if you could, the several dozens of MHz that you don't need will destroy your battery-powered application.

      --
      :(){ :|:& };:
  22. Try FLAT Assembler by Taco+Cowboy · · Score: 1

    Try FLAT Assembler.

    A free assembler that do real wonders !

    Download from: http://flatassembler.net/download.php

    Forum: http://board.flatassembler.net/index.php

    --
    Muchas Gracias, Señor Edward Snowden !
  23. An Awesome Assembler by Taco+Cowboy · · Score: 1

    Here's an awesome assembler that do wonders --- Flat Assembler.

    Download from: http://flatassembler.net/download.php [flatassembler.net]

    Forum: http://board.flatassembler.net/index.php [flatassembler.net]

    --
    Muchas Gracias, Señor Edward Snowden !
  24. Code in ASM is FUN by Taco+Cowboy · · Score: 1

    I do code in HLL, but I do not give up the right to code in ASM.

    In fact, coding in ASM is super fun !

    Couple years ago I code in MASM, but now I use FASM (Flat Assembler) instead.

    It is available from http://www.flatassembler.net/.

    Enjoy ! :D

    --
    Muchas Gracias, Señor Edward Snowden !
    1. Re:Code in ASM is FUN by walshy007 · · Score: 1

      what's wrong with GNU AS?

    2. Re:Code in ASM is FUN by cheesybagel · · Score: 1

      Some people cannot wrap their head around AT&T assembler syntax.

  25. Real life example by Ilgaz · · Score: 1

    Coreplayer PowerPC, for OS X, does play 720P H264 video on a G4 1.42 Ghz fine. Adding more to shock, its benchmark function actually shows 70-80 fps levels. Why? Altivec is used along with very clever OpenGL and possibly ASM.

    Of course, some idiot will popup and say "powerpc is dead"... Well, in case of Intel Core 2 duo, the CPU load is sub 3-5% levels giving free cycles for all the amazing filters one can run. It is not just PowerPC/Altivec wasted, SSE is always wasted too. I really wonder what kind of computing we would do if these guys coding X86 only and relying on automatic optimization actually knew/used SSE instructions.

  26. More of his talks? by cmason · · Score: 1

    This talk was great! But, I'd love to have seen some of the other ones Cliff offered (particularly the GC one). A quick search of google video/youtube turns up only his lock free hash table talk, which is great, but I've seen it already.

    Anyone have links to more of this guy?

    -c

    --
    "If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
  27. Done it, but hardware makes a HUGE diff. too by Anonymous Coward · · Score: 1, Interesting

    " As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own. Of course all that is predicated on using a profiler first to find out where the actual problem is." - by Sycraft-fu (314770) on Thursday January 14, @06:50PM (#30772966)

    Exactly, & I tend to use a very "old-school/primitive" method of "hand-rolled profiling" (in using hi-res multimedia timers registered with the OS to do so, in order to find the 'slow spots' in my methods/subroutines/procedures/functions), &, it works (to @ least spot those areas, just as you noted).

    HOWEVER: I only took my program, noted below earlier here today also, down from roughly a 4.5 hour runtime, down to a 4 hour runtime using programmatic optimizations of varying kinds! Not a bad gain, especially by optimizing (but that took a lot of time to determine mind you, & that matters in the workplace of course).

    The program's one for my personal use here though, &, it's for:

    ----

    A.) Sorting data alphabetically in datasets/records in HOSTS files

    B.) Removing duplicated entries

    C.) Pings+ resolved DOMAIN/HOST Names to IP Addresses of fav. sites I go to to add into the HOSTS file as to their correct IP-to-DOMAIN/HOST name resolution (faster doing it from a local HOSTS file than calling out to a potentially compromized & SLOWER external DNS server by far)

    D.) Changing the preceeding blocking address used on domain/host names in a HOSTS file from the larger & slower 127.0.0.1 or 0.0.0.0 blocking addresses to the smaller/faster 0 one (& vice-a-versa IF NEED BE too)

    ----

    Anyhow/anyways, now that the background of what my app does is covered?

    Well - I did so via BOTH compiler level switchwork (Borland Delphi 7.1) & also hand-level ones done in Assembly language!

    (Plus, using sorted lists & a better algorithm for sorting than QuickSort variants on small/medium/large lists (small = insertion sort variation, medium & large = quick sort variation (I changed it on the fly depending on the size of the data being sent into the lists I used (dynamically resizing types of course))).

    I also chose Borland Delphi because it was shown, as far back as 1997 in fact & in a COMPETING TRADE JOURNAL in computing called "Visual Basic Programmer's Journal" Sept./Oct. 1997 issue entitled "INSIDE THE VB5 COMPILER" where Delphi SOUNDLY "KNOCKED THE CHOCOLATE" out of BOTH MSVC++ &/or VB5 by DOUBLE (or, better) in BOTH Math & String processing related tasks work...

    That all "said & aside", well...?

    I only got SO MUCH out of using programmatic optimizations by hand, a 1/2 hour decrease in work time, going from 4.5 hr. runtimes down to 4 hrs. time for all tasks A-D above completing.

    I further used x86 Assembly code, inlined via the asm directive mostly here, or using shifts vs. multiplies etc. et al & more, such as FOR loops vs. WHILE loops etc. too & better algorithms after profiling showed me where I was "slowing up" the most, via hi-res multimedia timers timing all my procedures)

    I also used lastly used compiler switches work (& also, removing ones I did not need for safety once the code proved safe & accurate enough, via using Try-Catch-Except/Finally errtrap methods in Delphi, doing my OWN exception handling & err trapping vs. the built-in "structured exception handlers" in the compiler itself only))...

    Sure, 1/2 hr. less of 4.5 hours, down to 4 hours only, is a decent increase... but? Compared to what you get from BETTER HARDWARE??

    It pales by comparison!

    In fact, I noted this very example in another thread here today ->

    Forrester Says Tech Downturn Is "Unofficially Over":

    http://news.slashdot.org/comments.pl?sid=1508482&cid=30776266

    Where others there ar

  28. Re:In C/C++ shift is not the same as multiply/divi by physburn · · Score: 1
    Ugh, I would have guessed that shift would be top precendence like exponention. Glad you posted that, it will stop some future bugs from me. I would make the excuse that i'm program java more than C. But precendece ordering is the same in Java.

    ---

    C++ Programming Feed @ Feed Distiller

  29. C++ in Quake 3? by Anonymous Coward · · Score: 0

    I'm sorry, where exactly did you find 'C++ methods' in Quake 3 code?

  30. What about prefetching? by Mr+Z · · Score: 2, Interesting

    That was a fabulous presentation, and one that I'll likely hold onto a copy of, since it describes the issue of SMP memory ordering with a great example. I'll have to write "presenter notes" for those slides, since I can't get the video to come up, but that's OK. I understand what's going on there.

    One thing I thought was notably absent was any discussion of data prefetch. With all of the emphasis on how performance is dominated by cache misses, you'd think he'd give at least a nod to both automatic hardware and compiler directed software prefetch. After all, he mentions CMT, which is a more exotic way to hide memory latency, IMHO.

    On a different note: In the example on slides 23 - 30, he shows an example where speculation allowed two cache misses to pipeline, bringing the cost-per-miss down to about half. Dunno if he highlighted the synergy here in the talk, because it wasn't highlighted in the presentation. It is useful to note, though, how overlapping cache misses reduces their cost. There can be even more synergy here than is otherwise obvious: In HPCA-14, there was a fascinating paper (slides) about how incorrect speculation can still speed up programs due to misses on the incorrectly-speculated path still bringing in relevant cache lines.

  31. Torrent for video by Anonymous Coward · · Score: 0

    It took me hours to download this video, trying several different mirrors. So, here's a bittorrent version of the video.

  32. Torrent for video by Anonymous Coward · · Score: 0

    It took me hours to download this video, trying several different mirrors. So, here's a bittorrent version of the video

  33. A lot of people have watched a different talk? by pslam · · Score: 1

    The main take-away from this talk is that the modern software engineer needs to pay more attention to memory access and data dependency.

    For some reason, the Slashdot luddites have come out in force to declare that it was actually about how inaccessible modern architectures are and how it's more proof that you should never use anything but a high level language. Nonsense.

    I see this happen every time the subject of low level architecture comes up. There's a (sadly) large proportion of engineers who vehemently refuse to learn anything below the highest levels of programming. This turns into a silly justification backed by the evidence of how complex modern architectures are.

    Some variants of this luddite behavior emerge as 'premature optimization is the root of all evil'. Yes, it's a good quote, but it's not referring to what you're referring to. There's nothing wrong with knowing in advance where the bottlenecks in a system will likely be. That's called experience. It's called knowing the characteristics of your platform. Those who stubbornly design systems without thought to performance are doomed to produce code which is inefficient, slow, and worst of all - incapable of being optimized without a re-write. Premature optimization may be bad, but preemptive optimization is a good quality to have.

    That's the second take-away, in my opinion, from the talk: Engineers are all going to have to learn how to optimize code for the architecture, because your free ride on the MHz and CPI slope has ended. Here's a clue: if you're someone who knows how it all works, can preemptively optimize their designs to better fit their system, and can use their knowledge to debug issues, you are a far more valued engineer than the others. Bear this in mind the next time you find that a million outsourced engineers can do exactly the same job as you.

  34. SMALL "Addendum", on profiling methods... apk by Anonymous Coward · · Score: 0

    Alternately, instead of using TIMERS (especially high-resolution multimedia timers registered w/ the OS? You can use prebuilt functions/procedures/methods/subroutines that use methods like JAVA's "getTime()" method @ the START of your procedures/methods/subroutines/functions instead... getting the current time in MILLISECONDS, first.

    Then, @ the END of your procedure/function/method/subroutine being timed? Get the time again, using an analog to JAVA's getTime() method, & subtract the start time (what you did @ the BEGINNING of what's being timed) from the END TIME (what you obtain @ the END of what you're timing) & voila:

    You have just EASILY "profiled" the runtime of your method/subroutine/function/procedure being timed... with relative ease!

    APK

    P.S.=> No timers required either... Just thought I'd add that, as an 'addendum' to my original methods noted, as this omits having to use TIMERS, period... & additionally? There are even more "finer grained" methods that use MICROSECONDS (finer grained than getTime() & nanoseconds, afaik, too)... some "Food 4 Thought" for you all to "drink in, & digest" here... apk

  35. x86 vs. Z80 by Osvaldo+Doederlein · · Score: 1

    As a former Z80 black-belt ninja, I'd say it was easier - except that Z80 didn't have mul/div instructions... but, who needs these? You could write a routine for this (granted: dead slow) but I've written pretty significant programs without ever needing a single "full" mul/div, just hacking around it with a couple shifts, adds or bit ops.

  36. how profiling tools fit in by aap · · Score: 1

    or you needlessly wrote some hideous O(n!) search which is NP complete, then no amount of profiling and instruction tuning is ever going to help you.

    In this situation the value of the profiling tools is not for instruction tuning, but to help you notice the existence of the bad search function so you can replace it with something else.

    In a large program there can be lurking n-squaredness which may not be obvious from looking at any one section of the code. For example there could be an innocent function which loops over n objects, and you may not realize that it is being called from a function twelve levels up the stack which is also looping over the same n objects.

    Sometimes it's enough to just stop in the debugger a few times to realize what is slower than it should be and why. In other cases, browsing the output of a good call graph profiler can help inspire the fix faster.