The K computer is able to do work more efficiently than GPUs because it uses a very power-efficient core, the Sun VIIIfx. If you peel the onion, it seems like the real reason for energy efficiency is special purpose units and the HPC-ACE instructions. I did a quick investigation of what this core has (and what it doesn't) to make it so energy efficient. It may be an interesting read for some of you guys so leaving a link here: http://bit.ly/kTvvDE
Agreed. In fact, that style of functional partitioning is the most power efficient approach. Btw, Intel's latest chip SandyBridge is an example of that too. There is GPU which is good at one type of work and a CPU which is good at another type of code (AMD fusion is the same way).
I do feel that functional partitioning makes programming even worse though. Now not only do you need to find parallelism but also decide which task best runs where. Automating this process is long ways so until then its a tough one. You do see that happening for some obvious cases though, i.e., GPU.
I believe there are programs that need performance and are not graphics and movies. Databases is one. Excel functions like goal seek is another. How about web browsers? There is work going on to parallelize HTML rendering.
On a side note, I do want to point out that graphics seems so regular because it was made this way by design, which is one way of finding parallelism. Gfx was constrained in a way that it became parallel, e.g., it was mandated that triangles was the only building block and only certain types of blending modes will allowed. Not surprisingly, the trend is now changing. Ten years ago, a GPU was a fixed function hardware but today a major chunk of it is programmable and new standards like DX11 are pushing that even further. It is clear that if you want better graphics, the code becomes irregular, which isn't as easy to parallelize but still needs performance. Animating movies fall in that category of graphics where parallelism isn't that easy to find...
I have a small issue with the message passing. Doesn't it make the barrier of entry higher for the average programmers? Many claim that IBM Cell was harder to code for than the XBOX because of this.
Thanks for reading my article. Awesome point! actually I really wanted to point out in my article that Google Map-Reduce is just what you need for the histogram kernels (not sure if you that post of mine: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html). There are many problems that fit, but I would say that there many that don't. Travelling sales man problem, scheduling problems, and branch-and-bound in general don't fall in that category. Databases is another one.
I kind of agree and disagree with you guys at the same time. Parallel programming is becoming inevitable if you want your code to run faster because the "free lunch" of Intel making it faster --by burning insane power-- is power. Code that is at acceptable performance will stay single threaded, but code needing performance will have to multi-threaded. Adobe multi-thredaded there code, even image processing stuff like ImageMagick did it. Graphics has been doing it for a while. Its just that motivation is very clear, and I feel that competition will take programmers to this unfriendly land even if they dont like it.
I touched on the cache line alignment and memory a bit. Can/will talk about that if its a topic of interest.
JVM's in my experience cache align almost all data structures. I have profiled some JAVA code using PIN. Perhaps that information can help you...
Billy, I understand that parallel programming is difficult. However, it was inevitable for Intel, AMD, or IBM to do it. Why? Because as you make a single core bigger, it provides less performance for power, i.e., a 2x faster core generally burns 4x higher power. That rate is not sustainable. Proof: look at the size of the heat sink on your processors. It couldn't get any bigger. Thus, Intel had to stop making single core faster and there were two options: increase cache (which is of no use beyond a certain point) or do nothing or go multicore. They chose the last one as the other two were not possible.
This makes software's life harder but this is the world we live in. My personal position is that hardware should understand software challenges and help them.. which is the point of my blog Future Chips.
All hardware guys (myself included) claim that they have been doing concurrent work since long. It takes writing a parallel program to understand the challenges. The communication latency (a huge issue), non-determinism, caches, need to deal with legacy code, and the need to make it robust that makes it a much much harder problem.
My goal is to make hardware guys see these challenges so I can say you hit a pet-peeve. (I am a hardware guy who has learned software over the years).
...very simple article...was expecting something more technical!
Anyway, cheers for the effort to write it!
Ps. Looks like he discovered an open secret:-P
In my defense, I wasn't targeting parallel programming experts:-) The theme of my post was to familiarize people who don't know parallel programming with the challenges.
The basic problem with parallel programming is that, in most widely used languages, all data is by default shared by all threads. C, C++, and Python all work that way. The usual bug is race conditions.
There have been many languages for parallel programming which don't have default sharing, but they've never taken over outside some narrow niches. Partly because most of them weren't that useful outside their niche.
The other classic problem is that in most shared-data languages with locks, the language doesn't know what the lock is protecting. So you can still code race conditions by accident.
Hey, my concern is that there is lots of code where I do want to assume sharing (since its easier to think that way for some problems). MPI is kind of like what you are saying and programming in MPI is not much fine either.
The K computer is able to do work more efficiently than GPUs because it uses a very power-efficient core, the Sun VIIIfx. If you peel the onion, it seems like the real reason for energy efficiency is special purpose units and the HPC-ACE instructions. I did a quick investigation of what this core has (and what it doesn't) to make it so energy efficient. It may be an interesting read for some of you guys so leaving a link here: http://bit.ly/kTvvDE
Not sure what you mean. Are you thinking RPC?
Agreed. In fact, that style of functional partitioning is the most power efficient approach. Btw, Intel's latest chip SandyBridge is an example of that too. There is GPU which is good at one type of work and a CPU which is good at another type of code (AMD fusion is the same way). I do feel that functional partitioning makes programming even worse though. Now not only do you need to find parallelism but also decide which task best runs where. Automating this process is long ways so until then its a tough one. You do see that happening for some obvious cases though, i.e., GPU.
I completely agree with Xyrus. Parallel programming is a big think and I feel that every programmer is gonna get exposed to it eventually.
I believe there are programs that need performance and are not graphics and movies. Databases is one. Excel functions like goal seek is another. How about web browsers? There is work going on to parallelize HTML rendering. On a side note, I do want to point out that graphics seems so regular because it was made this way by design, which is one way of finding parallelism. Gfx was constrained in a way that it became parallel, e.g., it was mandated that triangles was the only building block and only certain types of blending modes will allowed. Not surprisingly, the trend is now changing. Ten years ago, a GPU was a fixed function hardware but today a major chunk of it is programmable and new standards like DX11 are pushing that even further. It is clear that if you want better graphics, the code becomes irregular, which isn't as easy to parallelize but still needs performance. Animating movies fall in that category of graphics where parallelism isn't that easy to find ...
I have a small issue with the message passing. Doesn't it make the barrier of entry higher for the average programmers? Many claim that IBM Cell was harder to code for than the XBOX because of this.
Thanks for reading my article. Awesome point! actually I really wanted to point out in my article that Google Map-Reduce is just what you need for the histogram kernels (not sure if you that post of mine: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html). There are many problems that fit, but I would say that there many that don't. Travelling sales man problem, scheduling problems, and branch-and-bound in general don't fall in that category. Databases is another one.
I kind of agree and disagree with you guys at the same time. Parallel programming is becoming inevitable if you want your code to run faster because the "free lunch" of Intel making it faster --by burning insane power-- is power. Code that is at acceptable performance will stay single threaded, but code needing performance will have to multi-threaded. Adobe multi-thredaded there code, even image processing stuff like ImageMagick did it. Graphics has been doing it for a while. Its just that motivation is very clear, and I feel that competition will take programmers to this unfriendly land even if they dont like it.
I touched on the cache line alignment and memory a bit. Can/will talk about that if its a topic of interest. JVM's in my experience cache align almost all data structures. I have profiled some JAVA code using PIN. Perhaps that information can help you ...
Billy, I understand that parallel programming is difficult. However, it was inevitable for Intel, AMD, or IBM to do it. Why? Because as you make a single core bigger, it provides less performance for power, i.e., a 2x faster core generally burns 4x higher power. That rate is not sustainable. Proof: look at the size of the heat sink on your processors. It couldn't get any bigger. Thus, Intel had to stop making single core faster and there were two options: increase cache (which is of no use beyond a certain point) or do nothing or go multicore. They chose the last one as the other two were not possible. This makes software's life harder but this is the world we live in. My personal position is that hardware should understand software challenges and help them .. which is the point of my blog Future Chips.
All hardware guys (myself included) claim that they have been doing concurrent work since long. It takes writing a parallel program to understand the challenges. The communication latency (a huge issue), non-determinism, caches, need to deal with legacy code, and the need to make it robust that makes it a much much harder problem. My goal is to make hardware guys see these challenges so I can say you hit a pet-peeve. (I am a hardware guy who has learned software over the years).
I agree with that but they should only be used after you know whats underneath the hood. If we "teach" parallel programming that way then the programmers will never understand the underlying challenges. It will become all magic to them like JAVA is. See my post on this very topic: http://www.futurechips.org/thoughts-for-researchers/csee-professors-graduates-understand-computers.html
...very simple article...was expecting something more technical! Anyway, cheers for the effort to write it! Ps. Looks like he discovered an open secret :-P
In my defense, I wasn't targeting parallel programming experts:-) The theme of my post was to familiarize people who don't know parallel programming with the challenges.
The basic problem with parallel programming is that, in most widely used languages, all data is by default shared by all threads. C, C++, and Python all work that way. The usual bug is race conditions.
There have been many languages for parallel programming which don't have default sharing, but they've never taken over outside some narrow niches. Partly because most of them weren't that useful outside their niche.
The other classic problem is that in most shared-data languages with locks, the language doesn't know what the lock is protecting. So you can still code race conditions by accident.
Hey, my concern is that there is lots of code where I do want to assume sharing (since its easier to think that way for some problems). MPI is kind of like what you are saying and programming in MPI is not much fine either.