What Makes Parallel Programming Difficult?
An anonymous reader writes "Intel's Aater Suleman writes about why parallel programming is difficult. ... I was unaware ... that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if you are unfamiliar with parallel code debugging. "
Motherboards rarely have parallel ports these days!
Good tutorial for someone who wants to jump into some parallel programming, but it's mostly Operating Systems 101 (or 601).
Honestly though, if you have not optimized your algorithm or code to for parallelism and you want to do it now, you might probably be better off writing the whole thing from scratch, and the tutorial explains why very nicely/.
synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism
And yet people constantly get these details wrong in practice.
It's an extra layer of complexity and it introduces extra chances to make mistakes, even around areas where a programmer could know better. There's not much way around that. If people only made coding mistakes around difficult problems software would be dramatically more bulletproof than it actually is.
That you have to do everything all at once. How would you tell 50 kids to sort 50 toy cars? How would you tell 50 footballers to line up by height all at once? How would you have 50 editors edit the 50 pages of a screenplay all at once so that it makes sense from a continuity perspective? All these problems are very easy, but become very hard when you have to do it all at once...
I bet there are hundreds of simple tutorials on concurrent processing on the web, and these two pieces bring nothing new/interesting.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
I think the language Chapel being developed by Cray is taking concrete steps to make Parallel Programming easier. It is very similar in syntax to C++ and even supports some higher level constructs.
This looks like an advertisement for Chapel but I have no relation to Cray. Having taken a graduate parallel programming course, I cannot agree more with the statement that "Parallel Programming is difficult". I struggled a lot with pthreads and MPI before doing the final assignment in Chapel which was a pleasant surprise. The difference between serial and parallel code was FIVE characters only - and it gave a near linear speedup on three different machines.
Using locks and the like make it very easy to do multithreaded and parallel programs.
The big problem comes when you need multiple locks because you find your program is waiting more on locks than anything else which is gumming up the whole works, and that can easily lead to deadlocks and other fun stuff.
Another way is to consider lockless algorithms, which don't have such blocking mechanisms. However, then you get into issues where atomicity isn't quite so atomic thanks to memory queues and re-ordering done in the modern CPU, and thus have to start adding memory barriers before doing your atomic exchanges.
Raymond Chen (of Microsoft) did a nice write up of the lockfree ways to do things and what Windows provides to accomplish them.
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/05/10149783.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150261.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150262.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/07/10150728.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151159.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151258.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/12/10152296.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/13/10152929.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/14/10153633.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/15/10154245.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/19/10155452.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/20/10156014.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/21/10156539.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/22/10156894.aspx
The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.
There is only one GOOD reasons to use multithreading -- because your work is compute-bound. This typically happens on large-data applications like audio/video processing (for which you just call out to libraries that someone else has written), or else on your own large-data problems that have embarrassingly trivial parallelization: e.g.
var results = from c in Customers.AsParallel() where c.OrderStatus="unfilfilled" select new {Name=c.Name, Cost=c.Cost};
Here, using ParallelLINQ, it's as simple as just sticking in "AsParallel()". The commonest sort of large-data problems don't have any complicated scheduling.
There are also BAD reasons why people have used multithreading, particularly to deal with long-latency operations like network requests. But this is a BAD reason, and you shouldn't use multithreading for it. There are better alternatives, as shown by the Async feature in F#/VB/C# which I worked on, which was also copied into Javascript with Google's traceur compiler). e.g.
Task task1 = (new WebClient()).DownloadStringTaskAsync("http://a.com");
Task task2 = (new WebClient()).DownloadStringTaskASync("http://b.com");
Task winner = await Task.WhenAny(task1,task2);
string result = await winner;
Here it kicks off two tasks in parallel. But they are cooperatively multitasked on the same main thread at the "await" points. Therefore there is *NO* issue about race conditions; *NO* need to use semaphores/mutexes/condition-variables. The potential for unwanted interleaving is dramatically reduced.
So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.
While you're correct from a temporarily practical measure, I disagree in theory. OS theory 20 or more years ago was about one very simple concept.. Keeping all resources utilized. Instead of buying 20 cheap, slow full systems (at a meager $6k each), you can buy 1 $50k machine and time-share it. All your disk-IO will be maximized, all your CPUs will be maximized, network etc. Any given person is running slower, but you're saving money overall.
If I have a single 8 core machine but it's attached to a netapp disk-array of 100 platters over a network, then the latency means that the round trip of a single-threaded program is almost guaranteed to leave platters idle. If, instead I split a problem up into multiple threads / processes (or use async-IO concepts), then each thread can schedule IO and immediately react to IO-completion, thereby turning around and requesting the next random disk block. While async-IO removes the advantage of multiple CPUs, it's MASSIVELY error-prone programming compared to blocking parallel threads/processes.
A given configuration will have it's own practical maximum and over-saturation point. And for most disk/network sub-systems, 8 cores TODAY is sufficient. But with appropriate NUMA supported motherboards and cache coherence isolation, it's possible that a thousand-thread application-suite could leverage more than 8 cores efficiently. But I've regularly over-committed 8 core machine farms with 3 to 5 thousand threads and never had responsiveness issues (each thread group (client application) were predominantly IO bound). Here, higher numbers of CPUs allows fewer CPU transfers during rare periods of competing hot CPU sections. If I have 6 hot threads on 4 cores, the CPU context switches leach a measureable amount of user-time. But by going hyper-threading (e.g. doubling the number of context registers), we can reduce the overhead slightly.
Now for HPC, where you have a single problem you're trying to solve quickly/cheaply - I'll admit it's hard to scale up. Cache contention KILLS performance - bringing critical region execution to near DRAM speeds. And unless you have MOESI, even non-contentious shared memory regions run at BUS speeds. You really need copy-on-write and message passing. Of course, not every problem is efficient with copy-on-write algorithms (i.e. sorting), so YMMV. But this, too was an advocation for over-committing.. Meaning while YOUR problem doesn't divide. You can take the hardware farm and run two separate problems on it. It'll run somewhat slower, but you get nearly double your money's worth in the hardware - lowering costs, and thus reducing the barrier to entry to TRY and solve hard problems with compute farms.
amazon EC anyone?
-Michael
I took a course in parallel programming. We worked with MPI in C if anyone is curious or interested. The hardest part was completely changing the way you thought about programming. It was like the first time you looked at prolog. Or the first time you tried a functional language like scheme. If you've only ever done procedural, it can be quite a big deal to switch. Also, it's quite a bit harder to debug multithreaded code, as the exact order of the instructions is different everytime you run it. Tracking down a deadlock and other problems that only exist in parallel programs can be quite difficult. Anybody who thinks thread syncronization is a solved problem has never written a complex multithreaded application, or is a programming genius. Most likely the former.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Importantly, comparisons to factories are generally unwise because manufacturing is 'embarrassingly parallel' -- it doesn't matter if you have to make the whole iPhone in one single step that takes two straight hours and you can't divide it into any constituent part. You can still manufacture as many iPhones a day as you like because you just set up however many manufacturing units which each make one iPhone every two hours, until you have the desired output capacity.
The 'hard' problems to parallelize are the ones that work like building a house: You can't lay the foundation until you grade the terrain. You can't put in the walls until you lay the foundation. You can't install the electrical until there are walls. And so on. There is no known way to convert an unimproved lot and a truck full of concrete and drywall into into a house in less than 5 minutes, no matter how many construction workers you have available.