What Makes Parallel Programming Difficult?
An anonymous reader writes "Intel's Aater Suleman writes about why parallel programming is difficult. ... I was unaware ... that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if you are unfamiliar with parallel code debugging. "
Motherboards rarely have parallel ports these days!
synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism
And yet people constantly get these details wrong in practice.
It's an extra layer of complexity and it introduces extra chances to make mistakes, even around areas where a programmer could know better. There's not much way around that. If people only made coding mistakes around difficult problems software would be dramatically more bulletproof than it actually is.
Using locks and the like make it very easy to do multithreaded and parallel programs.
The big problem comes when you need multiple locks because you find your program is waiting more on locks than anything else which is gumming up the whole works, and that can easily lead to deadlocks and other fun stuff.
Another way is to consider lockless algorithms, which don't have such blocking mechanisms. However, then you get into issues where atomicity isn't quite so atomic thanks to memory queues and re-ordering done in the modern CPU, and thus have to start adding memory barriers before doing your atomic exchanges.
Raymond Chen (of Microsoft) did a nice write up of the lockfree ways to do things and what Windows provides to accomplish them.
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/05/10149783.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150261.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150262.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/07/10150728.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151159.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151258.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/12/10152296.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/13/10152929.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/14/10153633.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/15/10154245.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/19/10155452.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/20/10156014.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/21/10156539.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/22/10156894.aspx
While you're correct from a temporarily practical measure, I disagree in theory. OS theory 20 or more years ago was about one very simple concept.. Keeping all resources utilized. Instead of buying 20 cheap, slow full systems (at a meager $6k each), you can buy 1 $50k machine and time-share it. All your disk-IO will be maximized, all your CPUs will be maximized, network etc. Any given person is running slower, but you're saving money overall.
If I have a single 8 core machine but it's attached to a netapp disk-array of 100 platters over a network, then the latency means that the round trip of a single-threaded program is almost guaranteed to leave platters idle. If, instead I split a problem up into multiple threads / processes (or use async-IO concepts), then each thread can schedule IO and immediately react to IO-completion, thereby turning around and requesting the next random disk block. While async-IO removes the advantage of multiple CPUs, it's MASSIVELY error-prone programming compared to blocking parallel threads/processes.
A given configuration will have it's own practical maximum and over-saturation point. And for most disk/network sub-systems, 8 cores TODAY is sufficient. But with appropriate NUMA supported motherboards and cache coherence isolation, it's possible that a thousand-thread application-suite could leverage more than 8 cores efficiently. But I've regularly over-committed 8 core machine farms with 3 to 5 thousand threads and never had responsiveness issues (each thread group (client application) were predominantly IO bound). Here, higher numbers of CPUs allows fewer CPU transfers during rare periods of competing hot CPU sections. If I have 6 hot threads on 4 cores, the CPU context switches leach a measureable amount of user-time. But by going hyper-threading (e.g. doubling the number of context registers), we can reduce the overhead slightly.
Now for HPC, where you have a single problem you're trying to solve quickly/cheaply - I'll admit it's hard to scale up. Cache contention KILLS performance - bringing critical region execution to near DRAM speeds. And unless you have MOESI, even non-contentious shared memory regions run at BUS speeds. You really need copy-on-write and message passing. Of course, not every problem is efficient with copy-on-write algorithms (i.e. sorting), so YMMV. But this, too was an advocation for over-committing.. Meaning while YOUR problem doesn't divide. You can take the hardware farm and run two separate problems on it. It'll run somewhat slower, but you get nearly double your money's worth in the hardware - lowering costs, and thus reducing the barrier to entry to TRY and solve hard problems with compute farms.
amazon EC anyone?
-Michael