What Makes Parallel Programming Difficult?
An anonymous reader writes "Intel's Aater Suleman writes about why parallel programming is difficult. ... I was unaware ... that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if you are unfamiliar with parallel code debugging. "
Motherboards rarely have parallel ports these days!
"You're just not thinking fourth-dimensionally!"
"Right, right, I have a real problem with that."
Use Ctrl-C instead of ESC in Vim!
There is a class of problems, the P complete problems, which are (probably) inherently hard to multithread in any advantageous way, and this class includes some pretty important real-world problems:
http://en.wikipedia.org/wiki/P-complete
Palm trees and 8
PLC code for automated mechanical systems runs into all of the same timing issues etc. Obviously the PLC code is much smaller but this is mostly due to the instructions acting on smaller data sets, but can still be just as complicated,.
I am kinda curious how anyone even tangentially involved in programming could not be aware that the problem with writing parallel programming was doing it for a gain in efficiency. Making a thread or process is generally just a couple lines of code, synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism.
Good tutorial for someone who wants to jump into some parallel programming, but it's mostly Operating Systems 101 (or 601).
Honestly though, if you have not optimized your algorithm or code to for parallelism and you want to do it now, you might probably be better off writing the whole thing from scratch, and the tutorial explains why very nicely/.
Not all tasks are suited for parallel programming. Some are and can be highly optimized, but if you have something that simply can't be reduced further then parallel programming will not solve your problems.
This hasn't been news for 30+ years. Parallel programming is hard for a multitude of well-known technical reasons, this is nothing new. Another major reason is that the human brain is no good at parallel programming; again, nothing new. I say let's either try to solve the problems or move on.
In my opinion, many problems with software development, are just as applicable in other domains of our life, and parallel programming is definitely one of them. We equally well have problems managing large teams of people working in parallel. These are problems of logistics, management and also (and no I am not joking) - cooking. And we're as bad handling these as we now handle software development. It may be right, however, to start solving this with the computers - no need to throw away rotten food and/or burned-out employees under delusional management.
What could help is a trusted set of formal theories. That would be a start. Whether it will be a mathematician, a computer programmer, or a kitchen chef that writes this set, is not important. Stripped bare of the bathwater - abstracted and proven - the baby will be what we need.
That you have to do everything all at once. How would you tell 50 kids to sort 50 toy cars? How would you tell 50 footballers to line up by height all at once? How would you have 50 editors edit the 50 pages of a screenplay all at once so that it makes sense from a continuity perspective? All these problems are very easy, but become very hard when you have to do it all at once...
I bet there are hundreds of simple tutorials on concurrent processing on the web, and these two pieces bring nothing new/interesting.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
Well, we just learned about this in a graduate comp arch course and yeah, it can get hairy. Especially if using a processor consistency model as opposed to the sequential consistency model. The easiest fix is to throw up some fence instructions around the interdependent code sections to force sequential consistency, and then lay some flags as a means of time signaling between the processors. This eliminates the randomness that the author discussed in one of the examples.
I think the language Chapel being developed by Cray is taking concrete steps to make Parallel Programming easier. It is very similar in syntax to C++ and even supports some higher level constructs.
This looks like an advertisement for Chapel but I have no relation to Cray. Having taken a graduate parallel programming course, I cannot agree more with the statement that "Parallel Programming is difficult". I struggled a lot with pthreads and MPI before doing the final assignment in Chapel which was a pleasant surprise. The difference between serial and parallel code was FIVE characters only - and it gave a near linear speedup on three different machines.
20 years ago when I was working with transputers we use Occam. It was a very pure parallel programming language and it wasn't too difficult. However, writing parallel code meant starting from scratch (getting rid of the dusty decks of old algorithms as my professor described it). However, this never really happened and we've ended up with primitive parallelisation nailed on to sequential code. The are many parallel architectures, SIMD, MIMD, distributed memory, shared memory and combinations of them all and none of the languages currently available really suit. You have basic multithreading and MPI bolted on to old sequential languages and developers trying to take algorithms which absolutely must proceed in order and trying to find sections they can perform in parallel. This isn't going to get you good performance or scalability. In Occam, we wrote procedures which were independent of each other and communicated over channels and all would be running at once. You wired them together and poured you data in at one end and results came out at the other. Due to the very fine grained parallelism it was extremely scalable - I hade code running on 80+ transputers at over 90% efficiency in 1990 and it would run happily on many more if I had them, and yet it also ran perfectly on one since serialisation of parallel programs is easy whereas parallelising serial code is very hard and inefficient. I find it sad that the current state of parallel programming is still so far behind where we were 20 years ago due to all this legacy stuff. Even when people start out with new algorithms, they still start with serial processes and then try and parallelise rather than considering the parallelism from the outset and designing the algorithm so it doesn't require everything in memory and doesn't produce different results when number of CPUs changes.
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
I'll take a stab and reply to you that "8 processors was enough for anyone", in the sense that multiplexing 8 programs is just insane. Better to just run 8 prorams each on their own core, and use some progs that can use 4 cores at a time. That leaves 4 free.
(Overly simpistic) I agree, but 1028 cores is not the answer. We need the next generation in raw core power to move computing forward. 8 killer cores will beat 1024 mediocre cores.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
"The news of the problems of parallelization is still news after 30 years"
- paraphrase of Christian saying
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Using locks and the like make it very easy to do multithreaded and parallel programs.
The big problem comes when you need multiple locks because you find your program is waiting more on locks than anything else which is gumming up the whole works, and that can easily lead to deadlocks and other fun stuff.
Another way is to consider lockless algorithms, which don't have such blocking mechanisms. However, then you get into issues where atomicity isn't quite so atomic thanks to memory queues and re-ordering done in the modern CPU, and thus have to start adding memory barriers before doing your atomic exchanges.
Raymond Chen (of Microsoft) did a nice write up of the lockfree ways to do things and what Windows provides to accomplish them.
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/05/10149783.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150261.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150262.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/07/10150728.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151159.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151258.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/12/10152296.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/13/10152929.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/14/10153633.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/15/10154245.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/19/10155452.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/20/10156014.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/21/10156539.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2011/04/22/10156894.aspx
Parallel programming is hard. Bummer. I'd rather be interested in an article that talks about *monitoring and debugging* parallel programs (I currently struggle to monitor parallel algorithms implemented in Java). Anybody?
that most of today's popular programming languages do not accommodate higher-level forms of expression required for easy parallelism. Declarative languages have a slight edge at being able to express where sequential dependencies are.
keep dreaming http://www.flowlang.net/
One more reason why functional programming matters. Many programs become trivial to parallelize when you avoid mutation and side-effects outside of limited, carefully-controlled contexts.
It's truly a joy when you can parallelize your code by changing a single character (from "map" to "pmap"). There's usually a little more to it than that, but you almost never have to deal with explicit locks or synchronization.
What makes parallel programming hard is poor languages. Languages that allow state changes and don't keep them isolated. Isolate changes of state, all changes of state and be careful about what kinds of combinators to use. Google map-reduce works whenever
a) You can organize your data into an array
b) You can do computations on array element in isolation
c) You can do computations on the entire array via. associate operations pairwise
And most programs do meet those properties but they slip in all sorts of subtle state changes so out of order execution isn't possible. The key to parallelism is better language selection.
The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.
There is only one GOOD reasons to use multithreading -- because your work is compute-bound. This typically happens on large-data applications like audio/video processing (for which you just call out to libraries that someone else has written), or else on your own large-data problems that have embarrassingly trivial parallelization: e.g.
var results = from c in Customers.AsParallel() where c.OrderStatus="unfilfilled" select new {Name=c.Name, Cost=c.Cost};
Here, using ParallelLINQ, it's as simple as just sticking in "AsParallel()". The commonest sort of large-data problems don't have any complicated scheduling.
There are also BAD reasons why people have used multithreading, particularly to deal with long-latency operations like network requests. But this is a BAD reason, and you shouldn't use multithreading for it. There are better alternatives, as shown by the Async feature in F#/VB/C# which I worked on, which was also copied into Javascript with Google's traceur compiler). e.g.
Task task1 = (new WebClient()).DownloadStringTaskAsync("http://a.com");
Task task2 = (new WebClient()).DownloadStringTaskASync("http://b.com");
Task winner = await Task.WhenAny(task1,task2);
string result = await winner;
Here it kicks off two tasks in parallel. But they are cooperatively multitasked on the same main thread at the "await" points. Therefore there is *NO* issue about race conditions; *NO* need to use semaphores/mutexes/condition-variables. The potential for unwanted interleaving is dramatically reduced.
So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.
...very simple article...was expecting something more technical!
:-P
Anyway, cheers for the effort to write it!
Ps. Looks like he discovered an open secret
The basic problem with parallel programming is that, in most widely used languages, all data is by default shared by all threads. C, C++, and Python all work that way. The usual bug is race conditions.
There have been many languages for parallel programming which don't have default sharing, but they've never taken over outside some narrow niches. Partly because most of them weren't that useful outside their niche.
The other classic problem is that in most shared-data languages with locks, the language doesn't know what the lock is protecting. So you can still code race conditions by accident.
This is probably why many programmers experience such difficulty programming (efficiently) PLCs.... Lots of very large, very complicated programs are just "the same code running endlessly". The devil is in the inputs.
Erlang (and other functional, single-static-assignment languages) are perfect for parallel programming.
Okay, then the only problem is getting something useful out of Erlang.
Back in 1985 the Japanese government announced a "fifth generation" computing project, with software to be developed in Prolog. So I went and learned Prolog, an intriguing and amusing language. Only problem is, it was totally useless for any actual application, as the Japanese found out.
Sorry, but in order to believe any of the promises of one of those non-vonNeumann languages, I have to see a practical working application first.
The Scala community has tried to move the problem into a more practical realm by adding things like parallel collections, DSL's to abstract out the problem for specific applications and the Akka Project for simpler concurrency.
Most of the parallel programming discussion I've seen is very complicated and not likely to appeal to those who have to do practical day-to-day business projects. By pushing the abstractions up a level, I think the Scala folks have made parallel programming more accessible for the average developer.
Ref: Parallel Collections video.
Signatures are a waste of bandwi (buffering...)
All hardware guys (myself included) claim that they have been doing concurrent work since long. It takes writing a parallel program to understand the challenges. The communication latency (a huge issue), non-determinism, caches, need to deal with legacy code, and the need to make it robust that makes it a much much harder problem. My goal is to make hardware guys see these challenges so I can say you hit a pet-peeve. (I am a hardware guy who has learned software over the years).
The problem is moving optimizations to sofware which is the most retarded thing I ever heard. Most problems are not parrallizeable (if such a word exists) and is not needed if Intel made decent processors rather than the Itanium disaster. VLIW is heavily dependable on this. Parallel Programming as Intel wants it is to have the developers make up for the shortcummings by having programmers invent tricks while they put more cache and other things on their chips. We have the corei series of processors now at least but still. Why should we learn how to do parallel programming because they lost billions in R&D and have these silly api's and a market to sell these tools.
Advancements have been made to make that argument obsolete as putting more cache while making compilers harder to write is not making huge improvements in performance. It is still old fashioned branch predictions. So yes it is usefull in some circumstances, but in business all I care about is how fast I can get SQL out to a client/server app or to an intranet app
I admit I have not wrote code in 5 years so I am not the best source before the professional programmers nail me here. This is just from what I see
http://saveie6.com/
No, the devil's in the steering comittee.
Science advances one funeral at a time- Max Planck
Mutexes and semaphores are a bit more than bits and integers. You have to be able to modify them atomically. If you don't design your PLC correctly, you end up with two paths attempting to modify the same bit at the same time. Without atomicity, you can't know which won.
Government is not reason; it is not eloquent; it is force. Like fire, it is a dangerous servant and a fearful master.
Parallelism is very easy, provided that you don't do it yourself.
Use a pure functional programming language like Haskell that can be automatically parallelized.
Or use a programming language that uses the Active Object Pattern (or the Actor model).
Or do as I do: use C++ to implement message passing, then have one thread per object. Objects don't have public members, they simply communicate by exchanging messages.
In all the above cases, the trick is to avoid state. In Haskell, state is avoided by design; in the Actor Model/Active Object pattern, state is avoided due to isolation: effectively, each thread is like a separate process.
I have a small issue with the message passing. Doesn't it make the barrier of entry higher for the average programmers? Many claim that IBM Cell was harder to code for than the XBOX because of this.
While you're correct from a temporarily practical measure, I disagree in theory. OS theory 20 or more years ago was about one very simple concept.. Keeping all resources utilized. Instead of buying 20 cheap, slow full systems (at a meager $6k each), you can buy 1 $50k machine and time-share it. All your disk-IO will be maximized, all your CPUs will be maximized, network etc. Any given person is running slower, but you're saving money overall.
If I have a single 8 core machine but it's attached to a netapp disk-array of 100 platters over a network, then the latency means that the round trip of a single-threaded program is almost guaranteed to leave platters idle. If, instead I split a problem up into multiple threads / processes (or use async-IO concepts), then each thread can schedule IO and immediately react to IO-completion, thereby turning around and requesting the next random disk block. While async-IO removes the advantage of multiple CPUs, it's MASSIVELY error-prone programming compared to blocking parallel threads/processes.
A given configuration will have it's own practical maximum and over-saturation point. And for most disk/network sub-systems, 8 cores TODAY is sufficient. But with appropriate NUMA supported motherboards and cache coherence isolation, it's possible that a thousand-thread application-suite could leverage more than 8 cores efficiently. But I've regularly over-committed 8 core machine farms with 3 to 5 thousand threads and never had responsiveness issues (each thread group (client application) were predominantly IO bound). Here, higher numbers of CPUs allows fewer CPU transfers during rare periods of competing hot CPU sections. If I have 6 hot threads on 4 cores, the CPU context switches leach a measureable amount of user-time. But by going hyper-threading (e.g. doubling the number of context registers), we can reduce the overhead slightly.
Now for HPC, where you have a single problem you're trying to solve quickly/cheaply - I'll admit it's hard to scale up. Cache contention KILLS performance - bringing critical region execution to near DRAM speeds. And unless you have MOESI, even non-contentious shared memory regions run at BUS speeds. You really need copy-on-write and message passing. Of course, not every problem is efficient with copy-on-write algorithms (i.e. sorting), so YMMV. But this, too was an advocation for over-committing.. Meaning while YOUR problem doesn't divide. You can take the hardware farm and run two separate problems on it. It'll run somewhat slower, but you get nearly double your money's worth in the hardware - lowering costs, and thus reducing the barrier to entry to TRY and solve hard problems with compute farms.
amazon EC anyone?
-Michael
What makes parallel programming hard is computer languages.
Most languages today are actually not designed for parallelism or concurrency simply because most computers for a very long time had only one core. This is why we have threading and locks everywhere. Threads have huge overhead from hundreds of kilobytes to megabytes. That may seem like nothing but ideally for parallelism and concurrency to work you need to be able to create thousands of processes at nearly no cost (hundreds of bytes each). And locks, don't even get me started with that!
Shared mutable state is also a major problem it makes parallelism very hard, again current languages make heavy use of it (Singletons).
Anyway, this problem has been solved ages ago just look into the Actor Model and Erlang to get started that should pretty much cover it.
Yeah, you can't write a for loop in C and expect magically parallelization.
Sure you can, if you're not using pointers or are using restrict-qualified pointers and you have a good enough compiler.
A bit is a mutex if your processor has an atomic test and set instruction. Which these days they all do. A semaphore is an integer protected by a mutex. He oversimplified a bit, but if you know what you're doing it is that simple.
I still have more fans than freaks. WTF is wrong with you people?
In the end, what makes it harder is that the things you do on software are usually way more complex than the things you do on hardware. If not for that, that argument would make perfect sense as hardware also have issues with communication latency, non-determinism, and localized data (not just caches), to an even higher degree than software. And you also must make it robust.
Rethinking email
The barrier for learning to program purely functionally is high. But its a 1x barrier vs. complexity for ever. It makes the system simple.
I agree with him, avoid state.
For certain types of problems, the Linda coordination primitives and shared tuple-space make parallel programming much easier. I used the original C-Linda many, many years ago, and IBM's TSpaces for Java more recently. If you're trying to do little bitty actions on lots of data with tight coordination, the overhead is pretty bad. Looking into PyLinda is on my list of things to do...
Makefiles specify dependencies and recipes quite tidily. A simple implementation of make is provided in the "AWK book" as an example program- http://www.cs.bell-labs.com/cm/cs/awkbook/
If that's the case, there is no need for mutex or semaphore constructs because those are only needed for parallelism. Parent was stating that he's used both in PLCs. If that's the case, why? Either he has no clue, or you're wrong. I don't really care either way, just addressing the fact that you can't just implement a mutex as a flag or a semaphore as an integer as stated by parent.
Government is not reason; it is not eloquent; it is force. Like fire, it is a dangerous servant and a fearful master.
Funny, just went through three different manufacturer's PLC instruction sets and none of them listed test and set or any other kind of discernible atomic instruction. Do you have any references you could point me to? I'd be interested to know for sure.
Government is not reason; it is not eloquent; it is force. Like fire, it is a dangerous servant and a fearful master.
I've got a crazy question to throw out there. Which do you think is easier? Design a motherboard capable of disguisesing a 4 core chip as a 1 core chip as far as the software sees it and then splits off the workload automatically across all cores
or
make every programmer in the entire world working on any multithreaded application jump through a ridiculous amount of hoops and headcahes to get their software to run properly.
Now they're both incredibly difficult but I still think it's sort of one sided. Seriously, why has nobody attempted to develop something like this yet? Making life difficult for thousands of programmers compared to making some weird CPU virtualization technique that manages to properly manage threads and not blue screen the OS due to memory sharing violations and then making a ton of money on that exact technology...well I think that sort of spells it out.
Intel, the world's largest semiconductor chip maker, which was established in 1968, has 41 years of product innovation and market leadership in history. In 1971, Intel introduced the world's first microprocessor. Brought about by the microprocessor computer and the Internet revolution has changed the world. http://www.cheaptoryburchshop.com/ http://www.sexytoyslove.com/ http://www.edhardygo.com/
Erlang (and other functional, single-static-assignment languages) are perfect for parallel programming.
In that case you can restate the original question as "how do I implement an Erlang compiler and runtime?" If, indeed, Erlang embodies the most efficient way to solve all parallell programming problems. Which of course is absurd. Erlang is useless for implementing servers, kernels, runtimes, networking stacks, file systems, device drivers, VM systems, database servers, or anything else that actually makes a computer tick. It's simply an interface to an already-implemented computing system. It's good to excellent for some problem dmains, but not all domains, and certainly not the computing (system) domain itself. It's one solution to a particular computing domain problem. How well it performs depends on how it's implemented, which brings us back to the original question: how to learn to design and write MT-hot code on SMP hardware.
Parallel Programming is one of the old Arts which the new "developers" don't know jack about.
Never buy Sony CDs - they will open up your computer to anyone..
I Which of course is absurd. Erlang is useless for implementing servers, kernels, runtimes, networking stacks, file systems, device drivers, VM systems, database servers, or anything else that actually makes a computer tick.
When I first read this I assumed the poster was being sarcastic, but reading again they actually believe it. Erlang useless for implementing servers? Erlang is used to implement loads of servers. You look at the back end of the a lot the the top 100 company web services and you will find them using Erlang to implement their server functionality. Erlang is used to provide database systems too.
The idea that Erlang is some sort of toy academic language that is not used for anything practical is a joke. Erlang came out of a real industrial requirement to allow telecom switches to support millions of processes while remaining up 24/7. From that it has grown to be used in all sort of real world applications. Some could even suggest that its purity has been compromised in order to meet the real world requirements.
The problem with parallel programming is not that its difficult but the tools that are used to implement it. I have used two languages that have made parallel programming a doddle. The first is Occam and the second is Erlang. They both implement parrallel programming the same way, and that is avoid shared memory and use message passing. Once you do that things get far simpler.
Choose your allies carefully, it is highly unlikely you will be held accountable for the actions of your enemies
Occam + transputer boards was a brilliant idea. Didn't work out so well in practice but that wasn't due to Occam. It's about the most elegant language I've ever seen.
However, writing an Occam compiler would run into the same sort of issues as the article describes, I fear. Unless CPU-cores take some of the transputer lessons to heart.
Therefore, by the (faulty) logic you're using, you're just a cow with a keyboard - osu-neko (2604)
Not really, at least in languages where messages are indistinguishable from normal function calls.
I have often wondered why we can't treat CPU cores like we treat memory. Always assume there will be enough* and if there isn't then have things like the time sharing algorithms to make up the slack like we have virtual memory algorithms. As opposed to the moment where the entire mindset is one or at best a few processes. It's as if we're back in the says of a few KB of memory but we've somehow managed to implement GUIs and abstracted many layed software because we've got a really good at running the virtual memory...
*ok I'd be worried at a programmer who assumed there would always be enough memory to solve any problem and implement any algorithm no matter how sloppy, but you get my drift.
"The weirdest thing about a mind, is that every answer that you find, is the basis of a brand new cliche" -
Not sure what you mean. Are you thinking RPC?
Context switch overhead.
I know tobacco is bad for you, so I smoke weed with crack.