Is Parallel Programming Just Too Hard?
pcause writes "There has been a lot of talk recently about the need for programmers to shift paradigms and begin building more parallel applications and systems. The need to do this and the hardware and systems to support it have been around for a while, but we haven't seen a lot of progress. The article says that gaming systems have made progress, but MMOGs are typically years late and I'll bet part of the problem is trying to be more parallel/distributed. Since this discussion has been going on for over three decades with little progress in terms of widespread change, one has to ask: is parallel programming just too difficult for most programmers? Are the tools inadequate or perhaps is it that it is very difficult to think about parallel systems? Maybe it is a fundamental human limit. Will we really see progress in the next 10 years that matches the progress of the silicon?"
I can't speak for the rest of the world, or even the programming community. That disclaimer spoken, however, I can say that parallel programming is indeed hard. The trivial examples, like simply running many processes in parallel that are doing the same thing (as in, for example, Monte Carlo sampling) are easy, but the more difficult examples of parallelized mathematical algorithms I've seen, such as those in linear algebra are difficult to conceptualize, let alone program. Trying to manage multiple threads and process communication in an efficient way when actually implementing it adds an additional level of complexity.
I think the biggest reason why it is difficult is that people tend to process information in a linear fashion. I break large projects into a series of chronologically ordered steps and complete one at a time. Sometimes if I am working on multiple projects, I will multitask and do them in parallel, but that is really an example of trivial parallelization.
Ironically, the best parallel programmers may be those good managers, who have to break exceptionally large projects into parallel units for their employees to simultaneously complete. Unfortunately, trying to explain any sort of technical algorithm to my managers usually exacts a look of panic and confusion.
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Aside from my usual lament that people already call themselves programmers when they can fire up Visual Studio, parallelizing your tasks opens quite a few cans of worms. Many things can't be done simultanously, many side effects can occur if you don't take care and generally, programmers don't really enjoy multithreaded applications, for exactly those reasons.
And often enough, it's far from necessary. Unless you're actually dealing with an application that does a lot of "work", calculate or display, preferable simultanously (games would be one of the few applications that come to my mind), most of the time, your application is waiting. Either for input from the user or for data from a slow source, like a network or even the internet. The average text processor or database client is usually not in the situation that it needs more than the processing power of one core. Modern machines are by magnitudes faster than anything you usually need.
Generally, we'll have to deal with this issue sooner or later, especially if our systems become more and more overburdened with "features" while the advance of processing speed will not keep up with it. I don't see the overwhelming need for parallel processing within a single application for most programs, though.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
The problem with parallel programming is we don't have the right set of primitives. Right now the primitives are threads, mutexes, semaphores, shared memory and queues. This is the machine language of concurrency - it's too primitive to effective write lots of code by anyone who isn't a genius.
What we need is more advanced primitives. Here are my 2 or 3 top likely suspects:
- Concurrent Sequential Programs - CSP. This is the programming model behind Erlang - one of the most successful concurrent programming languages available. Writing large, concurrent, robust apps is as simple as 'hello world' in Erlang. There is a whole new way of thinking that is pretty much mind bending. However, it is that new methodology that is key to the concurrency and robustness of the end applications. Be warned, it's functional!
- Highly optimizing functional languages (HOFL) - These are in the proto-phase, and there isn't much available, but I think this will be the key to extremely high performance parallel apps. Erlang is nice, but not high performance computing, but HOFLs won't be as safe as Erlang. You get one or the other. The basic concept is most computation in high performance systems is bound up in various loops. A loop is a 'noop' from a semantic point of view. To get efficient highly parallel systems Cray uses loop annotations and special compilers to get more information about loops. In a functional language (such as Haskel) you would use map/fold functions or list comprehensions. Both of which convey more semantic meaning to the compiler. The compiler can auto-parallelize a functional-map where each individual map-computation is not dependent on any other.
- Map-reduce - the paper is elegant and really cool. It seems like this is a half way model between C++ and HOFLs that might tide people over.
In the end, the problem is the abstractions. People will consider threads and mutexes as dangerous and unnecessary as we consider manual memory allocation today.
It is not difficult to justify parallel programming. Ten years ago, it was difficult to justify because most computers had a single processor. Today, dual-core systems are increasingly common, and 8-core PC's are not unheard of. And software developers are already complaining because it's "too hard" to write parallel programs.
Since Intel is already developing processors with around 80 cores, I think that multi-core (i.e. multi-processor) processors are only going to become more common. If software developers intend to write software that can take advantage of current and future processors, they're going to have to deal with parallel programming.
I think that what's most likely to happen is we'll see the emergence of a new programming model, which allows us to specify an algorithm in a form resembling a Hasse diagram, where each point represent a step and each edge represents a dependency, so that a compiler can recognize what can and cannot be done in parallel and set up multiple threads of execution (or some similar construct) according to that.
Parallel programming doesn't have to be quite as painful as it currently is. The catch is that you have to face the fact that you can't go on thinking with a sequential paradigm and have some tool, library, or methodology magically make everything work. And now, I'm not talking about functional programming. Functional programming is great, and has a lot going for it, but solving concurrent programming issues is not one of those things. Functional programming deals with concurrency issues by simply avoiding them. For problems that have no state and can be coded purely functionally this is fine, but for a large number of problems you end up either tainting the purity of your functions, or wrapping things up in monads which end up having the same concurrency issues all over again. It does have the benefit that you can isolate the state, and code that doesn't need it is fine, but it doesn't solve the issue of concurrent programming.
No, the different sorts of paradigms I'm talking about no shared state, message passing concurrency models ala CSP and pi Calculus and the Actor Model. That sort of approach in terms of how to think about the problem shows up in languages like Erlang, and Oz which handle concurrency well. The aim here is to make message passing and threads lightweight and integrated right into the language. You think in terms actors passing data, and the language supports you in thinking this way. Personally I'm rather fond of SCOOP for Eiffel which elegantly integrates this idea into OO paradigms (an object making a method call is, ostensibly, passing a message after all). That's still research work though (only available as a preprocessor and library, with promises of eventually integrating it into the compiler). At least it makes thinking about concurrency easier, while still staying somewhat close more traditional paradigms (it's well worth having a look at if you've never heard of it).
The reality, however, is that these new languages which provide the newer and better paradigms for thinking and reasoning about concurrent code, just aren't going to get developer uptake. Programmers are too conservative and too wedded to their C, C++, and Java to step off and think as differently as the solution really requires. No, what I expect we'll get is kluginess retrofitted on to existing languages in a slipshod way that sort of work, in as much as it is an improvement over previous concurrent programming in that language, but doesn't really make the leap required to make the problem truly significantly easier.
Craft Beer Programming T-shirts
I've worked with parallel software for years - there are lots of ways to do it, lots of good programming tools around even a couple of decades back (my stuff ranged from custom message passing in C to using "Connection-Machine Fortran"; now it's java threads) but the fundamental problem was stated long ago by Gene Amdahl - if half the things you need to do are simply not parallelizable, then it doesn't matter how much you parallelize everything else, you'll never go more than twice as fast as using a single thread.
Now there's been lots of work on eliminating those single-threaded bits in our algorithms, but every new software problem needs to be analyzed anew. It's just another example of the no-silver-bullet problem of software engineering...
Energy: time to change the picture.
I think that what's most likely to happen is we'll see the emergence of a new programming model, which allows us to specify an algorithm in a form resembling a Hasse diagram, where each point represent a step and each edge represents a dependency, so that a compiler can recognize what can and cannot be done in parallel and set up multiple threads of execution (or some similar construct) according to that.
Z -H-10.html#%25_sec_1.1.5. More parallelism can be drawn out if the interpreter "compiles" as yet unused functions while evaluating others. See the following section.
This is more-or-less how functional programming works. You write your program using an XML-like tree syntax. The compiler utilizes the tree to figure out dependencies. See http://mitpress.mit.edu/sicp/full-text/book/book-
After all, I am strangely colored.
Even though it is implemented in Java, you can use just about anything with it, using the Hadoop streaming functionality.
It is still difficult to justify if you can more easily write more efficient single-threaded apps. What consumer-level apps out there really need more processing power than a single core of a modern CPU can provide? I already understand the enterprise need. In fact, multi-threaded solutions for enterprise and scientific apps are already prevalent, that market having had SMP for a long time.
Yes, but that's an easy sort of parallelism. Heck, I wrote a fractal generator that did the generation in a separate thread in 11th grade after writing my first Win32 program 4 or 5 months previous. It was also my first experience with threads. I'm not even sure I really knew what they were before that. This isn't *really* paralleling the application in the sense TFA means.
Closer is this: After some more work and a rewrite (for other reasons), I had "Fracked" running n threads, each rendering 1/n of the display. Data parallelism == easy parallelism.
But a lot of problems don't fit these models, and need a LOT of thought put into how to parallelize them. It's likely that some problems in P are not efficiently parallelizable.
Our cognitive system does many things at the same time, yes. That doesn't answer the question that's being posed here: whether explicit, conscious reasoning about parallel processing is hard for people.
Are you adequate?
Actually, it's more like pipelined. The fact that your eyes already moved to the next letter, just says that the old one is still going through the pipeline. Yeah, there'll be some bayesian prediction and pre-fetching involved, but it's nowhere near consciously doing things in parallel.
Try reading two different texts side by side, at the same time, and it won't work that neatly parallel any more.
Heck, there were some recent articles about why most Powerpoint presentations are a disaster: in a nutshell, because your brain isn't that parallel, or doesn't have the bandwidth for it. If you try to read _and_ hear someone saying something (slightly) different at the same time, you just get overloaded and do neither well. The result is those time-wasting meetings where everyone goes fuzzy-brained and forgets everything as soon as the presentation flipped to the next chart.
To get back to the pipeline idea, the brain seems to be quite the pipelined design. Starting from say, the eyes, you just don't have the bandwidth to consciously process the raw stream of pixels. There are several stages of buffering, filtering out the irrelevant bits (e.g., if you focus on the blonde in the car, you won't even notice the pink gorilla jumping up and down in the background), "tokenizing" it, matching and cross-referencing it, etc, and your conscious levels work on the pre-processed executive summary.
We already know, for example, that the shortest term buffer can store about 8 seconds worth of raw data in transit. And that after about 8 seconds it will discard that data, whether it's been used or not. (Try closing your eyes while moving around a room, and for about 8 seconds you're still good. After that, you no longer know where you are and what the room looks like.)
There's a lot of stuff done in parallel at each stage, yes, but the overall process is really just a serial pipeline.
At any rate, yeah, your eyes may already be up to 8 seconds ahead of what your brain currently processes. It doesn't mean you're that much of a lean, mean, parallel-processing machine, it just means that some data is buffered in transit.
Even time-slicing won't really work that well, because of that (potential) latency and the finite buffers. If you want to suddenly focus on another bit of the picture, or switch context to think of something else, you'll basically lose some data in the process. Your pipeline still has the old data in it, and it's going right to the bit bucket. That or both streams get thrashed because there's simply not enough processing power and bandwidth for both to go through the pipeline at the same time.
Again, you only need to look at the fuzzy-brain effect of bad Powerpoint presentations to see just that in practice. Forced to try to process two streams at the same time (speech and text), people just make a hash of both.
A polar bear is a cartesian bear after a coordinate transform.
I've seen it over and over in the industry, there is a distinct lack of parallel programming skill and knowledge, and even senior developers with years under their belt struggle with the fundamentals.
Parallel programming does add a layer of complexity and its inherent lack of general solutions does make abstracting its complexity away difficult, but I suspect that the biggest issue is the lack of training of the work force. It isn't something you can pick up easily without a steep learning curve with many hard lessons in it, definitely not something that can be incorporated as a new thing to be learnt on the fly with deadlines looming.
Another aspect is that its fundamental to the design, parallelism can and often will dictate the design and if the software architects are not actively designing for it or are not experienced enough to ensure that it remains a viable future option, future attempts to parallelise can be difficult at best.
Ultimately the key issues are:
* Lack of skill and training in the work force (including the understanding that there are no general solutions)
* Lack of mature development tool to easy the development process (debugging tools especially)
GPLv2: I want my rights, I want my phone call! DRM: What use is a phone call, if you are unable to speak?
I was using a potential answer to this in 1990. I was working for a small company in the Provo/Orem area, called Computer System Architects, which was selling Transputer hardware. For those who haven't heard of Transputers, they were small, 16- or 32-bit processors, with a small amount of built-in RAM (not a cache; this was actually in the memory map and you could do small tasks on a Transputer without any external RAM), 2-4 high-speed serial channels (easily implemented with 4 wires) and a stack-based architecture. Adding megabytes of external RAM was easy, and it was embarrassingly easy to connect up networks of these things, even on one board (in a single ISA slot), and build cluster. An external card cage, in those days, could hold 20 slots, which would hold up to 80 Transputers, using our products.
I did some Assembly and some C, but the kicker language for this chip was called Occam II. Among other things, it used the indentation in the code to determine block structure. A quick example:
PAR
step A
step B
step C
SEQ
step D
step E
In this example, steps A, B and C would all be executed in parallel with another task which ran step D then step E. If you had one Transputer in your machine, it would multi-task. If you had multiple CPU's available, it would spread the task across the CPU's.
It also has a basic construct called a Channel. These were very easy to set up and use. These were how the different tasks communicated with each other.
It was not difficult to spawn thousands of tasks, each one doing a relatively small part of an overall task, with full communication and synchronization. Again, if you had multiple CPU's available, it would spread the tasks across them. A board with multiple Transputers was usually doing ray-tracing or rendering Mandelbrot fractals as a demo anytime we went to a trade or tech show. They could knock it down to one processor, and things got done relatively quickly. Then, they'd kick in 4 or 16 CPU's and blow people's minds.
This was in 1990. A 386DX-33 was high-end, back then. The Transputer didn't run DOS or Windows, so it didn't survive in the market of the time. That was a shame; I benchmarked a variety of them, then ran identical benchmarks on various other machines as technology marched on. A T805 running 30 MHz (the top end Transputer I ever got to play with) blasted through mixed integer/floating-point calculations about as fast a 486DX2-66 (which didn't come on the market for another couple years). There was an occasion where I had 16 of those T805's sitting my machine. You'd need a Pentium II to be able to match that occasion. It was well over a decade later that the P-II became available.
Cool tech, but the programming tools were what allowed you to really use the parallelization. It was typical to achieve over 95% linear speedup (i.e. 20 CPU's gave real-world 19x performance); sometimes we went over 99%. Most Intel SMP machines are lucky if they give 80% linear speedup (4 CPU's = 3.2x total performance).
... by the Dew of Mountains the thoughts acquire speed, the hands acquire shakes, the shakes become a warning