Slashdot Mirror


What Makes Parallel Programming Difficult?

An anonymous reader writes "Intel's Aater Suleman writes about why parallel programming is difficult. ... I was unaware ... that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if you are unfamiliar with parallel code debugging. "

18 of 196 comments (clear)

  1. Easy! by Anonymous Coward · · Score: 4, Funny

    Motherboards rarely have parallel ports these days!

    1. Re:Easy! by sgbett · · Score: 5, Funny

      is that things sometimes happen in the wrong order.

      --
      Invaders must die
    2. Re:Easy! by sgbett · · Score: 5, Funny

      The problem with parallel programming

      --
      Invaders must die
    3. Re:Easy! by BadPirate · · Score: 2

      Yes it is happen when that terrible's.

      --
      - Holy crap, I've got MOD points! Who thought that was a good idea.
    4. Re:Easy! by Acius · · Score: 2

      - The nazi grammar parallel

      No combination above gramatically words possible of the is correct. You 's drop the should.

      --
      Acius the unfamous
    5. Re:Easy! by Jane+Q.+Public · · Score: 2

      I have to disagree on this. Synching multi-threaded programs is often VERY far from trivial, and takes much more than "basic skills". TFA made that very point; often there are hidden dependencies that may not be obvious even to the developer who originally wrote the program. If it only took "basic" skills then we would have a lot more multi-threaded programs these days.

      I ran into this problem myself recently: I wrote a program that interfaced with websites via http, and the problem was that while it was actively engaged in its main loop, it ignored any input from the UI (so the STOP button would not work, for example). I actually had to put the main processing in a separate thread so that a user could click "STOP" and abort the process (which could take a long time).

      Funny thing was: in order for the whole thing to work, the second thread had to alter a variable that was available by the first thread... so it had to break the 'rules' to work.

      It would be nice if Cocoa had the equivalent of the old VB "GetEvents" method, which polled the queue for any pending interface events. That would have solved that problem without involving multiple threads.

  2. Moral of the story... by kvvbassboy · · Score: 2
    learn2efficientparallelalgorithms.

    Good tutorial for someone who wants to jump into some parallel programming, but it's mostly Operating Systems 101 (or 601).

    Honestly though, if you have not optimized your algorithm or code to for parallelism and you want to do it now, you might probably be better off writing the whole thing from scratch, and the tutorial explains why very nicely/.

  3. Re:unaware? WTF? by Mongoose+Disciple · · Score: 5, Insightful

    synchronization with data separation, mutex's and avoiding deadlocks and race conditions has been solved since almost the beginning of parallelism

    And yet people constantly get these details wrong in practice.

    It's an extra layer of complexity and it introduces extra chances to make mistakes, even around areas where a programmer could know better. There's not much way around that. If people only made coding mistakes around difficult problems software would be dramatically more bulletproof than it actually is.

  4. "What Makes Parallel Programming Difficult?" by Anonymous Coward · · Score: 2, Insightful

    That you have to do everything all at once. How would you tell 50 kids to sort 50 toy cars? How would you tell 50 footballers to line up by height all at once? How would you have 50 editors edit the 50 pages of a screenplay all at once so that it makes sense from a continuity perspective? All these problems are very easy, but become very hard when you have to do it all at once...

  5. Lame by oldhack · · Score: 3, Interesting

    I bet there are hundreds of simple tutorials on concurrent processing on the web, and these two pieces bring nothing new/interesting.

    --
    Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
  6. Chapel by Warlord88 · · Score: 2

    I think the language Chapel being developed by Cray is taking concrete steps to make Parallel Programming easier. It is very similar in syntax to C++ and even supports some higher level constructs.

    This looks like an advertisement for Chapel but I have no relation to Cray. Having taken a graduate parallel programming course, I cannot agree more with the statement that "Parallel Programming is difficult". I struggled a lot with pthreads and MPI before doing the final assignment in Chapel which was a pleasant surprise. The difference between serial and parallel code was FIVE characters only - and it gave a near linear speedup on three different machines.

  7. Efficiency is hard by tlhIngan · · Score: 4, Informative

    Using locks and the like make it very easy to do multithreaded and parallel programs.

    The big problem comes when you need multiple locks because you find your program is waiting more on locks than anything else which is gumming up the whole works, and that can easily lead to deadlocks and other fun stuff.

    Another way is to consider lockless algorithms, which don't have such blocking mechanisms. However, then you get into issues where atomicity isn't quite so atomic thanks to memory queues and re-ordering done in the modern CPU, and thus have to start adding memory barriers before doing your atomic exchanges.

    Raymond Chen (of Microsoft) did a nice write up of the lockfree ways to do things and what Windows provides to accomplish them.

    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/05/10149783.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150261.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/06/10150262.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/07/10150728.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151159.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/08/10151258.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/12/10152296.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/13/10152929.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/14/10153633.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/15/10154245.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/19/10155452.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/20/10156014.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/21/10156539.aspx
    http://blogs.msdn.com/b/oldnewthing/archive/2011/04/22/10156894.aspx

    1. Re:Efficiency is hard by Prune · · Score: 2

      Worse, typical synchronization primitives such as in pthreads and Windows are optimized not for speed but for error handling etc. It is easy to beat their speed with custom implementations, sometimes with dramatic speed increases (see for example numerous articles at http://locklessinc.com/articles/ [locklessinc.com] and I've implemented some of the mutexes and ticket locks and my Windows/Linux software has become faster as a result). In addition, using lock-free algorithms whenever possible can provide a further performance enhancement (various algorithms and optimizations at www.1024cores.net), but along with implementing those goes an assumption of understanding of memory consistency semantics (acquire, release, consume, weak ordering, strong consistency, etc.) in order to minimize implied or explicit memory barriers, as well as thread local storage tricks in order to minimize data dependencies, the latter becoming all the more important as the number of cores increases and memory necessarily becomes less uniform in access times. Add to this the need to use cache-aware and/or cache-oblivious algorithms, and it is very clear that optimization in a modern hardware environment is anything but simple, and moreover cannot be anything but simple!

      --
      "Politicians and diapers must be changed often, and for the same reason."
  8. Parallel programming *NOT* widely needed by ljw1004 · · Score: 3, Interesting

    The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.

    There is only one GOOD reasons to use multithreading -- because your work is compute-bound. This typically happens on large-data applications like audio/video processing (for which you just call out to libraries that someone else has written), or else on your own large-data problems that have embarrassingly trivial parallelization: e.g.

    var results = from c in Customers.AsParallel() where c.OrderStatus="unfilfilled" select new {Name=c.Name, Cost=c.Cost};

    Here, using ParallelLINQ, it's as simple as just sticking in "AsParallel()". The commonest sort of large-data problems don't have any complicated scheduling.

    There are also BAD reasons why people have used multithreading, particularly to deal with long-latency operations like network requests. But this is a BAD reason, and you shouldn't use multithreading for it. There are better alternatives, as shown by the Async feature in F#/VB/C# which I worked on, which was also copied into Javascript with Google's traceur compiler). e.g.

    Task task1 = (new WebClient()).DownloadStringTaskAsync("http://a.com");
    Task task2 = (new WebClient()).DownloadStringTaskASync("http://b.com");
    Task winner = await Task.WhenAny(task1,task2);
    string result = await winner;

    Here it kicks off two tasks in parallel. But they are cooperatively multitasked on the same main thread at the "await" points. Therefore there is *NO* issue about race conditions; *NO* need to use semaphores/mutexes/condition-variables. The potential for unwanted interleaving is dramatically reduced.

    So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.

    1. Re:Parallel programming *NOT* widely needed by Xyrus · · Score: 2

      The dirty secret of parallel programming is that it's *NOT* so widely needed. I think a lot of academics got funding to study automatic parallelization or other parallel techniques, and they latch on to multicore as a justification for it, but it's not.

      It is widely needed, but perhaps not by you.

      There is only one GOOD reasons to use multithreading -- because your work is compute-bound.

      Ok, now you're confusing the issue. Are you talking about parallel programming or multi-threaded programming? Parallel programming is larger in scope the simple multi-threaded programming.

      So in the end, who are the people who still need to develop multithreaded algorithms? There are very few. I think they're just the people who write high-performance multithreaded libraries.

      No, there are quite a few and many of them make quite a lot of money to do so. Do you think the programmers at ILM, 3DS, Pixar, and NASA are just sitting around doing nothing? Does your MT algorithm library also know how to optimize for the GPU as well? Are you sure that a one size fits all approach will work for every application out there?

      To be clear, parallel programming also encompasses programming platforms like super-computers, which I assure you does not have a simple MT algorithm library to magically optimize applications. A whole new level of complexity is added with the introduction of GPUs. It gets even more complicated if you deal with non-symmetric distributions or dynamic regridding.

      The world of parallel programming is quite large with a lot of big companies and brainpower invested to maximize efficiency and utilization. I think you're greatly underestimating how important and vast the field is.

      --
      ~X~
  9. Re:multiple processors by maraist · · Score: 4, Interesting

    While you're correct from a temporarily practical measure, I disagree in theory. OS theory 20 or more years ago was about one very simple concept.. Keeping all resources utilized. Instead of buying 20 cheap, slow full systems (at a meager $6k each), you can buy 1 $50k machine and time-share it. All your disk-IO will be maximized, all your CPUs will be maximized, network etc. Any given person is running slower, but you're saving money overall.

    If I have a single 8 core machine but it's attached to a netapp disk-array of 100 platters over a network, then the latency means that the round trip of a single-threaded program is almost guaranteed to leave platters idle. If, instead I split a problem up into multiple threads / processes (or use async-IO concepts), then each thread can schedule IO and immediately react to IO-completion, thereby turning around and requesting the next random disk block. While async-IO removes the advantage of multiple CPUs, it's MASSIVELY error-prone programming compared to blocking parallel threads/processes.

    A given configuration will have it's own practical maximum and over-saturation point. And for most disk/network sub-systems, 8 cores TODAY is sufficient. But with appropriate NUMA supported motherboards and cache coherence isolation, it's possible that a thousand-thread application-suite could leverage more than 8 cores efficiently. But I've regularly over-committed 8 core machine farms with 3 to 5 thousand threads and never had responsiveness issues (each thread group (client application) were predominantly IO bound). Here, higher numbers of CPUs allows fewer CPU transfers during rare periods of competing hot CPU sections. If I have 6 hot threads on 4 cores, the CPU context switches leach a measureable amount of user-time. But by going hyper-threading (e.g. doubling the number of context registers), we can reduce the overhead slightly.

    Now for HPC, where you have a single problem you're trying to solve quickly/cheaply - I'll admit it's hard to scale up. Cache contention KILLS performance - bringing critical region execution to near DRAM speeds. And unless you have MOESI, even non-contentious shared memory regions run at BUS speeds. You really need copy-on-write and message passing. Of course, not every problem is efficient with copy-on-write algorithms (i.e. sorting), so YMMV. But this, too was an advocation for over-committing.. Meaning while YOUR problem doesn't divide. You can take the hardware farm and run two separate problems on it. It'll run somewhat slower, but you get nearly double your money's worth in the hardware - lowering costs, and thus reducing the barrier to entry to TRY and solve hard problems with compute farms.
    amazon EC anyone?

    --
    -Michael
  10. Re:unaware? WTF? by CastrTroy · · Score: 2

    I took a course in parallel programming. We worked with MPI in C if anyone is curious or interested. The hardest part was completely changing the way you thought about programming. It was like the first time you looked at prolog. Or the first time you tried a functional language like scheme. If you've only ever done procedural, it can be quite a big deal to switch. Also, it's quite a bit harder to debug multithreaded code, as the exact order of the instructions is different everytime you run it. Tracking down a deadlock and other problems that only exist in parallel programs can be quite difficult. Anybody who thinks thread syncronization is a solved problem has never written a complex multithreaded application, or is a programming genius. Most likely the former.

    --

    Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
  11. Re:I think they made a movie about this... by Anthony+Mouse · · Score: 2

    Importantly, comparisons to factories are generally unwise because manufacturing is 'embarrassingly parallel' -- it doesn't matter if you have to make the whole iPhone in one single step that takes two straight hours and you can't divide it into any constituent part. You can still manufacture as many iPhones a day as you like because you just set up however many manufacturing units which each make one iPhone every two hours, until you have the desired output capacity.

    The 'hard' problems to parallelize are the ones that work like building a house: You can't lay the foundation until you grade the terrain. You can't put in the walls until you lay the foundation. You can't install the electrical until there are walls. And so on. There is no known way to convert an unimproved lot and a truck full of concrete and drywall into into a house in less than 5 minutes, no matter how many construction workers you have available.