Slashdot Mirror


Auto-Parallelizing Compiler From Codeplay

Max Romantschuk writes "Parallelization of code can be a very tricky thing. We've all heard of the challenges with Cell, and with dual and quad core processors this is becoming an ever more important issue to deal with. The Inquirer writes about a new auto-parallelizing compiler called Sieve from Codeplay: 'What Sieve is is a C++ compiler that will take a section of code and parallelize it for you with a minimum hassle. All you really need to do is take the code you want to run across multiple CPUs and put beginning and end tags on the parts you want to run in parallel.' There is more info on Sieve available on Codeplay's site."

34 of 147 comments (clear)

  1. Reentrant? by Psychotria · · Score: 5, Interesting

    Forgive me if I'm wrong (I've not coded parallel things before), but if the code is re-entrant, does this go a long way towards running the code in parallel? Obviously there are other factors involved here, like addressing memory, but this is thought of in re-entrant programming. I'm not sure what the difference is... please enlighten me :-)

    1. Re:Reentrant? by Anonymous Coward · · Score: 5, Informative

      Reentrancy is a factor, because it's a class of dependencies, but there are many other dependencies.

      Consider a for loop: for (int i=0; i100; i++)doSomething(i);

      Can this be parallelized? Perhaps the author meant it like it's written there: First doSomething(0), then doSomething(1), then ... Or maybe he doesn't care about the order and doSomething just needs to run once for each i in 0..99. The art of automatic parallelization is to find overspecifications like the ordered loop where order isn't really necessary. If nothing in doSomething depends on the outcome of doSomething with a different i, they can be run in parallel and in any order. Suppose each doSomething involves a lengthy calculation and an output at the end. Then they can't simply run in parallel, because the output is a dependency: As written, the output from doSomething(0) comes before doSomething(1) and so on. But the compiler could still run the lengthy calculation in parallel and synchronize only the fast output at the end. The more of these opportunities for parallelism the compiler can find, the better it is.

    2. Re:Reentrant? by 644bd346996 · · Score: 4, Interesting

      In the case of the for loop, that is really a symptom of the fact that c-style languages don't have syntax for saying "do this to each of these". So one must manually iterate over the elements. Java does have the for-each syntax, but it is just an abbreviation of the "for i from 0 to x" loop.

      Practically all for loops written are independent of order, so they could be trivially implemented using MapReduce. That one change would parallelize a lot of code, with no tricky compiler optimizations.

    3. Re:Reentrant? by jd · · Score: 5, Informative
      Simple version: Parallel code need not be re-entrant, but all re-entrant code is parallel.

      More complex version: There are four ways to run a program. These are "Single Instruction, Single Data" (ie: a single-threaded program), "Single Instruction, Multi Data" (SETI@Home would be an example of this), "Multi Instruction, Single Data" (a good way to program genetic algorithms) and "Multi Instruction, Multi Data" (traditional, hard-core parallelism).

      SIMD would need to be re-entrant to be parallel, otherwise you can't be running the same instructions. (Duh. :) SIMD is fashionable, but is limited to those cases where you are operating on the data in parallel. If you want to experiment with dynamic methods (herustics, genetic algorithms, self-learning networks) or where you want to apply multiple algorithms to the same data (eg: data-mining, using a range of specialist algorithms), then you're going to be running a vast number of completely different routines that may have no components in common. If so, you wouldn't care if they were re-entrant or not.

      In practice, you're likely to use a blend of SIMD, MISD and MIMD in any "real-world" program. People who write "pure" code of one type or another usually end up with something that is ugly, hard to maintain and feels wrong for the problem. On the other hand, it usually requires the fewest messaging and other communication libraries, as you're only doing one type of communication. You can also optimize the hell out of the network, which is very likely to saturate with many problems.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    4. Re:Reentrant? by prencher · · Score: 3, Informative
  2. FPP by DigitAl56K · · Score: 5, Funny

    Frtprallps
    is arle ot

    1. Re:FPP by MarkRose · · Score: 2, Informative

      "first parallel post"

      --
      Be relentless!
  3. This is Awesome by baldass_newbie · · Score: 5, Funny

    I loved 'Clocks'. Oh wait, Codeplay...not Coldplay.
    Nevermind.

    Oh look. A duck.

    --
    The opposite of progress is congress
  4. openMP by Anonymous Coward · · Score: 2, Informative

    and what the difference between this and openMP ?

    1. Re:openMP by ioshhdflwuegfh · · Score: 2, Informative

      and what the difference between this and openMP ? On the page 7 of The Codeplay Sieve C++ Parallel Programming System, 2006 you'll find section that describes "advantages" of codeplay over openmp, but nothing terribly exciting. Codeplay allows you indeed to better automatize parallelization but is at the same time also limited to a narrower set of optimizations compared to openmp.
  5. Interesting, but.. by DigitAl56K · · Score: 5, Insightful

    The compiler will put out code for x86, Ageia PhysX and Cell/PS3. There were three tests talked about today, CRC, Julia Ray Tracing and Matrix Multiply. All were run on 8 cores (2S Xeon 5300 CPUs) and showed 739, 789 and 660% speedups respectively.

    That's great - but do the algorithms involved here naturally lend themselves to the parallelization techniques the compiler uses? Are there algorithms that are very poor choices for parallelization? For example, can you effectively parallelize a sort? Wouldn't each thread have to avoid exchanging data elements any other thread was working on, and therefore cause massive synchronization issues? A solution might be to divide the data set by the number of threads and then after each set was sorted merge them in order - but that requires more code tweaking than the summary implies. So I wonder how different this is from Open/MT?

    1. Re:Interesting, but.. by Anonymous Coward · · Score: 5, Interesting

      or example, can you effectively parallelize a sort? Wouldn't each thread have to avoid exchanging data elements any other thread was working on, and therefore cause massive synchronization issues?

      Yes you can, take a look a Merge sort (or quick sort, same idea). You split up the large data set into smaller ones, sort those and recombine. That's perfect for parallization -- you just need a mechanism for passing out the orginal elements and then recombining them.

      So if you had to sort 1B elements maybe you get 100 computers and give them each 1/100th of the data set. THat's manageable for one computer to sort easily. THen just develop a service that hands you the next element from each machine, and you pull off the lowest one.

    2. Re:Interesting, but.. by DigitAl56K · · Score: 3, Interesting

      If you read my post, this is exactly what I suggested. The actual point was that it requires more than simply putting "beginning and end tags" on the code, e.g. it is not automatic.

      I would also ask this of CodePlay: If your compiler is automatic, why do we need to add beginning and end tags? :)

    3. Re:Interesting, but.. by SSCGWLB · · Score: 2, Interesting

      You have a good point; both matrix multiply and ray tracing are embarrassingly parallel problems. They lend themselves to this type of optimization.

      Consider a two NxN matrices, A and B, multiplied together to make a matrix C. Each element of C (Cij), is the sum of Ai[0..N] and Bj[0..N]. This is an almost trivial parallelization problem, commonly one of the first coding exercise learned in a parallel processing class.

      IMHO, this is interesting but has a long way to go before its useful for anything but a narrow set of problem.

  6. snake oil by oohshiny · · Score: 4, Insightful

    I think anybody who is claiming to get decent automatic parallelization out of C/C++ is selling snake oil. Even if a strict reading of the C/C++ standard ends up letting you do something useful, in my experience, real C/C++ programmers make so many assumptions that you can't parallelize their programs without breaking them.

    1. Re:snake oil by mastershake_phd · · Score: 2, Insightful

      "All you really need to do is take the code you want to run across multiple CPUs and put beginning and end tags on the parts you want to run in parallel" The compiler isn't going to know if you're doing something stupid or not. In other words: use at your own risk. The old adage of "garbage in, garbage out" still applies.

      But how are you supposed to know exactly how something is going to run under this? Even with a good understanding of what your trying to do and (hopefully) what exactly the compiler is doing you still might get some weird results under certain situations. It might work buts its still going to take lots of trial and error, or at least a lot of verification.

    2. Re:snake oil by ariels · · Score: 2, Informative
      TFA specifically mentions that you need to mark up your code with sieves:
      1. A sieve is defined as a block of code
        contained within a sieve {} marker and
        any functions that are marked with sieve.
      2. Inside a sieve, all side-effects are delayed
        until the end of the sieve.
      3. Side effects are defined as modifications
        of data that are declared outside the
        sieve
      The compiler can use this information to decide what parts of the code can safely be parallelized. Adding the "sieve" keyword can change the semantics of the code, adding it correctly is your responsibility.
      Not sure I find the particular concept appealing for programming -- just trying to straighten out the claim of the article.
      --
      2 dashes and a space, or just 2 dashes?
  7. Prefer OpenMP by drerwk · · Score: 5, Informative
    I have some small amount of experience with OpenMP http://openmp.org/ , which allows one to modify C++ or Fortran code using pragmas to direct the compiler regarding parallelization of the code. And the Codeplay white paper made this sound much like it implements one of the dozen or so OpenMP patterns. I am fairly skeptical that Codeplay has any advantage over OpenMP, but the white paper lists some purported advantages. I will not copy them here and take the fun out of reading them for yourself. I will list OpenMP advantages.
    1: OpenMP is supported by Sun, Intel, IBM, $MS(?) etc, and implemented in gcc 4.2.
    2: OpenMP has been used successfully for about 10 years now, and is on a 2.5 release of the SPEC.
    3. It is Open - the white paper for Codeplay mentions it being protected by patents. (boo hiss)
    4. Did I mention that it is supported in gcc 4.2 which I built it on my Powerbook last week and it is very cool?

    So maybe Codeplay is a nice system. Maybe they even have users and can offer support. But if you are looking to make your C++ code run multi-threaded with the least amount of effort I've seen ( It is still effort! ) take a look at OpenMP. In my simple tests it was pretty easy to make use of OpenMP, and I am looking forward to trying it on a rather more complicated application.

    1. Re:Prefer OpenMP by PhrostyMcByte · · Score: 4, Informative

      Don't forget the other end of the development spectrum - Visual C++ 2005 has builtin OpenMP support too.

    2. Re:Prefer OpenMP by jd · · Score: 4, Interesting
      Personally, I would agree with you. I have to say I am not fond of OpenMP - I grew up on Occam, and these days Occam-Pi blows anything done in C out of the water. (You can write threads which can auto-migrate over a cluster, for example. Even OpenMOSIX won't work at a finer granularity than entire processes, and most compile-time parallelism is wholly static after the initial execution.)

      On the other hand, OpenMP is a far more solid, robust, established, reputable, reliable solution than Codeplay. The patent in Codeplay is also bothersome - there aren't many ways to produce an auto-parallelizing compiler and they've mostly been done. This means the patent either violates prior art (most likely), or is such "black magic" that no other compiler-writing company could expect to reproduce the results and would be buying the technology anyway. It also means they can't ship to Europe, because Europe doesn't allow software patents and has a reputation of reverse-engineering such code (think "ARC-4") or just pirating/using it anyway (think: pretty good privacy version 2, international version, which had patented code in it)

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  8. Yup by PhrostyMcByte · · Score: 3, Interesting

    For the majority of apps, OpenMP is enough. That is what this looks like - a proprietary OpenMP. It might make it easier than creating and managing your own threads but calling it "auto" parallelizing when you need to mark what to execute in parallel is a bit of a stretch.

    For apps that need more, it is probably a big enough requirement that someone knowledgable is already on the coding team. Which isn't to say that a compiler/lang/lib lowering the "experience required" bar wouldn't be welcomed, just that I wish these people would work on solving some new problems instead of re-tackling old ones.

    The main purpose of these extensions seems to be finding a way to restrict the noob developer enough that they won't be able to abuse threading like some apps love to do. That is a very good thing in my book! (Think Freenet, where 200-600 threads is normal.)

  9. How long has Sun Studio had "-xautopar"? by Anonymous Coward · · Score: 2, Informative

    Yep, it's in there.

    And it works, too.

  10. Been done... by TheRealMindChild · · Score: 3, Interesting

    I have my 'Mips Pro Auto Parallellizing Option 7.2.1' cd sitting right next to my Irix 6.5 machine... and I know it's YEARS old

    --

    "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    1. Re:Been done... by adrianmonk · · Score: 5, Interesting

      I have my 'Mips Pro Auto Parallellizing Option 7.2.1' cd sitting right next to my Irix 6.5 machine... and I know it's YEARS old

      Oh, are we having a contest for who can name the earliest auto-parallelizing C compiler? If so, I nominate the vc compiler on the Convex computers. The Convex C-1 was released in 1985 and I believe had a vectorizing compiler from the start, which would make sense since it had a single, big-ass vector processor (one instruction, crap loads of operands -- can't remember how many, but it was something like 64 separate values being added to another 64 separate values in one single instruction).

      I personally remember watching somebody compile something with it. It was really neat to watch -- required no special pragmas or anything, just plain old regular C code, and it would produce an annotated copy of your file telling you which lines were fully vectorized, partly vectorized, etc. You could, of course, tweak the code to make it easier for the compiler to vectorize it, but even when you did, it was still plain old C code.

  11. Re:Hey! Let's reinvent OpenMP! by grub · · Score: 2, Interesting

    Our SGI compilers at work come with an -apo (automatic parallization optimization) command line option. That one option cost us a pretty penny. It's nice to see other people getting in on the action.

    Snippet from the manpage, highlighting is mine:

    -apo, -apokeep, -apolist
    For -n32 and -64, it invokes the Auto-Parallelizing Option
    (APO), which automatically converts sequential code into
    parallel code by inserting parallel directives where it is
    safe and beneficial to do so. Specifying -apo also sets the
    -mp option. Both -apokeep and -apolist produce a listing
    file, file.list. Specifying -apokeep retains file.anl and
    file.m, which can be used by the parallel analyzer, ProDev
    ProMP (see the EXAMPLES section). When the -IPA option is
    specified with -apokeep, the default settings for IPA
    suboptions are used with the exception of -IPA:inline, which
    is set to OFF.
    APO is invoked only if you are licensed for it. For licensing
    information, see your sales representative.

    For more information on APO, its directives, and command-line
    options, see MIPSpro C and C++ Pragmas.

    When specifying the -o32 option on the cc command line, -apo
    invokes the IRIS Power C analyzer (PCA). See the -pca option
    description.
    --
    Trolling is a art,
  12. SmartVariables is a good alternative to MPI / PVM by Anonymous Coward · · Score: 2, Insightful

    Let's see if I can teach any old dogs some new trix.

    Here is a quote from the SmartVariables white-paper:

    "The GPL open-source SmartVariables technology works well as a replacement for both MPI and PVM based systems, simplifying such applications. Systems built with SmartVariables don't need to worry about explicit message passing. New tasks can be invoked by using Web-Service modules. Programs always work directly with named-data, in parallel. Tasks are easily sub-divided and farmed out to additional web-services, as needed - without worry of breaking the natural parallelism. If two or more tasks ever access data of the same name and location, then that data is automatically shared between them - without need for additional parallel programming constructs. Instead of using configuration files with lists of available machines, a shared SmartVariables List object (with a commonly accepted name, like "machines@localhost") could easily hold the available host names, which can then be used for dynamic task allocation. The end-result is that SmartVariables-based parallel systems need only reference and work with distributed data, and don't need to manage it. Automatic sharing means there is no need to worry about explicit connection, infrastructure, or message-passing code. Instead, applications only need agree on the names used for their data. Names and object locations are easily managed by using a SmartVariables based Directory-Service as an additional layer of object indirection."

    The rest of this paper is here: http://www.smartvariables.com/doc/DistributedProgr amming.pdf

    A single code-base works on Apple / Linux / Windows.
    Complete code and docs at http://smartvariables.com/

  13. Sounds like multi-threading AND NOT Parallelizing by mrnick · · Score: 3, Insightful

    I read the article, the information at the company's web site and even white papers written on the compiler. And although I did see one reference to "Multiple computers across a network (e.g. a "grid")" there was no other mention of it.

    When I think of Parallelizing software, after getting over my humors mind thinking of a virus that paralyzes users, what comes to mind is clustering. When I think of clustering the train of thought directs me to Beowulf and MPI or it's predecessor PVM. Though I can find no information that supports the concept of clustering in any manner.

    Again I did see a reference to: "Multiple computers across a network (e.g. a "grid")" but according to Wikipedia grid computing is defined "A grid uses the resources of many separate computers connected by a network (usually the Internet) to solve large-scale computation problems. Most use idle time on many thousands of computers throughout the world."

    Well, that sounds like the distributed SETI project and the like, which would seem even more ambitious than a compiler that would help write MPI code for Beowulf clusters.

    From all the examples this looks like a god compiler for writing code that will run more efficiently on multi-core and multi-processor systems but would not help you in writing parallel code for clustering.

    Though, this brings up a concept that many people forget. Even people that I would consider to be rather intelligent on the subject of clustering often forget this. And that is that if you have an 8 computer cluster with each node running on a system with dual-core Intel CPU installed that if you write parallel code for it using MPI you are benefiting from 8 cores in parallel. Many people that write parallel code forget about multi-threading. To benefit from all 16 cores in a cluster I just described the code would have to be written multi-threaded and parallel. One of the main professors involved in a clustering project at my university stated to me that in their test environment they were using 8 dell systems with dual-core Intel CPU so in total they had the power of 16 cores. Since he has his Ph. D. and all I didn't feel the need to correct him and explain that unless his code was both parallel and multi-threaded he was only getting the benefit of 8 cores. I knew he was not multi-threading because they were not even writing the code in MPI rather they were using Python and batching processes to the cluster. From my knowledge Python cannot write multi-threaded applications. Even if it can I know they were not (from looking at their code).

    Sometimes it's the simplest things that confuse the brightest of us....

    Nick Powers

    --

    Encryption: I may not agree with what you say, but I will defend your right to encrypt it...
  14. Sad... by jd · · Score: 2, Funny

    I mean, it was only running on two threads AND showed clear signs of excess barrier operations at the end of every character. From here on out, I expect first parallel posts to run over at least four threads and not be sequentially-coherent. The world is moving towards async! Don't let first posts suffer with past limitations!

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  15. Is this better than OpenMP? by Anonymous Coward · · Score: 2, Informative

    So, I fail to see what's new about this. As has been mentioned before, OpenMP auto-parallelizes for SMP systems quite well, as long as you know what you're doing. Like anything done in parallel, if you don't figure out where your data and algorithm dependencies are you'll hose your program. If Sieve does some sort of dependency analysis, that would be interesting, but I doubt it would catch all problems. In fact, I imagine it's provably impossible to auto-parallelize in the general case -- it will likely be proven equivalent to the halting problem eventually.

    What would be new is when someone substantially improves on MPI. Auto-parallelizing a FOR loop is amusing, doing the same for a complex algorithm moving data around in a cluster, well, that's a different sort of difficult.

    Anyway, no matter how many libraries and tools come out to ease the pain, parallel programming is frigging hard. In fact, the more automagic the compiler, the harder it will be to debug when the inevitable race condition sneaks through. Combine this with lowering the bar for parallel programming and letting more idiots in and we can look forward to some truly horrific code. If you make it so any idiot can code, any idiot will!

  16. Re:single process uses 1 core unless multi-threade by julesh · · Score: 2, Insightful

    The operating system on a multiple-core machine can split up the processes but one process can only run on one core unless it has been written in a multi-threaded fashion.

    In parallel processing general each machine is running one part of a program, thus one program, and unless that program is multi-threaded as well as parallel then it can only use one core per node on a cluster.

    Though, someone who writes multi-threaded parallel applications should be held in high esteem! I don't know any such coders.


    Have you considered that if you run two copies of the process on each node, it will use both cores?

  17. OpenMP can support clusters by mi · · Score: 4, Informative

    Intel's compiler (icc), available for Linux, Windows, and FreeBSD extends OpenMP to clusters.

    You can build your OpenMP code and it will run on clusters automatically. Intel's additional pragmas allow you to control, which things you want parallelized over multiple machines vs. multiple CPUs (the former being fairly expensive to setup and keep in sync).

    I've also seen messages on gcc's mailing list, that talk about extending gcc's OpenMP implementation (moved from GOMP to mainstream in gcc-4.2) to clusters the same way.

    Nothing in OpenMP prevents a particular implementation from offering multi-machine parallelization. Intel's is just the first compiler to get there...

    The beauty of it all is that OpenMP is just compiler pragmas — you can always build the same code with them off (or with a non-supporting compiler), and it will still run serially.

    --
    In Soviet Washington the swamp drains you.
    1. Re:OpenMP can support clusters by mi · · Score: 2, Insightful

      Won't that require some runtime support, like mpirun in MPI (that takes care of rsh/ssh-ing to each node and starting the processes)?

      Well, yes, of course. You also need the actual hardware too :-)

      This is beyond the scope of the discussion, really — all clusters require a fair amount of work to setup and maintain. But we are talking about coding for them here...

      --
      In Soviet Washington the swamp drains you.
  18. Nothing new by UtilityFog · · Score: 2, Informative

    Cilk has been around for years, indeed it won the ICFP 1998 programming contest.

  19. Re:batching code to a cluster is just silly by Mr.+Hankey · · Score: 2, Insightful

    I do work in a HPC environment, and we have a number of clusters available which are utilized through batch processing as well as software interfaces such as OpenMPI.

    In defense of the batch system, assuming we definine parallel as "at the same time", and not "inter-node communication", one should be fine with the multiple processes approach. There are a number of commercial and open source schedulers which will do the work for you, including node weighting for systems with different numbers of processors and/or processor speeds. As long as the nodes are not doing anything else, the OS schedulers do the right thing.

    Given that it's simpler to write, debug and verify single threaded code, pushing multiple instances of an application to a cluster's nodes can reduce development time. Not requiring direct communication between the nodes also allows execution the same application on multiple clusters as seen in grids, indeed even multiple architectures. It also lowers network overhead, unless you introduce InfiniBand or Myranet interfaces (which are costly.)

    --
    GPL: Free as in will