Slashdot Mirror


Why Does Current Clustering Require Recoding?

AugstWest asks: "I've been doing some research into what the available clustering options are for pooling CPU resources, and it looks like most of the solutions I've found require that programs be re-written to take advantage of the cluster. Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?"

12 of 75 comments (clear)

  1. latency? by Johnny+Mnemonic · · Score: 5, Insightful

    why aren't there clustering solutions that do this as well?

    Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?

    You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.

    Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.

    --

    --
    $tar -xvf .sig.tar
    1. Re:latency? by Frumious+Wombat · · Score: 3, Informative

      Don't forget disk access issues as well. You now have file locking, non-local disk-access, and race state issues to contend with.

      Example from my work is that we tend to write several hundred meg to several gig scratch files, and then perform RW operations on them continually during a calculation. If the disk isn't local to the process, then you end up flooding the network, and bringing everything to a screeching halt.

      In a Mosixish/Condor type environment, you then have to deal with which processes, because of this disk limitation, can be migrated to other CPUs, or can allow a second job to start on their own because of insufficient utilization, from those which have to have exclusive access to the CPU, and near-exclusive access to the disk, in order to prevent the calc from bogging down.

      Then, as the parent mentioned, you have the CPU-CPU communication issues, the network overhead, and memory access patterns, all of which are hard. In theory, had you written your code correctly in the first place, this would only be moderately annoying, but since most people's applications are single-threaded, most programming is taught in serial mode, and the tools for MPar work are still expensive and exotic, then you get a situation where it's easy to run a compute farm (massive numbers of single-processor jobs), but hard to run a parallel cluster (one job aggregating resources)

      --
      the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
  2. cooking lessons by xutopia · · Score: 3, Insightful
    imagine telling a group of 10 cooks to make a huge roast. You wouldn't cut the roast in 10 pieces and each make them cook it seperatly and then glue the pieces back together. It would be highly difficult to glue back the 10 pieces. Instead it would make more sense to ask all cooks to do a seperate tasks. A few could could cut vegetables while another would make a sauce, another a salad.

    As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.

    1. Re:cooking lessons by swmccracken · · Score: 3, Insightful

      See, is this High Availablity clustering or performance clustering. The asker doesn't state, and it's a rather important distinction.

      If it's HA, you'd get 10 cooks each to make a roast. Sure, you'd end up with cooking extra meat but that doesn't matter - the goal here is to guarentee that a roast will be cooked no matter what. (I can imagine two copies of bochs running on seperate physical machines but linked to run in absoulte lock-step. Performance might be impared, but relability will be there.)

      If it's performance, then you're right, you can't magically glue two computers together and get twice the performance.

  3. part of the issue by sfcat · · Score: 3, Informative
    When making an application distributed, you must figure out how to replicate the memory the application uses to other machines and make sure that this replication and synchronization work is transparent to the logic of the application. But this replication and synchronization is far far far too expensive (computationally) if done naively. So either special system calls (which is what the recoding requires) or a redesign of how work is parcelled out to worker threads is necessary.

    This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.

    --
    "Those that start by burning books, will end by burning men."
  4. Mosix by NitsujTPU · · Score: 4, Insightful

    You might want to try Mosix.

    http://www.mosix.org/

  5. TANSTAAFL by Julian+Morrison · · Score: 4, Insightful

    Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth. It's hard because it's hard.

  6. Compilers by Marillion · · Score: 3, Interesting
    Most compilers/interpreters support languages designed for single thread execution. Fortran, COBOL, C, C++, Ruby, Perl, PHP, Java, ... Sure all these have API calls to make use of multiple threads, but the language itself isn't multi-threaded.

    In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ which covers lots of problems that have to be solved.

    I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.

    --
    This is a boring sig
    1. Re:Compilers by GileadGreene · · Score: 3, Informative
      I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed.

      The venerable occam programming language requires that each block of code be specifically identified as being executable either in parallel or sequentially. Since PAR and SEQ constructs can be nested it is easy to build up quite complex concurrent structures that can easily be distributed. Since the semantics of occam processes are derived from Hoare's CSP process algebra the compositional nature of occam's parallelism is theoretically sound, and avoids many of the problems associated with thread-based concurrency model that most people are familiar with.

  7. This is a basic systems question. by stienman · · Score: 4, Informative

    This is a basic systems question:

    [Why must] programs be re-written to take advantage of the cluster.

    The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).

    Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.

    The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.

    -Adam

  8. Holy Grail by owlstead · · Score: 3, Insightful

    This will be a bit difficult to explain fully. The other posts have already lightly touched the problems involved (especially latency). But you are talking about the holy grail of parallel computing here; seeing one system while it is running all over the place. My best advice for you is to get a good book on parallel systems and get educated. This is something like asking a doctor why there are still diseases.

  9. Your language doesn't support it by photon317 · · Score: 3, Informative


    The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.

    Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.

    --
    11*43+456^2