Why Does Current Clustering Require Recoding?
AugstWest asks: "I've been doing some research into what the available clustering options are for pooling CPU resources, and it looks like most of the solutions I've found require that programs be re-written to take advantage of the cluster. Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?"
This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.
"Those that start by burning books, will end by burning men."
This is a basic systems question:
[Why must] programs be re-written to take advantage of the cluster.
The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).
Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.
The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.
-Adam
Don't forget disk access issues as well. You now have file locking, non-local disk-access, and race state issues to contend with.
Example from my work is that we tend to write several hundred meg to several gig scratch files, and then perform RW operations on them continually during a calculation. If the disk isn't local to the process, then you end up flooding the network, and bringing everything to a screeching halt.
In a Mosixish/Condor type environment, you then have to deal with which processes, because of this disk limitation, can be migrated to other CPUs, or can allow a second job to start on their own because of insufficient utilization, from those which have to have exclusive access to the CPU, and near-exclusive access to the disk, in order to prevent the calc from bogging down.
Then, as the parent mentioned, you have the CPU-CPU communication issues, the network overhead, and memory access patterns, all of which are hard. In theory, had you written your code correctly in the first place, this would only be moderately annoying, but since most people's applications are single-threaded, most programming is taught in serial mode, and the tools for MPar work are still expensive and exotic, then you get a situation where it's easy to run a compute farm (massive numbers of single-processor jobs), but hard to run a parallel cluster (one job aggregating resources)
the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.
Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.
11*43+456^2
The venerable occam programming language requires that each block of code be specifically identified as being executable either in parallel or sequentially. Since PAR and SEQ constructs can be nested it is easy to build up quite complex concurrent structures that can easily be distributed. Since the semantics of occam processes are derived from Hoare's CSP process algebra the compositional nature of occam's parallelism is theoretically sound, and avoids many of the problems associated with thread-based concurrency model that most people are familiar with.