Why Does Current Clustering Require Recoding?
AugstWest asks: "I've been doing some research into what the available clustering options are for pooling CPU resources, and it looks like most of the solutions I've found require that programs be re-written to take advantage of the cluster. Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?"
why aren't there clustering solutions that do this as well?
Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?
You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.
Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.
--
$tar -xvf
``Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?''
Because it's virtualization, and thus hurts performance?
Please correct me if I got my facts wrong.
It's hard to take arbitrary code and decide which parts can be run on opposite ends of a network cable.
Sure, you could make a clustering application that would run arbitrary x86 code on separate machines, but it would be many orders of magnitude slower than just running the code on one big Xeon.
Hell, it's hard enough to take a single thread and spread work across multiple execution units in the CPU for out-of-order execution, and too hard to do it across multiple CPUs in a single box. Why would it be possible across a network cable? Have I completely misunderstood the question?
There are no trails. There are no trees out here.
As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.
This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.
"Those that start by burning books, will end by burning men."
You might want to try Mosix.
http://www.mosix.org/
Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth. It's hard because it's hard.
How is this magical cpu virtualizer going to know what it can split up and send to different computers? Like another poster mentioned, latency is the big issue. If your cpu virtualizer arbitrarily sends instructions over the network to other nodes, but the original program still expects them to be executed at local cpu speed then things are going to get fucked up fast. I wouldn't be surprised if the final result is actually slower than just running the job on one box.
Basically, what's wrong with this idea is the clustering software has no way of knowing what it can chunk up and spit out to other nodes unless the programmer of the software in question tells it. Some multithreaded programs can be run on clusters without a rewrite, but there is already clustering software for that application. What the OP is suggesting is similar to rerouting highway traffic by arbitrarily plucking cars off the highway and putting them on random side streets. They all may get there eventually and, at first, it may seem like they are moving faster, but in the end it just takes everyone a lot longer to reach their destination. Now, if the drivers themselves planned alternate routes to help alleviate congestion on the highways, then there's a good chance everyone would get to their destinations faster.
In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ which covers lots of problems that have to be solved.
I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.
This is a boring sig
First off, it's not entirely clear what you want to do with it. If you want load balancing then that's one problem. If you want parallel batch processing (such as rendering farms or compiling) then that's another problem. And for the really juicy stuff, ie running a normal application distributed on multiple computers then that is a third, and very different problem.
But all of them require that you add something to the original program which distributes the work (load balancing/render farms). If you want your original program to run in parallel then that is a much harder problem to solve. Basically you'll have to remake it into something like the above.
The last problem would basically require the computer to extract threads out of your code. This is pretty much impossible to do automatically though.
If your problem is so parallelizable that bandwidth isn't a limitation, then you don't need any special clustering software, you just need nfs and ssh: I do all my compiling in a flash with a short script and "make -j 16 CXX=sshcxx".
If your problem isn't that parallelizable and yet you need a whole cluster of computers to run it, odds are you need more efficiency than distributed shared memory can give you. You can access memory on your own node with orders of magnitude more bandwidth and less latency than on other nodes, and if your application doesn't take that into consideration it can run orders of magnitude slower.
Of course, that doesn't apply to every problem, and there are people trying to create exactly the cluster-as-computer architecture you'd like to see for ease of application programming. Check out OpenMosix and MigShm for one example - I haven't used the latter DSM patch myself but I know that for non-shared-memory programs, Mosix has had working process migration code for years.
Actually, I'd recommend openMosix. Granted Mosix is the original and is open source now as well, but it still seems like openMosix is more actively developed.
Free will is just an illusion
This is a basic systems question:
[Why must] programs be re-written to take advantage of the cluster.
The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).
Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.
The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.
-Adam
The virtual machines you mention all run on a single existing system. You want a virtual machine that runs on multiple systems. That goes way beyond what the existing VMs do. They just implement the hardware instructions of a single system in software running on a single system. Taking that implementation and spreading it out among multiple systems means anticipating every clustering problem the code might raise, and solving it in advance.
Nobody knows how to do that. If they did, they'd implement it as the back end of compiler rather than waste the overhead of using a VM.
(They say that there are no stupid questions. Not true. But there are lame stupid questions, and interesting stupid questions. My vocation is answering interesting stupid questions, which is why I'm grateful for this one!)
This will be a bit difficult to explain fully. The other posts have already lightly touched the problems involved (especially latency). But you are talking about the holy grail of parallel computing here; seeing one system while it is running all over the place. My best advice for you is to get a good book on parallel systems and get educated. This is something like asking a doctor why there are still diseases.
(And, for your particular example, mosix has a number of schedulers & you can schedule manually. You can trivially send one postscript file to each node. Of course you can do this "braindead" clustering with a script, but it isn't as robust, easy, or flexible.)Somewhat agree for single apps, especially edge cases that you point out. But if a large number of CPU-intensive processes, (open)Mosix is a good, fairly painless way to divide the load.
cache consistency. When I modify a page that is in my processor cache, now I have to put the word out to the whole network -- and I can't really commit that page until I know for sure that other threads in the cluster did not modify the same page (and, in the case someone did, I must decide how do I merge their modifications and mine, notify them of the merging, etc, etc...) What was a quick (important for performance) operation becomes a dog-slow operation, and maybe puts the whole motif for using a cluster in jeopardy...
HTH,
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.
Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.
11*43+456^2
Beowulf clusters typically are designed for specific purposes and software is written to take advantage of the design. You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it.
I'm certainly no expert, but I have researched this a bit since I work in a department with a LOT of extra boxes laying around. They're slow individually but together add up to a good bit of processing power and memory. We want to put them to use but the question is "what use?"
That question boils down to programs designed to use multiple threads versus splitting processes. If your needs involve running things that require lots of processes, then openMosix is a good bet, but if you're simply wanting to make your favorite software run faster, the answer might be to rebuild it to take advantage of a Beowulf cluster with more threads rather than trying to divy up the processes. Fortunately, there are compile tools out there to make it a little easier and specifically openMosix has some compile tools to make programs more multi-process friendly.
Despite all the tools though, some programs just don't divide well without significant recoding. If you're faced with that type of problem, its time to call in the coding gurus because openMosix can't help you. Others, like apache and mysql were practically written to be shared.
OpenMosix may be the answer or not, it all depends on the question, which in this case isn't completely clear because the objective and software desired aren't discussed.
As to the why clustering works this way, there are far more technical and probably much more accurate answers but in simple terms, you can't make two computers do one thing faster than one computer can do it unless you can divide the job. Some jobs divide easily, some don't.
B) Eliminate all the stupid users. This is frowned upon by society.