Why Does Current Clustering Require Recoding?
AugstWest asks: "I've been doing some research into what the available clustering options are for pooling CPU resources, and it looks like most of the solutions I've found require that programs be re-written to take advantage of the cluster. Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?"
This is a really good question and I have often wondered it myself.
I am glad someone has finally posed the question to the community at large.
Imagine the renewal of interest in clustering if this were to become reality...
-- If I were a fish, I'd be wet
why aren't there clustering solutions that do this as well?
Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?
You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.
Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.
--
$tar -xvf
``Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?''
Because it's virtualization, and thus hurts performance?
Please correct me if I got my facts wrong.
It's hard to take arbitrary code and decide which parts can be run on opposite ends of a network cable.
Sure, you could make a clustering application that would run arbitrary x86 code on separate machines, but it would be many orders of magnitude slower than just running the code on one big Xeon.
Hell, it's hard enough to take a single thread and spread work across multiple execution units in the CPU for out-of-order execution, and too hard to do it across multiple CPUs in a single box. Why would it be possible across a network cable? Have I completely misunderstood the question?
There are no trails. There are no trees out here.
As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.
This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.
"Those that start by burning books, will end by burning men."
You might want to try Mosix.
http://www.mosix.org/
Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth. It's hard because it's hard.
How is this magical cpu virtualizer going to know what it can split up and send to different computers? Like another poster mentioned, latency is the big issue. If your cpu virtualizer arbitrarily sends instructions over the network to other nodes, but the original program still expects them to be executed at local cpu speed then things are going to get fucked up fast. I wouldn't be surprised if the final result is actually slower than just running the job on one box.
Basically, what's wrong with this idea is the clustering software has no way of knowing what it can chunk up and spit out to other nodes unless the programmer of the software in question tells it. Some multithreaded programs can be run on clusters without a rewrite, but there is already clustering software for that application. What the OP is suggesting is similar to rerouting highway traffic by arbitrarily plucking cars off the highway and putting them on random side streets. They all may get there eventually and, at first, it may seem like they are moving faster, but in the end it just takes everyone a lot longer to reach their destination. Now, if the drivers themselves planned alternate routes to help alleviate congestion on the highways, then there's a good chance everyone would get to their destinations faster.
openMosix is a GPLed fork of MOSIX & is undergoing rapid development. No need to recode apps. Apps will work like they do in an SMP machine. So, if they are already faster on a dual-processor machine, they won't need to use special libraries or threading methods to work over several workstations.
More processors don't always increase speed, you have to be able to split up the problem in chunks and then work on them at the same time. The algorithms that simulate a processor aren't easily run in parallel, basically. Or require too much communication overhead.
"It's too bad that stupidity isn't painful." - Anton LaVey
In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ which covers lots of problems that have to be solved.
I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.
This is a boring sig
First off, it's not entirely clear what you want to do with it. If you want load balancing then that's one problem. If you want parallel batch processing (such as rendering farms or compiling) then that's another problem. And for the really juicy stuff, ie running a normal application distributed on multiple computers then that is a third, and very different problem.
But all of them require that you add something to the original program which distributes the work (load balancing/render farms). If you want your original program to run in parallel then that is a much harder problem to solve. Basically you'll have to remake it into something like the above.
The last problem would basically require the computer to extract threads out of your code. This is pretty much impossible to do automatically though.
Oh, that sounds like a tough one to me. Ok, ok, it's actually not that tough - but it DOES require combining a number of kernel patches, there's no one-stop-shop (at the moment) for this. It also requires that network connections be IPv6, as there's bugger all mobility support out there for IPv4 for Linux as best as I can tell.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
If your problem is so parallelizable that bandwidth isn't a limitation, then you don't need any special clustering software, you just need nfs and ssh: I do all my compiling in a flash with a short script and "make -j 16 CXX=sshcxx".
If your problem isn't that parallelizable and yet you need a whole cluster of computers to run it, odds are you need more efficiency than distributed shared memory can give you. You can access memory on your own node with orders of magnitude more bandwidth and less latency than on other nodes, and if your application doesn't take that into consideration it can run orders of magnitude slower.
Of course, that doesn't apply to every problem, and there are people trying to create exactly the cluster-as-computer architecture you'd like to see for ease of application programming. Check out OpenMosix and MigShm for one example - I haven't used the latter DSM patch myself but I know that for non-shared-memory programs, Mosix has had working process migration code for years.
Actually, I'd recommend openMosix. Granted Mosix is the original and is open source now as well, but it still seems like openMosix is more actively developed.
Free will is just an illusion
If one woman can have one baby in nine months, can nine women have one baby in one month?
Clustering grew out of a poor-mans solution to distributed computing. There have been plenty of distributed operating systems over the decades, but the status-quo weights heavy. Rather than revolutionary OSes we use evolutionary OSes.
For example, a common solution for distributed computing is to use Linux (a 1970'ish design that isn't a distributed OS at all) and cluster it hence forcing application writers to re-write to manually divide computation along latency/bandwidth lines.
There is no technological reason why a compiler couldn't do a detailed whole-app dataflow analysis of a program and compute a good (or perhaps optimal) distribution for computations automatically.
It simply isn't done that way, because applications of distributed computing are either inherently easy to divide up (like serving web pages, raytracing, database access etc.), or are scientific computations that are written by practitioners with little CS understanding of distributed computing.
Similarly, no company has any interest in building truley 'supercomputing' hardware any more - they're more interested in re-packaging off-the-shelf parts because the market is too small.
Hence, todays 'supercomputers' are just clustes in disguise. So, if you local supercomputer uses Intel PC CPUs connected into a cluster, why not just run Linux on it and force app writers to divide up the computations by using MPI or something? Right?
It may not be easy to program nor use the compute resources optimally, but keep everyone in a job so who cares.
So, in summary, the answer to your questions is: because people don't like change, like job stability, maintaining the status-quo and are lazy.
Disclosure: I work for the organization that runs the largest 'supercomputer' in the US.
You can only use solutions that exist for the technology you're using. Likewise, though, you're not limited by constraints on technologies you're not using.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
This is a basic systems question:
[Why must] programs be re-written to take advantage of the cluster.
The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).
Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.
The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.
-Adam
I have been hearing about TOE http://en.wikipedia.org/wiki/TCP_Offload_Engine and RDMA http://en.wikipedia.org/wiki/Remote_Direct_Memory_ Access for over a year now both of these would help with clustering of remote servers and there CPU's
VirtualIron is a company/product that runs a para-virtualized Linux instance (more similar to Xen than Bochs or the desktop VMWare) which spans multiple physical machines.
http://www.virtualiron.com/
The virtual machines you mention all run on a single existing system. You want a virtual machine that runs on multiple systems. That goes way beyond what the existing VMs do. They just implement the hardware instructions of a single system in software running on a single system. Taking that implementation and spreading it out among multiple systems means anticipating every clustering problem the code might raise, and solving it in advance.
Nobody knows how to do that. If they did, they'd implement it as the back end of compiler rather than waste the overhead of using a VM.
(They say that there are no stupid questions. Not true. But there are lame stupid questions, and interesting stupid questions. My vocation is answering interesting stupid questions, which is why I'm grateful for this one!)
pick an algorithm (say matrix multiply). write the fastest possible serial implementation you can (hint: you can do better than O(n^3)). then implement a parallel matrix multiply using MPI. now make the parallel one run as fast as possible.
i can guarantee that the serial algorithm is about a day's worth of effort to implement; however, the parallel one will require at least a week. as you start working through the parallel implementation, you'll quickly discover that all the things that are true in a shared memory multiprocessor are no longer true in a distributed memory multiprocessor. you'll quickly appreciate the communication costs and overheads imposed by the cluster. you'll also appreciate just how much of a bitch attempting to debug parallel programs is (where did stdout go? depends on your MPI implementation... whee!). topology also becomes a major factor: parallel sorts which are efficient on toroidal grid topologies are no longer efficient on hypercubes, and vice versa.
you managed to communicate what i was attempting to in a far more succinct way than i managed to.
This will be a bit difficult to explain fully. The other posts have already lightly touched the problems involved (especially latency). But you are talking about the holy grail of parallel computing here; seeing one system while it is running all over the place. My best advice for you is to get a good book on parallel systems and get educated. This is something like asking a doctor why there are still diseases.
cache consistency. When I modify a page that is in my processor cache, now I have to put the word out to the whole network -- and I can't really commit that page until I know for sure that other threads in the cluster did not modify the same page (and, in the case someone did, I must decide how do I merge their modifications and mine, notify them of the merging, etc, etc...) What was a quick (important for performance) operation becomes a dog-slow operation, and maybe puts the whole motif for using a cluster in jeopardy...
HTH,
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.
Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.
11*43+456^2
Clustering is a many-cpu, one-problem situation. Many problems are not "do this thing 1000 times", but "perform these 1000 steps in order", so it requires a lot of work to make the simultaneous availability of many CPUs an advantage. The goal is to increase the speed of a CPU-intensive task.
Virtualization is a many-problem, one-cpu situation. Various software tricks make each of several programs think they have an entire system to themselves. In reality it all runs inside a virtualizer/emulator. Speed is sacrificed, but there are other advantages in management, flexible allocation, etc.. The goal is to make better use of a few CPUs by a larger number of programs.
VMware doesn't virtualize the CPU. Whatever native CPU you're running is what the Guest OS sees.
... reminds me of Cray's famous statement that though one woman can have a child in 9 month, nine women would not be able to have one in 1 month. ;-)
Yes, all the answers above were quite sufficient to explain why you have to re-code your app if you want for it to run _faster_ when you add _more_ nodes. And it is so easy to make it run slower -- I bet the original poster would benefit from trying to re-code some a sequential program to a parallel one at least once, then ask himself "How the heck can I teach a compiler to deal with this mess???" (not to mention OS/virtualization hardware which has to deal with binary and it is so much harder to extract dependency information from that on the fly!*).
Paul B.
* Yes, there was that result of (as far as I remember!) HP virtualization group when some SPEC apps ran FASTER on a virtual machine with just-in-time compiler than natively compiled code -- had to do with virtualizer knowing which branch your program will take next and preparting for that. You'd have to google for the whole story, it was quite educational!
Listing 4: Implementation of replication sort
1 par (element=0; element<SIZE; element++) {
2 seq {
3 par (element2=0; element2<SIZE-1; element2++) {
4 ifselect(element>element2) {
5 if(uList[element] > uList[element2])
6 comp[element][element2] = 1;
7 } else ifselect (element<=element2) {
8 if(uList[element] >= uList[element2+1])
9 comp[element][element2] = 1;
10 }
11 }
13 position[element] = SUM_OF_DIGITS(comp[element]);
14 sList[position[element]]=uList[element];
15 }
16 }
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
You should check out cilk. Two of the people behind the project used to work for Thinking Machines, the company that made one of the best supercomputers of its day. Cilk adds a few key words to C and it requires much less effort than most other parallel programming models. Unfortunately, the distributed version of it was only a prototype and isn't included in the latest release. If nothing else, you should read the papers these guys have written.
They also had a couple of graduate students that made Jilk (Cilk for Java). From the sound of their papers it isn't ready for production use yet, but it's something to keep in mind if you prefer writing code in Java.
"For the Snark was a Boojum, you see." -From the Hunting of the Snark: An Agony in Eight Fits, by Lewis Carroll
http://sourceforge.net/projects/xgridagent-java/
Indeed, I've been using Qemu a lot lately. While it's great for my needs, it emulates a lower-speed P2 CPU (on my P4 machine) and requires that any device hooks also be understood and passed-through/translated by the virtual machine.
In the end, you'll get better performance and compatability out of coding for a cluster, rather than having the overhead and redirection of the virtualization process.
So surprised to see such dump questions make it to the front page of Slashdot... well not really.
Wouldn't being able to partition any normal program into a program that executes (efficiently) on multiple CPU's basically require that someone solves the halting problem?
I bring this up because because it seems like, in order to partition the tasks efficiently, you'd basically need to be able to predict what the program was trying to do in advance.... and if you could predict what a program was going to do in advance of actually running it, it would seem like you have just solved the halting problem.
Any CS folks care to weigh in? This isn't really my field, but I'm quite interested to know one way or the other.
Life is too short to proofread.
Beowulf clusters typically are designed for specific purposes and software is written to take advantage of the design. You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it.
I'm certainly no expert, but I have researched this a bit since I work in a department with a LOT of extra boxes laying around. They're slow individually but together add up to a good bit of processing power and memory. We want to put them to use but the question is "what use?"
That question boils down to programs designed to use multiple threads versus splitting processes. If your needs involve running things that require lots of processes, then openMosix is a good bet, but if you're simply wanting to make your favorite software run faster, the answer might be to rebuild it to take advantage of a Beowulf cluster with more threads rather than trying to divy up the processes. Fortunately, there are compile tools out there to make it a little easier and specifically openMosix has some compile tools to make programs more multi-process friendly.
Despite all the tools though, some programs just don't divide well without significant recoding. If you're faced with that type of problem, its time to call in the coding gurus because openMosix can't help you. Others, like apache and mysql were practically written to be shared.
OpenMosix may be the answer or not, it all depends on the question, which in this case isn't completely clear because the objective and software desired aren't discussed.
As to the why clustering works this way, there are far more technical and probably much more accurate answers but in simple terms, you can't make two computers do one thing faster than one computer can do it unless you can divide the job. Some jobs divide easily, some don't.
B) Eliminate all the stupid users. This is frowned upon by society.
"You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it."
Maybe not for 2+2, but you could for large numbers which are not atomic to add. If it takes linear time to do a task on one processor, on the Connection Machine it could basically end up being lg n time.
Cf. here