Why Does Current Clustering Require Recoding?

latency? by Johnny+Mnemonic · 2005-09-13 09:13 · Score: 5, Insightful

why aren't there clustering solutions that do this as well?

Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?

You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.

Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.

--

--
$tar -xvf .sig.tar

Re:latency? by Frumious+Wombat · 2005-09-13 11:15 · Score: 3, Informative

Don't forget disk access issues as well. You now have file locking, non-local disk-access, and race state issues to contend with.

Example from my work is that we tend to write several hundred meg to several gig scratch files, and then perform RW operations on them continually during a calculation. If the disk isn't local to the process, then you end up flooding the network, and bringing everything to a screeching halt.

In a Mosixish/Condor type environment, you then have to deal with which processes, because of this disk limitation, can be migrated to other CPUs, or can allow a second job to start on their own because of insufficient utilization, from those which have to have exclusive access to the CPU, and near-exclusive access to the disk, in order to prevent the calc from bogging down.

Then, as the parent mentioned, you have the CPU-CPU communication issues, the network overhead, and memory access patterns, all of which are hard. In theory, had you written your code correctly in the first place, this would only be moderately annoying, but since most people's applications are single-threaded, most programming is taught in serial mode, and the tools for MPar work are still expensive and exotic, then you get a situation where it's easy to run a compute farm (massive numbers of single-processor jobs), but hard to run a parallel cluster (one job aggregating resources)

--
the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken

Performance by RAMMS+EIN · 2005-09-13 09:18 · Score: 2, Insightful

``Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?''

Because it's virtualization, and thus hurts performance?

--
Please correct me if I got my facts wrong.

Because it's hard. by Elwood+P+Dowd · 2005-09-13 09:18 · Score: 2, Insightful

It's hard to take arbitrary code and decide which parts can be run on opposite ends of a network cable.

Sure, you could make a clustering application that would run arbitrary x86 code on separate machines, but it would be many orders of magnitude slower than just running the code on one big Xeon.

Hell, it's hard enough to take a single thread and spread work across multiple execution units in the CPU for out-of-order execution, and too hard to do it across multiple CPUs in a single box. Why would it be possible across a network cable? Have I completely misunderstood the question?

--

There are no trails. There are no trees out here.

cooking lessons by xutopia · 2005-09-13 09:20 · Score: 3, Insightful

imagine telling a group of 10 cooks to make a huge roast. You wouldn't cut the roast in 10 pieces and each make them cook it seperatly and then glue the pieces back together. It would be highly difficult to glue back the 10 pieces. Instead it would make more sense to ask all cooks to do a seperate tasks. A few could could cut vegetables while another would make a sauce, another a salad.

As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.

Re:cooking lessons by swmccracken · 2005-09-13 11:07 · Score: 3, Insightful

See, is this High Availablity clustering or performance clustering. The asker doesn't state, and it's a rather important distinction.

If it's HA, you'd get 10 cooks each to make a roast. Sure, you'd end up with cooking extra meat but that doesn't matter - the goal here is to guarentee that a roast will be cooked no matter what. (I can imagine two copies of bochs running on seperate physical machines but linked to run in absoulte lock-step. Performance might be impared, but relability will be there.)

If it's performance, then you're right, you can't magically glue two computers together and get twice the performance.

part of the issue by sfcat · 2005-09-13 09:21 · Score: 3, Informative

When making an application distributed, you must figure out how to replicate the memory the application uses to other machines and make sure that this replication and synchronization work is transparent to the logic of the application. But this replication and synchronization is far far far too expensive (computationally) if done naively. So either special system calls (which is what the recoding requires) or a redesign of how work is parcelled out to worker threads is necessary.

This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.

--
"Those that start by burning books, will end by burning men."

Mosix by NitsujTPU · 2005-09-13 09:22 · Score: 4, Insightful

You might want to try Mosix.

http://www.mosix.org/

TANSTAAFL by Julian+Morrison · 2005-09-13 09:23 · Score: 4, Insightful

Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth. It's hard because it's hard.

duh by jpmkm · 2005-09-13 09:23 · Score: 2, Informative

How is this magical cpu virtualizer going to know what it can split up and send to different computers? Like another poster mentioned, latency is the big issue. If your cpu virtualizer arbitrarily sends instructions over the network to other nodes, but the original program still expects them to be executed at local cpu speed then things are going to get fucked up fast. I wouldn't be surprised if the final result is actually slower than just running the job on one box.

Basically, what's wrong with this idea is the clustering software has no way of knowing what it can chunk up and spit out to other nodes unless the programmer of the software in question tells it. Some multithreaded programs can be run on clusters without a rewrite, but there is already clustering software for that application. What the OP is suggesting is similar to rerouting highway traffic by arbitrarily plucking cars off the highway and putting them on random side streets. They all may get there eventually and, at first, it may seem like they are moving faster, but in the end it just takes everyone a lot longer to reach their destination. Now, if the drivers themselves planned alternate routes to help alleviate congestion on the highways, then there's a good chance everyone would get to their destinations faster.

Compilers by Marillion · 2005-09-13 09:30 · Score: 3, Interesting

Most compilers/interpreters support languages designed for single thread execution. Fortran, COBOL, C, C++, Ruby, Perl, PHP, Java, ... Sure all these have API calls to make use of multiple threads, but the language itself isn't multi-threaded.

In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ which covers lots of problems that have to be solved.

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.

--
This is a boring sig

Re:Compilers by The+boojum · 2005-09-13 10:20 · Score: 2, Informative

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.
Have a look at OpenMP. Granted, it's more for shared-memory systems than clusters, but it works similiarly to what you describe.
Re:Compilers by GileadGreene · 2005-09-13 14:41 · Score: 3, Informative

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed.
The venerable occam programming language requires that each block of code be specifically identified as being executable either in parallel or sequentially. Since PAR and SEQ constructs can be nested it is easy to build up quite complex concurrent structures that can easily be distributed. Since the semantics of occam processes are derived from Hoare's CSP process algebra the compositional nature of occam's parallelism is theoretically sound, and avoids many of the problems associated with thread-based concurrency model that most people are familiar with.
Re:Compilers by Marillion · 2005-09-13 17:11 · Score: 2, Interesting

No. Java is a perfect example of API based threading and in Java it's easy to do. Still a class has to implement Runnable and the programmer has to create a Thread and start it.
The synchronize keyword was closer to where I was going. Suppose Java had a thread modifier keyword for looping operators. You could then:
public void renderImage(Image images[]) { thread for (int i = 0; i < images.length; i++) { render(images[i]); } }
each iteration of the looping block launches as a different task running in parallel and the loop exits once all tasks are complete. Or a easier RPC like
public double getSalary(EmpID id) { String idString = id.getID(); double salary; remote("hr-system") { salary = HR.getSalary(idString); } return salary; }
I certianly recognize that there are some very hard problems that would need to be solved in such a senario. Thread synchronization, mutexes, semaphores would need to be looked at. A clean way to integrate directory services and other ways of defining environmental resources is not trivial either and critical for the success of such a language.

--
This is a boring sig

What type of cluster do you want? by Hast · 2005-09-13 09:30 · Score: 2, Informative

First off, it's not entirely clear what you want to do with it. If you want load balancing then that's one problem. If you want parallel batch processing (such as rendering farms or compiling) then that's another problem. And for the really juicy stuff, ie running a normal application distributed on multiple computers then that is a third, and very different problem.

But all of them require that you add something to the original program which distributes the work (load balancing/render farms). If you want your original program to run in parallel then that is a much harder problem to solve. Basically you'll have to remake it into something like the above.

The last problem would basically require the computer to extract threads out of your code. This is pretty much impossible to do automatically though.

Because bandwidth is scarce. by roystgnr · 2005-09-13 09:35 · Score: 2, Interesting

If your problem is so parallelizable that bandwidth isn't a limitation, then you don't need any special clustering software, you just need nfs and ssh: I do all my compiling in a flash with a short script and "make -j 16 CXX=sshcxx".

If your problem isn't that parallelizable and yet you need a whole cluster of computers to run it, odds are you need more efficiency than distributed shared memory can give you. You can access memory on your own node with orders of magnitude more bandwidth and less latency than on other nodes, and if your application doesn't take that into consideration it can run orders of magnitude slower.

Of course, that doesn't apply to every problem, and there are people trying to create exactly the cluster-as-computer architecture you'd like to see for ease of application programming. Check out OpenMosix and MigShm for one example - I haven't used the latter DSM patch myself but I know that for non-shared-memory programs, Mosix has had working process migration code for years.

openMosix by Codename_V · 2005-09-13 09:42 · Score: 2, Informative

Actually, I'd recommend openMosix. Granted Mosix is the original and is open source now as well, but it still seems like openMosix is more actively developed.

--
Free will is just an illusion

This is a basic systems question. by stienman · 2005-09-13 10:20 · Score: 4, Informative

This is a basic systems question:

[Why must] programs be re-written to take advantage of the cluster.

The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).

Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.

The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.

-Adam

My try by fm6 · 2005-09-13 10:48 · Score: 2, Interesting

Lots of good answers, but none that quite satisfy me. Here's mine:

The virtual machines you mention all run on a single existing system. You want a virtual machine that runs on multiple systems. That goes way beyond what the existing VMs do. They just implement the hardware instructions of a single system in software running on a single system. Taking that implementation and spreading it out among multiple systems means anticipating every clustering problem the code might raise, and solving it in advance.

Nobody knows how to do that. If they did, they'd implement it as the back end of compiler rather than waste the overhead of using a VM.

(They say that there are no stupid questions. Not true. But there are lame stupid questions, and interesting stupid questions. My vocation is answering interesting stupid questions, which is why I'm grateful for this one!)

MOSIX License by Noksagt · 2005-09-13 11:18 · Score: 2, Informative

Actually, I'd recommend openMosix.

Agreed.

Granted Mosix is the original and is open source now as well,

Not by OSI/DFSG/FSF standards. The license is still very restrictive. I think the kernel patches might be under GPL, but certainly not the user tools.

it still seems like openMosix is more actively developed.

This is certainly true. Most talent jumped ship & openMosix does have a higher number of active developers (and is somewhat backed by AMD (though I think AMD can and should give more developers to the project)).

Holy Grail by owlstead · 2005-09-13 11:18 · Score: 3, Insightful

This will be a bit difficult to explain fully. The other posts have already lightly touched the problems involved (especially latency). But you are talking about the holy grail of parallel computing here; seeing one system while it is running all over the place. My best advice for you is to get a good book on parallel systems and get educated. This is something like asking a doctor why there are still diseases.

Re:openMosix by Noksagt · 2005-09-13 11:27 · Score: 2, Informative

I agree that you may have to make some of these kinds of design changes to benefit for one application processes. But you'd really have to make those if you use other clustering solutions too. With Mosix, you don't have to make the kind of implementation-specific changes, though.

(And, for your particular example, mosix has a number of schedulers & you can schedule manually. You can trivially send one postscript file to each node. Of course you can do this "braindead" clustering with a script, but it isn't as robust, easy, or flexible.)

Mosix sounds good because you don't have to "do" anything special but most apps won't benefit from it.

Somewhat agree for single apps, especially edge cases that you point out. But if a large number of CPU-intensive processes, (open)Mosix is a good, fairly painless way to divide the load.

No, the hard part is ... by hummassa · 2005-09-13 11:56 · Score: 2, Interesting

cache consistency. When I modify a page that is in my processor cache, now I have to put the word out to the whole network -- and I can't really commit that page until I know for sure that other threads in the cluster did not modify the same page (and, in the case someone did, I must decide how do I merge their modifications and mine, notify them of the merging, etc, etc...) What was a quick (important for performance) operation becomes a dog-slow operation, and maybe puts the whole motif for using a cluster in jeopardy...
HTH,

--
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048

Your language doesn't support it by photon317 · 2005-09-13 12:37 · Score: 3, Informative

The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.

Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.

--
11*43+456^2

Re:Mosix - a great answer, but not for everything. by ancientt · 2005-09-14 16:11 · Score: 2, Insightful

I don't think its quite as simple as a right answer. Sure, openMosix rocks but its only one kind of answer, not the final one. OpenMosix spreads the processes around but can't split a single process up to make it complete faster. It can send processes to the most likely CPU but that still doesn't address the question of speeding up the time that the process will take to complete.

Beowulf clusters typically are designed for specific purposes and software is written to take advantage of the design. You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it.

I'm certainly no expert, but I have researched this a bit since I work in a department with a LOT of extra boxes laying around. They're slow individually but together add up to a good bit of processing power and memory. We want to put them to use but the question is "what use?"

That question boils down to programs designed to use multiple threads versus splitting processes. If your needs involve running things that require lots of processes, then openMosix is a good bet, but if you're simply wanting to make your favorite software run faster, the answer might be to rebuild it to take advantage of a Beowulf cluster with more threads rather than trying to divy up the processes. Fortunately, there are compile tools out there to make it a little easier and specifically openMosix has some compile tools to make programs more multi-process friendly.

Despite all the tools though, some programs just don't divide well without significant recoding. If you're faced with that type of problem, its time to call in the coding gurus because openMosix can't help you. Others, like apache and mysql were practically written to be shared.

OpenMosix may be the answer or not, it all depends on the question, which in this case isn't completely clear because the objective and software desired aren't discussed.

As to the why clustering works this way, there are far more technical and probably much more accurate answers but in simple terms, you can't make two computers do one thing faster than one computer can do it unless you can divide the job. Some jobs divide easily, some don't.

--
B) Eliminate all the stupid users. This is frowned upon by society.

Slashdot Mirror

Why Does Current Clustering Require Recoding?

25 of 75 comments (clear)