Why Does Current Clustering Require Recoding?

Clustering pitfalls by sycotic · 2005-09-13 09:10 · Score: 0

This is a really good question and I have often wondered it myself.

I am glad someone has finally posed the question to the community at large.

Imagine the renewal of interest in clustering if this were to become reality...

--
-- If I were a fish, I'd be wet

latency? by Johnny+Mnemonic · 2005-09-13 09:13 · Score: 5, Insightful

why aren't there clustering solutions that do this as well?

Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?

You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.

Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.

--

--
$tar -xvf .sig.tar

Re:latency? by Frumious+Wombat · 2005-09-13 11:15 · Score: 3, Informative

Don't forget disk access issues as well. You now have file locking, non-local disk-access, and race state issues to contend with.

Example from my work is that we tend to write several hundred meg to several gig scratch files, and then perform RW operations on them continually during a calculation. If the disk isn't local to the process, then you end up flooding the network, and bringing everything to a screeching halt.

In a Mosixish/Condor type environment, you then have to deal with which processes, because of this disk limitation, can be migrated to other CPUs, or can allow a second job to start on their own because of insufficient utilization, from those which have to have exclusive access to the CPU, and near-exclusive access to the disk, in order to prevent the calc from bogging down.

Then, as the parent mentioned, you have the CPU-CPU communication issues, the network overhead, and memory access patterns, all of which are hard. In theory, had you written your code correctly in the first place, this would only be moderately annoying, but since most people's applications are single-threaded, most programming is taught in serial mode, and the tools for MPar work are still expensive and exotic, then you get a situation where it's easy to run a compute farm (massive numbers of single-processor jobs), but hard to run a parallel cluster (one job aggregating resources)

--
the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken

Performance by RAMMS+EIN · 2005-09-13 09:18 · Score: 2, Insightful

``Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?''

Because it's virtualization, and thus hurts performance?

--
Please correct me if I got my facts wrong.

Because it's hard. by Elwood+P+Dowd · 2005-09-13 09:18 · Score: 2, Insightful

It's hard to take arbitrary code and decide which parts can be run on opposite ends of a network cable.

Sure, you could make a clustering application that would run arbitrary x86 code on separate machines, but it would be many orders of magnitude slower than just running the code on one big Xeon.

Hell, it's hard enough to take a single thread and spread work across multiple execution units in the CPU for out-of-order execution, and too hard to do it across multiple CPUs in a single box. Why would it be possible across a network cable? Have I completely misunderstood the question?

--

There are no trails. There are no trees out here.

cooking lessons by xutopia · 2005-09-13 09:20 · Score: 3, Insightful

imagine telling a group of 10 cooks to make a huge roast. You wouldn't cut the roast in 10 pieces and each make them cook it seperatly and then glue the pieces back together. It would be highly difficult to glue back the 10 pieces. Instead it would make more sense to ask all cooks to do a seperate tasks. A few could could cut vegetables while another would make a sauce, another a salad.

As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.

Re:cooking lessons by fm6 · 2005-09-13 10:32 · Score: 1

It would be highly difficult to glue back the 10 pieces.
Gluing a roast back together is easy. It's eating it aftwards that's hard!
Re:cooking lessons by swmccracken · 2005-09-13 11:07 · Score: 3, Insightful

See, is this High Availablity clustering or performance clustering. The asker doesn't state, and it's a rather important distinction.

If it's HA, you'd get 10 cooks each to make a roast. Sure, you'd end up with cooking extra meat but that doesn't matter - the goal here is to guarentee that a roast will be cooked no matter what. (I can imagine two copies of bochs running on seperate physical machines but linked to run in absoulte lock-step. Performance might be impared, but relability will be there.)

If it's performance, then you're right, you can't magically glue two computers together and get twice the performance.
Re:cooking lessons by Anonymous Coward · 2005-09-15 01:42 · Score: 0

The computing model, which one program against or for, is the problem. We have gone for von-Nu0man and ...now complain it does not work for other models.
With our major programming langs being imperative and inflexible , all these patching won't be optimal and will be a PITA.

It would be ideal, to have *pure* Functional model and an intelligent VM to run on a single or a million machines as you wish. If one can hint the VM at compile, coding and run-times about the accepted policies for some code, It could do wonders.
(Does anything exist like this now?)

All I want to say is, Software is complex, not because the real world is complex, as some morons say. We made it complex with the poor major decisions of the past (and possibly the present).

part of the issue by sfcat · 2005-09-13 09:21 · Score: 3, Informative

When making an application distributed, you must figure out how to replicate the memory the application uses to other machines and make sure that this replication and synchronization work is transparent to the logic of the application. But this replication and synchronization is far far far too expensive (computationally) if done naively. So either special system calls (which is what the recoding requires) or a redesign of how work is parcelled out to worker threads is necessary.

This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.

--
"Those that start by burning books, will end by burning men."

Mosix by NitsujTPU · 2005-09-13 09:22 · Score: 4, Insightful

You might want to try Mosix.

http://www.mosix.org/

Re:Mosix by NitsujTPU · 2005-09-13 11:34 · Score: 0

It's funny. This is the system that the guy wants. This is what he's looking for. It's not an "why not" answer, it's a system that does what he wants.

Not a single moderator gave me a point for this. Not to whine, but what kind of a clue bat does it take to get the CORRECT answers modded up at Slashdot?

TANSTAAFL by Julian+Morrison · 2005-09-13 09:23 · Score: 4, Insightful

Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth. It's hard because it's hard.

Re:TANSTAAFL by GigsVT · 2005-09-13 10:03 · Score: 1

It's hard because it's hard

Ask Slashdots like this one, and succinct replies like this one make me wish there was an option to immediately archive a story with just one comment if enough people vote for it. :)

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Re:TANSTAAFL by sumirati · 2005-09-20 20:00 · Score: 1

Clustering exposes complications regarding: shared data, latency, concurrency, transactions, central control, security, failovers, and so forth.

I'm very happy that my clusters don't require forth.

duh by jpmkm · 2005-09-13 09:23 · Score: 2, Informative

How is this magical cpu virtualizer going to know what it can split up and send to different computers? Like another poster mentioned, latency is the big issue. If your cpu virtualizer arbitrarily sends instructions over the network to other nodes, but the original program still expects them to be executed at local cpu speed then things are going to get fucked up fast. I wouldn't be surprised if the final result is actually slower than just running the job on one box.

Basically, what's wrong with this idea is the clustering software has no way of knowing what it can chunk up and spit out to other nodes unless the programmer of the software in question tells it. Some multithreaded programs can be run on clusters without a rewrite, but there is already clustering software for that application. What the OP is suggesting is similar to rerouting highway traffic by arbitrarily plucking cars off the highway and putting them on random side streets. They all may get there eventually and, at first, it may seem like they are moving faster, but in the end it just takes everyone a lot longer to reach their destination. Now, if the drivers themselves planned alternate routes to help alleviate congestion on the highways, then there's a good chance everyone would get to their destinations faster.

Re:duh by Everleet · 2005-09-13 10:24 · Score: 1

No, it's more like chopping up a bus, leaving one passenger with each piece, and telling them all to drive to the same destination. Unless the problem actually is parallelizable, in which case you'd tell them to drive to different destinations.

Then again, it's even more like chopping up the highway into short, disconnected, side-by-side pieces, giving the group some number of go-karts (depending on the problem), and telling each person to drive down a different piece than they started on.

--
It's tragic. Laugh.

openMosix by Noksagt · 2005-09-13 09:25 · Score: 1

openMosix is a GPLed fork of MOSIX & is undergoing rapid development. No need to recode apps. Apps will work like they do in an SMP machine. So, if they are already faster on a dual-processor machine, they won't need to use special libraries or threading methods to work over several workstations.

Re:openMosix by samjam · 2005-09-13 09:52 · Score: 1

This is true;

but unless your app is heavily CPU bound it will probably stick to its home node.

If the app does much io, like say, processing batches postscript files, it will probably stay in its home node maybe unless you manage to get global block devices working to convince mosix that the io is as good from anynode.

Mosix sounds good because you don't have to "do" anything special but most apps won't benefit from it.

--
blog.sam.liddicott.com
Re:openMosix by Noksagt · 2005-09-13 11:27 · Score: 2, Informative

I agree that you may have to make some of these kinds of design changes to benefit for one application processes. But you'd really have to make those if you use other clustering solutions too. With Mosix, you don't have to make the kind of implementation-specific changes, though.

(And, for your particular example, mosix has a number of schedulers & you can schedule manually. You can trivially send one postscript file to each node. Of course you can do this "braindead" clustering with a script, but it isn't as robust, easy, or flexible.)
Mosix sounds good because you don't have to "do" anything special but most apps won't benefit from it.
Somewhat agree for single apps, especially edge cases that you point out. But if a large number of CPU-intensive processes, (open)Mosix is a good, fairly painless way to divide the load.
Re:openMosix by AugstWest · 2005-09-13 12:45 · Score: 1

Well, the reason I'm asking is that I'm starting a job in a Math department at a University, where I'll have too much free time and lots of underused Linux machines around. They do a lot of very cpu-intensive work, and I was thinking it'd be a fun learning project to put something together.
Re:openMosix by NitsujTPU · 2005-09-14 02:34 · Score: 1

Mosix is a quick solution. It's not going to get you the most bang for your buck if you're trying to develop some amazing supercomputer, but it will certainly get you the biggest bang for your buck if you're interested in something you could put together in an afternoon and pump some good performance out of.

Wrong problem. by Mr2cents · 2005-09-13 09:27 · Score: 1

More processors don't always increase speed, you have to be able to split up the problem in chunks and then work on them at the same time. The algorithms that simulate a processor aren't easily run in parallel, basically. Or require too much communication overhead.

--
"It's too bad that stupidity isn't painful." - Anton LaVey

Compilers by Marillion · 2005-09-13 09:30 · Score: 3, Interesting

Most compilers/interpreters support languages designed for single thread execution. Fortran, COBOL, C, C++, Ruby, Perl, PHP, Java, ... Sure all these have API calls to make use of multiple threads, but the language itself isn't multi-threaded.

In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ which covers lots of problems that have to be solved.

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.

--
This is a boring sig

Re:Compilers by The+boojum · 2005-09-13 10:20 · Score: 2, Informative

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.
Have a look at OpenMP. Granted, it's more for shared-memory systems than clusters, but it works similiarly to what you describe.
Re:Compilers by blackcoot · 2005-09-13 11:18 · Score: 1

a nit to pick: java has threading built in as a primitive. this is something java takes advantage of when using javaspaces, jini, and jxta. that said, you still suffer the issues imposed by network topologies in a distributed memory multiprocessor.
Re:Compilers by MarkLewis · 2005-09-13 12:28 · Score: 1

Well, this is what J2EE is supposed to accomplish with EJB. Of course in previous versions EJB implementations have suffered from terrible performance and terrible code complexity. Supposedly both have gotten better, but I haven't dared to look at them again yet.
Re:Compilers by GileadGreene · 2005-09-13 14:41 · Score: 3, Informative

I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed.
The venerable occam programming language requires that each block of code be specifically identified as being executable either in parallel or sequentially. Since PAR and SEQ constructs can be nested it is easy to build up quite complex concurrent structures that can easily be distributed. Since the semantics of occam processes are derived from Hoare's CSP process algebra the compositional nature of occam's parallelism is theoretically sound, and avoids many of the problems associated with thread-based concurrency model that most people are familiar with.
Re:Compilers by Marillion · 2005-09-13 17:11 · Score: 2, Interesting

No. Java is a perfect example of API based threading and in Java it's easy to do. Still a class has to implement Runnable and the programmer has to create a Thread and start it.
The synchronize keyword was closer to where I was going. Suppose Java had a thread modifier keyword for looping operators. You could then:
public void renderImage(Image images[]) { thread for (int i = 0; i < images.length; i++) { render(images[i]); } }
each iteration of the looping block launches as a different task running in parallel and the loop exits once all tasks are complete. Or a easier RPC like
public double getSalary(EmpID id) { String idString = id.getID(); double salary; remote("hr-system") { salary = HR.getSalary(idString); } return salary; }
I certianly recognize that there are some very hard problems that would need to be solved in such a senario. Thread synchronization, mutexes, semaphores would need to be looked at. A clean way to integrate directory services and other ways of defining environmental resources is not trivial either and critical for the success of such a language.

--
This is a boring sig
Re:Compilers by Anonymous Coward · 2005-09-13 18:43 · Score: 0

Take a functional language. Most of the time, order of execution does not matter (except for I/O operations and stuff alike) and the compiler could spread excecution to several threads. Are there any compilers out there doing that?
Re:Compilers by stoborrobots · 2005-09-14 01:58 · Score: 1

I remember seeing something very similar to what you describe... A little poking around brings me to this page about implementing image-processing in hardware, (originally seen on robots.net).

They talk about OpenMP, (as The Boojum mentioned) and they use it in a way analogous to what you're describing there... an example: (Damnit... slashcode fuxors up the indenting...)

Listing 4: Implementation of replication sort
1 par (element=0; element<SIZE; element++) { 2 seq { 3 par (element2=0; element2<SIZE-1; element2++) { 4 ifselect(element>element2) { 5 if(uList[element] > uList[element2]) 6 comp[element][element2] = 1; 7 } else ifselect (element<=element2) { 8 if(uList[element] >= uList[element2+1]) 9 comp[element][element2] = 1; 10 } 11 } 13 position[element] = SUM_OF_DIGITS(comp[element]); 14 sList[position[element]]=uList[element]; 15 } 16 }

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Re:Compilers by Anonymous Coward · 2005-09-14 03:43 · Score: 0

Fialta at http://www.fialta.org/
Re:Compilers by Shewmaker · 2005-09-14 14:50 · Score: 1

I've never tried it, but there is a project that claims to have an OpenMP implementation that runs over distributed shared memory.

Omni/SCASH: Cluster-enabled Omni OpenMP on a software distributed shared memory system SCASH

--
"For the Snark was a Boojum, you see." -From the Hunting of the Snark: An Agony in Eight Fits, by Lewis Carroll
Re:Compilers by hackstraw · 2005-09-15 02:20 · Score: 1

Take a look at OpenMP.

It takes a commercial compiler, but its straight forward, an open specification, can be used "automagically", its portable across machines and languages.

It does not work on a clustered system, but only one that has local processors and memory.

What type of cluster do you want? by Hast · 2005-09-13 09:30 · Score: 2, Informative

First off, it's not entirely clear what you want to do with it. If you want load balancing then that's one problem. If you want parallel batch processing (such as rendering farms or compiling) then that's another problem. And for the really juicy stuff, ie running a normal application distributed on multiple computers then that is a third, and very different problem.

But all of them require that you add something to the original program which distributes the work (load balancing/render farms). If you want your original program to run in parallel then that is a much harder problem to solve. Basically you'll have to remake it into something like the above.

The last problem would basically require the computer to extract threads out of your code. This is pretty much impossible to do automatically though.

Re:What type of cluster do you want? by Reverend528 · 2005-09-13 11:24 · Score: 1

Server applications can be load-levelled across a cluster without modification if the load-balancing is done on the process level. Apache has no problems load-balancing on an openssi cluster, since the individual processes don't depend on eachother.

--
Badass Resumes
Re:What type of cluster do you want? by Hast · 2005-09-14 01:09 · Score: 1

Yes but you still need an application doing the actual load-balancing. That may be built in or not; but it still has to exist.

And I imagine that if you have a complex site with dynamic content from a database you'll still need to optimise your design to get as much use as possible from the load balancing.

What you want... by jd · 2005-09-13 09:32 · Score: 1

Is a clustering system that openly moves applications around, and then allows shared memory to be distributed, along with devices and allows applications to move around yet retain their network connections.

Oh, that sounds like a tough one to me. Ok, ok, it's actually not that tough - but it DOES require combining a number of kernel patches, there's no one-stop-shop (at the moment) for this. It also requires that network connections be IPv6, as there's bugger all mobility support out there for IPv4 for Linux as best as I can tell.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:What you want... by aminorex · 2005-09-13 12:17 · Score: 1

There's Rocks and Racks.

--
-I like my women like I like my tea: green-

Because bandwidth is scarce. by roystgnr · 2005-09-13 09:35 · Score: 2, Interesting

If your problem is so parallelizable that bandwidth isn't a limitation, then you don't need any special clustering software, you just need nfs and ssh: I do all my compiling in a flash with a short script and "make -j 16 CXX=sshcxx".

If your problem isn't that parallelizable and yet you need a whole cluster of computers to run it, odds are you need more efficiency than distributed shared memory can give you. You can access memory on your own node with orders of magnitude more bandwidth and less latency than on other nodes, and if your application doesn't take that into consideration it can run orders of magnitude slower.

Of course, that doesn't apply to every problem, and there are people trying to create exactly the cluster-as-computer architecture you'd like to see for ease of application programming. Check out OpenMosix and MigShm for one example - I haven't used the latter DSM patch myself but I know that for non-shared-memory programs, Mosix has had working process migration code for years.

Re:Because bandwidth is scarce. by versus · 2005-09-22 14:56 · Score: 1

I do all my compiling in a flash with a short script and "make -j 16 CXX=sshcxx".
How about distcc? No need for NFS.

--
Brain is my second favorite organ.

openMosix by Codename_V · 2005-09-13 09:42 · Score: 2, Informative

Actually, I'd recommend openMosix. Granted Mosix is the original and is open source now as well, but it still seems like openMosix is more actively developed.

--
Free will is just an illusion

Not all problems are equally divisible by Anonymous Coward · 2005-09-13 10:04 · Score: 0

If one woman can have one baby in nine months, can nine women have one baby in one month?

Because clustering is a hack by Anonymous Coward · 2005-09-13 10:05 · Score: 0

Clustering grew out of a poor-mans solution to distributed computing. There have been plenty of distributed operating systems over the decades, but the status-quo weights heavy. Rather than revolutionary OSes we use evolutionary OSes.
For example, a common solution for distributed computing is to use Linux (a 1970'ish design that isn't a distributed OS at all) and cluster it hence forcing application writers to re-write to manually divide computation along latency/bandwidth lines.
There is no technological reason why a compiler couldn't do a detailed whole-app dataflow analysis of a program and compute a good (or perhaps optimal) distribution for computations automatically.
It simply isn't done that way, because applications of distributed computing are either inherently easy to divide up (like serving web pages, raytracing, database access etc.), or are scientific computations that are written by practitioners with little CS understanding of distributed computing.
Similarly, no company has any interest in building truley 'supercomputing' hardware any more - they're more interested in re-packaging off-the-shelf parts because the market is too small.
Hence, todays 'supercomputers' are just clustes in disguise. So, if you local supercomputer uses Intel PC CPUs connected into a cluster, why not just run Linux on it and force app writers to divide up the computations by using MPI or something? Right?
It may not be easy to program nor use the compute resources optimally, but keep everyone in a job so who cares.
So, in summary, the answer to your questions is: because people don't like change, like job stability, maintaining the status-quo and are lazy.
Disclosure: I work for the organization that runs the largest 'supercomputer' in the US.

Infiniband by jd · 2005-09-13 10:15 · Score: 1

This is where infiniband comes into play, as it has built-in support for distributed direct memory access and caters to sufficient bandwidth to support it. The problem is the questioner doesn't state what sort of interconnect they're using - and that matters in a cluster. Ethernet, SCI, Infiniband, etc, all support different types of solution.

You can only use solutions that exist for the technology you're using. Likewise, though, you're not limited by constraints on technologies you're not using.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

This is a basic systems question. by stienman · 2005-09-13 10:20 · Score: 4, Informative

This is a basic systems question:

[Why must] programs be re-written to take advantage of the cluster.

The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).

Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.

The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.

-Adam

What about the mythical TOE and RDMA by mkop · 2005-09-13 10:21 · Score: 1

I have been hearing about TOE http://en.wikipedia.org/wiki/TCP_Offload_Engine and RDMA http://en.wikipedia.org/wiki/Remote_Direct_Memory_ Access for over a year now both of these would help with clustering of remote servers and there CPU's

Re:What about the mythical TOE and RDMA by cant_get_a_good_nick · 2005-09-13 12:33 · Score: 1

TOE would help free up some cycles on an individual machine, but I'm guessing the question doesn't ask that. It's a little unclear, but I'm guessing it's "If already I can hit a Virtual machine, why can't I code to a virtual machine who's real machine happens to be the cluster?"

Generally it looks like he's asking for a VM to be an abstraction layer over a cluster. The problem is abstractions are simplifications, and you can't just simplify away the real problems of a cluster. There are some solutions that can always help simplify (OpenMPI sounds closest with what little research I've done) but clusters are hard, and simplification layers on top of it can't delete all complexity.
Re:What about the mythical TOE and RDMA by Anonymous Coward · 2005-09-15 17:19 · Score: 0

and there CPU's

You mean their CPU's. Jackass.

VirtualIron is probably what you're looking for by hansendc · 2005-09-13 10:26 · Score: 1

VirtualIron is a company/product that runs a para-virtualized Linux instance (more similar to Xen than Bochs or the desktop VMWare) which spans multiple physical machines.

http://www.virtualiron.com/

My try by fm6 · 2005-09-13 10:48 · Score: 2, Interesting

Lots of good answers, but none that quite satisfy me. Here's mine:

The virtual machines you mention all run on a single existing system. You want a virtual machine that runs on multiple systems. That goes way beyond what the existing VMs do. They just implement the hardware instructions of a single system in software running on a single system. Taking that implementation and spreading it out among multiple systems means anticipating every clustering problem the code might raise, and solving it in advance.

Nobody knows how to do that. If they did, they'd implement it as the back end of compiler rather than waste the overhead of using a VM.

(They say that there are no stupid questions. Not true. But there are lame stupid questions, and interesting stupid questions. My vocation is answering interesting stupid questions, which is why I'm grateful for this one!)

why don't you do a little experiment? by blackcoot · 2005-09-13 11:09 · Score: 1

pick an algorithm (say matrix multiply). write the fastest possible serial implementation you can (hint: you can do better than O(n^3)). then implement a parallel matrix multiply using MPI. now make the parallel one run as fast as possible.

i can guarantee that the serial algorithm is about a day's worth of effort to implement; however, the parallel one will require at least a week. as you start working through the parallel implementation, you'll quickly discover that all the things that are true in a shared memory multiprocessor are no longer true in a distributed memory multiprocessor. you'll quickly appreciate the communication costs and overheads imposed by the cluster. you'll also appreciate just how much of a bitch attempting to debug parallel programs is (where did stdout go? depends on your MPI implementation... whee!). topology also becomes a major factor: parallel sorts which are efficient on toroidal grid topologies are no longer efficient on hypercubes, and vice versa.

mod parent up by blackcoot · 2005-09-13 11:13 · Score: 1

you managed to communicate what i was attempting to in a far more succinct way than i managed to.

Re:mod parent up by AugstWest · 2005-09-13 12:36 · Score: 1

I would give it points, but I guess I can't, since I asked the questoin :] Thanks for a great answer.
Re:mod parent up by blackcoot · 2005-09-13 12:57 · Score: 1

intro. to parallel computation is known as one of the nightmare master's level classes (the sort of class that most folks only took because they were either a] utter masochists, b] far too clever, or c] didn't have any other choices) at my alma mater. i kinda fell into all three of those categories and ended up doing pretty well. the question you asked is one that my prof posed to us our first night of class and one that came back to haunt us almost every class thereafter (and certainly on the final)...

anyways, it's a good question to ask --- shows you're at least thinking about things. ultimately, clusters will have such low latencies that they'll move in the direction you're talking about --- distributed-but-sufficiently-low-latency-that-we'l l-pretend-that-it's-a-single-ram. until then, we're still stuck with MPI and friends...

MOSIX License by Noksagt · 2005-09-13 11:18 · Score: 2, Informative

Actually, I'd recommend openMosix.

Agreed.

Granted Mosix is the original and is open source now as well,

Not by OSI/DFSG/FSF standards. The license is still very restrictive. I think the kernel patches might be under GPL, but certainly not the user tools.

it still seems like openMosix is more actively developed.

This is certainly true. Most talent jumped ship & openMosix does have a higher number of active developers (and is somewhat backed by AMD (though I think AMD can and should give more developers to the project)).

Holy Grail by owlstead · 2005-09-13 11:18 · Score: 3, Insightful

This will be a bit difficult to explain fully. The other posts have already lightly touched the problems involved (especially latency). But you are talking about the holy grail of parallel computing here; seeing one system while it is running all over the place. My best advice for you is to get a good book on parallel systems and get educated. This is something like asking a doctor why there are still diseases.

No, the hard part is ... by hummassa · 2005-09-13 11:56 · Score: 2, Interesting

cache consistency. When I modify a page that is in my processor cache, now I have to put the word out to the whole network -- and I can't really commit that page until I know for sure that other threads in the cluster did not modify the same page (and, in the case someone did, I must decide how do I merge their modifications and mine, notify them of the merging, etc, etc...) What was a quick (important for performance) operation becomes a dog-slow operation, and maybe puts the whole motif for using a cluster in jeopardy...
HTH,

--
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048

Your language doesn't support it by photon317 · 2005-09-13 12:37 · Score: 3, Informative

The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.

Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.

--
11*43+456^2

Re:Your language doesn't support it by Anonymous Coward · 2005-09-13 18:53 · Score: 0

...and what you need is a concurrent programming language. While many of those listed are more or less research languages, I know that at least Erlang and Occam have been used in commercial applications. Then again, something completely far out like Mozart/Oz could bring a revolution in programming...

With the current shift to multi-core processors as the way to bring more performance, I am waiting for the day when processors have thousands or even millions of cores. At that stage programming with threads in C/C++, Java or Python would be like programming in assembly -- you can either spend months debugging race conditions in the aforementioned languages, or take the step to advanced concurrent languages and get full advantage of all the available parallelism for free.

completely different issues by __aazofn1209 · 2005-09-13 13:00 · Score: 1

Clustering is a many-cpu, one-problem situation. Many problems are not "do this thing 1000 times", but "perform these 1000 steps in order", so it requires a lot of work to make the simultaneous availability of many CPUs an advantage. The goal is to increase the speed of a CPU-intensive task.

Virtualization is a many-problem, one-cpu situation. Various software tricks make each of several programs think they have an entire system to themselves. In reality it all runs inside a virtualizer/emulator. Speed is sacrificed, but there are other advantages in management, flexible allocation, etc.. The goal is to make better use of a few CPUs by a larger number of programs.

VMware doesn't virtualize the CPU by Torrenc · 2005-09-13 13:13 · Score: 1

VMware doesn't virtualize the CPU. Whatever native CPU you're running is what the Guest OS sees.

Re:VMware doesn't virtualize the CPU by jbridge21 · 2005-09-14 06:57 · Score: 1

Dude, that's what "virtualization" means in the context of recent computing... "emulation" is where you emulate the entire thing ALA Bochs. "Virtualization" you just trap system code and hardware access and redirect them.
Re:VMware doesn't virtualize the CPU by Anonymous Coward · 2005-09-14 13:47 · Score: 0

Nope. The standard terminology is that "emulation" is the trapping of events and redirecting them (VMware/Xen), "simulation" is the software mimicry of the entire hardware structure (Bochs). "Virtualization" is a more general term for any such trick of presenting the illusion of a CPU that doesn't really exist -- that is, both emulation and simulation are types of virtualization.

Your example... by PaulBu · 2005-09-13 16:17 · Score: 1

... reminds me of Cray's famous statement that though one woman can have a child in 9 month, nine women would not be able to have one in 1 month. ;-)

Yes, all the answers above were quite sufficient to explain why you have to re-code your app if you want for it to run _faster_ when you add _more_ nodes. And it is so easy to make it run slower -- I bet the original poster would benefit from trying to re-code some a sequential program to a parallel one at least once, then ask himself "How the heck can I teach a compiler to deal with this mess???" (not to mention OS/virtualization hardware which has to deal with binary and it is so much harder to extract dependency information from that on the fly!*).

Paul B.

* Yes, there was that result of (as far as I remember!) HP virtualization group when some SPEC apps ran FASTER on a virtual machine with just-in-time compiler than natively compiled code -- had to do with virtualizer knowing which branch your program will take next and preparting for that. You'd have to google for the whole story, it was quite educational!

Re:Your example... by Tower · 2005-09-15 01:23 · Score: 1

Here's the ARS Technica article on the HP Dynamo tech.

--
"It's tough to be bilingual when you get hit in the head."
Re:Your example... by jgrahn · 2005-09-15 08:19 · Score: 1

... reminds me of Cray's famous statement that though one woman can have a child in 9 month, nine women would not be able to have one in 1 month.
I've never seen that one attributed to anyone but Fred Brooks. (But it's still funny.)

Listing 4: Implementation of replication sort by stoborrobots · 2005-09-14 02:00 · Score: 1

Listing 4: Implementation of replication sort 1 par (element=0; element<SIZE; element++) { 2 seq { 3 par (element2=0; element2<SIZE-1; element2++) { 4 ifselect(element>element2) { 5 if(uList[element] > uList[element2]) 6 comp[element][element2] = 1; 7 } else ifselect (element<=element2) { 8 if(uList[element] >= uList[element2+1]) 9 comp[element][element2] = 1; 10 } 11 } 13 position[element] = SUM_OF_DIGITS(comp[element]); 14 sList[position[element]]=uList[element]; 15 } 16 }

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco

cilk by Shewmaker · 2005-09-14 03:30 · Score: 1

You should check out cilk. Two of the people behind the project used to work for Thinking Machines, the company that made one of the best supercomputers of its day. Cilk adds a few key words to C and it requires much less effort than most other parallel programming models. Unfortunately, the distributed version of it was only a prototype and isn't included in the latest release. If nothing else, you should read the papers these guys have written.

They also had a couple of graduate students that made Jilk (Cilk for Java). From the sound of their papers it isn't ready for production use yet, but it's something to keep in mind if you prefer writing code in Java.

--
"For the Snark was a Boojum, you see." -From the Hunting of the Snark: An Agony in Eight Fits, by Lewis Carroll

Xgrid agent for Java by Anonymous Coward · 2005-09-14 04:23 · Score: 0

http://sourceforge.net/projects/xgridagent-java/

Qemu Virtualization by phorm · 2005-09-14 06:44 · Score: 1

Indeed, I've been using Qemu a lot lately. While it's great for my needs, it emulates a lower-speed P2 CPU (on my P4 machine) and requires that any device hooks also be understood and passed-through/translated by the virtual machine.

In the end, you'll get better performance and compatability out of coding for a cluster, rather than having the overhead and redirection of the virtualization process.

I'm Surprised by Anonymous Coward · 2005-09-14 07:13 · Score: 0

So surprised to see such dump questions make it to the front page of Slashdot... well not really.

The halting problem by theLOUDroom · 2005-09-14 14:21 · Score: 1

Wouldn't being able to partition any normal program into a program that executes (efficiently) on multiple CPU's basically require that someone solves the halting problem?

I bring this up because because it seems like, in order to partition the tasks efficiently, you'd basically need to be able to predict what the program was trying to do in advance.... and if you could predict what a program was going to do in advance of actually running it, it would seem like you have just solved the halting problem.

Any CS folks care to weigh in? This isn't really my field, but I'm quite interested to know one way or the other.

--
Life is too short to proofread.

Re:Mosix - a great answer, but not for everything. by ancientt · 2005-09-14 16:11 · Score: 2, Insightful

I don't think its quite as simple as a right answer. Sure, openMosix rocks but its only one kind of answer, not the final one. OpenMosix spreads the processes around but can't split a single process up to make it complete faster. It can send processes to the most likely CPU but that still doesn't address the question of speeding up the time that the process will take to complete.

Beowulf clusters typically are designed for specific purposes and software is written to take advantage of the design. You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it.

I'm certainly no expert, but I have researched this a bit since I work in a department with a LOT of extra boxes laying around. They're slow individually but together add up to a good bit of processing power and memory. We want to put them to use but the question is "what use?"

That question boils down to programs designed to use multiple threads versus splitting processes. If your needs involve running things that require lots of processes, then openMosix is a good bet, but if you're simply wanting to make your favorite software run faster, the answer might be to rebuild it to take advantage of a Beowulf cluster with more threads rather than trying to divy up the processes. Fortunately, there are compile tools out there to make it a little easier and specifically openMosix has some compile tools to make programs more multi-process friendly.

Despite all the tools though, some programs just don't divide well without significant recoding. If you're faced with that type of problem, its time to call in the coding gurus because openMosix can't help you. Others, like apache and mysql were practically written to be shared.

OpenMosix may be the answer or not, it all depends on the question, which in this case isn't completely clear because the objective and software desired aren't discussed.

As to the why clustering works this way, there are far more technical and probably much more accurate answers but in simple terms, you can't make two computers do one thing faster than one computer can do it unless you can divide the job. Some jobs divide easily, some don't.

--
B) Eliminate all the stupid users. This is frowned upon by society.

Re:Mosix - a great answer, but not for everything. by bentini · 2005-09-15 11:06 · Score: 1

"You can't have two computers add 2+2 any faster than you can have one computer do it. You can however, have two computers adding 2+2 and 0+1+1 at the same time to get two answers in half the time it would take one computer to do it."

Maybe not for 2+2, but you could for large numbers which are not atomic to add. If it takes linear time to do a task on one processor, on the Connection Machine it could basically end up being lg n time.

Cf. here

Slashdot Mirror

Why Does Current Clustering Require Recoding?

75 comments