Mosix 1.0 Released

← Back to Stories (view on slashdot.org)

Posted by michael on Saturday May 5, 2001 @04:08AM from the insert-Beowulf-joke-here dept.

Mosix is a scalable clustering system for Linux, released under the GPL. Version 1.0 for the 2.4 kernels is now available.

1 of 77 comments (clear)

Min score:

Reason:

Sort:

A quick primer on types of paralell systems. by landley · 2001-05-05 05:20 · Score: 5

There are traditionally three different types of paralell processing systems: SMP, NUMA, and networked clusters like Beowulf. In reality, these form a continuous range, with SMP at one end, Beowulf at the other, and NUMA in the middle.
SMP is Symmetrical Multi-Processing, or one computer with multiple processors just like multiple hard drives, multiple serial ports, or multiple banks of RAM. In an SMP setup, each processor has equal access to the other system resources, and although they may need locking to avoid stomping on each other's activities, it's no more expensive for processor #2 to access a certain resource (such as an area of main memory) than it is for processor #5 to do so. Thus there's no real reason to shuffle processes around to be "closer" to some other resource.
The other end of the spectrum is message passing networked clustering, like beowulf, where isolated systems (each with its associated set of resources) accept complete tasklets, work on them more or less alone, and output the results. Accessing resources from the rest of the cluster is very expensive, and you try not to do it more than absolutely necessary (once per transaction). A message comes in with all the info a node needs to do its work, and the node sends a message back out with the result and to announce it's ready for the next mouthful.
NUMA is in between, and it stands for Non-Uniform Memory Architecture. You have a bunch of similar processors, like in SMP, but some resources are "close" to each processor and some are far away.
Remember, clusters own resources outright, this is my node's memory. On SMP all processors access a pool of shared resources (like main memory) at the same speed (hence symmetrically). On NUMA, processor #53 -CAN- access memory over by processor #1736, but it'll take much longer than if it accesses memory near itself. It'll block, it'll have wait states. (Just like accessing a page swapped to the hard drive vs accessing one in memory.)
The thing is, as systems on either end become more complex they move towards NUMA. Think mondo SMP systems with dozens of processors, each of which has megabytes of L1 cache. You want to keep stuff "in cache" rather than accessing main memory, and sometimes you wan't to access something that's currently in some other processor's cache. Cache line pollution and such. That's a NUMA type of problem.
From the other end, once you start connecting beowulf clusters together with really high speed interconnects (like gigbit ethernet or myrinet, and often speed here is more a question of latency than bandwidth,) and start teaching them how to pretend to be one big shared memory image by page faulting through the network, you're approcaching NUMA from the other end. Stuff's in my machine's memory locally right now, and swapping it in from some other guy's memory (and swapping out some of my stuff to make room for it) is something I only want to do when absolutely necessary, because it slows me down.
MOSIX is taking beowulf clusters in the direction of NUMA. This is a good thing, it makes them more flexible and capable, but it opens up a whole can of worms to optimize it properly. (Not a new can of course, the kernel hackers are already dealing with a rather significant portion of NUMA's issues just trying to get 32 processor alphas to work smoothly.) If the interconnects between clusters were perfect, we could just treat it as one big SMP machine. Then again if our hard drives were as fast as our ram we wouldn't try so hard to minimize swapping, would we? You could still just treat MOSIX as SMP instead of NUMA if you don't want to optimize your performance. And for many things that's a fine solution, just distributing it cross the cluster gives you all the performance you need, and adding nodes is more cost effective than rewriting your app for greater speed in the new environment.
But performance hits of thrashing all your pages through the network can be just as bad as thrashing them in and out of the swap partition. And performance is the only reason we're using clusters in the first place, isn't it?
And NUMA optimization just makes maintaining locality of reference, streamlined locking, and minimizing contention for commonly accessed resources even MORE important. It's the same kind of thing you'd do on a normal SMP machine anyway, it just has more of an impact, because there's more inefficiency to optimize away.
Rob