Maintaining Large Linux Clusters

← Back to Stories (view on slashdot.org)

Maintaining Large Linux Clusters

Posted by michael on Saturday June 14, 2003 @09:01AM from the sysadmin's-delight dept.

pompousjerk writes "A paper landed on arXiv.org on Friday titled Installing, Running and Maintaining Large Linux Clusters at CERN [PDF]. The paper discusses the management of the 1000+ Linux nodes, upgrading from Red Hat 6.1 to 7.3, securely installing over the network, and more. They're doing this in preparation for Large Hadron Collider-class computation."

7 of 134 comments (clear)

Min score:

Reason:

Sort:

Re:"But why?" asked Little Johnny. by hak+hak · 2003-06-14 09:30 · Score: 4, Informative

Because of the computations required to analyze the enormous amount of data a particle collider outputs. The scattered particles go through all sorts of detectors which measure their energy and direction and send them to the cluster, which has to search for particles significantly smaller than a needle in a haystack of measurements.

(Disclaimer: IANAPP (Particle Physicist))
Single system image by Tester · 2003-06-14 09:56 · Score: 5, Informative

Where I work, we are developping a clustering system using single system images.. Where all the OS is stored on a server and is NFS mounted by each node. Our current tests show that we can easily run 100 nodes on 100mbit ethernet from a single server... And the coolest thing is that the nodes mount the / of the server, so for "small clusters" (under 100 nodes), we have to do a software upgrade only once and all nodes and the server are upgraded... Btw, this whole thing can be done using an almost unmodified Gentoo Linux distribution.

I'm hoping to convince my boss to let us publish detailed docs.. he thinks that if we do everyone will be able to use it and he will loose sales (we are in the hardware business..). Details at our homepage and about an older version (but with more details) at the place where we used to work.
Related project: Loading disk images for clusters by angio · 2003-06-14 09:57 · Score: 4, Informative

This reminds me of a paoper that was just presented at USENIX:
Fast, Scalable Disk Imaging with Frisbee. Fun talk.

Pretty cool tricks - they use multicast and filesystem specific compression techniques to parallel load the disks on a subset of the disks in the cluster. Very very very fast. (I use the disk imaging part of their software to load images on my test machines at MIT, and I'm quite impressed).

Anyway, just a bit of related cool stuff.
Red Hat 7.3 by Spoticus · 2003-06-14 10:05 · Score: 2, Informative

RH 7.3 reaches it's end of life in December of this year. One can only assume (and hope) that they have the in-house people to support it, or it's going to cost them beacoup $$ for continued RHN support.
1. Re:Red Hat 7.3 by vondo · 2003-06-14 10:22 · Score: 2, Informative
  
  I'm sure they are firewalled/NATed off, so why would they need (or even want) to upgrade that often?
Another approach... by Junta · 2003-06-14 11:39 · Score: 2, Informative

If you want to scale more, and your nodes have tons of ram, you could likely stuff the whole os into ramdisk and then use the local disk for the scratch space. Once booted, the network impact of nfs goes away.

Of course, you could use System installer Suite (http://www.sisuite.org/) which is *similar* to the rsync method mentioned by the other poster, but you get to skip the redhat install step in favor of SiS's tools.

--
XML is like violence. If it doesn't solve the problem, use more.
linuxbios, anyone? by nafrikhi · 2003-06-14 23:56 · Score: 2, Informative

has anyone tried linuxbios http://www.linuxbios.org/ to replace standard bios. results in a diskless, faster boot. used in this cluster architecture: http://www.clustermatic.org/