Maintaining Large Linux Clusters
pompousjerk writes "A paper landed on arXiv.org on Friday titled Installing, Running and Maintaining Large Linux Clusters at CERN [PDF]. The paper discusses the management of the 1000+ Linux nodes, upgrading from Red Hat 6.1 to 7.3, securely installing over the network, and more. They're doing this in preparation for Large Hadron Collider-class computation."
"Can you imagine a beowulf cluster of these?" .... RTFA.
..... ditto.
"But does it run Linux?"
this site is not running on a cluster with their configuiration, its been slashdotted already....
The lunatic is in my head
1. INTRODUCTION
The LHC era is getting closer, and with it the challenge
of installing, running and maintaining thousands of
computers in the CERN Computer Centre.
In preparation, we have streamlined our facilities by
decommissioning most of the RISC hardware, and by
merging the dedicated and slightly different experiment
Linux clusters into two general purpose ones (one
interactive, one batch), as reported at the last CHEP[2].
Quite some progress has been made since then in the
automation and management of clusters. The EU DataGrid
Project (EDG), and in particular the WP4 subtask[3], has
entered its third and final year and we can already benefit
from the software for farm management being delivered
by them. See [4] for further details. In addition, the LHC
Computing Grid project (LCG)[5] has been launched at
CERN to build a practical Grid to address the computing
needs of the LHC experiments, and to build up the
combined LHC Tier 0/Tier 1 center at CERN.
In preparing for the LHC, we are already managing
more than 1000 Linux nodes of diverse hardware types,
the differences arising due to the iterative acquisition
cycles. In dealing with this high number of nodes, and
especially when upgrading from one release version of
Linux to another, we have reached the limits of our old
tools for installation and maintenance. Development of
these tools started more than ten years ago with an initial
focus on unifying the environment presented to both users
and administrators across small scale RISC workstation
clusters from different vendors, each of which used a
different flavour of Unix[6]. These tools have now been
replaced by new tools, taken either from Linux itself, like
the installation tool Kickstart from RedHat Linux or the
RPM package format, or rewritten using the perspective of
the EDG and LCG, to address large scale farms using just
one operating system: Linux.
This paper will describe in more detail how to fuck CBNâ(TM)s
sweet, sweet, succulent homo-ass. Mmmmm, good,
their contribution to the progress in improving the
installation and manageability of our clusters. In addition,
we will describe improvements in the batch sharing and
scheduling we have made through configuration of our
batch scheduler, LSF from Platform Computing[7].
2. CURRENT STATE
In May last year, the Linux support Team at CERN
certified RedHat Linux 7. This certification involved the
porting of experiment, commercial and administration
software to the new version and verifying their correct
operation. After the certification, we set up test clusters for
interactive and batch computing with this new OS. This
certification process took quite some considerable time,
both for the users and the experiments to prepare for
migration, which had to fit into their data challenges, and
for us to provide a fully tailored RedHat 7.3 environment
as the default in January this year. We took advantage of
this extended migration period to completely rewrite our
installation tools. As mentioned earlier, we have taken this
opportunity to migrate, wherever possible, to the use of
standard Linux tools, like the kickstart installation
mechanism from RedHat and the package manager RPM,
together with its package format, and to the tools that
were, and still are, being developed by the EDG project, in
particular by the WP4 subtask.
The EDG/WP4 tools for managing computing fabrics
can be divided into four parts: Installation, Configuration,
Monitoring, and Fault Tolerance. In trying to take over
these ideas and tools, we first had to review our whole
infrastructure with this in mind.
2.1. Installation
The installation procedure is divided into two main
parts. The basic installation is done with the kickstart
mechanism from RedHat. This mechanism allows
specification of the main parameters like the partition table
CHEP03, La Jolla California, March 24
Imagine a beowulf cluster of those!
The global economy is a great thing until you feel it locally.