Maintaining Large Linux Clusters
pompousjerk writes "A paper landed on arXiv.org on Friday titled Installing, Running and Maintaining Large Linux Clusters at CERN [PDF]. The paper discusses the management of the 1000+ Linux nodes, upgrading from Red Hat 6.1 to 7.3, securely installing over the network, and more. They're doing this in preparation for Large Hadron Collider-class computation."
My book on maintaing a cluster of 0-1 nodes will be out next month.
"try doing it with a windows cluster"
john
are you a Weapon of Male Destruction? then you need one of our sassy t-shirts
All I Want For Christmas Is My Constitutional Rights
I'm not interested till it can do Ogg Vorbis!!
G4 Hackintosh
"Why on earth would someone need a 1000+ node cluster?"
:-)
Look at google?
Why on earth would someone need a 1000+ node cluster?
Maybe for a Large Hadron Collider-class computation.
Whenever the offence inspires less horror than the punishment, the rigour of penal law is obliged to give way...
Damn. Back when I was on a high-energy experiment located in the middle-of-nowhere in Japan (subject of at least two slashdot articles), our japanese colleagues used to lease gaggles of Sun workstations at a yearly maintanence cost that exceeded the retail value of the machines themselves!!
A few of us linux-fans used to grumble that we'd be better off buying dozens of cheap linux-boxes, but we weren't making the buying decisions. It seemed to us that the higher-ups didn't think cheap boxes with a free OS could compete on a performance basis with the Suns.
As for me? I just installed CERNlib on my laptop and just laughed as it blew the suns away on a price/performance(+portability) basis
Just because you don't need it, or can't envision needing it, doesn't mean nobody else needs that kind of power.
Bob
fsck -u
Maby if you wanted to watch Finding Nemo rendered in realtime instead of frame by frame? I don't know.
So yeah, I basically designed my own system for a professor in the Political Science Dept at my universidad Washington University in St. Louis that completely boots over the network and is completely diskless for every node. About a year before Knoppix ever started doing that. Did it with openMosix and its fully LAM/MPI functional. Bruce of the openMosix list was on me for quite a while to get the docs done, but some really not cool domesitc issues came up and I never got them done. If anyone is really interested, send an email to drtdiggers_DONT_SPAM_ME_BASTARDS_@_SUCKYMICROSOFT_ hotmail.com and let me know, I'll finish them up.
(Disclaimer: IANAPP (Particle Physicist))
Hey, *somebody* has to back up the Internet from time to time!
:)
:)
...or to find Cheney.
Either that, or all the pr0n encoding.
Best...Tivo...*ever*!
Host this thing at an Internap location, and you're the Ultimate LPB.
Searching for "First Posters" for the Homeland Security people to "visit."
SETI client!
"Every room has every movie ever made in any language." Who do you think hosts *that*?
ILM, seeing the second LOTR movie, decides an 'upgrade' is in order for the SW:EP3 render farm.
It takes this much computing power to find WMD in Iraq.
MS compiling Longhorn builds.
Calculating the question for the answer 42.
BitTorrent!
Installing, Running and Maintaining Large Linux Clusters at CERN
Vladimir Bahyl, Benjamin Chardi, Jan van Eldik, Ulrich Fuchs, Thorsten Kleinwort, Martin Murth, Tim
Smith CERN, European Laboratory for Particle Physics, Geneva, Switzerland
Having built up Linux clusters to more than 1000 nodes over the past five years, we already have practical experience confronting some of the LHC scale computing challenges: scalability, automation, hardware diversity, security, and rolling OS
upgrades. This paper describes the tools and processes we have implemented, working in close collaboration with the EDG project [1], especially with the WP4 subtask, to improve the manageability of our clusters, in particular in the areas of system
installation, configuration, and monitoring.
In addition to the purely technical issues, providing shared interactive and batch services which can adapt to meet the diverse and changing requirements of our users is a significant challenge. We describe the developments and tuning that we have
introduced on our LSF based systems to maximise both responsiveness to users and overall system utilisation.
Finally, this paper will describe the problems we are facing in enlarging our heterogeneous Linux clusters, the progress we have made in dealing with the current issues and the steps we are taking to 'gridify' the clusters
1. INTRODUCTION
The LHC era is getting closer, and with it the challenge of installing, running and maintaining thousands of
computers in the CERN Computer Centre. In preparation, we have streamlined our facilities by
decommissioning most of the RISC hardware, and by merging the dedicated and slightly different experiment
Linux clusters into two general purpose ones (one interactive, one batch), as reported at the last CHEP[ 2].
Quite some progress has been made since then in the automation and management of clusters. The EU DataGrid
Project (EDG), and in particular the WP4 subtask[ 3], has entered its third and final year and we can already benefit
from the software for farm management being delivered by them. See [4] for further details. In addition, the LHC
Computing Grid project (LCG)[ 5] has been launched at CERN to build a practical Grid to address the computing
needs of the LHC experiments, and to build up the combined LHC Tier 0/ Tier 1 center at CERN.
In preparing for the LHC, we are already managing more than 1000 Linux nodes of diverse hardware types,
the differences arising due to the iterative acquisition cycles. In dealing with this high number of nodes, and
especially when upgrading from one release version of Linux to another, we have reached the limits of our old
tools for installation and maintenance. Development of these tools started more than ten years ago with an initial
focus on unifying the environment presented to both users and administrators across small scale RISC workstation
clusters from different vendors, each of which used a different flavour of Unix[ 6]. These tools have now been
replaced by new tools, taken either from Linux itself, like the installation tool Kickstart from RedHat Linux or the
RPM package format, or rewritten using the perspective of the EDG and LCG, to address large scale farms using just
one operating system: Linux.
This paper will describe these tools in more detail and their contribution to the progress in improving the
installation and manageability of our clusters. In addition, we will describe improvements in the batch sharing and
scheduling we have made through configuration of our batch scheduler, LSF from Platform Computing[ 7].
2. CURRENT STATE
In May last year, the Linux support Team at CERN certified RedHat Linux 7. This certification involved the
porting of experiment, commercial and administration software to the new version and verifying their correct
operation. After the certification, we set up test clusters for interactive and batch computing with this new OS. This
certification process took quite some consid
Bush is on fire and its not good for my lungs.
Shouldn't that be a Large Hard-on Collider-class computation?
I write better stuff than that... and i am a middle schooler, going into high school. Work on i dunn, a plot?
Do you ever find yourself humming the MacGuyver theme song? Then you my friend, are a true nerd.
They must be trying to play Doom 3.
The global economy is a great thing until you feel it locally.
Just a little too late for the SETI@home project. Kind of a shame, really. If only we had those computers sooner...
[[ Just because you don't need it, or can't envision needing it, doesn't mean nobody else needs that kind of power. ]]
This is exactly why he asked the question, genius. He wanted to know who would need this type of computational power and if it was more cost/performance effective than just buying a supercomputer.
These typical Slashdot dumbshits need to get off their "I'm smarter than you" pedestals and realize that they're no better than anyone else. It would also be nice bonus if they learned how to read.
The rule of Marc: Whenever commenting on someone else's stupidity, you will always indirectly comment on your own.
Damn you, grammer!
I've been looking at ClusterKnoppix mentioned recently on slashdot. It has built in openmosix and also supports thin clients via a terminal service. Just pop it in, and instant cluster. In case you missed the article:
ClusterKnoppix
Where I work, we are developping a clustering system using single system images.. Where all the OS is stored on a server and is NFS mounted by each node. Our current tests show that we can easily run 100 nodes on 100mbit ethernet from a single server... And the coolest thing is that the nodes mount the / of the server, so for "small clusters" (under 100 nodes), we have to do a software upgrade only once and all nodes and the server are upgraded... Btw, this whole thing can be done using an almost unmodified Gentoo Linux distribution.
I'm hoping to convince my boss to let us publish detailed docs.. he thinks that if we do everyone will be able to use it and he will loose sales (we are in the hardware business..). Details at our homepage and about an older version (but with more details) at the place where we used to work.
So that they can survive a slashdotting? ;)
Bush is on fire and its not good for my lungs.
well, i recently interviewed at nvidia, and they have a 3,000+ cluster just for emulating the new graphics/io chips they're working on... they don't manufacture anything, the turn around time to manufacture a prototype for testing would take too long... so all they do is simulate the actual chips and then send the data off for fabrication once they're done. on a cluster of 3,000 machines, some jobs take all weekend, from what i understand.
imagine if they just used one machine.
This reminds me of a paoper that was just presented at USENIX:
Fast, Scalable Disk Imaging with Frisbee. Fun talk.
Pretty cool tricks - they use multicast and filesystem specific compression techniques to parallel load the disks on a subset of the disks in the cluster. Very very very fast. (I use the disk imaging part of their software to load images on my test machines at MIT, and I'm quite impressed).
Anyway, just a bit of related cool stuff.
Have they not taught you the meaning of the word "satire" in middle school yet? That is unfortunate, as I was taught what "satire" was at the tender age of five. Allow me to initiate your uninitiated minds: Main Entry: satÂire Pronunciation: 'sa-"tIr Function: noun Etymology: Middle French or Latin; Middle French, from Latin satura, satira, perhaps from (lanx) satura dish of mixed ingredients, from feminine of satur well-fed; akin to Latin satis enough -- more at SAD Date: 1501 1 : a literary work holding up human vices and follies to ridicule or scorn 2 : trenchant wit, irony, or sarcasm used to expose and discredit vice or folly
Pretty much everything that has to do with solving evolution equations for complex systems. Even wether forcasts require way more computing power than NASA's 96 node cluster can provide. Rocket science is not at all "rocket science".
"You can't allow somebody to commit the crime before you detain them." [Condoleezza Rice]
RH 7.3 reaches it's end of life in December of this year. One can only assume (and hope) that they have the in-house people to support it, or it's going to cost them beacoup $$ for continued RHN support.
Large linux clusters maintain YOU!
Why on earth would someone need a 1000+ node cluster?
The Atlas Project at CERN, when it comes online, is supposed to produce a petabyte of data every year. I doubt one 1000 node cluster would be enough to process that data quickly.
First, as another poster pointed out, these detectors produce a LOT of data. I'm on an experiment slated to take data at about the same time as the LHC experiments, with similar rate requirements.
We plan to use a 2500 node cluster (of year 2007 CPUs) to filter our data in real time. The input rate into this cluster will be about 10 GB/s, output rate about 200 MB/s.
But, each interaction is analyzed (usually) by just one computer. There are so many interactions, though, that you need massive clusters, but not much communication between nodes of the cluster.
That's just for the data filter. You need even larger amounts of computing to analyze what comes out in that 200 MB/s and to simulate what happens in the experiment. Much larger amounts.
Our experiment will ultimately require clusters this size at the laboratory and at something like a dozen other institutions.
Gotta love Slashdot... the only place where such a disclaimer isn't taken for granted.
kind of misread the title... oops
What the hell are they studying???
So, to all those who are in the know out there... when they have what they want how many nodes and individual machines could they maintain? What are the constraints? What about data back-ups? Is ephemeral data recorded on a few machines in separate nodes to make sure that one getting nocked out doesn't zap something for good?
Who in their right mind would have a cluster this size, for this sort of work, on any network where "securely installing over the network" is an issue? I mean, I'd want this as far off of a public network as possible, unless I really want to explain to whoever authorized my grant why my experimental data indicates that:
e = mc^31337
my sig's at the bottom of the page.
You might be able to analyze your data on a 0-node cluster if the Tevatron doesn't start working better soon...
-a mildly disgruntled CDF postdoc
Good explanation. The main cause why LHC needs so much processing power is that the higher the energy, the more scattered particles ("Jets") you have, and they all arrive instantly at the detectors, and LHC will have higher energies than its predecessors. The "size" of particles is meaningless, but the interesting events where a new particle can be detected are very rare. I can't remember the numbers anymore and would have too look it up. They are also working on custum hardware which will do some calculations before sending the data to the cluster.
well, whatever that was supposed to be... i got it, but it lacked effectivness. Sorry, but when you post a rebuttal, make sure that it's partially coherent, at least!
Do you ever find yourself humming the MacGuyver theme song? Then you my friend, are a true nerd.
Is it that hard for you to admit that the US is falling behind the curve?
Apparantly you didn't, because you didn't grasp any of my points whatsoever. A dense mind will never pick up on any cues...... no matter what..... and you sir, appear to be quite dense.
CERN is only half in France, the other half being in switzerland (not even in the Europe Union). but, being American it must be hard for you to understand geography beyond your own backyard; my deepest regrets :-/
Well, Frenchie La Frencherson, last time I checked (right now as a matter of fact), Switzerland was located smack in the damn middle of Europe and the EU. How dumb do you think us americans are?
...run Windows?
Analogies don't equal equalities, they are merely somewhat analogous.
you actually checked? hehe...
Pretty dumb, since you cannot grasp the fact that geographical location has nothing to do with membership in a political organisation. To make it less abstract, and hopefully easier to understand for you, think of how West Berlin was not part of the soviet block, in spite of the fact that it was located smack in the middle of Est Germany. But maybe i'm asking to much of you.
How dumb do you think us americans are?
Two more answers like that and you'll make me believe that the US is inhabited by amoebas that can type on a keyboard.
Running rpm --rebuilddb must be a real drag.
obvious troll. and not funny. cern is in switzerland.
IAAL
If you want to scale more, and your nodes have tons of ram, you could likely stuff the whole os into ramdisk and then use the local disk for the scratch space. Once booted, the network impact of nfs goes away.
Of course, you could use System installer Suite (http://www.sisuite.org/) which is *similar* to the rsync method mentioned by the other poster, but you get to skip the redhat install step in favor of SiS's tools.
XML is like violence. If it doesn't solve the problem, use more.
NASA have the machine at number 18 in the top500 list, its got 384 nodes (1392 cpus, 4 cpus per node).
Funny coincidence, I was just reading an article about the planned CERN Large Hadron Collider which will be ready in 2007; it'll put out 1250 Mbyte/sec.
This is stored to tape though (~50 30 Mbyte/sec Storagetek 9.940B drives in parallel), not realtime.
I pikced up on it, buts being a snob doesn't make you classy. Der!
Do you ever find yourself humming the MacGuyver theme song? Then you my friend, are a true nerd.
I LOVE LINUX!!!!!!!!!!!!
I'm surprised that nobody has mentioned SystemImager. If you haven't looked at it for maintaining large numbers of Linux boxes, scamper off and take a look now. It is worth your time.
Now, that being said, I recently had the opportunity to evaluate using a number of OpenBSD boxes, but I couldn't find a utility for maintaining a bunch of the boxes in the same manner as SystemImager (i.e. Incrementally update servers from a golden master via rsync).
So, has anyone run found anything that does what systemimager does, but that is cross-platform? Do any SystemImager developers out there want to comment on the potential difficulty in supporting other-than-Linux operating systems in SystemImager?
SystemImager is one of the most useful tools I've ever seen, however, I believe that it would be an enterprise "killer app" if it could do MacOS X, *BSD, Windows etc.
-Peter
. Penguins Surely Ca
It's understandable why a person would have to check. It's akin to not being able to provide exact coordinates for a specific planet/asteroid orbiting a planet a million light years away: nobody cares, and by nobody, I mean those of us who matter, namely Americans.
Anyone else read it as "Large Hardon Collider"? I blew coffee threw my nose. Damn disexlia...
I need a cluster to do the rendering of my blender animation experiments. My 100 frame movie at 640x480 takes several minutes to finish on a single 1.3Ghz box, especially when you like enviroment maps at high res (for mirrored surfaces).
Next Q: Why whould anyone want to make 3d animations in Blender? A. Because I want to!
try { do() || do_not(); } catch (JediException err) { yoda(err); }
I am a high energy physicist.
You will need this much computing power if you are trying to filter and analyze one the order of a petabyte of data yearly. Some collisions at the LHC will produce 1000s of particles, a large fraction of which will be detected in multiple detectors as they fly away from the collsion point (nucleus on nucleus collisions). Thousands of these collisions will happen every second. The information in the various detectors then must be collected back so that all the signals a particular particle made can be associated with each other. Then many graduate students must write code to search through all these particles for exciting physics. A lot of computing power is essential for exploiting the potential of the collider and detector.
Actually no. Part of CERN is in France, part is in Switzerland.
One application that benefits from adding the nodes (with almost linear scaling in performance) is the Monte Carlo radiation transport. For example, in medical physics people try to calculate a dose distribution in a human body for the various configuration of treatment accelerators. Monte Carlo simulation software "generates" random initial particles (with appropriate probabilities for given accelerator) and than tracks each particle as it propagates and interacts with surrounding tissue. Interactions are randomly generated (hence: Monte Carlo) but again randomness is biased according to the appropriate physics. Each such "history" can be independently generated by a different node thus making parallelization trivial.
In my lab I have assembled a 24-node cluster and it takes about 4-8 hr to calculate dose distributions for the most cases. With a 1000 node cluster it would be possible to do this sort of calculations routinely in clinics during the treatment planing and actual treatment. This will mean that the cancer patients will have improved survivability odds due to the more precise targeting of the tumors.
Cheers,
Beowulf's root
How applicable is this to FreeBSD? Now that linux is under this legal cloud of doom, I'm switching all my clusters over to FreeBSD.
Imagine a beow... oh wait...
From my experience, NFS is by far the worst choice in networked filesystems.
Since all the boxens are linux I strongly suggest, SAMBA of NCP.
Think of it this way: In the old days we had
- NFS to share files with other old Unices (like SCO, Slowlaris, HPUX, AIX).
- SMB to share files with windowz
- NCP to share files with Novell network fs.
IMHO, NCP is the best. SMB is pretty good too.
has anyone tried linuxbios http://www.linuxbios.org/ to replace standard bios. results in a diskless, faster boot. used in this cluster architecture: http://www.clustermatic.org/
i have a 1'000'000 node cluster! ...
...
it crawles around, eates flies and likes light
and sometimes it replicates in my
fruit-loops!
it can accurately predict ( >95% ) the weather
two days ahead!
two bad it doesn't have any interface
that is compatible with me
it's what we scientists call a "passive-cluster"!
There's always an anti-NFS troll out there just waiting to spout "the truth". Get over it. NFS works great, particularily when all you are running is Linux.
Hmm... Not quite there yet. The collection of command line tools could probably be rolled into something that automates system management the way SystemImager does. But even then, radmind rather unintelligently seems to recopy entire files.
Also, how is partitioning taken care of.
No, I'm still looking for something like SystemImager that handles multiple Operating Systems. Perhaps extending SystemImager to support others will be the easiest way.
As a side note, Frisbee, which was mentioned in a previous thread, is the killer app for LAN-based system imaging. Wow!
-Peter
. Penguins Surely Ca
Sorry, not a big SystemImager expert. I see that it just uses rsync, hence your comment about recopying entire files. I'd point out that for binary files, rsync tends to copy the entire file anyway, on a version change. radmind's nice in this case because it can tell that a file needs to be updated with no network traffic.
:w
how is partitioning taken care of
Depends on the system. For Mac OS X, we pretty much need to use Apple's tools. For Solaris, we use Jumpstart. Kickstart on Linux. Partitioning is very OS specific. radmind is very portable.
dsriugadniaw34r sareh98fase fasef
"I picked up on it, buts being a snob doesn't make you classy. Der!"
Hmmmmm