Slashdot Mirror


Maintaining Large Linux Clusters

pompousjerk writes "A paper landed on arXiv.org on Friday titled Installing, Running and Maintaining Large Linux Clusters at CERN [PDF]. The paper discusses the management of the 1000+ Linux nodes, upgrading from Red Hat 6.1 to 7.3, securely installing over the network, and more. They're doing this in preparation for Large Hadron Collider-class computation."

11 of 134 comments (clear)

  1. Lucky bastards by Professor+D · · Score: 5, Interesting
    #include "back-in-my-day-rant"

    Damn. Back when I was on a high-energy experiment located in the middle-of-nowhere in Japan (subject of at least two slashdot articles), our japanese colleagues used to lease gaggles of Sun workstations at a yearly maintanence cost that exceeded the retail value of the machines themselves!!

    A few of us linux-fans used to grumble that we'd be better off buying dozens of cheap linux-boxes, but we weren't making the buying decisions. It seemed to us that the higher-ups didn't think cheap boxes with a free OS could compete on a performance basis with the Suns.

    As for me? I just installed CERNlib on my laptop and just laughed as it blew the suns away on a price/performance(+portability) basis

  2. Re:"But why?" asked Little Johnny. by Bob+Wehadababyitsabo · · Score: 5, Interesting
    Where I work, there is a 500 node Linux cluster for cladistic tree generation, which takes a lot of brute force and specialized tools to make happen. It is arguably much more complex then launching a rocket.

    Just because you don't need it, or can't envision needing it, doesn't mean nobody else needs that kind of power.

    Bob

    --
    fsck -u
  3. Autoassimilating Diskless Linux Clusters by Anonymous Coward · · Score: 3, Interesting

    So yeah, I basically designed my own system for a professor in the Political Science Dept at my universidad Washington University in St. Louis that completely boots over the network and is completely diskless for every node. About a year before Knoppix ever started doing that. Did it with openMosix and its fully LAM/MPI functional. Bruce of the openMosix list was on me for quite a while to get the docs done, but some really not cool domesitc issues came up and I never got them done. If anyone is really interested, send an email to drtdiggers_DONT_SPAM_ME_BASTARDS_@_SUCKYMICROSOFT_ hotmail.com and let me know, I'll finish them up.

    1. Re:Autoassimilating Diskless Linux Clusters by Daengbo · · Score: 2, Interesting

      I don't claim to know more about your situation than you do, but several distros, including k12ltsp.org , support Open Mosix straight from the install, and work with either PXE (which you couldn't have used) or Etherboot. I'm not trying to change your mind. I'm just pointing out that there are a lot of folks who prefer and even swear by diskless clusters.

  4. ClusterKnoppix - OpenMosix by Anonymous Coward · · Score: 5, Interesting

    I've been looking at ClusterKnoppix mentioned recently on slashdot. It has built in openmosix and also supports thin clients via a terminal service. Just pop it in, and instant cluster. In case you missed the article:

    ClusterKnoppix

  5. why such a huge cluster? by Anonymous Coward · · Score: 5, Interesting

    well, i recently interviewed at nvidia, and they have a 3,000+ cluster just for emulating the new graphics/io chips they're working on... they don't manufacture anything, the turn around time to manufacture a prototype for testing would take too long... so all they do is simulate the actual chips and then send the data off for fabrication once they're done. on a cluster of 3,000 machines, some jobs take all weekend, from what i understand.

    imagine if they just used one machine.

  6. Re:"But why?" asked Little Johnny. by vondo · · Score: 5, Interesting
    Disclaimer: IAAPP (I am a particle physicist).

    First, as another poster pointed out, these detectors produce a LOT of data. I'm on an experiment slated to take data at about the same time as the LHC experiments, with similar rate requirements.

    We plan to use a 2500 node cluster (of year 2007 CPUs) to filter our data in real time. The input rate into this cluster will be about 10 GB/s, output rate about 200 MB/s.

    But, each interaction is analyzed (usually) by just one computer. There are so many interactions, though, that you need massive clusters, but not much communication between nodes of the cluster.

    That's just for the data filter. You need even larger amounts of computing to analyze what comes out in that 200 MB/s and to simulate what happens in the experiment. Much larger amounts.

    Our experiment will ultimately require clusters this size at the laboratory and at something like a dozen other institutions.

  7. "securely installing over the network" by ameoba · · Score: 4, Interesting

    Who in their right mind would have a cluster this size, for this sort of work, on any network where "securely installing over the network" is an issue? I mean, I'd want this as far off of a public network as possible, unless I really want to explain to whoever authorized my grant why my experimental data indicates that:

    e = mc^31337

    --
    my sig's at the bottom of the page.
    1. Re:"securely installing over the network" by samhalliday · · Score: 3, Interesting

      if you read the paper (which OK is not as bad as not reading the article), you would realise that this is not a project which is being performed only at CERN; when LHC (and others, eg ALICE) become active in a few years, the data is going to be piped to literally hundreds of participating instututions (this is the current list for one of the smaller experiments) for data analysis. so, no, this is not enough processing power, and yes they need it to be publically available. i also know people who are (or were?) working on the security implementations. believe me, at CERN, they think it through; its run by lots of really smart people who know what they are at, not politicians. the distributed processing that comes out of these projects will hopefully pave the way forward for the next generation of the internet (the grid).

  8. SystemImager-like update mechanism for non-Linux? by pschmied · · Score: 5, Interesting

    I'm surprised that nobody has mentioned SystemImager. If you haven't looked at it for maintaining large numbers of Linux boxes, scamper off and take a look now. It is worth your time.

    Now, that being said, I recently had the opportunity to evaluate using a number of OpenBSD boxes, but I couldn't find a utility for maintaining a bunch of the boxes in the same manner as SystemImager (i.e. Incrementally update servers from a golden master via rsync).

    So, has anyone run found anything that does what systemimager does, but that is cross-platform? Do any SystemImager developers out there want to comment on the potential difficulty in supporting other-than-Linux operating systems in SystemImager?

    SystemImager is one of the most useful tools I've ever seen, however, I believe that it would be an enterprise "killer app" if it could do MacOS X, *BSD, Windows etc.

    -Peter

  9. Re:"But why?" asked Little Johnny. by Anonymous Coward · · Score: 1, Interesting

    I am a high energy physicist.
    You will need this much computing power if you are trying to filter and analyze one the order of a petabyte of data yearly. Some collisions at the LHC will produce 1000s of particles, a large fraction of which will be detected in multiple detectors as they fly away from the collsion point (nucleus on nucleus collisions). Thousands of these collisions will happen every second. The information in the various detectors then must be collected back so that all the signals a particular particle made can be associated with each other. Then many graduate students must write code to search through all these particles for exciting physics. A lot of computing power is essential for exploiting the potential of the collider and detector.