Slashdot Mirror


Linux Supercomputer Wins Weather Bid

Greg Lindahl writes "The Forecast Systems Laboratory, a divison of NOAA, selected HPTi, a Linux cluster integrator, to provide a $15 million supercomputing system during the next 5 years. The computational core of this system is a cluster of Compaq Alphas running Linux, using Myrinet interconnect. Check outwww.hpti.com for information on the company. "

8 of 115 comments (clear)

  1. Why Alpha's? Screaming FP performance, that's why by Troy+Baer · · Score: 3

    I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?

    The big thing about the Alpha for people like NOAA (who run big custom number-crunching apps written in FORTRAN) is its stellar FP performance. A 500MHz 21264 Alpha peaks at 1 GFLOPS and can sustain 25-40% of that, because of the memory bandwidth available. A Pentium III Xeon at the same clock rate peaks at 500MFLOPS and can sustain 20-30% of that.

    That doesn't fly for everybody, though. Where I work, we have a huge hodgepodge of message-passed, shared-memory, and vector scientific codes, plus needs for some canned applications that aren't available on the Alpha. We picked quad Xeons for our cluster and bought the Portland Group's compiler suite to try to get some extra performance out of the Intel chips.

    --Troy
    --
    "My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
  2. Re:Why Alpha's??? by Panaflex · · Score: 3

    Oops!! Sorry! 10 times faster is a wrong. True specs are: (from www.spec.org)

    (UP2000 21264 667MHz -Alpha Processor Inc)
    53.7 SPECfp95
    32.1 SPECint95

    The P3 is

    (SE440BX2 MB/550MHz P3 -intel)
    15.1 SPECfp95
    22.3 SPECint95

    --
    I said no... but I missed and it came out yes.
  3. Go to this link by aheitner · · Score: 3

    General Processor Info.

    Compare the SPECfp scores of high-end Intel and Alpha offerings. Take a look at a 600MHz PIII Xeon and a 667MHz Alpha 21264.

    The reason to choose Alpha should be obvious.

  4. Re: Solving PDEs by coyote-san · · Score: 4

    I worked at FSL for several years, although on a different project. I knew people working on the weather models, and I took a class on parallel processing from the CU professor who shared the old Paragon supercomputer with NOAA. I even had an account on the Paragon briefly (for that class) after leaving NOAA.

    NOAA needs to solve partial differential equations (PDEs). A *lot* of PDEs. My class spent a lot of time on solving numerical methods, and my entire undergraduate class in the early 80's was covered in the first lecture of my graduate class a few years ago. My Palm Pilot, running multigrid analysis, could beat the pants off a Cray-XMP running the best known algorithm from 15 years ago.

    AI programs may not scale well, but the type of work done at NOAA *does*. Furthermore the hot topic a few years ago was applying some ideas from chaos theory to weather forecasts - take a dozen systems, insert just a little bit of noise into the initial data (essentially, instrument noise in your observations), then let them all run. If all models show the same weather phenonema, you can be pretty sure that it will occur. If the models show wildly different results (e.g., Hurricane Floyd slams into Key West in one run, but NYC in the other) you know that you can't make any firm predictions. As an educated layman's guess, I expect that the reason the hurricane forecasts are so much better than just a few years ago is precisely this type of variational analysis.

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
  5. Re:Why Alpha's? Screaming FP performance, that's w by Troy+Baer · · Score: 3

    If the G4 can sustain >1gflops, then why not build a cluster of G4s running LinuxPPC?

    I'm not convinced the G4 can sustain 1 GFLOP/s in any kind of real calculation -- it simply doesn't have enough memory bandwidth. The G4 uses the standard PC100 memory bus, AFAIK. That's 64 bits wide running at 100MHz = 800MB/s peak. So without help from the caches, the absolute best you can do is on *any* PC100 based system is 200 MFLOP/s using 32-bit FP or 100 MFLOP/s using 64-bit FP. In practice you can only sustain about 300-350 MB/s out of the PC100 memory bus, so things get even worse. The caches will help quite a bit (maybe a factor of 2-4), but I have trouble imagining the G4 being able to sustain over 500 MFLOP/s even on something small like Linpack 100x100 because of the limited bandwidth and latency of the PC100 bus. Other processors that have similar peak FP ratings have much higher memory bandwidths; we've benchmarked an Alpha 21264 (1 GFLOP/s peak, ~400 MFLOP/s sustained) at about 1 GB/s memory bandwidth (that's measured, not peak), and a Cray T90 CPU (1.8 GLOP/s peak, ~700 MFLOP/s sustained) at 11-13 GB/s (again, measured not peak).

    There's also the question of compilers. You have to have a compiler that recognizes vectorizable loops and generates the appropriate machine code to use the vector unit. Unless Motorola's feeling *really* magnanimous, I don't see that kind of technology making it into gcc (and g77, more importantly for scientific codes) any time soon. Otherwise, you're at the mercy of a commercial Fortran compiler vendor like Portland Group or Absoft. PGI hasn't shown any interest in PowerPC to this point, and Absoft currently does PPC compilers only for MacOS 8, not OSX or LinuxPPC.

    I'd love to be proven wrong on this, but based on my experience I don't see how you could do it.

    --Troy
    --
    "My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
  6. The right tool for the right task ... by LL · · Score: 3

    Buying the hardware is only 15-30% of the total cost. Also, in a production environment, you should not be fixated by the CPU. The question should be, within the capital budget, what is the best combination of resources that maximises the effectiveness of achieving your mission.

    To give you some real-world experience, a group I'm working with is looking at continential-scale simulation at a 5km resolution with the aim of going down to 100m. Now despite what most people think, the bottleneck (in this example) is in fact the I/O, with estimated total requirements of 30 TBytes. Doing the sums show that to keep up with the CPU (say hypothetically 1 run/24 hours), you would need average throughput of 350 MByte/sec. Hardware that supports both this volume and capacity is NOT cheap. We would joke that we paid x million for the I/O and SGI would throw in the Cray for free :-).

    Now as for how an Alpha cluster could be used, it would fit very nicely into the dedicated batch box category. It has a very high CPU rate and some decent compiler optimisation. As such it would augment whatever existing environment exists, reducing the workload of the more expensive machines for development which generally have better tools (just you try debugging a multi-gigabyte core dump). The biggest problem nowadays is not the algorithms, but managing the data traffic to the CPUs and this is where Linux clusters are weak with relatively slow interconnects, unbalanced memory hierarchies, and cheaper but higher latency memory. You have to accept the disadvantages and shift jobs which are not suited for this architecture off. A bit of smarts goes a long way in stretching the budget.

    LL

  7. Re:will it live up to expectations? by quade]CnM[ · · Score: 3

    This is not true of Masively Parallel Systems such as boewulf. The problem with Linux and scalibility is more of a hardware problem then a software problem. While you aren't going to put Linux on a Sun E10k anytime soon, it was never ment to be on such a large SMP machine. The Intel SMP artecture is flawed in design. All processors share the same buss. Therefore if one processor can sustain 300M/sec of transfer, and you have 4 processors. That 800M/sec buss is going to slow down. now your processors are only 2/3 as efficient as they are in a single system. But you are probably going to be slower then this because most RAM has a sustained transfer rate of only 150-200M/sec. so you only use 1/2 of the processor.


    To fix this, you use 2 processor busses, and 2 memory busses. you fill these up, and you get 4 processor busses and 4 memory busses. now you need to connect these buss segments. You have several options. First, connect them within the same machine. This is what NUMA is. the other route is to put each bus in a seperate machine, each machine running a copy of the kernel localy, and connect each box together with a fast network. This is what boewulf is.

    To give you an example. think of a highway system. If you have a lot of traffic switching lanes(busses) constantently, then it would be best to build one big 20 lane highway(NUMA). but if all the traffic basicaly keeps in its own lane, without much need to switch lanes(Inter Process Comunication) then it may be more economical to build 10 2lane highways(boewulf).


    Infact ins't a cray T3E more of a boewulf type cluster of closly nit machines then a NUMA. I think each node on a T3E runs a local copy of the micro-kernel.

  8. Re:will it live up to expectations? by Chalst · · Score: 3

    When you talk of linux's problems with mulitple processors, I think that you are referring to its limited SMP capacity.

    SMP (Symmetric Multi- Processing) is fundamentally different to clustering, as all of the processors in an SMP configuration share the same memory bus, whilst in a cluster the machine architectures are distinct, and we use a high-speed network to exploit parallelism.

    See the Linux Parallel Processing HOWTO for more information.