Slashdot Mirror


How to get 1.5 TeraFlops from Linux

Oak Ridge National Lab has purchased from SGI an Altix 3000 (flash movie). This article claims that: SGI Altix 3000 is recognized as the first Linux cluster that scales up to 64 processors within each node and the first cluster ever to allow global shared memory access across nodes. There is more here, here, and here.

27 of 280 comments (clear)

  1. Better than Beowulf for normal use... by TWX · · Score: 5, Informative

    You're better off using mosix. It'll allow for more normal (ie, not beowulf specific) applications to thread across computers. I'd imagine that an open-mosix setup (like the ones using the knoppix boot CDs tailored to it) could probably make for a fairly powerful computing cluster very easily.

    --
    Do not look into laser with remaining eye.
    1. Re:Better than Beowulf for normal use... by yuvtob · · Score: 3, Informative

      while you are probably right that for most cases mosix will do just fine (I used it for a ~50 PC cluster at nights for DSP calcs), these machines are for super-computer calculations that require a lot of memory. If you even could run a 2GB process on mosix, it would be slowed down by the network, and these beasts can run 100GB processes at a 2GB/s interconnect !

    2. Re:Better than Beowulf for normal use... by ERJ · · Score: 5, Informative

      Mosix is nice, because it treats the cluster like a single, large, multi-cpu box by simply allocating threads to different boxes. The nice thing about this is that any multi-threaded program can take advantage (as stated in the parent post).

      However, this also can cause problems. Most threaded programs are written assuming that all the threads have high speed (i.e. system bus / cpu cache) access to shared information. When we introduce the latency incurred by a network, this can cause programs to run alot slower then they would if they simply had all the threads on a single box. Obviously, it all depends on how the program was written, and what it does.

      If you are writting a program specifically for a cluster, I would suggest instead looking at something like LAM-MPI. This allows for a much more controlling approach to be taken. It is more work (you have to decide how the work will be split) but it allows for much better control of where and what is being done and how to optimize it.

    3. Re:Better than Beowulf for normal use... by battjt · · Score: 2, Informative

      Threads can't be migrated. Only processes can be migrated.

      http://howto.ipng.be/openMosixWiki/index.php/App li cations%20using%20pthreads

      You have to write your application as a bunch of processes to take advantage of a mosix cluster.

      Joe

      --
      Joe Batt Solid Design
  2. Re:Beowulf cluster jokes... by SkArcher · · Score: 2, Informative

    here and here are probably good places to look.

    --

    An infinite number of monkeys will eventually come up with the complete works of /.
  3. Re:Beowulf cluster jokes... by gladbach · · Score: 5, Informative

    just download clusterknoppix and knock yourself out. ; )

    http://bofh.be/clusterknoppix/

    --
    "Computer games don't affect kids; I mean if Pac-Man affected us as kids, we'd all be running around in darkened rooms,
  4. Re:Beowulf cluster jokes... by The_ForeignEye · · Score: 5, Informative

    Back in my days of parallel programming (read: 1998) on Beowulf clusters I used Fortran and C. The trick to make your program "parallel" is to use special programming libraries that will spawn instances of your program across the cluster and let them communicate between each other. The libraries I used were PVM and MPI.

    At that time they were working on a Java implementation, but I don't know what happened with that.

  5. Yanking from my journal entry of 6/30/03 by anzha · · Score: 4, Informative

    HPC Wire had an article that I referenced in my journal on 6/30.

    It's an interesting machine. I'd love to get one to play with. I'm sure our benchmarkers will have some even more interesting comments once they're done. Expect teething problems, folks. Systems of this size and complexity take time to break in.

    --
    Do you know why the road less traveled by is littered with the bones of the unwary?
  6. Oops (RTFA) by Anonymous Coward · · Score: 5, Informative

    The machine has 256 processors for 1.5 teraflops, not 64.

    1. Re:Oops (RTFA) by Anonymous Coward · · Score: 1, Informative

      The 64 is how many CPUs it can use in a single node...RTFA =)
      It has 4 nodes...4 x 64 = 256

  7. Re:Hey, at least it's not running IRIX by Chicane-UK · · Score: 4, Informative

    Um..

    I always liked Irix, and everyone I ever talked to who used Irix liked it. The GUI is about 500x more usable than the horrors of OpenWindows or CDE on Solaris.. bleugh.

    --
    "Hey! Unless this is a nude love-in, get the hell off my property!!"
  8. Re:Beowulf cluster jokes... by 3141 · · Score: 2, Informative

    Another poster mentioned MOSIX, but openMosix is probably a better bet. It's released under the GPL, and is a combination of kernel-patch and user-space tools. Once you get these installed on each node, and connected via ethernet (all with networking set up of course... IP addresses etc) you should have yourself a cluster.

  9. Re:64 processors = 1.5 Cells by AmishSlayer · · Score: 2, Informative

    What I find amazing is that the Cell is supposed to run up to a TeraFlop when it reaches production. That compared to a 64 processor Linux cluster.

    thats 64 processors per node

  10. Setting one up now by jimshep · · Score: 5, Informative

    We just got ours installed yesterday. I'm still installing software and am starting benchmarks. It's only the deskside version (12 cpus, 24GB RAM, 1TB disk), but still more powerful than the 4-cpu SGI Origins that we have been using.

    It is the first one that the regional SGI reps had actually installed, but since it is almost exactly the same as the MIPS-based origin 3000 servers (with the exception of the obviously different Itanium 2 cpus and supporting chipsets), they ran into almost no problems getting it online. I have also been suprised as to how many commercial codes have already been ported to the platform.

    The main reasons we purchased this machine is for the ease in parallelizing code and the floating point performance of the Itaniam 2 cpus. We're computational materials engineers and the less time we have to spend optimizing codes so that the nodes of a cluster are always kept busy and minimizing I/O bottlenecks gives us more time to concentrate on the theoretical issues.

    It runs RedHat 7.2 with some tweaks by SGI called SGI ProPack. The Propack modifications come on separate CDs, with the proprietary software on separate CDs from the open source software. So far, from the command line, everything works just like my PC. It's kind of strange running Linux on a >$100K machine, but it sure beats dealing with the annoying differences between IRIX and Linux. Now to see if it performs as well as we expect...

  11. Re:kernel sources? by Anonymous Coward · · Score: 2, Informative

    I'm 100% sure it's very much a hardware thing. SGI has a long history of building very large hardware shared memory machines (e.g., the Origin line - 02000 and 03000) based on proprietary MIPS processors. They still make those machines, but market pressures forced them to also develop and sell Intel-based shared memory machines. I'll be curious to see how much of SGI's extensive work on IRIX to let it scale to 1000's of processors efficiently will bubble out to their Linux systems.

  12. Re:Beowulf cluster jokes... by RussianBeard · · Score: 2, Informative

    Take a look at OSCAR. We built a nine node cluster out of IBM e-servers using it. It was really quite straightforward.

    As far as languages go, you'll need an MPI library (like MPICH, or LAM/MPI (which is also a runtime environment), but the actual code used is usually C, C++, or Fortran. BTW, OSCAR comes with MPICH and LAM/MPI.

  13. Or, Try Quantix, which comes with some apps by coyote1 · · Score: 2, Informative

    or, try Quantix, which is derived from cluster knoppix. A self-booting ISO with data analysis software, based on Knoppix. This is geared more for scientific apps; it doesn't come with open office, etc, which cluster knoppix does.

    --
    Eat Lamb, 1 million coyotes can't be wrong
  14. Mosix... by wowbagger · · Score: 2, Informative

    The thing about Mosix is the costs of process migration.

    First, you have to understand process migration. In a mosix cluster, a running process can be moved, lock stock and barrel, from one CPU to another. All that is left behind is a "stub" process that forwards all file I/O across the network to the new location. So, if the program was a 3D raytracer that had the source description file and the output file open, after migration all file accesses to those files would be forwarded over the network to the stub (since you cannot guarantee that the remote machine can access those files in the same way.)

    Now, this is great for programs that do little file I/O but lots of computing (for example the ray tracer I just described.)

    However, the process must be set up on the local node first, then migrated. If the process has a 3 G core image (is taking up 3G of memory), then 3G of stuff has to be shoved across the wire, while the program is frozen. Thus, migrating a process is expensive.

    Now, if you have a bunch of long-running compute bound processes this is a net win (for example, rendering a movie might benefit). But something like building the Linux kernel won't benefit, since what you have is a bunch of short running, high I/O jobs.

    We have a Mosix cluster at work. I tried using it as a compile farm, and the results were disappointing. Not surprising - I was NOT using it for what it was designed for.

    However, if we can ever get the FPGA synthesis tools running natively under Linux, the hardware types are going to be quite happy....

  15. Re:lites by Leebert · · Score: 4, Informative

    the billion dollar machine

    What the hell kind of Origin 3800 do YOU have? ISTR ours (512-proc) was roughly $10M.

  16. Re:Beowulf cluster jokes... by oudzeeman · · Score: 2, Informative

    This SGI isn't a beowulf cluster. Traditionally beowulf clusters refer to clusters that use COTS hardware, don't have global shared memory, etc. Lots of people in the cluster community won't even call clusters of workstations beowulf clusters if they have some high speed network like Myrinet. We just call ours a Linux cluster, a cluster, a distributed memory supercomputer... You can program your beowulf cluster in C or Fortran using a free MPI(message passing interface) implementation called MPICH. I have even seen a scaled down version of MPI for Python, (which requires MPICH to use). So start learning MPI. MPI-1 has 129 functions, but you can write most programs using a small subset of these calls. If you don't want to pay much money I suggest using C, because g77 sucks and there are no free Fortran 90 compilers. We use the Portland Group Fortran and C compilers as well as the Intel Fortran Compiler. I think we are going to switch completely to Intel Fortran and C. Why do you want to use a beowulf cluster if you have no clue about them or parallel programming in general? Just because they are 'cool'? A beowulf cluster is very usefull for modeling or datamining, but unless you are running models that take days/weeks/months on your workstation you won't need the processing power of a cluster. Right now we have someone running a model on 76 processors that takes about 9 hours to finish a 1 year cycle in the model. They want to run the model for a total of 50 years. This is a model of the pacific ocean where they introduce carbon into the ocean, and then they see what effect that has on temperature change, etc. After they get their 50 year resluts for the Pacific they want to do a global simulation. This is the real use of beowulf clusters. They aren't for load ballancing web servers, playing quake, or any of the other things people post about every time there is an article about supercomputers/beowulf clusters. The speed up you will get really depends on your application. The more communication is necessary, the smaller the speed up will be. If you have a 5 node cluster, with 2 processors per node, the theoretical maximum speed-up is 10, but you will never achieve that because of parallel overhead(MPI calls, communication time, etc). If you want more information on parallel programming and cluster computing send me a private message telling me what you hope to do with your cluster.

  17. Re:lites by green+pizza · · Score: 3, Informative

    SGI Origin 3800 cluster

    Just to nitpick... most Origins are not clusters but rather one large single machine. It is possible to partition the machine in firmware and have each partition talk to others over the existing (and now unused) numalink interconnects... but it's much faster (even for plain MPI code) to just run the beast as one large single machine.

  18. Re:distributed shared memory by green+pizza · · Score: 2, Informative

    You can find a list here. For most computations and most hardware, you are probably still better off with MPI or PVM rather than shared memory.

    Note also that there are several high speed interconnects for Linux clusters available from many different vendors, including InfiniBand, Gigabit Ethernet, FireWire, and Myrinet.


    SGI systems (Origin and Altix) have massive interconnects that hold together the single-system architecture. They're fast for shmem-type shared memory apps, but also for MPI. In fact, SGI keeps tweaking their MPI implementation with every release of IRIX and the Linux ProPack, even though MPI is not the "best" way to run apps on their systems.

    The interconnects in most Origins and Altix systems are 3.2 gigaBYTE per second with extremely low latency. I don't know about Infiniband, but I do know that GigE is only 125 MB/sec with really high latency... FireWire 800 is 100 MB/sec with better latency.... and I think the bst version of Myrinet is 500 MB/sec (4 gigabit) with about 5x the latency of SGI's 'numalink'.

    The smaller Altix systems (and supposedly, future Altix and Origin systems this fall) can be double cabled or can run at a higher speed... for 6.4 gbyte/sec per interconnect.

    Also, the Altix can handle up to 64 processors per single machine / single node (or 128 with a very beta set of patches). The cluster in the article is actually four Altix systems, each with 64 processors. The Origin 3800/3900 can handle 512 processors per node (or 1024 with a special "XXL" IRIX kernel).

    Great stuff for I/O intensive tasks, but massive overkill for 3d rendering or calculating pi.

  19. Yep. Here. by Anonymous Coward · · Score: 1, Informative
  20. Re:lites by Anonymous Coward · · Score: 3, Informative

    The machine has 1024 procs

    There are two 1,024-processor Origin 3000's in the world. One is in Eagan, Minnesota. The other is at NASA. The NASA machine is called chapman. It has 256 GB of RAM. Not terabytes.

    How do I know this? Because I'm sitting here looking at lomax right now.

    You're a... whaddya call it. Liar.

  21. How to get 2+ TeraFlops from Linux by tobiashm · · Score: 2, Informative

    This does not seem to have been mentioned before:
    Niflheim at Danish University of Technology

  22. Re:lites by CoolVibe · · Score: 2, Informative

    Oh, I found a little page on the sara website where it is clarified (can't get onto the intranet anymore, else I'd have mirrored some better specs). Anyway, more about TERAS here.