Slashdot Mirror


SGI And /Massive/ Linux Machine

Thanks to some of the folks from SGI for sending us some information about their latest project. Pretty interesting project -- the largest configuration has 10 PCI busses (busi?) with 24 scsi controllers and 10 disks. And wait'll you see the rest of the stats.

Hi all,

Just thought I would send out a note outlining the state of the mips64 port. Ralf, Ulf and I have been actively working past few months to bring up Linux on the SGI ccNUMA machines.

The executive summary: we have achieved multiuser boot on o200 and o2000s. The largest configuration is a 32p, 16node machine (only approx 4G worth of memory was populated over the 16 nodes, the system can take 4G * 16 node worth of memory). This machine has 10 PCI busses, with 24 scsi controllers and 10 disks. (Sample output is at

OSS SGI

If you are interested in the system architecture and details of the port, read on. The o2000s use R10000 series of MIPS processors. Each machine is comprised of modules, each module has 4 node boards with max 2 cpus and 4G memory on each node, and IO boards and routers. In a module, the two alternate node boards are each connected to a XBOW. Each XBOW possibly is connected on the other side to a number of PCI busses, which is what the IO boards connect to. Apart from this, there are routers in the system that provide connection paths between all memory to all cpus, to create a true CC-NUMA architecture.

On the software side, we are still struggling with compiler and binutils issues. The kernel itself is 64 bits, created by cross compiling on an ia32 box. We have not attempted 64 bit user program compilation or execution. The root disk is currently very close to the MIPS/Indy root disks. The architecture specific code uses the CONFIG_DISCONTIGMEM code to support memory on all nodes. The architecture specific NUMA features currently are: 1. replicate the kernel text on all nodes, so that no one node becomes a memory hot spot (unfortunately, the kernel data has to reside on only one node). 2. replicate low level excpetion handler code on all nodes. The architecture code also turns on CONFIG_NUMA to take advantage of node-local page allocations. (A CONFIG_NUMA patch that I have been submitting to Linus was put into the kernel in test6-pre1). For more information on NUMA and ongoing work, refer to

this document

The purpose of doing this port is to boot Linux on bigger systems that we have, in order to do cpu/memory scalability studies. This also lets us do NUMA performance work in the future. Another advantage is to be able to leverage this work on the upcoming SGI CC-NUMA Itanium boxes, which will be an SGI supported product. Initial results from scalability studies using mips64 is documented at

The OSS SGI site.

Kanoj

13 of 72 comments (clear)

  1. Re:Perhaps by Chuck+Chunder · · Score: 3

    If you look at the linked info you will see that:
    a) There are in fact 14 scsi devices attached. (13 drives and a cdrom).
    b) Even so only 4 of the 24 scsi hosts are actually used (So 20 scsi hosts are being 'wasted', not 10).

    Your initial question ('There isnt anything special about 10 drives, so why have 24 scsi buses?') was backwards. They are developing on a big-arse piece of machinery here. The point here isn't making efficient use of 14 scsi devices, it's showing that Linux can run and access 24 scsi buses. Your question should probably have been 'If they want to really show that you can use 24 scsi hosts shouldn't they have a shitload more drives'. Quite possibly for a proper demonstration, but for a dev box then scattering a few drives over a few hosts is probably satisfactory.

    --
    Boffoonery - downloadable Comedy Benefit for Bletchley Park
  2. Someone... by enneff · · Score: 5
    ...give this guy a fat ip pipe and a gnutella node! This machine has 10 PCI busses, with 24 scsi controllers and 10 disks. !!!!!


    nf

  3. I see why they ditched Cray by GrEp · · Score: 3

    You can see why they ditched Crey Supercomputers. They noticed that busniness want cheap processing power, and they don't care how to get it. If you want economical, "Beowulf" clusters are the way to go now a days.

    I am supriesed it has taken them this long to get some deals like this out the door.

    --

    bash-2.04$
    bash-2.04$yes "Don't you hate dialup connections?"| write USERNAME
    1. Re:I see why they ditched Cray by tolldog · · Score: 3

      NUMA != Beowulf

      Numa is not even close to how beowulf works. NUMA allows the procs to actualy work together with shared memory instead of a near shared memory that beowulf provides.

      Trust me, the cray link is much more efficient than what a fiber connection bewteen beowulf boxes would be, even if you went all out and did some sort of cube configuration.

      If beowulf was better, people wouldn't be shelling the money for the SGI boxes when they need the horse power, they would have some "wulf" farm working on the problem.

      --
      -I just work here... how am I supposed to know?
    2. Re:I see why they ditched Cray by stripes · · Score: 4
      You can see why they ditched Crey Supercomputers.

      Um, their Cray division did alot of the work for the O2000! In fact at release the >64CPU configs were only avail from cray. Oh, and the frame to frame comm channel? It's named the "CrayLink".

      Nice machines though, even if a bit long in the tooth (the O2000 is fourish years old, the O3000's should be announced anytime now, go look at comp.arch)

      If you want economical, "Beowulf" clusters are the way to go now a days.

      Sure, if you need very little communication between machines Beowulf is great, and the O2000's expensave comms (the xbow and craylink) are waisted. If you need a lot of comm, but not a lot of com a O2000 is great. If you need a lot-lot of comm maybe you are out of luck until the O3000, HP SuperDome, or IBM Power5 show up.

      Quick MP break down:

      • NORMA - NO Remote Memory Access - Beowulf is a NORMA, to get at memory on other systems you need to make OS calls, or at least use really expensave mmap'ed (NFS) files (i.e. non local memory costs 1000x more then local memory to access).
      • NUMA - Non-Uniform Memory Access - remote memory costs maybe 10x more then local memory access, and caching has to be handled specally (i.e. each system has to know when to flush it's cache "magically"). Very few examples, some IBM research machines do this.
      • ccNUMA - Cache Coherent Non-Uniform Memory Access. Remote memory access costs maybe 10x local, but the cache's work. Most large multip-processor machines work this way. The O2000, the Sun E10000. The E10000 has much less then a 10x penality for remote memory, but local memory costs more then the O2000's local accesses. On both the OS can move pages from CPU board to CPU board depending on access. The E10000 comes closest to giving the impresion that it is a UMA machine (and the O2000 isn't bad at it)
      • UMA - Uniform Memory Access. There is no remote memory, or no penality for accessing it. A tipical multiprocessor PC works this way. Tipically easy to build for small numbers of CPUs, incresingly impossable for larger numbers (or useless - you could make a UMA for 1024 CPUs by having extra shitty memory access times, but there is no known way to make one with good access times, even if you had a literally unlimited budget!).

        A NORMA (or better) is great for raytracing, crypto cracking, and the like. A UMA is great for N-body simulation (with large N). I wouldn't want to track the flow of air molicules over a wing with a Beowulf, but I wouldn't want to pay for a ccNUMA if I was "just" running PR-Renderman.

  4. sgi and linux by DarkClown · · Score: 3

    SGI really does seem to be going after linux. I recently took an rhce test in Dallas and out of 13 people in the class, 11 were from sgi. Kind of a trip.

  5. Re:Why? by stuce · · Score: 4

    SGI has already commited to producing huge
    NUMA servers based off the Itanium processors.
    Porting IRIX to this new architecture will be
    a huge undertaking as it has been tied to the
    MIPS architecture forever. Linux on the other
    hand ports quite easily. SGI is doing research
    as to what it would take to get Linux to
    run well on massive boxes like these.

    If linux can cut the mustard there will be
    no need to port IRIX and that will save SGI
    one huge headache.

  6. Re:Why? by Fluffy+the+Cat · · Score: 3

    For SGI, the incentive is pretty obvious. At the moment they produce machines with massive numbers of processors (we've a 256 node SGI here) and need an operating system to run on them. IRIX is massively better than Linux for this sort of thing at the moment, but using IRIX means that they have to deal with everything else associated with programming an OS rather than just the bits they're good at. By improving Linux sufficiently so that it has the same sort of level of performance as IRIX on massively parallel machines they can drop IRIX development and let someone else deal with most Linux bugs, saving themselves rather a lot of time and money in the process.

  7. A perfect machine for render land and other uses by tolldog · · Score: 5

    This is a great machine for rendering or any other application that is both CPU and memory bound.

    Some jobs do not parrallel well, such as individual frame rendering. With 24 boxes, the 5 + minute overhead of loading the scene file plus the memory spent on loading the textures and the geometry would be done on each machine, costing you 24x's the overhead of doing it on one machine. Trying to do this with a "quasi" shared memory system would kill the network. But would remove that hidious overhead.
    Doing this on a NUMA box fixes all of those problems. The memory is shared. The procs all look like one machine. The system runs smooth and well.

    This is why SGI is still in the large graphics server environment. People want individual frames done fast.

    The benifit of this being a linux box and not Irix....
    I, a huge linux vs. irix advocate, strugle to see why this would be good. Most of the apps that I would use are built for Irix first and then Linux (like Maya's renderer). I can see where others might have custom apps to use this, but the code would probably port to Irix just as easily as it would to Linux on the MIPS.

    It is a step in the right direction, IA64 NUMA boxes running linux. The ultimate in render farm machines.

    --
    -I just work here... how am I supposed to know?
  8. Discover... by pointwood · · Score: 4

    > Discovered 32 cpus on 16 nodes

    Why does my kernel not discover something like that? ;-)

  9. Re:Why? by NumberCruncher · · Score: 4
    The commercial benefit is several-fold:

    • Large memory/IO capable systems running a standard OS, with standard tools, and a well known ABI/API
    • Highly scalable and reconfigurable modular computing. If you need more power, add more C-Bricks. If you need more IO, add more P or X bricks.
    • Large application base: Linux has captured mindshare of devolopers. Applications are being ported at a furious rate. It is becoming the dominant platform for software development (over Solaris and other similar Unices).

    There are many other reasons as well, but frankly this type of machine is what many people have been waiting for. The total cost of ownership of all those Sun machines is far larger than of this machine. The performance of this machine is significantly ahead of your typical Sun machine.

    One of the nicest features of this machine is that you can reconfigure it with a reboot (no recabling) to come up as a single large machine or as multiple medium machines, or many single machines. You can configure the computer to your needs, not shoehorn the problem to fit within a solaris boxes limitations. And unlike on other OSes, the partitioning actually works here.

  10. Finally by joshv · · Score: 3

    I am glad to see some work being done on Linux to add real support for truly massived parallel systems. It has always been said that Linux does not scale well past a few processors (perhaps 4 at most) because modifying Linux to support systems with larger processor counts would hurt performance on low end hardware. Additionally one can assume that the kernel developers in generally don't have access to such massively parallel architectures.

    This little project holefully will prove that it can be done, and one might hope it's results will be applicable to less exotic multiprocessor hardware (say an 8 or 16 way x86 server).

    -josh

  11. Re:Little to do with Cray (Re:I see why they ditch by tolldog · · Score: 3

    I don't completely agree with that.

    Beowulf is not good for rendering. Each job can have up to 500-700 megs of memory being used. Share this over a 100bT or Fibre or some other network protocol. It won't work.

    We use other approaches for rendering. We spread the shot over a machine, not the frame. We eat the overhead of starting the renderer and reading the file. If possible, for those users who need one frame done fast, we threw it on our 4 proc O2000. That machine was taken from me, so now they just have to wait 4x's longer.

    Beowulf has its uses. Production rendering is not really one of them.

    --
    -I just work here... how am I supposed to know?