Slashdot Mirror


Japan's Newest Linux Supercluster: 13TB RAM

green pizza writes "Following its sale of a 10240 processor cluster to NASA, Silicon Graphics Inc has announced that it's supplying a 2048 processor Altix 3700 Bx2 to the Japan Atomic Energy Research Institute. Aside from running Linux on Itanium2 processors, the beast also features 13 TB of RAM!"

14 of 163 comments (clear)

  1. Re:so ? by szo · · Score: 3, Informative

    it's 13TB, not 3TB. Which is according to the article: "over 13 terabytes of memory - the world's largest memory capacity"

    Szo

    --
    Red Leader Standing By!
  2. Nuclear research by Big+Nothing · · Score: 5, Informative

    The puter will be used for nuclear research (bushspeak: nucjular reesatch) by the Japan Atomic Energy Research Institute. More info about the organisation, their projects, etc. can be found at: http://www.jaeri.go.jp/english/index.cgi.

    --
    SIG: TAKE OFF EVERY 'CAPTAIN'!!
  3. Re:undecided by hype7 · · Score: 2, Informative
    But isn't Itanium kinda evil (as opposed to slashdot darlings PPC/Power and Opteron)?
    While Linux is super cool? So should I like it?


    I know you're trying to be humourous, but it raises an interesting question: is this thing faster than the Big Mac?

    -- james
  4. Re:bottleneck by amorsen · · Score: 5, Informative

    The whole point of Altix is that it's a single system image, not a cluster. Every processor can access all 13TB. That doesn't mean communication is free, of course, but it's vastly faster than your favourite Beowulf cluster.

    --
    Finally! A year of moderation! Ready for 2019?
  5. Re:undecided by networkBoy · · Score: 2, Informative

    serious response to funny comment:
    The deal is that the Itanium2's are better(relative) processors when everything is compiled for them. The hitch is that in terms of price for performance Itaniums are near the bottom of the pile (highest performance != best value).
    Finally, in this situation (price be damned), there is not any reason to worry about value, just performance. Thus Itanium wins.
    -nB

    --
    whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
  6. Luckily by bmajik · · Score: 5, Informative

    SGI has been working through this in hardware for over 10 years.

    The distributed shared memory concept of the Altix (first seen on Origin 200 / Origin 2000 in the commercial space, and previously based on the Standford DASH/FLASH projects) uses a hardware based memory router.

    Each PE has local ram and local CPUs and a "MAGIC" chip that routes cache invalidations, memory block "ownership", etc messages to other PE's as necessary. Unlike SMP designs, cache coherencvy doesn't destroy the whole shebang because its not a shared bus, it's a heirarchial directory system. I.e. PE0 knows it only needs to contact PE3, PE6, and PE13 to invalidate a cache block. Turns out that thats much more efficient than broadcasting a message to PE0-PE63 saying "invalidate this block!"

    Now, as far as _all_ processor sharing the full 13TB - i am not sure.

    The memory density / system image equation is sort of a tradeoff, as more PE's require more router hops in the topology. More router hops increase latency. SGI has sold 256 and 512p single-image systems, and may have gone up to 1024 or 2048p / system.

    To be perfectly honest, the system-system latency is different than the intra-system latency, but nothing like it would be on an x86-with-ethernet shared nothing cluster.

    SGI's big installations are cool as they have advantages of both SMP and MPP designs.. each autonomous machine gives you signle-image benefits but with really high proc counts.. . and then you link a bunch of those together to get this outrageously sized machine.

    --
    My opinions are my own, and do not necessarily represent those of my employer.
    1. Re:Luckily by jon3k · · Score: 2, Informative

      http://www.sgi.com/products/servers/altix/

      "Scaling to 256 Itanium 2 processors in a single node, Altix 3700 leverages the powerful SGI® NUMAflex(TM) global shared-memory architecture to derive maximum application performance from new high-density CPU bricks." So I'm guessing its still 256 CPU's per node.

    2. Re:Luckily by flaming-opus · · Score: 3, Informative

      At NASA sgi has been experimenting with 2048 proc single system image. Since the japan system has yet to be deployed, it will likely be a single system.

      The SGI magic memory controller incorperates the numalink (origionally called cray-link) router they leveraged from the T3e work. This router uses worm-hole routing, which starts forewarding a packet as soon as the address bytes are read. This means that the added latency of going through several routers is often much less than packaging up the packet in the first place. On the hardware side of things it's not the number of router-hops that limits the scalability of the system. Rather the greater the size of the memory, the coarser the size of the directory blocks. With 13TB of memory you are probably invalidating dozens or hundreds of pages at a time. SUCK.

      The cache coherency of SGI's cc-numa machines makes them increadibly easy to program. However, there is a big overhead. Since most supercomputing software is written with MPI, rather than with posix-threads, you don't really behefit from it anyway. I think you can disable the hardware coherency on a per-process basis, which would greatly speed up MPI software.

    3. Re:Luckily by joib · · Score: 2, Informative


      At NASA sgi has been experimenting with 2048 proc single system image.


      The Columbia system still consists of 20 512-cpu systems, so I would assume this consists of 4 such 512-cpu systems.


      The cache coherency of SGI's cc-numa machines makes them increadibly easy to program. However, there is a big overhead.


      Well yes, the basic problem is that OpenMP/pthreads assumes a flat memory, whereas a NUMA box is all but flat. So the kernel better be real smart about how to map the memory onto the hardware to minimize remote memory accesses. And the programmer should of course avoid accessing memory from all over the place, although it's technically possible.


      Since most supercomputing software is written with MPI, rather than with posix-threads, you don't really behefit from it anyway. I think you can disable the hardware coherency on a per-process basis, which would greatly speed up MPI software.


      I'm pretty sure the SGI MPI implementation is pretty well optimized, either by using shared memory or by sending the messages directly over the numalink.

  7. Not the largest memory capacity by Anonymous Coward · · Score: 5, Informative

    Sorry to spoil the excitement for everybody but actually, Columbia far exceeds the Japanses system's memory capacity at 20 TByte. See this description for details of Columbia's config.

  8. Re:Honest curiosity by Wesley+Felter · · Score: 4, Informative

    Most clusters run the vendor Unix. IBMs runs AIX or Linux, SGIs run IRIX or Linux, Alphas run Tru64, x86 clusters run Linux. The ultra-high-end custom machines run obscure custom Unix ports. Microsoft is trying to break into the HPC market, but so far only Cornell and Rice are buying.

  9. Re:itanic processor shipments - giving them away f by chill · · Score: 2, Informative

    Reading through both links, I fail to see where it mentions that SGI & Intel *gave* the system to NASA for free.

    The SGI press release http://www.sgi.com/company_info/newsroom/press_rel eases/2004/october/columbia.html mentions NASA having to put together a business case and justification for Congress and that normally means asking for funds.

    Even if they did just give it away for the press (and I dount it). When dealing with the gov't, the support contracts are separate. No one but SGI could properly support the system, so I'm willing to bet they got a fat support contract out of it.

    -Charles

    --
    Learning HOW to think is more important than learning WHAT to think.
  10. Re:undecided by RalphBNumbers · · Score: 3, Informative
    is this thing faster than the Big Mac?

    And the awnser is: it depends on what you're doing with it.
    This thing is significantly more tightly coupled than VT's cluster, and uses shared memory as opposed to clustering, so for alot of tightly coupled problems it will be *far* more efficient.

    As for raw processing power, the Itanium2 has the same theoretical peak floating point performance as a PPC970 at the same clock. In reality the Itanium is likely to come closer to achieving it's peak than the PPC970 due to it's massive cache (9MB compared to the 970's 512KB). However the Itaniums in an Altix3000 are only running at 1.6Ghz according to SGI's page, while the 970s in VT's cluster are now at 2.3Ghz. So the BigMac would have some advantage on loosely coupled problems that it can fit in it's smaller cache and memory.

    So while the BigMac might beat this system at Linpack, the benchmark used to determine the top500, in the domain this system is to be used for (3d modeling of nuclear blasts) it's tighter coupling and greater RAM will make it much faster.
    --
    "The worst tyrannies were the ones where a governance required its own logic on every embedded node." - Vernor Vinge
  11. Re:undecided by drw · · Score: 2, Informative

    The Itanium2 is a fast processor, especially when it comes to optimized floating-point calculations. Yes, it is expensive and so the price/performance ratio is not as good as common desktop processors mostly for two reasons:

    1. Large die area (mostly due to huge amounts of on-die cache) - chip price is directly related to how many cores that fit on a silicon wafer.
    2. The Itanium2 is a low volume product, so R&D and verification costs are a higher percentage of chip costs.

    The biggest problem with the Itanium2 is not its performance, but the innability of Intel to lower its cost. This causes it to being relagated to niche markets like HPC where performance is everything.