Slashdot Mirror


SGI to Scale Linux Across 1024 CPUs

im333mfg writes "ComputerWorld has an article up about an upcoming SGI Machine, being built for the National Center for Supercomputing Applications, "that will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors and 3TB of shared memory.""

28 of 360 comments (clear)

  1. Whoa! by rylin · · Score: 5, Funny

    Sweet, now we'll be able to run Doom3 at highest detail in *SOFTWARE*-rendering mode!

  2. Ok by CableModemSniper · · Score: 5, Funny

    But does it run--crap. I mean what about a Beowulf--doh!
    Damn you SGI!

    --
    Why not fork?
    1. Re:Ok by jc42 · · Score: 4, Funny

      Hey, any reason we couldn't build, say, 1024 of these things, and make a beowulf cluster of them?

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  3. In other news... by b1t+r0t · · Score: 4, Funny

    Intel's sales figures for Itanic^Hum CPUs more than doubled as a result.

    --

    --
    "Open source is good." - Steve Jobs
    "Open source is evil." - Microsoft
    1. Re:In other news... by levram2 · · Score: 5, Informative

      The limit for Windows Server 2003, Datacenter edition for 64 bit Itaniums is actually 64 processors and 512 GB RAM. http://www.microsoft.com/windowsserver2003/64bit/i pf/datacenter.mspx

    2. Re:In other news... by caluml · · Score: 5, Funny

      We don't care about your actual facts for Windows - here at Slashdot we have FUD, rumour, and downright persistence. I think you will find if you read up on it more closely that 2003 Datacentre can only support up to 2 CPUs, and 256Mb maximum.
      Please stop letting facts get in the way of a good MS bashing session.

      Minister for Dis-Information.

  4. Re:really fast? by jhunsake · · Score: 4, Funny

    so does this mean KDE and Openoffice will finally run at decent speed?

    No, you're going to need quantum computing for that.

  5. The big question is... by mangu · · Score: 4, Funny

    ...how easy it is to install printer and sound drivers?

    1. Re:The big question is... by carlmenezes · · Score: 4, Funny

      Well on Windows you'd get a message saying...

      "Windows has detected 1024 new sound cards and is installing them..."

      and then the inevitable..

      "Windows needs to restart your computer. Click OK to restart"

      and then on system restart ...

      1024 sound control apps in the system tray! =)

      --
      Find a job you like and you will never work a day in your life.
  6. In other news... by k4_pacific · · Score: 4, Funny

    Microsoft made a statement today reminding everyone that Windows Server 2003 can handle as many as 32 processors, at the same time even.

    When shown the report about Linux running on 1024 processors, Gates purportedly responded, "32 processors ought to be enough for anybody."

    --
    Unknown host pong.
  7. Re:What happened to RISC? by CableModemSniper · · Score: 4, Funny

    They decided it was too RISCy maybe?

    --
    Why not fork?
  8. Re:Solaris by justins · · Score: 4, Informative

    Solaris is not a leader in supercomputing, never has been.

    http://top500.org/list/2004/06/

    There's no "stronghold" for Sun to lose.

    --
    Now before I get modded down, I be to remind whoever might read this that what I am saying is FACT. - bogaboga
  9. Sun != scientific computing by vlad_petric · · Score: 4, Informative
    Sun processors execute server workloads (database, app server) very well, but that's pretty much it. The emphasis with such workloads is on the memory system. Boatloads of caches do the job. It's also more effective to have tons of processors that are very simple, than just a couple of them that are complex and powerful.

    Scientific computing means data crunching (floating point). Complex, powerful processors are needed. The "stupider, but more" tradeoff doesn't work anymore. Sun processors have fallen behind in this respect.

    --

    The Raven

  10. Re:Solaris by mrm677 · · Score: 4, Interesting

    It seems that if they pull this off one of the dtrongholds of solaris (namely massivly parralell computing) will have been conqurered by linux. I wonder how sun are feeling at the moment?

    Solaris scales to hundreds of processors out-of-the-box. Until the vanilla Linux kernel accepts these changes and scale, Solaris still has a big edge in this area.

    Lame analogy: many people have demonstrated that they can hack their Honda Civic to outperform a Corvette, however I can walk into a dealership and purchase the latter which performs quite well without mods.

  11. Sun does more than that by puppetluva · · Score: 4, Insightful

    Sun hardware has additional, wonderful resiliency features like - allowing cpu's to "fail-over" to other cpus in case of failure. The same holds true for memory, network interfaces, etc. Solaris is aware of these hardware features and can "map out" the bad memory and cpus on the fly (or allow swap-in replacements). The engineers can then replace the broken cpus/memory/interfaces WITHOUT BRINGING THE MACHINE DOWN. This lends itself to an environment than can enjoy nearly 100% uptime. Finally, since Sun has been doing the "lots of cpus" thing for many years, their process management and scalability tends to be much better.

    I don't work for Sun, I'm just an SA that deals with both Solaris and Linux boxes. You don't pick sun for just "lots of cpus", you pick it for a very scalable OS and amazing hardware that allows for a very, very solid datacenter. If downtime costs a lot (ie. you lose a lot of money for being down), you should have Sun and/or IBM zseries hardware. Unfortunately those features cost a lot and most times you can use Linux clustering instead for a fraction of the cost and a high percentage of the availability.

  12. Re:What happened to RISC? by Epistax · · Score: 4, Interesting

    RISC and CISC offer no final advantage over the other, so the one that dominated is the one that was here first.

    Quick examples: RISC use less power because it has less logic? No, it needs to run at a higher frequency to maintain the same speed as a slower CISC.
    RISC is easier to program? Depends on the person. A compiler can take advantage of large instructions very well which are hardware optimized.
    RISC easier to develop/manage? I'll say yes for RISC on this one. There's simply less logic on the chip so less logical errors possible. There's plenty more cache which can break but broken parts can be fused off.
    RISC is physically smaller? No. RISC needs a higher clock frequency because many more instructions need to be executed. The result of this is that a much larger instruction cache is needed on chip.

    I don't remember every comparison but it pretty much comes out that neither is better than the other. That being said RISC is better than x86. Everything is better than x86. However CISC vs RISC is much harder to judge. Having done x86, 68k, and MIPS I must say that RISC is a pleasure.

  13. Re:Solaris by kasperd · · Score: 5, Interesting

    Until the vanilla Linux kernel accepts these changes and scale, Solaris still has a big edge in this area.

    I wouldn't be surprised to see these changes in the 2.8 kernel. And what will people do until then I hear some people ask. I can tell you that right now it is very few people that actually have the need to scale to 1024 CPUs. And that will probably also be true by the time Linux 2.8.0 is released. AFAIK Linux 2.6 does scale well to 128 CPUs, but I don't have hardware to test it, neither does any of my friends. So I'd say there is no need for a rush to get this in mainstream, the few people that need this can patch their kernels. My guess is that in the time from now until 2.8.0 is released, we will see less than 1000 such machines worldwide.

    --

    Do you care about the security of your wireless mouse?
  14. The solution! by Sidicas · · Score: 5, Funny

    "will run a single Linux operating system image across 1,024 Intel Corp. Itanium 2 processors..."
    "The National Center for Supercomputing Applications will use it for research"


    1. Make a system that generates more heat than a supernova.
    2.Research a solution to global warming.
    3. Profit!

  15. In other Headlines by ShadowRage · · Score: 4, Funny

    SCO gained $715,776

  16. Another thing Sun does well.... by passthecrackpipe · · Score: 4, Insightful
    Cache reduction - ehh cash reduction. One of the prime reasons Sun is losing serious levels of installed base to Linux is not because linux is better, it is because Sun is bloody expensive - outrageously so. And while most customers had to endure the annual fleecing with gritted teeth - due to lack of alternatives - Sun is now being pummeled out of datacenter after datacenter.

    I have replaced Sun Hardware/Software combo's in the core datacenter for many of our customers, and I can tell you that yes - Sun brings some amazing features to the table - most of which are there to serve old technology. Linux on simple CPU's delivers such an amazing price performance (depending on the job, we see an average of 3x to 4x performance increase for 25% of the cost. That means that if I were to spend the same, lifecycle-wise, on a Linux cluster as I would on a big Sun box like the 10k or 15k, I'd end up with 12x to 16x the performance of the Sun solution.

    The same functionality in terms of cpu and ram (and other hardware) failure is available on the Linux cluster, albeit in less graceful form - the magic spell to invoke goes like this:
    shutdown -h now
    if I have 300 machines crunching my data, I can afford to lose a couple, and can afford to have a few hot-standby's.

    Of course, the massively parrallel architecture does not work for all applications, and in those cases you would look to use either OpenMOSIX or of course the (relatively expensive) SGI box mentioned in this article.
    --
    People who think they know everything are a great annoyance to those of us who do.
  17. Re:Similar software available? by dwgranth · · Score: 5, Informative

    well, sgi uses/hacks NUMA, spinlocks, etc to make this happen in a more efficient manner. We recently had a SGI rep come and explain their 512CPU architechture at our LUG meeting... and he basically said that SGI has their own implementation of all of the clustering/cpu stacking techs... which they will eventually feed back into the community.. all good stuff.. understandably they will wait for a year or so so they can get their money's worth before they release their changes.

  18. Re:Advantages...? by myg · · Score: 4, Informative
    Because a machine like that isn't about running Apahce or serving files.

    The purpose of that computer is to solve complex scientific problems such as weather simulations, high-energy particle simulations, protine folding, etc. Many of these simulations involve iterated systems of equations that can take decades to solve on the fastest CPU's we have today.

    The only way to get meaningful results in a meaningful amount of time is to break the problem apart into smaller problems and solve them in parallel.

    Some projects, such as Folding@Home and Find-A-Drug go the distributed computing route -- use many disconnected systems to solve the problem.

    The downside to that approach is that not all problems can be easily broken apart -- and some classes of problems can exist without tight coupling but they loose efficiency. The impressive thing about this particular super computer is that it has a single, unified memory image.

    This is very useful for some classes of simulation problems when the entire simulation must be present for each iteration.

  19. The real test by Bruha · · Score: 4, Funny

    Fire up apache and then post a link to it here on slashdot. We love a challenge.

  20. Let me clue you in on a few things by justins · · Score: 4, Informative
    You don't pick sun for just "lots of cpus", you pick it for a very scalable OS and amazing hardware that allows for a very, very solid datacenter.

    The UNIX made by SGI (the company making the machine referenced in the article) is more scalable than Solaris. Remember, IRIX was the first OS to scale a single Unix OS image across 512 CPUs. And now they've eclipsed that, with Linux.

    Sun hardware has additional, wonderful resiliency features like - allowing cpu's to "fail-over" to other cpus in case of failure.

    None of that is unique to Sun.

    Finally, since Sun has been doing the "lots of cpus" thing for many years, their process management and scalability tends to be much better.

    Better than what? And says who? They've never decisively convinced the market that they're beter at this than HP, SGI, IBM or Compaq.

    If downtime costs a lot (ie. you lose a lot of money for being down), you should have Sun and/or IBM zseries hardware. Unfortunately those features cost a lot and most times you can use Linux clustering instead for a fraction of the cost and a high percentage of the availability.

    In addition to ignoring the other good Unix architectures out there in a dumb way with this comparison, you're also totally missing the point of the article. Linux supercomputing isn't just about cheap clusters anymore. Expensive UNIX machines on one side and cheap Linux clusters on the other is a false dichotomy.
    --
    Now before I get modded down, I be to remind whoever might read this that what I am saying is FACT. - bogaboga
  21. Re:It became obsolete by Johan+Veenstra · · Score: 4, Informative

    Actually RISC is a bad name for what it stand for, it should have been SISC (Simplified Instruction Set Computer), since the key difference between the two are the complexity of the instructions and not the quantity.

    A CISC instruction could do things like: take the value in register BP, add 4, get the value from the memory at the address you just computed, add the value in the register AX, and put the result back at the same memory location. Execution would take several clock-ticks.

    To do the same in RISC, you would need several instructions (add 4, get from memory, add ax, store to memory). The execution of the individual instructions would take one tick each, so the sequence would take several. But on average RISC was a bit faster.

    CISC was invented in a time that the memory was small, in the CISC way you could store larger programs in the same amount of memory.

    RISC was invented when memory-size was not limited anymore, and looked to displace CISC in the long run.

    CISC was still around when the memory bandwidth became a limiting factor. And since fewer instructions needed to be fetched from memory, more bandwidth was left for other data traffic. RISC lost some of it's speed advantage.

    Modern CISC processors, get CISC instructions from memory, chop them up in smaller instructions, and executes those smaller instructions really fast. So in fact they can be seen as RISC processors, posing as CISC processors, ie the best of both worlds.

    So CISC is a way of compressing RISC instructions, so they take up less memory/bandwidth.

  22. 1024 cpus and 3 TB memory by Anonymous Coward · · Score: 4, Funny

    That's almost enough to run Emacs!

  23. Re:from MPI to multithreaded ? by Sangui5 · · Score: 4, Informative

    Does this mean that the applications running on the "old" clusters, presumably using some flavor of MPI to communicate between nodes, will have to be ported somehow to become multithreaded applications ?

    NCSA still has plenty of "old" style clusters around. Two of the more aging clusters, Platinum and Titan are being retired, to make room for newer systems like Cobalt. Indeed, the official notice was made just recently--they're going down tommorrow. However, as the retirement notice points out, we still have Tungsten, Copper, and Mercury (Terragrid). Indeed, Tungsten is number 5 on the Top 500, so it should provide more than enough cycles for any message-passing jobs people require.

    So, anyone has any insights as to why/how this matters for the programmers ?

    What it means is that programming big jobs is easier. You no longer need to learn MPI, or figure out how to structure your job so that individual nodes are relatively loosely-coupled. Also, jobs that have more tightly-coupled parallelism are now possible. The older clusters used high-speed interconnects like Myrinet or Infiniband (NCSA doesn't own any Infiniband AFAIK, but we're looking at it for the next cluster supercomputer). Although they provided really good latency and bandwidth, they aren't as high-performing as shared memory. Also, Myrinet's ability to scale to huge numbers of nodes isn't all that great--Tugsten may have 1280 compute nodes, but a job that uses all 1280 nodes isn't practical. Indeed, untill recently the Myrinet didn't work at all, even after partitioning the cluster into smaller subclusters.

    This new shared-memory machine will be more powerful, more convienient, and easier to maintain than the cluster-style supercomputers. Hopefully it will allow better scheduling algorithms than on the clusters too--an appaling number of cycles get thrown away because cluster scheduling is non-preemptive.

    I'd also like to point out some errors in the Computerworld article. NCSA is *currently* storing 940 TB in near-line storage (Legato DiskXtender running on an obscenely big tape library), and growing at 2TB a week. The DiskXtender is licenced for up to 2 petabytes--we're coming close to half of that now. The article therefore vastly understates our storage capacity. On the other hand, I'd like to know where we're hiding all those teraflops of compute--35 TFLOPS after getting 6 TFLOPS from Cobalt sounds more than just a little high. That number smells of the most optimistic peak performance values of all currently connected compute nodes. I.e. - how many single-precision operations could the nodes do if they didn't have to communicate, everything was in L1 cache, we managed to schedule something on all of them, and they were all actually functioning. Realistically, I'd guess that we can clear maybe a quarter of that figure, given machines being down, jobs being non-ideal, etc. etc. etc.

    As a disclaimer, I do work at NCSA, but in Security Research, not High-Performance Computing.

  24. Re:Scalability of applications by xtp · · Score: 5, Informative

    SGI has had 512 and 1024-cpu MIPS-based systems in operation for more than 5 years. Much work was done on the Irix systems to initialize large parallel computations and provide libraries and compiler support for these configurations. One technique is to provide message-passing libraries that use shared memory. A better technique is to morph (slightly) parallel mesh apps so that each computational mesh node exposes the array elements to be shared with neighbors. No message-passing needed - you push data after a big iteration and then use the (really fast) sync primitives to launch into the next iteration. With shared-nothing clusters (i.e. Beowulf) a computation (and its memory) must be partitioned among the compute nodes. The improvement over a "classical" cluster can be startling especially with computations that are more communications-bound than compute-bound. This means there is no value for replacing a render farm with a big system. But there are big compute problems, e.g. finite element, for which the shared-nothing cluster is often inadequate.

    With a single memory image system the computation can easily repartition dynamically as the computation proceeds. Its very costly (never say impossible!) to do this on a cluster because you have to physically move memory segments from one machine to another. On the NUMA system you just change a pointer. The hardware is good enough that you don't really have to worry about memory latency.

    And let's not forget io. Folks seem to forget that you can dump any interesting section of the computation to/from the file system with a single io command. On these systems the io bandwidth is limited only by the number of parallel disk channels - a system like the one mentioned in the article can probably sustain a large number of GBytes/sec to the file system.

    Let's not forget page size. The only way you can traverse a few TB of memory without TLB-faulting to death is to have multi-MByte-size pages (because TLB size is limited). SGI allowed a process to map regions of main memory with different page sizes (upto 64 MB I think) at least 10 years ago in order to support large image data base and compute apps.

    When I used to work at SGI (5 years ago) the memory bandwidth at one cpu node was about 800 MBytes/s. My understanding is that the Altix compute nodes now deliver 12 GBytes/s at each memory controller. Although I haven't had a chance to test drive one of these new systems, it sounds like they have gradually been porting well-seasoned Irix algorithms to Linux. It is unlikely that a commodity computer really needs all of this stuff, but I'm looking at a 4-cpu Opteron that could really use many of the memory management improvements.

    g