Slashdot Mirror


Supercomputers' Growing Resilience Problems

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

3 of 112 comments (clear)

  1. Hardly A New Problem by MightyMartian · · Score: 5, Informative

    Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.

    In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

    --
    The world's burning. Moped Jesus spotted on I50. Details at 11.
  2. Re:Componentry? by Anonymous Coward · · Score: 5, Funny

    Componentry is embiggened, leading to less cromulent supercomputers?

  3. "and they halt operations when they do so" by brandor · · Score: 5, Informative
    This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.

    The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.

    Sensationalist titles are sensationalist I guess.