Supercomputers' Growing Resilience Problems
angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.
The world's burning. Moped Jesus spotted on I50. Details at 11.
Componentry is embiggened, leading to less cromulent supercomputers?
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.