Supercomputers' Growing Resilience Problems

← Back to Stories (view on slashdot.org)

Supercomputers' Growing Resilience Problems

Posted by samzenpus on Wednesday November 21, 2012 @12:01PM from the a-thousand-potential-cuts dept.

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

7 of 112 comments (clear)

Min score:

Reason:

Sort:

Hardly A New Problem by MightyMartian · 2012-11-21 12:04 · Score: 5, Informative

Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Componentry? by Anonymous Coward · 2012-11-21 12:29 · Score: 5, Funny

Componentry is embiggened, leading to less cromulent supercomputers?
"and they halt operations when they do so" by brandor · 2012-11-21 12:29 · Score: 5, Informative

This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.
Re:ummm, no. by fintler · 2012-11-21 13:00 · Score: 3

Google is having the same problems that this article describes -- they haven't fixed it either.
If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.
Re:Hardly A New Problem...and thus has been fixed by poetmatt · 2012-11-21 13:04 · Score: 3, Insightful

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.
Not Really New by Jah-Wren+Ryel · 2012-11-21 15:07 · Score: 3, Insightful

The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

--
When information is power, privacy is freedom.
Re:Hardly A New Problem...and thus has been fixed by markhahn · 2012-11-21 15:37 · Score: 3, Informative

"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.