Supercomputers' Growing Resilience Problems
angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.
The world's burning. Moped Jesus spotted on I50. Details at 11.
Ah-ha-ha! You wish failing kit would just up and die!
But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.
And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't catch it in time?
Is an increasing amount of componentry the same thing as an increasing number of components?
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.
On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
Google is having the same problems that this article describes -- they haven't fixed it either.
If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.
you start the job over.
You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.
On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.
If you have a single process bigger than that you need to setup a checkpointing system.
If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.
On the previous one I was on they used both tricks depending on the exact nature of what was being processed.
For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.
You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.
The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.
When information is power, privacy is freedom.
And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.
No it's not. The problem with your description is the "rinse, repeat" part. He's not talking about repeating with new input data. He's talking about a serialized workload where, for example, the output of the first 100 jobs is the input for the next 100 jobs, which then creates output that is the input for the next 100 jobs. It's not a case of repeating, its a case of serialization where if you have not done state check-point and things crater you have to start from the begining to get back where you were. No "standard of coding" can fix that.
When information is power, privacy is freedom.
"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.
I am sorry to say it, but this Phd student has no clue. Dealing with a node failure is not a problem with proper, modern supercomputing programming practices as well a OS/system software. There is an amazing programming technique called "checkpointing", developed a while ago. This allows you to periodically to "checkpoint" your application, essentially saving off the system call stack, the memory, register values, etc., etc., to a file. The application is also coded to check to see if that file exists, and if it does, to load all those values back into memory, registers, call stack, and then continue running from that point. So in the event of a hardware failure, the application/thread is simply restarted on another node in the cluster. That is application level checkpointing, there is also OS level checkpointing, which essentially does the same thing, but at the OS level irregardless of the processes running on the system, allowing for anything running on the entire machine to be checkpointed and restarted from that spot.
Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).
Both of these methods fix the issue of having possible nodes which die on you during computation.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)
Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say that fault tolerance within the MPI layer (e.g. here) will be sufficient. I personally very much doubt that. My bet is on higher-level frameworks, e.g. HPX, which can "abstract away" the location of a task from the node where its actually being executed.
Computer simulation made easy -- LibGeoDecomp
Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.
Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way better, and flat.
Hire some big phds to design mitigation mechanisms to increase MTBF by better environmental management, I dunno, give nodes a rest for 30 mins every 5 days. I dunno, more research needs to be done, and that takes tonne of time, (unless you can simulate it on a super computer) doh.
Just a thought.
Liberty freedom are no1, not dicks in suits.