Supercomputers' Growing Resilience Problems
angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.
The world's burning. Moped Jesus spotted on I50. Details at 11.
I would think the same concurrency one must apply to process within such a system could be applied in today's IaaS infrastructures. Perhaps some problems don't scale to the cloud. But then, I would think many of these could be run on several super-ish computers with a mix of the techniques for cloud and "local" concurrency. Is this really a problem for anyone but the sales staff offering supercomputers to large institutions?
Ah-ha-ha! You wish failing kit would just up and die!
But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.
And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't catch it in time?
Is an increasing amount of componentry the same thing as an increasing number of components?
Even my small business has component redundancy. It might slow things down, but surely not grind them to a halt. With all the billions spent and people WAY smarter than me working on those things, I really doubt its as bad as TFA makes it out to be.
This just says to me that they need to buy three of every component and run a voting system and throw a fail if one is way off the mark.
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.
Ever heard of Google? They're doing the massive supercomputer thing just fine and have completely figured out the fault tolerance side of things.
On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
Back in the 60's, ENIAC (the supercomputer of the day) ran on thousands of vacuum tubes, MTBF was
about 5.6 hours, then an IT person would have to figure out which tube to replace.
See http://ftp.arl.army.mil/~mike/comphist/hist.html
How do you give the work to another node when the failed node contains the only copy of its state (like in an MPI job)? Duplicating the state on multiple nodes is way too expensive.
EP is trivial to deal with.. The problem is with supercomputing jobs that aren't EP, and rely on computations from nodes 1-100 to feed to the next computational step in node 101-200, etc.
The real answer is some form of fault tolerant computing.. think EDAC for memory (e.g. forward error correction) but on a grosser scale. Or, even, designing your algorithm for the ability to "go back N", as in some ARQ communication protocols.
The theory on this is fairly well known (viz Byzantine Generals problem)
The problem is that big supercomputing software is hard enough to write already without throwing in non-deterministic behavior of the processing. It's not like the compiler does it for you, and even if you have libraries that use multiprocessing (matrix math), that's usually a small part of the problem.
Haven't these problems already been solved by large-scale cloud providers? Sure, hurricanes take out datacenters for diesel, but Google runs racks with dead nodes until it hit's a percentage where it makes sense to 'roll the truck' so-to-speak and get a tech onsite to repair the racks.
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.
you start the job over.
You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.
On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.
If you have a single process bigger than that you need to setup a checkpointing system.
If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.
On the previous one I was on they used both tricks depending on the exact nature of what was being processed.
For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.
This is the same method used to handle distributed computing with untrusted nodes. Simply hand off the same problem to multiple nodes and recompute if differences arise.
The real solution is going to involve hardware as well. The nodes themselves will become supernodes with built in redundancy.
Come on now. We've known how to run V8 engines on 7 or even 6 cylinders for years now. Certainly this technology must be in the public domain by now.
I have a demonstration unit I would be happy to part with for a mere $100K.
Have gnu, will travel.
Distributed fault tolerance. Diifficult to scale large and keep supercomputer performance. But still room for improvement.
You are being MICROattacked, from various angles, in a SOFT manner.
In most modern problems simply scaling the code to that level of parallelism is almost always an unsolved problem. There is still a lot of work to be done in this arena. Sparse linear algebra for example has serious scaling issues beyond a certain number of nodes, irregular memory access and communication patterns really bungle up parallelism.
Could they try turning it off then on again?
. .
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.
You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.
The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.
When information is power, privacy is freedom.
A machine like Kei in Kobe does live rerouting and migration of processes in the event of node or network failure. You don't even need to restart the affected job or anything. Once the number of nodes and cores exceed a certain level you really have to assume a small fraction are always offline for whatever reason, so you need to design for that.
Trust the Computer. The Computer is your friend.
"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.
The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability.
Yeah, no surprise there. Historically, kings have never cared about what happens to the peons.
When information is power, privacy is freedom.
DARPA's report ( http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf ) has a lot of interesting information for those who want to read more on exascale computing. I may be a bit biased being a grad student in HPC too, but the linked article didn't impress me.
So, exactly the same problem everybody with a large data centre has had for years. If it was competently designed, few, if any, of the failures will lead to the whole computer failing.
Reminds me of ENIAC. I went to the Moore School at U of Penn and ENIAC parts were in the basement (may still be there, for all I know). The story was that since it was all vacuum tubes, at the beginning they were lucky to get it to run for more than 10 minutes before a tube blew.
That being said, I can't believe that supercomputers don't have some kind of supervisory program to detect bad nodes and schedule around them.
I don't read your sig. Why are you reading mine?
I don't think you understand the nature of the problem.
The more nodes N you use for a computation and the longer the computation runs, the greater the odds that a node will fail before the computation can complete. Small programs ignore the problem and usually work just fine. Should a node crash, the job is re-run from scratch.
More sophisticated programs write out a checkpoint periodically so that if (when) a node fails, it is replaced and the job restarts from the checkpoint. However, that is not without cost. Computation isn't happening when state is being written out. However, so far, it's been manageable.
The issue in TFA is that this will not scale. As you add nodes, the MTBF gets much smaller (and so checkpointing has to become more frequent). Eventually you would reach a point where adding nodes actually slows the computation down because of the necessary overhead of checkpointing. Add enough nodes and eventually it does nothing but checkpoint over and over. That's a classic scalability problem and it has no obvious solutions that don't involve a magic wand.
I am sorry to say it, but this Phd student has no clue. Dealing with a node failure is not a problem with proper, modern supercomputing programming practices as well a OS/system software. There is an amazing programming technique called "checkpointing", developed a while ago. This allows you to periodically to "checkpoint" your application, essentially saving off the system call stack, the memory, register values, etc., etc., to a file. The application is also coded to check to see if that file exists, and if it does, to load all those values back into memory, registers, call stack, and then continue running from that point. So in the event of a hardware failure, the application/thread is simply restarted on another node in the cluster. That is application level checkpointing, there is also OS level checkpointing, which essentially does the same thing, but at the OS level irregardless of the processes running on the system, allowing for anything running on the entire machine to be checkpointed and restarted from that spot.
Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).
Both of these methods fix the issue of having possible nodes which die on you during computation.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.
This problem applies to all global modes of handling failures, i.e. when the whole system stops due to the failure of one or a few nodes.
Checkpoint/restart is actually a rather poor workaround as the ration of IO bandwidth to compute performance is shrinking with every new generation of supercomputers. Soon enough we'll spend more time writing checkpoints than doing actual computations.
I personally believe that we'll see some sort of redundancy on the node level in the mid-term future (i.e. the road to exascale), which will sadly require source code-level adaptation.
Computer simulation made easy -- LibGeoDecomp
With many simulations, say something like climate modelling or fluid dynamics, the output from one node (a particular region of space) is used as input to the adjacent neighbors. So if that node fails, it would propagate a delay. It seems that you would want some kid of "buddy" system where if one node were to keel over, the neighbors could take up the slack.
...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)
Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say that fault tolerance within the MPI layer (e.g. here) will be sufficient. I personally very much doubt that. My bet is on higher-level frameworks, e.g. HPX, which can "abstract away" the location of a task from the node where its actually being executed.
Computer simulation made easy -- LibGeoDecomp
Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.
Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way better, and flat.
Hire some big phds to design mitigation mechanisms to increase MTBF by better environmental management, I dunno, give nodes a rest for 30 mins every 5 days. I dunno, more research needs to be done, and that takes tonne of time, (unless you can simulate it on a super computer) doh.
Just a thought.
Liberty freedom are no1, not dicks in suits.
Well, it seems obvious that you need to distribute your checkpoints. Instead of saving the global state, you must save smaller subsets of it.
Now, writting a program that does that is no easy task. That's why it is a problem. But don't think it can't be solved, because it can.
Rethinking email
There has been over 30 years research and utilization of ACID and High Availability techniques; ...
WAL
2 Phase Commit
Hot Standby
Replication
in database technology and products. Surely its time the HPC community started looking at and using some of these techniques rather re-inventing the wheel.
It may seem preposterous, but its perfectly feasible to run CUDA and OpenCL code as Database Stored Procedures on GPU's. Stored Procs implement a
natural Map-Reduce paradigm, Database technology isn't the complete solution but its a good starting point.
Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?
I understand that there's little the processor can do if the mobo dies (unless it notices an increase in failed computations right before maybe it could send a signal) but is there any advantage to the type of processor used in some cases?
Just curious...
Cheers
The guy in the article that says that running the redundant copies on the same nodes would reduce i/o traffic: I'd love to speak to him. There are two options I see:
1) Assuming that there is common source data that both instances need to churn on: so the data isn't redundant so what exactly are you proving by getting the same result? Diddo with CPU, integer unit etc same hardware is not a redundant solution.
2) No shared data but you are generating data on each node: so they still have to chat with eachother to synchronize and such. So now you have 1/3 of a CPU node and 1/3 of I/O. But wait that's not all you also are trashing the hell out of your cache, if any of the stuff is on disk you are pretty much guaranteeing that at least one of the instances on the node has to wait for the drive latency since one or the other is going to get the disk first etc.
Either way you still have the problem of a node blowing up and all the simultaneous copies of the sim dying together: In short you can't call something a redundant system without actually having redundancy.
I am already assuming that each node saves it's own state to a neighboring node. That speeds up the checkpoints and improves the scalability but doesn't fundamentally address the problem.
Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?
No, not related to the processor, just a specific inter-node interconnect fabric implementation. The moral of the story is: just be aware that buying stuff that has just become available and trying to deploy it at scales well beyond what others (including the vendor) have done before leads to an effectively experimental configuration, and is not one that you should expect to behave like a production environment should for a period of time until the kinks are worked out. Of course, this plays hell with the overly optimistic schedules management thought should be achievable.
That does address the problem. Once you each node saves its state to enough other nodes, you don't need to make a checkpoint anymore.
But then, you must write code that can work with just partial checkpoints, what is quite hard.
Rethinking email
Saving state IS checkpointing, you know. And you cannot continue computation with missing state other than by re-computing it (which would be considered part of loading the checkpoint). The checkpoint images must also be coherent. Otherwise, you have lost state.
.
In the end, it just means that the maximum useful size of a cluster is limited by the reliability of the nodes.
Ultimately the problem may be sidestepped by going to non-stop style nodes (expensive as hell) but then your scalability is limited by how fast you can swap out the broken parts (alternatively, your run-time is limited by how many hot spares you can stuff in at the beginning). Now, the question comes down to how fast is the non-stop hardware (there has always been a speed penalty for that approach).
That's the restriction that must be removed. If your program can work with non-coherent subsets of the checkpoint, you won't have this scalability problem. And it is something possible, for any computation.
Rethinking email
I'll be needing to see proof of that one.
If you nodes communicate only through message passing (and you can compute anything by communicating only through message passing), when restoring a node if you restore any state after receiving or transmitting the last message, it can't make any difference.
But then, that algorithm is not viable. It creates a perfectly descentralized checkpoint, without needing any coherence, but it needs too many checkpoints. But between enforcing coherence only of a single node, and enforcing coherence of the entire computer there is a huge space with different trade-offs between the number of checkpoints and the cost of each one.
Alternatively, you can enforce transactions on each message. But that probably isn't universal due to dead-locks (I have no hard proff either way).
Rethinking email
You can't checkpoint jobs at this scale. It will take longer to checkpoint a job then to compute an answer. This is further compounded when the job takes several months to run. A 1000 node cluster is very tiny compared to the scale they're talking about.
I agree that it can always be decomposed to message passing, even shared memory is just a particular way of passing messages.
However, the state of a node necessarily includes the state of it's peers. If node A is expecting a reply from B while B is expecting a request from A, the computation fails every time.
The only way to manage that without barriers is for the checkpoint to happen upon receipt of EACH message before it is ACKed (which means state is recovered in-part by re-transmission of unACKed messages). But note, you're really only evading the problem and cooking the books since the peers are waiting while you're checkpointing. That is, they are not allowed to advance their state until yours is checkpointed at a step compatible with their current and next state. With a non-stop like scheme, that at least happens very quickly. Of course, that devotes 2/3 of the hardware to the checkpointing process. This is still a coherent checkpointing scheme.
I'm really not so sure there is a reasonable middle ground between that and the big global checkpoint (that is performed in parallel, of course). Sorta consistent = inconsistent.
We'll do just fine, this is just a doom and gloom article. Sure things fail, but will they fail before their useful life? If anything it's getting better with more cores per cpu, work spread across many nodes, etc. Just someone looking for some research dollars to analyze a problem that doesn't exist.
The only reason we sill use the same basic technology we have been using for over a decade now is because by going to the next step the government, and many corporations would have to admit to covering up Free Energy and many of Nikola Tesla's inventions. They have to figure out a way to slowly integrate advanced technology into the current system. Just like how we slowly integrate military technology into the consumer market.
I was at the talk @ SC. The guy's team has written an MPI library called red which does some basic check-summing and voting between nodes. big deal. Their presentation had jack to do with hardware failure mitigation. It was sort of interesting and yet underwhelming. Not bad on the overheard though.
this article summary is useless. as usual from samzenpus. sam should learn to read, and learn some computing.
You are right. I was overlooking that problem with the algorithm (one must have the checkpoint before it acks the message, otherwise the node can't restore to "any state after receiving the message").
But now that I've tought about it, transactions are way more interesting than I was assuming.
Rethinking email
We have addressed issues lie this using various methods over the years. Super computing is just a current area where doing pre-emptive issue resolutions has come in play.
More obvious ways of HELPING (not single item 'resolving') this issue include:
-- Parallelism ... similar to RAID for data storage but in various related areas ... yep, make it out of 'better stuff'. If you have something 'good' make it better, or 'harder' (like radiation hardening or EMP hardening),
-- High Availability and ruggedization
-- Multi-pathing of computation and data - we did this back in the '70s& '80s when reducing single points of failure, setting up multiple paths for data in computers, building in diagnostic facilities both in-band and out-of-band to be preemptive on detecting failure options before it causes issues.
-- Harden facilities and environmentals - buildings, power, HVAC, ventilation, humidity, water, earthquake and storm vulnerabilities can often be handled but not easily or cheaply after a new facility has been built.
-- Redundancy of MPP, and Vertical MIPS of few but larger CPUs, RISC processing - all have been used to make processing faster, cheaper, more reliable over the years in various implementations.
These are just a few techniques off the top of my head I can remember I have been associated with to make computing more 'bullet proof' (yes, we did put kevlar wallboard up too! It protected from 'postal visitors' mainly). So with a bit more work many issues can be resolved by using these and other techniques.
Yes, reliability is an issue, but it always has been, and even if addressed now, will be the next time a 'new' design comes out unless it is re-addressed every step of the way.
... "When you pry the source from my cold dead hands."
This sounds like someone pimping their thesis by proxy. Since when is this news? This problem was solved long ago. I recall Sequoia computer being a big player in the 80's and 90's in this market. Solution? Redundancy and monitoring of all critical systems.