Supercomputers' Growing Resilience Problems

← Back to Stories (view on slashdot.org)

Supercomputers' Growing Resilience Problems

Posted by samzenpus on Wednesday November 21, 2012 @12:01PM from the a-thousand-potential-cuts dept.

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

22 of 112 comments (clear)

Min score:

Reason:

Sort:

Hardly A New Problem by MightyMartian · 2012-11-21 12:04 · Score: 5, Informative

Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.
In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
1. Re:Hardly A New Problem by CastrTroy · 2012-11-21 13:07 · Score: 2
  
  I was thinking that if you had 100,000 nodes, that a certain percentage of them could be dedicated for fail over when one of the nodes goes down. The data, would exist in at least 2 nodes, kid of like RAID but for entire computers. If one node goes down, another node can take it's place. The cluster would only have to pause a short time while the new node got the proper contents into memory along with the instructions it needed to run. You'd need to do some kind of coding so a server could pick up as close as possible to where the other one died, but at least it wouldn't require an actual human to walk in and replace a part for the whole cluster to continue.
  
  --
  
  Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
2. Re:Hardly A New Problem by Anonymous Coward · 2012-11-21 18:12 · Score: 2, Insightful
  
  I think you're missing the fact that when a node dies in the middle of a 1024-core job that has been running for 12h you normally lose all the MPI processes and, unless the job has been checkpointed, everything that has been computed so far.
  
  It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.
"they halt operations when they [fail]" by Let's+All+Be+Chinese · 2012-11-21 12:15 · Score: 2

Ah-ha-ha! You wish failing kit would just up and die!
But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.
And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't catch it in time?
Componentry? by Beardydog · 2012-11-21 12:19 · Score: 2

Is an increasing amount of componentry the same thing as an increasing number of components?
1. Re:Componentry? by Anonymous Coward · 2012-11-21 12:29 · Score: 5, Funny
  
  Componentry is embiggened, leading to less cromulent supercomputers?
2. Re:Componentry? by blade8086 · 2012-11-21 14:19 · Score: 2
  
  No.
  an increasing amount of componentry has increased componentry with respect to an increasing number of components.
  In other words,
  componentry > components
  increasing amount > increasing number.
  Therefore
  componenty + increasing amount >> components + increasing number
  however
  componentry + increasing number =? increasing amount + components
  unfortunately, to precisely determine the complexity of the componentry,
  more components (or an increasing number of componentry) with resepct to the original summary, are required.
"and they halt operations when they do so" by brandor · 2012-11-21 12:29 · Score: 5, Informative

This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.
The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.
Sensationalist titles are sensationalist I guess.
1. Re:"and they halt operations when they do so" by Tirian · 2012-11-21 13:32 · Score: 2
  
  Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900 cores are available.
  Checkpoints are necessary, but in large-scale situations they are often difficult. You usually have a walltime allocation for your job and you certainly don't want to use 20% of it writing checkpoint files to Lustre (or whatever high-performance filesystem you are utilizing). Perhaps frequent checkpointing works on smaller systems/jobs, but for a capability job on a large system you are talking about a significant block of non-computational cycles being burned.
2. Re:"and they halt operations when they do so" by Anonymous Coward · 2012-11-21 16:20 · Score: 2, Informative
  
  Pretty much all MPI-based codes are vulnerable to single node failure. Shouldn't be that way but it is. Checkpoint-restart doesn't work when the time to write out the state is greater than MTBF. The fear is that's the path we're on, and will reach that point within a few years.
3. Re:"and they halt operations when they do so" by pereric · 2012-11-21 21:47 · Score: 2
  
  Can you really predict if it will halt?
On that scale by shokk · 2012-11-21 12:31 · Score: 2

On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.

--
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
Re:ummm, no. by fintler · 2012-11-21 13:00 · Score: 3

Google is having the same problems that this article describes -- they haven't fixed it either.
If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.
Re:Hardly A New Problem...and thus has been fixed by poetmatt · 2012-11-21 13:04 · Score: 3, Insightful

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.
How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.
Re:Old problem by Anonymous Coward · 2012-11-21 13:18 · Score: 2, Informative

you start the job over.
You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.
On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.
If you have a single process bigger than that you need to setup a checkpointing system.
If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.
On the previous one I was on they used both tricks depending on the exact nature of what was being processed.
For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.
Re:Hardly A New Problem...and thus has been fixed by cruff · 2012-11-21 14:34 · Score: 2

The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.
You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.
Not Really New by Jah-Wren+Ryel · 2012-11-21 15:07 · Score: 3, Insightful

The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.
Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.
On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

--
When information is power, privacy is freedom.
Re:Google/FB/etc are Embarassingly Parallel by Jah-Wren+Ryel · 2012-11-21 15:18 · Score: 2

And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.
No it's not. The problem with your description is the "rinse, repeat" part. He's not talking about repeating with new input data. He's talking about a serialized workload where, for example, the output of the first 100 jobs is the input for the next 100 jobs, which then creates output that is the input for the next 100 jobs. It's not a case of repeating, its a case of serialization where if you have not done state check-point and things crater you have to start from the begining to get back where you were. No "standard of coding" can fix that.

--
When information is power, privacy is freedom.
Re:Hardly A New Problem...and thus has been fixed by markhahn · 2012-11-21 15:37 · Score: 3, Informative

"hegemonous", wow.
I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.
Phd with no clue... by Fallen+Kell · 2012-11-21 20:14 · Score: 2

I am sorry to say it, but this Phd student has no clue. Dealing with a node failure is not a problem with proper, modern supercomputing programming practices as well a OS/system software. There is an amazing programming technique called "checkpointing", developed a while ago. This allows you to periodically to "checkpoint" your application, essentially saving off the system call stack, the memory, register values, etc., etc., to a file. The application is also coded to check to see if that file exists, and if it does, to load all those values back into memory, registers, call stack, and then continue running from that point. So in the event of a hardware failure, the application/thread is simply restarted on another node in the cluster. That is application level checkpointing, there is also OS level checkpointing, which essentially does the same thing, but at the OS level irregardless of the processes running on the system, allowing for anything running on the entire machine to be checkpointed and restarted from that spot.

Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).

Both of these methods fix the issue of having possible nodes which die on you during computation.

--
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
Whoever did mod the parent up.... by gentryx · 2012-11-21 22:40 · Score: 2

...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)
Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say that fault tolerance within the MPI layer (e.g. here) will be sufficient. I personally very much doubt that. My bet is on higher-level frameworks, e.g. HPX, which can "abstract away" the location of a task from the node where its actually being executed.

--
Computer simulation made easy -- LibGeoDecomp
easy solution by cheekyboy · 2012-11-22 00:20 · Score: 2

Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.
Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way better, and flat.
Hire some big phds to design mitigation mechanisms to increase MTBF by better environmental management, I dunno, give nodes a rest for 30 mins every 5 days. I dunno, more research needs to be done, and that takes tonne of time, (unless you can simulate it on a super computer) doh.
Just a thought.

--
Liberty freedom are no1, not dicks in suits.