Slashdot Mirror


Supercomputers' Growing Resilience Problems

angry tapir writes "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."

112 comments

  1. Hardly A New Problem by MightyMartian · · Score: 5, Informative

    Strikes me as a return to the olden days of vacuum tubes and early transistor computers, where component failure was frequent and brought everything to halt while the bad component was hunted down.

    In the long run if you're running tens of thousands of nodes, then you need to be able to work around failures.

    --
    The world's burning. Moped Jesus spotted on I50. Details at 11.
    1. Re:Hardly A New Problem by Anonymous Coward · · Score: 1

      Furthermore, there's also the race for higher density, if you put 48 cores on a single box or a single chip, expect that one core failing will cause issues that need to be addressed from somewhere else, and that replacing the "system" will probably require 47 good cores to be quickly replaced.

      As always, if you condense everything, you can expect that one piece in the whole system that breaks, causes failures in the rest of the system.

    2. Re:Hardly A New Problem by Tontoman · · Score: 1

      Here is an opportunity to leverage the power of the HPC to detect and correct and workaround any integrity issues in the system. Not a new problem, just the opportunity for productive application of research.

    3. Re:Hardly A New Problem by CastrTroy · · Score: 2

      I was thinking that if you had 100,000 nodes, that a certain percentage of them could be dedicated for fail over when one of the nodes goes down. The data, would exist in at least 2 nodes, kid of like RAID but for entire computers. If one node goes down, another node can take it's place. The cluster would only have to pause a short time while the new node got the proper contents into memory along with the instructions it needed to run. You'd need to do some kind of coding so a server could pick up as close as possible to where the other one died, but at least it wouldn't require an actual human to walk in and replace a part for the whole cluster to continue.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    4. Re:Hardly A New Problem by guruevi · · Score: 1

      Problem had already been solved by a number of HPC schedulers.

      Whoever this person is doesn't know shot about running a cluster but the problem has long been solved somewhere in the mid-80s

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    5. Re:Hardly A New Problem by hairyfeet · · Score: 1

      But unlike in those days, where you had to basically go down row after row hunting for a bad solder joint or dead tube, this problem seems to be...well frankly easily fixed with just a little thought. After all we already have diagnostic programs for all the basic components, RAM, CPU, HDD, etc so i don't see why those programs couldn't be cooked into a ROM and give us a simple pass/fail indicator light on the front of the node. if you wanted you could have a pass/fail light for each of the major components but frankly just a simple pass/fail for the node would be all that it would take, then you just yank and replace the node and worry about diagnosing what went wrong with the bad node later...or hell even never, just have the unit sent off and refurbed where they can test all the components and then sell off the refurbed unit or put it back in storage for when another node dies.

      Frankly with the ability to make little embedded ARM chips that take less than a watt and can run simple programs off a ROM like a diagnostics I don't see why this would be a problem, it just takes a little thought in the design beforehand.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    6. Re:Hardly A New Problem by loose+electron · · Score: 1

      Most MPP machines that I am familiar with have a system where the status and functionality of all nodes is checked as part of a supervisory routine and mapped out of the system. Bad Node? It goes on the list for somebody to go out and hot swap out the device. Processing load gets swapped to another machine.

      Once the new device is in place that same routine brings that now functioning processor back into the system.

      That sort of thing has existed for at least 10 years and probably longer.

      --
      www.effectiveelectrons.com "chips that work" Analog, RF, Mixed Signal
    7. Re:Hardly A New Problem by Anonymous Coward · · Score: 2, Insightful

      I think you're missing the fact that when a node dies in the middle of a 1024-core job that has been running for 12h you normally lose all the MPI processes and, unless the job has been checkpointed, everything that has been computed so far.

      It's not just about hunting and replacing the dead nodes, it's about the jobs' resilience to failure.

    8. Re:Hardly A New Problem by hairyfeet · · Score: 1

      Well frankly you shouldn't be writing programs that need every damned core of a 1024 node PC and using an OS that can't shift the load on a fail, should you?

      hell most of these units are running Linux and if it is one thing i give Linux props for its being able to be customized for big ass jobs like this. Have a 10%-20% leeway when it comes to nodes and have the OS simply move the load from node 146 to 187(B) for backup and call it a day. again while this isn't 2+2 frankly its not this insurmountable problem like they are making it out to be, this IS doable with a little thought beforehand.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    9. Re:Hardly A New Problem by Anonymous Coward · · Score: 0

      Original AC here. You haven't done much supercomputing, have you. You seem to fail to grasp that there are dependencies between the MPI tasks running.

      The job is a 1024-core job and it's not using all cores like you'd think -- rather, it is using a fraction of a, say, 50 Kcore machine that it just got allocated from the batch system. Even if there are nodes to spare from the backfill, there was data ("state") on the node that you've just lost due to, say, a mobo failure. You can't just "move the load", because you'd have to "undo" the entire job to the previous checkpoint. With big jobs it's not feasible to checkpoint, because there are hundreds of GB of address space to be saved to disk. Hence, your only option is to abort the job. Or wait for the new MPI standard which promises resilience.

    10. Re:Hardly A New Problem by cheekyboy · · Score: 1

      ok

      if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.

      Or are you saying out of 1 million nodes, 1 fails every 1/2 hour, so you need 24 spare to catchup, which is feasable.

      Or are you saying failure rate is 1 every minute? which would require 720 spare nodes every hour , which is a lot of new racks being installed daily.

      If it scales linearly, then its ok, and wont drag to a stand still, but if it doesnt, then it will eventually scale down until it stops falling and reaching equilibrium..... and if hardware improves in time, it will start scaling automatically due to less failures.

      --
      Liberty freedom are no1, not dicks in suits.
    11. Re:Hardly A New Problem by Anonymous Coward · · Score: 1

      if motherboard #325 dies and kills 8 nodes at once, so if that requires a 12 hour job to re-run why not just have always several dozen spare idle nodes waiting to do that 12hr job in 1hr to catch up.

      Let me rephrase, because I cannot seem to get my point across.

      The typical distributed-memory machine has several tens of thousands cores, bundled on, say, 4-core CPUs, bundled on, say, 2-CPU mobos. Each of these PCs is a node. Let us assume your optimistic view that you have spare nodes available.

      Now, I have a job that takes 12h to complete using 1024 cores, so 128 nodes. The job runs as 1024 processes, 8 per node, which communicate over the interconnect through a software layer known as MPI. Memory is local to each process and unless it's an embarassingly parallel problem, the processes need to communicate by sending messages. Say it's 10th hour of runtime when one of the nodes dies. You've just lost the state (RAM contents) of 8 processes out of 1024. No amount of spare nodes is going to bring it back. Sure, you can replace the node hardware by using one of the spare nodes, but not the state -- it's gone. It cannot be recreated from the state of the other nodes. It cannot be recreated by using 80 cores and re-doing "just this bit" 10 times faster, because the bits are interdependent -- you need to run the WHOLE thing again. There is no way I can catch up within 1 hour, well, not unless I have 10240 free cores lying around and a perfectly scaling problem and can re-run the WHOLE thing.

      The usual mechanism for dealing with this is checkpointing -- a glorified way of dumping the state of all processes to disk. But with a 1024-core job this might mean dumping some 1 TB, perhaps 2 TB of data to disk every once in a while.

    12. Re:Hardly A New Problem by hairyfeet · · Score: 1

      I think what the guy is trying to say is his programs are written VERY badly, and instead of saving state after each step of the calculations its an "all or nothing" kind of deal he has going where ANY failure at all throws his badly programming ass back to the start.

      Which if that is the case i gotta say...he deserves to be sent back to the beginning every time any damned thing goes wrong, because he is a dumbass. if he can split the damned load to 1024 CPUs he can damned well save the fucking state because those same messages that HAVE TO BE SENT to keep the 1024 cores on the same page can ALSO be used to send state data.

      so i think we can see the problem cheekyboy, he's doing it back asswards and wrong.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    13. Re:Hardly A New Problem by EETech1 · · Score: 1

      So you would dedicate say 1 or 2 CPUs per node or a node per rack for sniffing all of the intermediate data off of the local highest speed interconnect as it's sent between nodes, and sending it out to a fifo queue on the (or a separate) network to store the intermediate results in case of a failure?

      If you had the last couple chunks of data that every node sent to every other node it would make restarting from a failed node much easier as you would just have to reload the data in, recompute it, and compare the previous outputs and continue on with the job on a different node.

      It would take additional resources per rack for the spare nodes and data recording nodes, as well as possibly a second network and local storage in the rack to save the data, but at the exascale level, this might make sense.

      If you had 8 compute CPUs per node, and 1 more of them that had 16X the RAM and was a storage / supervisory CPU for the whole node, and a whole node per rack doing the same thing would it be able to keep up and would it cover most failures?

      Is this something that could be built right in to future interconnect chips so they can store the last couple of transactions and reroute and dump the data to a different node if something fails?

      It might be expensive, but so is the power required to run these huge computers and everything else about them.

      Cheers

    14. Re:Hardly A New Problem by hairyfeet · · Score: 1

      Exactly. I mean think about it, you ever PRICE a 1024 node rack? Even going AMD that is a hell of a chunk of change, and having to start a calculation aaaaalllll the way back at the beginning if a node fails as the AC was talking about is simply unacceptable. when you look at the power bill alone for a 12 hour calc on a 1024 node going full bore frankly the cost of throwing an 8-16 core per rack just to store current state is frankly cheap compared to the alternative.

      And again when you look at the cost I'd say this is definitely doable, put a RAID 5 of enterprise class HDDs with a couple of 256Gb SSDs for a fast store, and make sure that node has plenty of RAM so you would have a nice fast cache with plenty of buffering...yeah I'd say its doable. you could probably cook up some specialized ARM chips right into the interconnect for future designs but even what we have now with fibre channel i would say it could be done, and again look at how much money you lose by having a 1024 node go full bore for 10 hours like AC said only to have to start over at hour 11 because a node failed.

      Remember when you are talking cost on this scale you have to figure in cost of operation and having to double that cost because a node fails is simply unacceptable, not when you could slap a 16 core with 64Gb of RAM and some SSDs feeding into HDDs to have current state backed up. that way if a node fails just like RAID that data could be rebuilt on a fresh node and the calc could continue instead of starting over so...yeah I still say with some planning it should be doable.

      Oh and Happy Turkey day to you and yours, I'm getting ready to go do the whole roast beast with the fam and in the spirit of the holiday my oldest is bringing a local gal who doesn't have any family, and I hope you and yours look around and make sure there isn't anybody sitting alone with nobody on the holiday. I brought the guest last year but this year the new owner of the apt building is having a HUGE blow out in the back of the building, with a bunch of smokers cooking turkey and ham, so nobody in this building is gonna be alone this Thanksgiving which i thought was DAMN nice and told him so. Happy Holidays!

      --
      ACs don't waste your time replying, your posts are never seen by me.
    15. Re:Hardly A New Problem by Bill+Barth · · Score: 1

      Suppose that your job is computing along using about 3/4ths of the memory on node 146 (and every other node in your job) when that node's 4th DIMM dies and the whole node hangs or powers off. Where should the data that was in the memory of node 146 come from in order to be migrated to node 187?

      There are a couple of usual options: 1) a checkpoint file written to disk earlier, 2) a hot spare node that held the same data but wasn't fully participating in the process.

      Option 2 basically means that you throw away half of your compute resources in case there is a failure. Basically no computations being done today for scientific research are valuable enough to warrant this approach. Some version of Option 1 (periodic checkpointing and restarting) is always more cost effective. These systems, in the US at least, are generally between 2 and 10 times over requested by the science community. Taking half away to seamlessly prevent an occasional job death just isn't worth the lost opportunity to more fully utilize the resources.

      Option 1 implies taking some time away from your job to do the checkpointing. The vast majority of the time, some sort of OS-level automated checkpointing would be overkill as well. The author of the code knows better when is a good time to checkpoint and when it's a bad idea. I.e. you might consider checkpointing at a phase of the calculation when the data volume required to restore the state is at a minimum even if that means losing some part of a future calculation. Generally the calculation is cheap to redo since checkpoints of large volumes of data are expensive.

      In addition, OS-level checkpointing is a hard problem. E.g. if there are messages in flight on the network, do you try to log them and be able to restart them, or do you only checkpoint when the network is quiet? If the network is never quiet on all the nodes in the job, do you throw in needless synchronization that could ruin the parallel efficiency of the job in order to find a place to do your automated checkpoint? If you decide to log instead, where do you write the log data in order to avoid catastrophic failure of each node, and what's the cost of doing it?

      If these were just a bunch of VMs running a LAMP stack, this wouldn't be a hard problem. That's basically solved already. Migrating tasks for HPC jobs is truly a hard problem with tradeoffs to be considered.

      --
      Yes...I am a rocket scientist.
    16. Re:Hardly A New Problem by Anonymous Coward · · Score: 0

      Man, you are clueless.

  2. A Cloudy Future by Anonymous Coward · · Score: 0

    I would think the same concurrency one must apply to process within such a system could be applied in today's IaaS infrastructures. Perhaps some problems don't scale to the cloud. But then, I would think many of these could be run on several super-ish computers with a mix of the techniques for cloud and "local" concurrency. Is this really a problem for anyone but the sales staff offering supercomputers to large institutions?

  3. "they halt operations when they [fail]" by Let's+All+Be+Chinese · · Score: 2

    Ah-ha-ha! You wish failing kit would just up and die!

    But really, you can't be sure of that. So things like ECC become minimum requirements, so you might at least have an inkling something isn't working quite right. Because otherwise, your calculations may be off and you won't know about it.

    And yeah, propagated error can, and on this sort of scale eventually will, suddenly strike completely out of left field and hit like a tacnuke. At which point we'll all be wondering, what else went wrong and we didn't catch it in time?

    1. Re:"they halt operations when they [fail]" by Anonymous Coward · · Score: 1

      global warming simulations come to mind.

  4. Componentry? by Beardydog · · Score: 2

    Is an increasing amount of componentry the same thing as an increasing number of components?

    1. Re:Componentry? by Anonymous Coward · · Score: 5, Funny

      Componentry is embiggened, leading to less cromulent supercomputers?

    2. Re:Componentry? by blade8086 · · Score: 2

      No.
      an increasing amount of componentry has increased componentry with respect to an increasing number of components.

      In other words,

      componentry > components
      increasing amount > increasing number.

      Therefore

      componenty + increasing amount >> components + increasing number
      however
      componentry + increasing number =? increasing amount + components

      unfortunately, to precisely determine the complexity of the componentry,
      more components (or an increasing number of componentry) with resepct to the original summary, are required.

    3. Re:Componentry? by Inda · · Score: 1

      Dictionary definition: A bullshit term to feed to a client in order to stall them and placate their demands for results.

      Other dictionaries may show different results.

      --
      This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
    4. Re:Componentry? by Impy+the+Impiuos+Imp · · Score: 1

      Haw haw haw!

      Back in the vacuum tube days, engineers wrote estimates on the maximium size of computers given the half-life of a vacuum tube. 50,000 tubes later, it's running for minutes only between breakdowns.

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    5. Re:Componentry? by rubycodez · · Score: 1

      only true for "noble componentry"

    6. Re:Componentry? by Anonymous Coward · · Score: 0

      That reminds me of one of my pet peeves: functionalities.

      I remember, I think it was towards the end of the 1980's, that suddenly my collegues started to talk about functionalities for things we used to call the functions of a system. This may be different in English, but in Dutch the word 'functionaliteit' didn't have a plural as far as I was aware. According to the leading Dutch dictionary it meant (and still means) conformance with the function (I hope that's a reasonable translation), and it mentions the synonym 'doelmatigheid', which means effectiveness. So what exactly are effectivenesses? I couldn't find an answer in regular dictionaries and IT dictionaries didn't mention the word at all at the time (this was before the Internet became ubiquitous). So I started asking my collegues what they meant when they talked about functionalities, and what distinguished those from functions. Not a single one was able to answer that question, at best they made a clumsy guess. But they happily kept using the fancy sounding word. I don't do it often anymore, but occasionally I still ask the question, and after more than 20 years I haven't got a single satisfying answer. I never use the word myself, no-one ever seems to have noticed that I don't, and I have no trouble expressing myself. There doesn't seem to be a reason for the word to exist.

      Aftes I got connected to the Internet of course I looked for definitions there, I did it fairly recently, even. The definitions in Dutch are inconsistent, sources don't agree on the meaning. I found a definition in English somewhere that might make sense: functionalities are functions as the user sees them. Ah, but before we had functionalities we just called those 'user functions' if the context didn't already make clear what was meant. That's a term a user actually understands without needing an explanation, and if that is what it means we did things the wrong way around by creating jargon not for our own specialized meanings of a word but for the common meaning. And if so many people can't explain to me how functionalities are different from functions, how do they explain it to the users?

      The correct question to ask probably isn't what componentry is but what componentries are.

  5. Is this really a problem these days? by Anonymous Coward · · Score: 1

    Even my small business has component redundancy. It might slow things down, but surely not grind them to a halt. With all the billions spent and people WAY smarter than me working on those things, I really doubt its as bad as TFA makes it out to be.

  6. Whilst obviously tripling cost... by Anonymous Coward · · Score: 1

    This just says to me that they need to buy three of every component and run a voting system and throw a fail if one is way off the mark.

    1. Re:Whilst obviously tripling cost... by dissy · · Score: 1

      They do. The problem is that is a lot of waste, which does not scale well.

      With 1000 nodes, triple redundancy means only ~333 nodes are producing results.
      In a couple years, we will be up to 10000 nodes, meaning over 6000 nodes are not producing results.
      In a few more years we will be up to 100,000 nodes, meaning 60,000 nodes are not producing results.

      Those 60000 nodes are using a lot of resources (power, cooling, not to mention cost) and the issue is they need to develop and implement better methods to do this.

    2. Re:Whilst obviously tripling cost... by __aaltlg1547 · · Score: 1

      67% overhead in a computing system is considered unacceptable in most applications.
      How you respond to a failure is a big deal when you get to systems so large that they're statistically likely to have component failures frequently. It's often unacceptable to just throw out the result and start over. The malfunctioning system needs to be taken offline dynamically and the still-working systems have to compute around it without stopping the process. That's a tricky problem.

  7. "and they halt operations when they do so" by brandor · · Score: 5, Informative
    This is only true in certain types of supercomputers. The only one we have that will do this is an SGI UV-1000. It surfaces groups of blades as a single OS image. If one goes down, the kernel doesn't like it.

    The rest of our supercomputers are clusters and are built so that node deaths don't effect the cluster at large. Someone may need to resubmit a job, that's all. If they are competent, they won't even lose all their progress by using check-pointing.

    Sensationalist titles are sensationalist I guess.

    1. Re:"and they halt operations when they do so" by Meeni · · Score: 1

      Checkpoints won't scale to future generations. But what is amusing is to see some random ph.D student being cited here instead of the people who actually came to that conclusion some time ago already :)

    2. Re:"and they halt operations when they do so" by fintler · · Score: 1

      Checkpoints will probably stick around for quite some time, but the model will need to change. Rather than serializing everything all the way down to a parallel filesystem, the data could potentially be checkpointed to a burst buffer (assuming a per-node design) or a nearby node (experimental SCR design). Of course, it's correct that even this won't scale to larger systems.

      I think we'll probably have problems with getting data out to the nodes of the cluster before we start running into problems with checkpointing. The typical NFS home directory isn't going to scale. We'll need to switch over to something like udsl projections or another IO forwarding layer in the near future.

    3. Re:"and they halt operations when they do so" by Anthony · · Score: 1

      This is my experience as well. I have supported a shared memory system and now a distributed memory cluster, the resilience of the latter means job resubmission of jobs related to the failing node is the standard response. A failed blade (Altix brick in my case) meant the entire numalink connected system would go down. Component-level resilience and predictive diagnostics help. Job suspension and resumption and/or migration is also useful to work around predicted or degraded component failure.

      --
      Slashdot: Where nerds gather to pool their ignorance
    4. Re:"and they halt operations when they do so" by Tirian · · Score: 2

      Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900 cores are available.

      Checkpoints are necessary, but in large-scale situations they are often difficult. You usually have a walltime allocation for your job and you certainly don't want to use 20% of it writing checkpoint files to Lustre (or whatever high-performance filesystem you are utilizing). Perhaps frequent checkpointing works on smaller systems/jobs, but for a capability job on a large system you are talking about a significant block of non-computational cycles being burned.

    5. Re:"and they halt operations when they do so" by postbigbang · · Score: 1

      The problem is also with the classical Von Neumann model of the state machine. You can have many nodes that do work, then sync-up at different points as dependencies on a JCL- like program. When you have common nodes running at CPU clock, then the amount of buffer and cache that gets dirty is small, and the sync-time is the largest common denominator amongst the calculating nodes. When you bust that model by having an error in one or more nodes, then the sync can't happen until the last node is caught up. Other non-dependent functions can continue, but at some point, it grinds to a halt waiting either the initial node to complete, or for a replacement to complete, but completion nonetheless.

      When you run asynchronous jobs, the cycle time becomes less dependent on node failure, but more dependent on a competent coder's error handler that knows how to react, and spawns whatever's necessary to bring about an appropriate reaction to a node failure whilst keeping the rest of the jobs humming along.

      I think ZFS was a great start to having large data sets operated on by concurrent objective code sets, but tightly coupled systems are a recipe for disaster because of their state-machine fragility.

      --
      ---- Teach Peace. It's Cheaper Than War.
    6. Re:"and they halt operations when they do so" by Anonymous Coward · · Score: 2, Informative

      Pretty much all MPI-based codes are vulnerable to single node failure. Shouldn't be that way but it is. Checkpoint-restart doesn't work when the time to write out the state is greater than MTBF. The fear is that's the path we're on, and will reach that point within a few years.

    7. Re:"and they halt operations when they do so" by Anonymous Coward · · Score: 1

      This is one of the primary differences between XT (Seastar) and XE/XK (Gemini) systems. A Gemini based blade can go down and traffic will be routed around it. Blades can even be "hotswapped" and added back into the fabric as long as the replacement is identically configured.

    8. Re:"and they halt operations when they do so" by Anonymous Coward · · Score: 0

      A fault tolerant single image computer should be able to withstand a loss of a compute node. UV-1000 is obviously not meant to be a fault tolerant computer based on the product description.

    9. Re:"and they halt operations when they do so" by pereric · · Score: 2

      Can you really predict if it will halt?

    10. Re:"and they halt operations when they do so" by cheekyboy · · Score: 1

      after the few 1000 fails you will get similar curves for time of fail after new.

      --
      Liberty freedom are no1, not dicks in suits.
  8. ummm, no. by Anonymous Coward · · Score: 0

    Ever heard of Google? They're doing the massive supercomputer thing just fine and have completely figured out the fault tolerance side of things.

    1. Re:ummm, no. by fintler · · Score: 3

      Google is having the same problems that this article describes -- they haven't fixed it either.

      If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.

  9. On that scale by shokk · · Score: 2

    On that scale, distributed parallelism is key, where the system takes into account downed nodes and removes them from duty until it can return to service, or can easily add a replacement node to handle the stream. That's why Google and Facebook don't go down when a node fails.

    --
    "Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
    1. Re:On that scale by Anonymous Coward · · Score: 0

      Oh MAN! Image if you had a Boewolf cluster of those...
      wait... thats what it is... or.
      a Beowolf cluster of Beowolf clusters...
      I digress.

      Google and Facebook are data center centric search engines,
      and they are database clusters, which is Disk bound,
      where as a SuperComputer is compute bound,

      The reason that Database clusters dont go down, ( aside from the Madona factorr..), is that they have a serious amount of redundancy and location redundancy too. Supercomputers have notorious single points of failures. What comes to mind was the ATT WGS6300 that was used for the gateway to the Cray at Ames. The plug for the gateway was just stuck in the wall. So, when someone kicked the plug out, the machine was not very useful:

      IBM solved the reliability problem long ago with Project Stretch, which was a first incredibly unreliable. So they designed check pointing into the system, so that if anything went wrong, and a lot did. They could fix the problem, and back up to the instruction that was executing when the failure occurred, and continue from there.

      In summary, Some supercomputers have single points of failure, others have multiple points of failure, and some have mega redundancy, like distributed computing and database clusters.

  10. This problem is not new by Anonymous Coward · · Score: 0

    Back in the 60's, ENIAC (the supercomputer of the day) ran on thousands of vacuum tubes, MTBF was
    about 5.6 hours, then an IT person would have to figure out which tube to replace.
    See http://ftp.arl.army.mil/~mike/comphist/hist.html

  11. Re:Old problem by fintler · · Score: 1

    How do you give the work to another node when the failed node contains the only copy of its state (like in an MPI job)? Duplicating the state on multiple nodes is way too expensive.

  12. Google/FB/etc are Embarassingly Parallel by Anonymous Coward · · Score: 1

    EP is trivial to deal with.. The problem is with supercomputing jobs that aren't EP, and rely on computations from nodes 1-100 to feed to the next computational step in node 101-200, etc.

    The real answer is some form of fault tolerant computing.. think EDAC for memory (e.g. forward error correction) but on a grosser scale. Or, even, designing your algorithm for the ability to "go back N", as in some ARQ communication protocols.

    The theory on this is fairly well known (viz Byzantine Generals problem)

    The problem is that big supercomputing software is hard enough to write already without throwing in non-deterministic behavior of the processing. It's not like the compiler does it for you, and even if you have libraries that use multiprocessing (matrix math), that's usually a small part of the problem.

    1. Re:Google/FB/etc are Embarassingly Parallel by BitZtream · · Score: 1

      When you have multiple nodes, you aren't any different than Google.

      Google uses Map Reduce but it isn't the only way things get done.

      You have standards of coding to deal with the issues. MapReduce is only one of those ways of dealing with the issue.

      And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    2. Re:Google/FB/etc are Embarassingly Parallel by Jah-Wren+Ryel · · Score: 2

      And for reference, what you describe in your first paragraph is EXACTLY a MapReduce problem. First 100 nodes Map, second hundred nodes Reduce the results. Rinse, repeat.

      No it's not. The problem with your description is the "rinse, repeat" part. He's not talking about repeating with new input data. He's talking about a serialized workload where, for example, the output of the first 100 jobs is the input for the next 100 jobs, which then creates output that is the input for the next 100 jobs. It's not a case of repeating, its a case of serialization where if you have not done state check-point and things crater you have to start from the begining to get back where you were. No "standard of coding" can fix that.

      --
      When information is power, privacy is freedom.
    3. Re:Google/FB/etc are Embarassingly Parallel by Anonymous Coward · · Score: 1

      Uh, no..

      This kind of problem isn't one where you are rendering frames (each one in parallel) or where your work quanta is large. Think of doing a finite element code on a 3d grid. The value at each cell at time t=n+1 depend on the value of all the adjacent cells at time t=n. The calculation of cell x goes bonk, and you have to stop the whole shebang.

    4. Re:Google/FB/etc are Embarassingly Parallel by sirambrose · · Score: 1

      Most problems can't be solved by a single map reduce. Map reduce tasks are normally written as a series of map reduce jobs; each map runs on the output of the previous reduce.

      Since map reduce jobs write their output to disk at every step, it can be thought of as a form of check pointing. The difference between map reduce and mpi check pointing is that mpi needs to restart the whole job at the checkpoint, but map reduce frameworks can rerun just the work assigned to failed nodes. In the map reduce model, the checkpoint interval stays constant when adding additional nodes. With mpi, the checkpoint interval decreases as nodes are added because adding nodes increases the chance of at least one node failing in a given time interval and forcing the entire job to restart at the checkpoint.

      Not all mpi jobs can easily be rewritten as map reduce jobs, but map reduce does address the problem discussed in the article.

  13. Really? by Pandaemonium · · Score: 1

    Haven't these problems already been solved by large-scale cloud providers? Sure, hurricanes take out datacenters for diesel, but Google runs racks with dead nodes until it hit's a percentage where it makes sense to 'roll the truck' so-to-speak and get a tech onsite to repair the racks.

    1. Re:Really? by Anonymous Coward · · Score: 1

      Pretty much also solved in all but the smallest HPC context.

      They mention that overhead for fault revocery can take up 65% of the work. That's why so few large systems bother anymore. They run jobs and if a job dies, oh well the scheduler starts it over. Even if a job breaks twice in a row near the end, you still break even, and that's a relatively rare occurrence (job failing twice is not unheard of, but both times right at the end of a job is highly unlikely.

      This is the same approach in 'cloud' applications. They have more thorughput then they need and hard failures manifest simply as transient drop in throughput. Yes if you need predictive runtime, running the same job multiple times concurrently may be warranted (again, running the exact same job three times concurrently is *still* no worse than checkpointing overhead, and much simpler), The speaker is hawking their own MPI specifically toward this goal, but I'm not sure if such a thing is even needed based on what I've seen.

  14. Re:Hardly A New Problem...and thus has been fixed by poetmatt · · Score: 3, Insightful

    The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down. That's literally the purpose of basic cluster technology from probably 10 years ago.

    How do they act like this is a new, or magic issue? It doesn't exist if HPC people know what they're doing. Hell, usually they keep a known quantity of extra hardware out of use so that they can switch something on if things fail as necessary.

  15. Re:Old problem by Anonymous Coward · · Score: 2, Informative

    you start the job over.

    You make sure that a single job's run time x the number of nodes is not so large that the chance of that job running to completion is not unreasonable.

    On the previous ones I worked on the 60% job failure rate was around 100 nodes for 5 days, that comes down to the chance of a single node failing on a given day is .999 (you lose 1 out of 1000 nodes each day from something). The math is rather simple...0.999^500=60%. And in general you don't put dual power supplies, you don't mirror the disks...rerunning the jobs that failed is cheaper than increasing the node price to add things that only marginally improve things and also increase physical size.

    If you have a single process bigger than that you need to setup a checkpointing system.

    If you can split big jobs into lots of smaller pieces that can be pretty quickly put together at the end you do so.

    On the previous one I was on they used both tricks depending on the exact nature of what was being processed.

    For the most part it is not a complicated problem unless you expect unreasonably low failure rates and don't deal with reality.

  16. Their methoid is nothing new. by Jartan · · Score: 1

    This is the same method used to handle distributed computing with untrusted nodes. Simply hand off the same problem to multiple nodes and recompute if differences arise.

    The real solution is going to involve hardware as well. The nodes themselves will become supernodes with built in redundancy.

    1. Re:Their methoid is nothing new. by Yvan+Fournier · · Score: 1

      Actually, things are much more complex, and as some other poster mentioned, these issues are the continuing subject of research, and are expected by the supercomping community since quite a few years (simply projecting current statistics, the time required to checkpoint a full-machine job is would at some point become bigger thant the MTBF...)
      The PhD student mentioned seems to be just one of many working on this subject. Different research teams have different approaches, some trying to hide as much possible in the runtimes and hardware, others experimenting with more robust algorithms in applications.

      Tradeoffs on HPC clusters are not the same as on "business" type computers (high-throughput vs. high availability). For tightly coupled computations, a lot of data is flying around the network, and networks on these machines are fast, high throughput, and especially low-latency networks, with specific hardware, and in quite a few cases, partial offload of message management, using DMA writes or other techniques which might make checkpointing message queues a tad complex. The new MPI-3 standard has only minimal support for error handling, simply because this field is not mature/consensual enough, in the sense that not everyone agrees on the best solutions yet, and these may depend on the problem being solved and its expected running time. Avoiding too much additional application complexity and major performance hits is not trivial.

      In addition, up to now, when medium to large computations are batch jobs that may run a few hours to a few days on several thousand cores, re-running one a computation that failed due to hardware failures once in a while (usually much less than one in 10 times is much more cost-effective than duplicating everything, in addition to being faster. These applications do not usually require real-time results, and even for many time-constrained applications (such as tomorrow's weather), running almost twice as many simulations (or running them twice as fast in the case of ideal speedup) might often be more effective. This logic only breaks with very large computations.

      Also, regarding similarity to the cloud, when 1 node goes down on most clusters, the computation running on it will usually crash, but when a new computation is started by a decent resource manager/queuing system, that node will not be used, so everything does not need to be replaced immediately (that issue is at least solved). So most jobs running on 100th or 1/10th of a 100000 to 1 million node cluster will not be too much affected by random failures, but a job running on the full machine will be much more fragile.

      So, as machines get bigger and these issues become statistically more of an issue, an increasing portion of the HPC hardware and software effort needs to be devoted to these, but the urgency is not quite the same as if your bank had forgotten to use high-availability features for its customer's account data, and the dradeoffs reflect that.

  17. Oblig Bad Car Analogy by PPH · · Score: 1

    Come on now. We've known how to run V8 engines on 7 or even 6 cylinders for years now. Certainly this technology must be in the public domain by now.

    I have a demonstration unit I would be happy to part with for a mere $100K.

    --
    Have gnu, will travel.
    1. Re:Oblig Bad Car Analogy by afgam28 · · Score: 1

      Demonstration units can be had for much less :P

      For example a Chrysler 300C costs $36,000, and has has a Hemi V8 with Chrysler's Multi-Displacement System

  18. Tandem by SpaceLifeForm · · Score: 1

    Distributed fault tolerance. Diifficult to scale large and keep supercomputer performance. But still room for improvement.

    --
    You are being MICROattacked, from various angles, in a SOFT manner.
  19. In addition... by Anonymous Coward · · Score: 0

    In most modern problems simply scaling the code to that level of parallelism is almost always an unsolved problem. There is still a lot of work to be done in this arena. Sparse linear algebra for example has serious scaling issues beyond a certain number of nodes, irregular memory access and communication patterns really bungle up parallelism.

  20. Possible Solution by Master+Moose · · Score: 1

    Could they try turning it off then on again?

    --
    . . .gone when the morning comes
  21. Re:Hardly A New Problem...and thus has been fixed by cruff · · Score: 2

    The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability. Everything doesn't just instantly crash down.

    You might think so, but I've seen a configuration with an interconnect fabric that was extremely sensitive to the fallback of individual links to the next lower link speed cause all sorts of havoc cluster wide.

  22. Not Really New by Jah-Wren+Ryel · · Score: 3, Insightful

    The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart was basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.

    Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a different root cause but the results are the same. The problem now seems to be that because the system is so distributed so is the state of the system - with a thousand nodes you've got a thousand sets of processes and ram to checkpoint and you can't do the checkpoints local to each node because if the node dies, you can't retrieve the state of that node.

    On the other hand, I am not convinced that the overhead of checkpointing to a neighboring-node once every few of hours is really all that big of a problem. Interconnects are not RAM speed, but with gigabit+ speeds you should be able to dump the entire process state from one node to another in a couple of minutes. Back-of-the-napkin calculations say you could dump 32GB of ram across a gigabit ethernet link in 10 minutes with more than 50% margin for overhead. Doing that once every few hours does not seem like a terrible waste of time.

    --
    When information is power, privacy is freedom.
    1. Re:Not Really New by markhahn · · Score: 1

      yes, checkpointing is a reasonable answer. no large machines use Gb, though.

    2. Re:Not Really New by Anonymous Coward · · Score: 0

      Actually with modern fdr ib infrastructure it would take you about 10 seconds or less.

    3. Re:Not Really New by Jah-Wren+Ryel · · Score: 1

      Yeah, I was just using a conservative interconnect that most people here could relate too.

      --
      When information is power, privacy is freedom.
  23. Kei by JanneM · · Score: 1

    A machine like Kei in Kobe does live rerouting and migration of processes in the event of node or network failure. You don't even need to restart the affected job or anything. Once the number of nodes and cores exceed a certain level you really have to assume a small fraction are always offline for whatever reason, so you need to design for that.

    --
    Trust the Computer. The Computer is your friend.
  24. Re:Hardly A New Problem...and thus has been fixed by markhahn · · Score: 3, Informative

    "hegemonous", wow.

    I think you're confusing high-availability clustering with high-performance clustering. in HPC, there are some efforts at making single jobs fault-tolerant, but it's definitely not widespread. checkpointing is the standard, and it works reasonably, though is an IO-intensive way to mitigate failure.

  25. Re:Hardly A New Problem...and thus has been fixed by Jah-Wren+Ryel · · Score: 1

    The reality of hegemonous computing is that failure is almost of no concern. If you have 1/1000 nodes fail, you lose 1/1000th of your capability.

    Yeah, no surprise there. Historically, kings have never cared about what happens to the peons.

    --
    When information is power, privacy is freedom.
  26. DARPA's Exascale Report by Anonymous Coward · · Score: 1

    DARPA's report ( http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf ) has a lot of interesting information for those who want to read more on exascale computing. I may be a bit biased being a grad student in HPC too, but the linked article didn't impress me.

  27. Old news by Anonymous Coward · · Score: 0

    So, exactly the same problem everybody with a large data centre has had for years. If it was competently designed, few, if any, of the failures will lead to the whole computer failing.

  28. ENIAC by mspohr · · Score: 1

    Reminds me of ENIAC. I went to the Moore School at U of Penn and ENIAC parts were in the basement (may still be there, for all I know). The story was that since it was all vacuum tubes, at the beginning they were lucky to get it to run for more than 10 minutes before a tube blew.

    That being said, I can't believe that supercomputers don't have some kind of supervisory program to detect bad nodes and schedule around them.

    --
    I don't read your sig. Why are you reading mine?
  29. Re:Hardly A New Problem...and thus has been fixed by sjames · · Score: 1

    I don't think you understand the nature of the problem.

    The more nodes N you use for a computation and the longer the computation runs, the greater the odds that a node will fail before the computation can complete. Small programs ignore the problem and usually work just fine. Should a node crash, the job is re-run from scratch.

    More sophisticated programs write out a checkpoint periodically so that if (when) a node fails, it is replaced and the job restarts from the checkpoint. However, that is not without cost. Computation isn't happening when state is being written out. However, so far, it's been manageable.

    The issue in TFA is that this will not scale. As you add nodes, the MTBF gets much smaller (and so checkpointing has to become more frequent). Eventually you would reach a point where adding nodes actually slows the computation down because of the necessary overhead of checkpointing. Add enough nodes and eventually it does nothing but checkpoint over and over. That's a classic scalability problem and it has no obvious solutions that don't involve a magic wand.

  30. Phd with no clue... by Fallen+Kell · · Score: 2

    I am sorry to say it, but this Phd student has no clue. Dealing with a node failure is not a problem with proper, modern supercomputing programming practices as well a OS/system software. There is an amazing programming technique called "checkpointing", developed a while ago. This allows you to periodically to "checkpoint" your application, essentially saving off the system call stack, the memory, register values, etc., etc., to a file. The application is also coded to check to see if that file exists, and if it does, to load all those values back into memory, registers, call stack, and then continue running from that point. So in the event of a hardware failure, the application/thread is simply restarted on another node in the cluster. That is application level checkpointing, there is also OS level checkpointing, which essentially does the same thing, but at the OS level irregardless of the processes running on the system, allowing for anything running on the entire machine to be checkpointed and restarted from that spot.

    Then there is the idea of a master dispatcher, which essentially breaks down the application into small chunks of tasks, and then sends those tasks to be calculated/performed on a node in the cluster. If it does not get a corresponding return value from the system it sent the task within a certain ammount of time, it re-sends to another node (and marking the other node as bad and not sending future tasks to it until that value is cleared).

    Both of these methods fix the issue of having possible nodes which die on you during computation.

    --
    We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
    1. Re:Phd with no clue... by Anonymous Coward · · Score: 1

      There is an amazing programming technique called "checkpointing", developed a while ago.

      No shit, Sherlock. Now tell me what happens when a 2048-core job needs to write its 4TB of RAM to disk every hour.

    2. Re:Phd with no clue... by Anonymous Coward · · Score: 0

      And a 2048-core job is peanuts compared to what the article was talking about, which have millions of nodes, each with several cores. The amount of ram and disk available per core is not expected to increase much, but since the aggregate failure rate wil be proportionally higher as you add more cores, these systems will fail more than a thousand times more often than your tiny 2048-core example, and hence need to checkpoint tousands of times more often. So that would be a checkpoint every second or so. This is the real killer, and why checkpointing does not scale to exaflops performance.

    3. Re:Phd with no clue... by __aaltlg1547 · · Score: 1

      No, that doesn't solve the problem in general. It only solves the problem in the case where getting a result in a timely fashion is not critical.

  31. Checkpointing overhead by Anonymous Coward · · Score: 1

    The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.

    This problem applies to all global modes of handling failures, i.e. when the whole system stops due to the failure of one or a few nodes.

    1. Re:Checkpointing overhead by Jah-Wren+Ryel · · Score: 1

      The point of the article was that as the number of nodes involved in the calculation increases, the frequency at which at least one of them fails increases too (provided that the individual node failure rate is kept constant). Since you want at least one checkpoint between each typical failure, you would therefore have to checkpoint more and more often as the number of nodes is increased. Hence, the overhead involved with checkpointing goes up as the number of nodes involved increases, and with 100 times more nodes than most clusters use now, this overhead grows to overwhelm the amount of resources used for the actual calculation.

      Thanks for spelling that out. I don't know why I missed that reading the article. Your post deserves to be modded +5 informative.

      --
      When information is power, privacy is freedom.
  32. Re:Hardly A New Problem...and thus has been fixed by gentryx · · Score: 1

    Checkpoint/restart is actually a rather poor workaround as the ration of IO bandwidth to compute performance is shrinking with every new generation of supercomputers. Soon enough we'll spend more time writing checkpoints than doing actual computations.

    I personally believe that we'll see some sort of redundancy on the node level in the mid-term future (i.e. the road to exascale), which will sadly require source code-level adaptation.

    --
    Computer simulation made easy -- LibGeoDecomp
  33. Re:Hardly A New Problem...and thus has been fixed by Anonymous Coward · · Score: 0

    With many simulations, say something like climate modelling or fluid dynamics, the output from one node (a particular region of space) is used as input to the adjacent neighbors. So if that node fails, it would propagate a delay. It seems that you would want some kid of "buddy" system where if one node were to keel over, the neighbors could take up the slack.

  34. Whoever did mod the parent up.... by gentryx · · Score: 2

    ...doesn't understand the first thing about supercomputers, or even HPC. Currently virtually every HPC application uses MPI. And MPI doesn't take well to failing nodes. The supercomputer as a whole might still work, but the job will inevitably crash and needs to be restarted. HPC apps are usually tightly coupled. That sets them apart from loosely coupled codes such as a giant website (e.g. Google and friends)

    Fault tolerance is a huge problem in the community and we don't have the answers yet. Some say that fault tolerance within the MPI layer (e.g. here) will be sufficient. I personally very much doubt that. My bet is on higher-level frameworks, e.g. HPX, which can "abstract away" the location of a task from the node where its actually being executed.

    --
    Computer simulation made easy -- LibGeoDecomp
  35. easy solution by cheekyboy · · Score: 2

    Just make checkpointing cost zero time. How, have each node, really be a dual node ( like most things in nature are ). So one half is always computing, and the other half is check pointing. Just like video games use of double buffering to do smooth fps. If checkpointing uses less cpu than comps, then swap the cores functionality every N runs to give each core a temperature rest, to cool down, to increase life.

    Sure each node is bigger, but it could perhaps scale better, the overhead curve should be way better, and flat.

    Hire some big phds to design mitigation mechanisms to increase MTBF by better environmental management, I dunno, give nodes a rest for 30 mins every 5 days. I dunno, more research needs to be done, and that takes tonne of time, (unless you can simulate it on a super computer) doh.

    Just a thought.

    --
    Liberty freedom are no1, not dicks in suits.
    1. Re:easy solution by sjames · · Score: 1

      That doesn't actually make checkpointing take zero time, it allocates it 50% of the time. meanwhile, in that scheme the cluster must be small enough that most of the time, there isn't a failure in a checkpoint interval.

  36. Re:Hardly A New Problem...and thus has been fixed by marcosdumay · · Score: 1

    Well, it seems obvious that you need to distribute your checkpoints. Instead of saving the global state, you must save smaller subsets of it.

    Now, writting a program that does that is no easy task. That's why it is a problem. But don't think it can't be solved, because it can.

  37. Try Leveraging Database Technology by Anonymous Coward · · Score: 0

    There has been over 30 years research and utilization of ACID and High Availability techniques;
          WAL
        2 Phase Commit
        Hot Standby
        Replication ...

    in database technology and products. Surely its time the HPC community started looking at and using some of these techniques rather re-inventing the wheel.

    It may seem preposterous, but its perfectly feasible to run CUDA and OpenCL code as Database Stored Procedures on GPU's. Stored Procs implement a
    natural Map-Reduce paradigm, Database technology isn't the complete solution but its a good starting point.

  38. Re:Hardly A New Problem...and thus has been fixed by EETech1 · · Score: 1

    Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?

    I understand that there's little the processor can do if the mobo dies (unless it notices an increase in failed computations right before maybe it could send a signal) but is there any advantage to the type of processor used in some cases?

    Just curious...

    Cheers

  39. redunancy solution by ILongForDarkness · · Score: 1

    The guy in the article that says that running the redundant copies on the same nodes would reduce i/o traffic: I'd love to speak to him. There are two options I see:

    1) Assuming that there is common source data that both instances need to churn on: so the data isn't redundant so what exactly are you proving by getting the same result? Diddo with CPU, integer unit etc same hardware is not a redundant solution.

    2) No shared data but you are generating data on each node: so they still have to chat with eachother to synchronize and such. So now you have 1/3 of a CPU node and 1/3 of I/O. But wait that's not all you also are trashing the hell out of your cache, if any of the stuff is on disk you are pretty much guaranteeing that at least one of the instances on the node has to wait for the drive latency since one or the other is going to get the disk first etc.

    Either way you still have the problem of a node blowing up and all the simultaneous copies of the sim dying together: In short you can't call something a redundant system without actually having redundancy.

  40. Re:Hardly A New Problem...and thus has been fixed by sjames · · Score: 1

    I am already assuming that each node saves it's own state to a neighboring node. That speeds up the checkpoints and improves the scalability but doesn't fundamentally address the problem.

  41. Re:Hardly A New Problem...and thus has been fixed by cruff · · Score: 1

    Is this something that can be partially avoided by using Itanium processors instead of X86? Or has all of the reliability stuff been included in the recent Xenon chips?

    No, not related to the processor, just a specific inter-node interconnect fabric implementation. The moral of the story is: just be aware that buying stuff that has just become available and trying to deploy it at scales well beyond what others (including the vendor) have done before leads to an effectively experimental configuration, and is not one that you should expect to behave like a production environment should for a period of time until the kinks are worked out. Of course, this plays hell with the overly optimistic schedules management thought should be achievable.

  42. Re:Hardly A New Problem...and thus has been fixed by marcosdumay · · Score: 1

    That does address the problem. Once you each node saves its state to enough other nodes, you don't need to make a checkpoint anymore.

    But then, you must write code that can work with just partial checkpoints, what is quite hard.

  43. Re:Hardly A New Problem...and thus has been fixed by sjames · · Score: 1

    Saving state IS checkpointing, you know. And you cannot continue computation with missing state other than by re-computing it (which would be considered part of loading the checkpoint). The checkpoint images must also be coherent. Otherwise, you have lost state.

    .

    In the end, it just means that the maximum useful size of a cluster is limited by the reliability of the nodes.

    Ultimately the problem may be sidestepped by going to non-stop style nodes (expensive as hell) but then your scalability is limited by how fast you can swap out the broken parts (alternatively, your run-time is limited by how many hot spares you can stuff in at the beginning). Now, the question comes down to how fast is the non-stop hardware (there has always been a speed penalty for that approach).

  44. Re:Hardly A New Problem...and thus has been fixed by marcosdumay · · Score: 1

    The checkpoint images must also be coherent. Otherwise, you have lost state.

    That's the restriction that must be removed. If your program can work with non-coherent subsets of the checkpoint, you won't have this scalability problem. And it is something possible, for any computation.

  45. Re:Hardly A New Problem...and thus has been fixed by sjames · · Score: 1

    I'll be needing to see proof of that one.

  46. Re:Hardly A New Problem...and thus has been fixed by marcosdumay · · Score: 1

    If you nodes communicate only through message passing (and you can compute anything by communicating only through message passing), when restoring a node if you restore any state after receiving or transmitting the last message, it can't make any difference.

    But then, that algorithm is not viable. It creates a perfectly descentralized checkpoint, without needing any coherence, but it needs too many checkpoints. But between enforcing coherence only of a single node, and enforcing coherence of the entire computer there is a huge space with different trade-offs between the number of checkpoints and the cost of each one.

    Alternatively, you can enforce transactions on each message. But that probably isn't universal due to dead-locks (I have no hard proff either way).

  47. Re:Old problem by fintler · · Score: 1

    You can't checkpoint jobs at this scale. It will take longer to checkpoint a job then to compute an answer. This is further compounded when the job takes several months to run. A 1000 node cluster is very tiny compared to the scale they're talking about.

  48. Re:Hardly A New Problem...and thus has been fixed by sjames · · Score: 1

    I agree that it can always be decomposed to message passing, even shared memory is just a particular way of passing messages.

    However, the state of a node necessarily includes the state of it's peers. If node A is expecting a reply from B while B is expecting a request from A, the computation fails every time.

    The only way to manage that without barriers is for the checkpoint to happen upon receipt of EACH message before it is ACKed (which means state is recovered in-part by re-transmission of unACKed messages). But note, you're really only evading the problem and cooking the books since the peers are waiting while you're checkpointing. That is, they are not allowed to advance their state until yours is checkpointed at a step compatible with their current and next state. With a non-stop like scheme, that at least happens very quickly. Of course, that devotes 2/3 of the hardware to the checkpointing process. This is still a coherent checkpointing scheme.

    I'm really not so sure there is a reasonable middle ground between that and the big global checkpoint (that is performed in parallel, of course). Sorta consistent = inconsistent.

  49. wishy washy by Anonymous Coward · · Score: 0

    We'll do just fine, this is just a doom and gloom article. Sure things fail, but will they fail before their useful life? If anything it's getting better with more cores per cpu, work spread across many nodes, etc. Just someone looking for some research dollars to analyze a problem that doesn't exist.

  50. conspiracy by Anonymous Coward · · Score: 0

    The only reason we sill use the same basic technology we have been using for over a decade now is because by going to the next step the government, and many corporations would have to admit to covering up Free Energy and many of Nikola Tesla's inventions. They have to figure out a way to slowly integrate advanced technology into the current system. Just like how we slowly integrate military technology into the consumer market.

    1. Re:conspiracy by Anonymous Coward · · Score: 0

      Hail Telsa, Fu*k the government and corrupt corporations

  51. Meh. by Anonymous Coward · · Score: 0

    I was at the talk @ SC. The guy's team has written an MPI library called red which does some basic check-summing and voting between nodes. big deal. Their presentation had jack to do with hardware failure mitigation. It was sort of interesting and yet underwhelming. Not bad on the overheard though.

        this article summary is useless. as usual from samzenpus. sam should learn to read, and learn some computing.

  52. Re:Hardly A New Problem...and thus has been fixed by marcosdumay · · Score: 1

    You are right. I was overlooking that problem with the algorithm (one must have the checkpoint before it acks the message, otherwise the node can't restore to "any state after receiving the message").

    But now that I've tought about it, transactions are way more interesting than I was assuming.

  53. Same song, next verse... by servant · · Score: 1

    We have addressed issues lie this using various methods over the years. Super computing is just a current area where doing pre-emptive issue resolutions has come in play.

    More obvious ways of HELPING (not single item 'resolving') this issue include:

    -- Parallelism ... similar to RAID for data storage but in various related areas
    -- High Availability and ruggedization ... yep, make it out of 'better stuff'. If you have something 'good' make it better, or 'harder' (like radiation hardening or EMP hardening),
    -- Multi-pathing of computation and data - we did this back in the '70s& '80s when reducing single points of failure, setting up multiple paths for data in computers, building in diagnostic facilities both in-band and out-of-band to be preemptive on detecting failure options before it causes issues.
    -- Harden facilities and environmentals - buildings, power, HVAC, ventilation, humidity, water, earthquake and storm vulnerabilities can often be handled but not easily or cheaply after a new facility has been built.
    -- Redundancy of MPP, and Vertical MIPS of few but larger CPUs, RISC processing - all have been used to make processing faster, cheaper, more reliable over the years in various implementations.

    These are just a few techniques off the top of my head I can remember I have been associated with to make computing more 'bullet proof' (yes, we did put kevlar wallboard up too! It protected from 'postal visitors' mainly). So with a bit more work many issues can be resolved by using these and other techniques.

    Yes, reliability is an issue, but it always has been, and even if addressed now, will be the next time a 'new' design comes out unless it is re-addressed every step of the way.

    --
    ... "When you pry the source from my cold dead hands."
  54. Nothing to see here, move along by Anonymous Coward · · Score: 0

    This sounds like someone pimping their thesis by proxy. Since when is this news? This problem was solved long ago. I recall Sequoia computer being a big player in the 80's and 90's in this market. Solution? Redundancy and monitoring of all critical systems.