Slashdot Mirror


VMware Causes Second Outage While Recovering From First

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

5 of 215 comments (clear)

  1. Re:This is very bad design by drosboro · · Score: 5, Interesting

    I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

    This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.

    My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".

    But who knows, I could be wrong... I'm sure hoping I'm not!

  2. Re:'An inadvertent press of a key on a keyboard' by verbatim · · Score: 5, Funny

    This pretty much describes my entire career.

    --
    Price, Quality, Time. Pick none. What, you thought you had a choice?
  3. Don't let it happen again by stumblingblock · · Score: 5, Funny

    They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.

  4. Re:Slashdot summary non sensationalist by Icegryphon · · Score: 5, Funny

    Don't go knocking my typewriter
    It's Electric, and has wonderful BNC connector
    for network access. IBM, you did good.

  5. Re:VMware shows its PR colors. by ToasterMonkey · · Score: 5, Insightful

    VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

    "Transparency is bad" +4 Insightful

    What the... ?