VMware Causes Second Outage While Recovering From First

← Back to Stories (view on slashdot.org)

VMware Causes Second Outage While Recovering From First

Posted by Soulskill on Monday May 2, 2011 @11:55AM from the third-time's-a-charm dept.

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

12 of 215 comments (clear)

Min score:

Reason:

Sort:

This is very bad design by FunkyRider · 2011-05-02 11:58 · Score: 4, Interesting

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

--
just wonder why there are so many anonymous cowards in this world....
1. Re:This is very bad design by drosboro · 2011-05-02 12:02 · Score: 5, Interesting
  
  I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:
  
  This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
  My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".
  But who knows, I could be wrong... I'm sure hoping I'm not!
2. Re:This is very bad design by nurb432 · 2011-05-02 12:45 · Score: 4, Insightful
  
  I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )
  
  --
  ---- Booth was a patriot ----
Re:'An inadvertent press of a key on a keyboard' by verbatim · 2011-05-02 12:03 · Score: 5, Funny

This pretty much describes my entire career.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?
Engineering Errors by Bruha · 2011-05-02 12:11 · Score: 4, Interesting

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.
I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.
More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.
There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.
I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.
VMware shows its PR colors. by shuz · 2011-05-02 12:18 · Score: 4, Insightful

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

--
There is or can be built a machine that can simulate any physical object. -Church-Turing principle
1. Re:VMware shows its PR colors. by ToasterMonkey · 2011-05-02 14:36 · Score: 5, Insightful
  
  VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
  "Transparency is bad" +4 Insightful
  What the... ?
2. Re:VMware shows its PR colors. by rsborg · 2011-05-02 15:49 · Score: 4, Informative
  
  VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
  "Transparency is bad" +4 Insightful
  What the... ?
  You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".
  Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.
  
  --
  Make sure everyone's vote counts: Verified Voting
Don't let it happen again by stumblingblock · 2011-05-02 12:35 · Score: 5, Funny

They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.
Re:Slashdot summary non sensationalist by Icegryphon · 2011-05-02 12:47 · Score: 5, Funny

Don't go knocking my typewriter It's Electric, and has wonderful BNC connector for network access. IBM, you did good.
Re:UR DOING IT WRONG! by Jeremi · 2011-05-02 14:07 · Score: 4, Funny

I would like more elaboration on what "touched the keyboard" means.
It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Anybody see the irony in the first outage? by HockeyPuck · 2011-05-02 18:33 · Score: 4, Interesting

Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.
I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.
Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...
Irony. A storage company having a storage problem.