VMware Causes Second Outage While Recovering From First

← Back to Stories (view on slashdot.org)

VMware Causes Second Outage While Recovering From First

Posted by Soulskill on Monday May 2, 2011 @11:55AM from the third-time's-a-charm dept.

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

35 of 215 comments (clear)

Min score:

Reason:

Sort:

This is very bad design by FunkyRider · 2011-05-02 11:58 · Score: 4, Interesting

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

--
just wonder why there are so many anonymous cowards in this world....
1. Re:This is very bad design by drosboro · 2011-05-02 12:02 · Score: 5, Interesting
  
  I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:
  
  This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
  My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".
  But who knows, I could be wrong... I'm sure hoping I'm not!
2. Re:This is very bad design by nurb432 · 2011-05-02 12:45 · Score: 4, Insightful
  
  I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )
  
  --
  ---- Booth was a patriot ----
3. Re:This is very bad design by Daniel_Staal · 2011-05-02 13:13 · Score: 2
  
  'Enter' should do it, in most cases...
  (Assuming, of course, that the (in)correct command has been typed at the command line already.)
  
  --
  'Sensible' is a curse word.
4. Re:This is very bad design by X0563511 · 2011-05-02 13:19 · Score: 3, Informative
  
  ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.
  
  --
  For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
5. Re:This is very bad design by DigitalJanitor · 2011-05-02 13:47 · Score: 3, Funny
  
  Sounds like they could benefit from a virtual environment to test things out in.
6. Re:This is very bad design by c6gunner · 2011-05-02 14:08 · Score: 2
  
  If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
  Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".
7. Re:This is very bad design by Archangel+Michael · 2011-05-02 14:34 · Score: 2
  
  When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
8. Re:This is very bad design by ruiner13 · 2011-05-02 15:17 · Score: 2
  
  Or in this case, they need a virtual virtual environment.
  
  --
  today is spelling optional day.
9. Re:This is very bad design by md65536 · 2011-05-02 15:26 · Score: 2
  
  They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.
  Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.
  They could try explaining what particular nasty touches actually caused this. By trivializing the cause, they hide the problem, but they also suggest that really really simple problems can make go boom.
  Maybe if someone was poking their finger into the cloud's positronic brain, I could see "Unfortunately, someone touched it" as an acceptable answer. But a keyboard is equipment specifically made for touching. Are cloud data centers so fragile that things meant for fingers can't be touched?
10. Re:This is very bad design by Stupendoussteve · 2011-05-02 17:55 · Score: 2
  
  Everyone knows GUIs hunt down suspects, you just have to write them in Visual Basic. Duh!
11. Re:This is very bad design by nedlohs · 2011-05-02 18:13 · Score: 2
  
  It seems remarkedly unlikely that there would be an executable named "history -c; passwd -l root; rm -rf /", in fact I suspect that trailing / makes it impossible on unix-like systems.
  nohup sh -c "..." &
  on the other hand...
12. Re:This is very bad design by ais523 · 2011-05-03 00:59 · Score: 2
  
  Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll Lock on and off again is generally not meaningful to anything else), although I don't know that one from personal experience.
  
  --
  (1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
Game Over by ae1294 · 2011-05-02 12:00 · Score: 3, Insightful

The cloud is a lie. Would the next marketing buzz world please come on down!
1. Re:Game Over by Samantha+Wright · 2011-05-02 12:30 · Score: 2
  
  Completely disagree. The solution is clear: eliminate all potential sources of human error.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
2. Re:Game Over by Anonymous Coward · 2011-05-02 12:53 · Score: 2, Funny
  
  Has anyone mentioned Skynet yet?
Re:'An inadvertent press of a key on a keyboard' by verbatim · 2011-05-02 12:03 · Score: 5, Funny

This pretty much describes my entire career.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?
Slashdot summary non sensationalist by rsborg · 2011-05-02 12:03 · Score: 3, Interesting

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.
Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."
(emphasis mine).
I'd hate to be that ops guy.

--
Make sure everyone's vote counts: Verified Voting
1. Re:Slashdot summary non sensationalist by fuzzyfuzzyfungus · 2011-05-02 12:20 · Score: 2
  
  "And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."
2. Re:Slashdot summary non sensationalist by Icegryphon · 2011-05-02 12:34 · Score: 2
  
  Keyboards, how do they work? This does not bode well for VMware. As much as I love their production, I did chuckle at this major failure.
3. Re:Slashdot summary non sensationalist by Icegryphon · 2011-05-02 12:47 · Score: 5, Funny
  
  Don't go knocking my typewriter It's Electric, and has wonderful BNC connector for network access. IBM, you did good.
UR DOING IT WRONG! by celest · 2011-05-02 12:05 · Score: 2

You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!
In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/
1. Re:UR DOING IT WRONG! by Jeremi · 2011-05-02 14:07 · Score: 4, Funny
  
  I would like more elaboration on what "touched the keyboard" means.
  It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.
  
  --
  
  I don't care if it's 90,000 hectares. That lake was not my doing.
2. Re:UR DOING IT WRONG! by larry+bagina · 2011-05-02 14:43 · Score: 3, Informative
  
  Remember how your uncle used to touch you in your naughty place? It was like that.
  
  --
  Do you even lift?
  These aren't the 'roids you're looking for.
Not the RED button!!! by geekmux · 2011-05-02 12:08 · Score: 2

"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."
OK, seriously, who the hell has that much shit tied to a single key on a keyboard?
I've heard of macros for the lazy, but damn...
Engineering Errors by Bruha · 2011-05-02 12:11 · Score: 4, Interesting

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.
I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.
More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.
There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.
I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.
VMware shows its PR colors. by shuz · 2011-05-02 12:18 · Score: 4, Insightful

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

--
There is or can be built a machine that can simulate any physical object. -Church-Turing principle
1. Re:VMware shows its PR colors. by HFShadow · 2011-05-02 13:33 · Score: 2
  
  Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.
  How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.
2. Re:VMware shows its PR colors. by ToasterMonkey · 2011-05-02 14:36 · Score: 5, Insightful
  
  VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
  "Transparency is bad" +4 Insightful
  What the... ?
3. Re:VMware shows its PR colors. by drooling-dog · 2011-05-02 14:41 · Score: 3, Informative
  
  To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.
4. Re:VMware shows its PR colors. by rsborg · 2011-05-02 15:49 · Score: 4, Informative
  
  VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
  "Transparency is bad" +4 Insightful
  What the... ?
  You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".
  Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.
  
  --
  Make sure everyone's vote counts: Verified Voting
Since I'm being an awful person today... by fuzzyfuzzyfungus · 2011-05-02 12:19 · Score: 2

I, for one, would like to suggest that the Cloud Foundry is really foundering...
Don't let it happen again by stumblingblock · 2011-05-02 12:35 · Score: 5, Funny

They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.
Re:A cloud in need, is a cloud indeed by VortexCortex · 2011-05-02 17:31 · Score: 3, Funny

No, no, it is indeed a cloud: Thin, wispy and ephemeral.
Not to mention The Cloud is dangerous!
One time, "The Cloud" corrupted a few files on my server, toasted my dev machine's hard drive (couldn't even re-install!) made several monitors explode, and split the tree outside my home-office completely in two; Flying chunks of bark shattered my windows... to say nothing of the horror that became of the decorative landscape lighting that foolishly linked the outside to my main electrical system, may it rest in pieces.
The ironic thing is that I had a lightning rod installed; I thought I was safe from The Cloud, but The Cloud decided that my, now deceased, 200ft pine tree was a better target of opportunity.
The Cloud is a scary concept -- Super charged flying electrical batteries, always looming overhead, unpredictably destroying their targets with tremendous power, and surgical precision. Hell, the terror of witnessing such an event has permanently emotionally scarred my dog -- She has a prescription for Valium now because she hyperventilates and continuously shakes for hours at the mere sound of distant thunder...
My psyche is not unscathed either: I have to take a tranquilizer whenever I hear the words: "To The Cloud"
Anybody see the irony in the first outage? by HockeyPuck · 2011-05-02 18:33 · Score: 4, Interesting

Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.
I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.
Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...
Irony. A storage company having a storage problem.