VMware Causes Second Outage While Recovering From First
jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."
[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?
just wonder why there are so many anonymous cowards in this world....
Any programming error can be traced back to one or two of those.
The cloud is a lie. Would the next marketing buzz world please come on down!
Amazingly the Cloudfoundry blog itself had a much more dramatic telling:
"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.
Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."
(emphasis mine).
I'd hate to be that ops guy.
Make sure everyone's vote counts: Verified Voting
Just like "paper only" is a metaphor for the electronic document version, which is what was happening. In this case it means the engineer engaged in active management of the network instead of brainstorming ideas with the group. Presumably he intended to just investigate.
Help stamp out iliturcy.
You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!
In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/
All the VMware employees have their heads in the clouds!
"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."
OK, seriously, who the hell has that much shit tied to a single key on a keyboard?
I've heard of macros for the lazy, but damn...
...forgot to press Ctrl+Alt.
You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.
I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.
More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.
There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.
I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.
VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.
There is or can be built a machine that can simulate any physical object. -Church-Turing principle
Next.
I used to fear clowns...but I'm discovering that chimps are far, far, worse.
I, for one, would like to suggest that the Cloud Foundry is really foundering...
And that is why we need skynet.
They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.
When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".
be an issue. The problem is how poorly is the infrastructure designed and implemented to allow one moron one key stroke to cause such havoc? Apparently it is very weak and susceptible.
What were Christa McCauliffe's last words ?
"What's this button for ..."
And, by the way, that was a really perfect and fully credible explanation, kind Sirs. Yes, indeed! Totally, perfectly, unassailably perfect. It makes perfect sense. Happens all the time. (Ohboy!) But then, this is the age of credulity, after all.
Keyboards, how do they work?
This does not bode well for VMware.
As much as I do so love their production,
I did chuckle at this major failure.
No, you need to change that first line if you're going to post in rhyme:
Ah, keyboards: how do they function?
This bodes not so well for VMware.
As much as I do love their production,
I chuckled a bit at this major failure.
I'd work on that slant rhyme a bit, but then, what to I know? I'm an anonymous coward.
I don‘t think we have enter the period that internet is available everywhere and everytime but without internet cloud is nothing
If I think I can trust a cloud to support my data.
Never let a cartoon super villain design your network infrastructure.
I was working for the world's largest SMS & MMS hosted provider powering up a few extra servers for provisioning when the entire server room went dark. The Engineering Manager had ordered a 100 Amp circuit breaker but had never replaced the 60 Amp breaker because he kept forgetting to schedule it. When the lights went out it took 3 hours from midnight 'till 3am to get everything back up and running. The 100 Amp breaker was sitting inches from where it was supposed to go - right there on top of the circuit breaker box.
Three months later the same thing happened again - with the "redundant" server row.
You didn't hear this from me.
Proceed to bang head on table.
I am only considering VMware products again if they fire the idiot who wrote the blog post and cane him in the public square. Come on VMware, we are hoping for some retribution here.
Now, for the technical part, I'm only considering cloudy products again if they replace keyboards and human engineers with unicorns fluent in Lisp who can rainbow-activate and maintain the flockolent interfuzzys to the cervically index, to protect my data. I'm just not using any ol' cloud. No sir.
The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.
The answer is to run training exercises for the various scenarios so that everyone knows what to do and where to go in such situations.
The problem with that is that people are lazy. Security is not difficult. But NOT doing it will always be easier (and yield immediate rewards) in the short term.
Sounds good. But the system also has to be designed to take advantage of the technology that is available today. Too often the systems are based around the single machine running a single application with full administrative rights model. And the technological advances have just made it possible to fool the app into thinking it is on one machine while it runs on multiple machines (badly).
nice option
What, did he hit the giant red blinking "Fuck Everything Sideways" button? Seems like that might be a design flaw they should look into.
How did one engineer touching a keyboard when he shouldn't, take everything down?
He touched the keyboard in its Special Place.
Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.
if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.
lets admit it - huge monolithic clouds, are a bad idea. there should be a certain size limit for clouds' sizes, and after that the customers should be placed to another discrete cloud unit.
Read radical news here
^^^
No, no, it is indeed a cloud: Thin, wispy and ephemeral.
No problem. SkyNet will remedy that.
Have gnu, will travel.
In their defense Cloud Foundry is still in early beta.
Is the power grid run by some old pc terminal where hitting Esc can crash the full system?
Maybe they had trojan horse (or other malware) in their cloud system and they just turn off internet cable (to avoid risk and drop trusting of customers) ;)
Recipes for USA bankrupt - http://tinypaste.com/0d66f dd = dollar deluge (printed in the infinity)
Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.
I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.
Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...
Irony. A storage company having a storage problem.
It sounds like an earth leakage fault to me. So maybe VMware should not blame one poor sod for faulty wiring in their data centre.
Personally, I think it is a believable explanation, if only because I did something similar recently (though not the same extent of damage).
I clicked the right-mouse button instead of left-mouse button by mistake during maintenance work, which resulted in the pasting of the contents of my clipboard into the config of one of our edge routers. Unfortunately the clipboard contents happened to contain a config of a different edge router, which resulted in duplicate IPs on the network, routing getting all confused, and the only way of recovering was physical access and hard reboot since it even knocked the management networks offline.
I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.
Vmware's explanation reminded me of that incident. Unless "touching the keyboard" means logging on a secure system and entering a few bad commands.
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
141 comments and no one mentions the old Sun equipment that had the !@#^ power button on the keyboard! Must be the young crowd posting.
Been there, done that. Reached over, bumped the keyboard and the SparcStation went "blink!" and off.
I've been to a couple lab environments where the upper-right key on every keyboard had been physically removed because this was such a stupid design.
Learning HOW to think is more important than learning WHAT to think.
They try to make a full analysis public. That is agood thing. They could have gone with the same old level "there is a problem and we fixed it", like they try to do with the PSN network. (barely dettail, very fussy predecitions when expected to come up again)
Cloud based hosting is relative complex y its very nature. This will always violate the "KISS" design principle. ECC downtime has also shown this. A lot of costumers though they bought a 24/7 99,999% solution, but they forgot they only bought the tools for that solution.
I agree that "touched the keyboard", one of the engineers took the script litterarry when someone gave instructions to "do X", he launced the nuclair missiles instead of just gave a paper confirmation that X was down. ;)
PS "not reall misseles... just a figure of speech"
If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sockets.
Plus just when you think you have it all covered with redundant power supplies sometimes the entire power supply unit dies at the back end instead of just one of the modules. It's annoying and expensive when that happens but not as bad as a completely dead server.
Some years ago, on an entirely different employer, we deployed a rack of servers with KVM switch. The vendor technician warned that the KVM keyboard had a Windows-induced power button, which -- if pressed -- would really power off all servers connected to that KVM switch. Needless to say, we plucked that button out and glued a piece of plastic over the hole...
"Clearly, human error is still a major factor in cloud networks."
That is a huge leap. You cannot take one incident and use it as a broad brush with which to paint all of the players in cloud computing.
This should read: "Clearly, human error was a major player in these two specific incidents at VMWare."
Can Slashdot mods PLEASE dispense with the sensationalism?
You know... there is a fix for that.
Are you kidding me? Maybe the blame should fall on the designer of a system that would make it so simple to accidentally take down your entire system with a single key stroke.
If anyone was adverse affected by this, you are a fucking moron.
VMWare has been very clear that the service is still in alpha.
Shit on the internet goes down, shit on your network goes down, why shouldn't shit on the cloud go down?
No "are you sure" pop up?
"If any question why we died, Tell them because our fathers lied."
I think it's hilarious how everyone here comments like a know-it-all here about what the problem was when they've never seen more than the press release.
Regardless of your level of proficiency at networking, anyone who knows anything about computers knows there's a lot of computers and you have to look at each scenario on its own, not make some useless speculation about what the real cause is.
Additionally, is it really that hard for people to read the press release (including the author)? It's clear that it wasn't a single key press (like a big red button syndrome), but the fact that the system was currently set up as non-interactive while an employee ignored this warning.
Never underestimate the chaos you can create by copying a bunch of junk with your mouse and then accidentally pasting it into a root level shell.
mmmm... it does not make any sense to me... How come a simple key can take out all load balancers, routers, and firewalls... I mean, they should have any kind of HA in place.
Indeed it sounds to like a very bad design, but who knows, perhaps this guy turned off the only UPS available...
Press the *ANY* key to continue, or any other key to quit.