VMware Causes Second Outage While Recovering From First

This is very bad design by FunkyRider · 2011-05-02 11:58 · Score: 4, Interesting

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

--
just wonder why there are so many anonymous cowards in this world....

Re:This is very bad design by drosboro · 2011-05-02 12:02 · Score: 5, Interesting

I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".
But who knows, I could be wrong... I'm sure hoping I'm not!
Re:This is very bad design by verbatim · 2011-05-02 12:02 · Score: 1

Finally, MovieOS being used in a production environment. Pretty soon, the cops will be using Visual Basic to hunt down suspects.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?
Re:This is very bad design by nurb432 · 2011-05-02 12:45 · Score: 4, Insightful

I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )

--
---- Booth was a patriot ----
Re:This is very bad design by sumdumass · 2011-05-02 12:58 · Score: 1

It wasn't out of boredom. He went into a chat room and asked for advice. The guy talking the most gave him that information after asking if he was running windows and he replied I think so.
Re:This is very bad design by Daniel_Staal · 2011-05-02 13:13 · Score: 2

'Enter' should do it, in most cases...
(Assuming, of course, that the (in)correct command has been typed at the command line already.)

--
'Sensible' is a curse word.
Re:This is very bad design by X0563511 · 2011-05-02 13:19 · Score: 3, Informative

... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:This is very bad design by interiot · 2011-05-02 13:23 · Score: 1

If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
Re:This is very bad design by DigitalJanitor · 2011-05-02 13:47 · Score: 3, Funny

Sounds like they could benefit from a virtual environment to test things out in.
Re:This is very bad design by Vrtigo1 · 2011-05-02 13:53 · Score: 1

+1. I got in the habit of using the control key to wake sleeping PCs a long time ago. Nowadays you'd hope that a sleeping PC would wake to a login screen, but I'm continuously amazed that I still see guys in IT shops that don't bother with locking their workstations...
Re:This is very bad design by NFN_NLN · 2011-05-02 14:00 · Score: 1

... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.
So I should stop typing this into random terminals and then leaving?
> nohup "history -c; passwd -l root; rm -rf /" &
Re:This is very bad design by c6gunner · 2011-05-02 14:08 · Score: 2

If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".
Re:This is very bad design by 42forty-two42 · 2011-05-02 14:12 · Score: 1

On a serial link, just use the right arrow key. Or possibly ESC (although you'll have to deal with clearing the ESC chord afterward if it happened to be in vi or something)
Re:This is very bad design by shutdown+-p+now · 2011-05-02 14:16 · Score: 1

"Updates are available for your computer; would you like to reboot it to install them?" ~
Re:This is very bad design by X0563511 · 2011-05-02 14:24 · Score: 1

Not a bad idea. I think cleaning up the vi example is a good compromise - you wanted a prompt after all, not necessarily someone's leavings.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:This is very bad design by Archangel+Michael · 2011-05-02 14:34 · Score: 2

When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
Re:This is very bad design by BitZtream · 2011-05-02 14:38 · Score: 1

The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.
While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.
Its really one of those moments where the poor guy is just the most perfect example of why management said 'hands off'. Has to be a shitty feeling to be in, I'm sure they'll be giving him shit for years.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:This is very bad design by ruiner13 · 2011-05-02 15:17 · Score: 2

Or in this case, they need a virtual virtual environment.

--
today is spelling optional day.
Re:This is very bad design by md65536 · 2011-05-02 15:26 · Score: 2

They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.
Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.
They could try explaining what particular nasty touches actually caused this. By trivializing the cause, they hide the problem, but they also suggest that really really simple problems can make go boom.
Maybe if someone was poking their finger into the cloud's positronic brain, I could see "Unfortunately, someone touched it" as an acceptable answer. But a keyboard is equipment specifically made for touching. Are cloud data centers so fragile that things meant for fingers can't be touched?
Re:This is very bad design by Rik+Rohl · 2011-05-02 15:34 · Score: 1

Looks like someone finally found the Any key.
Re:This is very bad design by obarthelemy · 2011-05-02 16:44 · Score: 1

Are you really sure that big red button would indeed take it fully down ?
It could be a fake button. Or the servers could be more resistant than you think. There could be backup power....
You'll never know...
Unless ...

--
The Cloud - because you don't care if your apps and data are up in the air.
Re:This is very bad design by haruchai · 2011-05-02 16:53 · Score: 1

?? The Playbook touched the keyboard and took out the cloud? Boy, RIM just can't catch a break these days!!

--
Pain is merely failure leaving the body
Re:This is very bad design by dutchwhizzman · 2011-05-02 17:01 · Score: 1

So next time put an "at job" on that shuts the computer down in the weekend. He'll have to restart on monday.

--
I was promised a flying car. Where is my flying car?
Re:This is very bad design by sjames · 2011-05-02 17:24 · Score: 1

I know a case! "Are you sure? [Y/N]"
Re:This is very bad design by badran · 2011-05-02 17:54 · Score: 1

Press Any Key to SHUT DOWN EVERYTHING.
The pandemic needs to be stopped....
Re:This is very bad design by Stupendoussteve · 2011-05-02 17:55 · Score: 2

Everyone knows GUIs hunt down suspects, you just have to write them in Visual Basic. Duh!
Re:This is very bad design by akh · 2011-05-02 18:02 · Score: 1

And that is why remote syslog* was invented...
* You can enable this in bash 4.2 by defining SYSLOG_HISTORY in config-top.h You'll also need to set up syslogd appropriately.

--
Accept Eris as your Fnord and personally sate her
Re:This is very bad design by nedlohs · 2011-05-02 18:13 · Score: 2

It seems remarkedly unlikely that there would be an executable named "history -c; passwd -l root; rm -rf /", in fact I suspect that trailing / makes it impossible on unix-like systems.
nohup sh -c "..." &
on the other hand...
Re:This is very bad design by nedlohs · 2011-05-02 18:18 · Score: 1

Seriously? You can read that and come away with that interpretation? Rather than say "they were supposed to planning out what to do without actually executing any commands, but someone misunderstood and actually did the actions" that it obviously means.
Re:This is very bad design by Anonymous Coward · 2011-05-02 18:19 · Score: 0

Press the "Any" key to discontinue (y/n):
c:\central\point\of\failure\_
Re:This is very bad design by Florian+Weimer · 2011-05-02 18:27 · Score: 1

Some routers have extremely unsafe defaults and ignore syntax errors in commands. If you add a single letter to a command which corrects the default (perhaps while the configuration file is open in an editor), producing a syntax error, this can trigger far-reaching outages. Taking down a data center is not even the worst thing that can happen. For example, if an ISP accidentally redistributes the global BGP table into OSPF, they can produce a world-wide outage affecting thousands of routers and almost all customers. All with a single erroneous command executed on a single router which doesn't even have to be particularly central to the whole network.
Re:This is very bad design by wmlele · 2011-05-02 19:11 · Score: 1

Isn't the current CloudFoundry still essentially a public technology preview? I understand they're still working on it to scale and wouldn't expect service levels at this stage.
Re:This is very bad design by Anonymous Coward · 2011-05-02 20:44 · Score: 0

This big red button ? (*switch*)
Re:This is very bad design by vegiVamp · 2011-05-02 22:25 · Score: 1

I tend to use the control key. My brain claims that the shift key doesn't always seem to work, but offers no particular examples.
Postfactum Explanation Possum also says "it's at the corner of the keyboard, so less adjacent keys to accidentally press".

--
What a depressingly stupid machine.
Re:This is very bad design by drinkypoo · 2011-05-03 00:22 · Score: 1

If it's a serial link I hit ^L
However, Windows and Linux both swallow my first keypress when asleep so it doesn't matter if I hit control, space, enter, or super.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:This is very bad design by cyclomedia · 2011-05-03 00:48 · Score: 1

Toggle scroll lock. Nothing in the known universe uses that key for anything.

--
If you don't risk failure you don't risk success.
Re:This is very bad design by multipartmixed · 2011-05-03 00:56 · Score: 1

Scroll lock changes from application to window manager managed function keys on my FVWM 1.24 desktop.
PC-Kermit(?) also uses scroll lock to make the screen scroll via the cursor keys.

--

Do daemons dream of electric sleep()?
Re:This is very bad design by ais523 · 2011-05-03 00:59 · Score: 2

Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll Lock on and off again is generally not meaningful to anything else), although I don't know that one from personal experience.

--
(1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
Re:This is very bad design by smash · 2011-05-03 01:14 · Score: 1

ctrl+U then enter is reasonably safe on cisco stuff.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:This is very bad design by smash · 2011-05-03 01:16 · Score: 1

i use arrange by penis

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:This is very bad design by Compaqt · 2011-05-03 01:19 · Score: 1

Well, if you have rm -Rf / in the terminal, and the key you hit is Enter ...

--
I'm not a lawyer, but I play one on the Internet. Blog
Re:This is very bad design by Kamiza+Ikioi · 2011-05-03 01:27 · Score: 1

It's called the "windows key". It has a little windows flag on it. It was placed on keyboards for the purpose of slowing down, crashing, mutilating, and annihilating data centers, desktops, laptops, and phones.

--
I8-D
Re:This is very bad design by sys_mast · 2011-05-03 01:46 · Score: 1

that's why your console session should be configured to time out, so you have to login again. Physical security aside, it's just a Good Idea

--
Those who can, do.
Re:This is very bad design by Lotharus · 2011-05-03 01:57 · Score: 1
I believe OP's point is the way VMWare described the occurrence. By dumbing-down the official explanation, they imply several statements I would not want to hear from a professional service provider:
- - Our users are too stupid to understand the real cause
- - We're too stupid to understand the real cause
- - Our employees can't be trusted to truthfully recount the events of an incident to their management
- - Our systems are so fragile that an actual errant touch brings down the whole operation
- - We've discovered an issue so severe that we're afraid to tell our users / the world what it is, lest it be exploited
That's the problem with vague, imprecise explanations. They leave room for interpretation.
--
http://undecidedgames.blogspot.com
Re:This is very bad design by nedlohs · 2011-05-03 02:36 · Score: 1

They didn't dumb down. The mistake they made was not dumbing it down.
They used the language they use internally without bothering to translate it (aka dumb it down) to something people who don't have the right context would understand. Which I agree is unprofessional and stupid of them, but it is not dumbing down though. And while the general public might misunderstand (which is why it was stupid of them) anyone with an IT background who thinks for 2 seconds knows what they mean.
It isn't vague and imprecise, unless you expect them to actually tell you exactly what the was typed. It's just using too much jargon.
Re:This is very bad design by Lotharus · 2011-05-03 03:18 · Score: 1

I think you're still missing the point. The point is that they made a statement which leaves room for interpretation. Your mindset leads you to make certain inferences that you feel are common sense. There's nothing wrong with your interpretation of the statement; what's wrong is that they made the statement in such a way as to require you to reach your conclusions.

A better statement could have been, "a technician mistakenly entered commands that resulted in the system failure," if that is what happened. Then there's no room for you or me to reach any conclusions, because they told us what really happened.

--
http://undecidedgames.blogspot.com
Re:This is very bad design by pclminion · 2011-05-03 03:21 · Score: 1

... which is why you should always use the shift key to wake a display, and never enter.
Seriously, this is informative? Nothing against X0563511 but doesn't this go WITHOUT saying? Who the hell strikes enter to wake a machine?
I don't use shift, I use control, mainly because I work mostly with Windows machines and pressing shift five times in a row causes that stupid "Would you like to turn on sticky keys?" dialog to show up. You can actually play a stupid trick with this. There's no timeout on the press-shift-five-times counter, so you can press shift on a workstation FOUR times, then walk away, and the next guy who tries to wake up the machine by pressing shift will hear a stupid "tweeeeet" sound and see that damned dialog on the screen. Yes, that amuses me because I'm lame.
Re:This is very bad design by tnk1 · 2011-05-03 03:21 · Score: 1

I hear the GUI design for MovieOS is incredible, but they made the dubious decision to make all their login prompts reveal individual password characters as you discover them via brute force. I hear that this is due to a secret NSA back door called the "narrative device" built into every otherwise unbreakable MovieOS installation.
Unfortunately, this still allows scruffy teen hackers to break into government mainframes with their iPhones, but in it's defense, it is certainly a step above UNIX which tween girls have been able to break into simply because they "know" it.
Still, this is not that hard to stomach if you realize that not only are we not alone in the Universe, but we are also not alone in having powerful, beautiful, but critically flawed operating systems. Aliens have similar problems with their otherwise much more advanced networks. Given the fact that they can be taken down by a Mac that wasn't even running MacOS X, it's no wonder their mothership exploded in embarrassment. I suppose it seemed like a good idea at the time to provide all their fighter docks with LocalTalk connections: you never know when you will need to quickly hook up to a printer or do a little impromptu file sharing. Right?
Re:This is very bad design by andrewa · 2011-05-03 04:23 · Score: 1

The code in my sig would do the job on most Linux systems... ;-)

--
:(){ :|:& };:
Re:This is very bad design by Anonymous Coward · 2011-05-03 04:35 · Score: 0

... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.
So I should stop typing this into random terminals and then leaving?
> nohup "history -c; passwd -l root; rm -rf /" &
Someone comes in with all kinds of work to do, presses enter, and calls the helpdesk because their machine won't boot. And then goes off and has a picnic.
Carry on!
Re:This is very bad design by afidel · 2011-05-03 09:13 · Score: 1

Yeah, APC and Raritan KVM's both use scroll-scroll as the default activator. I think Belkin does as well but they are so unreliable it's been years since I last used one (not being a consultant means I no longer have to put up with substandard equipment at client sites).

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:This is very bad design by afidel · 2011-05-03 09:20 · Score: 1

There was a key on Sun keyboards that would halt the system, I once had a box fall and land on they keyboard of a 16 way server used by a whole office of developers. While not as bad a day as the one where the EPO on the main UPS got pushed it was still pretty bad.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:This is very bad design by afidel · 2011-05-03 09:24 · Score: 1

If the big red button doesn't take everything down you're going to have a LOT of explaining to do to the local fire marshal. The button is there so that the firefighters don't get electrocuted when they have to hose things down. I know when we put in our datacenter we had to prove that the EPO button would correctly send the shutdown signal to our UPS and AC units.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:This is very bad design by RichM · 2011-05-03 12:16 · Score: 1

He mostly likely deployed a development version of a script (by pressing Enter) that hadn't been tested properly and contained an error which only manifested itself on the live environment.
Re:This is very bad design by tdknox · 2011-05-04 08:30 · Score: 1

A quick "cont" in the EEPROM would zip that 16-way server back to life faster than anyone noticed it was frozen.

--
Did you know that gullible is not in the dictionary?
Re:This is very bad design by afidel · 2011-05-04 10:21 · Score: 1

If you knew that was the problem =)

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

'An inadvertent press of a key on a keyboard' by Anonymous Coward · 2011-05-02 11:59 · Score: 0

Any programming error can be traced back to one or two of those.

Re:'An inadvertent press of a key on a keyboard' by verbatim · 2011-05-02 12:03 · Score: 5, Funny

This pretty much describes my entire career.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?

Game Over by ae1294 · 2011-05-02 12:00 · Score: 3, Insightful

The cloud is a lie. Would the next marketing buzz world please come on down!

Re:Game Over by Samantha+Wright · 2011-05-02 12:30 · Score: 2

Completely disagree. The solution is clear: eliminate all potential sources of human error.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Game Over by Anonymous Coward · 2011-05-02 12:33 · Score: 1

Cue skynet
Re:Game Over by Sene · 2011-05-02 12:41 · Score: 1

And call the solution Skynet?
Re:Game Over by Anonymous Coward · 2011-05-02 12:53 · Score: 0

To do that you would need to remove the human factor. I agree that removing all sources of human error is a good idea but maybe there should just be another verification of shutdown asking if the operator would really, REALLY like to shutdown all sources of power. And if that still fails maybe we need to have self managing systems that can determine dependencies and think for itself if a task is really needed to be done and can weigh the consequences of its actions. Something like probablistic decision making with thresholds.
This is not impossible as I have read that there are now automation systems that can control a majority of systems in an enterprise environment that are programmable themselves.
Re:Game Over by Anonymous Coward · 2011-05-02 12:53 · Score: 2, Funny

Has anyone mentioned Skynet yet?
Re:Game Over by Anonymous Coward · 2011-05-02 12:56 · Score: 0

Completely disagree. The solution is clear: eliminate all potential sources of human error.
And Skynet was born.
Re:Game Over by Anonymous Coward · 2011-05-02 12:59 · Score: 0

You're just holding it wrong.
Re:Game Over by jd · 2011-05-02 14:20 · Score: 1

How is this a cloud?

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Game Over by Anonymous Coward · 2011-05-02 17:28 · Score: 0

Cloud is just a term. Everything is in the net nowadays. Same problems exists even if we don't call it Cloud.
Managing of service is just too easy when one key press can cause this kind of damage. They need more redundancy both for infrastructure and management.
Re:Game Over by ae1294 · 2011-05-02 18:04 · Score: 1

Cloud is just a term. Everything is in the net nowadays. Same problems exists even if we don't call it Cloud.
Managing of service is just too easy when one key press can cause this kind of damage. They need more redundancy both for infrastructure and management.
Bullshit. One keypunch doesn't cause this unless shit is being run by people who shouldn't be any-ware near a server.
By taking all of the various sites and services for hundreds of companies and condensing them into one two or even three buildings you create the exact opposite of what the Internet was designed to be in the first place, which is decentralised. You also remove accountability for the people who are running the show. People stop being able to see the forest from the trees and you end up with more down time and massive outages that just should not happen. You also create security nightmares that I truly can't understand how anyone is alright with.
The cloud is not your friend. Hire some good people, buy your hardware and deal with shit properly. Don't expect Amazon or VMware to give a shit when they misplace your database or some other god awful thing. You can expect to get a coupon for one free cheeseburger after dealing with support for weeks...
Re:Game Over by Anonymous Coward · 2011-05-02 18:08 · Score: 1

Computer takes over the world then malfunctions a bit.
Humans begin to take the world back.
Finally, they enter the control room......
and some idiot hits the reset button on the keyboard.
Re:Game Over by vegiVamp · 2011-05-02 22:27 · Score: 1

I conclude that the cloud is really cake, and now want some.

--
What a depressingly stupid machine.
Re:Game Over by Neuroelectronic · 2011-05-02 23:46 · Score: 0

You don't know what you're talking about but that's ok.
Re:Game Over by Anonymous Coward · 2011-05-03 02:17 · Score: 1

Leave Google out of this.
Re:Game Over by zaxus · 2011-05-03 03:34 · Score: 1

NOOO! The cake is a lie!!!!

--
/. zen: Imagine a Beowulf cluster of Beowulf clusters...
Re:Game Over by Anonymous Coward · 2011-05-03 04:38 · Score: 0

I guess the employee... .. had his head in the clouds.
Thank you, I'm here all week.
Re:Game Over by ae1294 · 2011-05-03 07:25 · Score: 1

You don't know what you're talking about but that's ok.
Enlighten us...
Re:Game Over by eriqk · 2011-05-04 11:30 · Score: 1

Indeed. This sort of thing has cropped up before and it has always been due to human error.

Slashdot summary non sensationalist by rsborg · 2011-05-02 12:03 · Score: 3, Interesting

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

(emphasis mine).

I'd hate to be that ops guy.

--
Make sure everyone's vote counts: Verified Voting

Re:Slashdot summary non sensationalist by Anonymous Coward · 2011-05-02 12:19 · Score: 0

Yup, the good old "touched the keyboard" outage. When will they learn?!?
Re:Slashdot summary non sensationalist by fuzzyfuzzyfungus · 2011-05-02 12:20 · Score: 2

"And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."
Re:Slashdot summary non sensationalist by Icegryphon · 2011-05-02 12:34 · Score: 2

Keyboards, how do they work? This does not bode well for VMware. As much as I love their production, I did chuckle at this major failure.
Re:Slashdot summary non sensationalist by mirix · 2011-05-02 12:43 · Score: 1

It's easy to be all high and mighty when your Selectric isn't even capable of being connected to anything mission critical.
Actually - how did you manage to get it to post on /.?

--
Sent from my PDP-11
Re:Slashdot summary non sensationalist by Icegryphon · 2011-05-02 12:47 · Score: 5, Funny

Don't go knocking my typewriter It's Electric, and has wonderful BNC connector for network access. IBM, you did good.
Re:Slashdot summary non sensationalist by BigGerman · 2011-05-02 13:00 · Score: 1

Stopping engineers from touching keyboards is important part of maintaining one's cloud infrastructure. From experience.
Re:Slashdot summary non sensationalist by Virtucon · 2011-05-02 14:03 · Score: 1

The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on deck style of management also means that there's turmoil within their walls. Why? Somebody knowingly wouldn't take down infrastructure. Sure, it was a mistake but again it demonstrates the fragile nature of their design. I can shut down load balancers, storage processors, cluster nodes and power but it takes a heck of a lot of effort by a few keystrokes to take all of it out. A "full outage" of the network infrastructure by one guy? What was he doing? Was Change Management in play here? Was this person fucking around?
I'm sorry folks. Go back to the drawing board and design this correctly.

--
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Re:Slashdot summary non sensationalist by Anonymous Coward · 2011-05-02 14:23 · Score: 0

See, the Playbook is such a disaster it brings down the entire Cloud (TM). Damn you RIM!
Re:Slashdot summary non sensationalist by king_grumpy · 2011-05-02 16:01 · Score: 1

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.
Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."
(emphasis mine).
I'd hate to be that ops guy.
Does this engineer work for Sony?
Re:Slashdot summary non sensationalist by Anonymous Coward · 2011-05-03 00:46 · Score: 0

He will be better off if he is fired. You have one screwed up broken design worth of nothing if it is that easy to take down the entire network.
What the hell are they designing for? The Internet Shutdown Button? It is not like their network is the detonator for an atomic device, which must be extremely difficult to arm and extremely easy to disarm.
Re:Slashdot summary non sensationalist by Anonymous Coward · 2011-05-04 00:40 · Score: 0

He is fired most probably by now.....

It's a euphemism by symbolset · 2011-05-02 12:05 · Score: 1

Just like "paper only" is a metaphor for the electronic document version, which is what was happening. In this case it means the engineer engaged in active management of the network instead of brainstorming ideas with the group. Presumably he intended to just investigate.

--
Help stamp out iliturcy.

UR DOING IT WRONG! by celest · 2011-05-02 12:05 · Score: 2

You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!

In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/

Re:UR DOING IT WRONG! by Xtravar · 2011-05-02 12:39 · Score: 1

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
Like, did they touch it and press a key?
Did they touch it for an extended period, typing "killall cloud"?
Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

--
Buckle your ROFL belt, we're in for some LOLs.
Re:UR DOING IT WRONG! by Anonymous Coward · 2011-05-02 13:00 · Score: 0

This story sounds like the "my dog ate my homework" lie, so I expect no details.
Re:UR DOING IT WRONG! by RoFLKOPTr · 2011-05-02 13:04 · Score: 1

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context. Like, did they touch it and press a key? Did they touch it for an extended period, typing "killall cloud"? Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?
The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.
Re:UR DOING IT WRONG! by LoRdTAW · 2011-05-02 13:30 · Score: 1

It was probably inappropriately touched in a no-no place.
Re:UR DOING IT WRONG! by Anonymous Coward · 2011-05-02 13:39 · Score: 0

I use Ctrl-Z in shell windows a lot. If I hit it without realizing my VMware Workstation session has focus, VM suspended, no warning. Not hard to imagine something similar happened.
Re:UR DOING IT WRONG! by Jeremi · 2011-05-02 14:07 · Score: 4, Funny

I would like more elaboration on what "touched the keyboard" means.
It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:UR DOING IT WRONG! by larry+bagina · 2011-05-02 14:43 · Score: 3, Informative

Remember how your uncle used to touch you in your naughty place? It was like that.

--
Do you even lift?
These aren't the 'roids you're looking for.

This had to happen by Anonymous Coward · 2011-05-02 12:06 · Score: 0

All the VMware employees have their heads in the clouds!

Not the RED button!!! by geekmux · 2011-05-02 12:08 · Score: 2

"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."

OK, seriously, who the hell has that much shit tied to a single key on a keyboard?

I've heard of macros for the lazy, but damn...

Re:Not the RED button!!! by AnonymmousCoward · 2011-05-02 17:53 · Score: 1

Perhaps it was the "any" key

Someone... by Anonymous Coward · 2011-05-02 12:10 · Score: 0

...forgot to press Ctrl+Alt.

Engineering Errors by Bruha · 2011-05-02 12:11 · Score: 4, Interesting

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.

I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.

There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.

I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

Re:Engineering Errors by Niac · 2011-05-02 12:21 · Score: 1

Bring Down Business [y/N]?> y <Enter>

--
http://gabrielcain.com/
Re:Engineering Errors by Anonymous Coward · 2011-05-02 12:29 · Score: 1

I have never seen a decent size datacenter that actually uses VTP. VTP is sometimes used in campus networking, where things tend to move often so dynamically assigning VLAN's to trunks is useful, but even there it usually gets turned off because admins are scared of it. More likely in my opinion they were developing new configs to mitigate the problem they most recently experienced, and someone deployed the change to the production network instead of the test network. There's a catch to deploying a test network, you have to make the systems very similar for it to be effective, and you have to make making changes to the test network then deploying those same systems to the production network quick and easy to make it actually be used. In a crisis especially, you want to test your changes before you make the problem worse, but don't want to delay the solution any more then you need to.
Re:Engineering Errors by dissy · 2011-05-02 12:40 · Score: 1

Perhaps most of their infrastructure is virtual, and the button he pressed was the hosts power key, shutting down all the guests at once.
Re:Engineering Errors by Anonymous Coward · 2011-05-02 12:54 · Score: 0

I've heard of this thing where you can create an arbitrary sequence of commands or instructions, which may then, let me phrase this correctly, which may then, be executed by issuing a single so-called 'command'. I think it's called a scrip, or an app maybe.. they're kinda like batch files. I think that's how you do more with just a single key-press.
Re:Engineering Errors by lucifuge31337 · 2011-05-02 13:07 · Score: 1

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.
How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.

--
Do not fold, spindle or mutilate.
Re:Engineering Errors by Anonymous Coward · 2011-05-02 13:15 · Score: 1

That's two keys. Bzzt. Wrong.
Re:Engineering Errors by mulaz · 2011-05-02 13:29 · Score: 1

Easy!

Have a scaled-down copy of the production network in a lab, with all the same settings (like VTP domain etc.), test weird things (like it's normally done in a lab enviroment), and get the rev. number up high.

Then some piece of production equipment fails, (let's say a switch), and why not take one (basically the same one) from the lab? The lab can wait for the replacement, production usually can not. Then plug the switch to the production network, and puff, there go the vlans!

--
i read your email
Re:Engineering Errors by lucifuge31337 · 2011-05-02 13:36 · Score: 1

So....what I said. Except you have it in your lab environment. And you don't relize its your VTP master. And you don't bother to put your production config on your replacement box before putting it in production....... Yeah. Not buying it as a likely scenario. This required multiple steps, and a fundamental lack of understanding of key functions of networking equipment in a datacenter setting (namely not knowing what your VTP master is) and a lack of any sort of sane procedures (putting a piece of equipment into production without so much as verifying a config). It's a plausible, but unlikely series of events that would require the input of someone who was not capable of building or maintaing the network in the first place.

--
Do not fold, spindle or mutilate.
Re:Engineering Errors by Anonymous Coward · 2011-05-02 13:45 · Score: 0

Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)
Re:Engineering Errors by zbaron · 2011-05-02 13:57 · Score: 1

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.
Re:Engineering Errors by lucifuge31337 · 2011-05-02 14:01 · Score: 1

Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)
And to my point, that's amateur hour stuff. Not what one would expect in a professional data center.
Also, that would not cause this proposed issue, as if they were all servers, none of them would take data as ca VTP client. It would be like not running VTP at all.

--
Do not fold, spindle or mutilate.
Re:Engineering Errors by lucifuge31337 · 2011-05-02 14:42 · Score: 1

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.
See my previous post about "crusty old IOS/CatOS".
Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.
Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn't have automated config backups of infrastructure equipment?
I guess this boils down to the fact that I'm not an armchair network admin. I've been doing this a long time, and I know how it works. Someone doing something this stupid would be like watching someone put a car in gear and then crawl under it to me. It's not something you should have to TELL someone not to do. It's something that SHOULD'T HAPPEN when one or more well agreed upon basic procedures are being followed. If the person you are asking to do that kind of work needs to be told these things, you have failed as a manager, and likely as an organization. If your network(s) set the stage for this type of thing to be a possibility (sharing vtp info bewteen production and lab, hoping someone won't ever accidentally bridge the two) you again have failed as a manager or organization. The most basic of widely accepted best practices would put multiple barriers between this type of thing happening, requiring a cascading series of procedural failures to actually happen.
In summary.....Nope, still not buying this as a reasonable explanation.

--
Do not fold, spindle or mutilate.
Re:Engineering Errors by Anonymous Coward · 2011-05-02 15:06 · Score: 0

That's why you don't turn on Cisco's stupid proprietary protocols, or better yet, get gear that isn't over-priced. There are much better ways to handle mass-changes in large networks. VTP would only make sense (to me) in an smallish environment with an inexperienced network administrator.
Re:Engineering Errors by Anonymous Coward · 2011-05-02 18:34 · Score: 0

So it must be Bring Down Business [Y/n]?>
Re:Engineering Errors by mulaz · 2011-05-02 22:49 · Score: 1

VTP servers also accept configuration from other VTP servers, so they act as clients too.

--
i read your email
Re:Engineering Errors by Anonymous Coward · 2011-05-02 23:20 · Score: 0

>You can not really stop stupid people.
You are not using large enough caliber ammunition.
Re:Engineering Errors by Anonymous Coward · 2011-05-03 02:57 · Score: 0

Lets say you have an exact copy of your production network with the same VTP domain and password running in your lab. You take a switch that is a client from the production network and move it to the lab environment to test something out. You have now overwritten the entire lab VTP database with the production one. Now you see this happened and fix it by going to a switch in the lab that is running as a VTP server and fix it so it is setup like the lab is supposed to be. You now have a higher configuration revision number in the lab than production. Once done with the switch you took from production, you move it back to the production network without resetting the VTP configuration revision number and you just overwrote the production VTP database with the lab one.
This has not happened to me yet, but I did play a part in taking down EVERYTHING once just by rebooting a dozen switches that were newly installed and not in use yet.
Re:Engineering Errors by Anonymous Coward · 2011-05-03 03:13 · Score: 0

Yep, and to clear things up:
Server: you can make and receive changes, and all those changes propagate
Client: you cannot make changes but keep received changes, and all those changes propagate
Transparent: you can make changes which never propagate, do not use received changes, but all received changes propagate.
The database is stored in non-volatile memory so taking a client and power cycling it makes no difference.
Unless you use the 15 levels of enable mode to restrict commands to certain users, anyone with enable access to the switch can make it a server and make changes.
Re:Engineering Errors by lucifuge31337 · 2011-05-03 03:33 · Score: 1

I got you. I know it can happen. I just don't see it being something that is likely in a properly managed network. I mean....come on....if you're in a position where you can swap out infrastructure equipment - even if your lab setup is moronic and shares VTP info - you ought to be (at a bare minimum) blowing away the config (which includes vlan.dat) and either copy-pasting from your config repository or creating a new one. My point is that this is very, very basic procedure. If you are tasked with touching infrastructure equipment, you ought to know these things. Also, if this happens it is a very quick restore if you have proper config backups and proper out of band access. Again, I guess I'm assuming a competency level that others aren't accustomed to. This stuff isn't hard.

--
Do not fold, spindle or mutilate.
Re:Engineering Errors by Anonymous Coward · 2011-05-03 09:48 · Score: 0

but NXOS doesn't support restoring from a backup of the config ;) and its fairly new
It does read the config in line by line tho, which in some cases will give you an even more broken config.
and yes I know your response to this will be 'Test backups!!!!! or they don't count'

VMware shows its PR colors. by shuz · 2011-05-02 12:18 · Score: 4, Insightful

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

--
There is or can be built a machine that can simulate any physical object. -Church-Turing principle

Re:VMware shows its PR colors. by HFShadow · 2011-05-02 13:33 · Score: 2

Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.
How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.
Re:VMware shows its PR colors. by Chuck+Chunder · 2011-05-02 13:58 · Score: 1

A better PR response
In what sense? I know that I appreciate frank disclosures of problems from our providers rather than obfuscating the issue (if nothing else it might highlight a similar problem in our procedures).

--
Boffoonery - downloadable Comedy Benefit for Bletchley Park
Re:VMware shows its PR colors. by Vrtigo1 · 2011-05-02 14:10 · Score: 1

I find your comment regarding offsite DR a bit off base. For small shops, I would agree that maintaining two data centers would be expensive, but for most places that have any kind of substantial investment in IT, it should be just an expense that is factored in from day one. For instance, the company I work for has an annual IT budget of about 2.5 million. We have three datacenters in addition to our computer room at HQ. Two of the data centers are for our public facing apps which are load balanced between them. We have a generator at HQ which can run us for about a week, but if TSHTF, we can move our apps to the remote datacenter. At HQ, I've put as much of the critical infrastructure as possible in VMs for portability and ease of management. HQ is backed up by the 3rd datacenter, where we put a single God box consisting of four 6 core CPUs and 96 GB of RAM. This is sufficient to run all of our critical apps on the single server until we can get our HQ equipment back up and running, or we have time to order and install new equipment elsewhere. The storage from HQ is continuously replicated to the offsite D/R facility, so in the event of a disaster, all I have to do is power up the VMs there, change the outside hostname of our HQ VPN endpoint to point to the D/R firewall and tell people to disconnect and reconnect to VPN. This setup cost us about 90k in capital expenditures including equipment, software and implementation and costs about 10k a year to run. Call it 150k for the D/R site and the generator at HQ, and I' say that's a relatively minor cost in the grand scheme of things.
Re:VMware shows its PR colors. by ToasterMonkey · 2011-05-02 14:36 · Score: 5, Insightful

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
"Transparency is bad" +4 Insightful
What the... ?
Re:VMware shows its PR colors. by drooling-dog · 2011-05-02 14:41 · Score: 3, Informative

To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.
Re:VMware shows its PR colors. by rsborg · 2011-05-02 15:49 · Score: 4, Informative

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
"Transparency is bad" +4 Insightful
What the... ?
You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".
Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.

--
Make sure everyone's vote counts: Verified Voting
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-02 16:14 · Score: 0

How, in the name of everything that is digitally holy, is this a frank disclosure of problems? Touching the keyboard brought it down????
While this is surprisingly frank about the effects, it is incredibly juvenile in its explanation. It may not be deliberate obfuscation... but obfuscation due to ignorance is even scarier, methinks.
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-02 16:39 · Score: 0

Transparency is bad if you want to maintain the brand. Blame it on a rogue employee - do not taint the brand.
So in that regard, "Transparency is bad" - yes. Just ask any politician.
Re:VMware shows its PR colors. by SuperQ · 2011-05-02 16:52 · Score: 1

Yup, here's a good example of what you're talking about:
http://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html
So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.
Re:VMware shows its PR colors. by Threni · 2011-05-02 19:12 · Score: 1

It was probably some offshore noob who doesn't understand test environments, change control etc and who decided to stick the latest version of some code onto the live environment. It happens. Hey, they're cheap!
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-02 20:35 · Score: 0

They are only as redundant and stable as the individuals managing them.
YMMD
Re:VMware shows its PR colors. by dbIII · 2011-05-02 21:18 · Score: 1

They are only as redundant and stable as the individuals managing them
Given that it is 2011 a very large chunk of the IT workforce has been made redundant.
Re:VMware shows its PR colors. by MightyYar · 2011-05-02 23:46 · Score: 1

The company as a whole is responsible for any of its failures.
I completely disagree. "The company" does not actually exist - it is actually just a group of individuals. If an individual can mess up the whole infrastructure, then I'd sure like to know that.

A better PR response
Yup, a better BS response that leaves them just as opaque as all the other companies out there.

An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.
That's exactly right - but you wouldn't know that if they had said "we made an unscheduled change". I prefer their transparency.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-03 02:35 · Score: 0

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

"Transparency is bad" +4 Insightful What the... ?
It is quite a load of spin to direct all the blame at one person. Lets even assume this one person intentionally made a mistake.
Why were they only designing a playbook after the first failure?
Methinks they didn't have a backup plan until they found they needed it so they rushed to design one.
Note that "rush" and "design" are not supposed to go together.

Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard.

This was our first total outage, which is an event where we need to put up a maintenance page. Some of you may have noticed that while the www.cloudfoundry.com maintenance page was posted correctly, the page designed to cover for all applications at *.cloudfoundry.com did not. This issue has been corrected.
This should have not been the first total outage. They should have intentionally caused outages and tested failovers before ever going live.
Sounds like:
- this system was designed without resiliency in case of failure
- they had an outage, and were scrambling to prevent something like it from happening again
- working overtime and rushing to correct the first issue may have played a part in this as well
Causes for the above might be:
- rush to market
- cheap company (don't tell me VMware couldn't afford a proper design if they cared in the first place)
- there is always room for growth, never time to do it right
Why again was there not already a playbook?
I also have a hunch, somewhere someone who quit in frustration months ago is thinking "I told you so" to themselves.
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-03 02:45 · Score: 0

Given 2 minutes I could cause all of our thousands of managed power strips to start turning off all outlets. It would either take 20 seconds to execute if each strip operated in parallel, or an hour if each controller could only shut off one strip at a time.
I also know where the controllers sit so I could unplug them all real quick, so maybe its a wash.
That said if EVERYTHING was off... we couldn't get to the servers controlling the power controllers since they are virtual, and one strip would have to be manually reset since it powers half the power controllers. Once manually reset, each strip would have to be power cycled before the controllers will manage it again, but the controllers are on that strip...
Re:VMware shows its PR colors. by pclminion · 2011-05-03 03:28 · Score: 1

It's not transparency, it's blame deflection. Transparency would be a discussion of HOW a single person operating a console could take down the entire infrastructure, not a discussion of who. I don't give a shit who. Saying "an employee fucked up" is finger-pointing and sounds like an attempt to step away from responsibility. It also implies that the internal "solution" to the problem is to fire some specific individual.
Transparency, which we do NOT see here, would be a discussion of HOW the data center was configured and HOW a single person at a keyboard could take the entire thing to the ground. Whether the guy's name is Winston or whatever I really don't give a crap.
Re:VMware shows its PR colors. by Anonymous Coward · 2011-05-03 03:35 · Score: 0

Transparency: Saying our team fucked up when assembling procedures for preventing a problem in the future and tested these unaccepted procedures on live hardware.
Blame: Saying one of our engineers fucked up when assembling procedures for preventing a problem in the future and tested these unaccepted procedures on live hardware.
See the difference? One takes responsibility AND is transparent, the other just looks like they're passing the buck.

The CLOUD is VAPOR-WARE by Purist · 2011-05-02 12:18 · Score: 1

Next.

--
I used to fear clowns...but I'm discovering that chimps are far, far, worse.

Re:The CLOUD is VAPOR-WARE by Dyinobal · 2011-05-02 12:58 · Score: 1

I'd think that was obvious, clouds are made out of vapor by definition.
Re:The CLOUD is VAPOR-WARE by Purist · 2011-05-02 13:39 · Score: 1

Tanks for validating my joke...was it too dry?

--
I used to fear clowns...but I'm discovering that chimps are far, far, worse.
Re:The CLOUD is VAPOR-WARE by md65536 · 2011-05-02 15:34 · Score: 1

Maybe the data center was too dry and when they touched the keyboard, a lightning bolt from the cloud struck.
Anyway I think keeping clouds in a data center sounds dangerous for several reasons.

Since I'm being an awful person today... by fuzzyfuzzyfungus · 2011-05-02 12:19 · Score: 2

I, for one, would like to suggest that the Cloud Foundry is really foundering...

Re:Since I'm being an awful person today... by torgis · 2011-05-03 02:09 · Score: 1

Come on y'all. If we try real hard I bet we can find a way to implicate Sony in this.

PEBKAC by MrQuacker · 2011-05-02 12:35 · Score: 1

And that is why we need skynet.

Don't let it happen again by stumblingblock · 2011-05-02 12:35 · Score: 5, Funny

They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.

Re:Don't let it happen again by Anonymous Coward · 2011-05-02 23:08 · Score: 0

I suppose it's the "any key".

I don't trust "The Cloud" by Beelzebud · 2011-05-02 12:36 · Score: 1

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".

Re:I don't trust "The Cloud" by Jeremi · 2011-05-02 14:20 · Score: 1

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.
You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.
I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:I don't trust "The Cloud" by jd · 2011-05-02 14:25 · Score: 1

Hard drives are easy to beat. Core memory has an estimated lifespan 20-30x that of a hard drive, is impervious to EMP and won't crash if bumped.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:I don't trust "The Cloud" by jimicus · 2011-05-02 23:25 · Score: 1

Not necessarily, as has already demonstrated.
I forget exactly where I first read it, but it bears repeating: Unless you can put your finger on a damn good reason why your business cannot deal with any downtime, you don't need high availability and probably shouldn't bother with it.
It invariably introduces a lot more complication, a lot more to go wrong. Few businesses truly need it, usually all they need is a clear plan to recover from system failure which accounts for the length of time such recovery will take.
Simply running your application on, say, a highly-available VM in someone else's datacenter does not make that complication go away, it just makes it somebody else's problem.
Re:I don't trust "The Cloud" by Nevynxxx · 2011-05-03 00:32 · Score: 1

I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.
Dropbox.

Human error will always... by Super+Dave+Osbourne · 2011-05-02 12:37 · Score: 0

be an issue. The problem is how poorly is the infrastructure designed and implemented to allow one moron one key stroke to cause such havoc? Apparently it is very weak and susceptible.

Reminds Me of A Bad Challenger Joke... by Anonymous Coward · 2011-05-02 12:41 · Score: 0

What were Christa McCauliffe's last words ?

"What's this button for ..."

Re:Reminds Me of A Bad Challenger Joke... by Sulphur · 2011-05-02 14:17 · Score: 0

What were Osama bin Laden's last words?
"Shoot?"
Re:Reminds Me of A Bad Challenger Joke... by Per+Wigren · 2011-05-02 21:27 · Score: 1

What were Osama bin Laden's last words?
"Darn."

--
My other account has a 3-digit UID.

... Our New Dark cloud Overlords by Anonymous Coward · 2011-05-02 12:42 · Score: 0

And, by the way, that was a really perfect and fully credible explanation, kind Sirs. Yes, indeed! Totally, perfectly, unassailably perfect. It makes perfect sense. Happens all the time. (Ohboy!) But then, this is the age of credulity, after all.

Better Verse [Re:Slashdot summary non sensationt] by Anonymous Coward · 2011-05-02 12:57 · Score: 0

Keyboards, how do they work? This does not bode well for VMware. As much as I do so love their production, I did chuckle at this major failure.

No, you need to change that first line if you're going to post in rhyme:

Ah, keyboards: how do they function? This bodes not so well for VMware. As much as I do love their production, I chuckled a bit at this major failure.

I'd work on that slant rhyme a bit, but then, what to I know? I'm an anonymous coward.

Cloud depends too much on internet by bmservice · 2011-05-02 13:07 · Score: 1

I don‘t think we have enter the period that internet is available everywhere and everytime but without internet cloud is nothing

Cloudy Vision of My Future... by BoRegardless · 2011-05-02 13:11 · Score: 1

If I think I can trust a cloud to support my data.

Press any key to destroy everything... by Anonymous Coward · 2011-05-02 13:28 · Score: 0

Never let a cartoon super villain design your network infrastructure.

Human Factor by Anonymous Coward · 2011-05-02 13:28 · Score: 0

I was working for the world's largest SMS & MMS hosted provider powering up a few extra servers for provisioning when the entire server room went dark. The Engineering Manager had ordered a 100 Amp circuit breaker but had never replaced the 60 Amp breaker because he kept forgetting to schedule it. When the lights went out it took 3 hours from midnight 'till 3am to get everything back up and running. The 100 Amp breaker was sitting inches from where it was supposed to go - right there on top of the circuit breaker box.

Three months later the same thing happened again - with the "redundant" server row.

You didn't hear this from me.

Press Any Key to Continue! by Anonymous Coward · 2011-05-02 13:41 · Score: 0

Proceed to bang head on table.

The technology is almost there by __aancvu2993 · 2011-05-02 13:52 · Score: 1

I am only considering VMware products again if they fire the idiot who wrote the blog post and cane him in the public square. Come on VMware, we are hoping for some retribution here.

Now, for the technical part, I'm only considering cloudy products again if they replace keyboards and human engineers with unicorns fluent in Lisp who can rainbow-activate and maintain the flockolent interfuzzys to the cervically index, to protect my data. I'm just not using any ol' cloud. No sir.

I disagree. by khasim · 2011-05-02 13:58 · Score: 1

However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.

The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.

The answer is to run training exercises for the various scenarios so that everyone knows what to do and where to go in such situations.

The problem with that is that people are lazy. Security is not difficult. But NOT doing it will always be easier (and yield immediate rewards) in the short term.

TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup.

Sounds good. But the system also has to be designed to take advantage of the technology that is available today. Too often the systems are based around the single machine running a single application with full administrative rights model. And the technological advances have just made it possible to fool the app into thinking it is on one machine while it runs on multiple machines (badly).

"press any key to crash the cloud" by Anonymous Coward · 2011-05-02 14:01 · Score: 0

nice option

Big Red Button by Anonymous Coward · 2011-05-02 14:03 · Score: 0

What, did he hit the giant red blinking "Fuck Everything Sideways" button? Seems like that might be a design flaw they should look into.

Re:Big Red Button by VanessaE · 2011-05-03 00:47 · Score: 1

Oh how I wish I had mod points right now - this should me modded straight to +5, Funny.
I just about laughed myself silly. Thanks. :-D

The Answer is obvious by SuperKendall · 2011-05-02 14:12 · Score: 1

How did one engineer touching a keyboard when he shouldn't, take everything down?

He touched the keyboard in its Special Place.

Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Cloud lol. by unity100 · 2011-05-02 14:21 · Score: 1

I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

lets admit it - huge monolithic clouds, are a bad idea. there should be a certain size limit for clouds' sizes, and after that the customers should be placed to another discrete cloud unit.

--
Read radical news here

Oh, so that's what the pause/break key does by Anonymous Coward · 2011-05-02 14:45 · Score: 0

^^^

A cloud in need, is a cloud indeed by Anonymous Coward · 2011-05-02 15:00 · Score: 1

No, no, it is indeed a cloud: Thin, wispy and ephemeral.

Re:A cloud in need, is a cloud indeed by VortexCortex · 2011-05-02 17:31 · Score: 3, Funny

No, no, it is indeed a cloud: Thin, wispy and ephemeral.
Not to mention The Cloud is dangerous!
One time, "The Cloud" corrupted a few files on my server, toasted my dev machine's hard drive (couldn't even re-install!) made several monitors explode, and split the tree outside my home-office completely in two; Flying chunks of bark shattered my windows... to say nothing of the horror that became of the decorative landscape lighting that foolishly linked the outside to my main electrical system, may it rest in pieces.
The ironic thing is that I had a lightning rod installed; I thought I was safe from The Cloud, but The Cloud decided that my, now deceased, 200ft pine tree was a better target of opportunity.
The Cloud is a scary concept -- Super charged flying electrical batteries, always looming overhead, unpredictably destroying their targets with tremendous power, and surgical precision. Hell, the terror of witnessing such an event has permanently emotionally scarred my dog -- She has a prescription for Valium now because she hyperventilates and continuously shakes for hours at the mere sound of distant thunder...
My psyche is not unscathed either: I have to take a tranquilizer whenever I hear the words: "To The Cloud"

Human error by PPH · 2011-05-02 15:15 · Score: 1

No problem. SkyNet will remedy that.

--
Have gnu, will travel.

BETA by Anonymous Coward · 2011-05-02 15:40 · Score: 0

In their defense Cloud Foundry is still in early beta.

Is the power grid run by some old pc terminal Esc by Joe+The+Dragon · 2011-05-02 17:18 · Score: 1

Is the power grid run by some old pc terminal where hitting Esc can crash the full system?

Maybe trojan horse ? by luk3Z · 2011-05-02 18:11 · Score: 0

Maybe they had trojan horse (or other malware) in their cloud system and they just turn off internet cable (to avoid risk and drop trusting of customers) ;)

--
Recipes for USA bankrupt - http://tinypaste.com/0d66f dd = dollar deluge (printed in the infinity)

Anybody see the irony in the first outage? by HockeyPuck · 2011-05-02 18:33 · Score: 4, Interesting

Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.

I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.

Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...

Irony. A storage company having a storage problem.

Re:Anybody see the irony in the first outage? by Anonymous Coward · 2011-05-03 13:15 · Score: 0

VMware is no longer an EMC subsidiary or component. They have been independent for something like five years.
Stop repeating falsehoods and check your facts before you post.
How do you know that they were even using an EMC SAN backend?
The cloud stuff is really complex. I'd like to see you manage it. Probably a Microsft SQL DB....

Earth leakage fault? by Anonymous Coward · 2011-05-02 19:41 · Score: 0

It sounds like an earth leakage fault to me. So maybe VMware should not blame one poor sod for faulty wiring in their data centre.

Been there, done that by Anonymous Coward · 2011-05-02 20:03 · Score: 0

Personally, I think it is a believable explanation, if only because I did something similar recently (though not the same extent of damage).

I clicked the right-mouse button instead of left-mouse button by mistake during maintenance work, which resulted in the pasting of the contents of my clipboard into the config of one of our edge routers. Unfortunately the clipboard contents happened to contain a config of a different edge router, which resulted in duplicate IPs on the network, routing getting all confused, and the only way of recovering was physical access and hard reboot since it even knocked the management networks offline.

Sounds like a 13 year old making up an excuse. by Ecuador · 2011-05-02 21:09 · Score: 1

I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.

Vmware's explanation reminded me of that incident. Unless "touching the keyboard" means logging on a secure system and entering a few bad commands.

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS

Wow. by chill · 2011-05-02 21:23 · Score: 1

141 comments and no one mentions the old Sun equipment that had the !@#^ power button on the keyboard! Must be the young crowd posting.

Been there, done that. Reached over, bumped the keyboard and the SparcStation went "blink!" and off.

I've been to a couple lab environments where the upper-right key on every keyboard had been physically removed because this was such a stupid design.

--
Learning HOW to think is more important than learning WHAT to think.

Re:Wow. by multipartmixed · 2011-05-03 01:14 · Score: 1

Just so you know, you can turn that off. /etc/power.conf IIRC. That said, I also tend to rip the key off.
Wanna know ironic, though? The Sun E150 server (mini E450 chassis, Ultra-1 guts) can't be turned *on* without the keyboard.
True story, one DC where I worked about 12 years ago called Sun support because a machine wouldn't power up after a simulated power failure. Stupid Sun SE wound up replacing the motherboard before he would listen to me and plug in a damn keyboard.

--

Do daemons dream of electric sleep()?
Re:Wow. by chill · 2011-05-03 01:20 · Score: 1

Yeah, but it is so much more satisfying to rip off that damned key with a pair of pliers. :-)
I have no trouble believing the story of the tech. I remember using that trick on MCSEs who thought they knew computers and Sun servers were just like the WinTel ones...
"How do you turn this damn thing on?" :-)

--
Learning HOW to think is more important than learning WHAT to think.

Not a KISS design. by leuk_he · 2011-05-02 21:32 · Score: 1

They try to make a full analysis public. That is agood thing. They could have gone with the same old level "there is a problem and we fixed it", like they try to do with the PSN network. (barely dettail, very fussy predecitions when expected to come up again)

Cloud based hosting is relative complex y its very nature. This will always violate the "KISS" design principle. ECC downtime has also shown this. A lot of costumers though they bought a 24/7 99,999% solution, but they forgot they only bought the tools for that solution.

I agree that "touched the keyboard", one of the engineers took the script litterarry when someone gave instructions to "do X", he launced the nuclair missiles instead of just gave a paper confirmation that X was down. ;)

PS "not reall misseles... just a figure of speech"

"Anybody" is sometimes very very badly wrong by dbIII · 2011-05-02 21:33 · Score: 1

And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed

If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sockets.
Plus just when you think you have it all covered with redundant power supplies sometimes the entire power supply unit dies at the back end instead of just one of the modules. It's annoying and expensive when that happens but not as bad as a completely dead server.

Re:"Anybody" is sometimes very very badly wrong by multipartmixed · 2011-05-03 01:08 · Score: 1

Um, no.
Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply. The AC is contained within the PSU and the equipment is powered by DC.
And besides which, all modern data centers keep their redundant power distribution in phase. For starters, they know that their grounds will be tied together through customer equipment.

--

Do daemons dream of electric sleep()?
Re:"Anybody" is sometimes very very badly wrong by HockeyPuck · 2011-05-03 02:55 · Score: 1

Wow... guess you've never been in a datacenter that actually PLANS it's power distribution system. It's planned that both power strips in a given cabinet (or plugs on a larger floor standing system like an EMC disk array) will go to different PDUs. Just in case the PDU itself fails. Now while this doesn't protect you from building supply (Power Company) failures. It does protect you from all sorts of intra-datacenter power issues.
Nobody is randomly plugging power cables in, but you are planning for a multi plug server/storage array/switch/router whatever to connect to two different PDUs.
This was just a bone headed mistake.
Re:"Anybody" is sometimes very very badly wrong by dbIII · 2011-05-04 00:49 · Score: 1

I guess you've never been to the majority of other places then in addition to not being able to properly comprehend what was written above. I even put the word "DO" in capitals above for you but it does not appear to have helped.
Re:"Anybody" is sometimes very very badly wrong by dbIII · 2011-05-04 01:15 · Score: 1

Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply.
Obviously not - but they do on the power supply side of the power supply. The seperate inputs are not always properly isolated and the two phases can add up to more than what the power supply can cope with. It doesn't matter to the server if the fire starts in the power supply or elsewhere - it's still a burnt server.
Now here's the bit you and the other less erudite guy that replied completely missed despite the capital letters:

If you DO know they are on the same phase it is a good idea
That answers your second point before you even made it. All this shit about "modern data centers" doesn't apply universally because a lot of gear is in other places and many "modern data centers" will have trouble distinguishing their arse from their elbow let alone making sure all of their cable monkeys know which phase different power points are on. I didn't even know which phase everything was on in the building I work in until one phase went out and I found some idiot had put all the air conditioning units on the same phase - so all of the computers stayed on while all of the cooling died. There was nothing documented about that on site initially just a list of points on each circuit. There is now no power point in that server room on a different phase to any other (and the AC was spread over the three phases since that fault happened more than once - industrial area with big cranes and an old substation messes with the power) so anybody can plug into anything because it is a known quantity.
As I tried to say above, taking power from two different circuits in a place with three phase power into the same power supply can be a very stupid thing to do.
Re:"Anybody" is sometimes very very badly wrong by multipartmixed · 2011-05-05 00:01 · Score: 1

> Obviously not - but they do on the power supply side of the power supply.
> The seperate inputs are not always properly isolated and the two phases can add
> up to more than what the power supply can cope with.
Cite? Admittedly, I've dealt almost exclusively with Sun equipment since we moved there in '98 (although that's going change RSN). But Sun gear, at least, does not let AC out of the PSU -- the PSUs are *not* interconnected, except where the DC meets on the backplane (and chassis ground).
Are you saying that there are servers out there that cross-connect their power supplies on the AC side? Frankly, that seems like asking for a boat load of trouble if you ask me.
In fact, it's a common DC configuration around here to run 208 to the cabs split off into 120s. We're out of phase there by definition. Of course, most of our PSUs *are* rated for 90-250V...

--

Do daemons dream of electric sleep()?
Re:"Anybody" is sometimes very very badly wrong by dbIII · 2011-05-05 01:18 · Score: 1

the PSUs are *not* interconnected, except where the DC meets on the backplane (and chassis ground).
My apologies. I failed to put "not always properly isolated" in capital letters and bold type. Ask an electrician or a fire investigator for more details or let's just look at wikipedia on short circuits and then draw the obvious conclusions:

In mains circuits, short circuits may occur between two phases, between a phase and neutral or between a phase and earth (ground). Such short circuits are likely to result in a very high current

For some reason you wrote the following above:

But Sun gear, at least, does not let AC out of the PSU
You just don't get it. Forget entirely that there is a computer involved because all that matters here is the inputs, the voltage difference and if faulty manufacture or damage has left a path for the power to fry the parts within the power supply at up to twice the design voltage and current. Even if the power supplies are in good condition a miswired point or plug where earth is live means double the trouble you'd normally get in that situation

Single key may indeed shut down a stack of serves by Anonymous Coward · 2011-05-02 22:29 · Score: 0

Some years ago, on an entirely different employer, we deployed a rack of servers with KVM switch. The vendor technician warned that the KVM keyboard had a Windows-induced power button, which -- if pressed -- would really power off all servers connected to that KVM switch. Needless to say, we plucked that button out and glued a piece of plastic over the hole...

Clearly? by EmagGeek · 2011-05-02 22:39 · Score: 1

"Clearly, human error is still a major factor in cloud networks."

That is a huge leap. You cannot take one incident and use it as a broad brush with which to paint all of the players in cloud computing.

This should read: "Clearly, human error was a major player in these two specific incidents at VMWare."

Can Slashdot mods PLEASE dispense with the sensationalism?

Re:Clearly? by am+2k · 2011-05-03 04:00 · Score: 1

Every error is a human error. Either somebody used it incorrectly, or somebody built it in a way where it couldn't handle the way it's used. IMO the only exception to this rule are things you can't be expected to consider, like a plane crashing into your data center or a Richter scale 9 earthquake.

One key? by WD · 2011-05-03 00:20 · Score: 1

You know... there is a fix for that.

"Hey, what does this key do?" by Anonymous Coward · 2011-05-03 00:46 · Score: 0

Are you kidding me? Maybe the blame should fall on the designer of a system that would make it so simple to accidentally take down your entire system with a single key stroke.

Well, yeah... by Anonymous Coward · 2011-05-03 03:00 · Score: 0

If anyone was adverse affected by this, you are a fucking moron.

VMWare has been very clear that the service is still in alpha.

What do you expect from the cloud? by Anonymous Coward · 2011-05-03 03:02 · Score: 0

Shit on the internet goes down, shit on your network goes down, why shouldn't shit on the cloud go down?

What? by koan · 2011-05-03 04:46 · Score: 1

No "are you sure" pop up?

--
"If any question why we died, Tell them because our fathers lied."

It's obvious nobody actually read what happened... by Anonymous Coward · 2011-05-03 05:13 · Score: 0

I think it's hilarious how everyone here comments like a know-it-all here about what the problem was when they've never seen more than the press release.

Regardless of your level of proficiency at networking, anyone who knows anything about computers knows there's a lot of computers and you have to look at each scenario on its own, not make some useless speculation about what the real cause is.

Additionally, is it really that hard for people to read the press release (including the author)? It's clear that it wasn't a single key press (like a big red button syndrome), but the fact that the system was currently set up as non-interactive while an employee ignored this warning.

Killer cut and paste by Anonymous Coward · 2011-05-03 05:40 · Score: 0

Never underestimate the chaos you can create by copying a bunch of junk with your mouse and then accidentally pasting it into a root level shell.

Bad design by Anonymous Coward · 2011-05-03 18:07 · Score: 0

mmmm... it does not make any sense to me... How come a simple key can take out all load balancers, routers, and firewalls... I mean, they should have any kind of HA in place.

Indeed it sounds to like a very bad design, but who knows, perhaps this guy turned off the only UPS available...

Re:single keystroke that would take down an entire by warchildx · 2011-05-04 09:15 · Score: 1

Press the *ANY* key to continue, or any other key to quit.

Slashdot Mirror

VMware Causes Second Outage While Recovering From First

215 comments