Slashdot Mirror


VMware Causes Second Outage While Recovering From First

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

152 of 215 comments (clear)

  1. This is very bad design by FunkyRider · · Score: 4, Interesting

    [[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

    --
    just wonder why there are so many anonymous cowards in this world....
    1. Re:This is very bad design by drosboro · · Score: 5, Interesting

      I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

      This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.

      My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".

      But who knows, I could be wrong... I'm sure hoping I'm not!

    2. Re:This is very bad design by verbatim · · Score: 1

      Finally, MovieOS being used in a production environment. Pretty soon, the cops will be using Visual Basic to hunt down suspects.

      --
      Price, Quality, Time. Pick none. What, you thought you had a choice?
    3. Re:This is very bad design by nurb432 · · Score: 4, Insightful

      I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )

      --
      ---- Booth was a patriot ----
    4. Re:This is very bad design by sumdumass · · Score: 1

      It wasn't out of boredom. He went into a chat room and asked for advice. The guy talking the most gave him that information after asking if he was running windows and he replied I think so.

    5. Re:This is very bad design by Daniel_Staal · · Score: 2

      'Enter' should do it, in most cases...

      (Assuming, of course, that the (in)correct command has been typed at the command line already.)

      --
      'Sensible' is a curse word.
    6. Re:This is very bad design by X0563511 · · Score: 3, Informative

      ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    7. Re:This is very bad design by interiot · · Score: 1

      If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.

    8. Re:This is very bad design by DigitalJanitor · · Score: 3, Funny

      Sounds like they could benefit from a virtual environment to test things out in.

    9. Re:This is very bad design by Vrtigo1 · · Score: 1

      +1. I got in the habit of using the control key to wake sleeping PCs a long time ago. Nowadays you'd hope that a sleeping PC would wake to a login screen, but I'm continuously amazed that I still see guys in IT shops that don't bother with locking their workstations...

    10. Re:This is very bad design by NFN_NLN · · Score: 1

      ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

      So I should stop typing this into random terminals and then leaving?

      > nohup "history -c; passwd -l root; rm -rf /" &

    11. Re:This is very bad design by c6gunner · · Score: 2

      If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.

      Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".

    12. Re:This is very bad design by 42forty-two42 · · Score: 1

      On a serial link, just use the right arrow key. Or possibly ESC (although you'll have to deal with clearing the ESC chord afterward if it happened to be in vi or something)

    13. Re:This is very bad design by shutdown+-p+now · · Score: 1

      "Updates are available for your computer; would you like to reboot it to install them?" ~

    14. Re:This is very bad design by X0563511 · · Score: 1

      Not a bad idea. I think cleaning up the vi example is a good compromise - you wanted a prompt after all, not necessarily someone's leavings.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    15. Re:This is very bad design by Archangel+Michael · · Score: 2

      When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    16. Re:This is very bad design by BitZtream · · Score: 1

      The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.

      While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.

      Its really one of those moments where the poor guy is just the most perfect example of why management said 'hands off'. Has to be a shitty feeling to be in, I'm sure they'll be giving him shit for years.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    17. Re:This is very bad design by ruiner13 · · Score: 2

      Or in this case, they need a virtual virtual environment.

      --

      today is spelling optional day.

    18. Re:This is very bad design by md65536 · · Score: 2

      They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.

      Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.

      They could try explaining what particular nasty touches actually caused this. By trivializing the cause, they hide the problem, but they also suggest that really really simple problems can make go boom.

      Maybe if someone was poking their finger into the cloud's positronic brain, I could see "Unfortunately, someone touched it" as an acceptable answer. But a keyboard is equipment specifically made for touching. Are cloud data centers so fragile that things meant for fingers can't be touched?

    19. Re:This is very bad design by Rik+Rohl · · Score: 1

      Looks like someone finally found the Any key.

    20. Re:This is very bad design by obarthelemy · · Score: 1

      Are you really sure that big red button would indeed take it fully down ?

      It could be a fake button. Or the servers could be more resistant than you think. There could be backup power....

      You'll never know...

      Unless ...

      --
      The Cloud - because you don't care if your apps and data are up in the air.
    21. Re:This is very bad design by haruchai · · Score: 1

      ?? The Playbook touched the keyboard and took out the cloud? Boy, RIM just can't catch a break these days!!

      --
      Pain is merely failure leaving the body
    22. Re:This is very bad design by dutchwhizzman · · Score: 1

      So next time put an "at job" on that shuts the computer down in the weekend. He'll have to restart on monday.

      --
      I was promised a flying car. Where is my flying car?
    23. Re:This is very bad design by sjames · · Score: 1

      I know a case! "Are you sure? [Y/N]"

    24. Re:This is very bad design by badran · · Score: 1

      Press Any Key to SHUT DOWN EVERYTHING.

      The pandemic needs to be stopped....

    25. Re:This is very bad design by Stupendoussteve · · Score: 2

      Everyone knows GUIs hunt down suspects, you just have to write them in Visual Basic. Duh!

    26. Re:This is very bad design by akh · · Score: 1

      And that is why remote syslog* was invented...

      * You can enable this in bash 4.2 by defining SYSLOG_HISTORY in config-top.h You'll also need to set up syslogd appropriately.

      --
      Accept Eris as your Fnord and personally sate her
    27. Re:This is very bad design by nedlohs · · Score: 2

      It seems remarkedly unlikely that there would be an executable named "history -c; passwd -l root; rm -rf /", in fact I suspect that trailing / makes it impossible on unix-like systems.

      nohup sh -c "..." &

      on the other hand...

    28. Re:This is very bad design by nedlohs · · Score: 1

      Seriously? You can read that and come away with that interpretation? Rather than say "they were supposed to planning out what to do without actually executing any commands, but someone misunderstood and actually did the actions" that it obviously means.

    29. Re:This is very bad design by Florian+Weimer · · Score: 1

      Some routers have extremely unsafe defaults and ignore syntax errors in commands. If you add a single letter to a command which corrects the default (perhaps while the configuration file is open in an editor), producing a syntax error, this can trigger far-reaching outages. Taking down a data center is not even the worst thing that can happen. For example, if an ISP accidentally redistributes the global BGP table into OSPF, they can produce a world-wide outage affecting thousands of routers and almost all customers. All with a single erroneous command executed on a single router which doesn't even have to be particularly central to the whole network.

    30. Re:This is very bad design by wmlele · · Score: 1

      Isn't the current CloudFoundry still essentially a public technology preview? I understand they're still working on it to scale and wouldn't expect service levels at this stage.

    31. Re:This is very bad design by vegiVamp · · Score: 1

      I tend to use the control key. My brain claims that the shift key doesn't always seem to work, but offers no particular examples.

      Postfactum Explanation Possum also says "it's at the corner of the keyboard, so less adjacent keys to accidentally press".

      --
      What a depressingly stupid machine.
    32. Re:This is very bad design by drinkypoo · · Score: 1

      If it's a serial link I hit ^L

      However, Windows and Linux both swallow my first keypress when asleep so it doesn't matter if I hit control, space, enter, or super.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    33. Re:This is very bad design by cyclomedia · · Score: 1

      Toggle scroll lock. Nothing in the known universe uses that key for anything.

      --
      If you don't risk failure you don't risk success.
    34. Re:This is very bad design by multipartmixed · · Score: 1

      Scroll lock changes from application to window manager managed function keys on my FVWM 1.24 desktop.

      PC-Kermit(?) also uses scroll lock to make the screen scroll via the cursor keys.

      --

      Do daemons dream of electric sleep()?
    35. Re:This is very bad design by ais523 · · Score: 2

      Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll Lock on and off again is generally not meaningful to anything else), although I don't know that one from personal experience.

      --
      (1)DOCOMEFROM!2~.2'~#1WHILE:1<-"'?.1$.2'~'"':1/.1$.2'~#0"$#65535'"$"'"'&.1$.2'~'#0$#65535'"$#0'~#32767$#1"
    36. Re:This is very bad design by smash · · Score: 1

      ctrl+U then enter is reasonably safe on cisco stuff.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    37. Re:This is very bad design by smash · · Score: 1

      i use arrange by penis

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    38. Re:This is very bad design by Compaqt · · Score: 1

      Well, if you have rm -Rf / in the terminal, and the key you hit is Enter ...

      --
      I'm not a lawyer, but I play one on the Internet. Blog
    39. Re:This is very bad design by Kamiza+Ikioi · · Score: 1

      It's called the "windows key". It has a little windows flag on it. It was placed on keyboards for the purpose of slowing down, crashing, mutilating, and annihilating data centers, desktops, laptops, and phones.

      --
      I8-D
    40. Re:This is very bad design by sys_mast · · Score: 1

      that's why your console session should be configured to time out, so you have to login again. Physical security aside, it's just a Good Idea

      --
      Those who can, do.
    41. Re:This is very bad design by Lotharus · · Score: 1
      I believe OP's point is the way VMWare described the occurrence. By dumbing-down the official explanation, they imply several statements I would not want to hear from a professional service provider:
      • - Our users are too stupid to understand the real cause
      • - We're too stupid to understand the real cause
      • - Our employees can't be trusted to truthfully recount the events of an incident to their management
      • - Our systems are so fragile that an actual errant touch brings down the whole operation
      • - We've discovered an issue so severe that we're afraid to tell our users / the world what it is, lest it be exploited

      That's the problem with vague, imprecise explanations. They leave room for interpretation.

    42. Re:This is very bad design by nedlohs · · Score: 1

      They didn't dumb down. The mistake they made was not dumbing it down.

      They used the language they use internally without bothering to translate it (aka dumb it down) to something people who don't have the right context would understand. Which I agree is unprofessional and stupid of them, but it is not dumbing down though. And while the general public might misunderstand (which is why it was stupid of them) anyone with an IT background who thinks for 2 seconds knows what they mean.

      It isn't vague and imprecise, unless you expect them to actually tell you exactly what the was typed. It's just using too much jargon.

    43. Re:This is very bad design by Lotharus · · Score: 1

      I think you're still missing the point. The point is that they made a statement which leaves room for interpretation. Your mindset leads you to make certain inferences that you feel are common sense. There's nothing wrong with your interpretation of the statement; what's wrong is that they made the statement in such a way as to require you to reach your conclusions.

      A better statement could have been, "a technician mistakenly entered commands that resulted in the system failure," if that is what happened. Then there's no room for you or me to reach any conclusions, because they told us what really happened.

    44. Re:This is very bad design by pclminion · · Score: 1

      ... which is why you should always use the shift key to wake a display, and never enter.

      Seriously, this is informative? Nothing against X0563511 but doesn't this go WITHOUT saying? Who the hell strikes enter to wake a machine?

      I don't use shift, I use control, mainly because I work mostly with Windows machines and pressing shift five times in a row causes that stupid "Would you like to turn on sticky keys?" dialog to show up. You can actually play a stupid trick with this. There's no timeout on the press-shift-five-times counter, so you can press shift on a workstation FOUR times, then walk away, and the next guy who tries to wake up the machine by pressing shift will hear a stupid "tweeeeet" sound and see that damned dialog on the screen. Yes, that amuses me because I'm lame.

    45. Re:This is very bad design by tnk1 · · Score: 1

      I hear the GUI design for MovieOS is incredible, but they made the dubious decision to make all their login prompts reveal individual password characters as you discover them via brute force. I hear that this is due to a secret NSA back door called the "narrative device" built into every otherwise unbreakable MovieOS installation.

      Unfortunately, this still allows scruffy teen hackers to break into government mainframes with their iPhones, but in it's defense, it is certainly a step above UNIX which tween girls have been able to break into simply because they "know" it.

      Still, this is not that hard to stomach if you realize that not only are we not alone in the Universe, but we are also not alone in having powerful, beautiful, but critically flawed operating systems. Aliens have similar problems with their otherwise much more advanced networks. Given the fact that they can be taken down by a Mac that wasn't even running MacOS X, it's no wonder their mothership exploded in embarrassment. I suppose it seemed like a good idea at the time to provide all their fighter docks with LocalTalk connections: you never know when you will need to quickly hook up to a printer or do a little impromptu file sharing. Right?

    46. Re:This is very bad design by andrewa · · Score: 1

      The code in my sig would do the job on most Linux systems... ;-)

      --
      :(){ :|:& };:
    47. Re:This is very bad design by afidel · · Score: 1

      Yeah, APC and Raritan KVM's both use scroll-scroll as the default activator. I think Belkin does as well but they are so unreliable it's been years since I last used one (not being a consultant means I no longer have to put up with substandard equipment at client sites).

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    48. Re:This is very bad design by afidel · · Score: 1

      There was a key on Sun keyboards that would halt the system, I once had a box fall and land on they keyboard of a 16 way server used by a whole office of developers. While not as bad a day as the one where the EPO on the main UPS got pushed it was still pretty bad.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    49. Re:This is very bad design by afidel · · Score: 1

      If the big red button doesn't take everything down you're going to have a LOT of explaining to do to the local fire marshal. The button is there so that the firefighters don't get electrocuted when they have to hose things down. I know when we put in our datacenter we had to prove that the EPO button would correctly send the shutdown signal to our UPS and AC units.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    50. Re:This is very bad design by RichM · · Score: 1

      He mostly likely deployed a development version of a script (by pressing Enter) that hadn't been tested properly and contained an error which only manifested itself on the live environment.

    51. Re:This is very bad design by tdknox · · Score: 1

      A quick "cont" in the EEPROM would zip that 16-way server back to life faster than anyone noticed it was frozen.

      --
      Did you know that gullible is not in the dictionary?
    52. Re:This is very bad design by afidel · · Score: 1

      If you knew that was the problem =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  2. Game Over by ae1294 · · Score: 3, Insightful

    The cloud is a lie. Would the next marketing buzz world please come on down!

    1. Re:Game Over by Samantha+Wright · · Score: 2

      Completely disagree. The solution is clear: eliminate all potential sources of human error.

      --
      Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
    2. Re:Game Over by Anonymous Coward · · Score: 1

      Cue skynet

    3. Re:Game Over by Sene · · Score: 1

      And call the solution Skynet?

    4. Re:Game Over by Anonymous Coward · · Score: 2, Funny

      Has anyone mentioned Skynet yet?

    5. Re:Game Over by jd · · Score: 1

      How is this a cloud?

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    6. Re:Game Over by ae1294 · · Score: 1

      Cloud is just a term. Everything is in the net nowadays. Same problems exists even if we don't call it Cloud.
      Managing of service is just too easy when one key press can cause this kind of damage. They need more redundancy both for infrastructure and management.

      Bullshit. One keypunch doesn't cause this unless shit is being run by people who shouldn't be any-ware near a server.

      By taking all of the various sites and services for hundreds of companies and condensing them into one two or even three buildings you create the exact opposite of what the Internet was designed to be in the first place, which is decentralised. You also remove accountability for the people who are running the show. People stop being able to see the forest from the trees and you end up with more down time and massive outages that just should not happen. You also create security nightmares that I truly can't understand how anyone is alright with.

      The cloud is not your friend. Hire some good people, buy your hardware and deal with shit properly. Don't expect Amazon or VMware to give a shit when they misplace your database or some other god awful thing. You can expect to get a coupon for one free cheeseburger after dealing with support for weeks...

    7. Re:Game Over by Anonymous Coward · · Score: 1

      Computer takes over the world then malfunctions a bit.

      Humans begin to take the world back.

      Finally, they enter the control room......

      and some idiot hits the reset button on the keyboard.

    8. Re:Game Over by vegiVamp · · Score: 1

      I conclude that the cloud is really cake, and now want some.

      --
      What a depressingly stupid machine.
    9. Re:Game Over by Anonymous Coward · · Score: 1

      Leave Google out of this.

    10. Re:Game Over by zaxus · · Score: 1

      NOOO! The cake is a lie!!!!

      --
      /. zen: Imagine a Beowulf cluster of Beowulf clusters...
    11. Re:Game Over by ae1294 · · Score: 1

      You don't know what you're talking about but that's ok.

      Enlighten us...

    12. Re:Game Over by eriqk · · Score: 1

      Indeed. This sort of thing has cropped up before and it has always been due to human error.

  3. Re:'An inadvertent press of a key on a keyboard' by verbatim · · Score: 5, Funny

    This pretty much describes my entire career.

    --
    Price, Quality, Time. Pick none. What, you thought you had a choice?
  4. Slashdot summary non sensationalist by rsborg · · Score: 3, Interesting

    Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

    "... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

    Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

    (emphasis mine).

    I'd hate to be that ops guy.

    --
    Make sure everyone's vote counts: Verified Voting
    1. Re:Slashdot summary non sensationalist by fuzzyfuzzyfungus · · Score: 2

      "And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."

    2. Re:Slashdot summary non sensationalist by Icegryphon · · Score: 2


      Keyboards, how do they work?
      This does not bode well for VMware.
      As much as I love their production,
      I did chuckle at this major failure.

    3. Re:Slashdot summary non sensationalist by mirix · · Score: 1

      It's easy to be all high and mighty when your Selectric isn't even capable of being connected to anything mission critical.

      Actually - how did you manage to get it to post on /.?

      --
      Sent from my PDP-11
    4. Re:Slashdot summary non sensationalist by Icegryphon · · Score: 5, Funny

      Don't go knocking my typewriter
      It's Electric, and has wonderful BNC connector
      for network access. IBM, you did good.

    5. Re:Slashdot summary non sensationalist by BigGerman · · Score: 1

      Stopping engineers from touching keyboards is important part of maintaining one's cloud infrastructure. From experience.

    6. Re:Slashdot summary non sensationalist by Virtucon · · Score: 1

      The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on deck style of management also means that there's turmoil within their walls. Why? Somebody knowingly wouldn't take down infrastructure. Sure, it was a mistake but again it demonstrates the fragile nature of their design. I can shut down load balancers, storage processors, cluster nodes and power but it takes a heck of a lot of effort by a few keystrokes to take all of it out. A "full outage" of the network infrastructure by one guy? What was he doing? Was Change Management in play here? Was this person fucking around?

      I'm sorry folks. Go back to the drawing board and design this correctly.

      --
      Harrison's Postulate - "For every action there is an equal and opposite criticism"
    7. Re:Slashdot summary non sensationalist by king_grumpy · · Score: 1

      Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

      "... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

      Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

      (emphasis mine).

      I'd hate to be that ops guy.

      Does this engineer work for Sony?

  5. It's a euphemism by symbolset · · Score: 1

    Just like "paper only" is a metaphor for the electronic document version, which is what was happening. In this case it means the engineer engaged in active management of the network instead of brainstorming ideas with the group. Presumably he intended to just investigate.

    --
    Help stamp out iliturcy.
  6. UR DOING IT WRONG! by celest · · Score: 2

    You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!

    In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/

    1. Re:UR DOING IT WRONG! by Xtravar · · Score: 1

      I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
      Like, did they touch it and press a key?
      Did they touch it for an extended period, typing "killall cloud"?
      Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

      --
      Buckle your ROFL belt, we're in for some LOLs.
    2. Re:UR DOING IT WRONG! by RoFLKOPTr · · Score: 1

      I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context. Like, did they touch it and press a key? Did they touch it for an extended period, typing "killall cloud"? Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

      The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.

    3. Re:UR DOING IT WRONG! by LoRdTAW · · Score: 1

      It was probably inappropriately touched in a no-no place.

    4. Re:UR DOING IT WRONG! by Jeremi · · Score: 4, Funny

      I would like more elaboration on what "touched the keyboard" means.

      It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    5. Re:UR DOING IT WRONG! by larry+bagina · · Score: 3, Informative

      Remember how your uncle used to touch you in your naughty place? It was like that.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  7. Not the RED button!!! by geekmux · · Score: 2

    "...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."

    OK, seriously, who the hell has that much shit tied to a single key on a keyboard?

    I've heard of macros for the lazy, but damn...

    1. Re:Not the RED button!!! by AnonymmousCoward · · Score: 1

      Perhaps it was the "any" key

  8. Engineering Errors by Bruha · · Score: 4, Interesting

    You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.

    I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.

    More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.

    There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.

    I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

    1. Re:Engineering Errors by Niac · · Score: 1

      Bring Down Business [y/N]?> y <Enter>

      --
      http://gabrielcain.com/
    2. Re:Engineering Errors by Anonymous Coward · · Score: 1

      I have never seen a decent size datacenter that actually uses VTP. VTP is sometimes used in campus networking, where things tend to move often so dynamically assigning VLAN's to trunks is useful, but even there it usually gets turned off because admins are scared of it. More likely in my opinion they were developing new configs to mitigate the problem they most recently experienced, and someone deployed the change to the production network instead of the test network. There's a catch to deploying a test network, you have to make the systems very similar for it to be effective, and you have to make making changes to the test network then deploying those same systems to the production network quick and easy to make it actually be used. In a crisis especially, you want to test your changes before you make the problem worse, but don't want to delay the solution any more then you need to.

    3. Re:Engineering Errors by dissy · · Score: 1

      Perhaps most of their infrastructure is virtual, and the button he pressed was the hosts power key, shutting down all the guests at once.

    4. Re:Engineering Errors by lucifuge31337 · · Score: 1

      More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.

      How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.

      --
      Do not fold, spindle or mutilate.
    5. Re:Engineering Errors by Anonymous Coward · · Score: 1

      That's two keys. Bzzt. Wrong.

    6. Re:Engineering Errors by mulaz · · Score: 1

      Easy!

      Have a scaled-down copy of the production network in a lab, with all the same settings (like VTP domain etc.), test weird things (like it's normally done in a lab enviroment), and get the rev. number up high.

      Then some piece of production equipment fails, (let's say a switch), and why not take one (basically the same one) from the lab? The lab can wait for the replacement, production usually can not. Then plug the switch to the production network, and puff, there go the vlans!

      --
      i read your email
    7. Re:Engineering Errors by lucifuge31337 · · Score: 1

      So....what I said. Except you have it in your lab environment. And you don't relize its your VTP master. And you don't bother to put your production config on your replacement box before putting it in production....... Yeah. Not buying it as a likely scenario. This required multiple steps, and a fundamental lack of understanding of key functions of networking equipment in a datacenter setting (namely not knowing what your VTP master is) and a lack of any sort of sane procedures (putting a piece of equipment into production without so much as verifying a config). It's a plausible, but unlikely series of events that would require the input of someone who was not capable of building or maintaing the network in the first place.

      --
      Do not fold, spindle or mutilate.
    8. Re:Engineering Errors by zbaron · · Score: 1

      Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

    9. Re:Engineering Errors by lucifuge31337 · · Score: 1

      Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)

      And to my point, that's amateur hour stuff. Not what one would expect in a professional data center.

      Also, that would not cause this proposed issue, as if they were all servers, none of them would take data as ca VTP client. It would be like not running VTP at all.

      --
      Do not fold, spindle or mutilate.
    10. Re:Engineering Errors by lucifuge31337 · · Score: 1

      Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

      See my previous post about "crusty old IOS/CatOS".

      Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.

      Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn't have automated config backups of infrastructure equipment?

      I guess this boils down to the fact that I'm not an armchair network admin. I've been doing this a long time, and I know how it works. Someone doing something this stupid would be like watching someone put a car in gear and then crawl under it to me. It's not something you should have to TELL someone not to do. It's something that SHOULD'T HAPPEN when one or more well agreed upon basic procedures are being followed. If the person you are asking to do that kind of work needs to be told these things, you have failed as a manager, and likely as an organization. If your network(s) set the stage for this type of thing to be a possibility (sharing vtp info bewteen production and lab, hoping someone won't ever accidentally bridge the two) you again have failed as a manager or organization. The most basic of widely accepted best practices would put multiple barriers between this type of thing happening, requiring a cascading series of procedural failures to actually happen.

      In summary.....Nope, still not buying this as a reasonable explanation.

      --
      Do not fold, spindle or mutilate.
    11. Re:Engineering Errors by mulaz · · Score: 1

      VTP servers also accept configuration from other VTP servers, so they act as clients too.

      --
      i read your email
    12. Re:Engineering Errors by lucifuge31337 · · Score: 1

      I got you. I know it can happen. I just don't see it being something that is likely in a properly managed network. I mean....come on....if you're in a position where you can swap out infrastructure equipment - even if your lab setup is moronic and shares VTP info - you ought to be (at a bare minimum) blowing away the config (which includes vlan.dat) and either copy-pasting from your config repository or creating a new one. My point is that this is very, very basic procedure. If you are tasked with touching infrastructure equipment, you ought to know these things. Also, if this happens it is a very quick restore if you have proper config backups and proper out of band access. Again, I guess I'm assuming a competency level that others aren't accustomed to. This stuff isn't hard.

      --
      Do not fold, spindle or mutilate.
  9. VMware shows its PR colors. by shuz · · Score: 4, Insightful

    VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

    --
    There is or can be built a machine that can simulate any physical object. -Church-Turing principle
    1. Re:VMware shows its PR colors. by HFShadow · · Score: 2

      Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.

      How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.

    2. Re:VMware shows its PR colors. by Chuck+Chunder · · Score: 1

      A better PR response

      In what sense? I know that I appreciate frank disclosures of problems from our providers rather than obfuscating the issue (if nothing else it might highlight a similar problem in our procedures).

      --
      Boffoonery - downloadable Comedy Benefit for Bletchley Park
    3. Re:VMware shows its PR colors. by Vrtigo1 · · Score: 1

      I find your comment regarding offsite DR a bit off base. For small shops, I would agree that maintaining two data centers would be expensive, but for most places that have any kind of substantial investment in IT, it should be just an expense that is factored in from day one. For instance, the company I work for has an annual IT budget of about 2.5 million. We have three datacenters in addition to our computer room at HQ. Two of the data centers are for our public facing apps which are load balanced between them. We have a generator at HQ which can run us for about a week, but if TSHTF, we can move our apps to the remote datacenter. At HQ, I've put as much of the critical infrastructure as possible in VMs for portability and ease of management. HQ is backed up by the 3rd datacenter, where we put a single God box consisting of four 6 core CPUs and 96 GB of RAM. This is sufficient to run all of our critical apps on the single server until we can get our HQ equipment back up and running, or we have time to order and install new equipment elsewhere. The storage from HQ is continuously replicated to the offsite D/R facility, so in the event of a disaster, all I have to do is power up the VMs there, change the outside hostname of our HQ VPN endpoint to point to the D/R firewall and tell people to disconnect and reconnect to VPN. This setup cost us about 90k in capital expenditures including equipment, software and implementation and costs about 10k a year to run. Call it 150k for the D/R site and the generator at HQ, and I' say that's a relatively minor cost in the grand scheme of things.

    4. Re:VMware shows its PR colors. by ToasterMonkey · · Score: 5, Insightful

      VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

      "Transparency is bad" +4 Insightful

      What the... ?

    5. Re:VMware shows its PR colors. by drooling-dog · · Score: 3, Informative

      To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.

    6. Re:VMware shows its PR colors. by rsborg · · Score: 4, Informative

      VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

      "Transparency is bad" +4 Insightful

      What the... ?

      You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".

      Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.

      --
      Make sure everyone's vote counts: Verified Voting
    7. Re:VMware shows its PR colors. by SuperQ · · Score: 1

      Yup, here's a good example of what you're talking about:

      http://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html

      So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.

    8. Re:VMware shows its PR colors. by Threni · · Score: 1

      It was probably some offshore noob who doesn't understand test environments, change control etc and who decided to stick the latest version of some code onto the live environment. It happens. Hey, they're cheap!

    9. Re:VMware shows its PR colors. by dbIII · · Score: 1

      They are only as redundant and stable as the individuals managing them

      Given that it is 2011 a very large chunk of the IT workforce has been made redundant.

    10. Re:VMware shows its PR colors. by MightyYar · · Score: 1

      The company as a whole is responsible for any of its failures.

      I completely disagree. "The company" does not actually exist - it is actually just a group of individuals. If an individual can mess up the whole infrastructure, then I'd sure like to know that.

      A better PR response

      Yup, a better BS response that leaves them just as opaque as all the other companies out there.

      An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

      That's exactly right - but you wouldn't know that if they had said "we made an unscheduled change". I prefer their transparency.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    11. Re:VMware shows its PR colors. by pclminion · · Score: 1

      It's not transparency, it's blame deflection. Transparency would be a discussion of HOW a single person operating a console could take down the entire infrastructure, not a discussion of who. I don't give a shit who. Saying "an employee fucked up" is finger-pointing and sounds like an attempt to step away from responsibility. It also implies that the internal "solution" to the problem is to fire some specific individual.

      Transparency, which we do NOT see here, would be a discussion of HOW the data center was configured and HOW a single person at a keyboard could take the entire thing to the ground. Whether the guy's name is Winston or whatever I really don't give a crap.

  10. The CLOUD is VAPOR-WARE by Purist · · Score: 1

    Next.

    --
    I used to fear clowns...but I'm discovering that chimps are far, far, worse.
    1. Re:The CLOUD is VAPOR-WARE by Dyinobal · · Score: 1

      I'd think that was obvious, clouds are made out of vapor by definition.

    2. Re:The CLOUD is VAPOR-WARE by Purist · · Score: 1

      Tanks for validating my joke...was it too dry?

      --
      I used to fear clowns...but I'm discovering that chimps are far, far, worse.
    3. Re:The CLOUD is VAPOR-WARE by md65536 · · Score: 1

      Maybe the data center was too dry and when they touched the keyboard, a lightning bolt from the cloud struck.

      Anyway I think keeping clouds in a data center sounds dangerous for several reasons.

  11. Since I'm being an awful person today... by fuzzyfuzzyfungus · · Score: 2

    I, for one, would like to suggest that the Cloud Foundry is really foundering...

    1. Re:Since I'm being an awful person today... by torgis · · Score: 1

      Come on y'all. If we try real hard I bet we can find a way to implicate Sony in this.

  12. PEBKAC by MrQuacker · · Score: 1

    And that is why we need skynet.

  13. Don't let it happen again by stumblingblock · · Score: 5, Funny

    They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.

  14. I don't trust "The Cloud" by Beelzebud · · Score: 1

    When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".

    1. Re:I don't trust "The Cloud" by Jeremi · · Score: 1

      When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.

      You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.

      I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    2. Re:I don't trust "The Cloud" by jd · · Score: 1

      Hard drives are easy to beat. Core memory has an estimated lifespan 20-30x that of a hard drive, is impervious to EMP and won't crash if bumped.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    3. Re:I don't trust "The Cloud" by jimicus · · Score: 1

      Not necessarily, as has already demonstrated.

      I forget exactly where I first read it, but it bears repeating: Unless you can put your finger on a damn good reason why your business cannot deal with any downtime, you don't need high availability and probably shouldn't bother with it.

      It invariably introduces a lot more complication, a lot more to go wrong. Few businesses truly need it, usually all they need is a clear plan to recover from system failure which accounts for the length of time such recovery will take.

      Simply running your application on, say, a highly-available VM in someone else's datacenter does not make that complication go away, it just makes it somebody else's problem.

    4. Re:I don't trust "The Cloud" by Nevynxxx · · Score: 1

      I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

      Dropbox.

  15. Cloud depends too much on internet by bmservice · · Score: 1

    I don‘t think we have enter the period that internet is available everywhere and everytime but without internet cloud is nothing

  16. Cloudy Vision of My Future... by BoRegardless · · Score: 1

    If I think I can trust a cloud to support my data.

  17. The technology is almost there by __aancvu2993 · · Score: 1

    I am only considering VMware products again if they fire the idiot who wrote the blog post and cane him in the public square. Come on VMware, we are hoping for some retribution here.

    Now, for the technical part, I'm only considering cloudy products again if they replace keyboards and human engineers with unicorns fluent in Lisp who can rainbow-activate and maintain the flockolent interfuzzys to the cervically index, to protect my data. I'm just not using any ol' cloud. No sir.

  18. I disagree. by khasim · · Score: 1

    However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.

    The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.

    The answer is to run training exercises for the various scenarios so that everyone knows what to do and where to go in such situations.

    The problem with that is that people are lazy. Security is not difficult. But NOT doing it will always be easier (and yield immediate rewards) in the short term.

    TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup.

    Sounds good. But the system also has to be designed to take advantage of the technology that is available today. Too often the systems are based around the single machine running a single application with full administrative rights model. And the technological advances have just made it possible to fool the app into thinking it is on one machine while it runs on multiple machines (badly).

  19. The Answer is obvious by SuperKendall · · Score: 1

    How did one engineer touching a keyboard when he shouldn't, take everything down?

    He touched the keyboard in its Special Place.

    Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  20. Cloud lol. by unity100 · · Score: 1

    I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

    if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

    lets admit it - huge monolithic clouds, are a bad idea. there should be a certain size limit for clouds' sizes, and after that the customers should be placed to another discrete cloud unit.

  21. A cloud in need, is a cloud indeed by Anonymous Coward · · Score: 1

    No, no, it is indeed a cloud: Thin, wispy and ephemeral.

    1. Re:A cloud in need, is a cloud indeed by VortexCortex · · Score: 3, Funny

      No, no, it is indeed a cloud: Thin, wispy and ephemeral.

      Not to mention The Cloud is dangerous!

      One time, "The Cloud" corrupted a few files on my server, toasted my dev machine's hard drive (couldn't even re-install!) made several monitors explode, and split the tree outside my home-office completely in two; Flying chunks of bark shattered my windows... to say nothing of the horror that became of the decorative landscape lighting that foolishly linked the outside to my main electrical system, may it rest in pieces.

      The ironic thing is that I had a lightning rod installed; I thought I was safe from The Cloud, but The Cloud decided that my, now deceased, 200ft pine tree was a better target of opportunity.

      The Cloud is a scary concept -- Super charged flying electrical batteries, always looming overhead, unpredictably destroying their targets with tremendous power, and surgical precision. Hell, the terror of witnessing such an event has permanently emotionally scarred my dog -- She has a prescription for Valium now because she hyperventilates and continuously shakes for hours at the mere sound of distant thunder...

      My psyche is not unscathed either: I have to take a tranquilizer whenever I hear the words: "To The Cloud"

  22. Human error by PPH · · Score: 1

    No problem. SkyNet will remedy that.

    --
    Have gnu, will travel.
  23. Is the power grid run by some old pc terminal Esc by Joe+The+Dragon · · Score: 1

    Is the power grid run by some old pc terminal where hitting Esc can crash the full system?

  24. Anybody see the irony in the first outage? by HockeyPuck · · Score: 4, Interesting

    Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.

    I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.

    Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...

    Irony. A storage company having a storage problem.

  25. Sounds like a 13 year old making up an excuse. by Ecuador · · Score: 1

    I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.

    Vmware's explanation reminded me of that incident. Unless "touching the keyboard" means logging on a secure system and entering a few bad commands.

    --
    Violence is the last refuge of the incompetent. Polar Scope Align for iOS
  26. Wow. by chill · · Score: 1

    141 comments and no one mentions the old Sun equipment that had the !@#^ power button on the keyboard! Must be the young crowd posting.

    Been there, done that. Reached over, bumped the keyboard and the SparcStation went "blink!" and off.

    I've been to a couple lab environments where the upper-right key on every keyboard had been physically removed because this was such a stupid design.

    --
    Learning HOW to think is more important than learning WHAT to think.
    1. Re:Wow. by multipartmixed · · Score: 1

      Just so you know, you can turn that off. /etc/power.conf IIRC. That said, I also tend to rip the key off.

      Wanna know ironic, though? The Sun E150 server (mini E450 chassis, Ultra-1 guts) can't be turned *on* without the keyboard.

      True story, one DC where I worked about 12 years ago called Sun support because a machine wouldn't power up after a simulated power failure. Stupid Sun SE wound up replacing the motherboard before he would listen to me and plug in a damn keyboard.

      --

      Do daemons dream of electric sleep()?
    2. Re:Wow. by chill · · Score: 1

      Yeah, but it is so much more satisfying to rip off that damned key with a pair of pliers. :-)

      I have no trouble believing the story of the tech. I remember using that trick on MCSEs who thought they knew computers and Sun servers were just like the WinTel ones...

      "How do you turn this damn thing on?" :-)

      --
      Learning HOW to think is more important than learning WHAT to think.
  27. Re:Reminds Me of A Bad Challenger Joke... by Per+Wigren · · Score: 1

    What were Osama bin Laden's last words?

    "Darn."

    --
    My other account has a 3-digit UID.
  28. Not a KISS design. by leuk_he · · Score: 1

    They try to make a full analysis public. That is agood thing. They could have gone with the same old level "there is a problem and we fixed it", like they try to do with the PSN network. (barely dettail, very fussy predecitions when expected to come up again)

    Cloud based hosting is relative complex y its very nature. This will always violate the "KISS" design principle. ECC downtime has also shown this. A lot of costumers though they bought a 24/7 99,999% solution, but they forgot they only bought the tools for that solution.

    I agree that "touched the keyboard", one of the engineers took the script litterarry when someone gave instructions to "do X", he launced the nuclair missiles instead of just gave a paper confirmation that X was down. ;)

    PS "not reall misseles... just a figure of speech"

  29. "Anybody" is sometimes very very badly wrong by dbIII · · Score: 1

    And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed

    If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sockets.
    Plus just when you think you have it all covered with redundant power supplies sometimes the entire power supply unit dies at the back end instead of just one of the modules. It's annoying and expensive when that happens but not as bad as a completely dead server.

    1. Re:"Anybody" is sometimes very very badly wrong by multipartmixed · · Score: 1

      Um, no.

      Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply. The AC is contained within the PSU and the equipment is powered by DC.

      And besides which, all modern data centers keep their redundant power distribution in phase. For starters, they know that their grounds will be tied together through customer equipment.

      --

      Do daemons dream of electric sleep()?
    2. Re:"Anybody" is sometimes very very badly wrong by HockeyPuck · · Score: 1

      Wow... guess you've never been in a datacenter that actually PLANS it's power distribution system. It's planned that both power strips in a given cabinet (or plugs on a larger floor standing system like an EMC disk array) will go to different PDUs. Just in case the PDU itself fails. Now while this doesn't protect you from building supply (Power Company) failures. It does protect you from all sorts of intra-datacenter power issues.

      Nobody is randomly plugging power cables in, but you are planning for a multi plug server/storage array/switch/router whatever to connect to two different PDUs.

      This was just a bone headed mistake.

    3. Re:"Anybody" is sometimes very very badly wrong by dbIII · · Score: 1

      I guess you've never been to the majority of other places then in addition to not being able to properly comprehend what was written above. I even put the word "DO" in capitals above for you but it does not appear to have helped.

    4. Re:"Anybody" is sometimes very very badly wrong by dbIII · · Score: 1

      Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply.

      Obviously not - but they do on the power supply side of the power supply. The seperate inputs are not always properly isolated and the two phases can add up to more than what the power supply can cope with. It doesn't matter to the server if the fire starts in the power supply or elsewhere - it's still a burnt server.
      Now here's the bit you and the other less erudite guy that replied completely missed despite the capital letters:

      If you DO know they are on the same phase it is a good idea

      That answers your second point before you even made it. All this shit about "modern data centers" doesn't apply universally because a lot of gear is in other places and many "modern data centers" will have trouble distinguishing their arse from their elbow let alone making sure all of their cable monkeys know which phase different power points are on. I didn't even know which phase everything was on in the building I work in until one phase went out and I found some idiot had put all the air conditioning units on the same phase - so all of the computers stayed on while all of the cooling died. There was nothing documented about that on site initially just a list of points on each circuit. There is now no power point in that server room on a different phase to any other (and the AC was spread over the three phases since that fault happened more than once - industrial area with big cranes and an old substation messes with the power) so anybody can plug into anything because it is a known quantity.
      As I tried to say above, taking power from two different circuits in a place with three phase power into the same power supply can be a very stupid thing to do.

    5. Re:"Anybody" is sometimes very very badly wrong by multipartmixed · · Score: 1

      > Obviously not - but they do on the power supply side of the power supply.
      > The seperate inputs are not always properly isolated and the two phases can add
      > up to more than what the power supply can cope with.

      Cite? Admittedly, I've dealt almost exclusively with Sun equipment since we moved there in '98 (although that's going change RSN). But Sun gear, at least, does not let AC out of the PSU -- the PSUs are *not* interconnected, except where the DC meets on the backplane (and chassis ground).

      Are you saying that there are servers out there that cross-connect their power supplies on the AC side? Frankly, that seems like asking for a boat load of trouble if you ask me.

      In fact, it's a common DC configuration around here to run 208 to the cabs split off into 120s. We're out of phase there by definition. Of course, most of our PSUs *are* rated for 90-250V...

      --

      Do daemons dream of electric sleep()?
    6. Re:"Anybody" is sometimes very very badly wrong by dbIII · · Score: 1

      the PSUs are *not* interconnected, except where the DC meets on the backplane (and chassis ground).

      My apologies. I failed to put "not always properly isolated" in capital letters and bold type. Ask an electrician or a fire investigator for more details or let's just look at wikipedia on short circuits and then draw the obvious conclusions:

      In mains circuits, short circuits may occur between two phases, between a phase and neutral or between a phase and earth (ground). Such short circuits are likely to result in a very high current

      For some reason you wrote the following above:

      But Sun gear, at least, does not let AC out of the PSU

      You just don't get it. Forget entirely that there is a computer involved because all that matters here is the inputs, the voltage difference and if faulty manufacture or damage has left a path for the power to fry the parts within the power supply at up to twice the design voltage and current. Even if the power supplies are in good condition a miswired point or plug where earth is live means double the trouble you'd normally get in that situation

  30. Clearly? by EmagGeek · · Score: 1

    "Clearly, human error is still a major factor in cloud networks."

    That is a huge leap. You cannot take one incident and use it as a broad brush with which to paint all of the players in cloud computing.

    This should read: "Clearly, human error was a major player in these two specific incidents at VMWare."

    Can Slashdot mods PLEASE dispense with the sensationalism?

    1. Re:Clearly? by am+2k · · Score: 1

      Every error is a human error. Either somebody used it incorrectly, or somebody built it in a way where it couldn't handle the way it's used. IMO the only exception to this rule are things you can't be expected to consider, like a plane crashing into your data center or a Richter scale 9 earthquake.

  31. One key? by WD · · Score: 1

    You know... there is a fix for that.

  32. Re:Big Red Button by VanessaE · · Score: 1

    Oh how I wish I had mod points right now - this should me modded straight to +5, Funny.

    I just about laughed myself silly. Thanks. :-D

  33. What? by koan · · Score: 1

    No "are you sure" pop up?

    --
    "If any question why we died, Tell them because our fathers lied."
  34. Re:single keystroke that would take down an entire by warchildx · · Score: 1

    Press the *ANY* key to continue, or any other key to quit.