British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

← Back to Stories (view on slashdot.org)

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

Posted by msmash on Friday June 2, 2017 @02:00AM from the getting-to-the-bottom-of-things dept.

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

55 of 262 comments (clear)

Min score:

Reason:

Sort:

Did they try... by 110010001000 · 2017-06-02 02:03 · Score: 5, Funny

...turning it on again?
1. Re: Did they try... by Anonymous Coward · 2017-06-02 02:16 · Score: 5, Funny
  
  text book example of a "career changing event"
2. Re:Did they try... by Zocalo · 2017-06-02 02:28 · Score: 5, Interesting
  
  Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.
  
  While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
  
  None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:Did they try... by sycodon · 2017-06-02 02:34 · Score: 5, Insightful
  
  Bullshit.
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
4. Re:Did they try... by DickBreath · 2017-06-02 02:37 · Score: 2
  
  In Open Compute I believe the power supplies wait random amount of time before applying power to the rails to fire up the load. What an idea. You auto-magically spread out the time of the start up load over a short time.
  
  --
  
  I'll see your senator, and I'll raise you two judges.
5. Re:Did they try... by Archangel+Michael · 2017-06-02 02:42 · Score: 5, Interesting
  
  There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
  Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
  My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
  I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.
  Data, is the biggest unaccounted for asset of a business.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
6. Re: Did they try... by __aaclcg7560 · 2017-06-02 02:57 · Score: 4, Interesting
  
  text book example of a "career changing event"
  Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.
7. Re:Did they try... by swb · 2017-06-02 03:06 · Score: 5, Insightful
  
  I think they also suffer from what I call "efficiency savings hoarding".
  If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.
  When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.
  Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.
  Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.
  I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.
8. Re:Did they try... by PPH · 2017-06-02 03:17 · Score: 2
  
  The overarching problem is quantifying the value of the functions being performed. You can compare the costs of a clerical staff versus that of an e-mail/calendaring system. But it's difficult to figure out in a business setting what these functions are worth.
  My boss would wail and cry over the inability to peruse the individual schedules of all of his minions for the purpose of calling yet another self-aggrandizing staff meeting. And he would assign a very high value to this function. But back in the 'old days', we just didn't have as many. And we got more work done.
  
  --
  Have gnu, will travel.
9. Re:Did they try... by jcr · 2017-06-02 03:24 · Score: 3, Insightful
  
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  Given the duration of the outage, I'd say that's a fair conclusion.
  -jcr
  
  --
  The only title of honor that a tyrant can grant is "Enemy of the State."
10. Re:Did they try... by I'm+New+Around+Here · 2017-06-02 03:27 · Score: 5, Funny
  
  >
  Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
  Yes, buy a Monster power strip for $50. It has gold plated vacuum tubes for effecient power control. ;^)
  
  --
  If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
11. Re:Did they try... by Drakonblayde · 2017-06-02 03:32 · Score: 5, Informative
  
  Not entirely true...
  So it depends on what kind of UPS you're employing. If it's the really big ones, you know, the ones the size of generators, then you don't plug stuff directly into them. They tend to be centralized and distribute power to PDU's that are in the racks themselves. The servers plug into the PDU's in the racks, and those PDU's have on/off switches. My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks. UPS's aren't employed to protect against human error, they're designed to protect against loss of main power.
  If you're data center is small enough, you can get away with UPS's mounted in the rack and plug your servers directly into them, but when you're talking about scale, that's just not feasible or cost effective.
  Somehow I doubt British Airways data center is of the 'couple cabinents in a colo variety' and they've probably got the big UPS setup
  Most likely the fault lies with whomever architected the data center. I'll bet either there's very little room between the racks, or the PDU's are mounted in a way they can be accidentally bumped (probably either mid rack or at the bottom). I personally have taken to mounting PDU's at the top of the rack on the backside just to minimize any potential human contact with them.
12. Re: Did they try... by LS1+Brains · 2017-06-02 03:33 · Score: 5, Insightful
  
  I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).
  
  As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.
13. Re:Did they try... by fuzzywig · 2017-06-02 04:10 · Score: 2
  
  Many BIOS(/EFI) have an option to delay the harddrive spin up, so they don't all demand spin-up power from the PSU at the same time.
14. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:18 · Score: 4, Funny
  
  Almost forget... 3) Kiss my shiny metal ass!
15. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:58 · Score: 2
  
  You're a middle-aged man counting Slashdot karma points?
  Nope. I'm a middle-aged man who wrote a Python script to scrap my comment history. I'll run the script and post the stats when I get home.
16. Re: Did they try... by jellomizer · 2017-06-02 05:00 · Score: 5, Insightful
  
  You mean for the Executive who didn't approve of the hot offsite fail over solution ?
  You know the stuff that normal large organizations have to make sure their business can be operational.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
17. Re:Did they try... by Dunbal · 2017-06-02 05:12 · Score: 2
  
  Many electrical companies bill industrial customers based on PEAK power consumption, so it's in your interest to spread the load as widely as possible.
  
  --
  Seven puppies were harmed during the making of this post.
18. Re:Did they try... by zeugma-amp · 2017-06-02 05:22 · Score: 5, Interesting
  
  Many moons ago I was working in a datacenter and we had a crew in the hallway hanging wallpaper. In order to do so, they had to remove the box that normally covered the Emergency Power Cutoff switch (which was actually a Big Red Button) that would instantly drop power to the whole room. I'm sure you can guess where this is going...
  One of the paper hangers bumped into the BRB, and *poof*, there went the power to the room. In the data center, we were in the middle of a shift change. My coworkers and I were standing around discussing handoff, and whatnot. Suddently, we heard a huge Boom as a crapload of switches tripped all around us. Then we heard the drives and fans spinning down.
  As I said, this was a ways back. In the data center we had 12 HP-3000/70 minicomputers, a couple of VAX 11/780s, and a water-cooled IBM 3090 mainframe that were our main systems in the room. The disk drives on the HPs were disk packs of 16" platters sitting in drives the size of a small washtub. They produced a lot of noise. Each HP3K had about 8 or 9 of these things daisy-chained behind the system itself.
  The room was loud. All the time. Well, when the power dropped, all those drives started to spin down. We were all just kind of standing around looking at each other, not knowing what had happened. You could hear the pitch of all those drives winding down, becoming a lower and lower note, until finally - silence.
  Simon and Garfunkle had a song called "The Sound of Silence" many year even further back into the dim reaches of time from when all this was taking place. This was the first instance in my life when I really understood what silence actually sounded like. It was eerie. You never heard silence in the computer room. You have UPS, generators and all kind of other things to make sure you never actually heard silence.
  So, there we were, standing around with our mouths hanging open, and listening to the eerie silence. The moment broke, and we quickly determined what had happened. Rather than just cut the power back on, we went through and powered off all the drives and such so we could slowly bring everything back up in an orderly fashion.
  One thing that I learned that day was that HP-3000 minicomputers contain a battery designed to allow the things to ride through such catastrophes. Out of the twelve HPs, once we had powered back on all of the drives, nine of them just started executing their next instruction and continued on as if nothing whatsoever had happened. Three needed to be coldstarted, which wasn't a really big deal. Within 30 minutes or so of power being brutally disconnected we had all of them running smoothly, or at least on the way up.
  The two DECs weren't quite so resiliant, but after checking their dirty disks, they came back up as well.
  An IBM 3090 does not like to have it's power just cut off. It really doesn't. We ended up having issues and it took about 24 hours to return to normal operational status.
  The entire event was kind of cool to run through. Gave me a new respect for HP engineering. For many of our users, all they experienced was that their terminal froze for about 20 minutes, then continued on where it had stopped.
  I don't know if the paper hanger lost his job, but we lost several thousand user hours of time while they were sitting staring at their frozen terminals.
  It was certainly an interesting experience, and I'll never forget the Sound of Silence in the Computer Room.
  
  --
  This is an ex-parrot!
19. Re:Did they try... by ghoul · 2017-06-02 05:29 · Score: 4, Insightful
  
  Managers get paid to take the blame and the stress while workers get paid to do the work.
  
  --
  **Life is too short to be serious**
20. Re: Did they try... by lactose99 · 2017-06-02 05:59 · Score: 2
  
  Resume line-item:
  - Single-handedly tested entire DR operation of British Airways
  
  --
  Fully licensed blockchain psychiatrist
21. Re: Did they try... by lactose99 · 2017-06-02 06:02 · Score: 3, Insightful
  
  "we didn't budget for that"
  "well does your budget include a multi-day downtime when the primary site goes offline?"
  "now how could the primary site possibly go offline?"
  Unfortunately I run into this far more than I should in this industry.
  
  --
  Fully licensed blockchain psychiatrist
22. Re:Did they try... by lactose99 · 2017-06-02 06:04 · Score: 3, Informative
  
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  And you'd be surprised how many shops think this knowledge is someone else's problem and subsequently don't add it to any server installation docs, then look for a scapegoat when systems go tits-up like this.
  
  --
  Fully licensed blockchain psychiatrist
23. Re: Did they try... by sabri · 2017-06-02 08:17 · Score: 4, Interesting
  
  I pay people to not screw up so if you do I'm terminating you and finding someone competent.
  Which would be stupid. What do you think the chances are that this guy will repeat this mistake?
  
  Here is a story a friend of mine once told me. He was working on an AS migration of a major telco, when he made a big boo-boo causing a huge outage for hundreds of thousands of subscribers, making headline news. The next morning he got called into his boss's office, expecting to be fired. He was not. The reason why?
  
  His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
  
  --
  I'm not a complete idiot... Some parts are missing.
24. Re: Did they try... by Razed+By+TV · 2017-06-02 11:40 · Score: 2
  
  His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
  I believe there is wisdom in this, but there is a prerequisite.
  The person must have the capacity to learn.
  
  I currently have the pleasure of working with someone who must repeat the same mistakes before he learns from them.
  He breaks off, on average, one screw a month.
  It's always the same.
  *WHIRR*
  *SNAP*
  "Oh crap!"
25. Re:Did they try... by sysrammer · 2017-06-02 15:39 · Score: 2
  
  Hello darkness my old friend...
  
  --
  His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain
Bright side by SpaghettiPattern · 2017-06-02 02:10 · Score: 2

Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.

--

I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
Re:Am I in the Matrix? by Anonymous Coward · 2017-06-02 02:11 · Score: 2, Insightful

The new article has more details.
Re:That still doesn't explain why by bill_mcgonigle · 2017-06-02 02:12 · Score: 2

they didn't just switch over to their DR site.
You forgot the mic drop.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
N+1 guess not by silas_moeckel · 2017-06-02 02:13 · Score: 3, Insightful

So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.

--
No sir I dont like it.
Re: LOL by haemish · 2017-06-02 02:13 · Score: 5, Insightful

Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
Yeah, yeah... blame the contractor... by __aaclcg7560 · 2017-06-02 02:14 · Score: 5, Insightful

This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
What the heck does this switch do? by JoeyRox · 2017-06-02 02:14 · Score: 4, Funny

No sure Bob - just flip it so that we can go get some lunch. I'm starving.
1. Re:What the heck does this switch do? by LordWabbit2 · 2017-06-02 03:21 · Score: 2
  
  Heh, you joke, but we had a server in our server room no one was using any more, it was under powered (ie. old) we had all gotten our stuff off of it and thought we might as well shut it down. So we did. Got a call a couple days later from across the country, "WTF happened to our XYZ?". So we switched it on again. No one knew wtf they were doing on/with the server, and our manager didn't even try to find out, he just said "Well leave it on then". It's probably still sitting there quietly doing whatever the fuck it was doing before.
  
  --
  There are three kinds of falsehood: the first is a 'fib,' the second is a downright lie, and the third is statistics.
Stephen Stucker unavailable for comment by RogueWarrior65 · 2017-06-02 02:15 · Score: 2

"Just kidding!"
not the contractor's fault by ooloorie · 2017-06-02 02:20 · Score: 4, Insightful

When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
Mental picture from the movie 'Airplane' by DirkDaring · 2017-06-02 02:26 · Score: 2

of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
Re:How is this a thing by Archangel+Michael · 2017-06-02 02:44 · Score: 3, Funny

Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
How does one DR test in a 24/7 business? by zerofoo · 2017-06-02 02:45 · Score: 4, Interesting

I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
1. Re:How does one DR test in a 24/7 business? by silas_moeckel · 2017-06-02 02:53 · Score: 4, Insightful
  
  You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.
  
  --
  No sir I dont like it.
2. Re:How does one DR test in a 24/7 business? by will_die · 2017-06-02 03:12 · Score: 2
  
  We actually test ours by failing over portions each month and making sure everything works.
  For a smaller place I worked which had a limited DR(not everything failed over) parts were tested on a monthly bases, everything was tested yearly with a planned failure that was also to ensure the users had training.
  Some DR stuff also now is really nice in that when you tell it to self-test it creates a separate network so you can test the installation at the COOP site.
3. Re:How does one DR test in a 24/7 business? by jader3rd · 2017-06-02 03:34 · Score: 2
  
  How does one actually fail-over test things in production in a 24/7 business
  You eliminate any distinction between maintenance operations and DR. The redundant systems should behave the same during upgrade/patching of one of the nodes, a disk dying on one of the nodes, a node hosting active client connections has its NIC die, having a rack die, having the WAN cut, having the entire datacenter lose power, etc.
  If the underlying redundancy system doesn't significantly differentiate discretionary failover operations from DR failover situations, you can run a 24/7 system.
  See Exchange Database Availability Groups as an example.
4. Re:How does one DR test in a 24/7 business? by Bob+the+Super+Hamste · 2017-06-02 05:40 · Score: 2
  
  Easily. Regularly switching to the backup site should be done as part of the day to day business operations. For example at my job I work with a company that will switch daily between the main and backup system. It doesn't hurt that the main and backup are running in a hot standby configuration and the backup can take over at a moments notice. They also have 2 additional systems for further levels of redundancy. One is a system that they do a system restore to each day (the previous backup of the main system) that is sitting warm and the other is a cold system where they do a weekly restore from a previous backup of the main system. As the switch-over, as well as the recovery, is done daily as part of regular operations it isn't an issue and everyone there knows what to do. This is for a piece of critical infrastructure which is why there is that level of redundancy, as well as many others to ensure a 99.999% up time of the system but it shows that it is possible to have the requisite up time with a properly designed system and processes.
  
  If you are worried about testing switching to a backup site on a 24/7 system you should also be worried about hardware failures and patches to that same system as those also require outages that you say can't happen as you obviously don't have a system with the required levels of redundancy and are lacking in recovery ability.
  
  --
  Time to offend someone
Re: LOL by thegarbz · 2017-06-02 03:05 · Score: 5, Insightful

who wouldn't let the engineers put in redundant power supplies
That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
Keep your core competencies in house by raymorris · 2017-06-02 03:29 · Score: 2

What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford is not in the business of cleaning carpets, so that's also a candidate for outsourcing.
Once you have a list of items that can be outsourced because they aren't your "core competencies", they "make or buy" decision becomes mostly a matter of arithmetic. For the same budget cost, will you get it done better by hiring people to do it, or by hiring a conpany to do it? Equivalently, for the same level of quality, does it cost less to pay in-house people to do it or to an outside source? Probably, you'll find that it's better to get an operating system from an outside source, not make your own.
While there is no hard and fast rule, a rule of thumb is to consider the company next door. If you could easily buy the same product or service from the same vendor that the company next door uses, and it would serve your purpose, you should probably do so. General purpose things like office supplies office cleaning, and payroll services should be purchased, not manufactured in house, because there is no competitive advantage to be gained from having better office supplies than the other company.
No HA? by elistan · 2017-06-02 03:33 · Score: 2

Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.

Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.

A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.

(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
Re: LOL by chispito · 2017-06-02 03:50 · Score: 3, Insightful

Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.

--
The Daddy casts sleep on the Baby. The Baby resists!
Re:How is this a thing by thsths · 2017-06-02 04:16 · Score: 2

> In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
Yes, if you run a simple experiment, and there is the possibility for harm, a single red button is a good idea.
But if shutting down the server room costs $100 000 000, then a single red button is not a good idea. Instead, you have two parallel power distribution system, with some physical separation, and there are two off switches. Of course there should be sign that explains how to use the switch, and I guess that is where this story eventually leads.
Oops by TheDarkener · 2017-06-02 04:16 · Score: 2

(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.

--
It is pitch black. You are likely to be eaten by a grue.
And we should believe this? by whitroth · 2017-06-02 04:32 · Score: 2

In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!
1. Re:And we should believe this? by markana · 2017-06-02 06:07 · Score: 3, Informative
  
  We had an entire data center shut down this way. Facilities *insisted* that the BRB (Big Red Button) not have any sort of shroud or cover over it. Just in case someone couldn't figure out how to get to the button in a dire emergency.
  So one day, they've got a clueless photographer taking pictures of the racks. He was backing up to frame the perfect framing and... we'll, you can guess the rest.
  Now, the button has a shroud that you have to reach into to hit it, and non-essential personnel are banned from the rooms. Total cost of the outage (even with the geo-redundant systems kicking in) was over $1M.
  Just another day in the life of IT.
Sounds like a system and database design fault by WillAffleckUW · 2017-06-02 04:56 · Score: 2

Have they never heard of multiple servers with the ability to handle server down events for one machine?

--
-- Tigger warning: This post may contain tiggers! --
And _more_ lies! by gweihir · 2017-06-02 05:00 · Score: 2

Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Been there ... by CaptainDork · 2017-06-02 05:06 · Score: 3, Interesting

... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.

--
It little behooves the best of us to comment on the rest of us.
Re: LOL by Dunbal · 2017-06-02 05:20 · Score: 2

The CEO should accept full responsibility
Hah, the CEO is probably trying to figure out how to give himself more stock options now that they're cheaper. These greedy fuckers can never think past their multi-million payouts.

--
Seven puppies were harmed during the making of this post.