British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

← Back to Stories (view on slashdot.org)

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

Posted by msmash on Friday June 2, 2017 @02:00AM from the getting-to-the-bottom-of-things dept.

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

30 of 262 comments (clear)

Min score:

Reason:

Sort:

Did they try... by 110010001000 · 2017-06-02 02:03 · Score: 5, Funny

...turning it on again?
1. Re: Did they try... by Anonymous Coward · 2017-06-02 02:16 · Score: 5, Funny
  
  text book example of a "career changing event"
2. Re:Did they try... by Zocalo · 2017-06-02 02:28 · Score: 5, Interesting
  
  Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.
  
  While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
  
  None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:Did they try... by sycodon · 2017-06-02 02:34 · Score: 5, Insightful
  
  Bullshit.
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
4. Re:Did they try... by Archangel+Michael · 2017-06-02 02:42 · Score: 5, Interesting
  
  There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
  Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
  My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
  I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.
  Data, is the biggest unaccounted for asset of a business.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
5. Re: Did they try... by __aaclcg7560 · 2017-06-02 02:57 · Score: 4, Interesting
  
  text book example of a "career changing event"
  Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.
6. Re:Did they try... by swb · 2017-06-02 03:06 · Score: 5, Insightful
  
  I think they also suffer from what I call "efficiency savings hoarding".
  If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.
  When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.
  Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.
  Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.
  I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.
7. Re:Did they try... by jcr · 2017-06-02 03:24 · Score: 3, Insightful
  
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  Given the duration of the outage, I'd say that's a fair conclusion.
  -jcr
  
  --
  The only title of honor that a tyrant can grant is "Enemy of the State."
8. Re:Did they try... by I'm+New+Around+Here · 2017-06-02 03:27 · Score: 5, Funny
  
  >
  Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
  Yes, buy a Monster power strip for $50. It has gold plated vacuum tubes for effecient power control. ;^)
  
  --
  If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
9. Re:Did they try... by Drakonblayde · 2017-06-02 03:32 · Score: 5, Informative
  
  Not entirely true...
  So it depends on what kind of UPS you're employing. If it's the really big ones, you know, the ones the size of generators, then you don't plug stuff directly into them. They tend to be centralized and distribute power to PDU's that are in the racks themselves. The servers plug into the PDU's in the racks, and those PDU's have on/off switches. My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks. UPS's aren't employed to protect against human error, they're designed to protect against loss of main power.
  If you're data center is small enough, you can get away with UPS's mounted in the rack and plug your servers directly into them, but when you're talking about scale, that's just not feasible or cost effective.
  Somehow I doubt British Airways data center is of the 'couple cabinents in a colo variety' and they've probably got the big UPS setup
  Most likely the fault lies with whomever architected the data center. I'll bet either there's very little room between the racks, or the PDU's are mounted in a way they can be accidentally bumped (probably either mid rack or at the bottom). I personally have taken to mounting PDU's at the top of the rack on the backside just to minimize any potential human contact with them.
10. Re: Did they try... by LS1+Brains · 2017-06-02 03:33 · Score: 5, Insightful
  
  I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).
  
  As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.
11. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:18 · Score: 4, Funny
  
  Almost forget... 3) Kiss my shiny metal ass!
12. Re: Did they try... by jellomizer · 2017-06-02 05:00 · Score: 5, Insightful
  
  You mean for the Executive who didn't approve of the hot offsite fail over solution ?
  You know the stuff that normal large organizations have to make sure their business can be operational.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
13. Re:Did they try... by zeugma-amp · 2017-06-02 05:22 · Score: 5, Interesting
  
  Many moons ago I was working in a datacenter and we had a crew in the hallway hanging wallpaper. In order to do so, they had to remove the box that normally covered the Emergency Power Cutoff switch (which was actually a Big Red Button) that would instantly drop power to the whole room. I'm sure you can guess where this is going...
  One of the paper hangers bumped into the BRB, and *poof*, there went the power to the room. In the data center, we were in the middle of a shift change. My coworkers and I were standing around discussing handoff, and whatnot. Suddently, we heard a huge Boom as a crapload of switches tripped all around us. Then we heard the drives and fans spinning down.
  As I said, this was a ways back. In the data center we had 12 HP-3000/70 minicomputers, a couple of VAX 11/780s, and a water-cooled IBM 3090 mainframe that were our main systems in the room. The disk drives on the HPs were disk packs of 16" platters sitting in drives the size of a small washtub. They produced a lot of noise. Each HP3K had about 8 or 9 of these things daisy-chained behind the system itself.
  The room was loud. All the time. Well, when the power dropped, all those drives started to spin down. We were all just kind of standing around looking at each other, not knowing what had happened. You could hear the pitch of all those drives winding down, becoming a lower and lower note, until finally - silence.
  Simon and Garfunkle had a song called "The Sound of Silence" many year even further back into the dim reaches of time from when all this was taking place. This was the first instance in my life when I really understood what silence actually sounded like. It was eerie. You never heard silence in the computer room. You have UPS, generators and all kind of other things to make sure you never actually heard silence.
  So, there we were, standing around with our mouths hanging open, and listening to the eerie silence. The moment broke, and we quickly determined what had happened. Rather than just cut the power back on, we went through and powered off all the drives and such so we could slowly bring everything back up in an orderly fashion.
  One thing that I learned that day was that HP-3000 minicomputers contain a battery designed to allow the things to ride through such catastrophes. Out of the twelve HPs, once we had powered back on all of the drives, nine of them just started executing their next instruction and continued on as if nothing whatsoever had happened. Three needed to be coldstarted, which wasn't a really big deal. Within 30 minutes or so of power being brutally disconnected we had all of them running smoothly, or at least on the way up.
  The two DECs weren't quite so resiliant, but after checking their dirty disks, they came back up as well.
  An IBM 3090 does not like to have it's power just cut off. It really doesn't. We ended up having issues and it took about 24 hours to return to normal operational status.
  The entire event was kind of cool to run through. Gave me a new respect for HP engineering. For many of our users, all they experienced was that their terminal froze for about 20 minutes, then continued on where it had stopped.
  I don't know if the paper hanger lost his job, but we lost several thousand user hours of time while they were sitting staring at their frozen terminals.
  It was certainly an interesting experience, and I'll never forget the Sound of Silence in the Computer Room.
  
  --
  This is an ex-parrot!
14. Re:Did they try... by ghoul · 2017-06-02 05:29 · Score: 4, Insightful
  
  Managers get paid to take the blame and the stress while workers get paid to do the work.
  
  --
  **Life is too short to be serious**
15. Re: Did they try... by lactose99 · 2017-06-02 06:02 · Score: 3, Insightful
  
  "we didn't budget for that"
  "well does your budget include a multi-day downtime when the primary site goes offline?"
  "now how could the primary site possibly go offline?"
  Unfortunately I run into this far more than I should in this industry.
  
  --
  Fully licensed blockchain psychiatrist
16. Re:Did they try... by lactose99 · 2017-06-02 06:04 · Score: 3, Informative
  
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  And you'd be surprised how many shops think this knowledge is someone else's problem and subsequently don't add it to any server installation docs, then look for a scapegoat when systems go tits-up like this.
  
  --
  Fully licensed blockchain psychiatrist
17. Re: Did they try... by sabri · 2017-06-02 08:17 · Score: 4, Interesting
  
  I pay people to not screw up so if you do I'm terminating you and finding someone competent.
  Which would be stupid. What do you think the chances are that this guy will repeat this mistake?
  
  Here is a story a friend of mine once told me. He was working on an AS migration of a major telco, when he made a big boo-boo causing a huge outage for hundreds of thousands of subscribers, making headline news. The next morning he got called into his boss's office, expecting to be fired. He was not. The reason why?
  
  His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
  
  --
  I'm not a complete idiot... Some parts are missing.
N+1 guess not by silas_moeckel · 2017-06-02 02:13 · Score: 3, Insightful

So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.

--
No sir I dont like it.
Re: LOL by haemish · 2017-06-02 02:13 · Score: 5, Insightful

Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
Yeah, yeah... blame the contractor... by __aaclcg7560 · 2017-06-02 02:14 · Score: 5, Insightful

This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
What the heck does this switch do? by JoeyRox · 2017-06-02 02:14 · Score: 4, Funny

No sure Bob - just flip it so that we can go get some lunch. I'm starving.
not the contractor's fault by ooloorie · 2017-06-02 02:20 · Score: 4, Insightful

When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
Re:How is this a thing by Archangel+Michael · 2017-06-02 02:44 · Score: 3, Funny

Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
How does one DR test in a 24/7 business? by zerofoo · 2017-06-02 02:45 · Score: 4, Interesting

I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
1. Re:How does one DR test in a 24/7 business? by silas_moeckel · 2017-06-02 02:53 · Score: 4, Insightful
  
  You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.
  
  --
  No sir I dont like it.
Re: LOL by thegarbz · 2017-06-02 03:05 · Score: 5, Insightful

who wouldn't let the engineers put in redundant power supplies
That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
Re: LOL by chispito · 2017-06-02 03:50 · Score: 3, Insightful

Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.

--
The Daddy casts sleep on the Baby. The Baby resists!
Been there ... by CaptainDork · 2017-06-02 05:06 · Score: 3, Interesting

... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.

--
It little behooves the best of us to comment on the rest of us.
Re:And we should believe this? by markana · 2017-06-02 06:07 · Score: 3, Informative

We had an entire data center shut down this way. Facilities *insisted* that the BRB (Big Red Button) not have any sort of shroud or cover over it. Just in case someone couldn't figure out how to get to the button in a dire emergency.
So one day, they've got a clueless photographer taking pictures of the racks. He was backing up to frame the perfect framing and... we'll, you can guess the rest.
Now, the button has a shroud that you have to reach into to hit it, and non-essential personnel are banned from the rooms. Total cost of the outage (even with the geo-redundant systems kicking in) was over $1M.
Just another day in the life of IT.