British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

← Back to Stories (view on slashdot.org)

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

Posted by msmash on Friday June 2, 2017 @02:00AM from the getting-to-the-bottom-of-things dept.

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

14 of 262 comments (clear)

Min score:

Reason:

Sort:

Did they try... by 110010001000 · 2017-06-02 02:03 · Score: 5, Funny

...turning it on again?
1. Re: Did they try... by Anonymous Coward · 2017-06-02 02:16 · Score: 5, Funny
  
  text book example of a "career changing event"
2. Re:Did they try... by Zocalo · 2017-06-02 02:28 · Score: 5, Interesting
  
  Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.
  
  While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
  
  None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:Did they try... by sycodon · 2017-06-02 02:34 · Score: 5, Insightful
  
  Bullshit.
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
4. Re:Did they try... by Archangel+Michael · 2017-06-02 02:42 · Score: 5, Interesting
  
  There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
  Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
  My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
  I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.
  Data, is the biggest unaccounted for asset of a business.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
5. Re:Did they try... by swb · 2017-06-02 03:06 · Score: 5, Insightful
  
  I think they also suffer from what I call "efficiency savings hoarding".
  If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.
  When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.
  Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.
  Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.
  I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.
6. Re:Did they try... by I'm+New+Around+Here · 2017-06-02 03:27 · Score: 5, Funny
  
  >
  Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
  Yes, buy a Monster power strip for $50. It has gold plated vacuum tubes for effecient power control. ;^)
  
  --
  If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
7. Re:Did they try... by Drakonblayde · 2017-06-02 03:32 · Score: 5, Informative
  
  Not entirely true...
  So it depends on what kind of UPS you're employing. If it's the really big ones, you know, the ones the size of generators, then you don't plug stuff directly into them. They tend to be centralized and distribute power to PDU's that are in the racks themselves. The servers plug into the PDU's in the racks, and those PDU's have on/off switches. My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks. UPS's aren't employed to protect against human error, they're designed to protect against loss of main power.
  If you're data center is small enough, you can get away with UPS's mounted in the rack and plug your servers directly into them, but when you're talking about scale, that's just not feasible or cost effective.
  Somehow I doubt British Airways data center is of the 'couple cabinents in a colo variety' and they've probably got the big UPS setup
  Most likely the fault lies with whomever architected the data center. I'll bet either there's very little room between the racks, or the PDU's are mounted in a way they can be accidentally bumped (probably either mid rack or at the bottom). I personally have taken to mounting PDU's at the top of the rack on the backside just to minimize any potential human contact with them.
8. Re: Did they try... by LS1+Brains · 2017-06-02 03:33 · Score: 5, Insightful
  
  I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).
  
  As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.
9. Re: Did they try... by jellomizer · 2017-06-02 05:00 · Score: 5, Insightful
  
  You mean for the Executive who didn't approve of the hot offsite fail over solution ?
  You know the stuff that normal large organizations have to make sure their business can be operational.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
10. Re:Did they try... by zeugma-amp · 2017-06-02 05:22 · Score: 5, Interesting
  
  Many moons ago I was working in a datacenter and we had a crew in the hallway hanging wallpaper. In order to do so, they had to remove the box that normally covered the Emergency Power Cutoff switch (which was actually a Big Red Button) that would instantly drop power to the whole room. I'm sure you can guess where this is going...
  One of the paper hangers bumped into the BRB, and *poof*, there went the power to the room. In the data center, we were in the middle of a shift change. My coworkers and I were standing around discussing handoff, and whatnot. Suddently, we heard a huge Boom as a crapload of switches tripped all around us. Then we heard the drives and fans spinning down.
  As I said, this was a ways back. In the data center we had 12 HP-3000/70 minicomputers, a couple of VAX 11/780s, and a water-cooled IBM 3090 mainframe that were our main systems in the room. The disk drives on the HPs were disk packs of 16" platters sitting in drives the size of a small washtub. They produced a lot of noise. Each HP3K had about 8 or 9 of these things daisy-chained behind the system itself.
  The room was loud. All the time. Well, when the power dropped, all those drives started to spin down. We were all just kind of standing around looking at each other, not knowing what had happened. You could hear the pitch of all those drives winding down, becoming a lower and lower note, until finally - silence.
  Simon and Garfunkle had a song called "The Sound of Silence" many year even further back into the dim reaches of time from when all this was taking place. This was the first instance in my life when I really understood what silence actually sounded like. It was eerie. You never heard silence in the computer room. You have UPS, generators and all kind of other things to make sure you never actually heard silence.
  So, there we were, standing around with our mouths hanging open, and listening to the eerie silence. The moment broke, and we quickly determined what had happened. Rather than just cut the power back on, we went through and powered off all the drives and such so we could slowly bring everything back up in an orderly fashion.
  One thing that I learned that day was that HP-3000 minicomputers contain a battery designed to allow the things to ride through such catastrophes. Out of the twelve HPs, once we had powered back on all of the drives, nine of them just started executing their next instruction and continued on as if nothing whatsoever had happened. Three needed to be coldstarted, which wasn't a really big deal. Within 30 minutes or so of power being brutally disconnected we had all of them running smoothly, or at least on the way up.
  The two DECs weren't quite so resiliant, but after checking their dirty disks, they came back up as well.
  An IBM 3090 does not like to have it's power just cut off. It really doesn't. We ended up having issues and it took about 24 hours to return to normal operational status.
  The entire event was kind of cool to run through. Gave me a new respect for HP engineering. For many of our users, all they experienced was that their terminal froze for about 20 minutes, then continued on where it had stopped.
  I don't know if the paper hanger lost his job, but we lost several thousand user hours of time while they were sitting staring at their frozen terminals.
  It was certainly an interesting experience, and I'll never forget the Sound of Silence in the Computer Room.
  
  --
  This is an ex-parrot!
Re: LOL by haemish · 2017-06-02 02:13 · Score: 5, Insightful

Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
Yeah, yeah... blame the contractor... by __aaclcg7560 · 2017-06-02 02:14 · Score: 5, Insightful

This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
Re: LOL by thegarbz · 2017-06-02 03:05 · Score: 5, Insightful

who wouldn't let the engineers put in redundant power supplies
That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.