British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)
An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.
...turning it on again?
Because I'm having déjà vu.
#DeleteFacebook
Seems like this 'test' to see if the UPS would kick in didn't work.
So the CEO _should_ resign after all.
they didn't just switch over to their DR site.
Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
Been there, sort of done that.
Years ago I was in the basement of a 5-star hotel in South Africa, busiest time of the week, everyone was checking out, and I had to install a simple little Novell Netware to internet gateway machine, and there was one spare port on the power strip. Something shouted out in my head, "Don't put it in that one!", but I thought "The machine supplied tests fine, the cable is approved... what could possibly go..." *BLAM*, everything went down and took a few hours to get back up as the Netware "mirror" servers decided to argue about who comes up first. No idea why, something was wrong with the power strip in the rack I suppose.
Needless to say, I'd hate to be the poor chap who took down BA like that, might be a little hard finding work, unless it's retelling their story at a geek-comedy club.
I guess it cost too much to add monitoring and remote management.
So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.
No sir I dont like it.
That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
No sure Bob - just flip it so that we can go get some lunch. I'm starving.
"Just kidding!"
I found the culprit: https://youtu.be/9WYGdstEVJQ?t...
That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
More like a extension cord stretched across a busy walkway just waiting for someone to trip on it.
. . . . the power was turned off by a FORMER contractor.
Then again, BA probably promoted him to executive VP.. .
Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-uncommon event of an outage? I don't care if it WAS a human that messed up and turned everything off, you need a contingency plan to deal with such things. Why? Because outages WILL happen no matter how much engineering and resources you pile into your primary location.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
He gutted the knowledgeable staff and replaced with inexperienced outsourced help.
Incoming power would/should have been the first thing checked.
At the moment, there is no reasonable way to tell between various scenarios.
It could go all the way from 'worker pressed big red button despite being told not to, signs telling him not to, and having signed an agreement not to', to 'worker followed what they believed was procedure and did what 99% of people would have done', to 'worker did precisely as instructed and are being scapegoated'.
The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.
Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...
"There is more worth loving than we have strength to love." - Brian Jay Stanley
of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.
Pull the other one, it's got bells on it.
Summation 2
The more important question is why it took the best part of two days to get things up and running again.
As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
Switches such as that should be locked out, requiring multiple people to allow access.
If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.
It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a simple system that it doesn't need redundancy then surely switching it back on would fix it). What about redundancy?
At the end of the day, unless the contractor was working way outside allowed procedures - e.g. deliberately switching it off for a laugh - then the fault lies way over his head.
(I know I'm preaching to the converted here - it just grinds my gears)
Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
It was not working perfectly at all - there was a single point of failure, poor design with no redundancy responsible for critical infrastructure, clearly approved by senior management.
So no, it wasn't a contractor responsible for the outage. It was the CEO who did not ensure there was redundancies in place on critical infrastructure, business continuity was not tested and disaster recovery was not a thing.
'flick it off' may include 'opened the interlocks and keyed in the code as he believed he was doing the correct thing'.
This could be a personal failure due to stupidity, a training failure, or he was in fact instructed to turn it off, and though he protested, is now getting scapegoated.
:)
Will $CURRENT_YEAR be the year of the Linux Desktop?
It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events stemming from someone replacing the wrong disk.
But at some point, all that abstraction meets the real world and the man behind the curtain really does need full control of whatever system they're in charge of. A favorite example of mine is a project we're doing in Azure -- the developers have full faith in the magic box that will never fail and is so simple that we don't need to know how it works. Sure we might not need to know the exact implementation details, but it doesn't absolve you from knowing what is and is not possible in the realm of compute, network and storage combinations. I've dealt with support tickets where the Microsoft personnel are quite obviously looking in whatever monster Hyper-V / SCVMM console is controlling their back-end to solve something complex.
At the data center level, you can only idiot-proof so much. Some operations person is actually going to need to control the system directly at some point and have access to the Big Red Lever. You can put a million fail-safes in place to avoid routine problems, but when automatic processes fail, you need at least one smart person who knows _everything_ that could go wrong. When you outsource this function to the lowest bidder, don't expect to get a super-genius in that role. Your typical body shop outsourcer isn't going to pay people enough to stay on to learn the ins and outs of an environment.
That's their hard luck. But if it wasn't for accidents like this, they wouldn't have a business so they're not really in a position to complain.
We also know one other thing: no one up in management will accept responsibility. All upper managers will be shielded from personal responsibility so their reputations and wealth will be preserved. Even if they retire early, they will never see their retirement be reduced because they screwed up. It will always be the case that they will be able to go out and get new positions, often as overpaid consulting parasites.
Everyone else, employees, customers, stockholders will loose out, but the insiders will barely feel a bump.
Why is Snark Required?
No, it isn't.
Sometimes someone, despite proper training, management, and instruction does something that goes against all of that training, to the point that no reasonable person given the same instruction and training would have done the same thing.
In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
If you don't you are probably in violation of the local fire codes. Though the 2011 updates to the NEC did remove the "shall be readily accessible at the principal exit door" language from the emergency power off requirements instead allowing "shall be located at approved locations readily accessible in case of fire to authorized personnel and emergency responders", so a bunch of jurisdictions will just be using that by now and not requiring the big red button at the exit... You can do away with it entirely if you check a bunch of other boxes.
Of course their data center probably isn't in the US in the first place, so they likely live under a whole different set of requirements for the off switch.
contractor: "so, I guess I'm pretty much done with this company right?"
CEO: "Not at all! We just spend 1 billion $ educating you!"
contractor in tears: "oh thank you"
CEO: "I was joking, dumbass. This is the real world. You're fired and we're going to sue you for 2 billion $".
As I've pointed out earlier, they should have been able to fail-over to another data center. So the fact that they didn't have these procedures and/or hadn't tested them is a management failure. The localized problem, though, should not be blamed on management.
What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford is not in the business of cleaning carpets, so that's also a candidate for outsourcing.
Once you have a list of items that can be outsourced because they aren't your "core competencies", they "make or buy" decision becomes mostly a matter of arithmetic. For the same budget cost, will you get it done better by hiring people to do it, or by hiring a conpany to do it? Equivalently, for the same level of quality, does it cost less to pay in-house people to do it or to an outside source? Probably, you'll find that it's better to get an operating system from an outside source, not make your own.
While there is no hard and fast rule, a rule of thumb is to consider the company next door. If you could easily buy the same product or service from the same vendor that the company next door uses, and it would serve your purpose, you should probably do so. General purpose things like office supplies office cleaning, and payroll services should be purchased, not manufactured in house, because there is no competitive advantage to be gained from having better office supplies than the other company.
Shit happens and most competent companies plan for it by have redundant live backup systems.
I can't believe that BA didn't have a live backup system at another site to fail over to.
Really, this costs money but these cheap bastards don't seem to have a clue.
I don't read your sig. Why are you reading mine?
Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.
Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.
A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.
(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
The janitor just tripped over the extension cord?
“He’s not deformed, he’s just drunk!”
Should be: "British Airways IT outage caused by FORMER CONTRACTOR who accidentally switched off the power".
... setup if their entire order processing can be turned off by a single guy.
I wouldn't even feel guilty if this happened to me. I'd just be surprised and say "Whooops ... guess that was the wrong switch/command/ansible script/whatever procedure.
We suffer more in our imagination than in reality. - Seneca
> In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
Yes, if you run a simple experiment, and there is the possibility for harm, a single red button is a good idea.
But if shutting down the server room costs $100 000 000, then a single red button is not a good idea. Instead, you have two parallel power distribution system, with some physical separation, and there are two off switches. Of course there should be sign that explains how to use the switch, and I guess that is where this story eventually leads.
(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.
It is pitch black. You are likely to be eaten by a grue.
I would not be too worried.
a) It is not clear that the contractor is the only person to blame.
b) Maybe there is some small print that was violated.
c) Even if they have to pay, there is probably a limit of 5 Million...
A few years ago a sys admin at Boeing's main site in Washington flipped a main power switch ("the big red button"). He wanted to restart the network hardware for the machines in that server closet, to solve a network issue (not a shutdown). He had no idea it was a single point of failure, a doomsday switch (when in doubt, ask more than one person!). The entire system went down, and took 24+ hours to restart, effectively shutting down Boeing's production of airplanes for a day (manufacturing typically requires lots of servers for automation, etc.). Ouch.
But Nike's famous flameout was much worse. Several years ago they replaced their ERP system (basically, it analyzes sales to keep their factories making the right products well in advance of need, so availability meets demand). Despite many red flags, the head of Nike had the ERP company deploy the new system in an absurdly short time, not in a proof-of-concept or limited deployment, nor an A-B comparison with the legacy, but instead global. While the new system worked, it had never been tested at scale, and it turned out it couldn't handle a serious load. Worse yet, it wasn't obvious that it wasn't handling the load. The effect was that the system lost track of the Nike products that were selling the most, e.g., Air Jordan shoes, so continued manufacturing wasn't triggered for the most popular products. Meanwhile, products that barely sold at all continued to register in the system, and since Nike had accidentally left some legacy triggers in place, unpopular products were manufactured double the needed amount. Months later, as stores ran out of Air Jordans and similar, it turned out none were being made, and couldn't be available for several more months. But stores were being shipped products that nobody wanted to buy. In a short time, Nike lost at least $100M, and nearly went bankrupt (in revenge, Nike bankrupted the company that did the new ERP system, despite the fact that the company had very clearly told Nike the short timelines were impossible to meet). Very recently, Nike has finally replaced their legacy ERP systems with best-of-breed software (e.g., based on JustEnough) that is tested to death for both accuracy and scalability. E.g., their unit testing has code coverage of close to 100% (I know, because I spent more time writing tests than services there). And they have a huge infrastructure team that leverages AWS scalability (Lambda and similar) to the extreme.
Of course. I am just waiting for the statement by BA that it was "not their fault", and they are therefore not paying compensation to passengers...
In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!
True story.
In the late 90's I worked at a small startup and was the main IT guy. Each night we had to send out large files, this is back around 98 or so when a 256k bonded business class ISDN or something like that cost us about $1,000 a month. So, this thing needed to be sending data all night long.
I kept having to go back into work because for some strange reason the line would sometimes go down, only after hours, and the crappy old software were were forced to use by the client for the uploads would just fail.. I had to manually restart the file transfer.
This happened about once a week for a months.
We then got a client that needed better security. So, among other things from the audit we did, we got a electronic lock for the server room door.
Week or two later without any failures my boss stops by with a guy I did not recognize.
"Hi, this is Bob, he is the manager of the cleaning company and he says his workers need a key for something?" Fun conversation..
The cleaning lady was ignoring (or unable to read) the signs saying keep out and such, was going up the ramp, around the server racks, over next to the network rack by the wall, unplugging the network power cord, and the proceeding to vacuum the spotless room I though no one but me ever went into... Then plugging it back in when done.
My fault for not locking it obviously.
Have they never heard of multiple servers with the ability to handle server down events for one machine?
-- Tigger warning: This post may contain tiggers! --
Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.
It little behooves the best of us to comment on the rest of us.
DR test are expensive and may show the DR site does not work, making them even more expensive. The bonus for upper management is obviously far more important.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
He will not have to. An outage of this magnitude from a single cause like this can only happen if gross negligence was rampart on the other side.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
I'm Navy trained, and every fail-critical system should be designed with the assumption that the greatest threat is the incompetence of your own employees. No single switch should be able to collapse a critical system. The contractor who physically pulled the plug is not the person agent. The engineers who designed a power system without single-fault tolerance and/or the managers who implemented inadequate supervision, training, and procedural compliance are responsible.
They are life safety equipment, dumbass.
Data center with backup generators, automatic transfer switches and whole nine yards during routine testing of equipment goes down hard and stays that way for the better part of a day because it couldn't handle inrush currents. Panic turning on the little switch you turned off not only didn't work but caused physical damage.
What if power supplies had a current sensor /w random timer to flip a relay and avoid inrush synchronization or what if power system was designed appropriately to deal with the problem in the first place? What if there was an off-side backup system?
Saying a contractor caused it is like saying a customer withdrawing $0 from an ATM caused a banks transaction system to crash. MULTIPLE design failures on many scales CAUSED this outage.
there really should be a shield cover over The Big Red Button so prevalent in data centers at the door. the damn thing always scared me, I never got within a foot of the bugger. always felt saver leaning on the Halon tank.
if this is supposed to be a new economy, how come they still want my old fashioned money?
As I'm sure others have posted, IT SHOULD NOT HAVE MATTERED!!
The fact that there was no redundant system anyway: fail!
The fact that turning it on again did not restore service: fail!
We can all laugh at the clown that turned off the power supply, but c'mon, we all know that this wasn't the *true* problem here!
No sane person who operates a critical infrastructure does not have a backup system and built in redundancy. Also you cannot switch off power in a computing facility or a single rack in there without proper permission.
In addition of having no backup system, they also did not have an emergency plan. Maybe they are both.
I mean I don't have any reason to doubt it either, just seems convenient that a dude named Ben just happens to get the blame...
"UNIX is very simple, it just needs a genius to understand its simplicity." -Dennis Ritchie
It was triggered by the contractor.
The cause is in the system design and testing that allowed that trigger to cause to much pain.
Example:
If you fired the CFO based on capriciousness and lack of understanding of what a CFO does you don't get to dodge responsibility saying the delay in getting out the financial quarterlies is because someone didn't order enough paper...
As a CEO, especially in 2017, you should know better than to trust outsourcing solely based on a sales pitch and perhaps a free lunch. These systems are delicate, and frankly even a midlevel IT position takes 2-3 months to get up to speed. If you are taking over a large scale organization in essence you are in a defacto freeze for 6-9 months depending on how much turnover there is. You can't ITILv3 your way out of the complexities of these systems...