British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)
An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.
...turning it on again?
Because I'm having déjà vu.
#DeleteFacebook
Seems like this 'test' to see if the UPS would kick in didn't work.
So the CEO _should_ resign after all.
How is this a thing??
I run a small company (i.e. less than 100 people) and have redundant power to every server and switch on different physical breakers.
they didn't just switch over to their DR site.
Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
Been there, sort of done that.
Years ago I was in the basement of a 5-star hotel in South Africa, busiest time of the week, everyone was checking out, and I had to install a simple little Novell Netware to internet gateway machine, and there was one spare port on the power strip. Something shouted out in my head, "Don't put it in that one!", but I thought "The machine supplied tests fine, the cable is approved... what could possibly go..." *BLAM*, everything went down and took a few hours to get back up as the Netware "mirror" servers decided to argue about who comes up first. No idea why, something was wrong with the power strip in the rack I suppose.
Needless to say, I'd hate to be the poor chap who took down BA like that, might be a little hard finding work, unless it's retelling their story at a geek-comedy club.
I guess it cost too much to add monitoring and remote management.
So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.
No sir I dont like it.
This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
No sure Bob - just flip it so that we can go get some lunch. I'm starving.
"Just kidding!"
I found the culprit: https://youtu.be/9WYGdstEVJQ?t...
. . . . the power was turned off by a FORMER contractor.
Then again, BA probably promoted him to executive VP.. .
Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-uncommon event of an outage? I don't care if it WAS a human that messed up and turned everything off, you need a contingency plan to deal with such things. Why? Because outages WILL happen no matter how much engineering and resources you pile into your primary location.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
The contractor's insurer is not going to be happy about having to pay out for the losses caused.
Shit happens
He gutted the knowledgeable staff and replaced with inexperienced outsourced help.
Incoming power would/should have been the first thing checked.
The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.
Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...
"There is more worth loving than we have strength to love." - Brian Jay Stanley
of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
Yeah, the verdict is in, and still let Indian IT scumbags to go hell!
Where are these nasty feelings come from?
This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.
Pull the other one, it's got bells on it.
Summation 2
Regardless of all the things done, it is a design flaw that a single point of failure can bring down the entire systems.
I've been at two places where a D.C. Lost power (and AC in a 3rd case). We were down for about a minute for fail over to the DR site. What's wrong with BA?
The more important question is why it took the best part of two days to get things up and running again.
As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.
It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a simple system that it doesn't need redundancy then surely switching it back on would fix it). What about redundancy?
At the end of the day, unless the contractor was working way outside allowed procedures - e.g. deliberately switching it off for a laugh - then the fault lies way over his head.
(I know I'm preaching to the converted here - it just grinds my gears)
We had a shut off switch in our data center that also got "accidentally" turned off.
The switch had a lock hole, but we didn't want to padlock it in case there was an emergency need to shutdown quickly (e.g. water or fire hazards).
So instead, we put a carabiner through the hole, with a note attached describing the purpose and a number to call for questions on removing the carabiner.
I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
It was not working perfectly at all - there was a single point of failure, poor design with no redundancy responsible for critical infrastructure, clearly approved by senior management.
So no, it wasn't a contractor responsible for the outage. It was the CEO who did not ensure there was redundancies in place on critical infrastructure, business continuity was not tested and disaster recovery was not a thing.
:)
Will $CURRENT_YEAR be the year of the Linux Desktop?
It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events stemming from someone replacing the wrong disk.
But at some point, all that abstraction meets the real world and the man behind the curtain really does need full control of whatever system they're in charge of. A favorite example of mine is a project we're doing in Azure -- the developers have full faith in the magic box that will never fail and is so simple that we don't need to know how it works. Sure we might not need to know the exact implementation details, but it doesn't absolve you from knowing what is and is not possible in the realm of compute, network and storage combinations. I've dealt with support tickets where the Microsoft personnel are quite obviously looking in whatever monster Hyper-V / SCVMM console is controlling their back-end to solve something complex.
At the data center level, you can only idiot-proof so much. Some operations person is actually going to need to control the system directly at some point and have access to the Big Red Lever. You can put a million fail-safes in place to avoid routine problems, but when automatic processes fail, you need at least one smart person who knows _everything_ that could go wrong. When you outsource this function to the lowest bidder, don't expect to get a super-genius in that role. Your typical body shop outsourcer isn't going to pay people enough to stay on to learn the ins and outs of an environment.
Love how the blame has shifted on this one as time goes along:
1 - The corporate policies of outsourcing and hiring unskilled workers are to blame: CEO and higher ups responsible.
2 - A power outage with the data center is to blame: We couldn't have planned for this.
3 - It was THIS GUYS fault!
contractor: "so, I guess I'm pretty much done with this company right?"
CEO: "Not at all! We just spend 1 billion $ educating you!"
contractor in tears: "oh thank you"
CEO: "I was joking, dumbass. This is the real world. You're fired and we're going to sue you for 2 billion $".
What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford is not in the business of cleaning carpets, so that's also a candidate for outsourcing.
Once you have a list of items that can be outsourced because they aren't your "core competencies", they "make or buy" decision becomes mostly a matter of arithmetic. For the same budget cost, will you get it done better by hiring people to do it, or by hiring a conpany to do it? Equivalently, for the same level of quality, does it cost less to pay in-house people to do it or to an outside source? Probably, you'll find that it's better to get an operating system from an outside source, not make your own.
While there is no hard and fast rule, a rule of thumb is to consider the company next door. If you could easily buy the same product or service from the same vendor that the company next door uses, and it would serve your purpose, you should probably do so. General purpose things like office supplies office cleaning, and payroll services should be purchased, not manufactured in house, because there is no competitive advantage to be gained from having better office supplies than the other company.
Shit happens and most competent companies plan for it by have redundant live backup systems.
I can't believe that BA didn't have a live backup system at another site to fail over to.
Really, this costs money but these cheap bastards don't seem to have a clue.
I don't read your sig. Why are you reading mine?
Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.
Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.
A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.
(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
Blame it on the contractor, not the solutions architects who didn't properly plan for disaster recovery. Or the product owners who decided that DR was too expensive.
Is the datacenter running on a raspberry pi? How can you just turn off the power on a datacenter. Is it like just running on a raspberry pi that someone needed to use the power supply to charge their phone? Sheeesh
The janitor just tripped over the extension cord?
“He’s not deformed, he’s just drunk!”
Unlikely - suspect a lie - something else also happened ..
All my cleaners had security vets and 2 hour induction courses - no exceptions.
Don't unplug anything, or push any buttons, and no mobile phones or Coke.
Do not pull up floor tiles - do not move the backup trolley.
Every datacentre has strict protocols and passes as to who can do what.
There is no 'accidental' off switch - none. The AirCon and Chillers and distribution box are all armed and locked. BTW PDP-11's had flawless power fail recovery - Digital got it right.
#1 hazard is water leaks and burst piping
#2 is Diesel generator not starting up - how to to prime it?
#3 Is false fire alarm, oddly caused by fog
#4 is non-redundant DNS or failover equipment
#5 is locked out the building - passcard box failed
#6)Bored employee or BOFH rebooting a server because it creates excitement - on ICL's and 360's the PSW reset button worked a treat.
#7 Loose cricket or soccer ball on the operations room floor at velocity hitting expensive equipment.
#8 Scheduled power company outage (roadworks) that the ops manager was not on the list - or the cleaner/mailman threw out that letter.
There can be an EMERGENCY switch but they are covered in mesh to prevent a broomstick whack, because in a HALON gas tanks with explosive bolts can be expensive.
So, is this datacentre in India - or did the cleaner not have any induction?
An angry cleaner is plausible - but this leads on to other delayed concerns, like demagnetizing backup tapes, or switching drive label numbers around
Power back on, everything normal. The exception to the rule is MS boxes just after a security or AV updates. But that is covered - surely.
Should be: "British Airways IT outage caused by FORMER CONTRACTOR who accidentally switched off the power".
... setup if their entire order processing can be turned off by a single guy.
I wouldn't even feel guilty if this happened to me. I'd just be surprised and say "Whooops ... guess that was the wrong switch/command/ansible script/whatever procedure.
We suffer more in our imagination than in reality. - Seneca
(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.
It is pitch black. You are likely to be eaten by a grue.
Is this true?
Yes. This man, has no dick.
A few years ago a sys admin at Boeing's main site in Washington flipped a main power switch ("the big red button"). He wanted to restart the network hardware for the machines in that server closet, to solve a network issue (not a shutdown). He had no idea it was a single point of failure, a doomsday switch (when in doubt, ask more than one person!). The entire system went down, and took 24+ hours to restart, effectively shutting down Boeing's production of airplanes for a day (manufacturing typically requires lots of servers for automation, etc.). Ouch.
But Nike's famous flameout was much worse. Several years ago they replaced their ERP system (basically, it analyzes sales to keep their factories making the right products well in advance of need, so availability meets demand). Despite many red flags, the head of Nike had the ERP company deploy the new system in an absurdly short time, not in a proof-of-concept or limited deployment, nor an A-B comparison with the legacy, but instead global. While the new system worked, it had never been tested at scale, and it turned out it couldn't handle a serious load. Worse yet, it wasn't obvious that it wasn't handling the load. The effect was that the system lost track of the Nike products that were selling the most, e.g., Air Jordan shoes, so continued manufacturing wasn't triggered for the most popular products. Meanwhile, products that barely sold at all continued to register in the system, and since Nike had accidentally left some legacy triggers in place, unpopular products were manufactured double the needed amount. Months later, as stores ran out of Air Jordans and similar, it turned out none were being made, and couldn't be available for several more months. But stores were being shipped products that nobody wanted to buy. In a short time, Nike lost at least $100M, and nearly went bankrupt (in revenge, Nike bankrupted the company that did the new ERP system, despite the fact that the company had very clearly told Nike the short timelines were impossible to meet). Very recently, Nike has finally replaced their legacy ERP systems with best-of-breed software (e.g., based on JustEnough) that is tested to death for both accuracy and scalability. E.g., their unit testing has code coverage of close to 100% (I know, because I spent more time writing tests than services there). And they have a huge infrastructure team that leverages AWS scalability (Lambda and similar) to the extreme.
In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!
True story.
In the late 90's I worked at a small startup and was the main IT guy. Each night we had to send out large files, this is back around 98 or so when a 256k bonded business class ISDN or something like that cost us about $1,000 a month. So, this thing needed to be sending data all night long.
I kept having to go back into work because for some strange reason the line would sometimes go down, only after hours, and the crappy old software were were forced to use by the client for the uploads would just fail.. I had to manually restart the file transfer.
This happened about once a week for a months.
We then got a client that needed better security. So, among other things from the audit we did, we got a electronic lock for the server room door.
Week or two later without any failures my boss stops by with a guy I did not recognize.
"Hi, this is Bob, he is the manager of the cleaning company and he says his workers need a key for something?" Fun conversation..
The cleaning lady was ignoring (or unable to read) the signs saying keep out and such, was going up the ramp, around the server racks, over next to the network rack by the wall, unplugging the network power cord, and the proceeding to vacuum the spotless room I though no one but me ever went into... Then plugging it back in when done.
My fault for not locking it obviously.
Have they never heard of multiple servers with the ability to handle server down events for one machine?
-- Tigger warning: This post may contain tiggers! --
Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.
It little behooves the best of us to comment on the rest of us.
I'm Navy trained, and every fail-critical system should be designed with the assumption that the greatest threat is the incompetence of your own employees. No single switch should be able to collapse a critical system. The contractor who physically pulled the plug is not the person agent. The engineers who designed a power system without single-fault tolerance and/or the managers who implemented inadequate supervision, training, and procedural compliance are responsible.
Data center with backup generators, automatic transfer switches and whole nine yards during routine testing of equipment goes down hard and stays that way for the better part of a day because it couldn't handle inrush currents. Panic turning on the little switch you turned off not only didn't work but caused physical damage.
What if power supplies had a current sensor /w random timer to flip a relay and avoid inrush synchronization or what if power system was designed appropriately to deal with the problem in the first place? What if there was an off-side backup system?
Saying a contractor caused it is like saying a customer withdrawing $0 from an ATM caused a banks transaction system to crash. MULTIPLE design failures on many scales CAUSED this outage.
there really should be a shield cover over The Big Red Button so prevalent in data centers at the door. the damn thing always scared me, I never got within a foot of the bugger. always felt saver leaning on the Halon tank.
if this is supposed to be a new economy, how come they still want my old fashioned money?
My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks.
Always mount your PDUs at the top of the rack. Or at least buy the ones that have the cover over the switch.
As I'm sure others have posted, IT SHOULD NOT HAVE MATTERED!!
The fact that there was no redundant system anyway: fail!
The fact that turning it on again did not restore service: fail!
We can all laugh at the clown that turned off the power supply, but c'mon, we all know that this wasn't the *true* problem here!
No sane person who operates a critical infrastructure does not have a backup system and built in redundancy. Also you cannot switch off power in a computing facility or a single rack in there without proper permission.
In addition of having no backup system, they also did not have an emergency plan. Maybe they are both.
I mean I don't have any reason to doubt it either, just seems convenient that a dude named Ben just happens to get the blame...
"UNIX is very simple, it just needs a genius to understand its simplicity." -Dennis Ritchie
It was triggered by the contractor.
The cause is in the system design and testing that allowed that trigger to cause to much pain.
Funny,
Reminds me of that song by prince and the new power generation
Or no power to the travelling people
Example:
If you fired the CFO based on capriciousness and lack of understanding of what a CFO does you don't get to dodge responsibility saying the delay in getting out the financial quarterlies is because someone didn't order enough paper...
As a CEO, especially in 2017, you should know better than to trust outsourcing solely based on a sales pitch and perhaps a free lunch. These systems are delicate, and frankly even a midlevel IT position takes 2-3 months to get up to speed. If you are taking over a large scale organization in essence you are in a defacto freeze for 6-9 months depending on how much turnover there is. You can't ITILv3 your way out of the complexities of these systems...
eom