British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)
An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.
...turning it on again?
Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
The new article has more details.
they didn't just switch over to their DR site.
You forgot the mic drop.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.
No sir I dont like it.
Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
No sure Bob - just flip it so that we can go get some lunch. I'm starving.
"Just kidding!"
When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
who wouldn't let the engineers put in redundant power supplies
That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford is not in the business of cleaning carpets, so that's also a candidate for outsourcing.
Once you have a list of items that can be outsourced because they aren't your "core competencies", they "make or buy" decision becomes mostly a matter of arithmetic. For the same budget cost, will you get it done better by hiring people to do it, or by hiring a conpany to do it? Equivalently, for the same level of quality, does it cost less to pay in-house people to do it or to an outside source? Probably, you'll find that it's better to get an operating system from an outside source, not make your own.
While there is no hard and fast rule, a rule of thumb is to consider the company next door. If you could easily buy the same product or service from the same vendor that the company next door uses, and it would serve your purpose, you should probably do so. General purpose things like office supplies office cleaning, and payroll services should be purchased, not manufactured in house, because there is no competitive advantage to be gained from having better office supplies than the other company.
Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.
Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.
A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.
(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.
The Daddy casts sleep on the Baby. The Baby resists!
> In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
Yes, if you run a simple experiment, and there is the possibility for harm, a single red button is a good idea.
But if shutting down the server room costs $100 000 000, then a single red button is not a good idea. Instead, you have two parallel power distribution system, with some physical separation, and there are two off switches. Of course there should be sign that explains how to use the switch, and I guess that is where this story eventually leads.
(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.
It is pitch black. You are likely to be eaten by a grue.
In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!
Have they never heard of multiple servers with the ability to handle server down events for one machine?
-- Tigger warning: This post may contain tiggers! --
Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.
It little behooves the best of us to comment on the rest of us.
The CEO should accept full responsibility
Hah, the CEO is probably trying to figure out how to give himself more stock options now that they're cheaper. These greedy fuckers can never think past their multi-million payouts.
Seven puppies were harmed during the making of this post.