British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)
A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."
Pretty sure UPS's and backup power supplies kinda do fall under that...
"It was not an IT issue, it was a power issue."
Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.
It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...
The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.
A proper IT staff would have built in safeguards against power outages and power surges.
For a company the size of British airways I would expect that they would have a hot fail over in a different country. Or at least a different geographic location.
In short they cheeped out on IT and now they are paying for it.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.
If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.
An ill-considered plan to save a few dimes has cost them several dollars.
The CEO should have foreseen this and should be let go. As should other executives who approved the offshoring plan.
Offshoring can work- but excessive staffing cuts to save a few extra dollars are begging for something like this to happen.
Infrastructure people should be located on site with the hardware and there should be multiple hardware systems *with* fail over testing on a monthly basis. (not quarterly. that fails. only monthly is often enough that the failover is seamless and there is a good argument for doing a daily failover.)
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
This is what happens when you treat your IT staff like your Janitorial staff.
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Outsourcing is part of the problem, but you're right, it derives from the mentality that IT is a cost center that must be minimized at every possible turn. It's outdated thinking, going back to the days where if your office network went down, there'd be a bit of inconvenience, but the planes still flew, and it wasn't a big deal. Today, IT is a business critical area, because when your network goes down, the planes stop flying, and you stop making money, never-mind the lingering effects from the terrible publicity or the angry customers. It's not something you can afford to skimp on, on any level.
Unfortunately it will probably take several shocks like this, and some high level careers ending as a result, before they start to wise up.
How would this 'admission' make anyone more comfortable about this business?
The business doesn't have to worry about that. It's safe regardless; too-big-to-fail public+private yada yada. This is BA we're talking about.
These "stories" are just the public narrative writing process, guided to affix/deflect blame to/from the appropriate parties as the scapegoats are singled out. The BA execs know they have maybe 72 hours or so before this story falls out of the news cycle so they're using that window to make the headlines they need to muddy the waters. Until now the only narrative that has had any play is the "outsourcing did it" one, and that hits too close to management, so they're making this stuff up and putting it out through their MSM channels.
Maw! Fire up the karma burner!
"We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room"
You weren't lucky, it's called having good, well-trained/practised staff on-site. And based on what everyone has been saying this is something that was severely lacking at BA