British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)
A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."
Pretty sure UPS's and backup power supplies kinda do fall under that...
We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?
https://www.theguardian.com/bu...
Oh yeah, power surges are to blame! haha no.
The dangers of knowledge trigger emotional distress in human beings.
"It was not an IT issue, it was a power issue."
Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.
to buy surge protectors...
Then it is an IT issue. Infrastructure should have equipment in line to prevent this kind of situation. It was poor IT management and planning that caused this.
It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...
How big a current spike was this? Don't UPSes act as surge protectors and filters too?
If the power was out and the system was running on standby power, it should not switch back until mains power is up and is stable. Also, certainly a large power surge would certainly have damaged other systems in the surrounding area, right? No, no, it was absolutely an IT issue and heads need to roll here for poor management decisions with regard to staffing and outsourcing. Make no mistake, you don't get to the upper echelons of management without a gift for deflecting blame - we cannot allow that to happen here. Keep digging.
The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.
Some might argue that good IT would have redundant servers in more than one location for mission critical infrastructure.
Others might argue that mission critical IT includes line conditioning power would handle surges.
They were hacked, but have to deny it, for strategic reasons, I'm sure...
The issue was caused by the outage. Bringing up all of the affected servers at once caused more damage, and was the result of poor planning. The DR site's data was inconsistent with production so it could not be brought online.
Therefore the production site needed to be repaired and data restored to the last known good state. And that takes a lot longer than whatever RTO/RPO that BA claims they would be able to meet.
And the difference is...?
Anyone whose Server Farm can be brought down from a power outage does NOT know what they are doing, or care enough about it to bother.
How would this 'admission' make anyone more comfortable about this business?
Power issues of this kind are IT issues.
When designing a server location you must take power into consideration; ie, do I have enough battery to keep all critical servers and supporting hardware up until the generator has kicked in, plus extra just in case the generator has a glitch or two of it's own. Is the battery rated at the correct surge protection to keep systems from glitching when the power does return? Is the generator more than enough to power everything between re-fuelings? Is the generator rated enough to run everything at less than 80% load? Have I staged non-critical servers and equipment to power down and power backup to spread the need for power out? Is there a backup facility that I can spin up or switch to; for this kind of operation you would want to switch to a new site in minutes to prevent business loss.
Again, these are IT problems.
Now a battery going dead, a power supply frying, circuit breakers tripping; these are power issues. Poor disaster recovery plans are not.
Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.
If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.
Strange, In my data centers we run diesel generators for power backup when utility is down. Would be fairly impossible for a power disruption to do something to our equipment. Sounds fake as fuck to me. I mean any half assed IT ,manager would have thought of redundant power source, be it diesel or battery.
It's amazing how many 'power' issues there are with remote Indian support centers. If it truly is the power issues, way are aren't there rigorous disaster plans because these power issues are so common. If they are in place, and they still aren't helping, then why are building these data centers / support centers there anyway? If the country has an unstable power grid or is prone to natural disaster that cause issues, once again, why are there data centers there in the first place?
It sounds like someone is blowing smoke up someone's butt. The question is, where is that smoke starting....
No good deed goes unpunished.
"Those union electricians told us we could run all these servers without upgrading the circuit breakers. It's not an IT problem, it's a union problem!"
Really? So, they are completely illustrating that their IT efforts are a "cost center" and that IT is a "necessary evil" that they provide minimal effort to. Everyone knows that a serious "Data Center" has multiple protective measures in place, so who is this service provider? I wonder how they treat their aircraft? This is so blatantly obvious it hurts those who know IT. Forget about the outsourcing questions.
was the fact that you apparently have no redundancy on extremely mission-critical servers.
Every server wasn't connected to a UPS? And the return of the power overwhelmed the UPSes?
And just how did management decide to "save money" on the power for the servers?
I worked as a dev for a pretty big social network company. We were a not-quite also-ran, peaking at Alexa 108 globally, and for a while we were beating the pants off of Facebook. This was in the pre-AWS days when startups still ran their own servers. Early on, we had apparent power failures on two successive Saturday nights. Right when our database scrubbing processes started.
I suggested to our sysadmins that *maybe* it was because all of the disk heads were starting to move at once, and *maybe* it would go away if we staggered the processes across servers.
Yep, problem solved. Our power feeds were rated for average power draw, not peak power draw on all servers in a rack, and peak power came when all of the disks started seeking simultaneously.
It seems the same thing happened at BA, except no one thought to stagger-start the servers. For us, this was the first big system we ever built, so, OK, chalk it up to growing pains (and the problem never, ever happened again). But BA? Shame on them.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
In other words: "We used $10 MILLION WORTH OF EXPENSIVE SERVERS like a CHILD would use a PAPER CLIP IN AN ELECTRIC SOCKET."
Comment removed based on user account deletion
It has none of these problems. Stabilizers on the power (and UPSs (Unicorns of Perplexing Size?) of cause),surge protection on every I/O port. And... A FUCKING REDUNDANT SYSTEM! It is the 2100 century, how hard can is be? (and, yes; It has a pretty big database (well, about 9 Gig since its from the 80s. It was amazingly massive back then) sampling/storing thousands of signals in real time. and, no; its NOT stinking SQL)
Pretty hard it seems, you CHEAP bastards... Instead of buying that personal yacht you are never going to use you should have given the IT guys the budget they wanted!
it's not our DC so we don't deal with the power part it's the DC that we outsourced to that does the power part.
The fact that this company even says that makes me question whether they understand what IT is and how to fix this going forward.
Most companies have a plan (or should have) to handle a location taking a hit... especially a company this large. The fact that a single DC collapse could bring this company to it's knee's is so bad it's almost negligent IMO.
--Hired Net Grunt
Delta tried this last year and got called to the carpet on it. BA needs to learn from other's mistakes...
--I like turtles...
BIG DC power systems are not really IT guys more like Infrastructure / electricians and some of that stuff is not easy swap even more so if an fail safe tripped and killed all power.
Fukushima was precipitated because although there were multiple backup generators, they were all located in the same place (basement) and all relied on a single fuel supply. When the tsunami flooded the basement and contaminated the fuel, all that redundancy was useless. Likewise, BA had multiple backup servers, but they all relied on a single power supply which fried the backup servers when it went bad.
Reliability does not come from redundancy per se. It comes from keeping all possible modes of equipment failure independent of each other. Adding redundant equipment can help achieve this, but redundancy is undone if that equipment is all vulnerable to a single mode of failure. When the Space Shuttle's solid rocket boosters showed signs of burning through the two O-rings sealing each mated section, NASA tried to fix it by increasing redundancy by adding a third O-ring. But they completely failed to account for an outside factor (cold weather) affecting all three O-rings simultaneously. And the Challenger blew up.
Our small school was able to cobble together enough money to afford an APC RM6000 to protect our small data room.
We recently had an intermittent power leg (it was broken at the service pole cut-offs). The wind would blow the cable and cause arcing - and lots of power weirdness on that leg.
Our UPS simply did what it needed to do to keep reliable power going to our IT systems. If we had a generator, we would have failed-over onto that until the power company fixed the service.
Surely an organization the size of BA can afford better and more redundant systems than this.
I suspect BA is passing the buck here.
some DC's have there own sub station's and it may of been some thing in the side the DC's power system that failed.
Offshore IT is what led to this mess.
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.
Seriously, I have not seen so many issues in Airline computers except for the last 2 years. What is different? Why outsourcing to India.
I prefer the "u" in honour as it seems to be missing these days.
I don't think he's digging his hole fast enough. Feel free to borrow my shovel.
Or, perhaps a better solution would be for someone else at BA to clonk him over the head from behind with a little statuette or something so he just stops talking.
"I was able to protect my puddly shit at my workplace with equipment I bought at Frys, so BA should have been able to protect its 12,000 servers just like I did."
Scaling up is hard. Just because you were able to do it with your install doesn't mean it would be just as easy for a larger install.
That said, they should have done a better job at BA. Even though testing power isn't part of a smaller DC's MO, it should be for a company the size of BA...at least in their dev environment.
I call bullshit on the "it was not an IT problem" as it was DEFINITELY an "IT Management Problem".
How do you even BUILD a large data center (or co-locate it a Commercial NOC like AT&T) without massive UPS and stand-by Generators to handle power issues?
A full power outage should cause ZERO downtime in a large Data Center. If it doesn't, FIRE THE IT MANAGEMENT who approved the shitty design to "save a few pounds" because the DC owner/operator has guaranteed 5 - 9's SLA IN THE CONTRACT (which isn't worth the paper it is printed on without these capabilities.)
You only pay for the Power Used (and square footage) in large data center leases anyway (unless you have huge bandwidth requirements) and the UPS and Generators normally do not count against that. Unfortunately when you co-locate with someone like AT&T it states in the contract that THEY ALONE determine if they meet the 5-9's SLA, and for some reason they NEVER miss that mark (funny that.)
This reeks of IT Management covering their asses for bad decisions with the Board of Directors.
Sometimes a meteor strike takes out your data center, it happens. The answer is to design smaller, smarter systems that are more resilient.
BA carries about 50M passengers per year, less than 135k per day. Over a 8 business hour day, that's about 20k bookings per hour. Let's say one booking consists of 100 datasets written to the database, and maybe 10 times that in reads. This works out to abut 500 writes and 5000 reads (most of which can be cached) per second. Actual average loads are going to be even lower. I'm not talking about extended services or analytics, which can happen on machines of lesser importance, this is just about the core business of taking a reservation and issuing a ticket.
It's not a high transaction volume, and it's not a lot of data. You do not need an entire data center for this. One server with a stock DB can do that. The structure of this core is simple enough so you could replicate it to hot spares around the globe. Heck, you could even lower the load by caching reads at the airport and at the web server.
There is no unavoidable technical reason this power failure had to be that catastrophic.
No Single Point of Failure is a basic, and possibly the most important principle of engineering. When this fondamental principle has not been followed, then it IS and IT problem.
The point they were making is they were a "puddly shit" shop and even they recognized the value of investing in power infrastructure. They weren't stating that infrastructure was easy or BA should do the same as them, merely that one would think they would also understand the value proposition of investing in infrastructure and proper DR execution.
I really think DR solutions should be pitched as IT insurance. Businesses wouldn't operate their offices without insurance against fires, etc so why do so many companies run their IT infrastructure with hedged bets for when their systems go down?
First, there is a cut in mains power to the data centre. No biggie, the batteries take the load. The backup generators then start to spin up and then supply power and the datacentre keeps running.
No lights went off, no computers crashed, business kept running.
But then, mains power is available again. How do you transition from your own generated power back to grid power? You can't just flick a switch. For a start you should ensure that the phase of the two power sources match - on all 3 phases. If you don't do that, I can imagine a power surge would be very likely.
But just like every outfit tests its backups - but very few test their restores, I guess BA had tested their failover process - but never got around to failing back. Or that the one time they did fail-back they got lucky.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Can they replace those servers with low power Raspberry Pi machines?
It is if it is set up and administered right.
we did monthly failovers between different physical sites. A blown DC at one site wouldn't have made a difference.
Our failovers involved a couple hours of oncall for about 150 staff. Most the time only a half dozen were working but a couple times a year it would involve most the staff (and a lot of it people) for part of that. A database would be out of sync or messed up and that would fall to the IT staff to fix. It became less common over time.
Did you miss that they fixed the power problems and then the IT systems were messed up for a long time afterwards indicating poor disaster planning and low staff skill.
A company as big as BA, should have had a separate failover site and been doing regular failovers.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
The utility providers for all of BA's major operations centers in England are all on record as saying there were no power surges, anomalies, etc. This wasn't "we're unaware of...", they all went back over their logs and categorically denied it (seems like they weren't happy about BA trying to pin any bit of this sh*t show on them). As many have pointed out above and elsewhere, none of this passes the sniff test. BA's taking a beating for this, not just over stranding passengers but how they handled the stranded passengers. Many of their communications to passengers have failed to mention BA's obligations as well as refunds and options passengers are legally entitled to. They even had the nerve to point most people to their toll customer service line instead of their toll-free one, charging people 35 pence/min to sit on hold while they were trying to get their travel plans sorted out. Even nickel and diming low cost carriers (LCC's) like RyanAir aren't stupid enough to try something like that after a system-wide disruption.
As opposed to another type of servers? Do they have building & grounds servers? Operations servers? Receptionist servers?
Just curious...
"A plan fiendishly clever in its intricacies"- Homer Simpson
... what one sows.
or both?
"It was not an IT issue, it was a power issue." - BA
This is why they had IT issues everyone. They obviously do not know what counts as IT and what counts at ALL.
This is why it was easy for them to justify offshoring.
This is why they fucked up.
Heads should roll and IT change management should kick in for damage control.
I bet their systems are also greatly vulnerable and they wouldn't even know.
"I was able to protect my puddly shit at my workplace with equipment I bought at Frys, so BA should have been able to protect its 12,000 servers just like I did."
Scaling up is hard. Just because you were able to do it with your install doesn't mean it would be just as easy for a larger install.
That said, they should have done a better job at BA. Even though testing power isn't part of a smaller DC's MO, it should be for a company the size of BA...at least in their dev environment.
Disclaimer: I'm a s/w dev in an unrelated area, so zero direct experience.
Anyway, "scaling up is hard" ... agree. However, why are their IT systems not sharded into multiple data centers ... preferably each also containing capacity to failover another site? Let's say they had a data center at the 2 sites in the UK and 2 more sites outside the UK. If one of the data centers goes down a peer takes over (or the load gets split 3 ways).
OK, then on a regular basis they could deliberately fail a datacenter & practice taking the load elsewhere.
I understand that having multiple datacenters means additional cost, but redundancy always comes at a cost. Sounds like they had a non-functioning RAID-1 (where D == datacenter) when they could have had RAID-5 (or 6)
I also understand that implementing this comes with additional s/w complexity ... but if you're big enough it'd seem to be worth it.
Meh, enough hand-waving.
This is not surprising. I was actually hired because the IT department took too long.
One project my manager was involved with was cancelled after 3 years because the goal had already passed. They never even started!
If a power surge caused the issue, then surely BA will sue the power company. If the power company can demonstrate there was no surge, surely they will sue BA for defamation.
And there was no power surge (not outside the DCs anyway).
A large number of ex-BA IT staff have commented in fora about the historic robustness of the system, however over the last 5 years BA has systematically gutted its IT staff and outsourced just about everything to India.
The CIO of BA (and IAG) is a manager whose last claim to fame was being the person responsible for ramming through the highly contentious (as in strike-causing) cabin crew contracts which stripped out many rights in 2011.
He has ZERO IT background and was charged with reducing the IT bill by $90 million per year.
Make of that what you will.
The problem is really quite simple: corporate drones think that throwing tantrums at a problem will get it fixed. Tantrums include not only screaming at people, but also throwing money at a problem. There's this thing called human capital where most qualified people will naturally reward an employer's loyalty to them with their loyalty to the cause of the employer. Yet the corporate world is treating humans like replaceable cogs, and that's what they get: stuff that's held together by good wishes and chewing gum. Why? Because in such a work atmosphere, nothing better will ever flourish.
A successful API design takes a mixture of software design and pedagogy.
Most companies automatically fall under the "oops, a power outage" cover story, when in fact they had been hacked. I know of at least one major company that's done this recently (inside info). Their premise was that a power outage or incompetence is much better PR than being hacked or compromised. See a pattern? There have been a few in the media recently. British Airways of all people, to have a power issue, no redundancy, etc. I am not buying it 100%.