British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)
A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."
Pretty sure UPS's and backup power supplies kinda do fall under that...
We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?
https://www.theguardian.com/bu...
Oh yeah, power surges are to blame! haha no.
The dangers of knowledge trigger emotional distress in human beings.
"It was not an IT issue, it was a power issue."
Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.
It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...
How big a current spike was this? Don't UPSes act as surge protectors and filters too?
The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.
And the difference is...?
Anyone whose Server Farm can be brought down from a power outage does NOT know what they are doing, or care enough about it to bother.
How would this 'admission' make anyone more comfortable about this business?
Power issues of this kind are IT issues.
When designing a server location you must take power into consideration; ie, do I have enough battery to keep all critical servers and supporting hardware up until the generator has kicked in, plus extra just in case the generator has a glitch or two of it's own. Is the battery rated at the correct surge protection to keep systems from glitching when the power does return? Is the generator more than enough to power everything between re-fuelings? Is the generator rated enough to run everything at less than 80% load? Have I staged non-critical servers and equipment to power down and power backup to spread the need for power out? Is there a backup facility that I can spin up or switch to; for this kind of operation you would want to switch to a new site in minutes to prevent business loss.
Again, these are IT problems.
Now a battery going dead, a power supply frying, circuit breakers tripping; these are power issues. Poor disaster recovery plans are not.
Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.
If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.
"Those union electricians told us we could run all these servers without upgrading the circuit breakers. It's not an IT problem, it's a union problem!"
Really? So, they are completely illustrating that their IT efforts are a "cost center" and that IT is a "necessary evil" that they provide minimal effort to. Everyone knows that a serious "Data Center" has multiple protective measures in place, so who is this service provider? I wonder how they treat their aircraft? This is so blatantly obvious it hurts those who know IT. Forget about the outsourcing questions.
was the fact that you apparently have no redundancy on extremely mission-critical servers.
Every server wasn't connected to a UPS? And the return of the power overwhelmed the UPSes?
And just how did management decide to "save money" on the power for the servers?
I worked as a dev for a pretty big social network company. We were a not-quite also-ran, peaking at Alexa 108 globally, and for a while we were beating the pants off of Facebook. This was in the pre-AWS days when startups still ran their own servers. Early on, we had apparent power failures on two successive Saturday nights. Right when our database scrubbing processes started.
I suggested to our sysadmins that *maybe* it was because all of the disk heads were starting to move at once, and *maybe* it would go away if we staggered the processes across servers.
Yep, problem solved. Our power feeds were rated for average power draw, not peak power draw on all servers in a rack, and peak power came when all of the disks started seeking simultaneously.
It seems the same thing happened at BA, except no one thought to stagger-start the servers. For us, this was the first big system we ever built, so, OK, chalk it up to growing pains (and the problem never, ever happened again). But BA? Shame on them.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
In other words: "We used $10 MILLION WORTH OF EXPENSIVE SERVERS like a CHILD would use a PAPER CLIP IN AN ELECTRIC SOCKET."
Comment removed based on user account deletion
The DC wasn't in India. The staff operating that DC (i.e patching, configuring, monitoring) were in India. The Indian support staff I work with follow instructions to the letter, so if the instructions are not completely accurate it falls on the team who made the instructions. Garbage In = Garbage Out.
it's not our DC so we don't deal with the power part it's the DC that we outsourced to that does the power part.
The fact that this company even says that makes me question whether they understand what IT is and how to fix this going forward.
Most companies have a plan (or should have) to handle a location taking a hit... especially a company this large. The fact that a single DC collapse could bring this company to it's knee's is so bad it's almost negligent IMO.
--Hired Net Grunt
Delta tried this last year and got called to the carpet on it. BA needs to learn from other's mistakes...
--I like turtles...
BIG DC power systems are not really IT guys more like Infrastructure / electricians and some of that stuff is not easy swap even more so if an fail safe tripped and killed all power.
Our small school was able to cobble together enough money to afford an APC RM6000 to protect our small data room.
We recently had an intermittent power leg (it was broken at the service pole cut-offs). The wind would blow the cable and cause arcing - and lots of power weirdness on that leg.
Our UPS simply did what it needed to do to keep reliable power going to our IT systems. If we had a generator, we would have failed-over onto that until the power company fixed the service.
Surely an organization the size of BA can afford better and more redundant systems than this.
I suspect BA is passing the buck here.
some DC's have there own sub station's and it may of been some thing in the side the DC's power system that failed.
Offshore IT is what led to this mess.
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.
Seriously, I have not seen so many issues in Airline computers except for the last 2 years. What is different? Why outsourcing to India.
I prefer the "u" in honour as it seems to be missing these days.
I don't think he's digging his hole fast enough. Feel free to borrow my shovel.
Or, perhaps a better solution would be for someone else at BA to clonk him over the head from behind with a little statuette or something so he just stops talking.
"I was able to protect my puddly shit at my workplace with equipment I bought at Frys, so BA should have been able to protect its 12,000 servers just like I did."
Scaling up is hard. Just because you were able to do it with your install doesn't mean it would be just as easy for a larger install.
That said, they should have done a better job at BA. Even though testing power isn't part of a smaller DC's MO, it should be for a company the size of BA...at least in their dev environment.
Sometimes a meteor strike takes out your data center, it happens. The answer is to design smaller, smarter systems that are more resilient.
BA carries about 50M passengers per year, less than 135k per day. Over a 8 business hour day, that's about 20k bookings per hour. Let's say one booking consists of 100 datasets written to the database, and maybe 10 times that in reads. This works out to abut 500 writes and 5000 reads (most of which can be cached) per second. Actual average loads are going to be even lower. I'm not talking about extended services or analytics, which can happen on machines of lesser importance, this is just about the core business of taking a reservation and issuing a ticket.
It's not a high transaction volume, and it's not a lot of data. You do not need an entire data center for this. One server with a stock DB can do that. The structure of this core is simple enough so you could replicate it to hot spares around the globe. Heck, you could even lower the load by caching reads at the airport and at the web server.
There is no unavoidable technical reason this power failure had to be that catastrophic.
First, there is a cut in mains power to the data centre. No biggie, the batteries take the load. The backup generators then start to spin up and then supply power and the datacentre keeps running.
No lights went off, no computers crashed, business kept running.
But then, mains power is available again. How do you transition from your own generated power back to grid power? You can't just flick a switch. For a start you should ensure that the phase of the two power sources match - on all 3 phases. If you don't do that, I can imagine a power surge would be very likely.
But just like every outfit tests its backups - but very few test their restores, I guess BA had tested their failover process - but never got around to failing back. Or that the one time they did fail-back they got lucky.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
It is if it is set up and administered right.
we did monthly failovers between different physical sites. A blown DC at one site wouldn't have made a difference.
Our failovers involved a couple hours of oncall for about 150 staff. Most the time only a half dozen were working but a couple times a year it would involve most the staff (and a lot of it people) for part of that. A database would be out of sync or messed up and that would fall to the IT staff to fix. It became less common over time.
Did you miss that they fixed the power problems and then the IT systems were messed up for a long time afterwards indicating poor disaster planning and low staff skill.
A company as big as BA, should have had a separate failover site and been doing regular failovers.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
The utility providers for all of BA's major operations centers in England are all on record as saying there were no power surges, anomalies, etc. This wasn't "we're unaware of...", they all went back over their logs and categorically denied it (seems like they weren't happy about BA trying to pin any bit of this sh*t show on them). As many have pointed out above and elsewhere, none of this passes the sniff test. BA's taking a beating for this, not just over stranding passengers but how they handled the stranded passengers. Many of their communications to passengers have failed to mention BA's obligations as well as refunds and options passengers are legally entitled to. They even had the nerve to point most people to their toll customer service line instead of their toll-free one, charging people 35 pence/min to sit on hold while they were trying to get their travel plans sorted out. Even nickel and diming low cost carriers (LCC's) like RyanAir aren't stupid enough to try something like that after a system-wide disruption.
As opposed to another type of servers? Do they have building & grounds servers? Operations servers? Receptionist servers?
Just curious...
"A plan fiendishly clever in its intricacies"- Homer Simpson
If a power surge caused the issue, then surely BA will sue the power company. If the power company can demonstrate there was no surge, surely they will sue BA for defamation.
And there was no power surge (not outside the DCs anyway).
A large number of ex-BA IT staff have commented in fora about the historic robustness of the system, however over the last 5 years BA has systematically gutted its IT staff and outsourced just about everything to India.
The CIO of BA (and IAG) is a manager whose last claim to fame was being the person responsible for ramming through the highly contentious (as in strike-causing) cabin crew contracts which stripped out many rights in 2011.
He has ZERO IT background and was charged with reducing the IT bill by $90 million per year.
Make of that what you will.
The problem is really quite simple: corporate drones think that throwing tantrums at a problem will get it fixed. Tantrums include not only screaming at people, but also throwing money at a problem. There's this thing called human capital where most qualified people will naturally reward an employer's loyalty to them with their loyalty to the cause of the employer. Yet the corporate world is treating humans like replaceable cogs, and that's what they get: stuff that's held together by good wishes and chewing gum. Why? Because in such a work atmosphere, nothing better will ever flourish.
A successful API design takes a mixture of software design and pedagogy.