British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)
A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."
Pretty sure UPS's and backup power supplies kinda do fall under that...
We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?
https://www.theguardian.com/bu...
Oh yeah, power surges are to blame! haha no.
The dangers of knowledge trigger emotional distress in human beings.
"It was not an IT issue, it was a power issue."
Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.
It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...
The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.
Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.
If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.
"Those union electricians told us we could run all these servers without upgrading the circuit breakers. It's not an IT problem, it's a union problem!"
How big a current spike was this?
1.21 Jiggawatts, and it sent them back to 1985.
Really? So, they are completely illustrating that their IT efforts are a "cost center" and that IT is a "necessary evil" that they provide minimal effort to. Everyone knows that a serious "Data Center" has multiple protective measures in place, so who is this service provider? I wonder how they treat their aircraft? This is so blatantly obvious it hurts those who know IT. Forget about the outsourcing questions.
They should do, but it depends a lot on the precise design of the UPS, and the nature of the power transient.
While many industrial UPS systems are dual conversion systems (essentially, the critical load is powered from the battery bus/inverter, and fails over to mains in the event of an inverter/battery malfunction), they are sometimes operated in standby mode (the critical load is powered from mains, and fails over to the battery bus/inverter in the event of a mains failure) as this saves energy due to improved energy efficiency and lower cooling demand in this mode.
Even so, dual conversion UPS systems are not necessarily immune to mains voltage fluctuation (even when operated in dual conversion mode) - depending on whether they try to follow mains voltage, or whether the voltage transient exceeds design limits.
If you are interested in some of the dynamics of this, it's worth looking at the incident at the Forsmark nuclear power plant in Sweden. In this case, unexpectedly large grid voltage fluctuations resulted in the double conversion UPSs suffering an output bus overvoltage, which resulted in triggering of output overvoltage protection and disconnection of the critical loads. A less well protected device could have exposed critical loads to a prolonged overvoltage. This incident required particular design changes for nuclear grade UPS systems, such that mains voltage fluctuations, even beyond the anticipated range, should not result in a critical load disconnection.
Great Scott!!
I worked as a dev for a pretty big social network company. We were a not-quite also-ran, peaking at Alexa 108 globally, and for a while we were beating the pants off of Facebook. This was in the pre-AWS days when startups still ran their own servers. Early on, we had apparent power failures on two successive Saturday nights. Right when our database scrubbing processes started.
I suggested to our sysadmins that *maybe* it was because all of the disk heads were starting to move at once, and *maybe* it would go away if we staggered the processes across servers.
Yep, problem solved. Our power feeds were rated for average power draw, not peak power draw on all servers in a rack, and peak power came when all of the disks started seeking simultaneously.
It seems the same thing happened at BA, except no one thought to stagger-start the servers. For us, this was the first big system we ever built, so, OK, chalk it up to growing pains (and the problem never, ever happened again). But BA? Shame on them.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
85 mph won't cut it. Gotta get that baby up to 88!
In other words: "We used $10 MILLION WORTH OF EXPENSIVE SERVERS like a CHILD would use a PAPER CLIP IN AN ELECTRIC SOCKET."
How would this 'admission' make anyone more comfortable about this business?
The business doesn't have to worry about that. It's safe regardless; too-big-to-fail public+private yada yada. This is BA we're talking about.
These "stories" are just the public narrative writing process, guided to affix/deflect blame to/from the appropriate parties as the scapegoats are singled out. The BA execs know they have maybe 72 hours or so before this story falls out of the news cycle so they're using that window to make the headlines they need to muddy the waters. Until now the only narrative that has had any play is the "outsourcing did it" one, and that hits too close to management, so they're making this stuff up and putting it out through their MSM channels.
Maw! Fire up the karma burner!
it's not our DC so we don't deal with the power part it's the DC that we outsourced to that does the power part.
BIG DC power systems are not really IT guys more like Infrastructure / electricians and some of that stuff is not easy swap even more so if an fail safe tripped and killed all power.
>UPS undersized
>Power fails, UPS quickly die
>power comes back or comes back with problems (open neutral,flipped phase,over voyage,etc)
>idiots try and bring back everything at once
>UPS trips from inrush from cold start
>or UPS says there is a power problem they ignore
>idiots flip big lever from "UPS" to "BYPASS"
>all protection...bypassed
Boom
I've seen this scenario play out a few times.
I have to return some videotapes...
Our small school was able to cobble together enough money to afford an APC RM6000 to protect our small data room.
We recently had an intermittent power leg (it was broken at the service pole cut-offs). The wind would blow the cable and cause arcing - and lots of power weirdness on that leg.
Our UPS simply did what it needed to do to keep reliable power going to our IT systems. If we had a generator, we would have failed-over onto that until the power company fixed the service.
Surely an organization the size of BA can afford better and more redundant systems than this.
I suspect BA is passing the buck here.
They do, but some surge protection devices have a limited number of surges they can absorb before they have to be replaced. If there were a number of surges, it's certainly feasible for the protection chain to fail at some point.
An anecdote from a few weeks ago with a data center I help manage. It has a backup generator, automatic switch gear and a Schneider Electric Galaxy double conversion UPS. Yes we don't have two, but we ain't an airline. We do have another data center on another site to take over if needed though.
So a few weeks back our phones go wild with texts fired off by the UPS tossing SNMP traps around. One sprint later, the UPS console is showing no input power and our in-house electricians lay rubber from one end of the campus to the other to get to the sub in time. As we wait for the UPS to hit that magic 5 minutes when it triggers the auto-shutdown sequences on the servers, the sparkies discover the sub's output is fine and the generator isn't running.
Then all shit breaks loose, ten power cycles on the UPS input, some lasting long enough to switch from battery to mains, some not. With ten minutes left on the batteries, the UPS gives up, shuts the inverter and charger down and switches the load to static bypass. Room goes silent except for the UPS alarms, and then the eleventh return cycle comes and goes in about three seconds. We hear PSU fans starting and then winding down. I dropped the master breaker on the DB and isolated the room from the UPS. Down until the sparkies figure it out. There goes three hours of our lives.
Turns out that the automatic switch gear had some arc damage on the utility-side contactor feeding the control boards, probably caused by the eight months of load-shedding (read utility driven power cuts to ration power) we had experienced two years ago. That was enough to drop the voltage in one sensor to below the trigger threshold and caused that contactor and the main load contractor to open. Before it could start the generator up, the control board then decided the utility had returned, so it closed the contractors again. And open again, and close again. The sound of a 3-phase 480V 500A contactor switching twice a second is enough to make the sparkies use words a sailor would be proud of.
We had to lock out the sensors, rig a temporary bypass on the contactors to power the room from the generator feed side and replace the damaged contactors before we were fully safe again. We lost 2 PSUs out of 90 and no data. We were lucky.
I relate this to show that no matter how good the power protection architecture is, multiple UPSes, twin feeds etc, shit can and does happen. We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room.
So I'm willing to accept that BA lost a data center to power problems. But I'm not willing to accept that the loss of a single data center can shut down global operations. BA must have multiple redundant data centers with a seamless failover mechanism. And that is a failure of IT pure and simple.
Trying to become famous by taking photos. Visit my homepage please.
"I was able to protect my puddly shit at my workplace with equipment I bought at Frys, so BA should have been able to protect its 12,000 servers just like I did."
Scaling up is hard. Just because you were able to do it with your install doesn't mean it would be just as easy for a larger install.
That said, they should have done a better job at BA. Even though testing power isn't part of a smaller DC's MO, it should be for a company the size of BA...at least in their dev environment.
If I read this right, they are claiming that putting a huge load on the system (bringing up power to too many servers at once) resulted in excessive voltage on the power rails.
In my understanding of physics, increasing the current usually results in reduced voltage. So where did the over-voltage come from?
Or are they saying their their UPS generators were somehow incapable of limiting their output voltage? Pretty strange generators, not suitable for the task?
None of this sounds right, which is why I reject it outright as a CYA claim by the CEO. I expect that the technicians responsible for rebuilding have been told that, if they talk about it, they will find that their own jobs have been outsourced. But still, perhaps some anonymous leaks will happen.
The real "Libtards" are the Libertarians!
It is if it is set up and administered right.
we did monthly failovers between different physical sites. A blown DC at one site wouldn't have made a difference.
Our failovers involved a couple hours of oncall for about 150 staff. Most the time only a half dozen were working but a couple times a year it would involve most the staff (and a lot of it people) for part of that. A database would be out of sync or messed up and that would fall to the IT staff to fix. It became less common over time.
Did you miss that they fixed the power problems and then the IT systems were messed up for a long time afterwards indicating poor disaster planning and low staff skill.
A company as big as BA, should have had a separate failover site and been doing regular failovers.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
The utility providers for all of BA's major operations centers in England are all on record as saying there were no power surges, anomalies, etc. This wasn't "we're unaware of...", they all went back over their logs and categorically denied it (seems like they weren't happy about BA trying to pin any bit of this sh*t show on them). As many have pointed out above and elsewhere, none of this passes the sniff test. BA's taking a beating for this, not just over stranding passengers but how they handled the stranded passengers. Many of their communications to passengers have failed to mention BA's obligations as well as refunds and options passengers are legally entitled to. They even had the nerve to point most people to their toll customer service line instead of their toll-free one, charging people 35 pence/min to sit on hold while they were trying to get their travel plans sorted out. Even nickel and diming low cost carriers (LCC's) like RyanAir aren't stupid enough to try something like that after a system-wide disruption.
"We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room"
You weren't lucky, it's called having good, well-trained/practised staff on-site. And based on what everyone has been saying this is something that was severely lacking at BA