British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

← Back to Stories (view on slashdot.org)

British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

Posted by msmash on Wednesday May 31, 2017 @04:00AM from the no-backup dept.

A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."

3 of 189 comments (clear)

Min score:

Reason:

Sort:

Power of the almighty dollar by mfh · 2017-05-31 04:04 · Score: 5, Informative

We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?
https://www.theguardian.com/bu...
Oh yeah, power surges are to blame! haha no.

--
The dangers of knowledge trigger emotional distress in human beings.
Re:Not IT... Riiiight... by Tailhook · 2017-05-31 04:17 · Score: 4, Informative

Not to mention fail over to alternative sites.
These are transparent lies. The real issue is well known now, but it's unconformable for all involved so they're making stuff up.

--
Maw! Fire up the karma burner!
Re:It _was_ an IT issue by Anonymous Coward · 2017-05-31 04:26 · Score: 5, Informative

BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.
Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!
Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.