Slashdot Mirror


British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."

12 of 189 comments (clear)

  1. Not IT... Riiiight... by Anonymous Coward · · Score: 5, Insightful

    Pretty sure UPS's and backup power supplies kinda do fall under that...

  2. Power of the almighty dollar by mfh · · Score: 5, Informative

    We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?

    https://www.theguardian.com/bu...

    Oh yeah, power surges are to blame! haha no.

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Power of the almighty dollar by sycodon · · Score: 5, Insightful

      This is what happens when you treat your IT staff like your Janitorial staff.

      --
      When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
  3. Direct cause by Anonymous Coward · · Score: 5, Insightful

    The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.

  4. Redundant System by Roger+W+Moore · · Score: 5, Insightful

    Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.

    1. Re: Redundant System by tysonedwards · · Score: 5, Funny

      Come on... It's apparent, the power surge was so severe it crossed the VPN Tunnels when they re-opened and traveled into another city and damaged those systems too!

      --
      Thirty four characters live here.
    2. Re:Redundant System by anegg · · Score: 5, Funny

      They obviously only got around to implementing the first half of their Disaster Recovery solution. They will implement the Recovery half next year.

  5. Re:"It wasn't me, it was the one armed man!" by TWX · · Score: 5, Insightful

    Yep.

    We have a Caterpillar generator the size of a schoolbus (and given its coloring I've had to restrain myself from sticking a stop-sign on the side as a prank) and a sophisticated transfer switch with power monitoring. When we lose power the batteries hold the DC over until the generator kicks in, and then when power is restored we do not switch back to grid immediately. I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between. On top of that, the system monitors grid power continuously and will intentionally island the system if there's a significant enough fault.

    This is not for something as critical as an airline's control system either. I do not find any reasonable excuse to blame power; you're supposed to assume that power is dirty and unreliable and to work around it.

    --
    Do not look into laser with remaining eye.
  6. Re:It _was_ an IT issue by Anonymous Coward · · Score: 5, Informative

    BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.

    Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!

    Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.

  7. Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · · Score: 5, Interesting

    Sounds great...when it works. I bet you've never looked at the code that controls a big automated transfer switch. I have. It's a mess. It's so bad that the very first install Eaton did with our new model, which was in Digital Forest in Tukwila, WA near Seattle, we had three failures in the first ninety days due to bad software. It shut an entire data center down even though utility power was not down, battery power good, and generator working. The guy we dispatched the third time had spent two years in Uganda so he was experienced with bad power. He claimed that power from Seattle City Light was worse than Uganda. The power was so bad that the software in the ATS decided to disconnect everything.

    The second time power was restored, because of the bad software, it switched to generator power before the generator was running fully. The voltage dropped and took out quite a few older pieces of equipment and stalled the engine. In other words, the opposite problem BA had.

  8. Re:Don't UPSes also act as surge protectors? by bruce_the_loon · · Score: 5, Interesting

    They do, but some surge protection devices have a limited number of surges they can absorb before they have to be replaced. If there were a number of surges, it's certainly feasible for the protection chain to fail at some point.

    An anecdote from a few weeks ago with a data center I help manage. It has a backup generator, automatic switch gear and a Schneider Electric Galaxy double conversion UPS. Yes we don't have two, but we ain't an airline. We do have another data center on another site to take over if needed though.

    So a few weeks back our phones go wild with texts fired off by the UPS tossing SNMP traps around. One sprint later, the UPS console is showing no input power and our in-house electricians lay rubber from one end of the campus to the other to get to the sub in time. As we wait for the UPS to hit that magic 5 minutes when it triggers the auto-shutdown sequences on the servers, the sparkies discover the sub's output is fine and the generator isn't running.

    Then all shit breaks loose, ten power cycles on the UPS input, some lasting long enough to switch from battery to mains, some not. With ten minutes left on the batteries, the UPS gives up, shuts the inverter and charger down and switches the load to static bypass. Room goes silent except for the UPS alarms, and then the eleventh return cycle comes and goes in about three seconds. We hear PSU fans starting and then winding down. I dropped the master breaker on the DB and isolated the room from the UPS. Down until the sparkies figure it out. There goes three hours of our lives.

    Turns out that the automatic switch gear had some arc damage on the utility-side contactor feeding the control boards, probably caused by the eight months of load-shedding (read utility driven power cuts to ration power) we had experienced two years ago. That was enough to drop the voltage in one sensor to below the trigger threshold and caused that contactor and the main load contractor to open. Before it could start the generator up, the control board then decided the utility had returned, so it closed the contractors again. And open again, and close again. The sound of a 3-phase 480V 500A contactor switching twice a second is enough to make the sparkies use words a sailor would be proud of.

    We had to lock out the sensors, rig a temporary bypass on the contactors to power the room from the generator feed side and replace the damaged contactors before we were fully safe again. We lost 2 PSUs out of 90 and no data. We were lucky.

    I relate this to show that no matter how good the power protection architecture is, multiple UPSes, twin feeds etc, shit can and does happen. We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room.

    So I'm willing to accept that BA lost a data center to power problems. But I'm not willing to accept that the loss of a single data center can shut down global operations. BA must have multiple redundant data centers with a seamless failover mechanism. And that is a failure of IT pure and simple.

    --
    Trying to become famous by taking photos. Visit my homepage please.
  9. Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · · Score: 5, Interesting

    I worked in a center that had a big diesel-powered UPS unit the size of a shipping container. It was there about 3 years before we had a power outage. It detected it and span up, engaged the clutch and ... the drive belt snapped. Oops. Under voltage. So rev faster. Still undervoltage, so MOAR revs. Now, in addition to the power outage we've got a big UPS that's on fire.