Slashdot Mirror


British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."

32 of 189 comments (clear)

  1. Not IT... Riiiight... by Anonymous Coward · · Score: 5, Insightful

    Pretty sure UPS's and backup power supplies kinda do fall under that...

    1. Re:Not IT... Riiiight... by Tailhook · · Score: 4, Informative

      Not to mention fail over to alternative sites.

      These are transparent lies. The real issue is well known now, but it's unconformable for all involved so they're making stuff up.

      --
      Maw! Fire up the karma burner!
    2. Re:Not IT... Riiiight... by sycodon · · Score: 4, Funny

      Well, India has a notoriously unreliable electrical grid.

      --
      When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
    3. Re:Not IT... Riiiight... by ShanghaiBill · · Score: 4, Insightful

      Well, India has a notoriously unreliable electrical grid.

      If the power goes down daily or weekly, you learn to deal with it, and your backup generators and fail-over systems become robust. If power goes done once a decade, it causes bigger problems.

  2. Power of the almighty dollar by mfh · · Score: 5, Informative

    We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?

    https://www.theguardian.com/bu...

    Oh yeah, power surges are to blame! haha no.

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Power of the almighty dollar by jellomizer · · Score: 4, Insightful

      A proper IT staff would have built in safeguards against power outages and power surges.
      For a company the size of British airways I would expect that they would have a hot fail over in a different country. Or at least a different geographic location.

      In short they cheeped out on IT and now they are paying for it.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    2. Re:Power of the almighty dollar by Maxo-Texas · · Score: 4, Insightful

      An ill-considered plan to save a few dimes has cost them several dollars.

      The CEO should have foreseen this and should be let go. As should other executives who approved the offshoring plan.

      Offshoring can work- but excessive staffing cuts to save a few extra dollars are begging for something like this to happen.

      Infrastructure people should be located on site with the hardware and there should be multiple hardware systems *with* fail over testing on a monthly basis. (not quarterly. that fails. only monthly is often enough that the failover is seamless and there is a good argument for doing a daily failover.)

      --
      She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
    3. Re:Power of the almighty dollar by sycodon · · Score: 5, Insightful

      This is what happens when you treat your IT staff like your Janitorial staff.

      --
      When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
    4. Re:Power of the almighty dollar by AC5398 · · Score: 3, Interesting

      And yet, if you laid your janitorial staff off you'd up to your neck in filth and garbage in no time at all.

      Management who don't rise through the ranks typically have absolutely no respect for the work that 'the ranks' perform.

  3. Direct cause by Anonymous Coward · · Score: 5, Insightful

    The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.

  4. Redundant System by Roger+W+Moore · · Score: 5, Insightful

    Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.

    1. Re: Redundant System by tysonedwards · · Score: 5, Funny

      Come on... It's apparent, the power surge was so severe it crossed the VPN Tunnels when they re-opened and traveled into another city and damaged those systems too!

      --
      Thirty four characters live here.
    2. Re:Redundant System by GameboyRMH · · Score: 4, Interesting

      This. The BA outage is the second most hilariously inept cause of an outage I've ever seen, after a local government office that was down for over a week because one rackmount server was dropped in transit.

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    3. Re:Redundant System by anegg · · Score: 5, Funny

      They obviously only got around to implementing the first half of their Disaster Recovery solution. They will implement the Recovery half next year.

  5. It _was_ an IT issue by matthiasvegh · · Score: 4, Insightful

    If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.

    1. Re:It _was_ an IT issue by Anonymous Coward · · Score: 5, Informative

      BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.

      Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!

      Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.

    2. Re:It _was_ an IT issue by Chris+Mattern · · Score: 4, Insightful

      Okay, they weren't flaming incompetents that didn't have a failover site. They were flaming incompetents that had a failover site that didn't work, because apparently they never tested it. Glad we cleared that up.

  6. Re:"It wasn't me, it was the one armed man!" by TWX · · Score: 5, Insightful

    Yep.

    We have a Caterpillar generator the size of a schoolbus (and given its coloring I've had to restrain myself from sticking a stop-sign on the side as a prank) and a sophisticated transfer switch with power monitoring. When we lose power the batteries hold the DC over until the generator kicks in, and then when power is restored we do not switch back to grid immediately. I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between. On top of that, the system monitors grid power continuously and will intentionally island the system if there's a significant enough fault.

    This is not for something as critical as an airline's control system either. I do not find any reasonable excuse to blame power; you're supposed to assume that power is dirty and unreliable and to work around it.

    --
    Do not look into laser with remaining eye.
  7. Next excuse.... by __aaclcg7560 · · Score: 3, Funny

    "Those union electricians told us we could run all these servers without upgrading the circuit breakers. It's not an IT problem, it's a union problem!"

  8. Re:Don't UPSes also act as surge protectors? by Pascoea · · Score: 4, Funny

    How big a current spike was this?

    1.21 Jiggawatts, and it sent them back to 1985.

  9. Re:ID10Ts by Fire_Wraith · · Score: 4, Insightful

    Outsourcing is part of the problem, but you're right, it derives from the mentality that IT is a cost center that must be minimized at every possible turn. It's outdated thinking, going back to the days where if your office network went down, there'd be a bit of inconvenience, but the planes still flew, and it wasn't a big deal. Today, IT is a business critical area, because when your network goes down, the planes stop flying, and you stop making money, never-mind the lingering effects from the terrible publicity or the angry customers. It's not something you can afford to skimp on, on any level.

    Unfortunately it will probably take several shocks like this, and some high level careers ending as a result, before they start to wise up.

  10. Sounds like an IT probelm to me. by pz · · Score: 4, Interesting

    I worked as a dev for a pretty big social network company. We were a not-quite also-ran, peaking at Alexa 108 globally, and for a while we were beating the pants off of Facebook. This was in the pre-AWS days when startups still ran their own servers. Early on, we had apparent power failures on two successive Saturday nights. Right when our database scrubbing processes started.

    I suggested to our sysadmins that *maybe* it was because all of the disk heads were starting to move at once, and *maybe* it would go away if we staggered the processes across servers.

    Yep, problem solved. Our power feeds were rated for average power draw, not peak power draw on all servers in a rack, and peak power came when all of the disks started seeking simultaneously.

    It seems the same thing happened at BA, except no one thought to stagger-start the servers. For us, this was the first big system we ever built, so, OK, chalk it up to growing pains (and the problem never, ever happened again). But BA? Shame on them.

    --

    Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
  11. Re:No, Where we REALLY screwed up was this: by Tailhook · · Score: 4, Insightful

    How would this 'admission' make anyone more comfortable about this business?

    The business doesn't have to worry about that. It's safe regardless; too-big-to-fail public+private yada yada. This is BA we're talking about.

    These "stories" are just the public narrative writing process, guided to affix/deflect blame to/from the appropriate parties as the scapegoats are singled out. The BA execs know they have maybe 72 hours or so before this story falls out of the news cycle so they're using that window to make the headlines they need to muddy the waters. Until now the only narrative that has had any play is the "outsourcing did it" one, and that hits too close to management, so they're making this stuff up and putting it out through their MSM channels.

    --
    Maw! Fire up the karma burner!
  12. Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · · Score: 5, Interesting

    Sounds great...when it works. I bet you've never looked at the code that controls a big automated transfer switch. I have. It's a mess. It's so bad that the very first install Eaton did with our new model, which was in Digital Forest in Tukwila, WA near Seattle, we had three failures in the first ninety days due to bad software. It shut an entire data center down even though utility power was not down, battery power good, and generator working. The guy we dispatched the third time had spent two years in Uganda so he was experienced with bad power. He claimed that power from Seattle City Light was worse than Uganda. The power was so bad that the software in the ATS decided to disconnect everything.

    The second time power was restored, because of the bad software, it switched to generator power before the generator was running fully. The voltage dropped and took out quite a few older pieces of equipment and stalled the engine. In other words, the opposite problem BA had.

  13. it's not our DC so we don't deal with the power by Joe_Dragon · · Score: 3, Funny

    it's not our DC so we don't deal with the power part it's the DC that we outsourced to that does the power part.

  14. Re:Don't UPSes also act as surge protectors? by bruce_the_loon · · Score: 5, Interesting

    They do, but some surge protection devices have a limited number of surges they can absorb before they have to be replaced. If there were a number of surges, it's certainly feasible for the protection chain to fail at some point.

    An anecdote from a few weeks ago with a data center I help manage. It has a backup generator, automatic switch gear and a Schneider Electric Galaxy double conversion UPS. Yes we don't have two, but we ain't an airline. We do have another data center on another site to take over if needed though.

    So a few weeks back our phones go wild with texts fired off by the UPS tossing SNMP traps around. One sprint later, the UPS console is showing no input power and our in-house electricians lay rubber from one end of the campus to the other to get to the sub in time. As we wait for the UPS to hit that magic 5 minutes when it triggers the auto-shutdown sequences on the servers, the sparkies discover the sub's output is fine and the generator isn't running.

    Then all shit breaks loose, ten power cycles on the UPS input, some lasting long enough to switch from battery to mains, some not. With ten minutes left on the batteries, the UPS gives up, shuts the inverter and charger down and switches the load to static bypass. Room goes silent except for the UPS alarms, and then the eleventh return cycle comes and goes in about three seconds. We hear PSU fans starting and then winding down. I dropped the master breaker on the DB and isolated the room from the UPS. Down until the sparkies figure it out. There goes three hours of our lives.

    Turns out that the automatic switch gear had some arc damage on the utility-side contactor feeding the control boards, probably caused by the eight months of load-shedding (read utility driven power cuts to ration power) we had experienced two years ago. That was enough to drop the voltage in one sensor to below the trigger threshold and caused that contactor and the main load contractor to open. Before it could start the generator up, the control board then decided the utility had returned, so it closed the contractors again. And open again, and close again. The sound of a 3-phase 480V 500A contactor switching twice a second is enough to make the sparkies use words a sailor would be proud of.

    We had to lock out the sensors, rig a temporary bypass on the contactors to power the room from the generator feed side and replace the damaged contactors before we were fully safe again. We lost 2 PSUs out of 90 and no data. We were lucky.

    I relate this to show that no matter how good the power protection architecture is, multiple UPSes, twin feeds etc, shit can and does happen. We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room.

    So I'm willing to accept that BA lost a data center to power problems. But I'm not willing to accept that the loss of a single data center can shut down global operations. BA must have multiple redundant data centers with a seamless failover mechanism. And that is a failure of IT pure and simple.

    --
    Trying to become famous by taking photos. Visit my homepage please.
  15. Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · · Score: 5, Interesting

    I worked in a center that had a big diesel-powered UPS unit the size of a shipping container. It was there about 3 years before we had a power outage. It detected it and span up, engaged the clutch and ... the drive belt snapped. Oops. Under voltage. So rev faster. Still undervoltage, so MOAR revs. Now, in addition to the power outage we've got a big UPS that's on fire.

  16. Re:"It wasn't me, it was the one armed man!" by citylivin · · Score: 4, Insightful

    Until your voltage regulator starts dying and only gives your equipment 80volts and no one notices the under voltage condition during normal maintenance and testing of the generator.

    The facilities maintenance people test the generators monthly, but it was not standard practice to test the voltage every single time the generator was tested.

    It is now.

    But the point is that systems fail in all sorts of fun ways in the real world. You learn, you change, you adapt, as im sure BA is doing. All it takes is one major incident to stop people from dragging their feet. I'm sure that is occurring now at british airlines.

    --
    As a potential lottery winner, I totally support tax cuts for the wealthy
  17. Re:"It wasn't me, it was the one armed man!" by Thelasko · · Score: 4, Insightful

    I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between.

    I used to design the diesel engines used in some of those systems, and have seen them in use. Although your system may monitor the grid to ensure reliability, it's most likely making sure it's not switching between two power sources that are out of phase.

    When we would connect one of our gensets to the power grid, we had to match the phase before we could close the switches. To do this, the engine speed was modified to run the generator at slightly above or below the frequency of the grid. If the phase wasn't matched, the power grid would try to force the generator into phase suddenly. It's assumed the power available from the grid is infinite in these types of systems. Therefore an incredible amount of current would flow through the generator and also provide a mechanical jerk to the engine if the switches were closed out of phase. Something will break in a spectacular fashion if this isn't done carefully.

    Honestly, this could be what happened to BA.

    --
    One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
  18. Re:BIG DC power systems are not really IT guys mor by Maxo-Texas · · Score: 4, Interesting

    It is if it is set up and administered right.

    we did monthly failovers between different physical sites. A blown DC at one site wouldn't have made a difference.

    Our failovers involved a couple hours of oncall for about 150 staff. Most the time only a half dozen were working but a couple times a year it would involve most the staff (and a lot of it people) for part of that. A database would be out of sync or messed up and that would fall to the IT staff to fix. It became less common over time.

    Did you miss that they fixed the power problems and then the IT systems were messed up for a long time afterwards indicating poor disaster planning and low staff skill.

    A company as big as BA, should have had a separate failover site and been doing regular failovers.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  19. Re:Don't UPSes also act as surge protectors? by phorm · · Score: 3, Insightful

    "We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room"

    You weren't lucky, it's called having good, well-trained/practised staff on-site. And based on what everyone has been saying this is something that was severely lacking at BA

  20. Re:"It wasn't me, it was the one armed man!" by kevmeister · · Score: 4, Interesting

    And sometimes **it happens.

    I worked as a Senior Network Engineer for a large national backbone provider to the US DOE. At the facilities we owned WE were in charge of oversight of the power system and regular testing. We had one experienced power engineer on staff to oversee everything, though the facility's plant engineering people did all of the actual heavy work.

    Back in 2009 we had just completed our annual full transfer test where we switched over to UPS, let the generator fire up, transferred to generator power, and then reversed the process. Everything worked perfectly. The following week we lost power. UPS kicked in, but the generator refused to start. One week earlier everything worked perfectly in the test case where we could have backed out before UPS died. No such luck that day. Our staff lost the ability to monitor the network and the laboratory where we were located lost Internet connectivity as did several other smaller facilities in the area. Took us about an hour to get a trailered generator in place and get things back on-line.

    No matter how carefully you plan and test, sometime you still lose.

    --
    Kevin Oberman, Network Engineer, Retired