Slashdot Mirror


British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

A huge IT failure that stranded 75,000 British Airways passengers followed damage to servers that were overwhelmed when the power returned after an outage, the airline said on Wednesday. From a report: BA is seeking to limit the damage to its reputation and has apologised to customers after hundreds of flights were canceled over a long holiday weekend. The airline provided a few more details of the incident in its latest statement on Wednesday. While there was a power failure at a data center near London's Heathrow airport, the damage was caused by an overwhelming surge once the electricity was restored, it said. "There was a total loss of power at the data center. The power then returned in an uncontrolled way causing physical damage to the IT servers," BA said in a statement. "It was not an IT issue, it was a power issue."

17 of 189 comments (clear)

  1. Not IT... Riiiight... by Anonymous Coward · · Score: 5, Insightful

    Pretty sure UPS's and backup power supplies kinda do fall under that...

    1. Re:Not IT... Riiiight... by ShanghaiBill · · Score: 4, Insightful

      Well, India has a notoriously unreliable electrical grid.

      If the power goes down daily or weekly, you learn to deal with it, and your backup generators and fail-over systems become robust. If power goes done once a decade, it causes bigger problems.

  2. "It wasn't me, it was the one armed man!" by Anonymous Coward · · Score: 2, Insightful

    "It was not an IT issue, it was a power issue."

    Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.

    1. Re:"It wasn't me, it was the one armed man!" by TWX · · Score: 5, Insightful

      Yep.

      We have a Caterpillar generator the size of a schoolbus (and given its coloring I've had to restrain myself from sticking a stop-sign on the side as a prank) and a sophisticated transfer switch with power monitoring. When we lose power the batteries hold the DC over until the generator kicks in, and then when power is restored we do not switch back to grid immediately. I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between. On top of that, the system monitors grid power continuously and will intentionally island the system if there's a significant enough fault.

      This is not for something as critical as an airline's control system either. I do not find any reasonable excuse to blame power; you're supposed to assume that power is dirty and unreliable and to work around it.

      --
      Do not look into laser with remaining eye.
    2. Re:"It wasn't me, it was the one armed man!" by citylivin · · Score: 4, Insightful

      Until your voltage regulator starts dying and only gives your equipment 80volts and no one notices the under voltage condition during normal maintenance and testing of the generator.

      The facilities maintenance people test the generators monthly, but it was not standard practice to test the voltage every single time the generator was tested.

      It is now.

      But the point is that systems fail in all sorts of fun ways in the real world. You learn, you change, you adapt, as im sure BA is doing. All it takes is one major incident to stop people from dragging their feet. I'm sure that is occurring now at british airlines.

      --
      As a potential lottery winner, I totally support tax cuts for the wealthy
    3. Re:"It wasn't me, it was the one armed man!" by Thelasko · · Score: 4, Insightful

      I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between.

      I used to design the diesel engines used in some of those systems, and have seen them in use. Although your system may monitor the grid to ensure reliability, it's most likely making sure it's not switching between two power sources that are out of phase.

      When we would connect one of our gensets to the power grid, we had to match the phase before we could close the switches. To do this, the engine speed was modified to run the generator at slightly above or below the frequency of the grid. If the phase wasn't matched, the power grid would try to force the generator into phase suddenly. It's assumed the power available from the grid is infinite in these types of systems. Therefore an incredible amount of current would flow through the generator and also provide a mechanical jerk to the engine if the switches were closed out of phase. Something will break in a spectacular fashion if this isn't done carefully.

      Honestly, this could be what happened to BA.

      --
      One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
  3. Not an IT Issue by Anonymous Coward · · Score: 2, Insightful

    It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...

  4. Direct cause by Anonymous Coward · · Score: 5, Insightful

    The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.

  5. Re:Power of the almighty dollar by jellomizer · · Score: 4, Insightful

    A proper IT staff would have built in safeguards against power outages and power surges.
    For a company the size of British airways I would expect that they would have a hot fail over in a different country. Or at least a different geographic location.

    In short they cheeped out on IT and now they are paying for it.

    --
    If something is so important that you feel the need to post it on the internet... It probably isn't that important.
  6. Redundant System by Roger+W+Moore · · Score: 5, Insightful

    Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.

  7. It _was_ an IT issue by matthiasvegh · · Score: 4, Insightful

    If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.

    1. Re:It _was_ an IT issue by Chris+Mattern · · Score: 4, Insightful

      Okay, they weren't flaming incompetents that didn't have a failover site. They were flaming incompetents that had a failover site that didn't work, because apparently they never tested it. Glad we cleared that up.

  8. Re:Power of the almighty dollar by Maxo-Texas · · Score: 4, Insightful

    An ill-considered plan to save a few dimes has cost them several dollars.

    The CEO should have foreseen this and should be let go. As should other executives who approved the offshoring plan.

    Offshoring can work- but excessive staffing cuts to save a few extra dollars are begging for something like this to happen.

    Infrastructure people should be located on site with the hardware and there should be multiple hardware systems *with* fail over testing on a monthly basis. (not quarterly. that fails. only monthly is often enough that the failover is seamless and there is a good argument for doing a daily failover.)

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  9. Re:Power of the almighty dollar by sycodon · · Score: 5, Insightful

    This is what happens when you treat your IT staff like your Janitorial staff.

    --
    When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
  10. Re:ID10Ts by Fire_Wraith · · Score: 4, Insightful

    Outsourcing is part of the problem, but you're right, it derives from the mentality that IT is a cost center that must be minimized at every possible turn. It's outdated thinking, going back to the days where if your office network went down, there'd be a bit of inconvenience, but the planes still flew, and it wasn't a big deal. Today, IT is a business critical area, because when your network goes down, the planes stop flying, and you stop making money, never-mind the lingering effects from the terrible publicity or the angry customers. It's not something you can afford to skimp on, on any level.

    Unfortunately it will probably take several shocks like this, and some high level careers ending as a result, before they start to wise up.

  11. Re:No, Where we REALLY screwed up was this: by Tailhook · · Score: 4, Insightful

    How would this 'admission' make anyone more comfortable about this business?

    The business doesn't have to worry about that. It's safe regardless; too-big-to-fail public+private yada yada. This is BA we're talking about.

    These "stories" are just the public narrative writing process, guided to affix/deflect blame to/from the appropriate parties as the scapegoats are singled out. The BA execs know they have maybe 72 hours or so before this story falls out of the news cycle so they're using that window to make the headlines they need to muddy the waters. Until now the only narrative that has had any play is the "outsourcing did it" one, and that hits too close to management, so they're making this stuff up and putting it out through their MSM channels.

    --
    Maw! Fire up the karma burner!
  12. Re:Don't UPSes also act as surge protectors? by phorm · · Score: 3, Insightful

    "We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room"

    You weren't lucky, it's called having good, well-trained/practised staff on-site. And based on what everyone has been saying this is something that was severely lacking at BA