British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

Not IT... Riiiight... by Anonymous Coward · 2017-05-31 04:03 · Score: 5, Insightful

Pretty sure UPS's and backup power supplies kinda do fall under that...

Re:Not IT... Riiiight... by Tailhook · 2017-05-31 04:17 · Score: 4, Informative

Not to mention fail over to alternative sites.
These are transparent lies. The real issue is well known now, but it's unconformable for all involved so they're making stuff up.

--
Maw! Fire up the karma burner!
Re:Not IT... Riiiight... by sycodon · 2017-05-31 04:25 · Score: 4, Funny

Well, India has a notoriously unreliable electrical grid.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Re:Not IT... Riiiight... by HornWumpus · 2017-05-31 04:40 · Score: 2

When they _fire_ the CEO, CTO and Director of IT. They should publicly announce 'It wasn't a management issue, it was power.'

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Re:Not IT... Riiiight... by ShanghaiBill · 2017-05-31 04:57 · Score: 4, Insightful

Well, India has a notoriously unreliable electrical grid.
If the power goes down daily or weekly, you learn to deal with it, and your backup generators and fail-over systems become robust. If power goes done once a decade, it causes bigger problems.
Re: Not IT... Riiiight... by thundercattt · 2017-05-31 08:52 · Score: 2

That was my first thought. A setup that size would have to have UPS backup setups upon backups. Baloney.
Re:Not IT... Riiiight... by rholtzjr · 2017-05-31 09:14 · Score: 2

I am pretty sure it was the lack of in this case. Even if a power surge happens, PDU/UPS pretty much handle any power related issues. This sounds more like someone was dinking around in the data center and pulled/shorted the wrong wire(s). Even if this did not happen, PDU/UPS equipment was designed to prevent what happened, so yea it WAS AN IT PROBLEM.
Re:Not IT... Riiiight... by dgatwood · 2017-06-01 07:30 · Score: 1

The problem is, AFAIK, like most legacy airlines, all of those critical systems tie into a central mainframe. Redundancy is supposed to be built into the mainframe hardware itself, and replication is infeasible because the mainframe-based databases don't really provide that capability. And like many companies, rather than spend the extra money to increase IT staffing so that they can properly transition to redundant clusters that are actually long-term-maintainable, they took the "If it ain't broke, don't fix it" approach, assuming that as long as the system was working, nothing more than minimal maintenance would be required, and outsourced their IT.
I'd like to believe that this will force them to rethink that strategy and invest in systems that go between the front-end systems and the mainframe to let them slowly replicate it onto more standard hardware, along with people trained to handle such a large-scale IT task, but I'm not holding my breath. Give it another ten years, and it will happen again, but next time, the damage will be too severe and they'll be down for a year while they rebuild all their systems from scratch.

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Not IT... Riiiight... by rholtzjr · 2017-06-01 13:53 · Score: 1

I already went through this once for a big insurance company back in 2003. We had setup a remote fail over site that was at most only an hour time frame from replacing in the event of a catastrophic failure. This is not really rocket science. This included a mainframe component (with data replication) and all the other supporting systems ranging from Windows, and UNIX systems with multiple flavors of databases (all replicating data). When the primary site went down for what ever reason (we assumed the worst to be total destruction of primary site), the router detected the failure and rerouted all traffic to remote site. We even went with two different carriers for the network redundancies. The only single point of failure would have been if the entire network infrastructure in the country went down. But we still had functionality at a different location and could relocate any employees from anywhere in the country to the new sight. We even practiced the failure event once a year.
No, this is just not doing due diligence on having a viable High Availability/Disaster Recovery response plan.
Re:Not IT... Riiiight... by DontBeAMoran · 2017-06-02 02:07 · Score: 1

O'BRIEN: In order to bring the system up to Starfleet code, I had to pull out the couplings to make room for a secondary backup.
GILORA: Starfleet code requires a second backup?
O'BRIEN: In case the first backup fails.
GILORA: What are the chances that a primary system and its backup would both fail at the same time?
O'BRIEN: It's not likely, but in a crunch, I wouldn't want to be caught without a second backup.

--
#DeleteFacebook
Re: Not IT... Riiiight... by Brockmire · 2017-06-02 08:43 · Score: 1

For future reference, it deserved a whoosh.

Power of the almighty dollar by mfh · 2017-05-31 04:04 · Score: 5, Informative

We all know that this outage was caused by bad faith outsourcing to unqualified persons. Who are they kidding?

https://www.theguardian.com/bu...

Oh yeah, power surges are to blame! haha no.

--
The dangers of knowledge trigger emotional distress in human beings.

Re:Power of the almighty dollar by wyHunter · 2017-05-31 04:10 · Score: 2

Pound?
Re:Power of the almighty dollar by jellomizer · 2017-05-31 04:13 · Score: 4, Insightful

A proper IT staff would have built in safeguards against power outages and power surges.
For a company the size of British airways I would expect that they would have a hot fail over in a different country. Or at least a different geographic location.
In short they cheeped out on IT and now they are paying for it.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:Power of the almighty dollar by Maxo-Texas · 2017-05-31 04:26 · Score: 4, Insightful

An ill-considered plan to save a few dimes has cost them several dollars.
The CEO should have foreseen this and should be let go. As should other executives who approved the offshoring plan.
Offshoring can work- but excessive staffing cuts to save a few extra dollars are begging for something like this to happen.
Infrastructure people should be located on site with the hardware and there should be multiple hardware systems *with* fail over testing on a monthly basis. (not quarterly. that fails. only monthly is often enough that the failover is seamless and there is a good argument for doing a daily failover.)

--
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
Re:Power of the almighty dollar by sycodon · 2017-05-31 04:26 · Score: 5, Insightful

This is what happens when you treat your IT staff like your Janitorial staff.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Re:Power of the almighty dollar by rikkards · 2017-05-31 05:28 · Score: 1

This is what happens when you outsource, you lose control of what is outside your grasp and you take them at face value.
CEO who did the right thing would have been pushed out because he was costing the shareholders too much money before that
Re:Power of the almighty dollar by AC5398 · 2017-05-31 05:30 · Score: 3, Interesting

And yet, if you laid your janitorial staff off you'd up to your neck in filth and garbage in no time at all.
Management who don't rise through the ranks typically have absolutely no respect for the work that 'the ranks' perform.
Re:Power of the almighty dollar by sycodon · 2017-05-31 06:06 · Score: 1

This is true.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Re:Power of the almighty dollar by Anonymous Coward · 2017-05-31 07:04 · Score: 1

Contractors can certainly a part of the failure even if it was a alleged power issue. Managers love to shop for an answer they like. At one company I worked at, we had a infrastructure architect that designed one heck of a great network with the proper level of redundancy and failover capability. The senior managers loathed to pay for equipment and capacity that mostly sat idle when not used for surge demand or brought up during maintenance windows. When things did fail, everything happened transparently and no one ever saw so much as a blip in availability.
So what did the manager do? Fired the architect and brought in a contact company. Things work perfectly right? What's the harm in saving a few bucks? The south Asian contractors said yes and nodded their heads to every idea to cut costs here and there. After all, they wanted to keep the customer happy. Then one night, there was a catastrophic failure. Sure enough, things crumbled and cratered. Rather than having in house folks with knowledge and skin in the game to get things running again, indifferent techs 10 time zones away took their time trying to figure things out. The 5 day outage costs a hell of a lot more money than was saved. The company never really recovered and has been downhill ever since.
Re:Power of the almighty dollar by MangoCats · 2017-05-31 08:56 · Score: 1

Why do you think it took them this long to come out with an explanation?
Re:Power of the almighty dollar by billybiro · 2017-06-01 19:30 · Score: 1

This is what happens when you treat your IT staff like your Janitorial staff.
It's often worse than this. I rarely see Janitorial staff having to "make do" with old threadbare mops and having to wash floors with water with no detergent. They are usually very well stocked for the specific things that they need to get their jobs done. IT staff? Not so much.
Re:Power of the almighty dollar by billybiro · 2017-06-01 19:34 · Score: 1

The CEO should have foreseen this and should be let go. As should other executives who approved the offshoring plan.
Yes, they'll probably be "let go". Right after they collect their multi million golden parachute, receive a hearty slap on the back from fellow members of the old boys network and walk right into another high paying CxO position in another multi-national organisation with nary a drop of ink to blot their copybook.
Re: Power of the almighty dollar by Brockmire · 2017-06-02 08:49 · Score: 1

In case you missed it, this is about outsourcing, not layoffs. Janitors get changed like underwear. No one notices. It's not exactly a skill, so much as no better option.

"It wasn't me, it was the one armed man!" by Anonymous Coward · 2017-05-31 04:04 · Score: 2, Insightful

"It was not an IT issue, it was a power issue."

Assuming it was not a lightning strike, It's still your fuckup if "power issues" can damage/take down your IT.

Re:"It wasn't me, it was the one armed man!" by TWX · 2017-05-31 04:19 · Score: 5, Insightful

Yep.
We have a Caterpillar generator the size of a schoolbus (and given its coloring I've had to restrain myself from sticking a stop-sign on the side as a prank) and a sophisticated transfer switch with power monitoring. When we lose power the batteries hold the DC over until the generator kicks in, and then when power is restored we do not switch back to grid immediately. I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between. On top of that, the system monitors grid power continuously and will intentionally island the system if there's a significant enough fault.
This is not for something as critical as an airline's control system either. I do not find any reasonable excuse to blame power; you're supposed to assume that power is dirty and unreliable and to work around it.

--
Do not look into laser with remaining eye.
Re:"It wasn't me, it was the one armed man!" by jellomizer · 2017-05-31 04:25 · Score: 1

A proper IT infrastructure can deal with a direct lightning strike as well.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · 2017-05-31 04:40 · Score: 5, Interesting

Sounds great...when it works. I bet you've never looked at the code that controls a big automated transfer switch. I have. It's a mess. It's so bad that the very first install Eaton did with our new model, which was in Digital Forest in Tukwila, WA near Seattle, we had three failures in the first ninety days due to bad software. It shut an entire data center down even though utility power was not down, battery power good, and generator working. The guy we dispatched the third time had spent two years in Uganda so he was experienced with bad power. He claimed that power from Seattle City Light was worse than Uganda. The power was so bad that the software in the ATS decided to disconnect everything.
The second time power was restored, because of the bad software, it switched to generator power before the generator was running fully. The voltage dropped and took out quite a few older pieces of equipment and stalled the engine. In other words, the opposite problem BA had.
Re:"It wasn't me, it was the one armed man!" by TWX · 2017-05-31 04:42 · Score: 1

Depends. Unfortunately airline stock tends to perform almost regardless of what an airline does simply because when people need to travel there are only so many options and among all airlines across the planet there are only so many seats going so many directions. As long as people want or need to travel the airlines will generate revenue, even those that make terrible mistakes or do terrible things to passengers from time to time, so long as they manage to get flying again.

--
Do not look into laser with remaining eye.
Re:"It wasn't me, it was the one armed man!" by drinkypoo · 2017-05-31 04:44 · Score: 2

The guy we dispatched the third time had spent two years in Uganda so he was experienced with bad power. He claimed that power from Seattle City Light was worse than Uganda. The power was so bad that the software in the ATS decided to disconnect everything.
Probably true. When the first grid-tie inverters were invented, they kept shutting themselves off because as it turned out, the utilities were totally incapable of producing power as clean as they claimed they were, and as they were demanding that the inverter provide. Making better power than utilities in the US is trivial.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:"It wasn't me, it was the one armed man!" by Anonymous Coward · 2017-05-31 05:45 · Score: 5, Interesting

I worked in a center that had a big diesel-powered UPS unit the size of a shipping container. It was there about 3 years before we had a power outage. It detected it and span up, engaged the clutch and ... the drive belt snapped. Oops. Under voltage. So rev faster. Still undervoltage, so MOAR revs. Now, in addition to the power outage we've got a big UPS that's on fire.
Re:"It wasn't me, it was the one armed man!" by citylivin · 2017-05-31 06:06 · Score: 4, Insightful

Until your voltage regulator starts dying and only gives your equipment 80volts and no one notices the under voltage condition during normal maintenance and testing of the generator.
The facilities maintenance people test the generators monthly, but it was not standard practice to test the voltage every single time the generator was tested.
It is now.
But the point is that systems fail in all sorts of fun ways in the real world. You learn, you change, you adapt, as im sure BA is doing. All it takes is one major incident to stop people from dragging their feet. I'm sure that is occurring now at british airlines.

--
As a potential lottery winner, I totally support tax cuts for the wealthy
Re:"It wasn't me, it was the one armed man!" by amorsen · 2017-05-31 06:29 · Score: 1

[..]sophisticated transfer switch with power monitoring[..]
Those break. Way more than they should. Often with interesting results that aren't just "power went off".
And you fundamentally can't make them redundant. You can have two of them on completely separate feeds of course, feeding into different power supplies on the servers. That sometimes helps, except when the overvoltage is sufficiently great to get through the protections of the power supply.

--
Finally! A year of moderation! Ready for 2019?
Re:"It wasn't me, it was the one armed man!" by Thelasko · 2017-05-31 06:47 · Score: 4, Insightful

I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between.
I used to design the diesel engines used in some of those systems, and have seen them in use. Although your system may monitor the grid to ensure reliability, it's most likely making sure it's not switching between two power sources that are out of phase.

When we would connect one of our gensets to the power grid, we had to match the phase before we could close the switches. To do this, the engine speed was modified to run the generator at slightly above or below the frequency of the grid. If the phase wasn't matched, the power grid would try to force the generator into phase suddenly. It's assumed the power available from the grid is infinite in these types of systems. Therefore an incredible amount of current would flow through the generator and also provide a mechanical jerk to the engine if the switches were closed out of phase. Something will break in a spectacular fashion if this isn't done carefully.

Honestly, this could be what happened to BA.

--
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
Re:"It wasn't me, it was the one armed man!" by aaarrrgggh · 2017-05-31 07:30 · Score: 1

Sounds like 365 Main, the problem was in multiple small blips which were each too small to start the engine, but in aggregate depleted the flywheel below the minimum to start the engine.
Re:"It wasn't me, it was the one armed man!" by TWX · 2017-05-31 07:38 · Score: 2

We test monthly. It's also a way to replenish the fuel before it becomes nonviable.

--
Do not look into laser with remaining eye.
Re:"It wasn't me, it was the one armed man!" by nwf · 2017-05-31 08:43 · Score: 1

A proper IT infrastructure can deal with a direct lightning strike as well.
At what cost? I doubt it's worth it for most businesses. There are too many disasters to plan for: lightening, flood, earthquake, tornado, high winds, several combined. It's probably impossible to protect against everything unless you have Federal Government money.
I've yet to see a surge suppression system that's affordable to a mid-scale business that can take a direct hit, anyway. Plus you get EM induced voltage that fries networking and other stuff, including the power system control circuitry. I've seen that myself.

--
I don't know, but it works for me.
Re:"It wasn't me, it was the one armed man!" by gweihir · 2017-05-31 09:41 · Score: 1

Yep.
We have a Caterpillar generator the size of a schoolbus (and given its coloring I've had to restrain myself from sticking a stop-sign on the side as a prank) and a sophisticated transfer switch with power monitoring. When we lose power the batteries hold the DC over until the generator kicks in, and then when power is restored we do not switch back to grid immediately. I am not the person that deals with the power, but as I understand it, the generator and transfer switch monitors the grid for some time before switching back to grid, and there are power conditioners in between. On top of that, the system monitors grid power continuously and will intentionally island the system if there's a significant enough fault.
This is not for something as critical as an airline's control system either. I do not find any reasonable excuse to blame power; you're supposed to assume that power is dirty and unreliable and to work around it.
That is how it is done. It is well-known that power often comes back up "unclean" after a failure.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:"It wasn't me, it was the one armed man!" by gweihir · 2017-05-31 09:42 · Score: 1

It is in fact a standard scenario.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:"It wasn't me, it was the one armed man!" by phorm · 2017-05-31 10:11 · Score: 2

Strange, in the last place I worked with a big DC, they regularly tested the generator (I think monthly, and even from floors away you could *hear* it), and UPS systems. In my five years there, I'd not heard of an outage due to any of the many power failures in our area.
Re:"It wasn't me, it was the one armed man!" by kevmeister · 2017-05-31 11:27 · Score: 4, Interesting

And sometimes **it happens.
I worked as a Senior Network Engineer for a large national backbone provider to the US DOE. At the facilities we owned WE were in charge of oversight of the power system and regular testing. We had one experienced power engineer on staff to oversee everything, though the facility's plant engineering people did all of the actual heavy work.
Back in 2009 we had just completed our annual full transfer test where we switched over to UPS, let the generator fire up, transferred to generator power, and then reversed the process. Everything worked perfectly. The following week we lost power. UPS kicked in, but the generator refused to start. One week earlier everything worked perfectly in the test case where we could have backed out before UPS died. No such luck that day. Our staff lost the ability to monitor the network and the laboratory where we were located lost Internet connectivity as did several other smaller facilities in the area. Took us about an hour to get a trailered generator in place and get things back on-line.
No matter how carefully you plan and test, sometime you still lose.

--
Kevin Oberman, Network Engineer, Retired
Re:"It wasn't me, it was the one armed man!" by dbIII · 2017-05-31 13:58 · Score: 1

Indeed - good idea.
Sometimes Murphy is still against you.
A power station I did some work at had a 20MW emergency generator (old jet engine) to kick things off (conveyors and crushers require a lot of juice) and it was tested monthly for around 25 years and maintained carefully. The only time it was needed (due to a fairly rare set of circumstances) it didn't work. A second one was installed later as a backup to the backup but neither was needed again for the remaining life of the power station.
I think those two little 20MW gas turbines were adjusted to run on natural gas are still in use from time to time to cover peaks. They may be old but running time is what matters.
Re:"It wasn't me, it was the one armed man!" by dbIII · 2017-05-31 14:00 · Score: 1

But the point is that systems fail in all sorts of fun ways in the real world. You learn, you change, you adapt, as im sure BA is doing.
Yes, and by outsourcing to someone who has not learned your lessons you have to get through all those mistakes a second time.
Re:"It wasn't me, it was the one armed man!" by tibit · 2017-06-01 02:35 · Score: 1

It's almost as if they could save money by pushing their shit out to a competent datacenter. I don't really see why BA, of all people, needs their own. Their shit could run on AWS just fine I'm sure. It's not magic. They deal with the same shit they dealt 5 decades ago when it ran on slow big iron.

--
A successful API design takes a mixture of software design and pedagogy.
Re:"It wasn't me, it was the one armed man!" by marcansoft · 2017-06-02 02:42 · Score: 1

All of those disasters are trivial to plan for.
Step 1: have two datacenters in different locations
Step 2: test that you can fail over to the other site regularly
That's it. That takes care of every single disaster you have listed, with one solution. There is no excuse to not have two sites for a company as big as BA.
Re:"It wasn't me, it was the one armed man!" by CharlieG · 2017-06-02 06:19 · Score: 1

Back when I was a kid, I got taken on a tour of a "high reliability" data center. Mind you, this is in the era a drum memory. The setup for this company was, frankly, insane
Walked into the data center - TWO mainframes, side by side, totally redundant. I thought that was cool, as either machine could cover for the other, with NO loss in performance. The guy laughed and said "Well there is another data center with two machines across town, so if this building goes down, we're good,(NYC)". He said there were another two data centers in London, Chicago, Frankfurt, Berlin, Tokyo, Hong Kong, and finally Alice Springs, ALL redundant. Yes, performance would take a hit if too many went down, but he was saying the DR plan included limited global nuclear war, which was why the data center in Alice Springs. I gather there were regular tests. Never heard of them losing any data

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Re:"It wasn't me, it was the one armed man!" by Agripa · 2017-06-02 08:24 · Score: 1

When we would connect one of our gensets to the power grid, we had to match the phase before we could close the switches.
EE here. I've only dealt with small to medium generators- you know, 500W to 100KW. I've never heard of connecting a backup generator to the power grid. The small to medium sized transfer switches most absolutely definitely do not do this, and in fact, have large gaps between the contacts to insure that never happens. I can't envision why anyone would want to try to sync a generator to the power grid, unless you're part of and powering the grid, but never for a backup generator.
They might synchronize but not ever connect to the grid to support large rotating loads. All of my UPSes either always synchronize or in some cases have a selectable optional mode where synchronization is not done.

Not an IT Issue by Anonymous Coward · 2017-05-31 04:06 · Score: 2, Insightful

It absolutely is an IT issue if you cannot automatically recover from power events in a single data center...

Don't UPSes also act as surge protectors? by Jamlad · 2017-05-31 04:06 · Score: 1

How big a current spike was this? Don't UPSes act as surge protectors and filters too?

Re:Don't UPSes also act as surge protectors? by Pascoea · 2017-05-31 04:24 · Score: 4, Funny

How big a current spike was this?
1.21 Jiggawatts, and it sent them back to 1985.
Re:Don't UPSes also act as surge protectors? by ChumpusRex2003 · 2017-05-31 04:27 · Score: 2

They should do, but it depends a lot on the precise design of the UPS, and the nature of the power transient.

While many industrial UPS systems are dual conversion systems (essentially, the critical load is powered from the battery bus/inverter, and fails over to mains in the event of an inverter/battery malfunction), they are sometimes operated in standby mode (the critical load is powered from mains, and fails over to the battery bus/inverter in the event of a mains failure) as this saves energy due to improved energy efficiency and lower cooling demand in this mode.

Even so, dual conversion UPS systems are not necessarily immune to mains voltage fluctuation (even when operated in dual conversion mode) - depending on whether they try to follow mains voltage, or whether the voltage transient exceeds design limits.

If you are interested in some of the dynamics of this, it's worth looking at the incident at the Forsmark nuclear power plant in Sweden. In this case, unexpectedly large grid voltage fluctuations resulted in the double conversion UPSs suffering an output bus overvoltage, which resulted in triggering of output overvoltage protection and disconnection of the critical loads. A less well protected device could have exposed critical loads to a prolonged overvoltage. This incident required particular design changes for nuclear grade UPS systems, such that mains voltage fluctuations, even beyond the anticipated range, should not result in a critical load disconnection.
Re: Don't UPSes also act as surge protectors? by Anonymous Coward · 2017-05-31 04:29 · Score: 1

The BA chairman was quoted as saying "great Scott, getting a telsa to go 85 MPH is no easy task."
Re:Don't UPSes also act as surge protectors? by Salgak1 · 2017-05-31 04:31 · Score: 2

Great Scott!!
Re: Don't UPSes also act as surge protectors? by Anonymous Coward · 2017-05-31 04:36 · Score: 2, Funny

85 mph won't cut it. Gotta get that baby up to 88!
Re:Don't UPSes also act as surge protectors? by CrAlt · 2017-05-31 05:05 · Score: 2

>UPS undersized
>Power fails, UPS quickly die
>power comes back or comes back with problems (open neutral,flipped phase,over voyage,etc)
>idiots try and bring back everything at once
>UPS trips from inrush from cold start
>or UPS says there is a power problem they ignore
>idiots flip big lever from "UPS" to "BYPASS"
>all protection...bypassed
Boom
I've seen this scenario play out a few times.

--
I have to return some videotapes...
Re:Don't UPSes also act as surge protectors? by bruce_the_loon · 2017-05-31 05:33 · Score: 5, Interesting

They do, but some surge protection devices have a limited number of surges they can absorb before they have to be replaced. If there were a number of surges, it's certainly feasible for the protection chain to fail at some point.
An anecdote from a few weeks ago with a data center I help manage. It has a backup generator, automatic switch gear and a Schneider Electric Galaxy double conversion UPS. Yes we don't have two, but we ain't an airline. We do have another data center on another site to take over if needed though.
So a few weeks back our phones go wild with texts fired off by the UPS tossing SNMP traps around. One sprint later, the UPS console is showing no input power and our in-house electricians lay rubber from one end of the campus to the other to get to the sub in time. As we wait for the UPS to hit that magic 5 minutes when it triggers the auto-shutdown sequences on the servers, the sparkies discover the sub's output is fine and the generator isn't running.
Then all shit breaks loose, ten power cycles on the UPS input, some lasting long enough to switch from battery to mains, some not. With ten minutes left on the batteries, the UPS gives up, shuts the inverter and charger down and switches the load to static bypass. Room goes silent except for the UPS alarms, and then the eleventh return cycle comes and goes in about three seconds. We hear PSU fans starting and then winding down. I dropped the master breaker on the DB and isolated the room from the UPS. Down until the sparkies figure it out. There goes three hours of our lives.
Turns out that the automatic switch gear had some arc damage on the utility-side contactor feeding the control boards, probably caused by the eight months of load-shedding (read utility driven power cuts to ration power) we had experienced two years ago. That was enough to drop the voltage in one sensor to below the trigger threshold and caused that contactor and the main load contractor to open. Before it could start the generator up, the control board then decided the utility had returned, so it closed the contractors again. And open again, and close again. The sound of a 3-phase 480V 500A contactor switching twice a second is enough to make the sparkies use words a sailor would be proud of.
We had to lock out the sensors, rig a temporary bypass on the contactors to power the room from the generator feed side and replace the damaged contactors before we were fully safe again. We lost 2 PSUs out of 90 and no data. We were lucky.
I relate this to show that no matter how good the power protection architecture is, multiple UPSes, twin feeds etc, shit can and does happen. We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room.
So I'm willing to accept that BA lost a data center to power problems. But I'm not willing to accept that the loss of a single data center can shut down global operations. BA must have multiple redundant data centers with a seamless failover mechanism. And that is a failure of IT pure and simple.

--
Trying to become famous by taking photos. Visit my homepage please.
Re:Don't UPSes also act as surge protectors? by phorm · 2017-05-31 10:13 · Score: 3, Insightful

"We were lucky we had people on the site who knew what trouble sounds like and were willing to isolate the room"
You weren't lucky, it's called having good, well-trained/practised staff on-site. And based on what everyone has been saying this is something that was severely lacking at BA
Re:Don't UPSes also act as surge protectors? by dbIII · 2017-05-31 14:06 · Score: 1

BA must have multiple redundant data centers with a seamless failover mechanism
While that's how you would do it reality of what BA did seems to be less sensible.
Re:Don't UPSes also act as surge protectors? by Anne+Thwacks · 2017-05-31 18:39 · Score: 1

BA must have multiple redundant data centers with a seamless failover mechanism.
AND test them every month.
Not just to make sure the hardware works, but to make sure EVERYONE involved knows what a failure of electrical failover looks like, but that the dual-redundancy also works, and everyone knows where all the controls are, and what they do (hardware and software).
It may be expensive relative to your pay grade, but in the greater scheme of the world's largest airlines, its totally piddling. Have you seen the price of a new aeroplane? Or the even annual fuel bill for flying a transatlantic route? Hell, you could buy a compete redundant data centre for cost of flying Trump across DC! (Unless you buy your data centres from the wrong guy).

--
Sent from my ASR33 using ASCII

Direct cause by Anonymous Coward · 2017-05-31 04:07 · Score: 5, Insightful

The power surge was the direct cause. The fundamental cause was the failure of management to ensure they had an appropriate disaster recovery plan.

Re:Direct cause by Anonymous Coward · 2017-05-31 05:53 · Score: 1

British Airways is up and running today. Their disaster recovery plan has worked just fine. What they were lacking was a sufficient incidence response plan.
Re:Direct cause by Anonymous Coward · 2017-05-31 06:46 · Score: 1

You mean business continuity. Sounds like their Disaster Recovery Plan worked fine. They are online.

No, Where we REALLY screwed up was this: by bbsguru · 2017-05-31 04:13 · Score: 1

So instead of being incompetent at software, they are claiming to be incompetent at hardware.
And the difference is...?

Anyone whose Server Farm can be brought down from a power outage does NOT know what they are doing, or care enough about it to bother.

How would this 'admission' make anyone more comfortable about this business?

Re:No, Where we REALLY screwed up was this: by Tailhook · 2017-05-31 04:39 · Score: 4, Insightful

How would this 'admission' make anyone more comfortable about this business?
The business doesn't have to worry about that. It's safe regardless; too-big-to-fail public+private yada yada. This is BA we're talking about.
These "stories" are just the public narrative writing process, guided to affix/deflect blame to/from the appropriate parties as the scapegoats are singled out. The BA execs know they have maybe 72 hours or so before this story falls out of the news cycle so they're using that window to make the headlines they need to muddy the waters. Until now the only narrative that has had any play is the "outsourcing did it" one, and that hits too close to management, so they're making this stuff up and putting it out through their MSM channels.

--
Maw! Fire up the karma burner!

Poor disaster recovery plan. by bob4u2c · 2017-05-31 04:14 · Score: 1

Power issues of this kind are IT issues.

When designing a server location you must take power into consideration; ie, do I have enough battery to keep all critical servers and supporting hardware up until the generator has kicked in, plus extra just in case the generator has a glitch or two of it's own. Is the battery rated at the correct surge protection to keep systems from glitching when the power does return? Is the generator more than enough to power everything between re-fuelings? Is the generator rated enough to run everything at less than 80% load? Have I staged non-critical servers and equipment to power down and power backup to spread the need for power out? Is there a backup facility that I can spin up or switch to; for this kind of operation you would want to switch to a new site in minutes to prevent business loss.

Again, these are IT problems.

Now a battery going dead, a power supply frying, circuit breakers tripping; these are power issues. Poor disaster recovery plans are not.

Redundant System by Roger+W+Moore · 2017-05-31 04:15 · Score: 5, Insightful

Even if UPS and surge protection do not count, having a redundant system in a different data centre ready to take over regardless of the cause of the outage definitely does fall under IT. It is insane that a major company like BA did not have any such redundancy for such an important, mission critical application. It would have cost far less than the £100 million estimated cost of this incident not to mention avoiding the appalling publicity.

Re: Redundant System by tysonedwards · 2017-05-31 04:24 · Score: 5, Funny

Come on... It's apparent, the power surge was so severe it crossed the VPN Tunnels when they re-opened and traveled into another city and damaged those systems too!

--
Thirty four characters live here.
Re:Redundant System by Anonymous Coward · 2017-05-31 04:30 · Score: 1

DR sites are useless when production data isn't being mirrored over correctly. That's one of the issues here.
Re: Redundant System by sound+vision · 2017-05-31 05:18 · Score: 1

"An innovation in Power-over-IP"
Re:Redundant System by GameboyRMH · 2017-05-31 05:41 · Score: 4, Interesting

This. The BA outage is the second most hilariously inept cause of an outage I've ever seen, after a local government office that was down for over a week because one rackmount server was dropped in transit.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel
Re:Redundant System by bmk67 · 2017-05-31 06:54 · Score: 1

...which is also an IT issue.
Re: Redundant System by oobayly · 2017-05-31 07:05 · Score: 1

What about the RBS banking outage?
https://en.m.wikipedia.org/wik...
Re:Redundant System by lactose99 · 2017-05-31 07:10 · Score: 1

Then its not DR, one of the core facets of DR includes testing it to ensure recovery is actually recovery.

--
Fully licensed blockchain psychiatrist
Re:Redundant System by anegg · 2017-05-31 07:19 · Score: 5, Funny

They obviously only got around to implementing the first half of their Disaster Recovery solution. They will implement the Recovery half next year.
Re:Redundant System by aaarrrgggh · 2017-05-31 07:21 · Score: 1

I would give 10:1 odds that they had a voltage dip, transferred to generator and failed coming back because their batteries were no good. Unclear if they lost power once or twice, or if it was the servers auto restarting, but the kind of damage they allude to typically is when you are 7 years into your "10-year" VRLA batteries. Also known as cost cutting...
Re:Redundant System by Afty0r · 2017-05-31 07:50 · Score: 2

It would have cost far less than the £100 million estimated cost of this incident
I agree that they should do it, but it is unlikely that the one-off cost of implementing always-on redundant systems would be this cheap, the scale and scope of the IT systems involved in the airline industry is enormous and it's likely it would cost significantly more than that. There are also ongoing costs to consider. Source: Work in software development, have seen projects in organisations way smaller and simpler than British Airways with projected costs higher than that for less benefit.
Re:Redundant System by MangoCats · 2017-05-31 08:53 · Score: 1

Ticketing and scheduling systems are not life-safety critical, therefore they don't get the budget for double redundancy. It's aerospace-think come to the company comptroller's office, imposed on IT that made this failure happen.
Also, that "£100 million estimated cost of this incident" is less than the development, rollout and ten years of additional maintenance costs of a full double redundant geographically diverse scheduling system. For an industry that can't even make a baggage sorting system work properly, there's a lot of fear about "upgrading" anything. Historically, big failures like this come around less than once a decade.
Re:Redundant System by gweihir · 2017-05-31 09:33 · Score: 2

It is pretty clear that BA leadership screwed up massively here and yes, it is most decidedly an IT problem. The described power-outage scenario is a complete standard one and competent planning prepares for it. Now they are trying to misdirect (i.e. lie) in order to make it appear like this was a natural disaster and of course, they could not have done anything about that. Dishonorable, untrue, but nicely demonstrates the defective characters of the people in power at BA.
The only right thing to do is kick out the ones responsible (including the CEO) with a performance review that makes sure they never get any other leadership position. Otherwise these people will continue to do damage.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Redundant System by gweihir · 2017-05-31 09:34 · Score: 1

"most hilariously inept" covers it well.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Redundant System by Zaelath · 2017-05-31 10:40 · Score: 2

It's always fun trying to sell a DR failover test to a 24/7 company.
- So what's this for?
- To make sure you can recover quickly in the event of a disaster.
- What if that fails, worst case?
- Well, your warm site fails to take over, so you have a planned outage now instead of an unplanned outage later.
- How long an outage?
- Well, if we fail to bring up the warm site, and fail to fall back to the current production site, there may be some lost transactions and we'd need to shut down long enough to make sure the databases are correct. Say, a day?
- You want to shut down for a day? How likely is this disaster we're talking about?
- Well, you have RAID, clustering, UPS, generator backup, probably a 1 in 10 year event?
- Ok, then no. We'll take the day outage in 10 years.
- Wait.. I said ...
- Thanks, goodbye.
Re:Redundant System by LinuxIsGarbage · 2017-05-31 10:46 · Score: 1

Doesn't everyone use closed transition (make before break) transfer switches these days? Failing that even with shit batteries I'd think a break before make transfer switch should be able to be absorbed by weak batteries on a double conversion UPS, or by power supplies on the computer hardware running with UPS in bypass (or a standby UPS).
I have seen the following oddities with emergency power devices (oddly all Eaton):
~10 years ago an Eaton(IIRC) closed transition transfer switch with a firmware bug. There was planned work taking utility power down, so the generator (2MVA) was fired up, load shifted to generator, then when utility lost power it crashed and the load lost generator power.
Eaton Powerware... not the 9155 but looks close to it, maybe 9355?. Anyways 10kVA double conversion UPS with external bypass switch and included 460-120 stepdown transformer:
-With utility failure shit batteries cause the breaker to trip shortly after switching to battery, and the UPS to fault. Upon restoration of utility power, the load fails to come up until manual intervention on the UPS.
-Eaton service guy comes to service the UPS. Places the external bypass switch to bypass (I think it was on transition to bypass, may have been transition from bypass), UPS dumps the load. Bypass should have been activated on the soft keys on the menu to match the desired state before throwing the manual switch. If it was going back to normal, I think it was switching from bypass, to service is fine (which allows control power to the UPS), allow it to power up, put the soft menu to bypass, then move the switch to normal, then move soft keys from bypass to normal.
Regardless even an event like that shouldn't cause a multi-hour failure.
Re:Redundant System by HornWumpus · 2017-05-31 12:35 · Score: 1

I know a utility that saved a little money on the power that ran the electric natural gas pumps for their biggest generator (pumps were in another utilities area) by putting the pumps on a curtailable contract (like 'peak corps' and other programs, that cut off your AC power when demand is highest).
So that means, when the demand was highest, that utility shutdown the supply pumps to PG&^H^H^H a large utilities biggest generator. Brilliant!

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Re:Redundant System by wlj · 2017-05-31 12:47 · Score: 1

I remember hearing about something like this happening at the University I went to - but it happened back in the 1960s! Does anybody use motor-generator systems any more?
Re:Redundant System by AHuxley · 2017-05-31 14:49 · Score: 1

"hospitals backup power to be reviewed" (Oct 11, 2016)
"... failed in the blackout because of a fuel pump issue"
http://www.news.com.au/nationa...

--
Domestic spying is now "Benign Information Gathering"
Re:Redundant System by stoatwblr · 2017-06-01 02:13 · Score: 1

The powercos in the area have come out and categorically stated there was no form of power hit, dip or other problem on the public side of the meters.
Re:Redundant System by stoatwblr · 2017-06-01 02:20 · Score: 1

"Ticketing and scheduling systems are not life-safety critical"
Loading, aircraft balancing (centre of gravity) and fuel load calculations ARE.
All of these were affected. Plus BA's entire VOIP system.
Re:Redundant System by RockDoctor · 2017-06-01 05:28 · Score: 1

costs of a full double redundant geographically diverse scheduling system.
The system(s) under discussion were not geographically diverse. They're all somewhere a little SW of London - probably fairly close to but not at Heathrow. Just one data centre.
Cheapskates.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:Redundant System by RockDoctor · 2017-06-01 05:48 · Score: 1

Loading, aircraft balancing (centre of gravity) and fuel load calculations ARE.
And unless I misunderstood the lessons of that plane which came down because of loading X pounds of fuel which the pilot logged as X kilograms of fuel ... it is the absolute duty of the pilot commanding to personally check such calculations. Whether they do it with an abacus, or with a piece of chalk on the runway, it's their responsibility to check the sums are right. and if the IT system that usually does it is up shit creek, then you-the-pilot still have the personal responsibility to do the calculations some other way and get it right. Or don't take off. Don't even taxi anywhere.
Planes have weight sensors in the wheel mounts. Because that is the data that the pilot needs for doing those calculations.
The flight incident (a.k.a "good landing") was the Gimli Glider.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:Redundant System by dgatwood · 2017-06-01 07:18 · Score: 1

Ticketing and scheduling systems are not life-safety critical ...
Tell that to the cancer patient who misses his/her chemo appointment at some regional cancer center because of a flight cancellation. When you have enough people depending on a service, if an outage lasts long enough, every service becomes life-safety critical, though perhaps somewhat less so in a country like England that is geographically small enough to not depend on air travel to get from one part of the country to another (and even less so post-Brexit).

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Redundant System by aaarrrgggh · 2017-06-01 10:06 · Score: 1

Almost every time we have seen major issues, it is auto-restart of systems followed by a power failure during boot. (Almost) Everything is designed to survive the first hard crash, but crashing during boot or initialization often leaves systems in an unstable condition.

Taking hours to do an orderly restoration is not uncommon; even for a simple system you might need to bring your network core up first, restore DHCP services, get domain controllers up and synchronized, and audit all your filesystems before secondary systems boot... It becomes an order of magnitude more complex when you have some hardware failures and data corruption. Still shows they had a crap business continuity plan, but hey...

(While Eaton makes crap transfer switches, their UPSs are generally better than Schneider/APC/MGE or Liebert/Emerson/Vertiv in my experience. Closed transition switches, even when properly commissioned and tested, can create a number of issues, especially when you have neutrals on the switch as you would in Europe.)
Re:Redundant System by OffTheWallSoccer · 2017-06-02 02:20 · Score: 1

Well said!
Re:Redundant System by stoatwblr · 2017-06-12 03:32 · Score: 1

Yes, but the BA systems were supposed to measure mass of everything _before_ loading and allow sensible placement to ensure CoG limits are OK, etc.
Having to do it the long way can take hours if you're totally dependent on the automated system and have to break out manual methods, now multiply that out by a few hundred flights and you've got severe gate gridlock.
As you say, you can't even taxi until you have this all sorted and if you can't do that _until_ all the baggage, freight and squishies are onboard - when the system won't let them on because boarding passes aren't being printed, then you have a version of mexican standoff on your hands.
Re:Redundant System by RockDoctor · 2017-06-14 06:38 · Score: 1

Having to do it the long way can take hours if you're totally dependent on the automated system and have to break out manual methods,
I didn't say it would be easy.
In my business, I have done the job manually in the past, and will probably have to do it again in the future. It's harder - for sure - but doable. It scares the pants off the sprogs (those with less than 20 years experience) when their computers go down leaving them with a sharp pencil and a notepad (dead-tree variety). And they protest that the rest of the operation will have to stop until their computerised data-acquisition system gets fixed (which they're not taught how to do - it's go to be a technician visit) ... which their Boss promptly revokes when told that if the rest of the operation stops working, their company will be billed for the down-time. Which would be between 1 and 2 million USD/day. Then they learn pretty damned quick.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"

It _was_ an IT issue by matthiasvegh · 2017-05-31 04:16 · Score: 4, Insightful

If the power wouldn't have come back at the datacenter, would that still be a power issue? If an earthquake destroys the datacenter is that an earthquake issue? If your system collapses when a datacenter goes offline (for whatever reason), you're at fault, not the datacenter. This seems like a classic case of having a single point of failure.

Re:It _was_ an IT issue by Anonymous Coward · 2017-05-31 04:26 · Score: 5, Informative

BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.
Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!
Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.
Re:It _was_ an IT issue by Chris+Mattern · 2017-05-31 04:30 · Score: 4, Insightful

Okay, they weren't flaming incompetents that didn't have a failover site. They were flaming incompetents that had a failover site that didn't work, because apparently they never tested it. Glad we cleared that up.
Re:It _was_ an IT issue by fluffernutter · 2017-06-01 01:13 · Score: 1

But volume groups were not being mirrored correctly to the DR site.
Ok I know I'm late to the party but that tells me that it was an IT issue right there.

--
Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
Re:It _was_ an IT issue by stoatwblr · 2017-06-01 02:27 · Score: 1

"But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes"
The interesting part is that this part of the problem started happening on FRIDAY - around 18 hours BEFORE the total outage caused by the supposed power outage/surge.

Next excuse.... by __aaclcg7560 · 2017-05-31 04:23 · Score: 3, Funny

"Those union electricians told us we could run all these servers without upgrading the circuit breakers. It's not an IT problem, it's a union problem!"

Re:Next excuse.... by IMightB · 2017-05-31 04:44 · Score: 1

dont really think you can blame this on unions.... whats your agenda? Most of the time issues like this are caused by management thinking "we already spent X million dollars for server clusters on one site. But it costs X much more for each server to have dual power supplies, then X much more for each DataCenter power bus and redundant backups, then you telling me we have to spend X times 2 for an additional DataCenter?!?!?! and testing etc etc. I thought this is what a HA cluster is for!"
Re:Next excuse.... by IMightB · 2017-05-31 07:04 · Score: 1

Umm yeah, reread it indeed WOOSH... I'm going to blame it on Obama... Thanks Obama!

ID10Ts by U8MyData · 2017-05-31 04:26 · Score: 2

Really? So, they are completely illustrating that their IT efforts are a "cost center" and that IT is a "necessary evil" that they provide minimal effort to. Everyone knows that a serious "Data Center" has multiple protective measures in place, so who is this service provider? I wonder how they treat their aircraft? This is so blatantly obvious it hurts those who know IT. Forget about the outsourcing questions.

Re:ID10Ts by Fire_Wraith · 2017-05-31 04:34 · Score: 4, Insightful

Outsourcing is part of the problem, but you're right, it derives from the mentality that IT is a cost center that must be minimized at every possible turn. It's outdated thinking, going back to the days where if your office network went down, there'd be a bit of inconvenience, but the planes still flew, and it wasn't a big deal. Today, IT is a business critical area, because when your network goes down, the planes stop flying, and you stop making money, never-mind the lingering effects from the terrible publicity or the angry customers. It's not something you can afford to skimp on, on any level.

Unfortunately it will probably take several shocks like this, and some high level careers ending as a result, before they start to wise up.

But what *was* an IT issue... by Chris+Mattern · 2017-05-31 04:26 · Score: 1

was the fact that you apparently have no redundancy on extremely mission-critical servers.

You have *got* to be kidding... by whitroth · 2017-05-31 04:31 · Score: 1

Every server wasn't connected to a UPS? And the return of the power overwhelmed the UPSes?

And just how did management decide to "save money" on the power for the servers?

Re:You have *got* to be kidding... by whoever57 · 2017-05-31 04:56 · Score: 1

And the return of the power overwhelmed the UPSes?
No, they appear to be saying that turning the power on somehow damaged the computers:
"The power then returned in an uncontrolled way causing physical damage to the IT servers"
I don't believe this. The CEO is just protecting his own ass after outsourcing IT.

--
The real "Libtards" are the Libertarians!
Re:You have *got* to be kidding... by sconeu · 2017-05-31 05:29 · Score: 1

No, they appear to be saying that turning the power on somehow damaged the computers:
"The power then returned in an uncontrolled way causing physical damage to the IT servers"

Right, which says that the servers weren't connected to UPSes. Because if they were, then the UPS would have filtered the power surge.

--
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Re:You have *got* to be kidding... by whoever57 · 2017-05-31 06:34 · Score: 2

If I read this right, they are claiming that putting a huge load on the system (bringing up power to too many servers at once) resulted in excessive voltage on the power rails.
In my understanding of physics, increasing the current usually results in reduced voltage. So where did the over-voltage come from?
Or are they saying their their UPS generators were somehow incapable of limiting their output voltage? Pretty strange generators, not suitable for the task?
None of this sounds right, which is why I reject it outright as a CYA claim by the CEO. I expect that the technicians responsible for rebuilding have been told that, if they talk about it, they will find that their own jobs have been outsourced. But still, perhaps some anonymous leaks will happen.

--
The real "Libtards" are the Libertarians!

Sounds like an IT probelm to me. by pz · 2017-05-31 04:36 · Score: 4, Interesting

I worked as a dev for a pretty big social network company. We were a not-quite also-ran, peaking at Alexa 108 globally, and for a while we were beating the pants off of Facebook. This was in the pre-AWS days when startups still ran their own servers. Early on, we had apparent power failures on two successive Saturday nights. Right when our database scrubbing processes started.

I suggested to our sysadmins that *maybe* it was because all of the disk heads were starting to move at once, and *maybe* it would go away if we staggered the processes across servers.

Yep, problem solved. Our power feeds were rated for average power draw, not peak power draw on all servers in a rack, and peak power came when all of the disks started seeking simultaneously.

It seems the same thing happened at BA, except no one thought to stagger-start the servers. For us, this was the first big system we ever built, so, OK, chalk it up to growing pains (and the problem never, ever happened again). But BA? Shame on them.

--

Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.

HOLY CRAP by Moblaster · 2017-05-31 04:38 · Score: 2

In other words: "We used $10 MILLION WORTH OF EXPENSIVE SERVERS like a CHILD would use a PAPER CLIP IN AN ELECTRIC SOCKET."

Comment removed by account_deleted · 2017-05-31 04:38 · Score: 1

Comment removed based on user account deletion

Re:Power? Really? by Anonymous Coward · 2017-05-31 04:45 · Score: 1

The DC wasn't in India. The staff operating that DC (i.e patching, configuring, monitoring) were in India. The Indian support staff I work with follow instructions to the letter, so if the instructions are not completely accurate it falls on the team who made the instructions. Garbage In = Garbage Out.

it's not our DC so we don't deal with the power by Joe_Dragon · 2017-05-31 04:51 · Score: 3, Funny

it's not our DC so we don't deal with the power part it's the DC that we outsourced to that does the power part.

Not sure that means what you think it means..... by Lord_Rion · 2017-05-31 04:52 · Score: 1

The fact that this company even says that makes me question whether they understand what IT is and how to fix this going forward.

Most companies have a plan (or should have) to handle a location taking a hit... especially a company this large. The fact that a single DC collapse could bring this company to it's knee's is so bad it's almost negligent IMO.

--
--Hired Net Grunt

Last year... by Muckluck · 2017-05-31 04:55 · Score: 1

Delta tried this last year and got called to the carpet on it. BA needs to learn from other's mistakes...

--

--I like turtles...

BIG DC power systems are not really IT guys more by Joe_Dragon · 2017-05-31 04:57 · Score: 2

BIG DC power systems are not really IT guys more like Infrastructure / electricians and some of that stuff is not easy swap even more so if an fail safe tripped and killed all power.

I'm not buying it by zerofoo · 2017-05-31 05:12 · Score: 2

Our small school was able to cobble together enough money to afford an APC RM6000 to protect our small data room.

We recently had an intermittent power leg (it was broken at the service pole cut-offs). The wind would blow the cable and cause arcing - and lots of power weirdness on that leg.

Our UPS simply did what it needed to do to keep reliable power going to our IT systems. If we had a generator, we would have failed-over onto that until the power company fixed the service.

Surely an organization the size of BA can afford better and more redundant systems than this.

I suspect BA is passing the buck here.

Re:I'm not buying it by Cesare+Ferrari · 2017-05-31 05:53 · Score: 2

A report I read suggested they had around 500 cabinets of machines (not sure if this was across both sites or the primary). Estimating 2KW/cabinet brings you into MW territory for the lot, so this is a non-trivial amount of machinery to keep running in a power failure situation. The failure description suggested that it was a surge issue, so it's not clear if this was just stupids on their behalf (not staggering restart) or something else going wrong within the site (bad failure to generators etc).
Either way, their IT infrastructure wasn't up to the job, and clearly their DR planning didn't get them out of the hole quickly. However, their DR planning did get them running again within a few days which is more than most companies can manage. Well done to the guys actually doing the work - lots of long shifts and stress, and let's hope they get traction with management to put some decent process in place for the future.
Re:I'm not buying it by Shimbo · 2017-05-31 06:31 · Score: 2

However, their DR planning did get them running again within a few days which is more than most companies can manage.
Most companies wouldn't manage to recover in a few days from an actual disaster. However, all that seems to have happened is that they fried a few servers. Doesn't take a lot of planning to get some spares in and recover some toasted machines. Not knocking the guys on the ground, who probably had to work quite hard to do it but trying to fixup the primary site because the failover was dysfunctional is no evidence at all for a good DR plan.
Also, we don't know where the surge came from, or how it was able to break any local redundancy. That all looks like poor FM.

some DC's have there own sub station's and it may by Joe_Dragon · 2017-05-31 05:13 · Score: 1

some DC's have there own sub station's and it may of been some thing in the side the DC's power system that failed.

Bullshit. by sethstorm · 2017-05-31 05:43 · Score: 1

Offshore IT is what led to this mess.

--
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.

amazing how airlines are all having issues by WindBourne · 2017-05-31 05:43 · Score: 1

Seriously, I have not seen so many issues in Airline computers except for the last 2 years. What is different? Why outsourcing to India.

--
I prefer the "u" in honour as it seems to be missing these days.

Here's a shovel by ilsaloving · 2017-05-31 05:50 · Score: 1

I don't think he's digging his hole fast enough. Feel free to borrow my shovel.

Or, perhaps a better solution would be for someone else at BA to clonk him over the head from behind with a little statuette or something so he just stops talking.

Yes, because small scale deployments always map up by mveloso · 2017-05-31 06:05 · Score: 2

"I was able to protect my puddly shit at my workplace with equipment I bought at Frys, so BA should have been able to protect its 12,000 servers just like I did."

Scaling up is hard. Just because you were able to do it with your install doesn't mean it would be just as easy for a larger install.

That said, they should have done a better job at BA. Even though testing power isn't part of a smaller DC's MO, it should be for a company the size of BA...at least in their dev environment.

The Reason Is Architectural Bloat by ud0 · 2017-05-31 06:21 · Score: 1

Sometimes a meteor strike takes out your data center, it happens. The answer is to design smaller, smarter systems that are more resilient.

BA carries about 50M passengers per year, less than 135k per day. Over a 8 business hour day, that's about 20k bookings per hour. Let's say one booking consists of 100 datasets written to the database, and maybe 10 times that in reads. This works out to abut 500 writes and 5000 reads (most of which can be cached) per second. Actual average loads are going to be even lower. I'm not talking about extended services or analytics, which can happen on machines of lesser importance, this is just about the core business of taking a reservation and issuing a ticket.

It's not a high transaction volume, and it's not a lot of data. You do not need an entire data center for this. One server with a stock DB can do that. The structure of this core is simple enough so you could replicate it to hot spares around the globe. Heck, you could even lower the load by caching reads at the airport and at the web server.

There is no unavoidable technical reason this power failure had to be that catastrophic.

Uncontrolled RETURN of power? by petes_PoV · 2017-05-31 07:10 · Score: 1

So what would have actually happened?

First, there is a cut in mains power to the data centre. No biggie, the batteries take the load. The backup generators then start to spin up and then supply power and the datacentre keeps running.
No lights went off, no computers crashed, business kept running.

But then, mains power is available again. How do you transition from your own generated power back to grid power? You can't just flick a switch. For a start you should ensure that the phase of the two power sources match - on all 3 phases. If you don't do that, I can imagine a power surge would be very likely.

But just like every outfit tests its backups - but very few test their restores, I guess BA had tested their failover process - but never got around to failing back. Or that the one time they did fail-back they got lucky.

--
politicians are like babies' nappies: they should both be changed regularly and for the same reasons

Re:Uncontrolled RETURN of power? by aaarrrgggh · 2017-05-31 07:42 · Score: 1

Because your batteries gave their last gasp to get onto generator, or because the impact event was not the initial loss of power but the thermal damage from restoration of power.

After people hitting the big red button (or the fire alarm doing it automatically), this is the most common failure mode for a data center.
Re:Uncontrolled RETURN of power? by aaarrrgggh · 2017-06-01 10:10 · Score: 1

Famous last words. VRLAs fail open-circuit or very high impedance under load. Without proper cell-level monitoring you really don't know if you have any battery left when under load. Best course of action is generally to ride on generator while you have someone pull all the jars with an Alber meter and live with reduced voltage or parallel strings.

Re:BIG DC power systems are not really IT guys mor by Maxo-Texas · 2017-05-31 08:14 · Score: 4, Interesting

It is if it is set up and administered right.

we did monthly failovers between different physical sites. A blown DC at one site wouldn't have made a difference.

Our failovers involved a couple hours of oncall for about 150 staff. Most the time only a half dozen were working but a couple times a year it would involve most the staff (and a lot of it people) for part of that. A database would be out of sync or messed up and that would fall to the IT staff to fix. It became less common over time.

Did you miss that they fixed the power problems and then the IT systems were messed up for a long time afterwards indicating poor disaster planning and low staff skill.

A company as big as BA, should have had a separate failover site and been doing regular failovers.

--
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.

BA utility providers have already call BS on this by Anonymous Coward · 2017-05-31 08:42 · Score: 2, Informative

The utility providers for all of BA's major operations centers in England are all on record as saying there were no power surges, anomalies, etc. This wasn't "we're unaware of...", they all went back over their logs and categorically denied it (seems like they weren't happy about BA trying to pin any bit of this sh*t show on them). As many have pointed out above and elsewhere, none of this passes the sniff test. BA's taking a beating for this, not just over stranding passengers but how they handled the stranded passengers. Many of their communications to passengers have failed to mention BA's obligations as well as refunds and options passengers are legally entitled to. They even had the nerve to point most people to their toll customer service line instead of their toll-free one, charging people 35 pence/min to sit on hold while they were trying to get their travel plans sorted out. Even nickel and diming low cost carriers (LCC's) like RyanAir aren't stupid enough to try something like that after a system-wide disruption.

IT servers??? by acoustix · 2017-05-31 09:39 · Score: 1

As opposed to another type of servers? Do they have building & grounds servers? Operations servers? Receptionist servers?

Just curious...

--
"A plan fiendishly clever in its intricacies"- Homer Simpson

Awaiting lawsuit for clarification by MacDork · 2017-05-31 16:38 · Score: 1

If a power surge caused the issue, then surely BA will sue the power company. If the power company can demonstrate there was no surge, surely they will sue BA for defamation.

There is no spoon by stoatwblr · 2017-06-01 02:17 · Score: 1

And there was no power surge (not outside the DCs anyway).

A large number of ex-BA IT staff have commented in fora about the historic robustness of the system, however over the last 5 years BA has systematically gutted its IT staff and outsourced just about everything to India.

The CIO of BA (and IAG) is a manager whose last claim to fame was being the person responsible for ramming through the highly contentious (as in strike-causing) cabin crew contracts which stripped out many rights in 2011.

He has ZERO IT background and was charged with reducing the IT bill by $90 million per year.

Make of that what you will.

Re:There is no spoon by stoatwblr · 2017-06-01 02:44 · Score: 1

An airline is a very large, very complicated IT network and logistical operation. Operating aircraft and feeding the self-loading freight is actually secondary (think wet-leasing).
Functional, reliable and resilient IT is the absolute core of the business. It's not a cost centre. If you fuck this up then your company is dead. This isn't like freight ops. You can't book two passengers in seat 13A as one example of the degree of error tolerance required.
IAG (BA's parent company) has lost sight of this fact. Alex Cruz' cost cutting in IT was explicitly blamed by Vuerlig (another IAG subsidiary) for the huge meltdown they had just after he left them and switched to running BA. The difference being that Vuerlig is a low-cost airline with limited flight ranges and much lower customer expectations.
The corporate knowledge to deal with IT problems has been systematically removed from the company. Given that - and the way the people involved were removed it's not surprising that Alex Cruz "appeals" to muck in were met with a collective yawn and "not unless you pay us a lot of money" from the recently redundant.
I'd be very surprised if the total costs falling out from this cockup are less than three times the supposed IT "savings" made in BA's cutbacks. By the time regulator fines (for failing to provide legally required support for passengers) are added in, it could be 10 times the supposed savings.

Re:BIG DC power systems are not really IT guys mor by tibit · 2017-06-01 02:21 · Score: 1

The problem is really quite simple: corporate drones think that throwing tantrums at a problem will get it fixed. Tantrums include not only screaming at people, but also throwing money at a problem. There's this thing called human capital where most qualified people will naturally reward an employer's loyalty to them with their loyalty to the cause of the employer. Yet the corporate world is treating humans like replaceable cogs, and that's what they get: stuff that's held together by good wishes and chewing gum. Why? Because in such a work atmosphere, nothing better will ever flourish.

--
A successful API design takes a mixture of software design and pedagogy.

Slashdot Mirror

British Airways Says IT Collapse Came After Servers Damaged By Power Problem (reuters.com)

135 of 189 comments (clear)