Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)
Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."
A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.
According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.
Last time I worked with the airline industry, they were still heavily reliant upon mainframe systems. That means putting redundant equipment at diverse datacenters is more costly. It's not like spinning up a new rack of x86 VMWare servers.
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up),
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator. So, this is a case in which the backup was the problem, not the solution.
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.
Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.
Most of y'all probably don't know what you're talking about. Here's what's going to happen:
1) Delta will file a loss-of-business / data system failure claim after things are stable again
2) They'll haggle with their insurer long after this little story is forgotten (and yeah, lots o' heartache today, but it's still probably going to be little.)
3) Delta will get a settlement of some dollar amount
4) Some bean counter will eventually tally the cost of that policy versus the payout versus how much all those redundant backups would have cost. The accountant will most likely conclude that it was a smart idea to have bought that insurance policy and NOT paid out the multimillions of dollars IT was asking for in redundant systems.
5) The insurance company will note the payout as a blip on its financials (probably already expected by the actuaries.) Insurance company will keep making profit.
The little air traveller is screwed and blued, but Delta and its insurer will keep flying. Doing business today without a data loss rider on your business insurance would be the really stupid idea, much more so than wasting money on redundant systems that are more expensive than said rider.
While on the surface it may appear their IT department is "incompetent" as one person pointed out, other factors could have contributed to the outage. Management not approving proper tests to be done or another datacenter in a completely different location. Improper maintenance on the generator(s). While IT may request things be done or placed a certain way, doesn't mean the facilities team care or understand why and do it their own way anyways. Like why have two generators located right next to each other? They probably shared the same resource for operating as well.
It takes an event like this for people to realize the importance of listening to the people who implement and maintain their infrastructure. I'm sure anyone who saw this happening is digging through their memos and pulling out the multiple requests for disaster recovery solutions to prevent these things. Not to show them, haha I told you so, but to cover their ass when they start looking for someone to fire.
It's easy to point out IT as the scapegoat but sometimes they just have to deal with what they're given by the higher ups.
Bullcrap. A boo-boo this massive is BY DEFINITION a management fuck-up. It is management's [only] job to ensure all departments are doing their jobs competently. They don't get to say "well gosh, engineering told us they knew what they were doing". Yeah, it isn't EASY, but it's why they get the obscene compensation levels.
For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
Yeah, it's sort of a riff on Godwin's law. If you blame "MBAs" for a problem, that person has no fact based arguments left so the argument is over and the person doing it loses the argument. It's basically scapegoating and tribalism at its worst.
Management is a pretty easy target. Management has to make decisions with imperfect information (like playing poker) whereas engineers are used to working with greater certainty (more like playing chess) and it's hard for many of them to wrap their head around the difference. Engineers who don't actually know any better seem to think MBA is shorthand for management incompetence. Never mind that a MBA is a degree, not a person or even a category of people. It's as stupid and incoherent as saying CS = incompetent programmers. I happen to be an engineer but I'm also a certified accountant. I have degrees in both engineering and business and I use both in my day job running a manufacturing plant. I can say with absolute confidence that there are just as many engineering school graduates who are bad at their jobs as there are business school graduates who are bad at their jobs. I run into both routinely. And just as many who are good at their jobs as well. Just because you may have run into some of the bad ones doesn't grant the right to paint the rest with the same brush.
Accountants don't have a good idea of lost business opportunity or lost customers.
So while the basics may make financial sense, that doesn't actually mean it was a good idea.
There are two types of people in the world: Those who crave closure
I used to work on one of these systems.
The flight planning system takes inputs from several sources - weather forecasts, notices about airspace closures, etc. (NOTAMs), and booking info - and creates an optimal flight plan for the aircraft.
A modern airline doesn't have enough flight planning staff to take over manually if the system fails, so if your flight planning goes out, your fleet is gradually grounded.
The large number of servers is due to the optimization problem. You need to take into account the flight conditions and fuel costs in different locations in order to decide your route, altitude, and fuel loading. Since fuel is a huge percent of the operating cost of the airline, it pays to invest a little extra computing power into optimizing these and save a bit fuel on each flight.
Our system had lots of redundancy but, with all the data feeds, there are lots of moving parts. It's not hard to imagine a scenario where, for example, you get everything transferred over to your disaster recovery site, but for some reason the weather feed isn't coming in and you can't make flight plans.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
This story brought to you courtesy of paperless tickets. Yes they are cheaper, yes it is simpler if people can print their own tickets, but the IT has to be up and running.
I remember an airline IT outage back in September 2004, there was a bug in the OS's error-handling routine for a particular class of error. This had all been tested with this particular OS level and had worked, but they had been forced to change the OS configuration to accomodate some new software and the bug was in place. Moving to new discs required a reboot, an additional configuration error caused problems. If it had been fixed within (I think) 90 minutes all would have been fine. The outage was 8 hours.
Passengers turned up at the airports with their paper tickets and were allowed to board. Any pre-allocated seating was ignored. People were laughing about flying the way things used to be, a good time was had by most.
Then came paperless tickets. The next outage had effects more like those we see in this case.
Mielipiteet omiani - Opinions personal, facts suspect.
Of course it's the MBAs fault. Their very raison de etre is calculating the costs of additional redundancy, and comparing that against the costs of a global operations failure and the ensuing loss of business due to carrier unreliability. Then, presenting this data to a decision maker for action.
There are only two ways that they can get off. One way is if the decision maker chose to accept the risk, knowing it fully. The other way is that if the IT department didn't advise them of the risk. I evaluate the chances of the IT department being dumb enough to not know what would happen as near zero.
You're left with MBAs who failed to present the business case properly or a CEO who is a retard. Choose one.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
Or they ran the numbers and calculated that even if they have an outage like this, the cost of that outage would be less than the cost of preventing it. If all you care about is the bottom line, you might not care if you inconvenience a bunch of customers for a few days.
Blimey I wouldn't do that and running a bog standard stream service never mind an airline with 100 million a day of revenue.
500 servers is about 50 racks. About 500,000 a year plus about 2,000,000 for kit and 4,000,000 for software and licenses and 250,000 for interconnect . So capex 6,000,000 and opex call it 1,000,000 per annum.
I normally rate a major dc failure ( more than 10min ) at about once every 5 years.
Easy business case.
Also generator and ups fail over is tough to test with one dc. Which hit this one bad.
Without Federal requirements there is no way a corporation is going to spend that kind of money.
A few failures like this one and they'll dig into the couch cushions to find the change for it. Having a backup data center for stuff that will shut the company down is not exactly a tough thing to justify. This shutdown alone would probably justify the cost in a single day.
They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run.
Perhaps but if they managed their IT properly they wouldn't have to lose money now. They can buy the insurance or they can take the risk of serious illness so to speak. Their choice and their funeral. Sounds like they rolled the dice and came up snake eyes today.
The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.
Not true. Some medical practices have them. Some internet firms have them (at least for the mission critical stuff). Some bits of the military and government have them. Insurance companies have them. Stock exchanges have them. And there are more as well. If it's valuable enough you have a backup data center of some sort.
It couldn't possibly be that they predicted exactly this and presented it clearly to upper management who then decided they could get a really fat bonus for keeping costs down and deploy the golden parachute before the inevitable disaster.
Somebody has lot's of 'splaining to do, surely. Power up the deflectors.
Off the top of my head I can name over 20 companies that have full failover to a backup DC. One of them is an Airline that everyone knows the name of.
Hell, I have configured stretch clusters for companies so that in the event of a DC failure the secondary DC is available with 0 down time and the failover is automatic. So it is done, it is normal operating procedures/best practices, and there is no reason the SECOND LARGEST AIRLINE IN THE USE IS NOT DOING IT!!!
If you want to argue that some small company of 1000 people is not doing it that is fine but there is no excuse beyond management failing to do their job for this one. I think the board needs to look into it and start cutting people from the top down.