Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)

← Back to Stories (view on slashdot.org)

Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)

Posted by msmash on Monday August 8, 2016 @01:00AM from the computer-glitch dept.

Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."

18 of 239 comments (clear)

Min score:

Reason:

Sort:

Incompetent IT by sjbe · 2016-08-08 01:06 · Score: 5, Interesting

A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
1. Re:Incompetent IT by NotInHere · 2016-08-08 01:14 · Score: 5, Insightful
  
  Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.
2. Re:Incompetent IT by mjwx · 2016-08-08 01:16 · Score: 5, Insightful
  
  A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
  Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
  I wouldn't be so fast to lay this at the feet of IT.
  
  I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.
  
  This is the kind of stupidity I see from American companies all the time. Here in Europe, computer downtime like this for a mere hour costs millions of pounds for an airline as they become liable not just for refunds, but also for extra costs as travel insurers pay large sums of money to get people where they're supposed to go. The reinsurers will then send their lawyers to present the airline with a nice bill.
  
  --
  Calling someone a "hater" only means you can not rationally rebut their argument.
3. Re:Incompetent IT by tripleevenfall · 2016-08-08 01:40 · Score: 4, Funny
  
  "Johnson, get in here"
  "Yes sir?"
  "You said you apped this in the cloud. How does the cloud go down?"
  "Well, er... "
  "Where are the damn synergies? I was told there would be synergies!"
4. Re:Incompetent IT by tripleevenfall · 2016-08-08 01:43 · Score: 4, Insightful
  
  For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
5. Re:Incompetent IT by Kjella · 2016-08-08 02:52 · Score: 5, Insightful
  
  Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
  Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution. Maybe also a hot failover next to it hooked up by a fat pipe. But if shit hits the fan big time in the data center, it goes down. Doesn't look like it took them *that* long to scramble what I assume is their cold backup online.
  The passengers aren't happy but hey sometimes shit happens with planes or crew or airports or whatnot leading to delays and cancellation. I've had a rescheduled flight and night in hotel because KLM got delayed and weren't allowed to liftoff because the destination airport was closing, it sucks but this is a fact of life for airlines. It becomes a big story because it happened to lots of people at once, but over say a year how how big a deal is it really? I'm sure they'll do a post mortem but I'd be surprised if they moved away from a centralized architecture.
  
  --
  Live today, because you never know what tomorrow brings
Re:Shouldn't have upgraded to W10 ! by NotInHere · 2016-08-08 01:12 · Score: 4, Insightful

Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
Report: Fire destroyed generators by McGruber · 2016-08-08 01:18 · Score: 4, Informative

A fire at the datacenter caused the outage, according to a post on post from "walterD" in Flyertalk.com's "Delta computers down ..." thread:

According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.
1. Re:Report: Fire destroyed generators by pz · 2016-08-08 01:40 · Score: 4, Insightful
  
  Here's the thing that amazes me.
  500 servers.
  The airline runs on 500 servres.
  I was part of an early social networking site that, at its peak had 20 M users, with about 10K actively using the site at any given moment. We ran with 200 servers and had really very excellent render time (this was getting on to a decade ago, and if our page loads ever got above 1 second it was considered a near crisis; our email/messaging system, that I wrote, handled 150 M messages per day). It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100. They need 500 servers? Five HUNDRED servers? And with the resources of a multi-billion dollar company, they're STILL ALL IN ONE LOCATION?
  They need a new IT team. Or a new management to give them the support they need.
  
  --
  
  Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
2. Re:Report: Fire destroyed generators by Critical+Facilities · 2016-08-08 01:50 · Score: 4, Insightful
  
  Interesting. I manage Enterprise Data Centers for a living. My expertise is the Facility Infrastructure (generators, UPS, switchgear, etc). What's being described in that post you linked to sounds very possible. I'd just about lay money down that this was a failure in an Automatic Transfer Switch. And as others have said, I pretty much guarantee that due to the corporate attitude of "facilities is just an expense center on a spreadsheet", there's been pressure to trim costs........including decreasing frequency of predictive maintenance like Infrared Thermography.
  
  A well maintained ATS should be able to function flawlessly for many, many years (like 20 years). To have faulted so badly that it took out the whole switch (which would definitely make the primary and generator feeds inaccessible) sure sounds like deferred or non-existent maintenance to me.
3. Re:Report: Fire destroyed generators by raftpeople · 2016-08-08 02:09 · Score: 5, Insightful
  
  "It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100" - You clearly have no idea what you are talking about. Go learn about the complexities of running an airline, the different software required, the number of users and systems supported, etc.
4. Re:Report: Fire destroyed generators by Critical+Facilities · 2016-08-08 02:41 · Score: 4, Informative
  
  Well, to be clear, I'm just speculating here, but I'm not implying that the GENERATORS blew up, I'm speculating that the ATS blew up. It is a very common topology to have multiple Generators connect to one main bus, and then have that bus connect to the Data Center via an ATS. In other words, yes, there is/are redundant Generator(s), but they all connect to one central bus, which then connects to the UPS Systems via the ATS and other switchgear.
  
  The failure rate of ATSs is pretty low (when they're maintained), so it often becomes a value engineering decision during design. Yes, you could have each Generator connect via its own ATS, thus distributing the risk, but in so doing you increase your constructions costs, increase your maintenance costs, etc. The bean counters don't like that, and it becomes hard to convince them that it's worth it when you can't come up with statistical proof that a failure of the ATS is likely.
Backup data center? by sjbe · 2016-08-08 01:52 · Score: 5, Insightful

Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.
Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.
For those claiming bad managers and saving money: by Anonymous Coward · 2016-08-08 01:53 · Score: 4, Interesting

Most of y'all probably don't know what you're talking about. Here's what's going to happen:
1) Delta will file a loss-of-business / data system failure claim after things are stable again
2) They'll haggle with their insurer long after this little story is forgotten (and yeah, lots o' heartache today, but it's still probably going to be little.)
3) Delta will get a settlement of some dollar amount
4) Some bean counter will eventually tally the cost of that policy versus the payout versus how much all those redundant backups would have cost. The accountant will most likely conclude that it was a smart idea to have bought that insurance policy and NOT paid out the multimillions of dollars IT was asking for in redundant systems.
5) The insurance company will note the payout as a blip on its financials (probably already expected by the actuaries.) Insurance company will keep making profit.
The little air traveller is screwed and blued, but Delta and its insurer will keep flying. Doing business today without a data loss rider on your business insurance would be the really stupid idea, much more so than wasting money on redundant systems that are more expensive than said rider.
Re:Arguing for resources is part of the job by fnj · 2016-08-08 02:12 · Score: 5, Insightful

Bullcrap. A boo-boo this massive is BY DEFINITION a management fuck-up. It is management's [only] job to ensure all departments are doing their jobs competently. They don't get to say "well gosh, engineering told us they knew what they were doing". Yeah, it isn't EASY, but it's why they get the obscene compensation levels.
Sounds like a problem with flight planning by Ami+Ganguli · 2016-08-08 02:36 · Score: 5, Informative

I used to work on one of these systems.
The flight planning system takes inputs from several sources - weather forecasts, notices about airspace closures, etc. (NOTAMs), and booking info - and creates an optimal flight plan for the aircraft.
A modern airline doesn't have enough flight planning staff to take over manually if the system fails, so if your flight planning goes out, your fleet is gradually grounded.
The large number of servers is due to the optimization problem. You need to take into account the flight conditions and fuel costs in different locations in order to decide your route, altitude, and fuel loading. Since fuel is a huge percent of the operating cost of the airline, it pays to invest a little extra computing power into optimizing these and save a bit fuel on each flight.
Our system had lots of redundancy but, with all the data feeds, there are lots of moving parts. It's not hard to imagine a scenario where, for example, you get everything transferred over to your disaster recovery site, but for some reason the weather feed isn't coming in and you can't make flight plans.

--
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
Paperless Tickets by Vlad_the_Inhaler · 2016-08-08 02:41 · Score: 4, Interesting

This story brought to you courtesy of paperless tickets. Yes they are cheaper, yes it is simpler if people can print their own tickets, but the IT has to be up and running.
I remember an airline IT outage back in September 2004, there was a bug in the OS's error-handling routine for a particular class of error. This had all been tested with this particular OS level and had worked, but they had been forced to change the OS configuration to accomodate some new software and the bug was in place. Moving to new discs required a reboot, an additional configuration error caused problems. If it had been fixed within (I think) 90 minutes all would have been fine. The outage was 8 hours.
Passengers turned up at the airports with their paper tickets and were allowed to board. Any pre-allocated seating was ignored. People were laughing about flying the way things used to be, a good time was had by most.
Then came paperless tickets. The next outage had effects more like those we see in this case.

--
Mielipiteet omiani - Opinions personal, facts suspect.
Insurance by sjbe · 2016-08-08 03:46 · Score: 4, Informative

Without Federal requirements there is no way a corporation is going to spend that kind of money.
A few failures like this one and they'll dig into the couch cushions to find the change for it. Having a backup data center for stuff that will shut the company down is not exactly a tough thing to justify. This shutdown alone would probably justify the cost in a single day.

They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run.
Perhaps but if they managed their IT properly they wouldn't have to lose money now. They can buy the insurance or they can take the risk of serious illness so to speak. Their choice and their funeral. Sounds like they rolled the dice and came up snake eyes today.

The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.
Not true. Some medical practices have them. Some internet firms have them (at least for the mission critical stuff). Some bits of the military and government have them. Insurance companies have them. Stock exchanges have them. And there are more as well. If it's valuable enough you have a backup data center of some sort.