Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)
Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."
More than likely an underfunded IT department. IT people often know what's needed for a reliable system, but the higher-ups just seem them as a cost center and won't provide them with a sufficient budget.
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.
A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
I wouldn't be so fast to lay this at the feet of IT.
I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.
This is the kind of stupidity I see from American companies all the time. Here in Europe, computer downtime like this for a mere hour costs millions of pounds for an airline as they become liable not just for refunds, but also for extra costs as travel insurers pay large sums of money to get people where they're supposed to go. The reinsurers will then send their lawyers to present the airline with a nice bill.
Calling someone a "hater" only means you can not rationally rebut their argument.
Here's the thing that amazes me.
500 servers.
The airline runs on 500 servres.
I was part of an early social networking site that, at its peak had 20 M users, with about 10K actively using the site at any given moment. We ran with 200 servers and had really very excellent render time (this was getting on to a decade ago, and if our page loads ever got above 1 second it was considered a near crisis; our email/messaging system, that I wrote, handled 150 M messages per day). It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100. They need 500 servers? Five HUNDRED servers? And with the resources of a multi-billion dollar company, they're STILL ALL IN ONE LOCATION?
They need a new IT team. Or a new management to give them the support they need.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
"with about 10K actively using the site at any given moment."
You actually think Delta only has 10k actively using their systems at any given moment? They probably have that many ticket counter staff logged in, not even counting customers, technicians, pilots, and so on.
Yeah, I get the 'why is your backup in the same building as your primary', but they probably need 500 servers.
From the sound of things, I'd say the cabbie played you. I wouldn't be a bit surprised if this is a scam he runs regularly.
Mr. Hu is not a ninja.
Interesting. I manage Enterprise Data Centers for a living. My expertise is the Facility Infrastructure (generators, UPS, switchgear, etc). What's being described in that post you linked to sounds very possible. I'd just about lay money down that this was a failure in an Automatic Transfer Switch. And as others have said, I pretty much guarantee that due to the corporate attitude of "facilities is just an expense center on a spreadsheet", there's been pressure to trim costs........including decreasing frequency of predictive maintenance like Infrared Thermography.
A well maintained ATS should be able to function flawlessly for many, many years (like 20 years). To have faulted so badly that it took out the whole switch (which would definitely make the primary and generator feeds inaccessible) sure sounds like deferred or non-existent maintenance to me.
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.
Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.
"It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100" - You clearly have no idea what you are talking about. Go learn about the complexities of running an airline, the different software required, the number of users and systems supported, etc.
Bullcrap. A boo-boo this massive is BY DEFINITION a management fuck-up. It is management's [only] job to ensure all departments are doing their jobs competently. They don't get to say "well gosh, engineering told us they knew what they were doing". Yeah, it isn't EASY, but it's why they get the obscene compensation levels.
On the contrary, after going through bankruptcies in recent years and shedding debt, pensions, etc., plus with the current low fuel prices, most airlines are currently swimming in cash.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution. Maybe also a hot failover next to it hooked up by a fat pipe. But if shit hits the fan big time in the data center, it goes down. Doesn't look like it took them *that* long to scramble what I assume is their cold backup online.
The passengers aren't happy but hey sometimes shit happens with planes or crew or airports or whatnot leading to delays and cancellation. I've had a rescheduled flight and night in hotel because KLM got delayed and weren't allowed to liftoff because the destination airport was closing, it sucks but this is a fact of life for airlines. It becomes a big story because it happened to lots of people at once, but over say a year how how big a deal is it really? I'm sure they'll do a post mortem but I'd be surprised if they moved away from a centralized architecture.
Live today, because you never know what tomorrow brings
Or they ran the numbers and calculated that even if they have an outage like this, the cost of that outage would be less than the cost of preventing it. If all you care about is the bottom line, you might not care if you inconvenience a bunch of customers for a few days.
I used to live in Point Roberts, WA and power was very unreliable. I worked from home half the week, so I bought two UPSes (one for the computer, one for the cable modem and router), and kept a charged car battery in the house with a 12V inverter which would give me by my calculations about 10 hours on my laptop (on top of my laptop's 5 hour battery). I had plans to buy a generator as well.
One day the power went out. The UPS kicked in. Power usually came back within a couple minutes so I kept working. After about 10 min, the UPS began warning it was nearly drained. So I shut down the desktop and switched to my laptop. Unfortunately I hadn't charged it so I got a low battery warning after about an hour. I lugged out the car battery, clamped on the leads for the inverter, plugged the laptop into the inverter, and fired it up. I was back in business again.
Got on the laptop, logged in to work. 30 seconds later the Internet went down. No cable TV as well. The battery keeping the cable company's equipment powered must've died.
You can make all your systems redundant, distributed, and robust. But unless you control all the network lines between you and all the places you need to communicate with, you're not in total control over the reliability of the system. (And if you're curious, I was without power for 3 days. I had to move my refrigerator's contents outside to keep them cool since it was winter, and use a wood stove to keep the house warm and cook my meals. I dropped plans to buy a generator since there was no point if my Internet connection would only last about 90 minutes.)