Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)
Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."
A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
The C level exec who most likely ignored the recommendation for a backup should be fired.
But he won't. At worst, he'll be asked to resign, he'll get a big fat bonus and find another high paying cushy job.
Most likely, there will be a goat somewhere who'll get escorted out of the building and will have a real hard time finding another job - because that's how it works for us peons. Unemployed means no good.
...I'd like to welcome you aboard Single Point of Failure Airlines.
the auto install of windows updates drains your battery and does not stop for battery mode or ups shut down commands.
Ha ha ha...very funny...not......
This was a power issue (cue the 'the IT staff needs to be hung by their scrotums for such shitty power infrastructure' comments).
You're messin' with my Zen Thing, man.....
A power outage at one location takes down the entire global delta network?
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.
According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.
...and life is a fucking nightmare o/~
(https://youtu.be/vzeOsEkzeA0
John Mulaney's stand up bit on Delta. It's worth it.)
I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.
But the management chain did not want to swallow the costs.
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
http://www.tagesschau.de/wirts...
I'd bet good money that someone at Amazon is ALREADY meeting with them.
And will still save them money, even with migration costs. Of course, the already-abused IT staff will get downsized, and the C-level who signs off on it will get a fat bonus. . .
Let me guess:
They have multiple servers distributed throughout various geographic locations, redundant "hot spare" servers ready to kick in at a moment's notice, a direct terabit link to the internet backbone, redundant power supplies with generator backup, bomb-proof bunkers and a team of highly trained special ops to guard their servers...but their flight scheduling software is still some quick Cobol hack running on a 386 (upgraded from a VAX in the mid 90s) wheezing away in a (very hot, very dusty) cabinet on their head office, connected to the "net" via an RS232 cable, at 12k baud...which has just died of natural causes (i.e., rust)?
I mean, other than that, I have no clue how the second largest US airline (according to the TFS) can manage to have a world-wide computer blackout in this day and age.
Last time I worked with the airline industry, they were still heavily reliant upon mainframe systems. That means putting redundant equipment at diverse datacenters is more costly. It's not like spinning up a new rack of x86 VMWare servers.
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up),
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator. So, this is a case in which the backup was the problem, not the solution.
Minnesota seems like a good place to house them....
My eyes reflect the stars and a smile lights up my face.
I wouldn't be so fast to lay this at the feet of IT.
It's the fault of IT (which includes IT management) unless you have evidence to the contrary. If they didn't adequately present the argument for why a robust network is a valuable asset then shame on them. Preventing then entire company from shutting down and losing millions of dollars per minute is a trivial argument to make. So yeah, IT carries most if not all of the blame here. If they couldn't make that argument then they suck at their job.
I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.
[sarcasm] Ahh, yes the MBA scapegoat. Couldn't possibly be that the folks who designed the network did a shit job of it. Clearly they must have been undercut by some bean counter somewhere. [/sarcasm] In a company the size of Delta if a power outage in a single location causes a company wide failure, that is almost certainly a technical screw up and not a budgetary one. Making the argument for some equipment to make the network resistant to power outages is a trivial financial argument to make. If the IT engineers had a single point of failure like that and they weren't able to justify whatever upgrades were necessary then they are bad at their job. Either they didn't see the problem or they failed to justify the resources to fix it. Either way they are incompetent and take the lion's share of the blame.
Still though, this begs to be something hosted in a datacenter/cloud with an online shadow in the background of another location replicating everything and ready to take over at a moment's notice, or something similar. Pretty standard these days, but airlines are so tight for money that they end up sometimes shooting their own feet...
Airlines are making record profits these days. Arguing that they don't have the money to properly set up the system that runs the whole company is ridiculous.
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.
Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.
Most of y'all probably don't know what you're talking about. Here's what's going to happen:
1) Delta will file a loss-of-business / data system failure claim after things are stable again
2) They'll haggle with their insurer long after this little story is forgotten (and yeah, lots o' heartache today, but it's still probably going to be little.)
3) Delta will get a settlement of some dollar amount
4) Some bean counter will eventually tally the cost of that policy versus the payout versus how much all those redundant backups would have cost. The accountant will most likely conclude that it was a smart idea to have bought that insurance policy and NOT paid out the multimillions of dollars IT was asking for in redundant systems.
5) The insurance company will note the payout as a blip on its financials (probably already expected by the actuaries.) Insurance company will keep making profit.
The little air traveller is screwed and blued, but Delta and its insurer will keep flying. Doing business today without a data loss rider on your business insurance would be the really stupid idea, much more so than wasting money on redundant systems that are more expensive than said rider.
While on the surface it may appear their IT department is "incompetent" as one person pointed out, other factors could have contributed to the outage. Management not approving proper tests to be done or another datacenter in a completely different location. Improper maintenance on the generator(s). While IT may request things be done or placed a certain way, doesn't mean the facilities team care or understand why and do it their own way anyways. Like why have two generators located right next to each other? They probably shared the same resource for operating as well.
It takes an event like this for people to realize the importance of listening to the people who implement and maintain their infrastructure. I'm sure anyone who saw this happening is digging through their memos and pulling out the multiple requests for disaster recovery solutions to prevent these things. Not to show them, haha I told you so, but to cover their ass when they start looking for someone to fire.
It's easy to point out IT as the scapegoat but sometimes they just have to deal with what they're given by the higher ups.
I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.
I'll take that bet. I'm betting they either overlooked something technical or they are just really bad at making financial arguments. Since a key part of engineering is being able to justify what you want to do in financial terms my guess is that they just weren't very good at their job. Justifying equipment to prevent an outage that would cost millions of dollars per minute is trivial.
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
Maybe but I doubt it. Given that Delta and other airlines are experiencing record profits, it's hard to see them not understanding the math of a system-wide shut down and what that would cost them.
IT folks usually put in the requirements for the power infrastructure, but I've almost never seen them handle it.
Often, it's building/maintenance who handles it.
And as with any project, it's probably upper management didn't want to pay for the level of redundancy that IT said was required.
There are two types of people in the world: Those who crave closure
I don't know about the Delta operations center, but I do know that United has a both a backup site (its pre-merger operations center) with a one-for-one replication of all operator positions and systems that they can fail over to in the event of an emergency and a smaller site that they can switch over to faster though with reduced capability. I would assume that all major airlines have similar setups, but something clearly didn't work at Delta. It isn't clear whether the problem here was with the backup systems in place or the management decisions made in response to the power failure.
For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
Yeah, it's sort of a riff on Godwin's law. If you blame "MBAs" for a problem, that person has no fact based arguments left so the argument is over and the person doing it loses the argument. It's basically scapegoating and tribalism at its worst.
Management is a pretty easy target. Management has to make decisions with imperfect information (like playing poker) whereas engineers are used to working with greater certainty (more like playing chess) and it's hard for many of them to wrap their head around the difference. Engineers who don't actually know any better seem to think MBA is shorthand for management incompetence. Never mind that a MBA is a degree, not a person or even a category of people. It's as stupid and incoherent as saying CS = incompetent programmers. I happen to be an engineer but I'm also a certified accountant. I have degrees in both engineering and business and I use both in my day job running a manufacturing plant. I can say with absolute confidence that there are just as many engineering school graduates who are bad at their jobs as there are business school graduates who are bad at their jobs. I run into both routinely. And just as many who are good at their jobs as well. Just because you may have run into some of the bad ones doesn't grant the right to paint the rest with the same brush.
Accountants don't have a good idea of lost business opportunity or lost customers.
So while the basics may make financial sense, that doesn't actually mean it was a good idea.
There are two types of people in the world: Those who crave closure
Considering the crap "disservice" Delta usually demonstrates (multi-hour delays are a "standard feature" of Delta), this can only improve things...
Remember, kids "IT Doesn't Matter".
non union h1b electronic F*cked and they will use this to blame IT and can more USC's
I used to work on one of these systems.
The flight planning system takes inputs from several sources - weather forecasts, notices about airspace closures, etc. (NOTAMs), and booking info - and creates an optimal flight plan for the aircraft.
A modern airline doesn't have enough flight planning staff to take over manually if the system fails, so if your flight planning goes out, your fleet is gradually grounded.
The large number of servers is due to the optimization problem. You need to take into account the flight conditions and fuel costs in different locations in order to decide your route, altitude, and fuel loading. Since fuel is a huge percent of the operating cost of the airline, it pays to invest a little extra computing power into optimizing these and save a bit fuel on each flight.
Our system had lots of redundancy but, with all the data feeds, there are lots of moving parts. It's not hard to imagine a scenario where, for example, you get everything transferred over to your disaster recovery site, but for some reason the weather feed isn't coming in and you can't make flight plans.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
People who depend on glass infrastructures shouldn't throw stones.
And, as already pointed out, if insurance and lawsuits cost less than robust equipment, you can expect much more of this as the world goes online. This is what I love about the "corporation". Nobody is ever held responsible. It's almost as if by design :-/
“He’s not deformed, he’s just drunk!”
This story brought to you courtesy of paperless tickets. Yes they are cheaper, yes it is simpler if people can print their own tickets, but the IT has to be up and running.
I remember an airline IT outage back in September 2004, there was a bug in the OS's error-handling routine for a particular class of error. This had all been tested with this particular OS level and had worked, but they had been forced to change the OS configuration to accomodate some new software and the bug was in place. Moving to new discs required a reboot, an additional configuration error caused problems. If it had been fixed within (I think) 90 minutes all would have been fine. The outage was 8 hours.
Passengers turned up at the airports with their paper tickets and were allowed to board. Any pre-allocated seating was ignored. People were laughing about flying the way things used to be, a good time was had by most.
Then came paperless tickets. The next outage had effects more like those we see in this case.
Mielipiteet omiani - Opinions personal, facts suspect.
What the fuck happened to failover? It is standard in the compute industry, so why don't Delta and Southwest use it? Someone call IBM or Oracle today.
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
By the time the Bean Counters get done? Depend on it. The books aren't going to show the future revenue lost because people swore off Delta in disgust and anyone who depends on surveys to obtain intangible data is going to get what they deserve. Even allowing for the fact that many people don't want to waste time on a survey to begin with, you can't survey people who thought "Delta? Those screwups?" and never even considered the company. Well you can, if you're into blanket surveys, but those are worth even less than the customer surveys.
And yes, if you detect an anti-survey bias there, you're right. To me, surveys are what you do when you're too out of touch to actually watch and listen to customers (Strike One), put blinders on your perceptions by virtue of only asking the questions your bean-counters think are worth asking (Strike 2), and are often only answered when the querent is either A) pissed, B) a "professional" survey answerer (limited, atypical population) or C) couldn't get away fast enough without actually gnawing off body parts (Strike 3).
So the drop in revenue over the long haul will be blamed on something more measurable and bonuses to the real offenders will continue unabated.
Of course it's the MBAs fault. Their very raison de etre is calculating the costs of additional redundancy, and comparing that against the costs of a global operations failure and the ensuing loss of business due to carrier unreliability. Then, presenting this data to a decision maker for action.
There are only two ways that they can get off. One way is if the decision maker chose to accept the risk, knowing it fully. The other way is that if the IT department didn't advise them of the risk. I evaluate the chances of the IT department being dumb enough to not know what would happen as near zero.
You're left with MBAs who failed to present the business case properly or a CEO who is a retard. Choose one.
HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
This is not a response to your uninformed comment. It's to MANY uninformed comments.
1) They have a DR plan. Doesn't mean they chose to execute it (time to failover vs expected time to recover and many other decisions).
2) The accountants don't run the decisions.
3) The MBAs don't run the decisions.
4) IT doesn't run the decisions.
5) Decisions are made in a company at many levels, with the input of many department heads. Have you guys never worked in a large enterprise before?
I guarantee the cloud infrastructure guys are salivating at the opportunity to convince the MBAs to ditch Delta's data center. What they won't mention is how much it would cost to actually implement instant failover capability in a cloud environment. I'm not anti-cloud, but I do think a business as large as Delta isn't going to see a lot of cost savings over what they're paying now for equipment. Microsoft and Amazon doesn't give away capacity for free, and you often pay dearly for certain key elements (IaaS, network connections, etc.) The MBAs don't see this though; they only see CapEx vs. OpEx and "we can fire 90% of the IT department."
IT cost isn't exactly something airlines spend willingly. Unless it directly affects safety or increases revenue/reduces cost, they want nothing to do with it. I guarantee the proposal for a redundant data center, or even a cloud-based DR location was floated, looked at and rejected as being too expensive. Airline IT is a web of third-party dependencies, each of which has a few single points of failure. Although, bad luck for them, this one seems like a straight power outage and/or transformer/generator failure. At least it seems like they didn't fry their computing equipment if they were able to get back online in a few hours. Sadly, I have experience with this and have seen companies dismiss the cost of a $20K server colo and network connection as excessive. People seem to forget that you need to guard against downtime unless you're some Web 2.0 startup social media company...if it costs $XX,000 per minute of downtime, you have to be willing to eat that or pay for DR.
Some of the IT staff most likely don't have a scrotum you insensitive clod. Why aren't they getting their fair share of hanging.
It has for years. The consolidation of the commercial airline industry in the US has resulted in companies that suck uniformly, and prices that remain high uniformly. I have been able to fly Bristol to Madrid, in Europe for less than $50 each way. In the US, Denver to San Francisco, which is roughly the same distance, will set you back by $200 round trip, if you are very lucky. Usually, it will be more like $300.
You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.
Or if you are down for 2 days ($200 million), and the cost of having a fully redundant system is more than $200 million (equipment, people, process, ...), from a business sense, it may make more sense to just accept an occasional outage.
Yeah, but IT should be involved in testing it. Had one very embarrassed data center manager when the IT manager said something like "So if I hit this red button, we will be running uninterrupted on the batteries and nothing should go wrong right?" When receiving an enthusiastic affirmative he went ahead and hit the red button. As you might guess the data center immediately powered off, hard...
And before you mention how wreck less this was, this was in an outage window that was scheduled ahead of time, was just after backups were complete and verified, and the IT manager had gotten the OK to do it from the higher ups. Basically everyone agreed they needed to know for sure it was going to keep running if the power was cut and were OK with the potential outage at a known time to hopefully prevent an unplanned outage at a random time.
Blimey I wouldn't do that and running a bog standard stream service never mind an airline with 100 million a day of revenue.
500 servers is about 50 racks. About 500,000 a year plus about 2,000,000 for kit and 4,000,000 for software and licenses and 250,000 for interconnect . So capex 6,000,000 and opex call it 1,000,000 per annum.
I normally rate a major dc failure ( more than 10min ) at about once every 5 years.
Easy business case.
Also generator and ups fail over is tough to test with one dc. Which hit this one bad.
Well the testing is assuming that it's there at all.
There are two types of people in the world: Those who crave closure
Without Federal requirements there is no way a corporation is going to spend that kind of money.
A few failures like this one and they'll dig into the couch cushions to find the change for it. Having a backup data center for stuff that will shut the company down is not exactly a tough thing to justify. This shutdown alone would probably justify the cost in a single day.
They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run.
Perhaps but if they managed their IT properly they wouldn't have to lose money now. They can buy the insurance or they can take the risk of serious illness so to speak. Their choice and their funeral. Sounds like they rolled the dice and came up snake eyes today.
The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.
Not true. Some medical practices have them. Some internet firms have them (at least for the mission critical stuff). Some bits of the military and government have them. Insurance companies have them. Stock exchanges have them. And there are more as well. If it's valuable enough you have a backup data center of some sort.
Delta has demonstrated that it, one of the world's largest airlines, doesn't co-locate it's critical infrastructure in redundant data centers with fail-over mechanisms. Delta's inability to operate has ripple effects in the operations of other airlines as well. Now criminals know that Atlanta is an Achilles' heel, and to cripple the world's air transportation systems, they need only attack it's power grid. Obviously, market incentives are not sufficient to make them have a more robust infrastructure. I think the FAA needs to step in here and regulate a little sanity into the system.
Also same as trying to fish for your torrent sites a few days ago when Slashdot was outed as a FBI site now.
They want your responses to these stories. They are trying to plan for unexpected responses in future false flag events in USA.
Hack the thermostat? Airlines grounded? Where do you get your torrents from? How would you respond?
Please have a seat and tell us, we are not FBI we are just stealth Slashdot submitters. Also, have you tried Microsoft Anniversary 10 yet? IT IS FREE.
In the summer of 2003, the Great North-East Blackout hit New England and other areas in the U.S. and parts of Canada. My wife were in Montreal at the time. When we tried to fly home non-stop to California from Trudeau International Airport (called Dorval International Airport at that time) via Air Canada on an early morning flight, we instead found ourselves flying in the late afternoon to Dulles in Washington, DC, changing planes, and then flying home. We arrived at our house more than 12 hours late.
No, Montreal and the rest of the province of Quebec were not affected by the blackout. Air Canada's computers, however, were in Toronto. Toronto and much of the province of Ontario were indeed blacked-out. While other airlines continued normal operations out of Montreal, Air Canada could not confirm reservations or issue boarding passes. Air Canada had no remote backup facilities.
Apparently, Delta Air Lines learned no lesson from Air Canada's experience 13 years ago.
Do they have a an entire recovery DC or space in someone else's DC ? Most business have plans to recover certain applications or move them to run on backup/development hardware. I worked for years in Contingency recovery and most places I've supported have space to recover applications should they fail, but few have the dedicated space or a plan to recover an entire infrastructure should a failure occur, and fewer have a plan to move BACK to the original space when the problem is fixed. The cost to maintain a duplicate hardware/space for everything, plus the people to recover it in an emergency is ENORMOUS, and the logistics to do so extremely complicated. Recovering the front end in a leased or rented space supported by another entity is very different than a full structure recovery move. Heck most places don't even have the offsite data e.g. full application code plus FULL data backups needed to recovery from scratch.
errr....umm...*whooosh* *whoosh* Is this thing on ?
The Il toll way is loading I-90 up with backup power all an long the new smart highway part how redundant is that system? If it fails people can end up with free tolls.
There is a great deal of difference in recovering certain applications or having multiple sites running a subset of one facet of your operation. A full structure recovery requires the hardware, staff and FULL data, e.g. full application and user data available to recover from scratch. That kind of overhead is enormous. Recovering mission critical stuff is par for the course, but recovering everything in a DC needed to do day to day operations in the event of a full infrastructure failure is a different beast entirely.
errr....umm...*whooosh* *whoosh* Is this thing on ?
No, their computers upgraded to win10 overnight. So none of their custom software worked anymore.... :-)
It's fun watching every department point to every other department for blame.
IT - It's upper managements fault. We assumed they took care of it. Or now, it's facilities fault.
Upper MGMT - It's ITs fault. We assumed if it still needed doing they would have told us.
Just need IT and Upper MGMT to talk first so they can sync the blame on facilities.
Your analysis rings true ...
And nothing will change at Delta unless there is a noticeable drop in ticket sales that can be directly attributed to this outage.
The inconvenience to passengers is worth a small bucket of warm piss unless passengers stop buying tickets.
Namely, they always shovel out heaps of number two whenever something goes terribly wrong. Their response policy is ALWAYS to tell LIES. Its POLICY to SPIN ALL NEGATIVE PRESS ATTENTION AT ALL TIMES. The truth will only make it worse because they know they are prone to major fuck ups and they have lots of enemies. They just don't want any of their cheap, stupid, or dishonest screw ups to look like they are willing and able to constantly screw up service for their customers since it is a calculated risk they willingly commit, and will commit again, because its worth being unreliable if it saves them the very high cost of total reliability and service. Delta didn't get to number two by acting like the best. They got there with compromised performance where is counts the most: the customers ruined travel plans. What's a random act of system breakdown worth ? Its very valuable because it buys them time to actually go off line, and get some very serious technical work done in a few hours that might otherwise take weeks, and great resources and planning , in order to implement in parallel to smooth daily operations. When an event like a power failure in Atlanta is blamed for there problem, then are we to believe it is reasonable to design there entire airline system to fail every time that event occurs? Is that how it goes with every other airline that experiences power failures? Not if they plan and provide contingencies for all known possible, albeit improbable, events. This jerks KNEW that this would be EXACTLY what happens IF a power outage occurs in Atlanta at 2AM on a SUNDAY. They probably can mitigate the collapse if its during business hours, while all hands are on deck, but there is no way that they are going to PAY THE WADS OF CASH it costs to have a hair trigger response team on standby 24/7 in case of an off hour power outage. Nor will they pay the heaps for backup power systems capable to manages seamless power supply for their fat power sucking energy wasting corporate consumption of all available amps in every facility and overclocked full throttled server farm and network that is required. The could have told us 10 years ago that this is what will happen under these exact circumstances. That's how it was designed because its a great cost savings to gamble that the power failure wont happen, but if it does, they can handle it with the resources available to them during normal operations, which is where it counts most. They make a calculation that its worth the cost savings not to worry about 24/7 contingency response because it involves a huge labor cost. They are not going to piss away profits just to be sure a customer is never inconvenienced. Fuck the customer and their frequent fickle flying. If you buy #2 you better be ready to eat some #2. That's what you get for burning a thousand hours of jet fuel in an hour because its half the price of buying a tank of gas and driving all day to travel the same distance. Lets face it, you can just go fuck off until greyhound or amtrack is a better alternative. The fact is that they really don't need to worry about it, because it costs less to piss off customers, which is always, than it does to be certain that there customers are always happy, which is never. As long as they are number 2 in the duopoly, then we can just eat shit when the gap occurs. There are a lot of gaps in there design and implementation. They have no intention of filling them all. You just better hope you get lucky and don't need to travel when the dice crap out. But lets be honest, a power failure in summer in Atlanta is an event that you can COUNT ON happening. So clearly, this is how the system is designed to function in this exact situation. What makes you think that kind of design concept is an accident or unforeseen when clearly its a calculated risk. They mitigate bad luck with bullshit to deny other future failures are guaranteed in the likely event of unusual dice. Its inevitable, but its hard to predict when exactly they will screw over the customer next exactly, but you can be sure it will happen again. They will act as surprised as the customer when happens, and everyone can go on with their pretense, and deny any fault because we allow fate to decide when we win or lose the bet.
My mechanical toilet is working well.
Perhaps Delta should send over their Crack-Team of Engineers and Scientists to figure out how it works.
Ha ha
"A spokesman for [Georgia power] said the problem for Delta was a failed âoeswitch gearâ and that Georgia Power sent workers to assist the airline early Monday morning." from the WSJ
Somebody has lot's of 'splaining to do, surely. Power up the deflectors.
(which should not happen as you should have backup generators and batteries that give power until the generators are spun up)
We have had outages due to power problems, during planned maintenance on the UPS system (which was to allow future UPS maintenance to be done without impact by introducing live switching of UPS between feeds). There was an outage to some systems because they had been upgraded (e.g routers with line cards added) to the point where one PDU could not supply sufficient power even though total power was less than "guaranteed"). To avoid a recurrence, power supplies were moved to other PDUs.
So, yes, while power failures shouldn't have impact, even in environments with supposedly robust frameworks (e.g. ITIL), mistakes happen or the impact of a change is not fully identified/understood (possibly due to the complexity of modelling the environment down to which of 8 power supplies on a device are connected to which of 4 PDUs in a cabinet which has 2 different feeds of the 6 feeds available in one DC in a campus with 3 DCs etc. etc.), resulting in unexpected failure modes.
(Apologies in advance, still no preview on the "mobile" interface).
Didn't they have monthly or quarterly "mains fail test"? Our environmental team's performance contracts require this ...
The talks have already happened. It's 'are they documented' or did upper mgmt just mention it in the hallway that there's no funding for backups so they couldn't be held accountable.
There are two types of people in the world: Those who crave closure
Doesn't
Ever
Leave
The
Airport
It's 'are they documented' or did upper mgmt just mention it in the hallway that there's no funding for backups so they couldn't be held accountable.
Unless IT asked for the stuff they needed in the hallways, there is documentation at least that IT was trying to prepare for this. If IT asked for this in that hallway and was fine with a no in that hallway, the IT folks didn't really think it was important.
Heck, I learned pretty low in the chain and at a very early age to ask for things in writing that went against what I thought was correct - only took a few times being burned to learn that. And also to know which battles where were fighting and which weren't (this would clearly land in the former category - at the very least above "was told no in the hallway" category.
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
FWIW, Google's standard -- a mantra which Google SREs have pounded into my head -- is "n + 2". You don't have a reliable system unless you have enough capacity to operate it when you lose two of your components (the definition of "component" here is context-dependent; they're whatever your points of failure are). Why do you need two extras, rather than just one? Because inevitably there will be some time you have to take one of them down for maintenance or upgrade or something. If you only have "n + 1", then during that window of time you're down to "n", meaning exactly the capacity you need to handle the load... and if something goes wrong you then have "<n", i.e. not enough to operate. OTOH, if you have more than "n + 2", and the individual systems are reasonably reliable, then you're probably wasting resources.
I suppose at the DC level "n + 1" is probably adequate if your other processes are structured so that you never take an entire DC offline intentionally.
I think this is a good philosophy for anyone who is operating a piece of critical computing, bet-your-business, computing infrastructure. Like, say, the database that allows one of the two or three largest airlines in the world to fly.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
If you have lower priority processes that can tolerate being offline for the duration of a planned outage, you only really need something more like "n + 1.5." The .5 only has to be able to handle the critical work for long enough to get your full backup system back online.
A transfer switch is almost always a single point of failure in a power system. It is the one place where primary power, backup power and the load come together.
On the contrary, after going through bankruptcies in recent years and shedding debt, pensions, etc., plus with the current low fuel prices, most airlines are currently swimming in cash.
what are you talking about? Those companies are public it's easy to see the numbers, stop making shit up based on your faulty guesswork.
Delta has a book value of *negative* 3 billions, cash flow down 115 millions last year alone. American Airlines is also in dark red, book value negative 700 millions, cash flow down 600 millions. Those are not companies where you can easily get a budget upgrade.
lucm, indeed.
If you have lower priority processes that can tolerate being offline for the duration of a planned outage, you only really need something more like "n + 1.5." The .5 only has to be able to handle the critical work for long enough to get your full backup system back online.
Absolutely. "n + 2" is a rule of thumb for critical systems... and it's also just a starting point. Thinking hard about your system may point out that you need even more, or maybe that you can get away with a little less. The rule in Google is that n + 2 is the default and then you can make arguments about why you need more or less.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Chances are that they got hacked.
Self-importance and self-indulgence is the root of ALL evil.
The IT staff? Or the management? "Oh I saved 10m by not having a backup data center." "Here's your 10K bonus!"
They say it simply didn't happen. Numerous analyst articles say Delta, like many other airlines, understandably operate aging and complex systems. Their system just had enough.
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Things can still go wrong. In NYC back in the 1990s, they had a 911 test. Con-Ed turned off power to the 911 center. No problem, they have a big diesel to kick in to keep their big mother IBM mainframes running. Well that diesel was in pieces on the floor because it had hit max time and needed to be rebuilt. No problem, it had a failover to another big mother diesel, which fired right up and took the load. No problem. The problem is that it only had about 5 minutes worth of diesel. They hadn't moved the diesel feed to the other engine. Down she went! It took about 45 minutes to get the system back running again. In that time they had a number of heart attacks and if memory serves me, at least one guy died that they think otherwise would have survived.
I'm seeing a lot more stupid stuff happening. Cloud, big centers, etc. They put all their systems on a SAN. I've seen a relatively simple san problem take a few hundred machines out. Supposedly - that can't happen. Well it did and does. I've also seen an technician delete a whole rack worth of storage with one mouse click. Then there is management software. Now instead of screwing up just one machine, we can do a few thousand at a pop.
Now customers want to use a Software Defined Data Center (SDDC). Probably short skirt sales - Everything is controlled by it, san, network, routing, switches, VPN, blades... whole Shebang! What could go wrong? Seems like the word clusterfuck was made for this.
Grumble.
An anonymous coward stated:
> First off you need a minimum of 2x the floor space in a min 2 different geographic locations.
> Second you need a min 2x the hardware at both locations. blah blah blah
> You need 2 x the number of blah blah blah
> Blah blah blah
Today you can do DR (Disaster Recovery) in AWS or other cloud infrastructure without needing 2x blah blah blah.
You do need 2x for *just* the database that stores truth and keeps it redundant sychronously or, in this case, near synchronous is probably good enough (OK lose a few hundred or even thousand transactions I would guess, just NOT OK to lose the entire system for a day. Jeeeesh.).
Almost all other systems can stay quiescent and not used actual cycles or energy until needed for recovery.
-- Dennis Allard
I know someone who flew home on Monday. They left on Monday and got home on Monday. Shit happens sometimes, but this wasn't a catastrophe even for everyone that way affected.