Slashdot Mirror


Delta Air Lines Grounded Around the World After Computer Outage (cnn.com)

Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."

239 comments

  1. Incompetent IT by sjbe · · Score: 5, Interesting

    A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.

    Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.

    1. Re:Incompetent IT by Anonymous Coward · · Score: 1

      Its money making time for some consultant out there...

    2. Re:Incompetent IT by Anonymous Coward · · Score: 2, Insightful

      More than likely an underfunded IT department. IT people often know what's needed for a reliable system, but the higher-ups just seem them as a cost center and won't provide them with a sufficient budget.

    3. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Or invested some $$ on a generator and fuel contracts.

    4. Re:Incompetent IT by NotInHere · · Score: 5, Insightful

      Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.

    5. Re:Incompetent IT by mjwx · · Score: 5, Insightful

      A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.

      Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.

      I wouldn't be so fast to lay this at the feet of IT.

      I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.

      This is the kind of stupidity I see from American companies all the time. Here in Europe, computer downtime like this for a mere hour costs millions of pounds for an airline as they become liable not just for refunds, but also for extra costs as travel insurers pay large sums of money to get people where they're supposed to go. The reinsurers will then send their lawyers to present the airline with a nice bill.

      --
      Calling someone a "hater" only means you can not rationally rebut their argument.
    6. Re:Incompetent IT by Anonymous Coward · · Score: 0

      You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.

      IT is viewed as a cost-center; a necessary evil of doing business, a line item in the budget with a negative value.
      Anything that managers can do to reduce that cost, they will do it.
      Delta Air Lines may finally be realizing the high cost of low IT prices.

    7. Re:Incompetent IT by Anonymous Coward · · Score: 2, Interesting

      AFAIK pretty much all airlines run scheduling software from a single company (I remember reading an article about how Southwest moved from an in-house system to the same as everyone else due to complexity issues), so it's not so much the airlines but this 3rd party that seems to have somewhat fragile software.

      Still though, this begs to be something hosted in a datacenter/cloud with an online shadow in the background of another location replicating everything and ready to take over at a moment's notice, or something similar. Pretty standard these days, but airlines are so tight for money that they end up sometimes shooting their own feet...

    8. Re:Incompetent IT by ilguido · · Score: 0

      Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust.

      If they're clueless enough to buy 11,000 surface tablets from Microsoft, it is not that amazing. They are (were) the poster child of Microsoft services for air lines: Microsoft Dynamics and Delta Air Lines: Innovative technology and personal service equals empowered employees and happy travelers

    9. Re:Incompetent IT by tripleevenfall · · Score: 4, Funny

      "Johnson, get in here"

      "Yes sir?"

      "You said you apped this in the cloud. How does the cloud go down?"

      "Well, er... "

      "Where are the damn synergies? I was told there would be synergies!"

    10. Re:Incompetent IT by tripleevenfall · · Score: 4, Insightful

      For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1

    11. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Why is there this vocal minority of Europeans who somehow thing they're better than the US? All of those costs you say European airlines have to face are the same in the US. Right now Delta is going to be paying for a lot of hotel rooms. And I've spent plenty of time in Europe and seen they do the exact same. Honestly a lot of what Europeans do seems a lot worse honestly. To this day I'm pissed off that I was needing to get to the airport and had to take a taxi in Madrid, I told the guy before I got in that I had no cash and was credit card fine, he said yes and then we get there and his credit card machine didn't work and wouldn't let me leave with my bags until I paid him which left me scrambling to find an ATM while I was late for my flight to finally get back and find that the fucker had gone through my bags. Don't say you can take something and then hold me responsible when your shit doesn't work. Some Europeans are worse than anything I've ever come across in the states.

    12. Re:Incompetent IT by Anonymous Coward · · Score: 0

      As far as I can tell the surface is a great machine. I've known like 50 people who own them and they've all said they love it. I've never because it costs more than I'm willing to pay for a tablet, but if it's what you need, the people who've gotten them have only had positive things to say. Maybe you just have an irrational bias against anything Microsoft because you don't like some things about them?

    13. Re:Incompetent IT by Kierthos · · Score: 3, Insightful

      From the sound of things, I'd say the cabbie played you. I wouldn't be a bit surprised if this is a scam he runs regularly.

      --
      Mr. Hu is not a ninja.
    14. Re:Incompetent IT by lucm · · Score: 1

      Totally agree. On one hand, airlines are not swimming in cash so everything requires a tedious business case. But also it's a fact that many organizations require a major incident before believing those birds of ill omen in IT.

      --
      lucm, indeed.
    15. Re:Incompetent IT by Anonymous Coward · · Score: 0

      It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle.

      Aside from some butterfly chaos theory effect, it didn't.

      Seriously, it didn't. That flight was fine. Their issue is not with the flights, it's on the ground, and with the paperwork. Passenger manifests. Luggage tracking. Maintenance logs. All sorts of fun with insurance liability and financial accountability.

      If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.

      From what I'm seeing, this seems not to be a network problem, but an electrical fault of some kind. One that occurred when testing the switchover to the backup.

      I don't doubt something like it has happened before, but as it is, I'm not sure it's something Delta could have anticipated except by being off-site redundant, and even then, I'm not sure what they'd have had to do differently.

    16. Re:Incompetent IT by Joe_Dragon · · Score: 1

      and if you did not have a bag then what will the cab do call the cops?

      What about the rules saying that they must take cards? It's broken as they don't want to pay the fees.

    17. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Just as long as they got The Low Price[TM] for their IT, it's all good, right? Executive bonuses all 'round!

    18. Re:Incompetent IT by jbengt · · Score: 3, Insightful

      On one hand, airlines are not swimming in cash so everything requires a tedious business case.

      On the contrary, after going through bankruptcies in recent years and shedding debt, pensions, etc., plus with the current low fuel prices, most airlines are currently swimming in cash.

    19. Re:Incompetent IT by stealth_finger · · Score: 1

      Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.

      Would that have been before or after they pointed out all these planes have two engines, we could cut costs massively by removing one from each?

      --
      Wanna buy a shirt?
      https://www.redbubble.com/people/stealthfinger/shop?asc=u
    20. Re:Incompetent IT by Kjella · · Score: 5, Insightful

      Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.

      Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution. Maybe also a hot failover next to it hooked up by a fat pipe. But if shit hits the fan big time in the data center, it goes down. Doesn't look like it took them *that* long to scramble what I assume is their cold backup online.

      The passengers aren't happy but hey sometimes shit happens with planes or crew or airports or whatnot leading to delays and cancellation. I've had a rescheduled flight and night in hotel because KLM got delayed and weren't allowed to liftoff because the destination airport was closing, it sucks but this is a fact of life for airlines. It becomes a big story because it happened to lots of people at once, but over say a year how how big a deal is it really? I'm sure they'll do a post mortem but I'd be surprised if they moved away from a centralized architecture.

      --
      Live today, because you never know what tomorrow brings
    21. Re: Incompetent IT by Anonymous Coward · · Score: 0

      A company the size of Delta does not need to be using somebody else's insecure, run by third world tech support 'cloud' to make their stuff work. They just need to, you know, make their stuff work.

    22. Re:Incompetent IT by NormalVisual · · Score: 1

      AFAIK pretty much all airlines run scheduling software from a single company (I remember reading an article about how Southwest moved from an in-house system to the same as everyone else due to complexity issues), so it's not so much the airlines but this 3rd party that seems to have somewhat fragile software.

      Dunno about the scheduling package, but most airlines contract with one of the major providers of reservations management services. At the time I worked in the field (little more than 10 years ago), the big names were Worldspan, Sabre, Navitaire, and a couple of others. I remember a HUGE clusterfuck that happened when Navitaire went down, and just completely screwed one of our major customers, grounding flights all over the country for several hours. Listening to the Navitaire folks and the airline folks screaming and pointing fingers at each other on the conference call was a hoot (once we'd shown that the problem wasn't at our end, of course), although I'm guessing the thousands of people stranded all over the country wouldn't have thought so.

      The point is that an airline can experience a system failure somewhere and not have it be due to anything they did/didn't do. In that particular case, the airline hadn't done anything wrong, and their end of the system was up and working properly. I'm sure Navitaire wrote a big check after that incident.

      --
      Please stand clear of the doors, por favor mantenganse alejado de las puertas
    23. Re:Incompetent IT by JaredOfEuropa · · Score: 2

      It's not always due to cost, sometimes it's plain stupidity. I did some work for a company that experienced a similar outage (not an airline company but one equally dependent on their datacenter). They had a new DC and spent good money on it, with redundant systems and power, top notch fire suppression systems, spare no expense. One day the mains power failed, the backup generator dutifully kicked in, died, and the secondary backup tried to start and failed. Turns out they had 2 backup generators. Hooked up to the same Diesel tank. Which was empty. The cost of adding a second tank would have been trivial, not to mention the paltry cost of having someone periodically check that there actually is some fuel in there.

      By the way, this was a European company.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    24. Re:Incompetent IT by shuz · · Score: 2

      I know people get upset about these kinds of things and airlines have really high public exposure to failure. But processes do fail. I don't work for Delta and don't have any affiliation with them. But I work in the sector and have felt the sting of system failure. Don't be quick judge and hindsight is 20/20. An example of what could have caused this is a complex network + storage device failure. It is reasonable for devices that never get turned off to experience failure to turn on if they ever lose power. I'm sure Delta has a DR site but the DR site may also have experience failure if it was in close proximity to the main site. Also failing over to a DR site can often take many hours. This is all the price to be paid for the efficiency of computing. A year from now few people except for employees at Delta will even remember that it happened.

      To all the folks around the world affected by this, hang in there. If they are offline after 24 hours then it is probably time to question what is going on. To the fine IT folks at Delta, I've been there good luck to you and don't forget to rotate out folks for resting. A freshly rested brain works faster.

      --
      There is or can be built a machine that can simulate any physical object. -Church-Turing principle
    25. Re:Incompetent IT by Voyager529 · · Score: 2

      "Where are the damn synergies? I was told there would be synergies!"

      Johnson: "Sir, the synergies are configured, just as you ordered. When one part of the system goes down, the whole system goes down. They work together that way, just like you asked."

    26. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Yep, that's stupidity.

      We test our backup generators every week (if the generators are running when I get to work, it must be Thursday). It's only a few bucks worth of diesel fuel, and it makes sure everything is working.

      (Last place I worked had power feed from two different substations, so we'd have to lose both before the generators kicked in -- those got tested regularly too, just not as frequently.)

    27. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with

      Exactly! It seems everyone these days is so quick to say "Facebook does it." Facebook doesn't have a need of accurate data. If a post that you are taking a crap goes missing nobody will care. Also a passenger on a flight is not a single ACID transaction -- it's many! The transaction rate for Delta must be insanely high. Distributed replication of that is likely to be very difficult.

    28. Re: Incompetent IT by Anonymous Coward · · Score: 0

      I had the same thing happen to me going to the Amsterdam airport. Then I started taking Uber so I knew a credit card would work. Then Uber was banned there...

    29. Re:Incompetent IT by Joe_Dragon · · Score: 1

      Sounds like the case of this one place that let there Diesel tank run dry just from each X days testing runs and then they really-ed needed it ran out as no one setup a auto refill contract.

    30. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Would that have been before or after they pointed out all these planes have two engines, we could cut costs massively by removing one from each?

      Er, like when aircraft went from four engines to three to two?

    31. Re:Incompetent IT by funwithBSD · · Score: 1

      I have consulted with a large airline, can't remember if they are #1 or #2 right now, and believe me, it is a very very complex system.

      We were brought in to reduce the complexity and increase the resiliency in case of a disaster/failure.

      The current environment, which is geographically paired mainframes with mid-range "helper" apps and data caches is about the best it can be.

      Getting everything coordinated from meals ordered, drinks loaded, fuel, baggage, load balance plan, seating, payments, etc, is incredibly complex for a large airline.

      Fuel saving from accurately tracking how much baggage is on board and not overstocking meals/drinks is significant, so while a PITA, it is worth it.

      --
      Never answer an anonymous letter. - Yogi Berra
    32. Re:Incompetent IT by Anonymous Coward · · Score: 0

      You don't have to have your data centers located next to each other and good DR planning dictates that you _dont_. I've worked with/for several banks and their customers are even less tolerant of transactional mistakes than airline customers are. They all ran multiple data centers in different regions so a single natural disaster couldn't take them out. For example, a major US bank had three primary data centers, Southwest, East Coast and Midwest (banks are generally sensitive about where there data centers are, you can generally find out with a little bit of sluthing but they don't advertise it). Two were always online and sharing load. The third was down for repairs but had to be able to come back online in a set amount of time (hours) or it was the data center managers ass. They had a rotating schedule of what data center would be down when so upgrades could be planned weeks or months in advance.

      An absolutely essential piece is having accurate time keeping and making sure all computers are synced to it. External ntp sources are not accurate enough, most data centers used in this manner have local ntp servers that sync with GPS signals (e.g. Epsilon GPS Clocks). Google actually uses atomic clocks to ensure consistency in Spanner, which is their globally distributed database.

    33. Re:Incompetent IT by ilguido · · Score: 2

      As far as I can tell the surface is a great machine. I've known like 50 people who own them and they've all said they love it.

      Now tell me how many of them were aircraft pilots, air stewards or the likes.

      Maybe you just have an irrational bias against anything Microsoft because you don't like some things about them?

      I have a very rational bias against silly business decisions, mr. Coward.
      I am pretty sure that the money spent on those 11,000 tablets could have been better spent on backup servers or other essential IT equipment, not on something that looks like a pure marketing decision.

    34. Re:Incompetent IT by Anonymous Coward · · Score: 0

      I call bullshit. If you know 50 people who own a Surface, you either work at Microsoft or somewhere where Microsoft gave them out for free (NFL etc).

    35. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution.

      I would expect that a hybrid system composed of an eventually consistent scalable cluster that handles flights and flight reservations for all reservations more than two weeks in advance of a flight date, coupled with a transactional system that handles flights and reservations for anything inside of two weeks, would probably be a better solution than just throwing a mass of redundant mainframes & ACID-compliant code at the whole problem.

      After all, when you go to book a flight six months in advance, do you really care if it takes 24 hours to resolve and confirm or deny a reservation for a flight, especially if there's only one chance in a million that a resolution conflict occurs during booking at that point?

    36. Re:Incompetent IT by CanadianMacFan · · Score: 1

      It doesn't have any impact on the flight that is already in the air. However once the plane lands the computer has the instructions for where the plane is going next, how much fuel to put into it, where to route the luggage and any cargo that was in it, what to load into the plane for the new trip, what supplies to replenish, what passengers are supposed to get on board, who is supposed to work on the plane, etc. Without all that information that plane is grounded.

    37. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Outsource IT admin of your Windows server farm to India and, yes, this is a perfectly plausible occurrence.

    38. Re: Incompetent IT by Anonymous Coward · · Score: 0

      Your first mistake is believing the story you're getting from the media is in any way accurate, complete or relevant.

    39. Re:Incompetent IT by Tablizer · · Score: 1

      Yes, but banks can be ran nearly independently of each other, while airlines cannot: each flight is potentially inter-related. Banks don't fly and cannot land in the same city during the same time period.

      One could perhaps split out airlines into artificial groupings, but that may create inefficiencies that an integrated (centralized) approach wouldn't have. There may be a trade-off between efficiency and independence, and this airline may have chosen efficiency over less down-time.

      Also it's probably impossible to keep transactions in-sync for spare systems separated by many miles without some delay. Light doesn't travel fast enough. There probably has to be a primary system with the other 2 regions you mentioned lagging. The more transactions involved, the harder it may be to have reliable up-to-date spares. It's possible airlines have more transactions than banks and that Delta did have spares, but they were not up-to-date enough to be transparently useful.

      I don't know the transaction rate of airlines compared to banks. This info may include flight positions and status, in addition to ticket purchases.

    40. Re:Incompetent IT by Tablizer · · Score: 1

      This is the kind of stupidity I see from American companies all the time. Here in Europe...

      The business mindset between the countries may be different with different trade-offs.

      Perhaps the USA is best at being "cowboys", breaking into new industries, while Europe is better at longer-term infrastructure.

    41. Re:Incompetent IT by Anonymous Coward · · Score: 0

      They should have automatic failover to another system that was kept in sync with all the transactions going on in the main system. You'd lose all the transactions that were in progress (the website and booking system should really be on a different system than the flight info too), but all the flights and stuff would continue. This is basic IT stuff for high risk systems. I learned it in my distributed computing class as a software engineer in college. Do IT degrees really not teach that? A duplicate system shouldn't be too costly. You'd need to keep around spare parts anyway. It sounds like all the airlines are one IT fuckup from complete bankruptcy. No one noticed the backups stopped working and the power glitches? All company data gone.

      Centralized architecture can work on this scale, but you do have to put some thought into handling failure cases.

    42. Re:Incompetent IT by MachineShedFred · · Score: 2

      Swimming in cash or not, if your entire enterprise hits the pause button stranding thousands of people in places they don't want to be because of a failure of your disaster recovery / business continuity plan, that's a universally bad thing, and an abject failure to plan or realize the potential of a multi-hour data center loss.

      Someone fucked up.

      --
      Slashdot still doesnâ(TM)t support Unicode after it was added to the HTML standard in 1997.
    43. Re:Incompetent IT by Tablizer · · Score: 1

      [PHB:] How does the cloud go down? ... Where are the damn synergies? I was told there would be synergies!

      Engineer: "The synergies got transposed in the Flux Capacitor and will need a Matrix Realignment with the Cloudification Quantum Dot Decryption Cycler Stack."

      PHB: "Bahh, whaddever, just fix it fast, or you're fired!"

    44. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Yeah, like banks haven't solved this problem already... /s

      Oh, they have.

    45. Re: Incompetent IT by Anonymous Coward · · Score: 0

      Unfortunately companies like Delta fell into the trap of

      "We are an Airline, not an IT company!"

      After which they outsourced everything and are now realizing that was not a good idea.

    46. Re:Incompetent IT by swillden · · Score: 1

      "Where are the damn synergies? I was told there would be synergies!"

      Johnson: "Sir, the synergies are configured, just as you ordered. When one part of the system goes down, the whole system goes down. They work together that way, just like you asked."

      Clearly we need a new buzzword for this, er, feature. How about "fully-synergistic non-performance"?

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    47. Re:Incompetent IT by swillden · · Score: 1

      Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth.

      Actually, outside of web search, basically all of Google's systems also require perfectly consistent truth. Gmail can perhaps allow a bit of latency between when an email arrives and when it shows in your inbox, but once it's there it always has to be there. Google Docs can't occasionally not find your doc or show you an outdated version. And, of course, Google also runs a massive transactional payment system, used to collect all of those tens of billions of dollars and pay all of the suppliers, app developers, etc.

      That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution.

      It can also be done with clusters and distribution, actually, including ACID compliance. Google's usual database solution for that sort of application, actually, is sharded, replicated MySQL. The data is sharded into separate database instances based on some key and then each shard is replicated. For example, suppose you sharded your database by the first letter of each customer's last name. Assuming the latin alphabet (not a workable assumption; this is a trivial example), you end up with 26 databases and when you want to find "Gates, Bill", you know you have to look in the "G" database, which itself is replicated (writes must reach all replicas to succeed, requiring two-phase commits, reads can come from whatever physical shard is nearest/first). "Virtual sharding", where you shard your primary key into a very large number of data sets, and then map the virtual shards onto a smaller number of physical databases makes this more flexible, so you can reallocate your physical shards as needed, and even incorporate replication into the sharding.

      What about when you need to do a lookup based on something other than your primary key? Well, it's the same indexing problem faced by all databases, except moved to a higher layer, and amenable to the same sort of sharding.

      Such systems are more complicated to build and run than the mainframe approach, but they can also scale to much larger sizes without losing their integrity.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    48. Re:Incompetent IT by The+Grim+Reefer · · Score: 1

      Swimming in cash or not, if your entire enterprise hits the pause button stranding thousands of people in places they don't want to be because of a failure of your disaster recovery / business continuity plan, that's a universally bad thing, and an abject failure to plan or realize the potential of a multi-hour data center loss. Someone fucked up.

      It depends. If the cost of putting the stranded passengers up in hotels and re-booking flights costs less than the redundant systems and the cost of upkeep for DR and salaries, then this turns out to be a net positive for Delta. Or that's how the bean counters will look at it.

      Air travel in the US sucks compared to the EU and especially compared to several of the Asian carriers. Customer satisfaction doesn't mean squat to any of the large domestic carriers these days. Of course most travelers would probably choose to fly on a plane sitting on a board resting on two cinder blocks if it would save them $5. So the market gave people what they wanted. This is just an extension of that.

    49. Re:Incompetent IT by Solandri · · Score: 3, Insightful

      I used to live in Point Roberts, WA and power was very unreliable. I worked from home half the week, so I bought two UPSes (one for the computer, one for the cable modem and router), and kept a charged car battery in the house with a 12V inverter which would give me by my calculations about 10 hours on my laptop (on top of my laptop's 5 hour battery). I had plans to buy a generator as well.

      One day the power went out. The UPS kicked in. Power usually came back within a couple minutes so I kept working. After about 10 min, the UPS began warning it was nearly drained. So I shut down the desktop and switched to my laptop. Unfortunately I hadn't charged it so I got a low battery warning after about an hour. I lugged out the car battery, clamped on the leads for the inverter, plugged the laptop into the inverter, and fired it up. I was back in business again.

      Got on the laptop, logged in to work. 30 seconds later the Internet went down. No cable TV as well. The battery keeping the cable company's equipment powered must've died.

      You can make all your systems redundant, distributed, and robust. But unless you control all the network lines between you and all the places you need to communicate with, you're not in total control over the reliability of the system. (And if you're curious, I was without power for 3 days. I had to move my refrigerator's contents outside to keep them cool since it was winter, and use a wood stove to keep the house warm and cook my meals. I dropped plans to buy a generator since there was no point if my Internet connection would only last about 90 minutes.)

    50. Re:Incompetent IT by TigerPlish · · Score: 1

      Clearly we need a new buzzword for this, er, feature. How about "fully-synergistic non-performance"?

      Asymmetric performance.

      --
      The "Civilized World" jumped the shark ca. 1973.
    51. Re:Incompetent IT by Anonymous Coward · · Score: 0

      That's because MBAs ARE the cause of the problem. *strokes neckbeard*

    52. Re: Incompetent IT by Anonymous Coward · · Score: 0

      Or maybe they just got unluck with a fire in the transfer switch.

    53. Re:Incompetent IT by Anonymous Coward · · Score: 1

      Having had my hands in designing a few of these sorts of systems it is not 'easy' either as slapping a few bits of cisco/hp/dell kit together and calling it a day.

      First off you need a minimum of 2x the floor space in a min 2 different geographic locations.
      Second you need a min 2x the hardware at both locations. Oh and make sure you DCs can have fail over power and separate power systems.
      You need 2 x the number of people running it. 1 set for each location. Support 'can' remote in to each location, preferably onsite but remote location 'can' work but wears out your support staff.
      Next you need to design the system to be able to handle what we called 'split brain'. Where half your data is out of the wrong data stores and the system is cross pointing to the wrong data centers. That takes time and proper design of the software, hardware, and network infrastructure.
      Your software and QA guys need to have their own set to play with to make sure. Preferably two sets each. Does not have to be geo redundant. Just virtual redundant to simulate.
      Oh and your external network better be able to handle it. So that means playing with the proper providers of switched networks, AND managing them and holding them accountable to fuckups. Plan on them not being up to task and you have to be redundant.
      Dont forget your upgrade plans. How will you fail between systems while you upgrade in place (both hardware, firmware, and software). Oh and your QA should be testing that plan as well.
      Also to your end customers and employees? It looks totally transparent. So you better have a decent network guy and load balancer guy.
      Also be sure to *TEST* your fail over systems. Do they actually fail over? Do they actually come back up? Do they actually get the right data from the right place? It is not just enough to have them. You want to make sure all of your plans actually work. You can even do it during the day, instead of 3AM on a sunday in a massive conference call. Oh and does the system fail properly while people are using it?

      Last of all you *need* and *want* to make sure your VP and up management is on board. If they are not *none* of the junk above matters. Dont bother and find another job.

      It takes a fairly seasoned hand build systems like this. Your fresh off the plane (hehe) h1b probably will not cut it no mater how much smoke the temp agency blows up your ass. You can in place train them. But expect it to take time.

      Some of the newer techs like docker and vmware can help mitigate some of these issues. But not all of them. You need to test them and find the holes. So you can either mitigate them or minimize them.

      That is the sort of system you want to build for a thing like this. Your customers and your fellow employees *expect* it to 'just work'.

      I feel for the dudes at that atlanta data center. Return to service is just the first step. "does not happen again" is the next step and that takes a lot of humility and fortitude to make it happen.

    54. Re: Incompetent IT by HornWumpus · · Score: 1

      It's the old Dilbert:

      They outsourced everything they aren't good at, unfortunately one of the things they aren't good at is knowing what they are good at.

      --
      John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
    55. Re:Incompetent IT by lucm · · Score: 1

      Of course most travelers would probably choose to fly on a plane sitting on a board resting on two cinder blocks if it would save them $5. So the market gave people what they wanted.

      Most travelers... or their employers! There's a guy I know who spends 14h per week in transit (flight + layover); if the company could put him in the cargo hold to save $5 they'd do it. When there's a plane delay he has to VPN in from the airport (but not the lounge of course). Can you imagine yourself spending an afternoon on those cheap airport plastic seats, debugging an Oracle BPEL workflow via a shaky remote desktop connection while waiting for your flight?

      Karma like this, he was probably a serial killer or a pedophile in his previous life.

      --
      lucm, indeed.
    56. Re:Incompetent IT by SvnLyrBrto · · Score: 1

      > his credit card machine didn't work

      That scam is one among many reasons I've long since quit using the legacy taxi companies and switched to Uber and Lyft. They made their own bed with stunts like that, and they can go lay in it.

      --
      Imagine all the people...
    57. Re:Incompetent IT by swillden · · Score: 1

      Asymmetric performance.

      I like how effectively that obscures the truth (something all good buzzwords should do).

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    58. Re:Incompetent IT by Anonymous Coward · · Score: 0

      They aren't good at learning from their mistakes, either. My family was supposed to be flying to Florida from DC shortly before Christmas one year, and a snowstorm in the midwest caused a lot of reroutes and cancelations. It turned out that when he number of ticket changes exceeded a certain number for a month, their systems went down, like they did that morning. They ended up in pretty much the same situation then...

    59. Re:Incompetent IT by Anonymous Coward · · Score: 0

      Well, they would be if they weren't giving their cash to the board of defectives who continue to revel in their incompetent decision making.

    60. Re:Incompetent IT by Anonymous Coward · · Score: 0

      I was in the Alaska Airlines data center a few years back when it went down hard. The official notification was that it was a computer glitch. What actually happened was we were doing an upgrade to the power grid. They were migrating the mainframe to its own grid for better growth and new redundancy. The contractors screwed up and took down the whole thing. It was electricians who tool it down. Its possible Delta's problem was human caused from an upgrade or service in process.

    61. Re:Incompetent IT by wyHunter · · Score: 1

      Clouds go down all the time. It's called 'rain.'

    62. Re:Incompetent IT by RespekMyAthorati · · Score: 1

      How could this happen?
      I've never heard of a server farm that wasn't powered by an uninterruptable power supply.

  2. CIO is a dumbass by Anonymous Coward · · Score: 0

    The C level exec who most likely ignored the recommendation for a backup should be fired.

    But he won't. At worst, he'll be asked to resign, he'll get a big fat bonus and find another high paying cushy job.

    Most likely, there will be a goat somewhere who'll get escorted out of the building and will have a real hard time finding another job - because that's how it works for us peons. Unemployed means no good.

  3. On behalf of our flight crew... by Anonymous Coward · · Score: 1

    ...I'd like to welcome you aboard Single Point of Failure Airlines.

  4. Re:Shouldn't have upgraded to W10 ! by Joe_Dragon · · Score: 1, Informative

    the auto install of windows updates drains your battery and does not stop for battery mode or ups shut down commands.

  5. Re:Shouldn't have upgraded to W10 ! by bev_tech_rob · · Score: 1

    Ha ha ha...very funny...not......

    This was a power issue (cue the 'the IT staff needs to be hung by their scrotums for such shitty power infrastructure' comments).

    --
    You're messin' with my Zen Thing, man.....
  6. Suspicious... by Anonymous Coward · · Score: 0

    A power outage at one location takes down the entire global delta network?

  7. Re:Shouldn't have upgraded to W10 ! by NotInHere · · Score: 4, Insightful

    Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.

  8. That's about $100 Million per day in lost revenue by billrp · · Score: 3, Interesting

    You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.

  9. Report: Fire destroyed generators by McGruber · · Score: 4, Informative
    A fire at the datacenter caused the outage, according to a post on post from "walterD" in Flyertalk.com's "Delta computers down ..." thread:

    According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.

    1. Re:Report: Fire destroyed generators by ilguido · · Score: 0
    2. Re:Report: Fire destroyed generators by pz · · Score: 4, Insightful

      Here's the thing that amazes me.

      500 servers.

      The airline runs on 500 servres.

      I was part of an early social networking site that, at its peak had 20 M users, with about 10K actively using the site at any given moment. We ran with 200 servers and had really very excellent render time (this was getting on to a decade ago, and if our page loads ever got above 1 second it was considered a near crisis; our email/messaging system, that I wrote, handled 150 M messages per day). It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100. They need 500 servers? Five HUNDRED servers? And with the resources of a multi-billion dollar company, they're STILL ALL IN ONE LOCATION?

      They need a new IT team. Or a new management to give them the support they need.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    3. Re:Report: Fire destroyed generators by bfpierce · · Score: 3, Insightful

      "with about 10K actively using the site at any given moment."

      You actually think Delta only has 10k actively using their systems at any given moment? They probably have that many ticket counter staff logged in, not even counting customers, technicians, pilots, and so on.

      Yeah, I get the 'why is your backup in the same building as your primary', but they probably need 500 servers.

    4. Re:Report: Fire destroyed generators by Critical+Facilities · · Score: 4, Insightful

      Interesting. I manage Enterprise Data Centers for a living. My expertise is the Facility Infrastructure (generators, UPS, switchgear, etc). What's being described in that post you linked to sounds very possible. I'd just about lay money down that this was a failure in an Automatic Transfer Switch. And as others have said, I pretty much guarantee that due to the corporate attitude of "facilities is just an expense center on a spreadsheet", there's been pressure to trim costs........including decreasing frequency of predictive maintenance like Infrared Thermography.

      A well maintained ATS should be able to function flawlessly for many, many years (like 20 years). To have faulted so badly that it took out the whole switch (which would definitely make the primary and generator feeds inaccessible) sure sounds like deferred or non-existent maintenance to me.

    5. Re:Report: Fire destroyed generators by swb · · Score: 2

      Calm down. Your social media site wasn't flying a half million people around the world in pressurized aluminum cans every day.

      Not even counting future travel reservations or queries, how many DB transactions do you think they handle per passenger per day alone? And none of that counts any other potential transactions, such as service info, flight data such as aircraft telemetry, employee data, regulatory information and so on.

      500 servers sounds almost too low, especially when you consider that probably more than a few are either legacy systems or run some kind of specialized software to move data between new systems and external legacy systems.

    6. Re:Report: Fire destroyed generators by burtosis · · Score: 1

      It's odd because I can't seem to find any news on a fire in Atlanta this morning. If "firefighters took awhile to extinguish the fire" and grounded delta flights worldwide you would probably expect at least a blurb on it somewhere.

    7. Re:Report: Fire destroyed generators by silas_moeckel · · Score: 1

      At 500 servers thats not realy a whole lot of data center real estate. A well build DC would not have the A and B buss gen sets next to each other though all bets are off once you get some overzealous firefighters on site. Having no DR setup in place is laughable.

      --
      No sir I dont like it.
    8. Re:Report: Fire destroyed generators by raftpeople · · Score: 5, Insightful

      "It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100" - You clearly have no idea what you are talking about. Go learn about the complexities of running an airline, the different software required, the number of users and systems supported, etc.

    9. Re:Report: Fire destroyed generators by swb · · Score: 1

      My question would be why primary and secondary generators were placed close enough that a fire with one would so easily affect another. I get that there are some serious temptations, including not wanting to run main power feeds very far or shared fuel storage.

      But the kind of proximity that would pose a dual generator fire risk seems like a bad idea.

      Ironically, this even crossed my mind at a year old data center I was at last week. Both backup generators were fairly close together and I wondered what kind of fire it would take to make both generators offline.

      Maybe there's some kind of protection against this. In the data center I was at, the generators appeared enclosed in those metal utility buildings; perhaps these are sturdy enough to contain fires or have significant fire suppression abilities inside them that would keep a fire from spreading.

    10. Re:Report: Fire destroyed generators by cdrudge · · Score: 2

      What exactly is "a server" though? A server can be anything from a single processor with a reasonable amount of memory to many multi-proc multi-core beasts with more memory than most people have disk space. Toss in virtualization and does 1 server = 1 physical machine, or 1 server = 1 virtual machine?

      Being an older airline, I'd be surprised if there wasn't one or more large mainframes in the mix as well, something your social networking site probably didn't have.

    11. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      Have you had any experience since your 'early social networking site'?

      I'm guessing not. Your scale was not a world wide ticketing, planning, billing system.

      Delta's system is probably thousands of times larger than yours ever was.

    12. Re:Report: Fire destroyed generators by bad-badtz-maru · · Score: 1

      A reasonable takeaway from this might be that that overly thorough testing of backup power systems can carry excessive risk.

    13. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      2nd major Airline Outage in a week? IMO: hackers managed to find the off button "shutdown -h now" and turned their network off.

      If they only need 500 servers, how many of them are VM's running their software? They maybe need 25 physical servers - about half a rack worth, to handle their entire load. If that's the case, why in hell don't they have some redundancy in their network? The only thing that would need real time updating is the passenger/cargo manifests along with the aircraft - everything else would be fine on a 5 minute delay. The main reason I ask is what in hell is Delta going to do when a Major huricane takes Atlanta out for months? Go Bankrupt?

    14. Re:Report: Fire destroyed generators by Critical+Facilities · · Score: 4, Informative

      Well, to be clear, I'm just speculating here, but I'm not implying that the GENERATORS blew up, I'm speculating that the ATS blew up. It is a very common topology to have multiple Generators connect to one main bus, and then have that bus connect to the Data Center via an ATS. In other words, yes, there is/are redundant Generator(s), but they all connect to one central bus, which then connects to the UPS Systems via the ATS and other switchgear.

      The failure rate of ATSs is pretty low (when they're maintained), so it often becomes a value engineering decision during design. Yes, you could have each Generator connect via its own ATS, thus distributing the risk, but in so doing you increase your constructions costs, increase your maintenance costs, etc. The bean counters don't like that, and it becomes hard to convince them that it's worth it when you can't come up with statistical proof that a failure of the ATS is likely.

    15. Re:Report: Fire destroyed generators by Critical+Facilities · · Score: 1

      Totally agree that no DR process is absurd.....especially for an airline. While I was at EDS, we supported Continental and US Air, and those guys regularly tested their DR schemes. And my point was that the Generators weren't affected directly, I'm theorizing that the ATS that connects the generators to the Data Center failed. It's very common to have several generators connect to a common bus (usually a paralleling piece of switchgear) and then have that bus tie to the Data Center via an ATS. In this case, both the Utility and the Generator power go through the ATS, thus making it a single point of failure. This is a very common design in spite of the single point of failure due to cost and historical performance of ATSs.

    16. Re:Report: Fire destroyed generators by silas_moeckel · · Score: 2

      Once the fire dept is onsite all bets are off. They will kill power from the other generators etc to insure crew safety. This is where prep is key so that they feel safe working an electrical fire without killing all power.

      --
      No sir I dont like it.
    17. Re:Report: Fire destroyed generators by silas_moeckel · · Score: 1

      Not realy failing to do thorough testing carries excessive risk. It would be very rare indeed that proper thermal imaging would not have caught this before it was a fire. That takes humans vs a control panel that runs the gen sets once a week etc.

      --
      No sir I dont like it.
    18. Re: Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      Don't bother trying to explain how the world works to someone who thinks infrastructure just magically appears in some 'cloud' somewhere and isn't hard work, or who compares every problem to those of running useless things like social media.

    19. Re:Report: Fire destroyed generators by shuying · · Score: 1

      you clearly have not the slightest idea of the complexity of running one of the world's top airliners. your social networking site is peanut.

    20. Re:Report: Fire destroyed generators by swb · · Score: 1

      I was going on a previous post's claims of a generator fire, rather than an ATS failure.

      I would think an ATS failure resulting in fire would be pretty darn hard to recover from in a timely fashion due to what I would expect would be some major electrical rework to replace the ATS, housing, and feeds, and related switchgear.

      I would guess that a "modern" data center design would isolate these components enough that even if the ATS melted to slag in place it would be a matter of just replacing the ATS. At a legacy or private data center, I can see lots of these components housed in proximity with little isolation, and a catastrophic failure resulting in lots of power system damage.

      Generator fire(s), while putting generators offline, would seem to be easier to recover from barring data center facility damage. Switch back to utility power and await delivery from Caterpillar of temporary backup power units which would be put back onto the generator bus.

    21. Re:Report: Fire destroyed generators by shuz · · Score: 1

      I should have seen this before my original posting. It makes perfect sense why they have been down so long. I've experienced this as well only it was at 16:00 instead of 02:00. *sad face*

      --
      There is or can be built a machine that can simulate any physical object. -Church-Turing principle
    22. Re:Report: Fire destroyed generators by jbengt · · Score: 1

      I'm working on the fueling design for proposed new Emergency Generators at O'Hare (currently out for bid after 3 years back and forth on where to put them and what to connect to them, in spite of simultaneous proclamations on how urgent it is to get the project done, since they have had failures during testing of the existing system.) When I first saw this on the local Chicago news, which obviously showed pictures of lines at O'Hare, I wondered if it had anything to do with our project - thankfully, not.

      Anyway, those metal "utility buildings" are probably not fire rated; from your description, they might be just the enclosures that outdoor generators come with. At O'Hare, the existing generators are inside, all located in a non-fire rated room. The six new generators are going into a new building, but separated into two rooms with 4- hour fire-rated concrete block construction, with fueling "day tanks" in separate rated rooms, also. Main electrical gear is going into a separate new building, again separated into 2 fire-rated rooms. Most of the Automatic Transfer Switches are distributed throughout the airport, and most of them are in fire-rated electrical rooms. Almost all of these spaces have fire-suppression systems (FM insisted on sprinklers, rather than CO2, in spite of having 4160 Volt generators). However, if there is a fire, generators automatically shut down. Hopefully, they could limp by with just 3 generators, but I wouldn't be surprised if a fire caused all six to be shut down, especially if the fire department decides that shutting them down manually is safer for their personnel.

    23. Re:Report: Fire destroyed generators by quetwo · · Score: 1

      From the reports I've been seeing, it wasn't the ATS that failed, but rather a generator that caught on fire -- and in order to extinguish the fire safely, they had to cut commercial power.

      Freak accidents like that happen. But what also happens is that companies that big invest in redundant systems in geo-redundant locations. What happens if a tornado, sharknado or other natural disaster happens and takes out the physical servers? Does Delta just cancel flights for the next month while they rebuild?

    24. Re:Report: Fire destroyed generators by jbengt · · Score: 1

      A reasonable takeaway from this might be that that overly thorough testing of backup power systems can carry excessive risk. As can underly thorough testing.
      I was involved in a project where, due to shitty power infrastructure in South Dakota, an entire call center was put on emergency generators, not just the life safety and computer equipment. It was faithfully tested every week, no problem. But the first time it was really needed, it failed, which cost tens of thousands in lost productivity. Turns out they weren't testing it under load, since load banks for testing cost money. At least that company (Discover) had useful distributed back-up, so no data was was lost, and customers were unaware of the outage.

    25. Re: Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      It irks me that people think that social media is some kind of new, hot technology. It's the most simplistic (and useless, IMHO) thing out there.

    26. Re:Report: Fire destroyed generators by Joe_Dragon · · Score: 1

      It's a servenado just hope that your data is encryption or it may be raining identity theft

    27. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      Keeping the Primary and backup generators close enough to each other for one to destroy the other, sounds like bad design to me. Of course it isn't the first and won't be the last act of stupidity we'll see with critical infrastructure design. What was it that caused the Fukushima disaster, oh yeah backup generators in a basement inside of a tsunami zone. Someone did have the foresight to recognize this and they did set up some generators on a hill outside of the flood area, but left all of the electrical transfer equipment in that same basement they knew could be flooded.

    28. Re:Report: Fire destroyed generators by silas_moeckel · · Score: 1

      It's common it's also a bad idea. Granted getting utilities to feed from 2 substations via diverse paths is a pita unless the location was picked for that purpose. While I know it's all too common it's far better to split utility and run a gen set(s) per UPS and keep physical separation between them thus their own ATS gear. None of that matters if the fire trucks roll in and insist on everything being shut down.

      --
      No sir I dont like it.
    29. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      A reasonable takeaway from this might be that that overly thorough testing of backup power systems can carry excessive risk.

      This was the failure mode that lead to the Chernobyl disaster.

    30. Re:Report: Fire destroyed generators by lamer01 · · Score: 2

      I worked for one of the largest airline reservation systems. It is a very complicated space, many degrees of complexity above your run of the mill social networking website. Unfortunately, the underlying technology goes back many decades (it is mainframe based, I am not sure these other 500 servers they mentioned do). I think that with the newer tech out there, it could probably be re-engineered to be totally fault tolerant but it would be a massive undertaking in $$$$$. To give you some clarity of the complexity, the system even calculates the weight distribution of planes as passengers check in and clears them for take off accordingly.

    31. Re: Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      Every datacenter I've ever worked in had all of the generators in the same room. If you were lucky, the A and B wiring went through different switchboards.

    32. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      They are not running a "site" on those 500 servers, wow.

    33. Re: Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      Commercial data centers don't normally advertise where they are located. They're not going to be issuing any press releases.

    34. Re:Report: Fire destroyed generators by Tablizer · · Score: 1

      Keeping the Primary and backup generators close enough to each other for one to destroy the other, sounds like bad design to me.

      Also, shouldn't they have halon (or equivalent) fire retardant systems to automatically put fires out, or is that only for computers themselves?

      It might be the power has to be shut down if a fire detected, and the backup generator was on the same power line or group as the one with the fire.

      Like you said, they probably should be spread apart from each other but maybe the real-estate of the area would make that very expensive. Or, they just didn't plan right.

    35. Re:Report: Fire destroyed generators by swb · · Score: 1

      those metal "utility buildings" are probably not fire rated; from your description, they might be just the enclosures that outdoor generators come with.

      It was kind of hard to tell whether they were just OEM enclosures or something purpose built, as they had the same generic institutional bland paint scheme as the building. I will say that they didn't look significantly larger than the stock enclosure you'd expect around a generator. But again, hard to judge because they're slightly larger than shipping container and I've also seen a similar size box with OEM graphics and paint on them.

      I kind of wonder if generator manufacturers have a "fire containment" enclosure option with built-in fire suppression that would also close all the air intake/exhausts to help extinguish a fire, as well as being made of a material that would withstand a brief fire.

    36. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 1

      Most large generator systems operate on a exclusion zone concept for safety, that is generally why you see them outside of the building they are supplying, and in the case of generators with large on site fuel they are usually a few dozen feet from any flammable structures. I suppose it wouldn't be too silly if you had multiple primary backup generators to cluster them but clustering them with secondary backup generators as well would be extremely foolish. Even if there were space constraints isolating each generator from the others with some reinforced masonry walls would be relatively cheap and virtually foolproof method of keeping a fire in one of them from disabling the other(s).

    37. Re:Report: Fire destroyed generators by PPH · · Score: 1

      Yeah, but that social networking site is a relatively simple app, built at one time with homogeneous platforms, systems and tool chains. The primary problem being spinning up more identical server instances to meet demand.

      Delta's systems consist of numerous unique applications, written over decades. Probably in different languages and on different server hardware. Much of it was probably inherited from ancestor airlines that Delta merged with or absorbed.

      It's the same in banking and other businesses that have grown their IT systems over many years and in different organizations. It's a mish-mash of stuff. Nobody is quite sure how it works. Building a hot backup with the database synchronization problems of heterogeneous systems is a huge undertaking. It's not the server rack or data center costs that will kill you. It's not 500x the same thing. It's several hundred systems, each that will need a failover scheme uniquely designed, tested and implemented. And software projects of this magnitude get out of hand very quickly.

      --
      Have gnu, will travel.
    38. Re:Report: Fire destroyed generators by axafg00b · · Score: 1

      Being a ditto-head here (and in the same field) having an ATS fail sounds right. No matter how well one tries to design redundancy and resiliency into a data center, there will always be that one weak spot. I would hope that Delta management and IT conduct a serious failure assessment (without it devolving into a witch hunt) to understand the failure process here and determine how to change it. I would like to see a better discussion of the exact cause so those of us in the data center industry can use the information to assess our own facilities.

      And yes, regular service and maintenance might have helped catch this issue before it caused the outage.

      --
      I think, therefore I am - Rene Descartes; I yam what I yam, an' that's what I yam - Popeye
    39. Re: Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      The only credible report I've seen came from the power company representative who said the outage was due to a failure in Delta's switchgear.

    40. Re:Report: Fire destroyed generators by houghi · · Score: 1

      Don't forget the connection to external systems for reservations and even other airlines. Delta works together with others and I do not like to share my seat 13A on Delta with somebody who booked 13A via KLM at almost the same time on the same flight and a third who booked via a travel agent.

      --
      Don't fight for your country, if your country does not fight for you.
    41. Re:Report: Fire destroyed generators by david_thornley · · Score: 1

      Modern mainframes are extremely fault-tolerant by themselves. IBM's been making them better and better over the decades. For certain classes of work, like running very large volumes of relatively simple transactions with extremely high reliability, which sounds like a lot of stuff Delta's doing, they're wonderful.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
    42. Re:Report: Fire destroyed generators by McGruber · · Score: 1

      Back at you, Zorak on Flyertalk.

    43. Re:Report: Fire destroyed generators by Anonymous Coward · · Score: 0

      They did get a new IT team, when they bought up Northwest Airlines. The running joke was that Northwest IT had a IT test VM farm on the side that could run all of Delta, but since Delta was wearing the pants they ended up just firing most of the Northwest IT guys since they couldn't work with all the ancient cruft Delta was keeping alive, rather than doing the sane thing and migrating to Northwest's VM platform. Sad but true.

  10. o/~ Because we're Delta airlines... by Dixie_Flatline · · Score: 1

    ...and life is a fucking nightmare o/~

    (https://youtu.be/vzeOsEkzeA0
    John Mulaney's stand up bit on Delta. It's worth it.)

  11. Re:Incompetent IT *management* by PolygamousRanchKid+ · · Score: 1

    I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.

    But the management chain did not want to swallow the costs.

    Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?

    --
    Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
  12. Mega lag: They are back in the air again! by burni2 · · Score: 1
  13. Re:Let this be a lesson to all by Salgak1 · · Score: 0

    I'd bet good money that someone at Amazon is ALREADY meeting with them.

    And will still save them money, even with migration costs. Of course, the already-abused IT staff will get downsized, and the C-level who signs off on it will get a fat bonus. . .

  14. Let me guess by Anonymous Coward · · Score: 0

    Let me guess:

    They have multiple servers distributed throughout various geographic locations, redundant "hot spare" servers ready to kick in at a moment's notice, a direct terabit link to the internet backbone, redundant power supplies with generator backup, bomb-proof bunkers and a team of highly trained special ops to guard their servers...but their flight scheduling software is still some quick Cobol hack running on a 386 (upgraded from a VAX in the mid 90s) wheezing away in a (very hot, very dusty) cabinet on their head office, connected to the "net" via an RS232 cable, at 12k baud...which has just died of natural causes (i.e., rust)?

    I mean, other than that, I have no clue how the second largest US airline (according to the TFS) can manage to have a world-wide computer blackout in this day and age.

  15. Mainframes in the airlines by Zondar · · Score: 2

    Last time I worked with the airline industry, they were still heavily reliant upon mainframe systems. That means putting redundant equipment at diverse datacenters is more costly. It's not like spinning up a new rack of x86 VMWare servers.

    1. Re:Mainframes in the airlines by Anonymous Coward · · Score: 1

      I would wager it is still significantly less costly than a single flying tin can and should appear as less than a rounding error in their budget.

    2. Re:Mainframes in the airlines by shuz · · Score: 2

      Many planes are leased or rotated off of budget after a certain maintenance schedule. Airlines run very thin profit margins despite how it may appear. Think about all the choices you have when flying? The Northwest airlines portion of Delta used to run mainframes in Minnesota. I don't know what they use in Atlanta. Mainframes can be much more efficient than a bunch of Oracle/Microsoft DB's running on VMware. It isn't a trivial task to fail over to DR for most companies. One of the scariest things are DB sync lag. If the database in DR becomes too far behind the primary DB then hundreds if not millions of people that purchased tickets, transfers, baggage logistics etc might be lost. The chaos from that might well outstrip the chaos from delaying all flights until the primary DC might be recovered or at minimum networking from the primary DC/Databases can be restored to the DR site and a 100% sync status can be confirmed. Even if everything seems perfect, going to DR is really scary. Please take everything that a company does in this situation IT and Management wise with a massive grain of salt. One last thing that is really hard to swallow as an airline customer. You don't have the "right" to fly. You have a privilege. Any business has a monetary incentive to give you that privilege. It would be bad business otherwise. But at the end of the day no business has the legal responsibility to serve you. With the exception of health care and health insurance in the US.

      --
      There is or can be built a machine that can simulate any physical object. -Church-Turing principle
    3. Re:Mainframes in the airlines by sexconker · · Score: 1

      But at the end of the day no business has the legal responsibility to serve you. With the exception of health care and health insurance in the US.

      And cake shops, and restaurants, and landlords / property managers, and the Post Office, and the marriage license window at city hall, and gas station air pumps, and...

    4. Re:Mainframes in the airlines by david_thornley · · Score: 1

      A lot of businesses have an obligation to serve certain people when they're open for business. That's not the same thing. If a bakery shop shuts down for a day, they lose business, and quite likely future business, but that's their call to make (unless they're doing it for a specifically illegal reason). If they refuse to serve someone for certain reasons, which vary from state to state but generally include race and religion, while they're serving other people, that's illegal.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  16. Fire when not ready by XXongo · · Score: 2

    Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up),

    Actually, what I'm hearing is that a fire in the backup generator took out the primary generator. So, this is a case in which the backup was the problem, not the solution.

    1. Re:Fire when not ready by Joe_Dragon · · Score: 1

      also with a fire it's more likely for someone to hit the red button or some automated system to kill the power. Also with the firemen on site say to kill all power it happens likely with a hard power off.

  17. TIme to move the servers by The-Ixian · · Score: 1

    Minnesota seems like a good place to house them....

    --
    My eyes reflect the stars and a smile lights up my face.
    1. Re:TIme to move the servers by swb · · Score: 1

      IIRC, NWA, which was merged into Delta a few years back, had a backhoe outage when a fiber trunk got cut.

  18. Arguing for resources is part of the job by sjbe · · Score: 0

    I wouldn't be so fast to lay this at the feet of IT.

    It's the fault of IT (which includes IT management) unless you have evidence to the contrary. If they didn't adequately present the argument for why a robust network is a valuable asset then shame on them. Preventing then entire company from shutting down and losing millions of dollars per minute is a trivial argument to make. So yeah, IT carries most if not all of the blame here. If they couldn't make that argument then they suck at their job.

    I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.

    [sarcasm] Ahh, yes the MBA scapegoat. Couldn't possibly be that the folks who designed the network did a shit job of it. Clearly they must have been undercut by some bean counter somewhere. [/sarcasm] In a company the size of Delta if a power outage in a single location causes a company wide failure, that is almost certainly a technical screw up and not a budgetary one. Making the argument for some equipment to make the network resistant to power outages is a trivial financial argument to make. If the IT engineers had a single point of failure like that and they weren't able to justify whatever upgrades were necessary then they are bad at their job. Either they didn't see the problem or they failed to justify the resources to fix it. Either way they are incompetent and take the lion's share of the blame.

    1. Re:Arguing for resources is part of the job by Kierthos · · Score: 1

      Or it could have been both. Never discount the idea of the penny-pinching MBA and an incompetent IT staff.

      Now, mind you, I'm inclined to side with the IT guys, as I am one myself, and even though I work for a much smaller company, I've seen some bone-headed decisions regarding purchases.

      And let's face it, IT, generally speaking, is not in the business of making things harder for themselves. Whereas company execs are often so insulated from the immediate consequences of their actions, that it could years for some decision to be a problem.

      It's entirely possible that everyone involved in the decisions involved in a single-site point of failure don't even work for Delta any more.

      --
      Mr. Hu is not a ninja.
    2. Re:Arguing for resources is part of the job by fnj · · Score: 5, Insightful

      Bullcrap. A boo-boo this massive is BY DEFINITION a management fuck-up. It is management's [only] job to ensure all departments are doing their jobs competently. They don't get to say "well gosh, engineering told us they knew what they were doing". Yeah, it isn't EASY, but it's why they get the obscene compensation levels.

    3. Re:Arguing for resources is part of the job by Anonymous Coward · · Score: 0

      Lol, wut. What is management's job other than to, like, you know, manage the business? If management's only job is to sit in meetings, require other people to research what is best for the company's future, and get to blame others if the decisions made are bad... Holy crap, what a cushy job.

    4. Re: Arguing for resources is part of the job by Anonymous Coward · · Score: 1

      Damn what a corporate apologist. It is not IT's job to set business priorities. That honor belongs to the MBAs you seem to worship. Now, if they said 'make it so this never goes down' and gave IT everything they said they wanted and it still failed we can talk about IT's role in this. If IT took that mandate and incorrectly implemented it we can talk about blaming them.

      What usually happens though is executives either never set the priority or failed to fund what was said was needed. There's a long history of this across management in general.

    5. Re:Arguing for resources is part of the job by Nkwe · · Score: 3, Insightful

      Or they ran the numbers and calculated that even if they have an outage like this, the cost of that outage would be less than the cost of preventing it. If all you care about is the bottom line, you might not care if you inconvenience a bunch of customers for a few days.

    6. Re:Arguing for resources is part of the job by sjames · · Score: 2

      It couldn't possibly be that they predicted exactly this and presented it clearly to upper management who then decided they could get a really fat bonus for keeping costs down and deploy the golden parachute before the inevitable disaster.

    7. Re:Arguing for resources is part of the job by Kreplock · · Score: 1

      Never discount the idea of the penny-pinching MBA and an incompetent IT staff.

      Or years of MBA penny-pinching accruing an incompetent IT staff over time.

    8. Re:Arguing for resources is part of the job by Kreplock · · Score: 1

      But they could have another outage tomorrow... Ah well, it will get sorted out one way or another eventually.

    9. Re: Arguing for resources is part of the job by JackieBrown · · Score: 1

      Now, if they said 'make it so this never goes down' and gave IT everything they said they wanted and it still failed we can talk about IT's role in this. If IT took that mandate and incorrectly implemented it we can talk about blaming them.

      What usually happens though is executives either never set the priority or failed to fund what was said was needed.

      So in your world, IT needs to be told that there should be backups for system outages (for an airline no less) and once they are told this, they need to be provided an unlimited budget.

      I guess short of that, ITs job is to reset passwords?

    10. Re:Arguing for resources is part of the job by swillden · · Score: 1

      Or they ran the numbers and calculated that even if they have an outage like this, the cost of that outage would be less than the cost of preventing it. If all you care about is the bottom line, you might not care if you inconvenience a bunch of customers for a few days.

      Such a calculation is a perfectly rational and even correct approach to the question. However, the computation of the expected cost of the outage must also take into account the present value of all of the future lost business resulting from the customers who had a terrible day *and* all of the rest of the world who read about it in on CNN. I think it's vanishingly unlikely that an accurate accounting of those costs would be less than the cost of sufficient redundancy to prevent the outage.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    11. Re: Arguing for resources is part of the job by david_thornley · · Score: 1

      If the airline hired IT people who didn't realize the need for backups, top management really screwed up. If they hired competent IT people who presented a backup and recovery plan that cost a reasonable amount and then decided not to fund it, it's top management's call, and certainly not IT's fault. If they hired seemingly competent IT people who created a backup and recovery plan that was adequately funded by the company, start blaming IT.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  19. Record profits by sjbe · · Score: 1

    Still though, this begs to be something hosted in a datacenter/cloud with an online shadow in the background of another location replicating everything and ready to take over at a moment's notice, or something similar. Pretty standard these days, but airlines are so tight for money that they end up sometimes shooting their own feet...

    Airlines are making record profits these days. Arguing that they don't have the money to properly set up the system that runs the whole company is ridiculous.

    1. Re:Record profits by sjames · · Score: 1

      Most of the crying and bankruptcies we saw before were actually just a scam to shaft long time employees on their pensions. They're fine now and they were fine then, it's just that now there's more money for executive bonuses and the hookers and blow fund is overflowing.

  20. Backup data center? by sjbe · · Score: 5, Insightful

    Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.

    Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.

    1. Re:Backup data center? by Archfeld · · Score: 0

      Without Federal requirements there is no way a corporation is going to spend that kind of money. They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run. The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.

      --
      errr....umm...*whooosh* *whoosh* Is this thing on ?
    2. Re:Backup data center? by Anonymous Coward · · Score: 0

      And companies that need their websites to make money?

      You are making things up completely, my employer and all of our competitors have multiple levels of redundancy (multiple datacenters, multiple availability zones per datacenter). If I want to deploy a little REST api our policies and systems enforce 9 total VM's, 3 per AZ and 3 total AZ's. Our competitors are at least as serious about uptime as we are.

      You clearly don't know what you are talking about.

    3. Re:Backup data center? by Anonymous Coward · · Score: 0

      This.

      The last three big companies I've worked for each had backup data centers in a different state, with dedicated high speed links (eg fiber loop) to replicate the databases. The general consensus here is that if there's a disaster that takes out both data centers, we have bigger things to worry about (like thermonuclear war or the Yellowstone supervolcano exploding). (Actually we run mostly active-active, with both datacenters taking traffic.)

      If you're a smaller company, you can have an emergency backup data center on eg AWS, with plans to scale that up quickly if you need to.

    4. Re:Backup data center? by MachineShedFred · · Score: 1

      Yeah, except that they do. Lots of them.

      It's called "having a disaster recovery / business continuity plan"

      --
      Slashdot still doesnâ(TM)t support Unicode after it was added to the HTML standard in 1997.
    5. Re: Backup data center? by buchanmilne · · Score: 1

      And companies that need their websites to make money?

      For some I know of, the IT staff have done everything in their power (identified risks, probability of the risk, impact in the event, possible mitigations, cost and time to implement mitigations etc.). "Business" (aka bean-counters) have decided that they accept the risk of the current status (business continuity plan takes more than a business week to restore basic services in the event of a complete failure in the primary site).

      You are making things up completely, my employer and all of our competitors have multiple levels of redundancy (multiple datacenters, multiple availability zones per datacenter). If I want to deploy a little REST api our policies and systems enforce 9 total VM's, 3 per AZ and 3 total AZ's. Our competitors are at least as serious about uptime as we are.

      And this is the cheapest possible service to provide redundancy for. When you have a few hundred TB of data changing at a rate of 100GB/day, this kind of solution (put it all in the "cloud", just in case, without a full BCM plan covering issues such as "where will the call centre work from and will they be able to accept calls") will get very expensive very quickly, and normally can't be approved by someone lower than CFO+CIO.

      (Sorry if there are formatting issues, mobile still doesn't have preview ...)

    6. Re:Backup data center? by nightfire-unique · · Score: 1

      Give they were down for less than a day, it's entirely possible they *do* have a backup DC, but chose not to transfer. Transferring a thousand services to backup carries an immense risk, and they may have decided that it's safer to simply repair the power systems and bring the original DC back online.

      --
      A government is a body of people notably ungoverned - AC
  21. For those claiming bad managers and saving money: by Anonymous Coward · · Score: 4, Interesting

    Most of y'all probably don't know what you're talking about. Here's what's going to happen:

    1) Delta will file a loss-of-business / data system failure claim after things are stable again
    2) They'll haggle with their insurer long after this little story is forgotten (and yeah, lots o' heartache today, but it's still probably going to be little.)
    3) Delta will get a settlement of some dollar amount
    4) Some bean counter will eventually tally the cost of that policy versus the payout versus how much all those redundant backups would have cost. The accountant will most likely conclude that it was a smart idea to have bought that insurance policy and NOT paid out the multimillions of dollars IT was asking for in redundant systems.
    5) The insurance company will note the payout as a blip on its financials (probably already expected by the actuaries.) Insurance company will keep making profit.

    The little air traveller is screwed and blued, but Delta and its insurer will keep flying. Doing business today without a data loss rider on your business insurance would be the really stupid idea, much more so than wasting money on redundant systems that are more expensive than said rider.

  22. Surface vs Actual by 1080bogus · · Score: 2

    While on the surface it may appear their IT department is "incompetent" as one person pointed out, other factors could have contributed to the outage. Management not approving proper tests to be done or another datacenter in a completely different location. Improper maintenance on the generator(s). While IT may request things be done or placed a certain way, doesn't mean the facilities team care or understand why and do it their own way anyways. Like why have two generators located right next to each other? They probably shared the same resource for operating as well.

    It takes an event like this for people to realize the importance of listening to the people who implement and maintain their infrastructure. I'm sure anyone who saw this happening is digging through their memos and pulling out the multiple requests for disaster recovery solutions to prevent these things. Not to show them, haha I told you so, but to cover their ass when they start looking for someone to fire.

    It's easy to point out IT as the scapegoat but sometimes they just have to deal with what they're given by the higher ups.

    1. Re:Surface vs Actual by david_thornley · · Score: 1

      It's possible, although unlikely, for genuinely unforeseen situations to occur, and disaster plans usually have a limit on how much of a disaster they're good for.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  23. Either way IT looks bad by sjbe · · Score: 1

    I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.

    I'll take that bet. I'm betting they either overlooked something technical or they are just really bad at making financial arguments. Since a key part of engineering is being able to justify what you want to do in financial terms my guess is that they just weren't very good at their job. Justifying equipment to prevent an outage that would cost millions of dollars per minute is trivial.

    Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?

    Maybe but I doubt it. Given that Delta and other airlines are experiencing record profits, it's hard to see them not understanding the math of a system-wide shut down and what that would cost them.

    1. Re:Either way IT looks bad by Anonymous Coward · · Score: 0

      just really bad at making financial arguments

      Sorry, but mgmt never wants to hear any financial arguments. They'll tell you your budget. And make the choices for you.

      Since a key part of engineering is being able to justify what you want to do in financial terms

      BULL SHIT. It's mgmts job to identify what's important to the business and fund it properly (people,time and/or money). That is their ENTIRE fucking job.

    2. Re:Either way IT looks bad by Anonymous Coward · · Score: 0

      Classic MBA finger pointing. He probably had a report sail through his desk outlining this exact situation and the cost to prevent it, but it's "the engineer's" fault he didn't read it because it wasn't synergizing the paradigm-shifters enough to compel him to give up his morning golf.

      That, or construction, installation, staffing and certification of a backup data-center would cost significantly more than this outage, in which case it's "the engineer's" fault the economics works out that it's cheaper to have occasional outages.

    3. Re:Either way IT looks bad by Anonymous Coward · · Score: 0

      This so much this.

      I was told my budget. I was told here is the hardware to buy. No "I do not need 1.5 million in hardware" would be listened to. So I was stuck building 1.5 million in hardware racks (real fun for a software guy). So I spent their money built the racks no one is using and left the company. How do I know the are not using it? I ask the dudes who ended up with it it. "I dont think anyone has logged in for a month".

      Many times it is 'I will lose my budget next year' is all they are worried about. Instead of 'budgets go up and down and here is the justification for 3x the spend this year' or '2x less this year'.

  24. Re:Shouldn't have upgraded to W10 ! by NatasRevol · · Score: 1

    IT folks usually put in the requirements for the power infrastructure, but I've almost never seen them handle it.

    Often, it's building/maintenance who handles it.

    And as with any project, it's probably upper management didn't want to pay for the level of redundancy that IT said was required.

    --
    There are two types of people in the world: Those who crave closure
  25. Re:Shouldn't have upgraded to W10 ! by Anonymous Coward · · Score: 0

    I don't know about the Delta operations center, but I do know that United has a both a backup site (its pre-merger operations center) with a one-for-one replication of all operator positions and systems that they can fail over to in the event of an emergency and a smaller site that they can switch over to faster though with reduced capability. I would assume that all major airlines have similar setups, but something clearly didn't work at Delta. It isn't clear whether the problem here was with the backup systems in place or the management decisions made in response to the power failure.

  26. A version of Godwin's law by sjbe · · Score: 3

    For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1

    Yeah, it's sort of a riff on Godwin's law. If you blame "MBAs" for a problem, that person has no fact based arguments left so the argument is over and the person doing it loses the argument. It's basically scapegoating and tribalism at its worst.

    Management is a pretty easy target. Management has to make decisions with imperfect information (like playing poker) whereas engineers are used to working with greater certainty (more like playing chess) and it's hard for many of them to wrap their head around the difference. Engineers who don't actually know any better seem to think MBA is shorthand for management incompetence. Never mind that a MBA is a degree, not a person or even a category of people. It's as stupid and incoherent as saying CS = incompetent programmers. I happen to be an engineer but I'm also a certified accountant. I have degrees in both engineering and business and I use both in my day job running a manufacturing plant. I can say with absolute confidence that there are just as many engineering school graduates who are bad at their jobs as there are business school graduates who are bad at their jobs. I run into both routinely. And just as many who are good at their jobs as well. Just because you may have run into some of the bad ones doesn't grant the right to paint the rest with the same brush.

    1. Re:A version of Godwin's law by torkus · · Score: 2

      If you want to throw blame around...let's give it to the 1% crowd.

      I'll even justify it...watch!

      Redundancy and proper backup costs $. Odds of occurance are quite low and pointy-haired people have this habit of cutting budgets to meet spending targets and save money, and all that. Why? Oh, because their bosses say so...the execs and board. Why? Because the company can get an extra $xyz in EPS by cutting budgets back and taking the low % risk on themselves in the short-ish term.

      So yeah, we close down the secondary datacenter and justfiy it with getting a redundant backup generator or something ... save a chunk of money, improve the company's margin by a smidge (which is considered impressive given how much they've already squeezed) .. and the stock market reacts to the 'innovative savings' positively which raises the stock price by a bit.

      That 'bit' matters when you own 6- or 7-figure $ in stock of that company...and your yearly $millions bonus is tied to the same. It's the same reason companies habitually gut their employee base ... not because the company is about to be insolvent but because their stock price sucks (generally due to not having 'enough' profit). /rant

      --
      You can get rich if you own a politician, but you have to be rich to buy one in the first place.
    2. Re:A version of Godwin's law by jbengt · · Score: 2

      Management has to make decisions with imperfect information (like playing poker) whereas engineers are used to working with greater certainty (more like playing chess) and it's hard for many of them to wrap their head around the difference.

      As an mechanical engineer in the construction industry, I can testify that working with imperfect information is the normal situation for us. And that "management" often requires engineers to boil down extremely imperfect and uncertain cost data into a singular "hard" number like Return On Investment so they can fool themselves into thinking they are making a fact-based decision. (OK, the better ones realize that there's a huge range of possible outcomes depending on external factors, and try to take account of error bars and the like, but those decision makers are relatively rare.)

    3. Re:A version of Godwin's law by Anonymous Coward · · Score: 0

      Deconstructing this.

      1: Any kind of redundancy and preparedness requires some very basic planning. If you are going to prepare for a riot and stash some gear away, take regular inventories because stuff that sits disappears. If you are going to setup redundant transformers for the electrical grid, basic maintenance like testing the transformer oil to make sure they are not going to explode and wind up on the 7 o'clock news is needed. If as an MBA, an individual with 6 years of managerial training, cannot figure out how to break your department down into programs and, from the data-center perspective, break the data center into different service levels and place the apps in those service levels, then plan out how to deliver those service levels inexpensively. If you do not understand that your technical doohickey needs redundancy and geeks checking it, we have seriously hooked a blender into your head and turned the sucker to max. I can get we've got politics going on but some things transcend politics. You define the service levels, get the stakeholders in the room, let them throw darts to tell you what goes where.

      2: Any manager that blames their "crappy staff" rather than writing a post-mortem and doing the root cause analysis into what happened and why is just a crappy manager. Blame fixes nothing. Those people generally flush money down the drain on mistake after mistake and makes the jobs of the good managers and good sales and staff infinitely more difficult. If those people all ended up in a mass grave tomorrow, the world would be a far better place.

      3: You get paid big bucks for everyone to ask "What the flying fsck!" when things like this happen. Because it is your job to take the blame, provide the assurances and where necessary, polish your turds. You do it right the media goes buck wild for a few days and it's a 15 minute meeting with the investors.

      4: There's a special class of managers who go from failure to failure. If you, as a company, are rewarding high level managers for anything but staying on board for the long haul and making sure things run right well. Sorry to say, must be awesome watching the investors fight with each other to try to get each others money through your org. It's part of being in the new aristocracy to get money floating from company doing "work" when in reality you're just getting cuts of the investors money in a legal way.

    4. Re:A version of Godwin's law by Seahawk · · Score: 1

      All I get is that MBA's are as bad as Hitler! ;)

  27. Re:For those claiming bad managers and saving mone by NatasRevol · · Score: 2

    Accountants don't have a good idea of lost business opportunity or lost customers.

    So while the basics may make financial sense, that doesn't actually mean it was a good idea.

    --
    There are two types of people in the world: Those who crave closure
  28. Huh? Did Something Change? by Anonymous Coward · · Score: 0

    Considering the crap "disservice" Delta usually demonstrates (multi-hour delays are a "standard feature" of Delta), this can only improve things...

    1. Re: Huh? Did Something Change? by ryanov · · Score: 1

      I have flown Delta consistently several times a month for at least the last two years. Very few, maybe 3-5 total, were late arriving, and at least a couple of those times it was out of their hands.

  29. Re:Incompetent IT *management* by Anonymous Coward · · Score: 0

    Remember, kids "IT Doesn't Matter".

  30. non union h1b electronic F* uped and they will use by Anonymous Coward · · Score: 0

    non union h1b electronic F*cked and they will use this to blame IT and can more USC's

  31. Sounds like a problem with flight planning by Ami+Ganguli · · Score: 5, Informative

    I used to work on one of these systems.

    The flight planning system takes inputs from several sources - weather forecasts, notices about airspace closures, etc. (NOTAMs), and booking info - and creates an optimal flight plan for the aircraft.

    A modern airline doesn't have enough flight planning staff to take over manually if the system fails, so if your flight planning goes out, your fleet is gradually grounded.

    The large number of servers is due to the optimization problem. You need to take into account the flight conditions and fuel costs in different locations in order to decide your route, altitude, and fuel loading. Since fuel is a huge percent of the operating cost of the airline, it pays to invest a little extra computing power into optimizing these and save a bit fuel on each flight.

    Our system had lots of redundancy but, with all the data feeds, there are lots of moving parts. It's not hard to imagine a scenario where, for example, you get everything transferred over to your disaster recovery site, but for some reason the weather feed isn't coming in and you can't make flight plans.

    --
    It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
    1. Re:Sounds like a problem with flight planning by cozytom · · Score: 1

      It could be everything too.

      flight planning and flight following is one system, typically, but then there is crew and maintenance. So it becomes a huge data sync problem. Start moving planes, and you need to know where the crew is, or where they will spend the night, so they can be where they need to be the next day. Same with maintenance. If a plane has a C or D check tonight, the plane needs to be where the mechanics are, or the plane may not be able to fly tomorrow (except as a ferry flight).

      Start moving one cog, and the all the gears need to mesh, or the whole thing will get really expensive.

  32. The old saying is true by fustakrakich · · Score: 1

    People who depend on glass infrastructures shouldn't throw stones.

    And, as already pointed out, if insurance and lawsuits cost less than robust equipment, you can expect much more of this as the world goes online. This is what I love about the "corporation". Nobody is ever held responsible. It's almost as if by design :-/

    --
    “He’s not deformed, he’s just drunk!”
    1. Re:The old saying is true by ISoldat53 · · Score: 1

      If more than one person is responsible, no one is at fault.

  33. Paperless Tickets by Vlad_the_Inhaler · · Score: 4, Interesting

    This story brought to you courtesy of paperless tickets. Yes they are cheaper, yes it is simpler if people can print their own tickets, but the IT has to be up and running.
    I remember an airline IT outage back in September 2004, there was a bug in the OS's error-handling routine for a particular class of error. This had all been tested with this particular OS level and had worked, but they had been forced to change the OS configuration to accomodate some new software and the bug was in place. Moving to new discs required a reboot, an additional configuration error caused problems. If it had been fixed within (I think) 90 minutes all would have been fine. The outage was 8 hours.
    Passengers turned up at the airports with their paper tickets and were allowed to board. Any pre-allocated seating was ignored. People were laughing about flying the way things used to be, a good time was had by most.

    Then came paperless tickets. The next outage had effects more like those we see in this case.

    --
    Mielipiteet omiani - Opinions personal, facts suspect.
    1. Re:Paperless Tickets by thegarbz · · Score: 1

      Yeah sure. The problem was lack of tickets when a system wasn't even able to assign gates, scheduled arrivals, departures, or basically get up any information on any plane, let alone those needed to print your paper tickets. Also how did you boot up your new servers without power again? The entire datacentre was out including backup power. The only thing which would have avoided this was a parallel datacentre and one would assume that someone did a cost-benefit analysis to eliminate this redundancy.

      Please apply some thought before you post. Yes paperless tickets have introduced an additional failure, but so far not a single incident that has made the news has been a result of paperless tickets.

    2. Re:Paperless Tickets by orgelspieler · · Score: 1

      When United had it's little hiccup a couple of years ago, I was on my way to Tulsa. I needed to bump my flight time up. Right as I got approved to be on the new flight, their computer system went down. The lady at the counter said, "I guess we'll have to do this the old-fashioned way." She hand-wrote my boarding pass. I still have it stapled to my wall. The lady at the gate looked at it, chuckled, and let me pass. I got on the flight, and the pilot said, "Well, they've told us we can't leave because the corporate computers are having some kind of a problem. But screw those guys, I got clearance from the tower so we're leaving anyway!" To which the whole plane erupted in applause. Here's to people that still know how to do things the old-fashioned way.

  34. WTF? by Anonymous Coward · · Score: 0

    What the fuck happened to failover? It is standard in the compute industry, so why don't Delta and Southwest use it? Someone call IBM or Oracle today.

  35. Re:Incompetent IT *management* by RabidReindeer · · Score: 1

    Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?

    By the time the Bean Counters get done? Depend on it. The books aren't going to show the future revenue lost because people swore off Delta in disgust and anyone who depends on surveys to obtain intangible data is going to get what they deserve. Even allowing for the fact that many people don't want to waste time on a survey to begin with, you can't survey people who thought "Delta? Those screwups?" and never even considered the company. Well you can, if you're into blanket surveys, but those are worth even less than the customer surveys.

    And yes, if you detect an anti-survey bias there, you're right. To me, surveys are what you do when you're too out of touch to actually watch and listen to customers (Strike One), put blinders on your perceptions by virtue of only asking the questions your bean-counters think are worth asking (Strike 2), and are often only answered when the querent is either A) pissed, B) a "professional" survey answerer (limited, atypical population) or C) couldn't get away fast enough without actually gnawing off body parts (Strike 3).

    So the drop in revenue over the long haul will be blamed on something more measurable and bonuses to the real offenders will continue unabated.

  36. This is unadulterated bullshit by HBI · · Score: 2

    Of course it's the MBAs fault. Their very raison de etre is calculating the costs of additional redundancy, and comparing that against the costs of a global operations failure and the ensuing loss of business due to carrier unreliability. Then, presenting this data to a decision maker for action.

    There are only two ways that they can get off. One way is if the decision maker chose to accept the risk, knowing it fully. The other way is that if the IT department didn't advise them of the risk. I evaluate the chances of the IT department being dumb enough to not know what would happen as near zero.

    You're left with MBAs who failed to present the business case properly or a CEO who is a retard. Choose one.

    --
    HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
    1. Re:This is unadulterated bullshit by Anonymous Coward · · Score: 0

      Have you considered that the MBAs and CEOs thought about the costs of an extended downtime relative to the costs of maintaining redundancy over a period of time and found out that the costs would be lower if an extended downtime occurred?

      If this is the case then we're probably in the undesirable situation of too many costs being externalized from the airline.

    2. Re:This is unadulterated bullshit by schnell · · Score: 2

      Just out of curiosity, is anything ever IT's fault? Or is it always the evil MBAs? Is there any chance that we, the collective Slashdot audience, have absolutely no clue what the internal funding, competency, vendor choices and strategy of Delta are?

      --
      "95% of all Slashdot .sig quotes are incorrect or completely fabricated." -Benjamin Franklin
    3. Re:This is unadulterated bullshit by SvnLyrBrto · · Score: 1

      Well, there are two possibilities in a situation like this. Either IT is incompetent and utterly incapable of properly designing and building reliability and redundancy into their systems. Or some obstructionist elsewhere in the corporate org chart... and these people often do hold MBA degrees... has maladministered to deny IT the resources necessary to do the job correctly, and they were forced to make compromises that reduced said reliability and redundancy.

      While the first is certainly not unknown; in my own experience, the latter is more common.

      --
      Imagine all the people...
    4. Re:This is unadulterated bullshit by david_thornley · · Score: 1

      I don't know how Delta operates internally, but it's the fault of management. Either they hired incompetents, or they hired competent people and didn't budget for what those people told them was necessary, or they deliberately decided to risk downtime. It may also be IT's fault, but it's the job of management to hire people who can do the job.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
    5. Re:This is unadulterated bullshit by RespekMyAthorati · · Score: 1

      Or is it always the evil MBAs?

      Well, given that it's PHB's (who are mostly MBA's) who hire the IT staff, and decide whether to pay for the best IT talent available, then yes it is always the incompetent MBA's fault.

  37. Re:For those claiming bad managers and saving mone by Anonymous Coward · · Score: 0

    This is not a response to your uninformed comment. It's to MANY uninformed comments.

    1) They have a DR plan. Doesn't mean they chose to execute it (time to failover vs expected time to recover and many other decisions).
    2) The accountants don't run the decisions.
    3) The MBAs don't run the decisions.
    4) IT doesn't run the decisions.
    5) Decisions are made in a company at many levels, with the input of many department heads. Have you guys never worked in a large enterprise before?

  38. Cue the Cloud Consultants in 3, 2, 1.... by ErichTheRed · · Score: 1

    I guarantee the cloud infrastructure guys are salivating at the opportunity to convince the MBAs to ditch Delta's data center. What they won't mention is how much it would cost to actually implement instant failover capability in a cloud environment. I'm not anti-cloud, but I do think a business as large as Delta isn't going to see a lot of cost savings over what they're paying now for equipment. Microsoft and Amazon doesn't give away capacity for free, and you often pay dearly for certain key elements (IaaS, network connections, etc.) The MBAs don't see this though; they only see CapEx vs. OpEx and "we can fire 90% of the IT department."

    IT cost isn't exactly something airlines spend willingly. Unless it directly affects safety or increases revenue/reduces cost, they want nothing to do with it. I guarantee the proposal for a redundant data center, or even a cloud-based DR location was floated, looked at and rejected as being too expensive. Airline IT is a web of third-party dependencies, each of which has a few single points of failure. Although, bad luck for them, this one seems like a straight power outage and/or transformer/generator failure. At least it seems like they didn't fry their computing equipment if they were able to get back online in a few hours. Sadly, I have experience with this and have seen companies dismiss the cost of a $20K server colo and network connection as excessive. People seem to forget that you need to guard against downtime unless you're some Web 2.0 startup social media company...if it costs $XX,000 per minute of downtime, you have to be willing to eat that or pay for DR.

    1. Re:Cue the Cloud Consultants in 3, 2, 1.... by Anonymous Coward · · Score: 0

      Who knows, they might even sell the data center to a cloud company. Hey, 500 servers ... but probably too old for convenient transfer to other sites/purposes.

    2. Re:Cue the Cloud Consultants in 3, 2, 1.... by guruevi · · Score: 1

      500 servers is like 10-20 racks. That's a very small datacenter in the company's basement or perhaps 1 mainframe with 500 instances that was recently (in the last decade) converted to a cluster. Either way, if a worldwide system is located in a single datacenter, I'd say the latter is probably the case which is currently IBM's modus operandi when a customer wants to upgrade an old mainframe

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
  39. Re:Shouldn't have upgraded to W10 ! by I4ko · · Score: 1

    Some of the IT staff most likely don't have a scrotum you insensitive clod. Why aren't they getting their fair share of hanging.

  40. Delta sucks by OneHundredAndTen · · Score: 0

    It has for years. The consolidation of the commercial airline industry in the US has resulted in companies that suck uniformly, and prices that remain high uniformly. I have been able to fly Bristol to Madrid, in Europe for less than $50 each way. In the US, Denver to San Francisco, which is roughly the same distance, will set you back by $200 round trip, if you are very lucky. Usually, it will be more like $300.

    1. Re:Delta sucks by orgelspieler · · Score: 1

      You could always fly Frontier, if you value money over your self-respect and sanity. They sum it up pretty well with their new slogan: "Frontier -- for when Spirit just isn't shitty enough for you!"

  41. Re:That's about $100 Million per day in lost reven by Nkwe · · Score: 1

    You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.

    Or if you are down for 2 days ($200 million), and the cost of having a fully redundant system is more than $200 million (equipment, people, process, ...), from a business sense, it may make more sense to just accept an occasional outage.

  42. Re:Shouldn't have upgraded to W10 ! by Anonymous Coward · · Score: 1

    Yeah, but IT should be involved in testing it. Had one very embarrassed data center manager when the IT manager said something like "So if I hit this red button, we will be running uninterrupted on the batteries and nothing should go wrong right?" When receiving an enthusiastic affirmative he went ahead and hit the red button. As you might guess the data center immediately powered off, hard...

    And before you mention how wreck less this was, this was in an outage window that was scheduled ahead of time, was just after backups were complete and verified, and the IT manager had gotten the OK to do it from the higher ups. Basically everyone agreed they needed to know for sure it was going to keep running if the power was cut and were OK with the potential outage at a known time to hopefully prevent an unplanned outage at a random time.

  43. Single data centre for critical resource? by MarkH · · Score: 2

    Blimey I wouldn't do that and running a bog standard stream service never mind an airline with 100 million a day of revenue.

    500 servers is about 50 racks. About 500,000 a year plus about 2,000,000 for kit and 4,000,000 for software and licenses and 250,000 for interconnect . So capex 6,000,000 and opex call it 1,000,000 per annum.

    I normally rate a major dc failure ( more than 10min ) at about once every 5 years.

    Easy business case.

    Also generator and ups fail over is tough to test with one dc. Which hit this one bad.

  44. Re:Shouldn't have upgraded to W10 ! by NatasRevol · · Score: 1

    Well the testing is assuming that it's there at all.

    --
    There are two types of people in the world: Those who crave closure
  45. Insurance by sjbe · · Score: 4, Informative

    Without Federal requirements there is no way a corporation is going to spend that kind of money.

    A few failures like this one and they'll dig into the couch cushions to find the change for it. Having a backup data center for stuff that will shut the company down is not exactly a tough thing to justify. This shutdown alone would probably justify the cost in a single day.

    They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run.

    Perhaps but if they managed their IT properly they wouldn't have to lose money now. They can buy the insurance or they can take the risk of serious illness so to speak. Their choice and their funeral. Sounds like they rolled the dice and came up snake eyes today.

    The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.

    Not true. Some medical practices have them. Some internet firms have them (at least for the mission critical stuff). Some bits of the military and government have them. Insurance companies have them. Stock exchanges have them. And there are more as well. If it's valuable enough you have a backup data center of some sort.

    1. Re:Insurance by funwithBSD · · Score: 1

      I work with BCRS (Business Continuity and Resiliency Services ) at IBM on a regular basis as an IBM Cloud Architect.

      ALL KINDS of companies have Recovery plans, but it sure is more likely in the industries you mentioned.

      I know of at least one airline competitor that has multiple sites for their business, and came to IBM to get 2 more built because the current two were too close to comfort after Hurricane Sandy proved how large of an area a single disaster could impact.

      --
      Never answer an anonymous letter. - Yogi Berra
    2. Re:Insurance by gordguide · · Score: 1

      Every Airline ... every company, every academic organization, every government agency, almost, for that matter ... had taken all the necessary steps to insure a local power outage would not affect the system as a whole, back in the pre-Y2K period. My ex worked for the Federal Prison System; every jail and every administrative and secure hospital facility (such as Maximum Security Psychiatric care facilities) were able to operate at 100% with local power failure.

      There were tested, redundant, working generator units at such insignificant areas as purely administrative bureaucratic sites, where the staff might be doing things like designing a building, and could easily survive a week off work with zero effect on the system.

      Interesting Tidbit: Any locks that could be affected by power failure are by law designed to fail open; it's a fire safety thing. You can't specify a fail closed lock even if you wanted to. But Prison Systems are exempt from the fire regulations, and all their locks fail closed.

      So, we can pretty much guarantee that 16 years ago Delta had the tested hardware and software in place such that a power failure in Atlanta could not disrupt their computerized flight system. And a little more than a decade later, that system is nowhere to be found.

  46. Security by infernalC · · Score: 1

    Delta has demonstrated that it, one of the world's largest airlines, doesn't co-locate it's critical infrastructure in redundant data centers with fail-over mechanisms. Delta's inability to operate has ripple effects in the operations of other airlines as well. Now criminals know that Atlanta is an Achilles' heel, and to cripple the world's air transportation systems, they need only attack it's power grid. Obviously, market incentives are not sufficient to make them have a more robust infrastructure. I think the FAA needs to step in here and regulate a little sanity into the system.

    1. Re:Security by Anonymous Coward · · Score: 0

      Obviously, market incentives are not sufficient to make them have a more robust infrastructure. I think the FAA needs to step in here and regulate a little sanity into the system.

      Are you willing to pay more for your airline tickets? That's what you will get when you want "gubament regalations" to step into something.

      Having government regulations and oversight doesn't mean a thing. Seriously bro. Companies "cut corners" all the time on US federal contracts and only some of them are caught later, sometimes much later, IF AT ALL.

      And what happens when violators of "gubament regalations"? They pay a fine.

      magic word: cousin [oh how appropriate for "gubament regalations"]

    2. Re:Security by infernalC · · Score: 1

      Yes, I am willing to pay a little more to decrease catastrophic risk. We already do that with many other products.

    3. Re:Security by infernalC · · Score: 1

      I doubt that IT costs are a burdensome percentage of fares. I bet it's mostly fuel, equipment and labor. A 737 costs about $50 million, and I'm guessing another million a year to maintain over a 20-yearish lifespan. Assuming 1000-ish flights a year with 150 paying seats on the flight, you're talking about $25 per ticket to pay for the plane. Fuel is about $5/gal, with average per seat mpg of 80-ish. So we're talking $50 per seat for fuel... up to $75. Add in labor, airport costs, taxes, etc... I'm just willing to bet that IT costs are less than 2% of a plane ticket. I bet adding proper redundancy would just be a drop in the bucket.

    4. Re:Security by lamer01 · · Score: 1

      Flew to the Caribbean recently. $450 per ticket. Almost half of that was various taxes and fees. So, the govt overhead is far more significant than IT.

  47. SAME AS THERMOSTAT STORY ABOVE by Anonymous Coward · · Score: 0

    Also same as trying to fish for your torrent sites a few days ago when Slashdot was outed as a FBI site now.

    They want your responses to these stories. They are trying to plan for unexpected responses in future false flag events in USA.

    Hack the thermostat? Airlines grounded? Where do you get your torrents from? How would you respond?

    Please have a seat and tell us, we are not FBI we are just stealth Slashdot submitters. Also, have you tried Microsoft Anniversary 10 yet? IT IS FREE.

  48. Some Lessons Are NEVER Learned by DERoss · · Score: 1

    In the summer of 2003, the Great North-East Blackout hit New England and other areas in the U.S. and parts of Canada. My wife were in Montreal at the time. When we tried to fly home non-stop to California from Trudeau International Airport (called Dorval International Airport at that time) via Air Canada on an early morning flight, we instead found ourselves flying in the late afternoon to Dulles in Washington, DC, changing planes, and then flying home. We arrived at our house more than 12 hours late.

    No, Montreal and the rest of the province of Quebec were not affected by the blackout. Air Canada's computers, however, were in Toronto. Toronto and much of the province of Ontario were indeed blacked-out. While other airlines continued normal operations out of Montreal, Air Canada could not confirm reservations or issue boarding passes. Air Canada had no remote backup facilities.

    Apparently, Delta Air Lines learned no lesson from Air Canada's experience 13 years ago.

    1. Re:Some Lessons Are NEVER Learned by guruevi · · Score: 1

      They probably did but then tried to shave the costs down again when a new generation of managers came along 2 years later "why do we have a datacenter doing nothing most of the time, let's go to the SAAS/MAAS/PAAS model (which I think was the buzz word for shared hosting 10 years ago)"

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
  49. application recovery vs infrastructure recovery by Archfeld · · Score: 1

    Do they have a an entire recovery DC or space in someone else's DC ? Most business have plans to recover certain applications or move them to run on backup/development hardware. I worked for years in Contingency recovery and most places I've supported have space to recover applications should they fail, but few have the dedicated space or a plan to recover an entire infrastructure should a failure occur, and fewer have a plan to move BACK to the original space when the problem is fixed. The cost to maintain a duplicate hardware/space for everything, plus the people to recover it in an emergency is ENORMOUS, and the logistics to do so extremely complicated. Recovering the front end in a leased or rented space supported by another entity is very different than a full structure recovery move. Heck most places don't even have the offsite data e.g. full application code plus FULL data backups needed to recovery from scratch.

    --
    errr....umm...*whooosh* *whoosh* Is this thing on ?
    1. Re:application recovery vs infrastructure recovery by Anon-Admin · · Score: 2

      Off the top of my head I can name over 20 companies that have full failover to a backup DC. One of them is an Airline that everyone knows the name of.

      Hell, I have configured stretch clusters for companies so that in the event of a DC failure the secondary DC is available with 0 down time and the failover is automatic. So it is done, it is normal operating procedures/best practices, and there is no reason the SECOND LARGEST AIRLINE IN THE USE IS NOT DOING IT!!!

      If you want to argue that some small company of 1000 people is not doing it that is fine but there is no excuse beyond management failing to do their job for this one. I think the board needs to look into it and start cutting people from the top down.

  50. The Il toll way is loading I-90 up with backup p by Joe_Dragon · · Score: 1

    The Il toll way is loading I-90 up with backup power all an long the new smart highway part how redundant is that system? If it fails people can end up with free tolls.

  51. application vs infrastrucure recovery by Archfeld · · Score: 1

    There is a great deal of difference in recovering certain applications or having multiple sites running a subset of one facet of your operation. A full structure recovery requires the hardware, staff and FULL data, e.g. full application and user data available to recover from scratch. That kind of overhead is enormous. Recovering mission critical stuff is par for the course, but recovering everything in a DC needed to do day to day operations in the event of a full infrastructure failure is a different beast entirely.

    --
    errr....umm...*whooosh* *whoosh* Is this thing on ?
  52. Definitely win10 ! by portal2 · · Score: 1

    No, their computers upgraded to win10 overnight. So none of their custom software worked anymore.... :-)

  53. Re:Shouldn't have upgraded to W10 ! by JackieBrown · · Score: 1

    It's fun watching every department point to every other department for blame.

    IT - It's upper managements fault. We assumed they took care of it. Or now, it's facilities fault.
    Upper MGMT - It's ITs fault. We assumed if it still needed doing they would have told us.

    Just need IT and Upper MGMT to talk first so they can sync the blame on facilities.

  54. Re:For those claiming bad managers and saving mone by Anonymous Coward · · Score: 0

    Your analysis rings true ...

    And nothing will change at Delta unless there is a noticeable drop in ticket sales that can be directly attributed to this outage.
    The inconvenience to passengers is worth a small bucket of warm piss unless passengers stop buying tickets.

  55. Delta is number TWO in more ways than ONE by Bob_Who · · Score: 1

    Namely, they always shovel out heaps of number two whenever something goes terribly wrong. Their response policy is ALWAYS to tell LIES. Its POLICY to SPIN ALL NEGATIVE PRESS ATTENTION AT ALL TIMES. The truth will only make it worse because they know they are prone to major fuck ups and they have lots of enemies. They just don't want any of their cheap, stupid, or dishonest screw ups to look like they are willing and able to constantly screw up service for their customers since it is a calculated risk they willingly commit, and will commit again, because its worth being unreliable if it saves them the very high cost of total reliability and service. Delta didn't get to number two by acting like the best. They got there with compromised performance where is counts the most: the customers ruined travel plans. What's a random act of system breakdown worth ? Its very valuable because it buys them time to actually go off line, and get some very serious technical work done in a few hours that might otherwise take weeks, and great resources and planning , in order to implement in parallel to smooth daily operations. When an event like a power failure in Atlanta is blamed for there problem, then are we to believe it is reasonable to design there entire airline system to fail every time that event occurs? Is that how it goes with every other airline that experiences power failures? Not if they plan and provide contingencies for all known possible, albeit improbable, events. This jerks KNEW that this would be EXACTLY what happens IF a power outage occurs in Atlanta at 2AM on a SUNDAY. They probably can mitigate the collapse if its during business hours, while all hands are on deck, but there is no way that they are going to PAY THE WADS OF CASH it costs to have a hair trigger response team on standby 24/7 in case of an off hour power outage. Nor will they pay the heaps for backup power systems capable to manages seamless power supply for their fat power sucking energy wasting corporate consumption of all available amps in every facility and overclocked full throttled server farm and network that is required. The could have told us 10 years ago that this is what will happen under these exact circumstances. That's how it was designed because its a great cost savings to gamble that the power failure wont happen, but if it does, they can handle it with the resources available to them during normal operations, which is where it counts most. They make a calculation that its worth the cost savings not to worry about 24/7 contingency response because it involves a huge labor cost. They are not going to piss away profits just to be sure a customer is never inconvenienced. Fuck the customer and their frequent fickle flying. If you buy #2 you better be ready to eat some #2. That's what you get for burning a thousand hours of jet fuel in an hour because its half the price of buying a tank of gas and driving all day to travel the same distance. Lets face it, you can just go fuck off until greyhound or amtrack is a better alternative. The fact is that they really don't need to worry about it, because it costs less to piss off customers, which is always, than it does to be certain that there customers are always happy, which is never. As long as they are number 2 in the duopoly, then we can just eat shit when the gap occurs. There are a lot of gaps in there design and implementation. They have no intention of filling them all. You just better hope you get lucky and don't need to travel when the dice crap out. But lets be honest, a power failure in summer in Atlanta is an event that you can COUNT ON happening. So clearly, this is how the system is designed to function in this exact situation. What makes you think that kind of design concept is an accident or unforeseen when clearly its a calculated risk. They mitigate bad luck with bullshit to deny other future failures are guaranteed in the likely event of unusual dice. Its inevitable, but its hard to predict when exactly they will screw over the customer next exactly, but you can be sure it will happen again. They will act as surprised as the customer when happens, and everyone can go on with their pretense, and deny any fault because we allow fate to decide when we win or lose the bet.

    1. Re:Delta is number TWO in more ways than ONE by lifeisshort · · Score: 1

      As you are may be aware, hitting this large key called 'enter' can split text into things called 'paragraphs', thus making post a bit easier to read. Just saying.

    2. Re:Delta is number TWO in more ways than ONE by Bob_Who · · Score: 1

      You're right, Its word salad on a cereal box.

      Its like reading the ingredients of Lucky Charms while eating Captain Crunch - totally Gertrude Stein.

      I've been having flash backs....

  56. IoT by Anonymous Coward · · Score: 0

    My mechanical toilet is working well.

    Perhaps Delta should send over their Crack-Team of Engineers and Scientists to figure out how it works.

    Ha ha

  57. Failed switch gear by Anonymous Coward · · Score: 0

    "A spokesman for [Georgia power] said the problem for Delta was a failed âoeswitch gearâ and that Georgia Power sent workers to assist the airline early Monday morning." from the WSJ

  58. Re:Shouldn't have upgraded to W10 ! by Kreplock · · Score: 2

    Somebody has lot's of 'splaining to do, surely. Power up the deflectors.

  59. Re: Shouldn't have upgraded to W10 ! by buchanmilne · · Score: 1

    (which should not happen as you should have backup generators and batteries that give power until the generators are spun up)

    We have had outages due to power problems, during planned maintenance on the UPS system (which was to allow future UPS maintenance to be done without impact by introducing live switching of UPS between feeds). There was an outage to some systems because they had been upgraded (e.g routers with line cards added) to the point where one PDU could not supply sufficient power even though total power was less than "guaranteed"). To avoid a recurrence, power supplies were moved to other PDUs.

    So, yes, while power failures shouldn't have impact, even in environments with supposedly robust frameworks (e.g. ITIL), mistakes happen or the impact of a change is not fully identified/understood (possibly due to the complexity of modelling the environment down to which of 8 power supplies on a device are connected to which of 4 PDUs in a cabinet which has 2 different feeds of the 6 feeds available in one DC in a campus with 3 DCs etc. etc.), resulting in unexpected failure modes.

    (Apologies in advance, still no preview on the "mobile" interface).

  60. Re: Shouldn't have upgraded to W10 ! by buchanmilne · · Score: 1

    Didn't they have monthly or quarterly "mains fail test"? Our environmental team's performance contracts require this ...

  61. Re:Shouldn't have upgraded to W10 ! by NatasRevol · · Score: 1

    The talks have already happened. It's 'are they documented' or did upper mgmt just mention it in the hallway that there's no funding for backups so they couldn't be held accountable.

    --
    There are two types of people in the world: Those who crave closure
  62. D E L T A by sexconker · · Score: 1

    Doesn't
    Ever
    Leave
    The
    Airport

    1. Re:D E L T A by ThatsMyNick · · Score: 1

      Except, you know, all the time it does.

  63. Re:Shouldn't have upgraded to W10 ! by JackieBrown · · Score: 1

    It's 'are they documented' or did upper mgmt just mention it in the hallway that there's no funding for backups so they couldn't be held accountable.

    Unless IT asked for the stuff they needed in the hallways, there is documentation at least that IT was trying to prepare for this. If IT asked for this in that hallway and was fine with a no in that hallway, the IT folks didn't really think it was important.

    Heck, I learned pretty low in the chain and at a very early age to ask for things in writing that went against what I thought was correct - only took a few times being burned to learn that. And also to know which battles where were fighting and which weren't (this would clearly land in the former category - at the very least above "was told no in the hallway" category.

  64. Re:Shouldn't have upgraded to W10 ! by swillden · · Score: 1

    Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.

    FWIW, Google's standard -- a mantra which Google SREs have pounded into my head -- is "n + 2". You don't have a reliable system unless you have enough capacity to operate it when you lose two of your components (the definition of "component" here is context-dependent; they're whatever your points of failure are). Why do you need two extras, rather than just one? Because inevitably there will be some time you have to take one of them down for maintenance or upgrade or something. If you only have "n + 1", then during that window of time you're down to "n", meaning exactly the capacity you need to handle the load... and if something goes wrong you then have "<n", i.e. not enough to operate. OTOH, if you have more than "n + 2", and the individual systems are reasonably reliable, then you're probably wasting resources.

    I suppose at the DC level "n + 1" is probably adequate if your other processes are structured so that you never take an entire DC offline intentionally.

    I think this is a good philosophy for anyone who is operating a piece of critical computing, bet-your-business, computing infrastructure. Like, say, the database that allows one of the two or three largest airlines in the world to fly.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  65. Re:Shouldn't have upgraded to W10 ! by Anonymous Coward · · Score: 0

    If you have lower priority processes that can tolerate being offline for the duration of a planned outage, you only really need something more like "n + 1.5." The .5 only has to be able to handle the critical work for long enough to get your full backup system back online.

  66. Re: Shouldn't have upgraded to W10 ! by Anonymous Coward · · Score: 0

    A transfer switch is almost always a single point of failure in a power system. It is the one place where primary power, backup power and the load come together.

  67. your excel is broken by lucm · · Score: 1

    On one hand, airlines are not swimming in cash so everything requires a tedious business case.

    On the contrary, after going through bankruptcies in recent years and shedding debt, pensions, etc., plus with the current low fuel prices, most airlines are currently swimming in cash.

    what are you talking about? Those companies are public it's easy to see the numbers, stop making shit up based on your faulty guesswork.

    Delta has a book value of *negative* 3 billions, cash flow down 115 millions last year alone. American Airlines is also in dark red, book value negative 700 millions, cash flow down 600 millions. Those are not companies where you can easily get a budget upgrade.

    --
    lucm, indeed.
  68. Re:Shouldn't have upgraded to W10 ! by swillden · · Score: 1

    If you have lower priority processes that can tolerate being offline for the duration of a planned outage, you only really need something more like "n + 1.5." The .5 only has to be able to handle the critical work for long enough to get your full backup system back online.

    Absolutely. "n + 2" is a rule of thumb for critical systems... and it's also just a starting point. Thinking hard about your system may point out that you need even more, or maybe that you can get away with a little less. The rule in Google is that n + 2 is the default and then you can make arguments about why you need more or less.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  69. They got hacked - chances are. by martinfb · · Score: 1

    Chances are that they got hacked.

    --


    Self-importance and self-indulgence is the root of ALL evil.
    1. Re:They got hacked - chances are. by Anonymous Coward · · Score: 0

      NSA false flag / test of the emergency broadcast system this is only a test.

  70. Re:Shouldn't have upgraded to W10 ! by wyHunter · · Score: 1

    The IT staff? Or the management? "Oh I saved 10m by not having a backup data center." "Here's your 10K bonus!"

  71. Georgia power denies outage by Chewbacon · · Score: 1

    They say it simply didn't happen. Numerous analyst articles say Delta, like many other airlines, understandably operate aging and complex systems. Their system just had enough.

    --
    Chewbacon
    The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
  72. Re:That's about $100 Million per day in lost reven by ebvwfbw · · Score: 1

    Things can still go wrong. In NYC back in the 1990s, they had a 911 test. Con-Ed turned off power to the 911 center. No problem, they have a big diesel to kick in to keep their big mother IBM mainframes running. Well that diesel was in pieces on the floor because it had hit max time and needed to be rebuilt. No problem, it had a failover to another big mother diesel, which fired right up and took the load. No problem. The problem is that it only had about 5 minutes worth of diesel. They hadn't moved the diesel feed to the other engine. Down she went! It took about 45 minutes to get the system back running again. In that time they had a number of heart attacks and if memory serves me, at least one guy died that they think otherwise would have survived.

    I'm seeing a lot more stupid stuff happening. Cloud, big centers, etc. They put all their systems on a SAN. I've seen a relatively simple san problem take a few hundred machines out. Supposedly - that can't happen. Well it did and does. I've also seen an technician delete a whole rack worth of storage with one mouse click. Then there is management software. Now instead of screwing up just one machine, we can do a few thousand at a pop.

    Now customers want to use a Software Defined Data Center (SDDC). Probably short skirt sales - Everything is controlled by it, san, network, routing, switches, VPN, blades... whole Shebang! What could go wrong? Seems like the word clusterfuck was made for this.

  73. Wonder if they were using VMWare VSAN? by jbgeek · · Score: 1

    Grumble.

  74. No need for 2x (was: Re:Incompetent IT) by dgallard · · Score: 1

    An anonymous coward stated:

    > First off you need a minimum of 2x the floor space in a min 2 different geographic locations.
    > Second you need a min 2x the hardware at both locations. blah blah blah
    > You need 2 x the number of blah blah blah
    > Blah blah blah

    Today you can do DR (Disaster Recovery) in AWS or other cloud infrastructure without needing 2x blah blah blah.

    You do need 2x for *just* the database that stores truth and keeps it redundant sychronously or, in this case, near synchronous is probably good enough (OK lose a few hundred or even thousand transactions I would guess, just NOT OK to lose the entire system for a day. Jeeeesh.).

    Almost all other systems can stay quiescent and not used actual cycles or energy until needed for recovery.

    -- Dennis Allard

  75. Re: For those claiming bad managers and saving mon by ryanov · · Score: 1

    I know someone who flew home on Monday. They left on Monday and got home on Monday. Shit happens sometimes, but this wasn't a catastrophe even for everyone that way affected.