Slashdot Mirror


Risk Management - A Cautionary Tale

Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."

23 of 203 comments (clear)

  1. Why didn't the CIO yell louder? by winkydink · · Score: 5, Insightful

    Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.

    --

    "I'd rather be a lightning rod than a seismometer." -Ken Kesey

    1. Re:Why didn't the CIO yell louder? by Knara · · Score: 4, Insightful

      I'd agree, but the fact that it was written in FORTRAN and they didn't have a single maintenance developer (even if it wasn't that developer's primary role) assigned to it that *knew* FORTRAN suggests that a whole lot of "buhhhhhh??" going on in that particular IT department.

  2. Article text by daVinci1980 · · Score: 4, Informative

    Site is already sluggish.

    Bound To Fail
    The crash of a critical legacy system at Comair is a classic risk management mistake that cost the airline $20 million and badly damaged its reputation.
    BY STEPHANIE OVERBY

    When Eric Bardes joined the Comair IT department in 1997, one of the very first meetings he attended was called to address the replacement of an aging legacy system the regional airline utilized to manage flight crews. The application, from SBS International, was one of the oldest in the company (11 years old at the time), was written in Fortran (which no one at Comair was fluent in) and was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).

    SBS came in to make a pitch for its new Maestro crew management software. One of the flight crew supervisors at the meeting had used Maestro, a first-generation Windows application, at a previous job. He found it clumsy, to put it kindly. "He said he wouldn't wish the application on his worst enemy," Bardes recalls. The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it. The consensus at the meeting was that if Comair was going to shoulder the expense of replacing the old crew management system, it should wait for a more satisfactory substitute to come along.

    And wait they did. The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events: managing the approach of Y2K, the purchase of the independent carrier by Delta in 2000, a pilot strike that grounded the airline in 2001, and finally, 9/11 and the ensuing downturn that ravaged the airline industry.

    A replacement system from Sabre Airline Solutions was finally approved last year, but the switch didn't happen soon enough. Over the holidays, the legacy system failed, bringing down the entire airline, canceling or delaying 3,900 flights, and stranding nearly 200,000 passengers. The network crash cost Comair and its parent company, Delta Air Lines, $20 million, damaged the airline's reputation and prompted an investigation by the Department of Transportation.

    Chances are, the whole mess could have been avoided if Comair or Delta had done a comprehensive analysis of the risk that this critical system posed to the airline's daily operations and had taken steps to mitigate that risk. But a look inside Comair reveals that senior executives there did not consider a replacement system an urgent priority, and IT did little to disrupt that sense of complacency. Though everyone seemed to know that there was a need to deal with the aging applications and architecture that supported the growing regional carrier--and the company even created a five-year strategic plan for just that purpose--a lack of urgency prevailed.

    After the acquisition by Delta, former employees say Comair IT executives didn't do the kind of thorough management analysis that might have persuaded the parent airline to invest in a replacement system before it was too late. Instead, Delta kept a lid on capital expenditures at Comair, with unfortunate consequences. The failure of the almost 20-year-old scheduling system not only saddled Delta with a plethora of customer service and financial headaches that the airline could ill afford but it also provides a cautionary tale for any company that thinks it can operate on its legacy systems for just...one...more...day.

    The five-year plan that wasn't
    Today, Cincinnati-based Comair is a regional airline that operates in 117 cities and carries about 30,000 passengers on 1,130 flights a day, with three or four crew members on each. But back in 1984, when Jim Dublikar joined the company as director of finance and risk management, Comair had

    --
    I currently have no clever signature witicism to add here.
  3. Blowing smoke up your donkey by Anonymous Coward · · Score: 5, Funny

    --------------------- Cut Here ---------------------
    Posts above this line have not RTFA.

    1. Re:Blowing smoke up your donkey by Anonymous Coward · · Score: 5, Funny

      Posts under this line haven't RTFA either.
      --------------------- Cut Here ---------------------

  4. Why did this system fail? by hellfire · · Score: 4, Interesting

    Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.

    I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?

    The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.

    I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.

    --

    "All great wisdom is contained in .signature files"

    1. Re:Why did this system fail? by Jayfar · · Score: 4, Informative
      The article conveniently leaves out the reason for the failure.

      No, the article conveniently explained that the sw had a limit of 32000 schedule changes per month. A severe winter storm necessitated enough changes to make the system fall over.

  5. Re:Yep by airrage · · Score: 4, Insightful

    To me, when you look at code you always want to rewrite it thinking you could do it better. But if you look at what they had to work with, you realize most coders write (at a given time), write pretty good code.

    This software has been working for over 20 years! What will your code look like in 20 years? I doubt it has the same track record. I'm not sure foresight was a problem. I think they did the best they could with language and hardware of the day.

    The comair meltdown wasn't a software problem if you ask me, it was the business changed.

    --
    "This isn't a study in computer science, its a study in human behavior"
  6. Re:Yep by EnronHaliburton2004 · · Score: 4, Insightful

    How could nobody in 11 years see that the changes were counted with a 16 bit signed integer?

    If this company was run by a typical big company, somebody DID complain about this 16-bit signed integer. Chances are, they were told to shut up about it and not rock the boat. This frequently happens when someone points out a bug which would require a fundamental change to the system.

    Most companies only like employees who think inside the box, despite telling people to think outside the box.

  7. Risk Management is Complex by justanyone · · Score: 4, Interesting


    I used to work in the Risk Management department of the capital markets division of a large international bank as a programmer.

    When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.

    The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.

    Several points about this job were fascinating:
    1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
    2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
    3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.

    It was a fun job. Then I found another one where I get to play with Python!

    -- Kevin

  8. software decays by ecklesweb · · Score: 4, Interesting

    One of the interesting quotes from the article:

    Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.

    I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

    Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...

    1. Re:software decays by hawaiian717 · · Score: 4, Insightful
      The only "decay" in software should happen as a result of changing business requirements.

      Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.

      --
      End of Line.
  9. /.ed by christoofar · · Score: 4, Funny

    Wow. Looks like even the mag for CIOs can't keep up with a /. DDoS attack. Maybe the CIO for CIO should be fired?

  10. Crew Scheduling system? How about Aircraft maint by NetNinja · · Score: 4, Insightful

    If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.

    Maintenance manuals and procedures are written in blood. The next tragedy will be no different.

  11. Re:Yep by tomhudson · · Score: 4, Funny
    Hindsight is 20/20.
    You mean like this story (the lesson being that what seems like a good thing at the time can become an unmitigated disaster):
    I like monkeys.

    The pet store was selling them for five cents a piece. I thought that
    odd since they were normally a couple thousand each. I decided not to
    look a gift horse in the mouth. I bought 200. I like monkeys.

    I took my 200 monkeys home. I have a big car. I let one drive. His
    name was Sigmund. He was retarded. In fact, none of them were really
    bright. They kept punching themselves in their genitals. I laughed.
    Then they punched my genitals. I stopped laughing.

    I herded them into my room. They didn't adapt very well to their new
    environment. They would screech, hurl themselves off of the couch at
    high speeds and slam into the wall. Although humorous at first, the
    spectacle lost its novelty halfway into its third hour.

    Two hours later I found out why all the monkeys were so inexpensive:
    they all died. No apparent reason. They all just sorta' dropped dead.
    Kinda' like when you buy a goldfish and it dies five hours later. Damn
    cheap monkeys.

    I didn't know what to do. There were 200 dead monkeys lying all over my
    room, on the bed, in the dresser, hanging from my bookcase. It looked
    like I had 200 throw rugs.

    I tried to flush one down the toilet. It didn't work. It got stuck.
    Then I had one dead, wet monkey and 199 dead, dry monkeys.

    I tried pretending that they were just stuffed animals. That worked for
    a while, that is until they began to decompose. It started to smell real
    bad.

    I had to pee but there was a dead monkey in the toilet and I didn't want
    to call the plumber. I was embarrassed.

    I tried to slow down the decomposition by freezing them. Unfortunately
    there was only enough room for two monkeys at a time so I had to change
    them every 30 seconds. I also had to eat all the food in the freezer so
    it didn't all go bad.

    I tried burning them. Little did I know my bed was flammable. I had to
    extinguish the fire.

    Then I had one dead, wet monkey in my toilet, two dead, frozen monkeys in
    my freezer, and 197 dead, charred monkeys in a pile on my bed. The odor
    wasn't improving.

    I became agitated at my inability to dispose of my monkeys and to use the
    bathroom. I severely beat one of my monkeys. I felt better.

    I tried throwing them way but the garbage man said that the city wasn't
    allowed to dispose of charred primates. I told him that I had a wet
    one. He couldn't take that one either. I didn't bother asking about the
    frozen ones.

    finally arrived at a solution. I gave them out as Christmas gifts. My
    friends didn't know quite what to say. They pretended that they like
    them but I could tell they were lying. Ingrates. So I punched them in
    the genitals.

    I like monkeys
    Same thing with the code in question. It seemed good when it was written, but it didn't stand the test of time, and ended up with a lot of people getting a swift kick in the you-know-whats.

    Or for another example of hindsight and the law of unanticipated consequences, just sing the first few bars of "Alice's Restaurant".

  12. A game of Jenga by lake2112 · · Score: 4, Insightful

    Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.

  13. Re:Interesting Technical Detail ... by josecanuc · · Score: 4, Insightful

    Exactly... The article author seems to point to the fact that the software was old and just waiting to die...

    Becase of the fact that NO ONE knew of the particular limit that was exceeded, those who were supposed to calculate risk never knew what the tipping point was.

    All they could say was "our software is old, someday it may not work any more, but I cannot say for what reason, because I do not know FORTRAN."

    How the hell can you calculate risk if your only input is the chronological age of a software system?

  14. Re:Yep by Marillion · · Score: 5, Insightful

    First thing 32767 changes are a lot. A whole f*ck*ng lot. It averages over 1310 changes per day. For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.

    I'm personally getting sick of people asking about backup systems. It was a problem with the data. Too much of it. Given the safety and goverment oversight that hinges on this data, you don't mess with it. Any backup system, whether one or one hundred backup systems, when presented with the same data, would also fail.

    The DOT report issued back in March (sorry don't have karma link handy) said neither Comair nor SBS (the closed source vendor that supplied the application) were aware of the limit.

    Eric Bardes (Yes, the one from TFA)

    --
    This is a boring sig
  15. Re:Yep by mankey+wanker · · Score: 4, Insightful

    Is there no way to moderate a post simply "odd"?

  16. Re:Interesting Technical Detail ... by BattleTroll · · Score: 4, Insightful
    "How the hell can you calculate risk if your only input is the chronological age of a software system?"

    That wasn't the the only input in this case. In fact, you don't have to know the gory details of the implementation to determine risk, just the business impact of a problem to the system.
    • Since no one at the company understood the language used, it stands to reason no one understood what the system was doing. Risk: Medium
    • The system was mission critical to the performance of almost every other function of the airline. If the system was lost, the airline was hosed. Risk: Critical
    • They had no failover plan in place in case the system went down. Risk: High
    • No load tests were possible since they only had the one system in place. Without load testing the only way to find out the system fails under load is to wait until it fails in production. Risk: High
    It stands to reason there were other risks involved that weren't identified in the article.
  17. Re:Interesting Technical Detail ... by Peter+La+Casse · · Score: 4, Insightful
    Software does not age.

    Software does age. As a program grows older, people change it, its inputs and how it is used, and the older a program gets, the less the people making the changes are likely to understand it.

    In addition, some bugs don't manifest themselves under usage patterns from 20 years ago, or when the software is run on hardware from 20 years ago, but they do manifest themselves under usage patterns or on hardware that's in use now. The more you change, especially without understanding all of the ramifications of that change, the greater the risk for error.

    That's what software aging is.

  18. Re:Interesting Technical Detail ... by ScuzzMonkey · · Score: 5, Insightful

    "They had no failover plan in place in case the system went down."

    With that, you've hit the heart of the matter, and what the article should have focused on rather than the "old software breaks down" BS. This was a bug which could have hit at ANY time since the software was installed; it was an overflow, not a rusting subroutine that fell off. I can't personally see any way that they could have foreseen this particular problem but when you have a system that is so critical to your operation, you don't look for problems it might have--you look for alternatives to fall back to when it DOES have problems.

    You never see them coming. But you'd better plan for them anyway.

    --
    No relation to Happy Monkey
  19. Re:Yep by Shotgun · · Score: 5, Insightful

    I worked for IBM, coding in the mainframe networking department. Their motto should have been, "Don't change anything...it's working."

    I got irritated. I would find stuff that was just STUPID. Horrendously mangled logic. Algorithms from other parts of the code applied completely wrong. Whenever I tried to improve the code I got the "It's working. Don't change anything" line. I left, determined to find a job where I could actually write code.

    That was several years ago. I've gotten smarter since. I've worked on several large-scale, 5-9's systems. After several major and minor fuck-ups, now I know....

    If it's working, don't change anything.

    --
    Aah, change is good. -- Rafiki
    Yeah, but it ain't easy. -- Simba