Slashdot Mirror


Risk Management - A Cautionary Tale

Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."

21 of 203 comments (clear)

  1. Why didn't the CIO yell louder? by winkydink · · Score: 5, Insightful

    Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.

    --

    "I'd rather be a lightning rod than a seismometer." -Ken Kesey

    1. Re:Why didn't the CIO yell louder? by Knara · · Score: 4, Insightful

      I'd agree, but the fact that it was written in FORTRAN and they didn't have a single maintenance developer (even if it wasn't that developer's primary role) assigned to it that *knew* FORTRAN suggests that a whole lot of "buhhhhhh??" going on in that particular IT department.

  2. Interesting Technical Detail ... by rewinn · · Score: 3, Insightful

    From the article:

    As it turned out, the crew management application, unbeknownst to anyone at Comair, could process only a set number of changes--32,000 per month--before shutting down.

    Sounds like some sort of overflow problem. Hmmm....

    The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.

    1. Re:Interesting Technical Detail ... by josecanuc · · Score: 4, Insightful

      Exactly... The article author seems to point to the fact that the software was old and just waiting to die...

      Becase of the fact that NO ONE knew of the particular limit that was exceeded, those who were supposed to calculate risk never knew what the tipping point was.

      All they could say was "our software is old, someday it may not work any more, but I cannot say for what reason, because I do not know FORTRAN."

      How the hell can you calculate risk if your only input is the chronological age of a software system?

    2. Re:Interesting Technical Detail ... by BattleTroll · · Score: 4, Insightful
      "How the hell can you calculate risk if your only input is the chronological age of a software system?"

      That wasn't the the only input in this case. In fact, you don't have to know the gory details of the implementation to determine risk, just the business impact of a problem to the system.
      • Since no one at the company understood the language used, it stands to reason no one understood what the system was doing. Risk: Medium
      • The system was mission critical to the performance of almost every other function of the airline. If the system was lost, the airline was hosed. Risk: Critical
      • They had no failover plan in place in case the system went down. Risk: High
      • No load tests were possible since they only had the one system in place. Without load testing the only way to find out the system fails under load is to wait until it fails in production. Risk: High
      It stands to reason there were other risks involved that weren't identified in the article.
    3. Re:Interesting Technical Detail ... by Peter+La+Casse · · Score: 4, Insightful
      Software does not age.

      Software does age. As a program grows older, people change it, its inputs and how it is used, and the older a program gets, the less the people making the changes are likely to understand it.

      In addition, some bugs don't manifest themselves under usage patterns from 20 years ago, or when the software is run on hardware from 20 years ago, but they do manifest themselves under usage patterns or on hardware that's in use now. The more you change, especially without understanding all of the ramifications of that change, the greater the risk for error.

      That's what software aging is.

    4. Re:Interesting Technical Detail ... by ScuzzMonkey · · Score: 5, Insightful

      "They had no failover plan in place in case the system went down."

      With that, you've hit the heart of the matter, and what the article should have focused on rather than the "old software breaks down" BS. This was a bug which could have hit at ANY time since the software was installed; it was an overflow, not a rusting subroutine that fell off. I can't personally see any way that they could have foreseen this particular problem but when you have a system that is so critical to your operation, you don't look for problems it might have--you look for alternatives to fall back to when it DOES have problems.

      You never see them coming. But you'd better plan for them anyway.

      --
      No relation to Happy Monkey
  3. Re:Yep by airrage · · Score: 4, Insightful

    To me, when you look at code you always want to rewrite it thinking you could do it better. But if you look at what they had to work with, you realize most coders write (at a given time), write pretty good code.

    This software has been working for over 20 years! What will your code look like in 20 years? I doubt it has the same track record. I'm not sure foresight was a problem. I think they did the best they could with language and hardware of the day.

    The comair meltdown wasn't a software problem if you ask me, it was the business changed.

    --
    "This isn't a study in computer science, its a study in human behavior"
  4. Re:Yep by EnronHaliburton2004 · · Score: 4, Insightful

    How could nobody in 11 years see that the changes were counted with a 16 bit signed integer?

    If this company was run by a typical big company, somebody DID complain about this 16-bit signed integer. Chances are, they were told to shut up about it and not rock the boat. This frequently happens when someone points out a bug which would require a fundamental change to the system.

    Most companies only like employees who think inside the box, despite telling people to think outside the box.

  5. Crew Scheduling system? How about Aircraft maint by NetNinja · · Score: 4, Insightful

    If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.

    Maintenance manuals and procedures are written in blood. The next tragedy will be no different.

  6. I told you so ... NOT! by argoff · · Score: 3, Insightful

    It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.

    Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.

    To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc .....

    Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?

  7. Risk management by uweg · · Score: 3, Insightful
    Well, the problem starts with being born or getting up in the morning. And a system running since 20 years normally doesn't start to stink by itself.

    OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?

  8. A game of Jenga by lake2112 · · Score: 4, Insightful

    Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.

  9. Re:Yep by Marillion · · Score: 5, Insightful

    First thing 32767 changes are a lot. A whole f*ck*ng lot. It averages over 1310 changes per day. For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.

    I'm personally getting sick of people asking about backup systems. It was a problem with the data. Too much of it. Given the safety and goverment oversight that hinges on this data, you don't mess with it. Any backup system, whether one or one hundred backup systems, when presented with the same data, would also fail.

    The DOT report issued back in March (sorry don't have karma link handy) said neither Comair nor SBS (the closed source vendor that supplied the application) were aware of the limit.

    Eric Bardes (Yes, the one from TFA)

    --
    This is a boring sig
  10. Re:Yep by mankey+wanker · · Score: 4, Insightful

    Is there no way to moderate a post simply "odd"?

  11. Old? by Nemi · · Score: 3, Insightful
    Age of the software should make no difference. The problem in this particular case was that the system could only handle 32,000 transactions a month (the programmer obviously used the wrong data type). That could be a problem with software of any age. Age had nothing to do with it failing.

    This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.

  12. Some flaws in the article... by CatsupBoy · · Score: 3, Insightful
    Ok, the bottom line, they should have upgraded. Fine, we can all agree on that.

    Now, first the article states:
    [The application] was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).
    First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.

    Secondly, the following scenario makes perfect business sense:
    SBS came in to make a pitch for its new Maestro crew management software [...] The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it.
    The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.

    Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.
  13. Re:software decays by adjuster · · Score: 3, Insightful

    But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

    It's rather sad, to me, that we design these wonderful machines that can perform logical operations in great quantities with a high degree of repeatability and low occurance of failure, then create a culture around them that encourages sloppiness, and ultimately introduces a large measure of uncertainty into the operation of these machines. I am baffled at the perverse desire-- nay need-- that people seem to have to make software suffer from entropy.

    The only "decay" in software should happen as a result of changing business requirements. There's no reason that, provided the business requirements don't change, that a well designed and properly implemented piece of software should not be usable in perpetuity. There may be changes in the underlying hardware and operating system software, but provided that the application is sufficiently abstracted from the underlying platform (or, provided that an emulation-layer for the original platform can be constructed) there's no reason other than changing business requirements for software to be "thrown away".

    Let's put this a different way: How does a patch to the underlying operating system cause an application to fail? If the patch changes the behaviour of the underlying operating system in such a manner as to return unexepected values to the application, the patch is the cause of the failure. A flawed patch doesn't make an application "age" or "decay"-- it's simply a flawed patch. An application has to make assumptions about the underlying operating system. These assumptions are based on the API documentation-- the contact between the operating system and the application. When the OS violates the terms of the contract, that doesn't mean the application "decayed"-- it means some moron who coded the operating system patch messed up, and the operating system manufacturer/maintainer didn't perform good regression testing.

    We should be designing software systems with 10 to 20 year usability goals. It would do a lot for the frustration level that the "suits" have with IT if we stopped being proponents of hugely expensive but "throwaway" systems, and started designing systems with an eye for longevity.

    --
    The Attitude Adjuster, I hate me, you can too.
  14. Re:software decays by hawaiian717 · · Score: 4, Insightful
    The only "decay" in software should happen as a result of changing business requirements.

    Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.

    --
    End of Line.
  15. Re:Yep by Shotgun · · Score: 5, Insightful

    I worked for IBM, coding in the mainframe networking department. Their motto should have been, "Don't change anything...it's working."

    I got irritated. I would find stuff that was just STUPID. Horrendously mangled logic. Algorithms from other parts of the code applied completely wrong. Whenever I tried to improve the code I got the "It's working. Don't change anything" line. I left, determined to find a job where I could actually write code.

    That was several years ago. I've gotten smarter since. I've worked on several large-scale, 5-9's systems. After several major and minor fuck-ups, now I know....

    If it's working, don't change anything.

    --
    Aah, change is good. -- Rafiki
    Yeah, but it ain't easy. -- Simba
  16. You're too young to understand by Anonymous Coward · · Score: 3, Insightful

    These systems were written when a computer had maybe 4M of main memory. So if you double the size of your counter, that means you can hold....1/2 as many events.

    So as a programmer, you make a choice. You either make the counter smaller, or you limit the system in some other way.

    Computers today have 3 orders of magnitude more memory, and the choice between a short and a long is easy to make. But back then, it wasn't.

    To help you understand, if a programmer from that era used a long int, he'd better have a damned good reason. Although, he should have made it an unsigned int and got double the space . See? You're not old enough to feel in your gut the need to save *BITS*.

    Back when I learned to code in the late 70's, we used assembler (BAL 360), and we saved space by making all number packed and then stripped off the sign byte. You did a MVO to the same memory location, and it had a side effect of shifting the packed number on nibble (1/2 byte) to the right, erasing the sign bit. We did that because a 40M disk pack on an IBM 370/148 cost about $40,000 and we couldn't waste it. Now I have a thumb drive with 1G on it. You just don't understand.