Risk Management - A Cautionary Tale
Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
Hindsight is 20/20.
Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
From the article:
Sounds like some sort of overflow problem. Hmmm....
The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.
--- Attorneys Assisting Citizen-Soldiers & Families -
And even though she was screaming from the highest mountain to anyone and everyone that would listen that doom was rushing towards them. That bad, bad things were going to happen. She was still made the sacrificial goat when the fecal material hit the rotating blades.
And this was for a federal agency.
Scary no?
If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.
Maintenance manuals and procedures are written in blood. The next tragedy will be no different.
It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.
.....
Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.
To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc
Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?
OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?
Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.
> I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced.
>...if I have a 20 year old system that I can't get parts for, that's a high risk system.
> However, if I can get parts for a 20 year old system, then the risk is lower.
Good points. The article does contain some facts, though. The system was Fortran based, ran only on one aging hardware platform, and no one at Comair knew Fortran. Those are risk factors with older software.
A better lesson than the article's implied "don't use old software" lesson would be, "periodically review legacy systems against changed business conditions and environment. Do not assume that software and hardware will continue to function when the business environment changes significantly."
This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.
Now, first the article states: First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.
Secondly, the following scenario makes perfect business sense: The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.
Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.
But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.
It's rather sad, to me, that we design these wonderful machines that can perform logical operations in great quantities with a high degree of repeatability and low occurance of failure, then create a culture around them that encourages sloppiness, and ultimately introduces a large measure of uncertainty into the operation of these machines. I am baffled at the perverse desire-- nay need-- that people seem to have to make software suffer from entropy.
The only "decay" in software should happen as a result of changing business requirements. There's no reason that, provided the business requirements don't change, that a well designed and properly implemented piece of software should not be usable in perpetuity. There may be changes in the underlying hardware and operating system software, but provided that the application is sufficiently abstracted from the underlying platform (or, provided that an emulation-layer for the original platform can be constructed) there's no reason other than changing business requirements for software to be "thrown away".
Let's put this a different way: How does a patch to the underlying operating system cause an application to fail? If the patch changes the behaviour of the underlying operating system in such a manner as to return unexepected values to the application, the patch is the cause of the failure. A flawed patch doesn't make an application "age" or "decay"-- it's simply a flawed patch. An application has to make assumptions about the underlying operating system. These assumptions are based on the API documentation-- the contact between the operating system and the application. When the OS violates the terms of the contract, that doesn't mean the application "decayed"-- it means some moron who coded the operating system patch messed up, and the operating system manufacturer/maintainer didn't perform good regression testing.
We should be designing software systems with 10 to 20 year usability goals. It would do a lot for the frustration level that the "suits" have with IT if we stopped being proponents of hugely expensive but "throwaway" systems, and started designing systems with an eye for longevity.
The Attitude Adjuster, I hate me, you can too.
IT people know that technology will be obsolete in a short time but most business people always see technology as flashy cost reducers and they never plan on retiring the systems from the get-go. It's an annoyance but it is not suprising in an industry where duct taping old systems is preferred over structural improvements through architecture.
Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.
End of Line.
They've also historically had fairly large IT shops. That has given them a lot of time and manpower over the past four decades to write custom software for themselves, and that has resulted in many unique airline-specific systems, sometimes running on interesting combinations of hardware.
One of the main problems with a "central IT shop" for the airlines is the fact that, operationally, each airline is somewhat unique in terms of the internal operational procedures they use, and many of the software applications at each airline are very tightly tied to that airline's own local set of procedures and business rules.
I worked for ten years at Northwest Airlines on a flight operations system that was originally written at United Airlines in the mid-1960's, and we had to make a lot of fundamental changes to displays and other things so our pilots and flight dispatchers could use their own in-house terminology, and so that the software would match the largely paper-driven procedures that it was replacing.
Even were the airline industry not in its current financial bind, the prospect of replacing some of those systems isn't one to be taken lightly -- not only are the systems at a major airline closely intertwined with unique procedures, but they also tend to be tightly tied together in terms of data with lots of real-time message passing going on not only between the airline's internal systems but also between the airline and various third parties (ACARS messages, weather info, flight plan information, reservations info, etc.).
It's a very interesting industry from an IT perspective, at least when it isn't in a death spiral...
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
"As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking."
And why is that a bad thing? If the software is a good tool for the task at hand, they should keep using it. In fact, the article clearly says that this program was in many ways superior to newer programs on the market - which is why they didn't upgrade earlier. They say they were able to create good workflows based around the software - extending the ideas of the software's design to other processes.
What the article fails to discuss is that you could have a brand new piece of software which fails just as badly. Software doesn't age, and this wasn't a hardware failure. If they tried to do 32,769 crew changes the first month they used the software....it would have failed just as it did now. And buying something new (just because it is new) doesn't mean it is bug-free. If anything, conventional wisdom would imply that the older software is less buggy than new software because it has years of usage. Whoever wrote this program was obviously an idiot and didn't consider "what happens if there are more than 32,768 changes a month?" But most people write shoddy software, and managers don't catch them on it. Do you think that's changed in 20 years? Have programmers become better? I doubt it. The new software they bought almost certainly has some bug lurking in it, ready to cause havoc.
The real issue is software that is not maintainable, mostly because noone has (or can use) the source code for it. In that sense, just because software is old, doesn't make it a Legacy Application. Lack of maintainability makes it a Legacy Application. What confuses me is, it sounds like they had the source code for the current application. How hard would it have been to go hire a Fortran programmer to review it, since noone in the organization was familiar with Fortran? And, it certainly should serve as a warning to anyone willing to use software critical to a business process without source code. You can't count on software being maintained if the company discontinues support, or goes bankrupt, or just doesn't feel like it. And if you can't count on someone else to support a critical system for you, you better make sure you can support it yourself. (It doesn't have to be FOSS, but you ought to have access to the source code for your own use)
...IMHO, can be found in the following single line from The Fine Article:
But after nearly 15 years in use, the business had grown accustomed to the SBS system, and much of Comair's crew management business processes had grown directly out of it.
(emphasis added)
Talk about putting the cart in front of the horse. This system would never have been replaced before it's crash--the cost of readjusting process and any other attached technology would have dwarfed simply updating the software. There was no business case you could make that would appear to justify the expense. Other than the little matter of "your company won't function if something goes wrong", of course...
Also, you'd never find a decent replacement product--since it's functionality would have to mirror those same system-driven business processes.
The truly major oversight was in letting the package drive how Comair did this part of it's business in the first place. Done otherwise, the meltdown might still have happened, for plenty of reasons outlined in the article. But left this way, this result was pre-ordained. No amount of planning or "risk assessment" was going to counter the inertia created by this process/technology inversion.
First, the longer a piece of software is in use, the greater the chance of finding an obscure or unlikely error condition. The older a piece of sofware, the more of its bugs will become apparent, and the more likely it is that a crippling bug will be found. Old software breaks down.
Second, operating constraints change over time. If a piece of software meets its initial demands, greater and greater demands are placed on it over time. If a piece of software is kept in use for many years, it will likely find itself handling a workload far in excess of what was imagined when the software was first created. When Comair first began using this software, it probably didn't have the business volume to make the transaction limit a problem. Because Comair's business grew over many years, but the software was not grown along with it, what was originally an unimportant design constraint turned out to be a major bug. Old software does not grow to meet new demands. Old software breaks down.
Old software doesn't rust. It doesn't develop stress fractures. It doesn't corrode or go stale. But in its own very real and very important way, old software does degrade over time; if not in itself, in its relationship to the constant growth in the demands placed upon it. Old software breaks down.
Any sufficiently well-organized community is indistinguishable from Government.
That's their job. Companies exist to make money - end of story. Technology for technologies sake is foolish and wasteful unless you're in an R&D department.
That being said, all technology spends (e.g. upgrades, redesigns, rewrites, replacements, etc.) can and should be boiled down to dollars that either fall into a profit/loss or risk/benefit catagory (hopefully over 3-5 years). If a CIO isn't doing this (or having it done for them) they should be fired.
If you think more subtlety is needed, then I hope you're a low to mid level SA/DBA/Developer because you don't understand the "business" side of the company that employs you. I'm not trying to be rude, but that's the brutal reality of the business world. On a lighter side, there'a always academia... if you'd prefer politics to dollars.
I haven't seen the code, but here's a guess. This is from a guy who started programming on an IBM mainframe and Burroughs minis in the late 70's,
When this code was written, hard disk space and memory space in mini computers and mainframes were at a premium
So you didn't use a fullword when a half would do. That would be the equivalent today of storing a lossless music file on your iPod Mini.
So they knew they were doing a few hundred events a month; the idea of going beyond 32K per month was probably absurd when it was written, and the programmer couldn't waste space, and anyway, when they got to that point, somebody would just patch the code and they'd be on their way.
Meanwhile, those programmers retired 10 years ago and this disaster erupted.
If you learned to program during the era this was written, it makes perfect sense.
These systems were written when a computer had maybe 4M of main memory. So if you double the size of your counter, that means you can hold....1/2 as many events.
So as a programmer, you make a choice. You either make the counter smaller, or you limit the system in some other way.
Computers today have 3 orders of magnitude more memory, and the choice between a short and a long is easy to make. But back then, it wasn't.
To help you understand, if a programmer from that era used a long int, he'd better have a damned good reason. Although, he should have made it an unsigned int and got double the space . See? You're not old enough to feel in your gut the need to save *BITS*.
Back when I learned to code in the late 70's, we used assembler (BAL 360), and we saved space by making all number packed and then stripped off the sign byte. You did a MVO to the same memory location, and it had a side effect of shifting the packed number on nibble (1/2 byte) to the right, erasing the sign bit. We did that because a 40M disk pack on an IBM 370/148 cost about $40,000 and we couldn't waste it. Now I have a thumb drive with 1G on it. You just don't understand.