Risk Management - A Cautionary Tale
Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
From the article:
Sounds like some sort of overflow problem. Hmmm....
The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.
--- Attorneys Assisting Citizen-Soldiers & Families -
To me, when you look at code you always want to rewrite it thinking you could do it better. But if you look at what they had to work with, you realize most coders write (at a given time), write pretty good code.
This software has been working for over 20 years! What will your code look like in 20 years? I doubt it has the same track record. I'm not sure foresight was a problem. I think they did the best they could with language and hardware of the day.
The comair meltdown wasn't a software problem if you ask me, it was the business changed.
"This isn't a study in computer science, its a study in human behavior"
How could nobody in 11 years see that the changes were counted with a 16 bit signed integer?
If this company was run by a typical big company, somebody DID complain about this 16-bit signed integer. Chances are, they were told to shut up about it and not rock the boat. This frequently happens when someone points out a bug which would require a fundamental change to the system.
Most companies only like employees who think inside the box, despite telling people to think outside the box.
94% of Repubs and 21% of Dems voted to renew the Patriot Act
If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.
Maintenance manuals and procedures are written in blood. The next tragedy will be no different.
It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.
.....
Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.
To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc
Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?
OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?
Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.
First thing 32767 changes are a lot. A whole f*ck*ng lot. It averages over 1310 changes per day. For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.
I'm personally getting sick of people asking about backup systems. It was a problem with the data. Too much of it. Given the safety and goverment oversight that hinges on this data, you don't mess with it. Any backup system, whether one or one hundred backup systems, when presented with the same data, would also fail.
The DOT report issued back in March (sorry don't have karma link handy) said neither Comair nor SBS (the closed source vendor that supplied the application) were aware of the limit.
Eric Bardes (Yes, the one from TFA)
This is a boring sig
Is there no way to moderate a post simply "odd"?
This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.
Now, first the article states: First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.
Secondly, the following scenario makes perfect business sense: The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.
Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.
But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.
It's rather sad, to me, that we design these wonderful machines that can perform logical operations in great quantities with a high degree of repeatability and low occurance of failure, then create a culture around them that encourages sloppiness, and ultimately introduces a large measure of uncertainty into the operation of these machines. I am baffled at the perverse desire-- nay need-- that people seem to have to make software suffer from entropy.
The only "decay" in software should happen as a result of changing business requirements. There's no reason that, provided the business requirements don't change, that a well designed and properly implemented piece of software should not be usable in perpetuity. There may be changes in the underlying hardware and operating system software, but provided that the application is sufficiently abstracted from the underlying platform (or, provided that an emulation-layer for the original platform can be constructed) there's no reason other than changing business requirements for software to be "thrown away".
Let's put this a different way: How does a patch to the underlying operating system cause an application to fail? If the patch changes the behaviour of the underlying operating system in such a manner as to return unexepected values to the application, the patch is the cause of the failure. A flawed patch doesn't make an application "age" or "decay"-- it's simply a flawed patch. An application has to make assumptions about the underlying operating system. These assumptions are based on the API documentation-- the contact between the operating system and the application. When the OS violates the terms of the contract, that doesn't mean the application "decayed"-- it means some moron who coded the operating system patch messed up, and the operating system manufacturer/maintainer didn't perform good regression testing.
We should be designing software systems with 10 to 20 year usability goals. It would do a lot for the frustration level that the "suits" have with IT if we stopped being proponents of hugely expensive but "throwaway" systems, and started designing systems with an eye for longevity.
The Attitude Adjuster, I hate me, you can too.
Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.
End of Line.
I worked for IBM, coding in the mainframe networking department. Their motto should have been, "Don't change anything...it's working."
I got irritated. I would find stuff that was just STUPID. Horrendously mangled logic. Algorithms from other parts of the code applied completely wrong. Whenever I tried to improve the code I got the "It's working. Don't change anything" line. I left, determined to find a job where I could actually write code.
That was several years ago. I've gotten smarter since. I've worked on several large-scale, 5-9's systems. After several major and minor fuck-ups, now I know....
If it's working, don't change anything.
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba
These systems were written when a computer had maybe 4M of main memory. So if you double the size of your counter, that means you can hold....1/2 as many events.
So as a programmer, you make a choice. You either make the counter smaller, or you limit the system in some other way.
Computers today have 3 orders of magnitude more memory, and the choice between a short and a long is easy to make. But back then, it wasn't.
To help you understand, if a programmer from that era used a long int, he'd better have a damned good reason. Although, he should have made it an unsigned int and got double the space . See? You're not old enough to feel in your gut the need to save *BITS*.
Back when I learned to code in the late 70's, we used assembler (BAL 360), and we saved space by making all number packed and then stripped off the sign byte. You did a MVO to the same memory location, and it had a side effect of shifting the packed number on nibble (1/2 byte) to the right, erasing the sign bit. We did that because a 40M disk pack on an IBM 370/148 cost about $40,000 and we couldn't waste it. Now I have a thumb drive with 1G on it. You just don't understand.