Risk Management - A Cautionary Tale
Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.
I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?
The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.
I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.
"All great wisdom is contained in .signature files"
I used to work in the Risk Management department of the capital markets division of a large international bank as a programmer.
When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.
The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.
Several points about this job were fascinating:
1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.
It was a fun job. Then I found another one where I get to play with Python!
-- Kevin
Unitarian Church: Freethinkers Congregate!
One of the interesting quotes from the article:
Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.
I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.
Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...