Risk Management - A Cautionary Tale
Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
but all it takes for a good number of companies to get egg on their face is one careless mid-level that is too casual with passwords (and/or takes their work home on laptops with info unencrypted)...
I'm sure "SlashdotMedia" will improve on all the wonders that Dice Holdings blessed us all with
How could nobody in 11 years see that the changes were counted with a 16 bit signed integer? The company grows, I would think that making sure the sw can keep up with the numbers would require very little foresight, yet from the article, it seems that the only considerations were in the UI? I wonder if this was a hw limit or a sw limit...
Pay no attention to that man behind the curtain.
Um...like making sure you run your Windows Updates. Because if you don't, you're gonna regret it.
Then again, even if you do, you're still going to regret it.
So, I guess the moral of the analogy is that it's better to patch your system and risk your hardware not working properly than having spyware or a virus on your system.
IGB: More fun than eating oatmeal!
Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.
I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?
The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.
I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.
"All great wisdom is contained in .signature files"
The Laws of Software Process: A New Model for the Production and Management of Software
"Armour, a consultant in software development, reveals a new structure for software development that redefines the nature and purpose of software. He explains how, in the modern knowledge economy, software systems are not products in the classical sense, but are the modern channels for the conveyance of information. From this perspective, he examines programming languages, quality, cost estimation, and project management, and demonstrates how to overcome common problems that afflict software development and use. The book is distributed by CRC.Copyright © 2004 Book News, Inc., Portland, OR"
I used to work in the Risk Management department of the capital markets division of a large international bank as a programmer.
When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.
The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.
Several points about this job were fascinating:
1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.
It was a fun job. Then I found another one where I get to play with Python!
-- Kevin
Unitarian Church: Freethinkers Congregate!
One of the interesting quotes from the article:
Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.
I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.
Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...
As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking. I suspect that this is often the problem, and other problems -- distractions like strikes and the Y2K bug, managment that doesn't pay sufficient attention to the problem -- are just just secondary.
Here's some personal experience that isn't nearly the same scale, but neatly illustrates what I mean. I once worked for a pubs department that delivered copy to printshops as raw Postscript. There was a push from management to upgrade to Acrobat-generated PDF. This should have been a no-brainer -- print shops hate dealing with raw Postscript, and the existing process relied on an ancient, unsupported printer driver that ran only on Windows 98. But the people who managed the process just totally balked, claiming that tight schedules left them no extra time to learn Acrobat. A lame excuse? Sure. But it took a new pubs manager, and escalation to the do-it-or-your-fired level, to get the chage made.
I think this kind of issue had a lot to do with the failure of IBM's famous plan to use Unix or Linux for all their internal bureaucratic needs. Too many people dug in their heels, claiming that they couldn't possibly retool their Windows-based workflow.
When you talk about this stuff, somebody always says, "If people can't get with the program, they should be fired!" Well, it often comes to that, as it almost did with the PDF issue. But you can't just abitrarily fire everybody who resists policy and process changes. It's expensive, there are legal ramifications -- and you risk destroying the very corporate infrastructure you're trying to save.
From what I can gather of the airline industry in general, it's a bunch of assorted systems that are sort of held together by duct tape and spit. If ever an industry needed open standards, mandated interoperability and thorough design and code auditing, I'd say that'd be the one. It seems to me that there really needs to be one central IT shop which rolls out all the software for airline and FAA IT needs and all airlines should go through that single central clearinghouse.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I'm sorry, but this story and your comment annoy me greatly.
Here's the situation. The company had an old green screen application that was working just fine. It was old, but it did what the company needed. There was no hint that there was any fault.
Now, one day the company had to cancel 90% of its flights - and whammo some double byte counter overflowed.
What's all this crap in the article about old software "getting brittle"? This wasn't brittle aging software, this was software that was hit by an event that took it outside of its design parameters.
How would *you* have judged the risk of this software failing? How would that risk compare with the risk of installing a new untested package?
You have 2 flight crew and some "flight attendants." Let's call the number of crew on the plane 6. When a flight is rescheduled, you have 6 transactions removing them from the old flight, and 6 more transactions adding them to the new flight. Total 12 transactions. When you have bad snow days causing cancellation or rescheduling of 1,000 flights, then you just used up 1/3 of your transactions for the month. Since all the transactions are serialized, restoring from back up tapes would just have a crash again.