Risk Management - A Cautionary Tale

← Back to Stories (view on slashdot.org)

Risk Management - A Cautionary Tale

Posted by Zonk on Tuesday May 3, 2005 @05:25AM from the watch-out-for-falling-anvils dept.

Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."

14 of 203 comments (clear)

Min score:

Reason:

Sort:

Why didn't the CIO yell louder? by winkydink · 2005-05-03 05:29 · Score: 5, Insightful

Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.

--
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
1. Re:Why didn't the CIO yell louder? by Knara · 2005-05-03 08:56 · Score: 4, Insightful
  
  I'd agree, but the fact that it was written in FORTRAN and they didn't have a single maintenance developer (even if it wasn't that developer's primary role) assigned to it that *knew* FORTRAN suggests that a whole lot of "buhhhhhh??" going on in that particular IT department.
Re:Yep by airrage · 2005-05-03 05:45 · Score: 4, Insightful

To me, when you look at code you always want to rewrite it thinking you could do it better. But if you look at what they had to work with, you realize most coders write (at a given time), write pretty good code.

This software has been working for over 20 years! What will your code look like in 20 years? I doubt it has the same track record. I'm not sure foresight was a problem. I think they did the best they could with language and hardware of the day.

The comair meltdown wasn't a software problem if you ask me, it was the business changed.

--
"This isn't a study in computer science, its a study in human behavior"
Re:Yep by EnronHaliburton2004 · 2005-05-03 05:48 · Score: 4, Insightful

How could nobody in 11 years see that the changes were counted with a 16 bit signed integer?

If this company was run by a typical big company, somebody DID complain about this 16-bit signed integer. Chances are, they were told to shut up about it and not rock the boat. This frequently happens when someone points out a bug which would require a fundamental change to the system.

Most companies only like employees who think inside the box, despite telling people to think outside the box.

--
94% of Repubs and 21% of Dems voted to renew the Patriot Act
Crew Scheduling system? How about Aircraft maint by NetNinja · 2005-05-03 05:54 · Score: 4, Insightful

If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.

Maintenance manuals and procedures are written in blood. The next tragedy will be no different.
A game of Jenga by lake2112 · 2005-05-03 06:06 · Score: 4, Insightful

Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.
Re:Interesting Technical Detail ... by josecanuc · 2005-05-03 06:30 · Score: 4, Insightful

Exactly... The article author seems to point to the fact that the software was old and just waiting to die...

Becase of the fact that NO ONE knew of the particular limit that was exceeded, those who were supposed to calculate risk never knew what the tipping point was.

All they could say was "our software is old, someday it may not work any more, but I cannot say for what reason, because I do not know FORTRAN."

How the hell can you calculate risk if your only input is the chronological age of a software system?
Re:Yep by Marillion · 2005-05-03 06:31 · Score: 5, Insightful

First thing 32767 changes are a lot. A whole f*ck*ng lot. It averages over 1310 changes per day. For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.

I'm personally getting sick of people asking about backup systems. It was a problem with the data. Too much of it. Given the safety and goverment oversight that hinges on this data, you don't mess with it. Any backup system, whether one or one hundred backup systems, when presented with the same data, would also fail.

The DOT report issued back in March (sorry don't have karma link handy) said neither Comair nor SBS (the closed source vendor that supplied the application) were aware of the limit.

Eric Bardes (Yes, the one from TFA)

--
This is a boring sig
Re:Yep by mankey+wanker · 2005-05-03 06:37 · Score: 4, Insightful

Is there no way to moderate a post simply "odd"?
Re:software decays by hawaiian717 · 2005-05-03 07:02 · Score: 4, Insightful

The only "decay" in software should happen as a result of changing business requirements.
Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.

--
End of Line.
Re:Interesting Technical Detail ... by BattleTroll · 2005-05-03 07:24 · Score: 4, Insightful
"How the hell can you calculate risk if your only input is the chronological age of a software system?"

That wasn't the the only input in this case. In fact, you don't have to know the gory details of the implementation to determine risk, just the business impact of a problem to the system.
- Since no one at the company understood the language used, it stands to reason no one understood what the system was doing. Risk: Medium
- The system was mission critical to the performance of almost every other function of the airline. If the system was lost, the airline was hosed. Risk: Critical
- They had no failover plan in place in case the system went down. Risk: High
- No load tests were possible since they only had the one system in place. Without load testing the only way to find out the system fails under load is to wait until it fails in production. Risk: High
It stands to reason there were other risks involved that weren't identified in the article.
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 07:32 · Score: 4, Insightful

Software does not age.

Software does age. As a program grows older, people change it, its inputs and how it is used, and the older a program gets, the less the people making the changes are likely to understand it.

In addition, some bugs don't manifest themselves under usage patterns from 20 years ago, or when the software is run on hardware from 20 years ago, but they do manifest themselves under usage patterns or on hardware that's in use now. The more you change, especially without understanding all of the ramifications of that change, the greater the risk for error.

That's what software aging is.
Re:Interesting Technical Detail ... by ScuzzMonkey · 2005-05-03 07:51 · Score: 5, Insightful

"They had no failover plan in place in case the system went down."

With that, you've hit the heart of the matter, and what the article should have focused on rather than the "old software breaks down" BS. This was a bug which could have hit at ANY time since the software was installed; it was an overflow, not a rusting subroutine that fell off. I can't personally see any way that they could have foreseen this particular problem but when you have a system that is so critical to your operation, you don't look for problems it might have--you look for alternatives to fall back to when it DOES have problems.

You never see them coming. But you'd better plan for them anyway.

--
No relation to Happy Monkey
Re:Yep by Shotgun · 2005-05-03 08:07 · Score: 5, Insightful

I worked for IBM, coding in the mainframe networking department. Their motto should have been, "Don't change anything...it's working."

I got irritated. I would find stuff that was just STUPID. Horrendously mangled logic. Algorithms from other parts of the code applied completely wrong. Whenever I tried to improve the code I got the "It's working. Don't change anything" line. I left, determined to find a job where I could actually write code.

That was several years ago. I've gotten smarter since. I've worked on several large-scale, 5-9's systems. After several major and minor fuck-ups, now I know....

If it's working, don't change anything.

--
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba