Risk Management - A Cautionary Tale

Why didn't the CIO yell louder? by winkydink · 2005-05-03 05:29 · Score: 5, Insightful

Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.

--

"I'd rather be a lightning rod than a seismometer." -Ken Kesey

Re:Why didn't the CIO yell louder? by Anonymous Coward · 2005-05-03 05:37 · Score: 2, Informative

I can't help but commiserate with the folks at Comair. Technology projects can be hard enough without having to deal with labor unions - which is really key to understanding Comair's problem. I was the project manager in the late 90's at TWA, hired to implement just a portion of what Comair is trying to replace. Scheduling systems hit at the heart of the pilot's work rules and they won't give up a single work rule without a fight. That was true even when the union was the instigator of the change. Even after the agreement on work rules, there were unique training issues, legal agreements, and of course egos. Pilots are a very confident class of people (great skill for flying planes) and that confidence is evident even when they are negotiating on things for which they have little knowledge. It is very hard to get an agreement on how to change the airline.
My project was to implement a new scheduling system for the pilots. It eventually took a complete restart on the project and a little over 3 years. I had to do things as a project manager that I would have never dreamed would be part of a technology project. I gave speeches to the Union governing council and was part of the official negotiation team. One year of that project was used in just negotiations with the pilot union.
I ended up both loving and hating that project. I even quit at one time, but came back after a few weeks. I was constantly frustrated by the lack of progress that was being made in negotiations, the feeling that I was the whipping boy, and the anger that was projected at me by some who thought they were being "forced" to make a change. Even near the end of the project, my boss commented that he was ready to give me a "real" project as soon as I wrapped that one up. No matter what happened, no one believed that it was all that tough. If you think it is tough to turn a company, try doing it with a union. On the other side of the spectrum, though, is that we ended up with a successful installation. We took a survey (2 months into the roll out) of the pilots and the union agreed that we got a 94% acceptance rating for that project. I also had to admit that I grew a lot on that project -- as a PM and as a person. I believe that a lot of my people, negotiating, finance and legal skills are all due to that project.
So, given the struggle that it takes to get the unions to agree to even mutually beneficial change, the company is left in the position of trying to get old work rules to fit with modern technology. That is the wrong way to do it and they will find themselves starting over several times before they realize that, as painful as it might be, you have to update the rules if you want to update the technology. The PM and CIO has to learn to be a salesman, negotiator and technocrat all at the same time.
Overall, I feel sorry for the Comair CIO. Your project has about a zero percent chance of succeeding unless you have just the right business people tied into the project to pull it off.
Darrell Hamilton Strategic Director LabCorp

Blame the unions!!!
Everything is their fault, right?
Re:Why didn't the CIO yell louder? by winkydink · 2005-05-03 05:41 · Score: 1

My experience is different from yours. An effective CIO manages well both "up and down". Bad executives can exist in any part of any organization. That trait isn't limited to IT.

--
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
Re:Why didn't the CIO yell louder? by winkydink · 2005-05-03 05:45 · Score: 3, Funny

...The PM and CIO has to learn to be a salesman, negotiator and technocrat all at the same time.

How is this different from what a good PM or CIO does every day? Darrell Hamilton is a "Strategic Director"? Strategic Director of what? Blame avoidance & CYA?

--
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
Re:Why didn't the CIO yell louder? by mr_z_beeblebrox · 2005-05-03 05:54 · Score: 1

but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening

The CIOs job is defined by the investors and management. Not by a slashdot post or even a standard definition. If the CIO is given many other things and told they are his priorities by the proper people, those things are his priorities. However, a good CIO would make risk management a priority with his company and if he could not...he would seek employment while he still had a good reference.
Re:Why didn't the CIO yell louder? by Angostura · 2005-05-03 07:40 · Score: 3, Interesting

I'm sorry, but this story and your comment annoy me greatly.

Here's the situation. The company had an old green screen application that was working just fine. It was old, but it did what the company needed. There was no hint that there was any fault.

Now, one day the company had to cancel 90% of its flights - and whammo some double byte counter overflowed.

What's all this crap in the article about old software "getting brittle"? This wasn't brittle aging software, this was software that was hit by an event that took it outside of its design parameters.

How would *you* have judged the risk of this software failing? How would that risk compare with the risk of installing a new untested package?
Re:Why didn't the CIO yell louder? by gr8_phk · 2005-05-03 08:00 · Score: 1

Because the system was in fact working, and there was no way for him to know about 16bit values being used in the software. It was a latent problem that would not degrade over time, it just completely broke one day.
Perhaps the docs for the software would indicate this problem. Did anyone RTFM at Comair?
Re:Why didn't the CIO yell louder? by Knara · 2005-05-03 08:56 · Score: 4, Insightful

I'd agree, but the fact that it was written in FORTRAN and they didn't have a single maintenance developer (even if it wasn't that developer's primary role) assigned to it that *knew* FORTRAN suggests that a whole lot of "buhhhhhh??" going on in that particular IT department.
Re:Why didn't the CIO yell louder? by gmajor · 2005-05-03 09:09 · Score: 1

At an old job, we supported a Fortune 500 company that had an application written in assembly. Nobody at the client company knew any assembly. They hadn't the faintest idea how it worked - they just knew that it did.

I suspect this situation repeats itself in many companies.
Re:Why didn't the CIO yell louder? by Knara · 2005-05-03 10:10 · Score: 1

Oh I know it does. I'm just saying that even though its common, it's a disaster waiting to happen.
Re:Why didn't the CIO yell louder? by dmhayden · 2005-05-04 01:28 · Score: 1

I agree completely. The article (and the company) seem to be placing blame in completely the wrong place. A risk assessment may have shown that the software was old and no one was fluent in fortran, but it probably would not have shown the actual error that caused the shutdown.

The only way to detect this sort of problem is to look back at the requirements document (assuming that there is one) for the software and see if the requirements and assumptions still hold true. But who does that?

I understand the concept of software getting brittle, but perhaps the analogy isn't so good. Really, the requirements and data that the software must handle change, so we ask it to do more and more stuff, and using input that is farther and farther from the original design constraints. Eventually, something like this happens.

risk management 101 by CKnight · 2005-05-03 05:30 · Score: 1

How do you strike a balance with risk mitigation and ease of use for users? Sure, you can run a backup of your data and applications every 5 minutes, but the of course, no work gets done.

It seems like a professional discipline in itself. Risk+ certification anyone?

--
http://www.watacrackaz.com

Re:risk management 101 by linuxbert · 2005-05-03 05:43 · Score: 3, Informative

You preform a TRA - Threat and risk Assesment. and you are quite right, it is a profession all of its own.

for the do it yourselfers : http://www.cse-cst.gc.ca/en/publications/gov_pubs/ itsg/itsg04.html Grab the Pdf, and it will guide you through the process.
Re:risk management 101 by homer_ca · 2005-05-03 06:27 · Score: 1

The short version of that is, you think through all the hypothetical scenarios then consider the severity and probability of each scenario. It does take some creative thinking to find all the disaster scenarios, and it takes some thorough analysis to find all the cascading failures resulting from a certain disaster.
Just off the top of my head, I can recall three disasters that disrupted air travel and required mass crew reschedulings: '99 Midwest blizzard (75 planes stuck on the runway in Detroit, passengers couldn't unload), 9/11, and the Northeast blackout. I wonder how close they came to the 32K limit those times. This isn't exactly a far-fetched scenario.

Analyze all you want, by Megaweapon · 2005-05-03 05:30 · Score: 2, Interesting

but all it takes for a good number of companies to get egg on their face is one careless mid-level that is too casual with passwords (and/or takes their work home on laptops with info unencrypted)...

--
I'm sure "SlashdotMedia" will improve on all the wonders that Dice Holdings blessed us all with

Re:Yep by Monf · 2005-05-03 05:33 · Score: 3, Interesting

yes, but: As it turned out, the crew management application, unbeknownst to anyone at Comair, could process only a set number of changes--32,000 per month--before shutting down. And that's exactly what happened.

How could nobody in 11 years see that the changes were counted with a 16 bit signed integer? The company grows, I would think that making sure the sw can keep up with the numbers would require very little foresight, yet from the article, it seems that the only considerations were in the UI? I wonder if this was a hw limit or a sw limit...

--
Pay no attention to that man behind the curtain.

We need an analogy here... by LegendOfLink · 2005-05-03 05:39 · Score: 3, Interesting

Um...like making sure you run your Windows Updates. Because if you don't, you're gonna regret it.

Then again, even if you do, you're still going to regret it.

So, I guess the moral of the analogy is that it's better to patch your system and risk your hardware not working properly than having spyware or a virus on your system.

--
IGB: More fun than eating oatmeal!

Interesting Technical Detail ... by rewinn · 2005-05-03 05:40 · Score: 3, Insightful

From the article:

As it turned out, the crew management application, unbeknownst to anyone at Comair, could process only a set number of changes--32,000 per month--before shutting down.

Sounds like some sort of overflow problem. Hmmm....

The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.

--
--- Attorneys Assisting Citizen-Soldiers & Families -

Re:Interesting Technical Detail ... by autark · 2005-05-03 05:56 · Score: 1

Previous /. articles confirmed it was integer overflow. Seems the global variable that counted the number of crew changes was a signed 16-bit integer, which allows for a high value of 32,767. There followed an interesting discussion on why a counter (which should never be negative) was declared as signed instead of unsigned. An unsigned int would have allowed for 65,535 crew changes, which would have saved Comair's bacon in December.
Re:Interesting Technical Detail ... by UdoKeir · 2005-05-03 06:23 · Score: 2, Insightful

And yet the idiot from EDS has this to say:

"These systems are just like physical assets," says Mike Childress, former Delta CTO and now vice president of applications and industry frameworks for EDS. "They become brittle with age, and you have to take great care in maintaining them."

You can easily run software for 20 years and it will not fail so long as you don't exceed its operating parameters. That's also assuming you can source replacement kit for hardware failures.

Software does not age.
Re:Interesting Technical Detail ... by arkanes · 2005-05-03 06:29 · Score: 1

It's pretty common to use signed ints to allow for the easy use of magic numbers like -1. It might even have been simpler - assume a function like changeCrew(), which returns the number of crew changes or a negative error code. It's unfortunate that this sort of thing gets done in high reliability enterprise systems, because the alternatives are more robust even if they are more awkward to use, but they're such common C idioms that I suppose it's to be expected.
Re:Interesting Technical Detail ... by josecanuc · 2005-05-03 06:30 · Score: 4, Insightful

Exactly... The article author seems to point to the fact that the software was old and just waiting to die...

Becase of the fact that NO ONE knew of the particular limit that was exceeded, those who were supposed to calculate risk never knew what the tipping point was.

All they could say was "our software is old, someday it may not work any more, but I cannot say for what reason, because I do not know FORTRAN."

How the hell can you calculate risk if your only input is the chronological age of a software system?
Re:Interesting Technical Detail ... by operagost · 2005-05-03 06:33 · Score: 1

It may have been a language limitation. We're talking about a 20-year-old FORTRAN application here. I think I've experienced a similar problem with DEC BASIC -- even the very latest versions on the Alpha and Itanium only have signed integers.

--

Gamingmuseum.com: Give your 3D accelerator a rest.
Re: Interesting Technical Detail ... by Alwin+Henseler · 2005-05-03 06:38 · Score: 2, Insightful

Sounds like some sort of overflow problem. Hmmm....
That depends. I suppose you could call the software involved here mission-critical. In that case one might expect limits like the ~32000/month to be documented (not in this case if I read it right). If that limit had been documented, then the failure would not have been overflow, but not RTFM/using the system out-of-spec, which is management/operator error.

Also it matters how exceeding a limit is handled (graceful degradation). Did this system say: "I'm stuffed, I can't take no more input for a while" or did it say "Oh dear, I'm confused, I'll be doing totally silly crap now". In the first case, the failure is partial and you can still get some work done. In the latter case (=what happened here?), you're totally screwed once you reach that limit.
Re:Interesting Technical Detail ... by Donny+Smith · 2005-05-03 06:56 · Score: 1

>>"These systems are just like physical assets," .... They become brittle with age,

>That's also assuming you can source replacement kit for hardware failures.

And how the hell is that different from what he said?

(Systems = hardware + software)
Re:Interesting Technical Detail ... by Jivecat · 2005-05-03 07:17 · Score: 1

How the hell can you calculate risk if your only input is the chronological age of a software system?

You gave a second input factor yourself... Number of IT staff who know FORTRAN = 0. Seems to me the equation is:

[Age of program] / [# of people to support it] = [infinitely high risk] ... or a DIV/0 error.

--
"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."--Feynman
Re:Interesting Technical Detail ... by BattleTroll · 2005-05-03 07:24 · Score: 4, Insightful
"How the hell can you calculate risk if your only input is the chronological age of a software system?"

That wasn't the the only input in this case. In fact, you don't have to know the gory details of the implementation to determine risk, just the business impact of a problem to the system.
- Since no one at the company understood the language used, it stands to reason no one understood what the system was doing. Risk: Medium
- The system was mission critical to the performance of almost every other function of the airline. If the system was lost, the airline was hosed. Risk: Critical
- They had no failover plan in place in case the system went down. Risk: High
- No load tests were possible since they only had the one system in place. Without load testing the only way to find out the system fails under load is to wait until it fails in production. Risk: High
It stands to reason there were other risks involved that weren't identified in the article.
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 07:32 · Score: 4, Insightful

Software does not age.

Software does age. As a program grows older, people change it, its inputs and how it is used, and the older a program gets, the less the people making the changes are likely to understand it.

In addition, some bugs don't manifest themselves under usage patterns from 20 years ago, or when the software is run on hardware from 20 years ago, but they do manifest themselves under usage patterns or on hardware that's in use now. The more you change, especially without understanding all of the ramifications of that change, the greater the risk for error.

That's what software aging is.
Re:Interesting Technical Detail ... by ScuzzMonkey · 2005-05-03 07:51 · Score: 5, Insightful

"They had no failover plan in place in case the system went down."

With that, you've hit the heart of the matter, and what the article should have focused on rather than the "old software breaks down" BS. This was a bug which could have hit at ANY time since the software was installed; it was an overflow, not a rusting subroutine that fell off. I can't personally see any way that they could have foreseen this particular problem but when you have a system that is so critical to your operation, you don't look for problems it might have--you look for alternatives to fall back to when it DOES have problems.

You never see them coming. But you'd better plan for them anyway.

--
No relation to Happy Monkey
Re:Interesting Technical Detail ... by UdoKeir · 2005-05-03 07:56 · Score: 1

That's not aging, that's modification. We're talking about the same software running on the same hardware for 20 years. If nothing changes, it will continue to function.

Besides, from the text of the article I'd say nobody was modifiying the software, since nobody in the company knew FORTRAN.

As for you running software on new hardware explanation, you're changing the operating parameters.
Re:Interesting Technical Detail ... by ScuzzMonkey · 2005-05-03 07:58 · Score: 1

Who exactly was changing the code, since no one there supposedly knew FORTRAN?

And usage patterns are not conveniently tied to time, but rather, well, usage. The airline could have hit it big two months after this package was deployed and run into the exact same bug.

I think it is an easy to sell analogy to people who work on airplanes, but in fact software does not "age", and treating it as if it does is a fundamental risk factor in and of itself because doing so invites a complete misunderstanding of why software DOES fail.

--
No relation to Happy Monkey
Re:Interesting Technical Detail ... by UdoKeir · 2005-05-03 08:02 · Score: 1

Hardware is a physical asset. It not like a physical asset, it is a physical fucking asset.

If you can't replace the hardware, your system hasn't become "brittle". A lack of compatible hardware does not cause your software to become "brittle".

You changed the operating parameters.
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 08:28 · Score: 2, Insightful

That's not aging, that's modification.

That's what software aging is. When people talk about software aging, they're not talking about something that doesn't exist, they're talking about the effect that I described: ongoing changes with less and less understanding of the system.

We're talking about the same software running on the same hardware for 20 years. If nothing changes, it will continue to function.

Change is inevitable. It is common and reasonable to expect change in the hardware, the inputs, the business models and the code itself. It's not reasonable to expect the same software to run on the same hardware under the same conditions for 20 years, even though it does happen in some extreme cases (like space probes.)
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 08:40 · Score: 1

Who exactly was changing the code, since no one there supposedly knew FORTRAN?

It is the system that changes when software ages; the system is comprised of software, hardware, data, documentation, users and business practices. It's not necessary for every one of those to change in order for change to occur.

in fact software does not "age"

This is a commonly held myth, and it leads people to think that maintenance of software-based systems is not necessary. That's a big mistake.

Software doesn't wear out, but a software-based system does deteriorate. That's what people are talking about when they refer to "software aging".
Re:Interesting Technical Detail ... by ScuzzMonkey · 2005-05-03 08:54 · Score: 1

Then it would be "system aging" not "software aging"; the mistake is in your choice of terminology. "Aging" in fact is still a terrible term to use to describe this process, since it has little or nothing to do with time and everything to do with modification and utilization. It's hardly a foregone conclusion that software-based systems will deteriorate without maintenance. You're making a poor generalization based on innaccurate assumptions of utilization across the board.

It escapes me why people feel the need to come up with inaccurate terms for things that are already easily described and then try to defend them when the inevitable misunderstanding results.

--
No relation to Happy Monkey
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 09:54 · Score: 1

Then it would be "system aging" not "software aging"; the mistake is in your choice of terminology.
It's a software-based system. In addition to being one of its components, "software" is also a synonym.
"Aging" in fact is still a terrible term to use to describe this process, since it has little or nothing to do with time and everything to do with modification and utilization.

It does have to do with time: as time progresses, poor modifications increase in number, increasing the bug count.

You're making a poor generalization based on innaccurate assumptions of utilization across the board.

I'm making a gross generalization, which of course looks poor if you consider atypical cases. In general, software does age, because it does change, and a common trend in legacy systems is for the developers making the changes to understand the system less and less as time goes on. This is true even when the developer is the creator; I understand the program that I put into maintenance mode 2 years ago much less now than I did when I worked on it every day. That's why good documentation is so important.

In addition, it's not my generalization; I'm simply parroting stuff that other people have discovered and documented.

It escapes me why people feel the need to come up with inaccurate terms for things that are already easily described and then try to defend them when the inevitable misunderstanding results.

what better term do you suggest? Bitrot? It's the same thing.
Re: Interesting Technical Detail ... by CrackHappy · 2005-05-03 11:57 · Score: 1

When reading your comment, I just about crapped myself laughing.

"I'm stuffed, I can't take no more input for a while"
I get the image of an overweight NY truck driver sitting at a greasy spoon holding his belly groaning.

"Oh dear, I'm confused, I'll be doing totally silly crap now"
This one is Winnie the Pooh on LSD. Man - that was hilarious, thanks!

--
1f u c4n r34d th1s u r34lly n33d t0 g37 l41d Capitalization really works: i helped my uncle jack off a horse
Re:Interesting Technical Detail ... by kupci · 2005-05-03 14:27 · Score: 1

Personally I think the 'age' metaphor is a poor way to comprehend this stuff, but if it helps you to understand this by thinking it is the software that ages, that's great.
Change is inevitable. It is common and reasonable to expect change in the hardware, the inputs, the business models and the code itself.

Exactly. So another way to think of it is this way: the problem is that the software doesn't age. It stays exactly the same, but everything else changes. So to understand this from a requirements perspective is more accurate.

For COMAIR, if they had installed the software, and had a great year, and then this snowstorm hit or whatever, and they needed to make 32K changes.. same deal.

One final example - this from the retail industry, where everything is driven off of SKU's. Company X buys this razzle-dazzle software that has a snazzy interface. They initially try to have SKUs by size, and realize they are going to quickly run out of SKUs, since the limit is 99,999. The problem was, the software company had never implemented this software at such a large store - they immediately broke his 'new' sofware. What was the thinking of having a limit like 99,999? Anyway, moral of the story - this had absolutely nothing to do with 'age', and nothing to do with 'bugs' - this was a design limitation. And there will always be resource limitations. This is why you need to work with DBAs for example, to decide how to configure your database.

It's not reasonable to expect the same software to run on the same hardware under the same conditions for 20 years, even though it does happen in some extreme cases (like space probes.)

Sure it is. You'd be surprised. Here's just one example of many. Your wi drivers license information, if you have one, is stored in a database system written in assembler. It was originally written for another state, and is probably pushing 30-40 years old (we are talking pre-SQL here). Why is it still in use? Isn't it too 'old'? Well, many reasons, I'll guess at a few: requirements haven't changed much. It's paid for. It works. It's supported (somehow) by IBM. It is extremely fast. Well designed. Talk about ROI.
Re:Interesting Technical Detail ... by Peter+La+Casse · 2005-05-03 14:56 · Score: 1

So another way to think of it is this way: the problem is that the software doesn't age. It stays exactly the same, but everything else changes.

In general, software is one of the things that changes. A significant source of "aging" is from changes made by a maintainer who is not familiar with the system. I'm talking about the general case, not the situation with Comair; I don't know what the maintenance history of the crew management system was.

It's not reasonable to expect the same software to run on the same hardware under the same conditions for 20 years, even though it does happen in some extreme cases (like space probes.)
Sure it is. You'd be surprised. Here's just one example of many. Your wi drivers license information, if you have one, is stored in a database system written in assembler. It was originally written for another state, and is probably pushing 30-40 years old (we are talking pre-SQL here). Why is it still in use? Isn't it too 'old'? Well, many reasons, I'll guess at a few: requirements haven't changed much. It's paid for. It works. It's supported (somehow) by IBM. It is extremely fast. Well designed. Talk about ROI.

It's likely that the hardware in question has had components wear out and be replaced during that span of time. It's also likely that the software has undergone some amount of modification, if only shortly after being deployed. The users have certainly changed, the business practices of the DMV may have changed and the system utilization has certainly increased as the population has increased.

I'm a big fan of building software that works for a long time, but it doesn't happen very often.
Re:Interesting Technical Detail ... by Nefarious+Wheel · 2005-05-03 16:14 · Score: 1

How the hell can you calculate risk if your only input is the chronological age of a software system?
Ancient development meme referred to as "the rule of bit decay". This means that any piece of software, if left to itself, will eventually fail to work. Implied was the idea that the software doesn't change, but the surrounding factors do; e.g. a tape-based system works fine until someone replaces the tape drive with a slightly different model, forcing idiosyncratic code to fail where it depended upon idiosyncratic features of the hardware that were factored out in the next engineering change.

People try to avoid this with software change, but it rarely works -- hard to motivate people to check old code exhaustively.

--
Do not mock my vision of impractical footwear

Article text by daVinci1980 · 2005-05-03 05:40 · Score: 4, Informative

Site is already sluggish.

Bound To Fail
The crash of a critical legacy system at Comair is a classic risk management mistake that cost the airline $20 million and badly damaged its reputation.
BY STEPHANIE OVERBY

When Eric Bardes joined the Comair IT department in 1997, one of the very first meetings he attended was called to address the replacement of an aging legacy system the regional airline utilized to manage flight crews. The application, from SBS International, was one of the oldest in the company (11 years old at the time), was written in Fortran (which no one at Comair was fluent in) and was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).

SBS came in to make a pitch for its new Maestro crew management software. One of the flight crew supervisors at the meeting had used Maestro, a first-generation Windows application, at a previous job. He found it clumsy, to put it kindly. "He said he wouldn't wish the application on his worst enemy," Bardes recalls. The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it. The consensus at the meeting was that if Comair was going to shoulder the expense of replacing the old crew management system, it should wait for a more satisfactory substitute to come along.

And wait they did. The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events: managing the approach of Y2K, the purchase of the independent carrier by Delta in 2000, a pilot strike that grounded the airline in 2001, and finally, 9/11 and the ensuing downturn that ravaged the airline industry.

A replacement system from Sabre Airline Solutions was finally approved last year, but the switch didn't happen soon enough. Over the holidays, the legacy system failed, bringing down the entire airline, canceling or delaying 3,900 flights, and stranding nearly 200,000 passengers. The network crash cost Comair and its parent company, Delta Air Lines, $20 million, damaged the airline's reputation and prompted an investigation by the Department of Transportation.

Chances are, the whole mess could have been avoided if Comair or Delta had done a comprehensive analysis of the risk that this critical system posed to the airline's daily operations and had taken steps to mitigate that risk. But a look inside Comair reveals that senior executives there did not consider a replacement system an urgent priority, and IT did little to disrupt that sense of complacency. Though everyone seemed to know that there was a need to deal with the aging applications and architecture that supported the growing regional carrier--and the company even created a five-year strategic plan for just that purpose--a lack of urgency prevailed.

After the acquisition by Delta, former employees say Comair IT executives didn't do the kind of thorough management analysis that might have persuaded the parent airline to invest in a replacement system before it was too late. Instead, Delta kept a lid on capital expenditures at Comair, with unfortunate consequences. The failure of the almost 20-year-old scheduling system not only saddled Delta with a plethora of customer service and financial headaches that the airline could ill afford but it also provides a cautionary tale for any company that thinks it can operate on its legacy systems for just...one...more...day.

The five-year plan that wasn't
Today, Cincinnati-based Comair is a regional airline that operates in 117 cities and carries about 30,000 passengers on 1,130 flights a day, with three or four crew members on each. But back in 1984, when Jim Dublikar joined the company as director of finance and risk management, Comair had

--
I currently have no clever signature witicism to add here.

Blowing smoke up your donkey by Anonymous Coward · 2005-05-03 05:42 · Score: 5, Funny

--------------------- Cut Here ---------------------
Posts above this line have not RTFA.

Re:Blowing smoke up your donkey by Anonymous Coward · 2005-05-03 06:26 · Score: 5, Funny

Posts under this line haven't RTFA either.
--------------------- Cut Here ---------------------

Why did this system fail? by hellfire · 2005-05-03 05:43 · Score: 4, Interesting

Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.

I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?

The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.

I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.

--

"All great wisdom is contained in .signature files"

Re:Why did this system fail? by ggvaidya · 2005-05-03 05:48 · Score: 2, Interesting

Why did Comair's system fail in the first place?

If I understand the article correctly, the database could only handle 32,000-odd transactions in a month. In December 2004, rescheduling caused by bad weather caused the database to hit its limit exactly on Christmas Day, and everything shut down. It wasn't until December 29th that everything was back up again.

Oh, and they're still using the old system: they've divided the database up, with each half having its own 32,000-transaction limit, but that's about it.
Re:Why did this system fail? by Jayfar · 2005-05-03 05:52 · Score: 4, Informative

The article conveniently leaves out the reason for the failure.
No, the article conveniently explained that the sw had a limit of 32000 schedule changes per month. A severe winter storm necessitated enough changes to make the system fall over.
Re:Why did this system fail? by ggvaidya · 2005-05-03 05:53 · Score: 1

Ah, I remember the good ol' days when you could tell a True Geek from a mere Brilliant Programmer by whether or not he had memorized the maximum value of an signed 16-bit integer ...

of course, 16-bit is now passe. Anybody know the maximum value of a 32-bit integer? Quickly, no looking allowed!
Re:Why did this system fail? by dejamatt · 2005-05-03 05:54 · Score: 1

I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?

If I remember correctly, the ID number to index schedule changes (which was a 16-bit integer) overflowed. I think they said they never expected more than 32,000 changes in a month.

http://www.cincypost.com/2004/12/28/comp12-28-2004 .html
Re:Why did this system fail? by LiENUS · 2005-05-03 05:54 · Score: 1

4294967296, I've been trying to memorize the maximum value of a 64-bit integer but thats a quite bit harder.
Re:Why did this system fail? by jsfetzik · 2005-05-03 06:06 · Score: 1

The system had an interanal 'flaw' that limited it to a maximum of 32,000 changes per month before it would crash. Due to increases in size over the years and the large number of scheduling changes made during the Christmas holiday season this maximum was reached and the system crashed.

So there were really two design/coding flaws that caused the crash. First, the limit on the number of changes. Second the lack of proper error handling when the maximum number of changes was reached. So it took both of these to bring the system down in a very messy way.
Re:Why did this system fail? by voidptr · 2005-05-03 06:10 · Score: 1

7FFFFFFF or FFFFFFFF depending if it's signed or not.

What, you wanted it in decimal?

--
This .sig for unofficial government use only. Official use subject to $500 fine.
Re:Why did this system fail? by ggvaidya · 2005-05-03 06:24 · Score: 1

I said signed. And yes, decimal.

Btw, I'm afraid I looked at your sig while at home :|. Where do I have to pay my 500$ fine? Much thanks.
Re:Why did this system fail? by QuestorTapes · 2005-05-03 06:40 · Score: 2, Insightful

> I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced.

>...if I have a 20 year old system that I can't get parts for, that's a high risk system.
> However, if I can get parts for a 20 year old system, then the risk is lower.

Good points. The article does contain some facts, though. The system was Fortran based, ran only on one aging hardware platform, and no one at Comair knew Fortran. Those are risk factors with older software.

A better lesson than the article's implied "don't use old software" lesson would be, "periodically review legacy systems against changed business conditions and environment. Do not assume that software and hardware will continue to function when the business environment changes significantly."
Re:Why did this system fail? by Junta · 2005-05-03 06:56 · Score: 1

So it does sound like, essentially, they likely didn't specifically how it was going to break, but simply had the mindset of 'it's old, something's going to break, we need to refresh this'. That would be dangerous thinking if it is true, it just happened to be the case this time.

The reaction most people are having is to say 'code is 20 years old, throw it out and redo it right!' which is a really bad philosophy for proven systems. In this case, for example, the prudent response is to examine the code and review all the unhandled limitations/wrap arounds of this nature, making them all more cleanly handled and if unable to address modern needs, conservatively work to up the limits. What you have is 20 years of proven reliability, code doesn't automatically age and become crap without being touched.

Their solution is probably the best short term solution, divide things up so that the limit is not hit, while preserving the proven reliability aspect of the code.

At a minimum, there should be some amount of work to make sure the value dosesn't wrap, and produces some more acceptable effect even if not incrementing the size of the data type.

--
XML is like violence. If it doesn't solve the problem, use more.
Re:Why did this system fail? by Bravo_Two_Zero · 2005-05-03 07:49 · Score: 1

One of the major drivers to replacing older systems is in-house programming knowledge. It's not enough that you may not have Cobol/Fortran/Business Basic developers on hand who intimately know the legacy code. You may not have *any* competent developers on staff at all for those languages, because the market for them might be the size of an ant's navel. Heck, you might not even own the code itself.

Even if you do have a couple, they'll be older and likely not replaceable at retirement. Documentation is helpful, but not an absolute answer. Compare a legacy app to, say, Latin. Documentation in Latin is better than nothing, but then you have to learn Latin, learn it fluently and learn its syntactical gotchas before you can read it.

It's very true you can't assume a 20-year-old system is bad based purely on age. That's exactly right. In Comair's case, though, it's an older package, likely with poor ongoing vendor support. And, the article notes the lack of internal hardware knowledge (AIX) as well.

It's absolutely not FUD to say that poor risk analysis of the costs of not moving forward leads to business disasters. It doesn't always, but that's the point of the risk analysis. There were probably a dozen ways to mitigate the risk to a more acceptable level, but Comair didn't take those steps.

--

Amateurs discuss tactics. Professionals discuss logistics.
Re:Why did this system fail? by Fulcrum+of+Evil · 2005-05-03 09:10 · Score: 1

I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.

How about this: every few years, reexamine the limitations and requirements of the system. Upgrade or replace the system when it gets too close to those limits.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"

Re:Yep by code_chick · 2005-05-03 05:44 · Score: 1

Not to be completly cynical - but I'm not surprised they missed the bug... They (or their software vendor) probably doesn't have the QA staff and/or expertise to do regression testing. But then again, who does? Why is it that very few software companies seem to consider QA important?

I have a friend that lived this nightmare. by zorkmid · 2005-05-03 05:44 · Score: 2, Insightful

And even though she was screaming from the highest mountain to anyone and everyone that would listen that doom was rushing towards them. That bad, bad things were going to happen. She was still made the sacrificial goat when the fecal material hit the rotating blades.

And this was for a federal agency.

Scary no?

Re:I have a friend that lived this nightmare. by ggvaidya · 2005-05-03 05:50 · Score: 1

Most. graphical. slashdot. post. ever. :|
Re:I have a friend that lived this nightmare. by operagost · 2005-05-03 06:35 · Score: 1

If she'd screamed in writing, then she'd be collecting hefty sums of severance pay right now.

--

Gamingmuseum.com: Give your 3D accelerator a rest.

Re:Yep by airrage · 2005-05-03 05:45 · Score: 4, Insightful

To me, when you look at code you always want to rewrite it thinking you could do it better. But if you look at what they had to work with, you realize most coders write (at a given time), write pretty good code.

This software has been working for over 20 years! What will your code look like in 20 years? I doubt it has the same track record. I'm not sure foresight was a problem. I think they did the best they could with language and hardware of the day.

The comair meltdown wasn't a software problem if you ask me, it was the business changed.

--
"This isn't a study in computer science, its a study in human behavior"

Do it right the first time. by Anonymous Coward · 2005-05-03 05:45 · Score: 1, Interesting

The Laws of Software Process: A New Model for the Production and Management of Software

"Armour, a consultant in software development, reveals a new structure for software development that redefines the nature and purpose of software. He explains how, in the modern knowledge economy, software systems are not products in the classical sense, but are the modern channels for the conveyance of information. From this perspective, he examines programming languages, quality, cost estimation, and project management, and demonstrates how to overcome common problems that afflict software development and use. The book is distributed by CRC.Copyright © 2004 Book News, Inc., Portland, OR"

Re:Yep by EnronHaliburton2004 · 2005-05-03 05:48 · Score: 4, Insightful

How could nobody in 11 years see that the changes were counted with a 16 bit signed integer?

If this company was run by a typical big company, somebody DID complain about this 16-bit signed integer. Chances are, they were told to shut up about it and not rock the boat. This frequently happens when someone points out a bug which would require a fundamental change to the system.

Most companies only like employees who think inside the box, despite telling people to think outside the box.

--
94% of Repubs and 21% of Dems voted to renew the Patriot Act

Risk Management is Complex by justanyone · 2005-05-03 05:48 · Score: 4, Interesting

I used to work in the Risk Management department of the capital markets division of a large international bank as a programmer.

When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.

The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.

Several points about this job were fascinating:
1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.

It was a fun job. Then I found another one where I get to play with Python!

-- Kevin

--
Unitarian Church: Freethinkers Congregate!

Re:Risk Management is Complex by Tackhead · 2005-05-03 05:59 · Score: 2, Interesting

> I used to work in the Risk Management department of the capital markets division of a large international bank [jpm[*cough*].com] as a programmer
>
>[...]
>
>It was a fun job. Then I found another one where I get to play with Python!
Huh? The story's supposed to end with the line "VAXen, my children, just don't belong some places." :-)
Re:Risk Management is Complex by jayloden · 2005-05-03 08:13 · Score: 1

just out of curiosity, where do you work now that you get to work with Python?

-Jay
Re:Risk Management is Complex by Skjellifetti · 2005-05-03 08:44 · Score: 1

2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.

This approach usually defines risk independently (typically as variance around a mean) for each individual item. The items are then observed (or just assumed) to have low enough correlations so that if one item fails, enough others are OK so that the reserves need only cover the probability of a small number of failures.

Hedge fund LTCM took this strategy to the limit by borrowing heavily -- in effect reducing reserves to a large negative number. LTCM essentially made money by noting when the cost differences of various instruments were out of line with their historic risk patterns. But even Nobel Prize Winners can't estimate all potentially significant risks correctly. Once in a great while, everything crashes together and reserves aren't enough, let alone negative reserves. When the Russian debt crisis happened in the fall of 1998, all of the instruments LTCM dealt in suddenly moved in lockstep as investors around the world fled to safer things like TBills. When LTCM died, they almost took the capital markets down with them because large international banks had loaned LTCM their play money based on those risk models.

--
FreeSpeech.org
Re:Risk Management is Complex by justanyone · 2005-05-03 09:12 · Score: 1

usually defines risk independently

The reality of modern banking risk management (in my experience at Bank One, which became JPMC) was that there were many different measures of risk attached to each exposure. The popular ones are the standard short term 'delta', or DV01, which measures a specific 1-day interest rate risk, gamma, vega, etc.

There's also something called stress testing, and it usually involves lots of cycle time to run (we ran it over weekends). This would take several scenarios, including the 1987 crash, the LTCM scenario, and even some aspects of the Barclay bank scenario (one trader sacked the bank, and other banks were exposed to because of Barclay was on the other end of some financial deals).

The QED people at Bank One were (and are) serious brainiacs that worked very hard coming up with scenarios and models that would severely test, in as real-world ways as possible, a deal or set of deals. I highly respected this group; they were fun to work near. Some conversations included how their models coped with "out-there" scenarios like large scale (9/11 and significantly larger) attacks and their effects on the markets.

Suffice to say, the larger US banks have put significant effort into making sure the risks they're taking on are accurately measured. But, that task is very large, and challenging computationally, organizationally, mathematically, and personally. JPMC (Bank One) wasn't perfect at the risk measurement game, but they were trying hard.

--
Unitarian Church: Freethinkers Congregate!

Fortran job openings? by joschm0 · 2005-05-03 05:49 · Score: 1

I would love to work on an old fortran system again, especially if it's in fortran IV. Yes indeed, those were the good old days.

--
01/20/09

Re:Yep by Zemplar · 2005-05-03 05:50 · Score: 3, Funny

"I wonder if this was a hw limit or a sw limit..."

I can assure you it was indeed a hardware or software limit.

software decays by ecklesweb · 2005-05-03 05:51 · Score: 4, Interesting

One of the interesting quotes from the article:

Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.

I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...

Re:software decays by HawkinsD · 2005-05-03 06:13 · Score: 1

Maybe now's a good time to put in a plug for the RISKS "Forum On Risks To The Public In Computers And Related Systems."
It sounds academic, but it's full of level-headed dissection of all kinds of software-related disasters, ranging from the hilarious, like the USS Yorktown dead in the water after a divide by zero, to the horrifying. The contributors are skeptical but polite, and I learn new stuff with every issue.

--
Never attribute to malice that which can be explained by mere idiocy.
Re:software decays by ecklesweb · 2005-05-03 06:47 · Score: 1

I disagree. I think bringing in the topic of changing requirements over time muddies the issue. Yes, requirements do change over time and maintenance intended to implement new or changed requirements can cause defects and failures. However, at issue here is whether a piece of software, fully functioning at time "A" and not [significantly] modified in terms of its own code base, will decay and fail at some time "B" in the future.

Assuming for the moment that the software program itself remains static, can you characterize and measure decay of the program, and is it the result of anything besides a changing (system? operating?) environment in whose context the software program runs?
Re:software decays by adjuster · 2005-05-03 06:55 · Score: 3, Insightful

But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

It's rather sad, to me, that we design these wonderful machines that can perform logical operations in great quantities with a high degree of repeatability and low occurance of failure, then create a culture around them that encourages sloppiness, and ultimately introduces a large measure of uncertainty into the operation of these machines. I am baffled at the perverse desire-- nay need-- that people seem to have to make software suffer from entropy.

The only "decay" in software should happen as a result of changing business requirements. There's no reason that, provided the business requirements don't change, that a well designed and properly implemented piece of software should not be usable in perpetuity. There may be changes in the underlying hardware and operating system software, but provided that the application is sufficiently abstracted from the underlying platform (or, provided that an emulation-layer for the original platform can be constructed) there's no reason other than changing business requirements for software to be "thrown away".

Let's put this a different way: How does a patch to the underlying operating system cause an application to fail? If the patch changes the behaviour of the underlying operating system in such a manner as to return unexepected values to the application, the patch is the cause of the failure. A flawed patch doesn't make an application "age" or "decay"-- it's simply a flawed patch. An application has to make assumptions about the underlying operating system. These assumptions are based on the API documentation-- the contact between the operating system and the application. When the OS violates the terms of the contract, that doesn't mean the application "decayed"-- it means some moron who coded the operating system patch messed up, and the operating system manufacturer/maintainer didn't perform good regression testing.

We should be designing software systems with 10 to 20 year usability goals. It would do a lot for the frustration level that the "suits" have with IT if we stopped being proponents of hugely expensive but "throwaway" systems, and started designing systems with an eye for longevity.

--
The Attitude Adjuster, I hate me, you can too.
Re:software decays by hawaiian717 · 2005-05-03 07:02 · Score: 4, Insightful

The only "decay" in software should happen as a result of changing business requirements.
Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.

--
End of Line.
Re:software decays by qwijibo · 2005-05-03 07:08 · Score: 2, Informative

The problem can also occur because the original application is tested against the real system, not the documented API. So a bug fix to the underlying system can both be correcting a bug and create an application error.

Throwaway systems are cost effective in the short term. That makes them popular with people who look at this quarter's stock price as both a goal an duration of their attention.
Re:software decays by Coryoth · 2005-05-03 07:10 · Score: 1

But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. ... The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay.

There's a relatively easy way to measure such "decay". When first designing to software do proper requirements gathering and write a full formal requirements specification (there are specification languages specifically for this purpose). As a maintenance check you run a new requirements gathering process and develop a new formal specification of requirements. If you used the right tools (that is the right specification language) to do this you will actually be able to compare the requirements used to develop the software with the requirements now in a precise way, getting a very specific measure of how requirements have deviated over time.

If you're really good you developed the software by refining the requirements specification and retained the the specification refinements as documentation of the code. Thus you will be able to get a complete measure of how much the change in requirements effects the code, and thus if you update the code (similarly updating the refinement specifiations) you can be sure you've properly covered all changes that need to be made, and no effects have gone unnoticed.

All of this is, of course, work. It is not dissimilar work to bothering to have a proper specification of your plane, having regular complete maintenance checks to make sure parts aren't worn beyond their tolerances, and fixing them if they are. People are willing to do that work on the planes, but aren't willing to do the similar work for software. This is mostly a matter of expectations, and the general perceptions of software.

Yes a plane crashing is a lot worse than your booking and ticketing system crashing, so you want to take care. Then again the software caused an awful lot of trouble and lost the company a remarkably large amount of money, so maybe that care was worth taking after all.

Jedidiah.

--
Craft Beer Programming T-shirts
Re:software decays by Shotgun · 2005-05-03 07:48 · Score: 1

I find it a rather humorous 'insight'. If the airplanes are 'rusting' then they don't need to worry about software.

Modern airplanes don't rust. They die of metal fatique, which aluminum is much more prone to than 4130 steel.

--
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba
Re:software decays by NotPeteMcCabe · 2005-05-03 10:21 · Score: 1

And the reason the application has not kept up is because, when the app was first written, the assumptions made were not documented and reviewed on an ongoing basis. Suppose there had been a spec that showed that the system can handle 32k changes in a day but no more. Every month or year you examine your change volume to see how close you are to crashing your system. When it starts getting close, you fix it.
This type of approach would have prevented this problem completely. But I was a programmer for seven years and nothing I saw during that time makes me believe that anything like what I've suggested could ever happen in the real world.
Re:software decays by CharlieG · 2005-05-03 12:10 · Score: 1

The BIG problem is this. You design an application to do X, per the spec, and you write a really GOOD program to do X. It works GREAT

Now someone comes along and says "we need the program to do X + Y". You stop, you think, and say "well, I can add in the Y functions HERE. It may may not be perfect, but there is no budget or time to refactor the entire system to do X + Y" Then along comes someone who wants to add Z. Again, the process is "We need it next week, and for N dollars" - the RIGHT answer may be to actually re-write the whole thing, but the company won't give you the money, so you graft on Z

Keep doing this for 20 years, and sometimes 100s of releases, and what you find is you have things that make NO sense, but WORK "gee, this routine calculates this number, and changes it to a string, and the calling routine changes back to a number." Why? Because some OTHER routine wanted a string 20 years ago, and no one needed a number

The time and effort to redesign your legacy system the right way is USUALLY called "replacement" - it's easy to get "New" than to fix old - trust me, it's the "gee wiz" factor. It's "New" it MUST be better

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Re:software decays by adjuster · 2005-05-03 13:26 · Score: 1

Now someone comes along and says "we need the program to do X + Y". You stop, you think, and say "well, I can add in the Y functions HERE. It may may not be perfect, but there is no budget or time to refactor the entire system to do X + Y" Then along comes someone who wants to add Z. Again, the process is "We need it next week, and for N dollars"

Keep doing this for 20 years, and sometimes 100s of releases, and what you find is you have things that make NO sense, but WORK

You're an apologist for poor software engineering practices. Think about what a civil engineer would do when working on a public works project, with respect to safety. Software developers should be doing that with respect to maintainability. The fact that the "legacy systems" designation even exists is proof enough that adopting better software engineering practices and implementing lifecycle processes for software would result in a better end result for all involved.

The problem may start w/ "the suits" wanting the application to "do X", but it ends with the developers who give it to them w/o thinking about the consequences.

*sigh* I suppose it is human nature to steal from the future to bankroll today...

--
The Attitude Adjuster, I hate me, you can too.
Re:software decays by CharlieG · 2005-05-03 23:02 · Score: 1

actually, in civil eng, it's done too - additions to houses that don't quite mesh perfectly, etc. Thing is, on a house, you do 1-2 additions over the life of the house, in software, that happend dozens of times

An appolgizr for it - no - more looking at the reality of looking at the real world, when they ask for X + Y, and you want to refactor it - say you come in with an internal cost of 100k - in the mean time, your outsource competition says "I'll do it for 20K" (they won't be around to pick up the wreck in 10-20 years, so...). Who's going to GET the work, and who's going to be saying "would you like fries with that?"

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso

Re:Yep by Anonymous Coward · 2005-05-03 05:52 · Score: 2, Insightful

Well, IIRC, the problem was that the 32K was known, and not important and the crew affectation *changes* were WAY under that number.

But during the holidays, the storm forced a *lot* of crew *changes*, several time, and they went outside the limit.

It seems that the feeling was "there is no way we have 32K *changes* in a single month", and /that/ was proved wrong.

/.ed by christoofar · 2005-05-03 05:52 · Score: 4, Funny

Wow. Looks like even the mag for CIOs can't keep up with a /. DDoS attack. Maybe the CIO for CIO should be fired?

Crew Scheduling system? How about Aircraft maint by NetNinja · 2005-05-03 05:54 · Score: 4, Insightful

If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.

Maintenance manuals and procedures are written in blood. The next tragedy will be no different.

Here's the by LouCifer · 2005-05-03 05:55 · Score: 1

Mirrordot mirror of article for your viewing pleasure.

--
Religion is for people afraid of going to hell.

Re:Yep by ucblockhead · 2005-05-03 05:55 · Score: 2, Insightful

Probably because when the code was written, there were only 25 planes, so 32,000 changes "would never be reached". There's probably a decent chance the line had a comment that said something like "// Make sure to change this if we ever need more than 32,000 changes!".

Then, no one looked at the code for 11 years.

That's how this happens. Not because people are stupid, but because people simply aren't looking at the old crufty code. They're too busy with new projects.

--
The cake is a pie

Re:Yep by Monf · 2005-05-03 05:58 · Score: 1

The comair meltdown wasn't a software problem if you ask me, it was the business changed.

I should clarify: by sw problem or hw problem, I am not criticizing the code, rather was the os (or the version of Fortran, or the db system) the limitation, or was it the hardware or both?

As to risk assesment, ensuring that that the system can handle the numbers should be a top priority and should be watched: well known by 2004 and at the forefront of things to look at since Y2K. You're exactly right, it was that the volume of the business changed, and that should have been a deciding factor in rushing a changeover, not that the UI wasn't pretty enough or exactly like the old one...

--
Pay no attention to that man behind the curtain.

It's a legacy system by morryveer · 2005-05-03 05:59 · Score: 3, Funny

Legacy == Bad, gonna die, just like dear Grandad. Should've rewritten it in Java, that'd fix it!

32767 + 1 = -32767, or maybe zero, or maybe NaN by Tangurena · 2005-05-03 06:02 · Score: 1

With a signed 16-bit integer, you have 1 bit for the sign, and 15-bits for the rest of the number. Depending upon any error handling by the compiler, you could get NaN (not a number), maybe zero, maybe -32767, or maybe just a core dump. In any event, the result is not what you are expecting.

Re:32767 + 1 = -32767, or maybe zero, or maybe NaN by guitaristx · 2005-05-03 07:31 · Score: 1

Actually, twos-complement makes darn sure that you don't ever have an NaN value in an integer. Since most hardware represents signed values using twos complement, I'd say that NaN would be right out.

--
I pity the foo that isn't metasyntactic
Re:32767 + 1 = -32767, or maybe zero, or maybe NaN by colinrichardday · 2005-05-03 10:57 · Score: 1

And why did they need a signed integer? Did they have negative changes?

Re:Yep by tomhudson · 2005-05-03 06:04 · Score: 4, Funny

Hindsight is 20/20.

You mean like this story (the lesson being that what seems like a good thing at the time can become an unmitigated disaster):

I like monkeys. The pet store was selling them for five cents a piece. I thought that odd since they were normally a couple thousand each. I decided not to look a gift horse in the mouth. I bought 200. I like monkeys. I took my 200 monkeys home. I have a big car. I let one drive. His name was Sigmund. He was retarded. In fact, none of them were really bright. They kept punching themselves in their genitals. I laughed. Then they punched my genitals. I stopped laughing. I herded them into my room. They didn't adapt very well to their new environment. They would screech, hurl themselves off of the couch at high speeds and slam into the wall. Although humorous at first, the spectacle lost its novelty halfway into its third hour. Two hours later I found out why all the monkeys were so inexpensive: they all died. No apparent reason. They all just sorta' dropped dead. Kinda' like when you buy a goldfish and it dies five hours later. Damn cheap monkeys. I didn't know what to do. There were 200 dead monkeys lying all over my room, on the bed, in the dresser, hanging from my bookcase. It looked like I had 200 throw rugs. I tried to flush one down the toilet. It didn't work. It got stuck. Then I had one dead, wet monkey and 199 dead, dry monkeys. I tried pretending that they were just stuffed animals. That worked for a while, that is until they began to decompose. It started to smell real bad. I had to pee but there was a dead monkey in the toilet and I didn't want to call the plumber. I was embarrassed. I tried to slow down the decomposition by freezing them. Unfortunately there was only enough room for two monkeys at a time so I had to change them every 30 seconds. I also had to eat all the food in the freezer so it didn't all go bad. I tried burning them. Little did I know my bed was flammable. I had to extinguish the fire. Then I had one dead, wet monkey in my toilet, two dead, frozen monkeys in my freezer, and 197 dead, charred monkeys in a pile on my bed. The odor wasn't improving. I became agitated at my inability to dispose of my monkeys and to use the bathroom. I severely beat one of my monkeys. I felt better. I tried throwing them way but the garbage man said that the city wasn't allowed to dispose of charred primates. I told him that I had a wet one. He couldn't take that one either. I didn't bother asking about the frozen ones. finally arrived at a solution. I gave them out as Christmas gifts. My friends didn't know quite what to say. They pretended that they like them but I could tell they were lying. Ingrates. So I punched them in the genitals. I like monkeys

Same thing with the code in question. It seemed good when it was written, but it didn't stand the test of time, and ended up with a lot of people getting a swift kick in the you-know-whats.

Or for another example of hindsight and the law of unanticipated consequences, just sing the first few bars of "Alice's Restaurant".

I told you so ... NOT! by argoff · 2005-05-03 06:05 · Score: 3, Insightful

It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.

Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.

To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc .....

Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?

Risk management by uweg · 2005-05-03 06:05 · Score: 3, Insightful

Well, the problem starts with being born or getting up in the morning. And a system running since 20 years normally doesn't start to stink by itself.

OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?

Re:Yep by code_chick · 2005-05-03 06:05 · Score: 2, Informative

sorry - I mistakenly drifted into the IT section of slashdot... You IT guys are all so threatened by real developers! (since you're all just developer want-a-bee's) And a female developer - that's the scariest of all! I won't make this mistake again... I wouldn't want to subject you to crying in the fetal position.

A game of Jenga by lake2112 · 2005-05-03 06:06 · Score: 4, Insightful

Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.

Bonehead, it's the same reason... by Gruneun · 2005-05-03 06:07 · Score: 1

that your comment has failed: Lack of attention

What if they won't listen? by swb · 2005-05-03 06:14 · Score: 3, Informative

I work in a business that isn't defined by technology (at least not historically), and I don't think that management actually listens or comprehends when it comes to a lot of IT issues.

When they do listen, they tend to reduce it to profit/loss and destroy the subtlety of the information and its meaning. CIOs that "push" issues, especially when they're expensive, tend to get canned as gadflys, big spenders or for not being "team players".

When it comes to technology, managers often don't care and don't want to know, except when it costs money.

Re:What if they won't listen? by kelleher · 2005-05-03 10:37 · Score: 2, Insightful

When it comes to technology, managers often don't care and don't want to know, except when it costs money.

That's their job. Companies exist to make money - end of story. Technology for technologies sake is foolish and wasteful unless you're in an R&D department.

That being said, all technology spends (e.g. upgrades, redesigns, rewrites, replacements, etc.) can and should be boiled down to dollars that either fall into a profit/loss or risk/benefit catagory (hopefully over 3-5 years). If a CIO isn't doing this (or having it done for them) they should be fired.

If you think more subtlety is needed, then I hope you're a low to mid level SA/DBA/Developer because you don't understand the "business" side of the company that employs you. I'm not trying to be rude, but that's the brutal reality of the business world. On a lighter side, there'a always academia... if you'd prefer politics to dollars.
Re:What if they won't listen? by swb · 2005-05-04 02:40 · Score: 1

That presumes that management generally has given broad authority to the CIO to make substantive changes in business process, organizational structure and other critical areas of business operation.

That broad of control isn't usually given out, thus its incumbant on more senior management to have a good operational understanding of technology's role in the business. It's not just thumbs up or down based on cost.

There's more to being in management than just wearing a suit, playing golf, and collecting a huge paycheck.
Re:What if they won't listen? by kelleher · 2005-05-04 03:34 · Score: 1

That presumes that management generally has given broad authority to the CIO to make substantive changes in business process, organizational structure and other critical areas of business operation.

Uh? Where the heck did you get that idea?

The CIO's jobs is provide technical leadership and direction to support the business processes, structure, and operation. To do this, they need to provide the "best cost" (notice I didn't say lowest - it's not always best) solutions to the business and justify them in profit/loss or risk/benefit. If part of the case involves changing a business process, then that should stand on the merits (yes, this means dollars) of the case.

That broad of control isn't usually given out, thus its incumbant on more senior management to have a good operational understanding of technology's role in the business. It's not just thumbs up or down based on cost.

Yes, but the keyword is "operational". Where the significant characteristics of two possible solutions are identical (e.g. functionality, maintainability, reliability, vendor viability, or whatever) then cost rules. If that's not the case, then the differences need to be weighted and/or costed. And yes, everything (and I mean everything) can be boiled down to dollars and is at successful companies on a regular basis.
Re:What if they won't listen? by swb · 2005-05-04 04:30 · Score: 1

I don't disagree with your analysis much, although I do think you overemphasize the monetary quantification of all decisions and the ability of leadership to accurately provide dollar valuations of all possible situations and choices.

Show me your guide to correctly, accurately and predictably dollar-quantifying worker morale, brand image and any other number of intantigbles, and I'll give you my title to the Brooklyn Bridge.

Furthermore, how often are the signifcant portions of ANY decision identical? About as often as an economist or political scientist runs into the prototypical "rational man". You're addressing some ideal-world framework for how it ought to work, when it seldom comes close to that.

Thus my original complaint; too many management leaders simply choose not to understand the breadth and depth of It decision making beyond crude dollar valuations. Yes, they are important, but they don't tell the whole story.

Re:Yep by Blakey+Rat · 2005-05-03 06:29 · Score: 1

Then again, this article also mentions a lengthy Y2K analysis that occured from 1998-2000... it's kind of surprising that the oversight never occured to anybody during that, when they were almost certainly digging through that old code.

Then again, the Y2K expert digging through the code probably didn't know enough about the business to realize that the signed 16-bit value wasn't sufficient, so he probably glossed right over it without making any suggestions at all.

--
Comment of the year

Re:Yep by Marillion · 2005-05-03 06:31 · Score: 5, Insightful

First thing 32767 changes are a lot. A whole f*ck*ng lot. It averages over 1310 changes per day. For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.

I'm personally getting sick of people asking about backup systems. It was a problem with the data. Too much of it. Given the safety and goverment oversight that hinges on this data, you don't mess with it. Any backup system, whether one or one hundred backup systems, when presented with the same data, would also fail.

The DOT report issued back in March (sorry don't have karma link handy) said neither Comair nor SBS (the closed source vendor that supplied the application) were aware of the limit.

Eric Bardes (Yes, the one from TFA)

--
This is a boring sig

FORTRAN data types by TheOldBear · 2005-05-03 06:33 · Score: 1

Its been close to thirty years since I last wrote anything in FORTRASH^H^H^H^HRAN, but I seem to recall that INTEGER data types [variables beginning with the letters 'I' through 'N' by default] were all signed. In fact, all FORTRAN datatypes are signed [at least through F77 & ratfor, I've never looked at F90]

--
Caution: Do not stare into laser with remaining eye.

Legacy addiction is the big problem by fm6 · 2005-05-03 06:34 · Score: 3, Interesting

The Slashdot headline is misleading (as usual). This is only incidentally about risk management. The real subject is, Legacy Applications -- and how you get rid of them before they bite you in the ass.

As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking. I suspect that this is often the problem, and other problems -- distractions like strikes and the Y2K bug, managment that doesn't pay sufficient attention to the problem -- are just just secondary.

Here's some personal experience that isn't nearly the same scale, but neatly illustrates what I mean. I once worked for a pubs department that delivered copy to printshops as raw Postscript. There was a push from management to upgrade to Acrobat-generated PDF. This should have been a no-brainer -- print shops hate dealing with raw Postscript, and the existing process relied on an ancient, unsupported printer driver that ran only on Windows 98. But the people who managed the process just totally balked, claiming that tight schedules left them no extra time to learn Acrobat. A lame excuse? Sure. But it took a new pubs manager, and escalation to the do-it-or-your-fired level, to get the chage made.

I think this kind of issue had a lot to do with the failure of IBM's famous plan to use Unix or Linux for all their internal bureaucratic needs. Too many people dug in their heels, claiming that they couldn't possibly retool their Windows-based workflow.

When you talk about this stuff, somebody always says, "If people can't get with the program, they should be fired!" Well, it often comes to that, as it almost did with the PDF issue. But you can't just abitrarily fire everybody who resists policy and process changes. It's expensive, there are legal ramifications -- and you risk destroying the very corporate infrastructure you're trying to save.

Re:Yep by mankey+wanker · 2005-05-03 06:37 · Score: 4, Insightful

Is there no way to moderate a post simply "odd"?

Hmm... by Greyfox · 2005-05-03 06:39 · Score: 2, Interesting

From what I can gather of the airline industry in general, it's a bunch of assorted systems that are sort of held together by duct tape and spit. If ever an industry needed open standards, mandated interoperability and thorough design and code auditing, I'd say that'd be the one. It seems to me that there really needs to be one central IT shop which rolls out all the software for airline and FAA IT needs and all airlines should go through that single central clearinghouse.

--

I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

Graceful Degradation ... by rewinn · 2005-05-03 06:46 · Score: 1

>it matters how exceeding a limit is handled (graceful degradation)

Your point on correct software design is exceedingly well taken ...

... but I just love the term "Graceful Degradation". Is it from Faulkner, or a New Wave band?

--
--- Attorneys Assisting Citizen-Soldiers & Families -

Really? by Conspiracy_Of_Doves · 2005-05-03 06:47 · Score: 1

It sounded to me like the moral was to run something other than Windows.

Of course, who am I to talk? I run windows myself, but mainly because I mostly play games on my computer.

--

Technoli

Old? by Nemi · 2005-05-03 06:52 · Score: 3, Insightful

Age of the software should make no difference. The problem in this particular case was that the system could only handle 32,000 transactions a month (the programmer obviously used the wrong data type). That could be a problem with software of any age. Age had nothing to do with it failing.

This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.

Re:Old? by jimicus · 2005-05-03 08:04 · Score: 1

Age in this case is misleading. As you say, it could have happened with anything.

What it should be emphasising is the importance of risk evaluation in the context of "disaster recovery". Had the business sat down to write a proper disaster recovery plan on the basis of "OK, what happens if this system goes completely kaput and all we have left are the offsite backups?" then it would have become clear that here was a business critical system which had no coherent DR plan.
Re:Old? by ntrfug · 2005-05-03 10:16 · Score: 1

But it IS selling something.

It's selling "Risk Assessment", and the moral of the story is that if you don't hire expensive professional "Risk Assessors" you too could lose $20 million.
Re:Old? by kybred · 2005-05-03 11:19 · Score: 1

Age of the software should make no difference. The problem in this particular case was that the system could only handle 32,000 transactions a month (the programmer obviously used the wrong data type). That could be a problem with software of any age. Age had nothing to do with it failing.

So, when (if) you write software, you consider all the things that could change in 20 years and make sure it can handle them? 20 years ago, the difference in storage use between 'signed short' and 'signed long' (I don't remember the Fortran names for them) may have been enough to make the difference between having enough and running out of memory on the machines of that time.

kybred

Some flaws in the article... by CatsupBoy · 2005-05-03 06:53 · Score: 3, Insightful

Ok, the bottom line, they should have upgraded. Fine, we can all agree on that.

Now, first the article states:

[The application] was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).

First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.

Secondly, the following scenario makes perfect business sense:

SBS came in to make a pitch for its new Maestro crew management software [...] The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it.

The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.

Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.

Re:Some flaws in the article... by tomarseneault · 2005-05-03 07:39 · Score: 1

I agree that the risk of upgrading their system in that fiscal enviornment might have been more, in their minds, than the risk of keeping what's been working fine for 20 years, but to according to the article they did not even have a backup system and this was a mission critical application. It sounds like they still do not have a backup and instead just split it so that the 32k limit is harder to reach. The failure here was in Disaster Recovery Planning rather than Risk Management Assesments.

The airline industry has lots of Fortran systems. by Richard+Steiner · 2005-05-03 06:54 · Score: 1

Probably not many job openings, though. :-(

I'm working mostly in F77 now. It's a good language for what it does.

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Application lifecycles by ehiris · 2005-05-03 06:58 · Score: 2, Insightful

IT people know that technology will be obsolete in a short time but most business people always see technology as flashy cost reducers and they never plan on retiring the systems from the get-go. It's an annoyance but it is not suprising in an industry where duct taping old systems is preferred over structural improvements through architecture.

Re:Yep by alienw · 2005-05-03 07:01 · Score: 2, Insightful

That's exactly the same reasoning that leads to problems like this. We've all heard that 640k should be enough for everybody. A good rule for counters and bit values is to go several orders of magnitude above what you think will be the maximum possible value for that counter.

One or two changes per flight is unlikely, but possible. Yeah, it's insanely high. Yeah, such a thing might only occur once every 15 years. However, the value should have been a 32-bit unsigned integer instead of a 16-bit signed integer, because a thousand of changes in a month is well within the domain of possible values. Also, these types of limits should be conspicuously published in a specifications sheet, just like every other industry does.

Re:Yep by _xeno_ · 2005-05-03 07:02 · Score: 1

Eric Bardes (Yes, the one from TFA)

5 digit user ID, huh? We can also blame this on Slashdot, right? :)

--
You are in a maze of twisty little relative jumps, all alike.

Airlines have been using IT for 40+ years. by Richard+Steiner · 2005-05-03 07:09 · Score: 2, Insightful

They've also historically had fairly large IT shops. That has given them a lot of time and manpower over the past four decades to write custom software for themselves, and that has resulted in many unique airline-specific systems, sometimes running on interesting combinations of hardware.

One of the main problems with a "central IT shop" for the airlines is the fact that, operationally, each airline is somewhat unique in terms of the internal operational procedures they use, and many of the software applications at each airline are very tightly tied to that airline's own local set of procedures and business rules.

I worked for ten years at Northwest Airlines on a flight operations system that was originally written at United Airlines in the mid-1960's, and we had to make a lot of fundamental changes to displays and other things so our pilots and flight dispatchers could use their own in-house terminology, and so that the software would match the largely paper-driven procedures that it was replacing.

Even were the airline industry not in its current financial bind, the prospect of replacing some of those systems isn't one to be taken lightly -- not only are the systems at a major airline closely intertwined with unique procedures, but they also tend to be tightly tied together in terms of data with lots of real-time message passing going on not only between the airline's internal systems but also between the airline and various third parties (ACARS messages, weather info, flight plan information, reservations info, etc.).

It's a very interesting industry from an IT perspective, at least when it isn't in a death spiral...

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Any application, once written, is "legacy". by Richard+Steiner · 2005-05-03 07:12 · Score: 1

The trick is to determine whether or not the cost to converst a given system is actually worth it.

If a rewrite effort requires 50,000 or 100,000 man years to complete, you're talking serious money...

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Re:Any application, once written, is "legacy". by fm6 · 2005-05-03 07:40 · Score: 1

You're hiding behind a silly quibble over the definition of the word "legacy". Nobody calls an application that rolled out last year a "legacy application". There may be a gray area, but there is clearly a lot of crap out there that's "legacy" in the worst sense of the word. Software that doesn't generate or use the kind of data that fits modern workflows. Software that only runs on ancient platforms that you keep around for the sole purpose of running them. Software that requires constant attention by aging employees who aren't easily replaced.
Sure, you need to figure the cost of replacing your legacy systems -- but only as an offset against the cost of not replacing them. And most of all, you need to think of the future of your company, and put it ahead of your own professional inertia.

CIO's are often very distant from such issues. by Richard+Steiner · 2005-05-03 07:20 · Score: 1

I've worked in three large IT shops now, and the CIO of each company would typically know *very* little about a given application besides its name (if even that), much less info about specific features or flaws therein.

When one works in an environment with several hundred in-house applications, it's easy for something to get lost in the shuffle, paricularly if the application in question isn't normally a source of issues or is using a techology which isn't "mainstream" for the company...

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Not necessarily. by Richard+Steiner · 2005-05-03 07:24 · Score: 1

The Fortran V and F77 stuff we have running here (as well as the stuff running in those languages at my former employer) doesn't have that problem.

Sometimes using an old 36-bit mainframe architecture (where an INT is 36-bits) is an advantage. :-)

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Unmaintained code (and shoddy work) is the problem by Anonymous Coward · 2005-05-03 07:24 · Score: 2, Insightful

"As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking."

And why is that a bad thing? If the software is a good tool for the task at hand, they should keep using it. In fact, the article clearly says that this program was in many ways superior to newer programs on the market - which is why they didn't upgrade earlier. They say they were able to create good workflows based around the software - extending the ideas of the software's design to other processes.

What the article fails to discuss is that you could have a brand new piece of software which fails just as badly. Software doesn't age, and this wasn't a hardware failure. If they tried to do 32,769 crew changes the first month they used the software....it would have failed just as it did now. And buying something new (just because it is new) doesn't mean it is bug-free. If anything, conventional wisdom would imply that the older software is less buggy than new software because it has years of usage. Whoever wrote this program was obviously an idiot and didn't consider "what happens if there are more than 32,768 changes a month?" But most people write shoddy software, and managers don't catch them on it. Do you think that's changed in 20 years? Have programmers become better? I doubt it. The new software they bought almost certainly has some bug lurking in it, ready to cause havoc.

The real issue is software that is not maintainable, mostly because noone has (or can use) the source code for it. In that sense, just because software is old, doesn't make it a Legacy Application. Lack of maintainability makes it a Legacy Application. What confuses me is, it sounds like they had the source code for the current application. How hard would it have been to go hire a Fortran programmer to review it, since noone in the organization was familiar with Fortran? And, it certainly should serve as a warning to anyone willing to use software critical to a business process without source code. You can't count on software being maintained if the company discontinues support, or goes bankrupt, or just doesn't feel like it. And if you can't count on someone else to support a critical system for you, you better make sure you can support it yourself. (It doesn't have to be FOSS, but you ought to have access to the source code for your own use)

The truly relevant cautionary tale... by EricTheGreen · 2005-05-03 07:53 · Score: 2, Insightful

...IMHO, can be found in the following single line from The Fine Article:

But after nearly 15 years in use, the business had grown accustomed to the SBS system, and much of Comair's crew management business processes had grown directly out of it.

(emphasis added)

Talk about putting the cart in front of the horse. This system would never have been replaced before it's crash--the cost of readjusting process and any other attached technology would have dwarfed simply updating the software. There was no business case you could make that would appear to justify the expense. Other than the little matter of "your company won't function if something goes wrong", of course...

Also, you'd never find a decent replacement product--since it's functionality would have to mirror those same system-driven business processes.

The truly major oversight was in letting the package drive how Comair did this part of it's business in the first place. Done otherwise, the meltdown might still have happened, for plenty of reasons outlined in the article. But left this way, this result was pre-ordained. No amount of planning or "risk assessment" was going to counter the inertia created by this process/technology inversion.

Alternate history... by Awful+Truth · 2005-05-03 07:53 · Score: 1

It's just as easy to imagine the alternate scenario: After two years, the 10 person team charged with rebuilding this legacy software (at a cost of $2.5 million) finally implements their solution...and it crashes. Flights are cancelled, money is lost, business and IT journalists point to the project as an example of whatever their favorite current issue is.

We often draw the wrong conclusions in the IT business, because they support the projects we like. Rather than scrap the legacy system, the head of IT could have hired a couple of programmers with some Fortran language skills (it's not ancient Akkadian, after all) to maintain the existing system. Or, if the CIO is truly budget-conscious, pay a bonus of $5,000 each to the first two developers who become proficient in the language and the system.

Re:You made lots of people cry. by phasm42 · 2005-05-03 07:59 · Score: 1

Yes, clearly this is all his fault.

That last bit was sarcasm, by the
way.

--
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner

Not a lot. At all. by Tangurena · 2005-05-03 08:02 · Score: 2, Interesting

You have 2 flight crew and some "flight attendants." Let's call the number of crew on the plane 6. When a flight is rescheduled, you have 6 transactions removing them from the old flight, and 6 more transactions adding them to the new flight. Total 12 transactions. When you have bad snow days causing cancellation or rescheduling of 1,000 flights, then you just used up 1/3 of your transactions for the month. Since all the transactions are serialized, restoring from back up tapes would just have a crash again.

Re:Yep by Shotgun · 2005-05-03 08:07 · Score: 5, Insightful

I worked for IBM, coding in the mainframe networking department. Their motto should have been, "Don't change anything...it's working."

I got irritated. I would find stuff that was just STUPID. Horrendously mangled logic. Algorithms from other parts of the code applied completely wrong. Whenever I tried to improve the code I got the "It's working. Don't change anything" line. I left, determined to find a job where I could actually write code.

That was several years ago. I've gotten smarter since. I've worked on several large-scale, 5-9's systems. After several major and minor fuck-ups, now I know....

If it's working, don't change anything.

--
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba

Am I? Consider this... by Richard+Steiner · 2005-05-03 08:08 · Score: 1

A former employer of mine hired a contractor to write a small system for them, and when it was done his contract was over so he left.

The software was written in a modern language on a modern platform, but the employer did not have any of its own expertise in that language. Some of the folks there took shots at making small changes, but for the most part the thing was a black box.

Was it a legacy application or not?

My point: there's a HUGE grey area.

Even the data supposedly "locked" on so-called legacy systems is often easily freed, but many times the easiest solution from a technical perspective (i.e., actually buying a license for a relational database on a mainframe) is considered "too expensive" to implement, and the platform is still blamed even though the data being locked away is a financial decision, not a technical one.

Besides, well-designed systems (in my experence) don't require constant attention regardless of age.

The mainframe system I worked on at NWA certainly had its flaws, but most of its limitations were due to the stubbornness of (and misconceptions held by) upper management when it came to the platform in question, not the system itself.

(As an aside, in response to your lecture at the bottom -- I'd love to free my career from the shackles of older technology, but corporate hiring practices over the past decade make that highly impractical. IT software workers are labelled based on their last platform of expertise, not on their knowledge base.

Solve that issue, and you'll see a lot less CYA on the part of legacy programmers...)

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

Great now what? by Martigan80 · 2005-05-03 08:34 · Score: 1

Now if we can only tell the military that Ada is dead we'll be in business!

--
This SIG pulled due to lack of funding. (This damn war is costing too much!)

Re:Yep by Cyn · 2005-05-03 08:34 · Score: 1

Most companies only like employees who think inside the box, despite telling people to think outside the box.

This is exactly why people who can "think outside the box" are hired. They know exactly what the box is, and exactly what thinking inside it consists of - and they are then informed that's exactly what they are to do - think inside it and not shake things up.

Those who can't think outside the box might accidentally do so, causing horrible things to happen like things get fixed. Obviously, those who can think outside it still might do this, but they can approach the dissemination of such information carefully - in an 'outside the box' manner. Not that they necessarily will.

What were we talking about?

--
cyn, free software and *nix operating systems enthusiast.

Re:Yep by Skjellifetti · 2005-05-03 09:13 · Score: 1

For a company that flys over 1300 flights a day, it means they averaged a change every flight every day. That's insanely high.

Depends on what's meant by a "change." Sure, 1300 planes (100% of flights) rerouted per day for a month is insane, but if one flight attendent being replaced by another is a "change," then 1300 per day during/after a major snowstorm across half the US might not be all that insane. Just having to cancel/delay/reroute ~20% of your daily flights for a month might be enough to hit the limit.

--
FreeSpeech.org

The problem wasn't "aging" code... by TheLoneGundam · 2005-05-03 09:19 · Score: 1

The problem was the failure of the people working with it to look at and understand it. Probably no one wanted to work on it, because it wasn't "sexy". People who write for CIOs and PHBs tend to write as though anything older than five years is at risk of failing.. when in reality, if it has worked for five years straight, hang onto it. Constant change is the source of more failures than "aging" code, in my experience.

Re:Risky business by Gary+Destruction · 2005-05-03 09:21 · Score: 1

If it's at will employment, they can harp on any mistakes the employee made as an excuse to terminate them. People have been fired just because someone in upper management didn't like them.

SDLC? by Gary+Destruction · 2005-05-03 09:24 · Score: 1

What SDLC model were they using for that application?

32767 may be a lot, but it's not enough by Anonymous Coward · 2005-05-03 09:31 · Score: 1, Informative

IT'S OBVIOUSLY NOT ENOUGH .

The root cause was not a problem with too much data.

The root cause was not addressing the problem of what would happen when an undefined amount of data was fed into the system.

I'm personally getting sick of programmers blaming something in "the data" for their code puking on its shoes. No input no matter how insane should ever cause your system to fail to do what it was designed to do! If your code can't handle incorrect input, spit out an error and move on. Don't crash. Don't stop processing data. Keep working properly!!!!

Data from an uncontrolled source can never be trusted. Anyone who depends on certain characteristics or amounts of uncontrolled data inputs is a f*****g idiot.

And I'm not sorry at all if my standards are to high for you.

"Old software breaks down" is not BS by susano_otter · 2005-05-03 10:20 · Score: 2, Insightful

First, the longer a piece of software is in use, the greater the chance of finding an obscure or unlikely error condition. The older a piece of sofware, the more of its bugs will become apparent, and the more likely it is that a crippling bug will be found. Old software breaks down.

Second, operating constraints change over time. If a piece of software meets its initial demands, greater and greater demands are placed on it over time. If a piece of software is kept in use for many years, it will likely find itself handling a workload far in excess of what was imagined when the software was first created. When Comair first began using this software, it probably didn't have the business volume to make the transaction limit a problem. Because Comair's business grew over many years, but the software was not grown along with it, what was originally an unimportant design constraint turned out to be a major bug. Old software does not grow to meet new demands. Old software breaks down.

Old software doesn't rust. It doesn't develop stress fractures. It doesn't corrode or go stale. But in its own very real and very important way, old software does degrade over time; if not in itself, in its relationship to the constant growth in the demands placed upon it. Old software breaks down.

--

Any sufficiently well-organized community is indistinguishable from Government.

Re:Am I? Consider this... by fm6 · 2005-05-03 10:45 · Score: 1

Bored now. I enjoy a good debut -- but only with people who respond to what I actually say.

I shouldn't headshrink someone I haven't met -- but you sound like you're struggling to rationalize your own technological footdragging. Perhaps you should remember that I'm not the person you need to convince. You should be worrying about explaining to your boss why you insisted on hanging onto that obsolete system until it collapsed of its own weight.

I haven't seen the code, but... by Anonymous Coward · 2005-05-03 11:00 · Score: 1, Insightful

I haven't seen the code, but here's a guess. This is from a guy who started programming on an IBM mainframe and Burroughs minis in the late 70's,

When this code was written, hard disk space and memory space in mini computers and mainframes were at a premium

So you didn't use a fullword when a half would do. That would be the equivalent today of storing a lossless music file on your iPod Mini.

So they knew they were doing a few hundred events a month; the idea of going beyond 32K per month was probably absurd when it was written, and the programmer couldn't waste space, and anyway, when they got to that point, somebody would just patch the code and they'd be on their way.

Meanwhile, those programmers retired 10 years ago and this disaster erupted.

If you learned to program during the era this was written, it makes perfect sense.

You're too young to understand by Anonymous Coward · 2005-05-03 11:07 · Score: 3, Insightful

These systems were written when a computer had maybe 4M of main memory. So if you double the size of your counter, that means you can hold....1/2 as many events.

So as a programmer, you make a choice. You either make the counter smaller, or you limit the system in some other way.

Computers today have 3 orders of magnitude more memory, and the choice between a short and a long is easy to make. But back then, it wasn't.

To help you understand, if a programmer from that era used a long int, he'd better have a damned good reason. Although, he should have made it an unsigned int and got double the space . See? You're not old enough to feel in your gut the need to save *BITS*.

Back when I learned to code in the late 70's, we used assembler (BAL 360), and we saved space by making all number packed and then stripped off the sign byte. You did a MVO to the same memory location, and it had a side effect of shifting the packed number on nibble (1/2 byte) to the right, erasing the sign bit. We did that because a 40M disk pack on an IBM 370/148 cost about $40,000 and we couldn't waste it. Now I have a thumb drive with 1G on it. You just don't understand.

Disaster Recovery issue, how.... by cykim · 2005-05-03 11:30 · Score: 1

I don't understand all these people pounding on their chest about how this was due to disaster recovery planning, or lack thereof.... From what I gather, there should have been some sort of usage stats that should have been monitored. However, if the limit wasn't documented and wasn't readily known to the developer or the users, how was anyone supposed to flag it? There should have been a software lifespan that was originally projected, not because of software aging, etc, but because no one can realistically expect that a piece of software should last forever, even if it's working just fine. On a general level, it will probably cost you less to stay current than it will to suffer something like Comair did, but that's a blanket statement that is not always true. Either way, it's a complicated issue. There probably was some sort of disaster recovery plan in place, but I'm not sure since it wasn't mentioned. Even if we had a time delayed secondary site up that accepted changes and was accessible, could someone please point out to me how this would have been avoided? I'm gonna say that this system was prolly not coded to be a distributed or clustered system and would share the load between two sites so that neither site would then suffer the fate of having more that 32000 changes on that day, but that only means that we're still a ticking time bomb before the error shows up or the system is upgraded and you hope that a similar type of limit isn't imposed on the new app. How does a DR plan let them avoid a fatal flaw in their app, short of upgrading/code change/etc? I'm gonna guess that if the primary failed, and they were too worried about getting the DR site up, rather than actually figuring out what the issue was, they would've walked right into the same problem at the DR site. We're talking about a system that affects millions upon millions of dollars, and it's a definite risk for someone to stick their neck out and be on the line for the success or failure of the successor to an app that had apparently, for better or worse, become a lynchpin within the organization. Does that excuse some of their obvious circular laziness that got them to where they went? No, but it is also understandable that not everyone has the cajones to stick their neck out on something that is so high profile. Few people are leaders and propogate change, most are policy enforcers that only appear to lead. So should it really be a suprise that there are hundreds (more likely thousands) of companies that are potentially going to be the next Comair? The number of times I had to use "should" is a strong indicator that there are a number of things that could have been done to avoid the issue, but bottom line, the problem in this particular case was poor coding on the part of the developer, and nothing else. This problem as mentioned before, could have appeared at any time during it's use, and the article I think puts completely the wrong slant on what needs to be done to correct issues like this.

Re:Yep by Specter · 2005-05-03 13:09 · Score: 1

Exactly. The article was obviously written by someone who's never had to manange a mission critical application in their life. If it's not broken, don't fix it. I can assure you that the cumulative cost in downtime due to "fixing" even minor things in the system over the years would easily outpace the $20M they lost in the one big outage.

Hot shot wet-behind-the-ears noobs don't understand that the purpose of the software is to support the business process. Unless there's a business case (i.e. an ROI) for making a change to a system that's working, you don't change the system. (Even if the code is ugly, and 20 years old, and written in an unfashionable language.)

The risk analysis story here isn't: why didn't they spend a bunch of unnecessary money to replace a system that was working fine. The _real_ risk analysis story here is: why Comair didn't have a disaster recovery plan in place for a system that was critical to their line of business?

The old school... by Kjella · 2005-05-03 19:14 · Score: 1

...remember that as far back as the 1980s, computer memory was *expensive*. In 1986 I got my commodore 64 with 64kB RAM. Use 32bit instead of 16bit? You just took up 1/2000 of my memory instead of 1/4000. Same reason we had two-digit dates and the entire y2k-problem.

Today? Use 32bit, hell 64bit if you like. 32k changes would take up all of 512kB of memory, wohoo. It's not long ago since we had the 2GB limit (FAT, AVI), 4GB limit (FAT32). All due to bit-hogging, can't use 64bits for file size. (2/4GB = signed/unsigned 32bit).

It's not longer ago than 1998 that I was in a class where they told people to choose smaller units to save memory (and no, it wasn't an embedded systems class). I go for extreme overkill. 64bit dates (no year 2038 problem), 64bit file sizes, 64bit calculations on "unlimited" data, e.g. sum(x) where x is some sort of list. Of course not for bit values and such.

Is it needed? Probably not. But now, unlike then, the cost of doing so is near zero. It is much easier to err on the side of caution then.

Kjella

--
Live today, because you never know what tomorrow brings

Re:Yep by DingerX · 2005-05-03 19:14 · Score: 1

I love all this sanctimonious stuff.
Was it bad programming practice? Perhaps.
Can a company rely on its closed-source software being free of bad code, or a bad assumption?

Just how likely is 32k crew changes in a month anyway?
The largest airline in the world, last I checked, was United, and they have ca. 2400 flights a day, spread all over the globe. Comair has 1300 flights, all geographically centered. Worse, they're all centered on CVG, more specifically the Delta terminal, which has grown so rapidly, it's more like a tumor than a concourse. In my experience, the terminal infrastructure doesn't handle delays well either (I've dropped people off at CVG two and a half hours before their scheduled departure and they've missed their flight because of lines through security).
So we have weather that affects practically the entire operating area of a 1300-flight airline -- something in itself inconceivable outside of Comair -- and a home base that, to an outsider, has a straining infrastructure. I'm inclined to say that it's not only "insanely high", but something that would only happen in a "perfect storm" of circumstances.

So you hit an "undocumented software limitation"
Vendors don't rate software to weather perfect storms. This kind of failure, mutatis mutandis could happen with software you purchased yesterday.
So I'd like to say it was a matter of "the company bought a scheduling package suited to a small airline, built everything else around it, and when it got big, the scheduling package was forced to do what it could not and puked." or even "the company should have known of the 32,767-transaction limitation."

Yes, you can have backups. You can buy a second, completely different software package, and train all your personnel to use it too, and double the workload. Or you can bring in a fleet of temps, teach them in 2 hours the ins-and-outs of FAA crew scheduling minima, break out the phones and chalkboards, and hope that by the end of the day you don't break too many laws.

How easy is emulating old mainframes? by ockegheim · 2005-05-03 20:26 · Score: 1

(or, provided that an emulation-layer for the original platform can be constructed)

I'm definitely no expert in these matters, but if a company has an old mainframe system, how difficult would it be to emulate it on a reliable modern system, and throw all sorts of scenarios at it. If anything came up they'd be able to debug it (by hiring a Fortran guy or girl). That would be much cheaper than a new system, and management wouldn't have to learn a new system.

--
I’m old enough to remember 16K of memory being described as “whopping”

RTFM? by budgenator · 2005-05-04 01:43 · Score: 1

Considering the age of the system, I'd say they was lucky they still had the source code. More than one company found that their source had disapeared during the Y2K conversions

--
Apocalypse Cancelled, Sorry, No Ticket Refunds

Re:Yep by alienw · 2005-05-04 01:50 · Score: 1

You do have a point about the software not being intended for their needs. Of course, I'm sure the vendor made quite a few changes as they grew, because there is simply no way a program can scale that well.

However, this is simply an example of the sloppiness of the software development process in general. In my opinion, any mission-critical software package should come with a full disclosure of its limitations. This shouldn't be something that's left to some codemonkey who did the database layout. This is one of the main parameters of such a system -- what kind of transaction volume it can handle. It should be carefully documented and explained. If it was, it wouldn't have been a problem in the first place.

Re:New software is the silver bullet? by susano_otter · 2005-05-04 06:50 · Score: 1

Thank you. I am now even more convinced that software breakdowns and hardware breakdowns share significant similarities.

Please allow me to rephrase your post:

Old cars don't break down so much if they're well maintained.

New cars break down a lot if they're poorly maintained.

Therefore, a new car is no better than an old car.

I'm totally leaving out the part where you tried to convince me to keep my old software because Windows sucks. I hope you don't mind.

--

Any sufficiently well-organized community is indistinguishable from Government.

Slashdot Mirror

Risk Management - A Cautionary Tale

152 of 203 comments (clear)