Risk Management - A Cautionary Tale
Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
Hindsight is 20/20.
Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
How do you strike a balance with risk mitigation and ease of use for users? Sure, you can run a backup of your data and applications every 5 minutes, but the of course, no work gets done.
It seems like a professional discipline in itself. Risk+ certification anyone?
http://www.watacrackaz.com
but all it takes for a good number of companies to get egg on their face is one careless mid-level that is too casual with passwords (and/or takes their work home on laptops with info unencrypted)...
I'm sure "SlashdotMedia" will improve on all the wonders that Dice Holdings blessed us all with
Its exactly things like this we should avoid!! I can see sign of civil war in this!
My Goodness!!! John Titor HAS predicted a civil war in 2005!
Um...like making sure you run your Windows Updates. Because if you don't, you're gonna regret it.
Then again, even if you do, you're still going to regret it.
So, I guess the moral of the analogy is that it's better to patch your system and risk your hardware not working properly than having spyware or a virus on your system.
IGB: More fun than eating oatmeal!
From the article:
Sounds like some sort of overflow problem. Hmmm....
The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.
--- Attorneys Assisting Citizen-Soldiers & Families -
Site is already sluggish.
Bound To Fail
The crash of a critical legacy system at Comair is a classic risk management mistake that cost the airline $20 million and badly damaged its reputation.
BY STEPHANIE OVERBY
When Eric Bardes joined the Comair IT department in 1997, one of the very first meetings he attended was called to address the replacement of an aging legacy system the regional airline utilized to manage flight crews. The application, from SBS International, was one of the oldest in the company (11 years old at the time), was written in Fortran (which no one at Comair was fluent in) and was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).
SBS came in to make a pitch for its new Maestro crew management software. One of the flight crew supervisors at the meeting had used Maestro, a first-generation Windows application, at a previous job. He found it clumsy, to put it kindly. "He said he wouldn't wish the application on his worst enemy," Bardes recalls. The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it. The consensus at the meeting was that if Comair was going to shoulder the expense of replacing the old crew management system, it should wait for a more satisfactory substitute to come along.
And wait they did. The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events: managing the approach of Y2K, the purchase of the independent carrier by Delta in 2000, a pilot strike that grounded the airline in 2001, and finally, 9/11 and the ensuing downturn that ravaged the airline industry.
A replacement system from Sabre Airline Solutions was finally approved last year, but the switch didn't happen soon enough. Over the holidays, the legacy system failed, bringing down the entire airline, canceling or delaying 3,900 flights, and stranding nearly 200,000 passengers. The network crash cost Comair and its parent company, Delta Air Lines, $20 million, damaged the airline's reputation and prompted an investigation by the Department of Transportation.
Chances are, the whole mess could have been avoided if Comair or Delta had done a comprehensive analysis of the risk that this critical system posed to the airline's daily operations and had taken steps to mitigate that risk. But a look inside Comair reveals that senior executives there did not consider a replacement system an urgent priority, and IT did little to disrupt that sense of complacency. Though everyone seemed to know that there was a need to deal with the aging applications and architecture that supported the growing regional carrier--and the company even created a five-year strategic plan for just that purpose--a lack of urgency prevailed.
After the acquisition by Delta, former employees say Comair IT executives didn't do the kind of thorough management analysis that might have persuaded the parent airline to invest in a replacement system before it was too late. Instead, Delta kept a lid on capital expenditures at Comair, with unfortunate consequences. The failure of the almost 20-year-old scheduling system not only saddled Delta with a plethora of customer service and financial headaches that the airline could ill afford but it also provides a cautionary tale for any company that thinks it can operate on its legacy systems for just...one...more...day.
The five-year plan that wasn't
Today, Cincinnati-based Comair is a regional airline that operates in 117 cities and carries about 30,000 passengers on 1,130 flights a day, with three or four crew members on each. But back in 1984, when Jim Dublikar joined the company as director of finance and risk management, Comair had
I currently have no clever signature witicism to add here.
--------------------- Cut Here ---------------------
Posts above this line have not RTFA.
Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.
I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?
The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.
I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.
"All great wisdom is contained in .signature files"
And even though she was screaming from the highest mountain to anyone and everyone that would listen that doom was rushing towards them. That bad, bad things were going to happen. She was still made the sacrificial goat when the fecal material hit the rotating blades.
And this was for a federal agency.
Scary no?
The Laws of Software Process: A New Model for the Production and Management of Software
"Armour, a consultant in software development, reveals a new structure for software development that redefines the nature and purpose of software. He explains how, in the modern knowledge economy, software systems are not products in the classical sense, but are the modern channels for the conveyance of information. From this perspective, he examines programming languages, quality, cost estimation, and project management, and demonstrates how to overcome common problems that afflict software development and use. The book is distributed by CRC.Copyright © 2004 Book News, Inc., Portland, OR"
I used to work in the Risk Management department of the capital markets division of a large international bank as a programmer.
When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.
The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.
Several points about this job were fascinating:
1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.
It was a fun job. Then I found another one where I get to play with Python!
-- Kevin
Unitarian Church: Freethinkers Congregate!
One word: repeatedly.
Document! Document! Document!
Might have still got fired, but it makes it easier to get unemployment, and keep a black mark from showing on your work history.
I would love to work on an old fortran system again, especially if it's in fortran IV. Yes indeed, those were the good old days.
01/20/09
One of the interesting quotes from the article:
Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.
I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.
Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...
Wow. Looks like even the mag for CIOs can't keep up with a /. DDoS attack. Maybe the CIO for CIO should be fired?
If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.
Maintenance manuals and procedures are written in blood. The next tragedy will be no different.
Mirrordot mirror of article for your viewing pleasure.
Religion is for people afraid of going to hell.
Legacy == Bad, gonna die, just like dear Grandad. Should've rewritten it in Java, that'd fix it!
With a signed 16-bit integer, you have 1 bit for the sign, and 15-bits for the rest of the number. Depending upon any error handling by the compiler, you could get NaN (not a number), maybe zero, maybe -32767, or maybe just a core dump. In any event, the result is not what you are expecting.
It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.
.....
Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.
To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc
Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?
OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?
Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.
that your comment has failed: Lack of attention
I work in a business that isn't defined by technology (at least not historically), and I don't think that management actually listens or comprehends when it comes to a lot of IT issues.
When they do listen, they tend to reduce it to profit/loss and destroy the subtlety of the information and its meaning. CIOs that "push" issues, especially when they're expensive, tend to get canned as gadflys, big spenders or for not being "team players".
When it comes to technology, managers often don't care and don't want to know, except when it costs money.
1- What is important to your organisation?
2- Evaluate this : You lose one part of your data or structure. Can you get it back? Are you SURE? In how much time?
It is way more information then this, but this is a start...
No sig for now.
It's not at all clear from the article that brining in a new software system would bring its own risks.
For example, a limit like the "max 32,000 changes per month" in the old system could well have existed in any new system, as could any number of bugs that take a long time to shake out.
The implicit message of the article seems to be "the system was old and that caused it to fail", which doesn't seem correct to me.
Its been close to thirty years since I last wrote anything in FORTRASH^H^H^H^HRAN, but I seem to recall that INTEGER data types [variables beginning with the letters 'I' through 'N' by default] were all signed. In fact, all FORTRAN datatypes are signed [at least through F77 & ratfor, I've never looked at F90]
Caution: Do not stare into laser with remaining eye.
As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking. I suspect that this is often the problem, and other problems -- distractions like strikes and the Y2K bug, managment that doesn't pay sufficient attention to the problem -- are just just secondary.
Here's some personal experience that isn't nearly the same scale, but neatly illustrates what I mean. I once worked for a pubs department that delivered copy to printshops as raw Postscript. There was a push from management to upgrade to Acrobat-generated PDF. This should have been a no-brainer -- print shops hate dealing with raw Postscript, and the existing process relied on an ancient, unsupported printer driver that ran only on Windows 98. But the people who managed the process just totally balked, claiming that tight schedules left them no extra time to learn Acrobat. A lame excuse? Sure. But it took a new pubs manager, and escalation to the do-it-or-your-fired level, to get the chage made.
I think this kind of issue had a lot to do with the failure of IBM's famous plan to use Unix or Linux for all their internal bureaucratic needs. Too many people dug in their heels, claiming that they couldn't possibly retool their Windows-based workflow.
When you talk about this stuff, somebody always says, "If people can't get with the program, they should be fired!" Well, it often comes to that, as it almost did with the PDF issue. But you can't just abitrarily fire everybody who resists policy and process changes. It's expensive, there are legal ramifications -- and you risk destroying the very corporate infrastructure you're trying to save.
From what I can gather of the airline industry in general, it's a bunch of assorted systems that are sort of held together by duct tape and spit. If ever an industry needed open standards, mandated interoperability and thorough design and code auditing, I'd say that'd be the one. It seems to me that there really needs to be one central IT shop which rolls out all the software for airline and FAA IT needs and all airlines should go through that single central clearinghouse.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
>it matters how exceeding a limit is handled (graceful degradation)
Your point on correct software design is exceedingly well taken ...
... but I just love the term "Graceful Degradation". Is it from Faulkner, or a New Wave band?
--- Attorneys Assisting Citizen-Soldiers & Families -
It sounded to me like the moral was to run something other than Windows.
Of course, who am I to talk? I run windows myself, but mainly because I mostly play games on my computer.
Technoli
This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.
Now, first the article states: First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.
Secondly, the following scenario makes perfect business sense: The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.
Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.
Probably not many job openings, though. :-(
I'm working mostly in F77 now. It's a good language for what it does.
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
IT people know that technology will be obsolete in a short time but most business people always see technology as flashy cost reducers and they never plan on retiring the systems from the get-go. It's an annoyance but it is not suprising in an industry where duct taping old systems is preferred over structural improvements through architecture.
Plan ahead. Wish I thought that one up
Ovaltine,
What's the deal with ovaltine.
The cup is round, the can is round,
why dont they call it roundtine?
There's some standarization going on.
t m
c asestudies/faa.shtml
http://www.iata.org/whatwedo/fuel/datastandards.h
http://www-306.ibm.com/software/ebusiness/jstart/
They've also historically had fairly large IT shops. That has given them a lot of time and manpower over the past four decades to write custom software for themselves, and that has resulted in many unique airline-specific systems, sometimes running on interesting combinations of hardware.
One of the main problems with a "central IT shop" for the airlines is the fact that, operationally, each airline is somewhat unique in terms of the internal operational procedures they use, and many of the software applications at each airline are very tightly tied to that airline's own local set of procedures and business rules.
I worked for ten years at Northwest Airlines on a flight operations system that was originally written at United Airlines in the mid-1960's, and we had to make a lot of fundamental changes to displays and other things so our pilots and flight dispatchers could use their own in-house terminology, and so that the software would match the largely paper-driven procedures that it was replacing.
Even were the airline industry not in its current financial bind, the prospect of replacing some of those systems isn't one to be taken lightly -- not only are the systems at a major airline closely intertwined with unique procedures, but they also tend to be tightly tied together in terms of data with lots of real-time message passing going on not only between the airline's internal systems but also between the airline and various third parties (ACARS messages, weather info, flight plan information, reservations info, etc.).
It's a very interesting industry from an IT perspective, at least when it isn't in a death spiral...
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
The trick is to determine whether or not the cost to converst a given system is actually worth it.
If a rewrite effort requires 50,000 or 100,000 man years to complete, you're talking serious money...
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
I've lived this too (also Feds). I warned the management until they were sick of me, and was demoted. By the time the system crashed and cost them 10 Million Bucks, all the managers and the CIO had moved on to better jobs with promotions! I laugh, I cry, I now work elsewhere.
I've worked in three large IT shops now, and the CIO of each company would typically know *very* little about a given application besides its name (if even that), much less info about specific features or flaws therein.
When one works in an environment with several hundred in-house applications, it's easy for something to get lost in the shuffle, paricularly if the application in question isn't normally a source of issues or is using a techology which isn't "mainstream" for the company...
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
The Fortran V and F77 stuff we have running here (as well as the stuff running in those languages at my former employer) doesn't have that problem.
:-)
Sometimes using an old 36-bit mainframe architecture (where an INT is 36-bits) is an advantage.
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
"As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking."
And why is that a bad thing? If the software is a good tool for the task at hand, they should keep using it. In fact, the article clearly says that this program was in many ways superior to newer programs on the market - which is why they didn't upgrade earlier. They say they were able to create good workflows based around the software - extending the ideas of the software's design to other processes.
What the article fails to discuss is that you could have a brand new piece of software which fails just as badly. Software doesn't age, and this wasn't a hardware failure. If they tried to do 32,769 crew changes the first month they used the software....it would have failed just as it did now. And buying something new (just because it is new) doesn't mean it is bug-free. If anything, conventional wisdom would imply that the older software is less buggy than new software because it has years of usage. Whoever wrote this program was obviously an idiot and didn't consider "what happens if there are more than 32,768 changes a month?" But most people write shoddy software, and managers don't catch them on it. Do you think that's changed in 20 years? Have programmers become better? I doubt it. The new software they bought almost certainly has some bug lurking in it, ready to cause havoc.
The real issue is software that is not maintainable, mostly because noone has (or can use) the source code for it. In that sense, just because software is old, doesn't make it a Legacy Application. Lack of maintainability makes it a Legacy Application. What confuses me is, it sounds like they had the source code for the current application. How hard would it have been to go hire a Fortran programmer to review it, since noone in the organization was familiar with Fortran? And, it certainly should serve as a warning to anyone willing to use software critical to a business process without source code. You can't count on software being maintained if the company discontinues support, or goes bankrupt, or just doesn't feel like it. And if you can't count on someone else to support a critical system for you, you better make sure you can support it yourself. (It doesn't have to be FOSS, but you ought to have access to the source code for your own use)
...IMHO, can be found in the following single line from The Fine Article:
But after nearly 15 years in use, the business had grown accustomed to the SBS system, and much of Comair's crew management business processes had grown directly out of it.
(emphasis added)
Talk about putting the cart in front of the horse. This system would never have been replaced before it's crash--the cost of readjusting process and any other attached technology would have dwarfed simply updating the software. There was no business case you could make that would appear to justify the expense. Other than the little matter of "your company won't function if something goes wrong", of course...
Also, you'd never find a decent replacement product--since it's functionality would have to mirror those same system-driven business processes.
The truly major oversight was in letting the package drive how Comair did this part of it's business in the first place. Done otherwise, the meltdown might still have happened, for plenty of reasons outlined in the article. But left this way, this result was pre-ordained. No amount of planning or "risk assessment" was going to counter the inertia created by this process/technology inversion.
We often draw the wrong conclusions in the IT business, because they support the projects we like. Rather than scrap the legacy system, the head of IT could have hired a couple of programmers with some Fortran language skills (it's not ancient Akkadian, after all) to maintain the existing system. Or, if the CIO is truly budget-conscious, pay a bonus of $5,000 each to the first two developers who become proficient in the language and the system.
Yes, clearly this is all his fault.
That last bit was sarcasm, by the
way.
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
You have 2 flight crew and some "flight attendants." Let's call the number of crew on the plane 6. When a flight is rescheduled, you have 6 transactions removing them from the old flight, and 6 more transactions adding them to the new flight. Total 12 transactions. When you have bad snow days causing cancellation or rescheduling of 1,000 flights, then you just used up 1/3 of your transactions for the month. Since all the transactions are serialized, restoring from back up tapes would just have a crash again.
A former employer of mine hired a contractor to write a small system for them, and when it was done his contract was over so he left.
The software was written in a modern language on a modern platform, but the employer did not have any of its own expertise in that language. Some of the folks there took shots at making small changes, but for the most part the thing was a black box.
Was it a legacy application or not?
My point: there's a HUGE grey area.
Even the data supposedly "locked" on so-called legacy systems is often easily freed, but many times the easiest solution from a technical perspective (i.e., actually buying a license for a relational database on a mainframe) is considered "too expensive" to implement, and the platform is still blamed even though the data being locked away is a financial decision, not a technical one.
Besides, well-designed systems (in my experence) don't require constant attention regardless of age.
The mainframe system I worked on at NWA certainly had its flaws, but most of its limitations were due to the stubbornness of (and misconceptions held by) upper management when it came to the platform in question, not the system itself.
(As an aside, in response to your lecture at the bottom -- I'd love to free my career from the shackles of older technology, but corporate hiring practices over the past decade make that highly impractical. IT software workers are labelled based on their last platform of expertise, not on their knowledge base.
Solve that issue, and you'll see a lot less CYA on the part of legacy programmers...)
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
Now if we can only tell the military that Ada is dead we'll be in business!
This SIG pulled due to lack of funding. (This damn war is costing too much!)
Okay, some people would argue that blogs are such a blot on the web as to warrant this action. Point is, um, let's see, Delta should have been able to do due diligence on the core software of new acquisitions. Shouldn't they?
The problem was the failure of the people working with it to look at and understand it. Probably no one wanted to work on it, because it wasn't "sexy". People who write for CIOs and PHBs tend to write as though anything older than five years is at risk of failing.. when in reality, if it has worked for five years straight, hang onto it. Constant change is the source of more failures than "aging" code, in my experience.
What SDLC model were they using for that application?
IT'S OBVIOUSLY NOT ENOUGH .
The root cause was not a problem with too much data.
The root cause was not addressing the problem of what would happen when an undefined amount of data was fed into the system.
I'm personally getting sick of programmers blaming something in "the data" for their code puking on its shoes. No input no matter how insane should ever cause your system to fail to do what it was designed to do! If your code can't handle incorrect input, spit out an error and move on. Don't crash. Don't stop processing data. Keep working properly!!!!
Data from an uncontrolled source can never be trusted. Anyone who depends on certain characteristics or amounts of uncontrolled data inputs is a f*****g idiot.
And I'm not sorry at all if my standards are to high for you.
First, the longer a piece of software is in use, the greater the chance of finding an obscure or unlikely error condition. The older a piece of sofware, the more of its bugs will become apparent, and the more likely it is that a crippling bug will be found. Old software breaks down.
Second, operating constraints change over time. If a piece of software meets its initial demands, greater and greater demands are placed on it over time. If a piece of software is kept in use for many years, it will likely find itself handling a workload far in excess of what was imagined when the software was first created. When Comair first began using this software, it probably didn't have the business volume to make the transaction limit a problem. Because Comair's business grew over many years, but the software was not grown along with it, what was originally an unimportant design constraint turned out to be a major bug. Old software does not grow to meet new demands. Old software breaks down.
Old software doesn't rust. It doesn't develop stress fractures. It doesn't corrode or go stale. But in its own very real and very important way, old software does degrade over time; if not in itself, in its relationship to the constant growth in the demands placed upon it. Old software breaks down.
Any sufficiently well-organized community is indistinguishable from Government.
I shouldn't headshrink someone I haven't met -- but you sound like you're struggling to rationalize your own technological footdragging. Perhaps you should remember that I'm not the person you need to convince. You should be worrying about explaining to your boss why you insisted on hanging onto that obsolete system until it collapsed of its own weight.
I haven't seen the code, but here's a guess. This is from a guy who started programming on an IBM mainframe and Burroughs minis in the late 70's,
When this code was written, hard disk space and memory space in mini computers and mainframes were at a premium
So you didn't use a fullword when a half would do. That would be the equivalent today of storing a lossless music file on your iPod Mini.
So they knew they were doing a few hundred events a month; the idea of going beyond 32K per month was probably absurd when it was written, and the programmer couldn't waste space, and anyway, when they got to that point, somebody would just patch the code and they'd be on their way.
Meanwhile, those programmers retired 10 years ago and this disaster erupted.
If you learned to program during the era this was written, it makes perfect sense.
These systems were written when a computer had maybe 4M of main memory. So if you double the size of your counter, that means you can hold....1/2 as many events.
So as a programmer, you make a choice. You either make the counter smaller, or you limit the system in some other way.
Computers today have 3 orders of magnitude more memory, and the choice between a short and a long is easy to make. But back then, it wasn't.
To help you understand, if a programmer from that era used a long int, he'd better have a damned good reason. Although, he should have made it an unsigned int and got double the space . See? You're not old enough to feel in your gut the need to save *BITS*.
Back when I learned to code in the late 70's, we used assembler (BAL 360), and we saved space by making all number packed and then stripped off the sign byte. You did a MVO to the same memory location, and it had a side effect of shifting the packed number on nibble (1/2 byte) to the right, erasing the sign bit. We did that because a 40M disk pack on an IBM 370/148 cost about $40,000 and we couldn't waste it. Now I have a thumb drive with 1G on it. You just don't understand.
"How would *you* have judged the risk of this software failing? "
Actually, most well-run companies set up a portfolio and have someone with a lot of experience and savvy do a risk assessment.
In some cases, its easy...2 year old system, lots of programmers around to support it, new technology blah blah. Low risk for the most part.
A 30 year old green screen application with almost no support in house or out, and a lack of source to judge how well the system works? That's high risk. You have to put together a cost justification to replace it. The business then makes the decision to replace it based on the amount of risk the CEO and/or board is willing to take.
In this case, this might have been done, but everybody is doing a CYA and the CIO might not be at fault at all! We don't know enough to make that the judgement, and the people who really know are shutting their mouth.
I know all this because I do this as a living. Its part of the world of an IT architect.
"How the hell can you calculate risk if your only input is the chronological age of a software system?"
e val-service.html
I do this for a living for out company and here's how you do it:
1) First do an assessment of the impact of the software failing
2) Then make a determination of the amount of support you have for the program to fix it in the event of a failure
3) Do you have the source code? Do you have people who can understand it? Do you have people who routinely update and fix the code?
4) Is the business logic in the program well understood and documented?
5) Do you routinely run load and stress analysis on the system to determine its limits?
That tells you a ton. The most important question is #1. Will the business fail if the program breaks? This is partly IT's job, but its really the business owner's job to give you that information.
If the business owner feels the business will stop without it, then its up to IT to recommend a course of action to mitigate the risk, and ultimately its up to the CEO/Board to concur or reject the recommendation (or come up with their own).
http://www.sei.cmu.edu/products/services/sw.risk.
TPF is still big in the airline industry. If you want a dead-end job doing IBM assembler programming in a loader that we laughingly call an operating system, then go for it.
OTOH, IBM is adding a POSIX layer into it and Apache is already ported, so there's hope for the old girl yet.
Nah, just kidding. IBM still can't figure out how to do structured I/O yet in this system.
I don't understand all these people pounding on their chest about how this was due to disaster recovery planning, or lack thereof.... From what I gather, there should have been some sort of usage stats that should have been monitored. However, if the limit wasn't documented and wasn't readily known to the developer or the users, how was anyone supposed to flag it? There should have been a software lifespan that was originally projected, not because of software aging, etc, but because no one can realistically expect that a piece of software should last forever, even if it's working just fine. On a general level, it will probably cost you less to stay current than it will to suffer something like Comair did, but that's a blanket statement that is not always true. Either way, it's a complicated issue. There probably was some sort of disaster recovery plan in place, but I'm not sure since it wasn't mentioned. Even if we had a time delayed secondary site up that accepted changes and was accessible, could someone please point out to me how this would have been avoided? I'm gonna say that this system was prolly not coded to be a distributed or clustered system and would share the load between two sites so that neither site would then suffer the fate of having more that 32000 changes on that day, but that only means that we're still a ticking time bomb before the error shows up or the system is upgraded and you hope that a similar type of limit isn't imposed on the new app. How does a DR plan let them avoid a fatal flaw in their app, short of upgrading/code change/etc? I'm gonna guess that if the primary failed, and they were too worried about getting the DR site up, rather than actually figuring out what the issue was, they would've walked right into the same problem at the DR site. We're talking about a system that affects millions upon millions of dollars, and it's a definite risk for someone to stick their neck out and be on the line for the success or failure of the successor to an app that had apparently, for better or worse, become a lynchpin within the organization. Does that excuse some of their obvious circular laziness that got them to where they went? No, but it is also understandable that not everyone has the cajones to stick their neck out on something that is so high profile. Few people are leaders and propogate change, most are policy enforcers that only appear to lead. So should it really be a suprise that there are hundreds (more likely thousands) of companies that are potentially going to be the next Comair? The number of times I had to use "should" is a strong indicator that there are a number of things that could have been done to avoid the issue, but bottom line, the problem in this particular case was poor coding on the part of the developer, and nothing else. This problem as mentioned before, could have appeared at any time during it's use, and the article I think puts completely the wrong slant on what needs to be done to correct issues like this.
I'm not too young to understand. I've been doing this for almost two decades.
Why should a system that needs to track crew changes on airline schedules be limited to the amount of the schedule it can fit in memory? That's another thoughtless limitation on a system that should not have all that strict performance requirements attached. It's not like it's doing real-time processing.
The system was 11 years old. Assuming it began design 15 years ago, even that long ago that had already invented the "disk drive".
And we both know the real reason it was a 16-bit field was someone selected "int" as the type without really thinking of the implications. Hell, they weren't even smart enough to "unsigned int", and that's a very revealing point.
If they were scrounging for memory they would have used an "unsigned int" instead.
They didn't.
So I call BS.
Then you didn't test your code very well. Again, this wasn't a bug, this was by design.
This has less to do with the code, and is all about the shoddy attitude of management towards the software, most apparent because of their lack of a backup system, and secondly that they had no one familiar with the code, and did not know of the design constraints.
Don't think that you can go out an buy a brand spanking new software system (trading in your rusty beater and getting a new Honda) and things will be rosy. You need to know what you are doing along with that. Here's a simple example: If your website is designed to handle a certain number of hits (let's say Microsoft IIS running on Access database just for kicks), it will fall over, if, someone posts the link at slashdot. In fact, let's say they had invested in the new Windows system, but scrimp on security and backup, and their site is hacked. Situation any different?
Old software breaks down ... and so does new software. Period.
Notice that Comair is *STILL* running their old software. Now they have a backup system, a fix to the 32K change limit, and probably some additional familiarity with how the system works.
Damn straight. I've read a bunch of the modded up posts, most fall hook line & sinker for the 'old software' bit, apparently absorbed from the CIO article in a 'Snow Crash' way, i.e. stream of data right past the 'does this make sense' checker. I think there are certainly reasons to replace software, but I didn't like the fact that this article was such a sales pitch, in effect one should print the article, hand to their manager and request System X to be replaced, or suffer dire consequences.
But interestingly, they still haven't replaced the software - they've got a workaround, and a backup, and maybe some better documentation.
The fact that they didn't have a backupis almost unbelievable, especially for an airline. What seems to me always happens in these cases, is that the blame falls heavily on the software vendor, and not on the users.
What is more remarkable though, and unsaid, was that the vendor a) still existed, b) still supported and was able to respond to this system...at Christmas.
That is impressive, and kind of lost in the details in the article. I recall a similar issue with a bank. I've forgotten the exact details, but apparently a the bank was running on a bunch of IBM controllers (or some sort of hardware). This hardware was redundant - if one CPU failed, there was a spare that would take over. But these folks, rather than replace the CPU when it failed, simply continued to run on the spare. Finally, the spare failed too, so they frantically call IBM , and somehow IBM managed to get them the required part (13 or 14 years old no less), so they were down for only 11 hours or so. Yet... they are still suing IBM! The bank apparently has no technical people, or won't hire any, but they will hire lawyers...
...remember that as far back as the 1980s, computer memory was *expensive*. In 1986 I got my commodore 64 with 64kB RAM. Use 32bit instead of 16bit? You just took up 1/2000 of my memory instead of 1/4000. Same reason we had two-digit dates and the entire y2k-problem.
Today? Use 32bit, hell 64bit if you like. 32k changes would take up all of 512kB of memory, wohoo. It's not long ago since we had the 2GB limit (FAT, AVI), 4GB limit (FAT32). All due to bit-hogging, can't use 64bits for file size. (2/4GB = signed/unsigned 32bit).
It's not longer ago than 1998 that I was in a class where they told people to choose smaller units to save memory (and no, it wasn't an embedded systems class). I go for extreme overkill. 64bit dates (no year 2038 problem), 64bit file sizes, 64bit calculations on "unlimited" data, e.g. sum(x) where x is some sort of list. Of course not for bit values and such.
Is it needed? Probably not. But now, unlike then, the cost of doing so is near zero. It is much easier to err on the side of caution then.
Kjella
Live today, because you never know what tomorrow brings
I'm definitely no expert in these matters, but if a company has an old mainframe system, how difficult would it be to emulate it on a reliable modern system, and throw all sorts of scenarios at it. If anything came up they'd be able to debug it (by hiring a Fortran guy or girl). That would be much cheaper than a new system, and management wouldn't have to learn a new system.
I’m old enough to remember 16K of memory being described as “whopping”
Considering the age of the system, I'd say they was lucky they still had the source code. More than one company found that their source had disapeared during the Y2K conversions
Apocalypse Cancelled, Sorry, No Ticket Refunds