Comair Done In by 16-Bit Counter
Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
Well, not this specific problem, but businesses have a common problem of outgrowing the systems that run their business. OTOH, this was an outsourced solution, so this case is pretty hard to explain away, other than sheer incompetence.
It's interesting because it provides a lesson in software design - arbitary limits will trip you up eventually. It's not as if nobody knew to avoid them before, though.
I'm scared of numbers that can't be written as a fraction. It's an irrational fear.
This assumes that they had the resources. Given the current competitive environment in terms of consumer price and fuel costs, it would not be surprising if IT got the short end of things.
A Pirate and a Puritan look the same on a balance sheet.
Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?
p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.
I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
So it turned out to be problematic to use a signed 16-bit integer.
...
But the real problem is a lack of error checking. It sounds like the code had something like:
int num_crew_changes;
crew_change_list[++num_crew_changes] = blah;
And the counter wrapped and the system crashed.
The code should have said:
if (num_crew_changes == MAXINT)
{
ERROR(E1234, "too many crew changes");
}
The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.
Rubbish. Don't judge yesteryear's programs by today's standards. Back then 4MB RAM cost more than $200. That's how important memory conservation was. In 1989 using an int was a perfectly acceptable choice. If you were programming back then you'd know how loathe programmers were to use longs when they didn't have to. (Granted an unsigned int would've worked better here, but that 64K limit could've also been reached.)
The software spec probably says something to the effect of "Don't attempt to schedule more than 32,767 crew changes." If you're running software that's more than a decade old you need to know what the limits of your software are.
Have fun: Join D.N.A. (National Dyslexics Association)
Ok dude do the math.
;-)
A sells tickets for $0 loss.
B sells tickets for $75 loss.
B gains many customers. However, the more customers the more loss they incur. Recall EVERY SEAT costs them $75. Eventually B just runs out of money and ups the costs.
Now A and B sell at the same cost. Customers notice the price hike and get upset [because for some reason people think air travel is a god given right so they get insanely upset at everything].
Sure some won-over customers will stay with B but many will spread out [many are also not particularly loyal they just use whatever cheaptickets.com tells them to].
Tell me I'm wrong. Tell me that most airlines haven't been filing for protection. Come on, tell me
Tom
Someday, I'll have a real sig.
unsigned short numberScheduleChanges;
fixes the problem.
You do realize that you've just fallen into the same trap, right? That doesn't fix the problem worth a damn. I mean, sure it doubles the amount of changes. And yes, 64,000 should be enough. But, hey, 32,000 should have been enough too, right?
Programs have internal limits. That's kosher. What's not appropriate is allowing the user base to exceed them or - for something like this - come close to exceeding them, without giving some kind of warning that notifies people of an impending problem and provides possible solutions (purge data, etc). Now you may point out that adding that kind of security increases the cost and complexity of software. Yup. That's why true enterprise software is expensive. Because that's what you're paying for.
Another alternative would have been for the software wrap and start purging existing records to make room for new ones. Either way, there should have been some defined strategy for the boundary condition, and there wasn't.
The other thing that the software vendor should have done when pushing their upgrade is point out that the previous version wouldn't allow flights to continue in that situation, but the new version expanded it to (some large number). Instead, they probably said, "We're 32 bit!" or something totally meaningless to the people evaluating the business case for the upgrade.
You're special forces then? That's great! I just love your olympics!
Am I the only one which reads POS as 'piece of shit' regardless of the fact I know it means point of sale?
It always fits perfectly in the context as well, as this example proves.
IntechHosting - Free domain, 2GB, PHP, £4.95/$8.95
Arbitrary being the key word. The limits probably weren't arbitrary when they were put in. The system probably had an expected life, and instead of maintaining their infrastructure the people tasked with running the company probably gave themselves pay raises while postponing payments into the employee pension fund. What stories like this are really about, are the complete worthlessness of MBAs. They exist for the sole purpose of diffusing responsability and obstructing accountability.
Very rarely does anyone have the luxury of designing for something with a hundred year life expectancy and a budget to match.
When business won't give IT the money needed to keep business's systems operational (be it for manpower, software upgrades, or electricity) and makes the final decision in purchases, something's going to have to give.
Business decides to buy a software package. After a while, upgrades come out, and the old version keeps getting pushed to the limits. IT adivses business of this, and says that an upgrade/replacement will resolve the problem, but business refuses to authorize said upgrade/replacement.
How do you propose IT "make it work" when their hands are tied? Even worse, IT will take the blame when it wasn't even their decision to make.
The only place where shaving bits made sense for us was on data records: we had a hash file with 2.1 million records, each 29 bytes long and it they all had to fit on a single 80MB hard drive. We squeezed every single bit out of those records (including developing a 3-byte integer to handle amounts that we told them could never exceed $99,999.99 (among other things, larger amounts would not have printed correctly.) But they were read-only records to us: we never wrote more than a few thousand rows of data, and we had plenty of space for the day's processing. And when they did have the odd line item that exceeded $100,000.00, they figured out to break it up into multiple smaller items.
And we got bit more than once by overflows. It took like three separate f-ups to get this guy to acknowledge that he needed to stop being stingy with the bytes. Even then, he'd still try to sneak in some memory "savings", but at least he stopped arguing when we called him on them.
John
I would suspect the attitude of debating a limit without knowing the business context your design choice exists in is probably what created this error to begin with.
I work in the industry, so I might be able to provide an alternate viewpoint. Esentially what happened to the airline industry is that the market changed from a luxury market (high profit margins, low competition) to a commodity market (low profit margins, intense competition). Unfortunately, the airlines had all made long term deals with the trade unions that presumed stagnate market conditions - so when the market changed, they could not change with it.
The smaller carriers all have one thing in common - no unions. They do not pay their pilots as much, and their pilots do not get paid if they don't fly. The number one expense for an airline is fuel, but the number two expense are the pilots, stewardesses, mechanics, and baggage handlers. There was no way for older airlines to meet the new market conditions (fly more for less profit per flight) without paying people less. The problem is that no one wants to be paid less, so instead they get rid of the least powerful people (who also happen to be the least paid). This is also specified in the union contract... all laying off actions must be FILO.
Essentially, the major carriers are hamstrung by the unions, and they will not survive long term. Unions work by artificially limiting labor supply - but that doesn't work if there is not enough work.
The unions say how evil it is that they are getting pay cuts, but where exactly do they expect the money to come from? The government really should not prop up certain providers when others are eager to take there place. Competition works for the most part. Air travel is becoming a commodity market, like cars. Market transitions cause upheavals, and change the market leaders - especially if the current leaders cannot change their bussiness structure.
I have to say that I totally disagree about management being incompetant - the current management (at least the upper level ones I deal with) are extremely good. They may even get the airline to survive and change to the current market conditions. But what has really destroyed the airlines is the changing markets, and the unions preventing the old airlines to change with the times. The only thing management could have done would be to have rejected the union contracts earlier. But I doubt if that was possible.
Unions seem to believe that society owes them a living. The problem is that society (except in the form of government) is not a person, and so recognizes no debts. Fighting that is totally ineffective because there is no one to fight.
while (sig==sig) sig=!sig;