Comair Done In by 16-Bit Counter
Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
This was Y32k!
FLR
It seems that 16 bits and 640K wasn't enough for them after all.
Striking fear in the authors of godawful fanfiction, I am here, appearing in darkness, Tuxedo Jack!
...I heard it on BugTraq first...
Maybe the existing system was working just fine? Upgrades too expensive?
Perhaps this was something that they never anticipated in a thousand years?
I bet *now* they'll upgrade, but until this particularly hairy situation arose, they didn't really see a need to upgrade a computer scheduling system that had been working great for them.
Dunno why this is interesting, aside from seeing "16 bit" in the headline.
I believe this will answer your question:
Tom Carter, a computer consultant with Clover Link Systems of Los Angeles, said the application has a hard limit of 32,000 changes in a single month.
"This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.
So it sounds like a signed int.
It was a joke! When you give me that look it was a joke.
Well, not this specific problem, but businesses have a common problem of outgrowing the systems that run their business. OTOH, this was an outsourced solution, so this case is pretty hard to explain away, other than sheer incompetence.
That's what you get for using a buggy and OLD OS for such important tasks. Grats!
Signed. Had the developers used unsigned then it might never have overflowed at all (They were going to replace the system in a few months anyway.)
Since TFA quoted someone who cited a limit of 32,000 reassignments, I would guess signed (and that they were just truncating--if it's really a 16-bit counter, the upper bound would be 32,767).
Here's the original post:
4 .html
1 85556
s ID=2275
Hi,
On Christmas Day last Saturday, Comair Airlines had to completely stop
flying
all of its planes due to computer problems. Comair blamed the computer
problems on their pilot scheduling software being overloaded after bad
weather earlier in the week forced many flights to be rescheduled. Comair
now hopes to have all of its 1,100 daily flights restored by tomorrow.
An article which was published today at the Cincinnati Post Web site
provides some interesting details of a software failure in Comair's pilot
scheduling software:
How it happened
http://www.cincypost.com/2004/12/28/comp12-28-200
According to the article, Comair is running a 15-year old scheduling
software package from SBS International (www.sbsint.com). The software has
a hard limit of 32,000 schedule changes per month. With all of the bad
weather last week, Comair apparently hit this limit and then was unable to
assign pilots to planes.
It sounds like 16-bit integers are being used in the SBS International
scheduling software to identify transactions. Given that the software is 15
years old, this design decision perhaps was made to save on memory usage.
In retrospect, 16-bit integers were probably not a good choice.
An anonymous message posted to Slashdot the day after Christmas first
described the software failure at Comair:
http://slashdot.org/comments.pl?sid=134005&cid=11
Earlier this year, an overflow of a 32-bit counter in Windows shut down air
traffic control over southern California for 3 hours:
Microsoft server crash nearly causes 800-plane pile-up
http://www.techworld.com/opsys/news/index.cfm?New
This problem occurred because of a known design flaw in older versions of
Windows:
http://tinyurl.com/5n9gc
Richard M. Smith
http://www.ComputerBytesMan.com
from information week
...
...
"The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768."
According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month.
to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
This was a horrible chain of events that severely inconvenienced a lot of people for Christmas, and I would be hoppin' mad if I was in any of their places. However, let's not jump on ComAir too hard, IMHO. From TFA:
"This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.
It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.
In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.
"There's no success like failure, and failure's no success at all."
- Bob Dylan
I stopped using 16 bit ints for anything 10 (or more) years ago when I had the joy of migrating systems from a 16 bit OS to a 32 bit OS.
Best Slashdot Co
what Initech handles?
Yeahhhhhh! Mmmmmmkay!
Did you get that memo?
Yell & scream & rant & rave... it's no use... you need a shaaaave ~ Bugs Bunny
just RTFA linked in the summary ("conair system crash")...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
The human slashdot editors where replaced long ago. I think it's some google news beta program that currently posts the stories.
Coder's Stone: The programming language quick ref for iPad
Since 2^16 = 65536, I'm guessing signed.
Comair Says Back to Normal Daily Schedule - Los Angeles Times (subscription) - 16 hours ago
Comair Downed By Computer Counting Limit - Information Week - 17 hours ago
Comair Faces Government Investigation - InternetNews.com - Dec 28, 2004
Which should pretty quickly get you to a summary account of the problem. A moment or two listening to the news on any reasonably competent radio station in the US would do it, too. And no, Slashdot is not a news reporting site: it's an aggregator.
"Perhaps Comair should have paid for the software upgrade to MaestroCrew." (in the Simpsons Comic Book Guy voice)
The owls are not what they seem
I thought the Comair crashed when Nicholas Cage steered it into the window of a Las Vegas casino.
Don't blame Durga. I voted for Centauri.
what, the link [http://it.slashdot.org/it/04/12/26/052212.shtml?t id=128]
doesn't work 4 u?
That when you are talking about an airline, a COMPUTER crash is by far the least traumatic kind you can have.
"Wow. Now THAT'S a lot of angry Indians." - Lt. Col. George Armstrong Custer
Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?
p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.
I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
Heard it here first? Sorry, I heard it first on the web/digest version of comp.risks... http://catless.ncl.ac.uk/Risks Excellent place to read about the risks of modern computing equipment and the risks to society by using same usually from mistakes like the 16 bit counter.
It looks like it was a signed 16-bit value from the detail in the article (32,767 maximum schedule changes a month). Why they would possibly need the sign is beyond me. Regardless, having 32767 schedule changes in a month? Must track every flight in the world.
Never hit your grandmother with a shovel, for it leaves a bad impression on her mind...
It could have worked if it wern't for the 2s complement they would be good for twice what they had. I think programming languages should make numbers unsigned unless asked that way we can take advantage of that extra bit. For things like counters where negitive numbers just wont happen is like having a 15bit number taking 16bits of space.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
When the shuttle on the screen blows up, and is accompanied by a very loud explosion sound outside the building, the kid looks sheepish and sneaks away.
Don't blame Durga. I voted for Centauri.
So it turned out to be problematic to use a signed 16-bit integer.
...
But the real problem is a lack of error checking. It sounds like the code had something like:
int num_crew_changes;
crew_change_list[++num_crew_changes] = blah;
And the counter wrapped and the system crashed.
The code should have said:
if (num_crew_changes == MAXINT)
{
ERROR(E1234, "too many crew changes");
}
The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.
I wrote a small POS system for multiple business in my city. It's using a digit that has the same limit to count number of transactions in a day.
My thinking behind it is that a transaction takes, on average, about 30s to complete, start to finish. There are 86400 seconds in a day (24h * 60m * 60s), or 2880 "groups" of 30 seconds (prev math / 30).
As such, there wouldn't be a case where I'd hit the magic 32767 limit. Maybe I should remove it anyway though. Suggestions?
It's times like this when you begin to realize that the Vic-20 (duct-taped to the bulletin board and surrounded by haywires) might not be the best choice anymore as mission-critical hub of your operations.
Don't blame Durga. I voted for Centauri.
I wonder how fast this CIO is going to be on his butt.
"Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."
"Well... you're fired! NEXT?!
Having once done tech support for the Maestro program used by Comair (and other scheduling software for other airlines as well), I think the software is junk. The employees undoubtedly said "I told you so!" when it broke, because they hated it as much as the support team did. IMO the airline didn't bother upgrading because they didn't think the old version was broken enough or outdated enough to warrant it.
Hmmm.
why would anybody make ae event counter a signed value?
short numberScheduleChanges;
hello?
unsigned short numberScheduleChanges;
fixes the problem.
there are 3 kinds of people:
* those who can count
* those who can't
Yeah, but that's my point -- the majority of Slashdot's summaries don't adequately explain what they're pointing to; covering Who / What / Where / When would do it, but one or more of these is almost always missing.
I, and most other people, don't have time to read every single article just to figure out what the Slashdot editor was posting about. Moreover, in a lot of cases -- not this one, but lots of others -- the befuddled herd of Slashdot visitors has trampled the site in question, so it's not even possible to look up the original article. Admittedly, that isn't the case here, but it's true more often than not.
All of this could be avoided easily if the Slashdot editorial staff were forced to sit through the first week of an introductory journalism class, where the grad student teaching the class on a professor's behalf will drill in the mantra of Who / What / Where / When until the students finally get it.
DO NOT LEAVE IT IS NOT REAL
Why was conair using signed shorts to track their scheduling changes anyway? It seems to me that a company of that magnitude should expect to run into more than 32000 schedule changes within one month more than once. I mean, I can understand that the counter was probably designed with space constraints in mind, but for christs sake, it would've only only been two extra bytes to fix this. That brings the total up to some 4 billion unsigned if I'm not mistaken. Technically, they could've used just three bytes, but then again, I wouldn't expect them to because how many languages have 24bit integers built in as primitives? Of course like someone else said, I guess we can't blame this all on the programmers either. I wouldn't just consider it very comforting that such a system could become crippled just because the programmers didn't think to allocate enough memory to allow for enough flexibility in scheduling.
Maybe Maestro should just die. My friend is a flight attendant for Southwest and has to use Maestro to plan her schedule. To use it she has to citrix into their main server and wait for an open client (I assume they have either a license or horrible programming restriction on concurrent users). On the very day that the new schedules are posted, it can take hours to log in. It's a joke.
This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.
Not only that, a lot of Oldfield's game involves piloting glider- or airplane-like avatars.
Don't blame Durga. I voted for Centauri.
seems possible with the ever-decreasing quality of journalism seen here...
"yeah what 'appened right was this computer-me-thingy went all pear-shaped and these silly buggers Comair went and got themselves 'Done In'..."
Better than Mr. Darl McBride. He would've said something like...
"Well... we were holistically mitigating our funds so we would have the money to sue the pants off every person who has ever ridden our planes, while trying to forcast how we plan to make a profit....."
2b || !2b =?
Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/
I can't stand it when someone posts a URL to some mailing list, telling everyone to go and look at it, without telling us why we should care about it.
When taken without neighboring information, the only clues that Slashdot gave about the article was that it was in the 'IT' section, and had a 'bug' picture next to it, so we know it was a technology problem, which most computer geeks would have known from 'overflowed 16-bit counter'.
It doesn't take that much extra effort to add a little more detail so that people can make a decision if it's of sufficient interest to spend our time reading the article:
Build it, and they will come^Hplain.
The "Comair system crash" link was just a link to a previous slashdot story (no change for /.'ing that), and already contained a summary...it would have been (-1) redundant to provide a new summary for what originally happened, when you can just link to the original summary...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
Thank you. So if the article just said something like...
...then it would have been enough. That one amendment -- "airline shutdown in the midwest last weekend" -- would have been enough to make the article perfectly clear to anyone who wasn't up to date with the story so far.
Is this really so much to expect? I wasn't trolling, honest, I'm just increasingly frustrated with how addled the editorial review of articles is around here. They've had years to figure this stuff out, but it's just as bad now as it was when Slashdot got started. Are they ever going to start taking their jobs as editors seriously & professionally? (And this isn't meant to be a blanket complaint -- Pudge for one seems to be aware of these basics and tries to do right with his articles, but other editors consistently muck this up...)
DO NOT LEAVE IT IS NOT REAL
"The computer software that crashed and grounded Comair's entire fleet on Christmas Day was an antiquated system due to be replaced in the coming months."
:P
First paragraph. I had just forgotten about it by the time I got to the *end* of the article. 6am + ADD - caffeine = me missing that bit. My bad.
If the system is designed that the only way to know if the change will work is to make it, and then if you don't like it take the change back, then I could see how very quickly this could create a problem.
The "job action" shows the laziness and greed. That's what labor unions are all about.
...is the management of software companies who ceased using real computer scientists to design and write their apps because disposable code monkeys work for so much cheaper. And outsourcing is even cheaper... in the short term (which seems to be all that matters in this industry anymore)
Yeah, you got it. Programming should be left to white Americans!
back in the early 80's. There was a big financial company that had an automated system that watched the prices of certain commodities and issued automated trade orders. The transactions where stored in arrays addressed by 16 bit signed integers, with the (now) highly predictable result on the first day that trading volume exceeded 16384 transactions. Since in C arrays are just syntactic sugar for pointer arithmetic, the system started executing trades based on "data" from random bits of heap memory. This apprently went on for some time before a human being figured out something had gone wrong, and (reportedly) the company lost billions in a single day. This might be somewhat exaggerated, since the event now has passed into folklore.
In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Probably so the crews from both parts of the airline don't get mixed-n-matched - there is strict seniority in airline heirachies, and the system Delta has probably can't handle two seniority lines at once for the same job. Your seniority affects what schedules you get - the more senior you are, the more likely you're going to get the schedule you want to fly. New pilots/cabin crew tend to get all the shitty routes no one else wants to fly.
Oolite: Elite-like game. For Mac, Linux and Windows
Great news. Now that $8/hr mouth breathers get to treat everyone like a terrorist, it's good to know that infrastructure is incredibly fragile and built like shit so that the economic effects are more or less the same. Maybe we should apply some of that you're-a-terrorist mentality to the people who run the fucking airlines...?
Nah!~ let's just fuck with the customers.
All Hail Maximus Primate Bush!
Delta Airlines has also been gutting its IT (and other) staff in recent years, while offering expensive protections to executives and many other employees making over $200,000/yr.
I'm wondering when Delta & Comair identified the troublesome software package as something that they should upgrade? The article says that it's going to be upgraded in upcoming weeks, so when did the upgrade get approved?
Is Comair going to try to go after SBS International for damages?
Has this issue ever come up before or did the powers-that-be know about the possibility of major issues beforehand?
It'd be interesting to see what happens as a result of this.
You'll be happy to know that after this mishap, all the major airlines have upgraded all their critical systems to XP Media Center Edition (TM). Not to worry, however: in order to increase system reliability, menu animation has been turned off.
I wonder if the poor sap that wrote the software is reading through all these comments and just smacking his heaed. How do you go home at night and tell your wife that you signed an int, and it ruined Christmas for a whole lotta people?
Remember last February the Martian rovers were sidelined by a computer crash. The reputed problem was the huge flash memory appeared to fill due to not properly building the free inode list in the Unix OS. This was the realtime VxWorks UNIX from Charles River used by NASA on space probes for 20 years. The Spirit rover went into a perpetual reboot cycle for a day (rebooting its default response to a severe crash). Then it took two weeks to diagnose, repair and test the software patch.
So the point is, that bugs can appear even in highly tested OSes.
I worked at a bank in the early 90s that had a trading system based on SQL Server and the client was written in Visual Basic 3. Apart from every other bad design choice in this system (I inherited it when the designers got promoted and started working on another, even bigger system), the all important record counter was an integer, so when trade 32768 was posted, the application crashed, and simply could not be started again, because the first thing it did was try to show the current total (it was written for operators to use, not traders). Worse was that the counter variable wasn't a global, and it was often times a stack variable, and always with a different name (sometimes iCounter, sometimes iCount, sometimes x).
The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.
*Sigh*...okay, back to coding.
I hear ya. A lot of the time I'm at work and all I have time for is to post some two-liner bullshit reply to the article.
now THAT'S an upgrade longgggggggggggggggg overdue due to corporate bureaucy failing to commit a budget
arbitary limits will trip you up eventually. It's not as if nobody knew to avoid them before
For one thing, there is an "arbitrary limit" on the number of atoms in the visible universe. At any given time, there is an "arbitrary limit" on the amount of memory available for a given price in dollars, and there is also an "arbitrary limit" on the amount of arithmetic that can be done in one second for a given price in dollars. Arithmetic in variable-length integers (called bignums) allocated on the heap often carries significant performance overhead compared to arithmetic in machine word-sized integers (called ints) in fixed-size data structures. On a given computer system, it may have been an engineering decision to use ints rather than bignums.
You kind of answered a question I posed elsewhere - "How long has the airline been aware of the fact that this software has potentially serious issues?"
"some time ago.."
But my experience in the IT industry is like 'Does it work ? then it's ready'. And also 'Can you do it faster? then do it faster and patch the program later'. There's a wrong misconception that everything should be done quickly in the IT industry when actualy all should be done to the perfection because it takes even more time otherwise, the programs will have to be fix soon or later.
There probably isn't any reason to. Comair, as a regional jet carrier, has separate crew contracts and crew rules than Delta, a mainline carrier. Thus they operate completely different types of jets, with different crew staffing requirements. The FAA crew rules might even be different. While it might make sense from a consolidation standpoint to merge the two systems of Comair and Delta, since in reality there would be no interaction and no overlap between the two systems (an RJ pilot isn't suddenly going to jump over to fly a 757) the expense isn't worth it.
p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.
Then I hope you also avoid United/United Express/Ted at O'Hare/Denver, Continental at Newark/Houston, Northwest in Detriot, USAirways in Cincinnati, American at O'Hare/Dallas... etc. etc. Every airline, not just Delta, uses hubs, and ground stops at any of these airports will cause significant delays. That's just the reality of air travel these days; if you're really worried, book non-stop travel (and pay up to 10x more).
Crash the computer system. :)
Or fly Southwest, which doesn't use the hub-and-spoke system, and get a direct flight to anywhere they fly for 10x less. Or maybe not 10x less, but in any case cheaper than any of the bigger airlines they compete against. There are alternatives, provided you are travelling between cities Southwest or an airline like them (JetBlue, Song) goes to...
DO NOT LEAVE IT IS NOT REAL
...was over at US Air, where a large number of selfish and uncaring union employees decided to mess up a huge number of their customer's Christmas plans.
While stupid stuff like this software problem is embarrassing, and the SBS and the people who wrote the software should hang their heads in shame it was still unintentional and is nowhere as shameful as the deliberate sabotage done to US Air's customers by their union employees.
Stand Fast,
tjg.
This is one time I'm glad airlines have limits on their liability to passengers.
Can you imagine what would happen if they didn't? Everyone would sue them for a ruined Christmas and Comair would be in Chapter 11 by tomorrow.
If Comair can afford it, they should give refunds to all affected passengers. I hope they've already comp'd the hotels and incidentals. Every executive and the highest-level IT people should surrender their 2004 bonuses as a gesture of contrition.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
If and when I absolutely need to fly somewhere, I most likely will pay extra, for the non-stop and possibly for the first class/business section, or just chuck it all and book a flight on a charter airline instead. I hate the commercial airline 'experience' just that much.
I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
Makes you wonder - perhaps important systems like that should be written in a higher-level fully memory managed language, even if its slower, buying extra fast hardware to run it means nothing to a decent sized company, but loosing a million because of over-complex code with potentially 100's of little bugs means allot...
This comment does not represent the views or opinions of the user.
Obviously, computer crash victims complain about the airline long and loud. Plane crash victims tend to be silent.
Don't blame Durga. I voted for Centauri.
...my aunt and uncle had their flight from Denver cancelled when a baggage truck blew across the tarmac and into their plane.
Facts do not cease to exist because they are ignored. - Aldous Huxley
Look at who is doing what. This one is entirely the union's fault. They are the ones loafing and taking it easy while bags pile up and customers get angry.
In this case, a 32-bit counter would have prevented the crash, but really, how many coders check for overflow in their 32-bit counters? In long running systems, you *have* to. And it's easier to overflow a 32-bit counter than you think; it depends how often it gets incremented (that is, what is being counted).
In languages with graceful promotion from integer to bignum, this is a non-issue. Not for most languages, however. (And lest anyone think I'm oversimplifying here, such promotion is not so simple, in that bignums take a variable amount of memory. To handle this truly transparently you need some form of automatic memory management, which is a no-no for many embedded systems.)
16 bits will get you a cup of coffee...
I think their downtime now is probably costing them MUCH more than $200.
Take off every sig. For great justice.
This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.
...).
I think you're underestemating the complexity of the problem. Sure it's easy to just assign people randomly, but there's a lot of constraints on the system:
1) Crewmembers should only work for so long before resting (security).
2) Crewmembers should only work for so many hours a month.
3) Crewmembers should end up home at some point.
4) All planes should have so many crewmembers (pilots, flight attendants,
6) The airline should spend as little money as possible on salaries.
And there a probably a lot more. Anyway, the point is that the algorithm for assigning crew in an (near) optimal way is probably quite tricky - a wildshot would be that it requires some sort of linear programming.
Somewhere in his awkward choice of words is a point. Speeding an application to market by skimping on the quality of the work causes this problem. Real computer scientists balk at leaving problems half done, monkey's don't. The outsourcing thing at the end, yeah, I don't agree with that as a mantra.
You are checking your backups, aren't you?
Damn Warez Copy...I thought it didn't have limits!
Yes, they had to have used a signed 16-bit integer counter. A little background for those who may not know how to count in binary:
If you know your binary, you know you can actually store a number of about twice that if you fill all the zeroes with ones:
But, if that's a signed integer, you'll need to reserve a position for +/-, leaving only fifteen binary places for the integer.
So as soon as the Comair system needed to count 32768 flight crew changes for that month, it crashed. It counted -0, or zero.
-jh
I've seen CIOs and VPs of IS do extremely stupid and/or damaging things to the departments they were in charge of and then leave the company with more laurels danging from their CVs.
It doesn't seem to matter that the real-world impact of their actions and policy decisions actually hurt the company.
Very frustrating to see this at multiple companies and (in some cases) multiple times at one company.
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
Seems like there was another example of this sort of thing on November 2nd, 2004 as well. IIRC, some North Carolina machines dumped 3000+ votes due to a similar problem.
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
mod parent down.
One poster already noted; The wholly-owned carriers fly different equipment and are staffed by pilots who are members of a different seniority force.
Moreover, typically the crew tracking system is integrated with the flight operations/dispatch system, and the maintenance control system, and the route planning system, and the trip optimization system. You wouldn't want to try to integrate all those functions into the parent carriers system unless you *had* to.
Finally, CFR 14 Part 121 says that each certificated carrier has to have their own dispatchers on staff. Comair, et al, are technically independant carriers -- they have their own certificate (DOT license to run an airline), and therefore have to staff their own flight operations (dispatch) office.
Therefore, Comair cannot integrate their staff with Delta's, even if they wanted to. Of course, that doensn't mean they couldn't still use Delta's operations software, but it just shows how separate the airlines actually must operate -- making the advantage of merging systems specious at best.
\FAA licensed aircraft dispatcher
-- Experience is a wonderful thing. It enables you to recognize a mistake when you make it again.
You want to fly to Los Angeles in a month and purchase the ticket then. The price you pay reflects the value of being able to make that choice then and assuring the airline of a seat being filled in a month. The value of the seat changes as time goes on such that 1 day before the plane leaves the seat is now worth a lot more to someone that has to get to Los Angeles the next day, no matter what the cost. Of course there is the other aspect as well - the seat has no value once the plane leaves.
Managing this changing value is what makes airline ticket prices incredibly complicated.
It sounds more like a signed short.
Hey, in 1989, Commodore's Amiga was 4 years old. And the Lattice C compiler commonly used on it considered an int to be 32 bits.
what-- were they expecting negatives also?
Generally, software is written to some specification. This specification may be perfectly reasonable (though often not) at the time of implementation. However, the environment is not static. What was reasonable at some time in the past may not be reasonable later. The software then needs to be modified or rewritten entirely.
Anyone who uses software must constantly evaluate it under actual working conditions to verify that it still meets current and foreseeable requirements. One would think that there were various reports as to how many reassignments occurred on a monthly basis and it is unlikely, though not impossible, that the system was overly stressed from the day it was implemented.
Sounds like someone didn't monitor the high watermarks that the system had gone through against the "known" limitatitons (of course that/those individual{s) may have been laid off or retired).
Bad risk management. Shame on management. That's part of their job.
Back in the 80s when I was C programmer (K&R, thank you, the one true C), C integer types were not standardized. "Integers" were defind to be the most natural size for a machine (typically a data word), "shorts" were defined to be no larger than ints, but possibly smaller (and thus possibly more space efficient). This reflected the philosophy of C-as-portable-assembler: if you were indexing an array of character representations of digits, for example, there was no reason not to use a short. It was conceivable that, since arrays were essentially immutable pointers and array indices were merely offsets against those pointers, you might want to refernce a negative offset from an array base in some kind of clever trick.
Various C implementations used short/integer sizes like 8/16 (for microprocessors like the 8080), 16/16, 16/32. These days, there are some mininal assumptions we can count on. Ansi-C specifies the following as minimal data sizes for char/short/int/long: 8/16/16/32. In practices IIRC, most modern compilers use 8/16/32/32, in other words a 32 bit int. GCC, I think uses 8/16/32/64.
The problem with this airline scenario I would expect is a kind primitive cousin of cut-and-paste coding. This is where the the programmer is pasting something like this from his mental scrapbook:
int i;
TRANSACTION trArray[];
It's very easy to do something like this. A really conscientious programmer asks himself whether the index value is indexing something that doesn't have a prescribed limit in the specifications (in this case I'm guessing it was probably indexing a file position). If there is no prescribed limit he uses an unsigned long. If there is a prescribed limit that would allow an integer index, he still uses an unsigned long unless he indexing something which logically can't grow larger, or until the profiler forces him.
Which brings me to what I find curious about this. Either: the programmer chose to index the value by an signed short (which would be almost inconceivably stupid as opposed to unforgiveably negligent), or he was using a C compiler with a 16 bit integer, which while possible under ANSI IIRC, seems terribly archaic.
Java, of course, uses 32 bit ints. But you aren't completely safe from this sort of thing. For example FileInputStream has two methods of interest here:
this is very safe, since it uses a long, which in java is 64 bits; even unsigned, there is little chance of overflow.
However consider this:
What happens when a programmer decides to skip around in a LARGE file using this API? If he decides to skip forward by more than 2,147,483,647 bytes the signed int will silently be converted to a negative offset, at least as of java 1.4. Granted the possibility is slim in most applications.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
If any post on here is a troll, this ain't it. heh.
You should over-engineer that counter to a ridiculous degree. It's not a bridge made out of expensive steel, it's just bits and bytes. Memory is cheap, disk space is cheap, and bandwith is becoming cheap. If something calls for the number of meters in an Olympic pool (50), size it to accommodate the number of millimeters across the known universe. Over-engineer first, then slim it down only if it causes performance problems.
Picture this: In five years, the stores have RFID tags or faster machines that reduce the POS transaction time to 1 second. The stores are popular and quadruple in number. To match their new inventory system, they artifically track each item bought as a separate sale (average 5 items per customer). Voila, you are up to 1,728,000 possible transactions in a day already.
Customers will always find new and unpredictable ways to use and abuse their software, rather than go through the pain and expense of an upgrade. By 2009, the client will have forgotten your name, but they'll (probably) never have to hunt you down if that counter goes to 100 million or so.
You shall see a cow on the roof of a cotton house.
There were two modes of operation, a very low power mode - intended for long, slow radiation treatment, and a very high power mode - intended for very short bursts of radiation treatment. If a technician accidentally selected the high power mode, then switched back to the low power mode and hit the radiate button immediately, the device would deliver the high power beam, but the display would show it as a low power beam.
Patients would complain about a burning sensation, and then die. Happened quite a few times before anyone figured out what was going wrong.
--- There are two kinds of people, those who accept dogmas and know it, and those who accept dogmas and don't know it
Certainly there are "tricks" that they could have employed, but nothing as simple as just using a different datatype.
The really interesting thing about the incident was that the bug did not show up under deliberate testing. You could try the exact sequence of operations that produced a fatality and not be able to reproduce the effect. The key was that experienced users entered the setup very quickly. I don't recall whether the issue was that the safety mechanism didn't kick in, or whether the input was not properly recorded. IIRC the Therac 25 had its own custom rolled hardware monitor where today we would use an embedded operating system, and the fault was ultimately tied to that, not to incorrect specifications for the user input routine, or incorrect programming of the user input routine.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Henceforth, all numeric variables shall be of type BigNum.
"A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt
I don't think many people know, but 1/3 of Pittsburgh flight attendants called in sick on Christmas Day. That alone is a cataclysmic problem since that is a main USAir hub.
Maybe USAir doens't want anyone to know how much they suck as an employer aside from how they suck as a corporation. Their employees don't seem to ever smile - they probably have no reason to.
As a side note, the JetBlue CEO has been know to frequent flights handing out cookies, talking to passengers, perform baggage handling, and cleaning up trash after the flight. Every single employee looks happy...
In my day, our ints had 16 bits and we LIKED it!
It was a joke! When you give me that look it was a joke.
IIRC, what happened is that the operators would go back up on the display to change values that were inputted erroneously. However, the new values that were displayed on the screen were never passed along to the underlying code. So the operator thought "Whoops, lemme change that power setting", hit the up arrow and changed the value and then went along like normal. The machine didn't know the power was supposed to change, and blasted the patients with radiation.
I'm sure someone else will chime in with a link to the correct version of the story.
It's pure bullshit. The effects to the US from neglected to micromanage the companies themselves the way they believe they can micromanage their own customers to the edge of outright criminal behavior is the same. On what's the busiest travel day of the year the entire system comes to a screaching halt no different than the US declaring the grounding of all aircraft on 9-11. But hey, I got modded down by dumbass hillbilly redneck who thinks failure is a good idea.
It's possible that the capacity limits in the software were more complicated than what type was selected for a few variables. How many other things were dependent on the maximum number of changes per month? Increasing the limit may have resulted in a program that could no longer meet other requirements for memory usage and run-time. Crew scheduling sounds like one of those problems that doesn't scale linearly. The worst-case scenarios when the software was designed may not match present conditions.
Mea navis aericumbens anguillis abundat
What I am not reading here and in the press is who at Comair knew the software had this drop dead limitation and why were they not prepared for it? My suspicion is that the FAA has such tight controls on software change management at the airlines that the Comair IT managers knew the problem could occur, but were powerless to modify the package to prevent the meltdown. -Steve
"The same process has already taken place in other industries that once paid a living wage... legacy industries like textiles, meat packing and restaurants."
IT
who gives a shit if 4MB of ram was $200??
If that happens to be one of the considerations
why they went with 16-bit integers... well
Shame on them for being retarded.
Write the damn code so it will WORK (without crap like this) and pay for the fuckin hardware to run it.
"Probably never. Our CIO is an idiot when it comes to technology."
Well if he was a genius at technology. You and your department would be out of a job.
This is normal part of life. Nobody wants to put money on the old system because it is going to be replaced but the system is still being used and the old will want to die just before it goes out of the door for it's last hurrah so no one will forget it.
Uses? The union would force EA to outsource all its work.
I've written longer, descriptive submissions before that have either been rejected and some shorter submission was put in it's place, or had the original article chopped up like what happened here. My original article said that Comair outsourced the system to SBS. The /. editors seem to like short, concise submissions, which unfortunately mean you have to follow a bunch of links to get the whole story.
Ahh, I see. So it's not your fault then, and we have another datum point to suggest that the editors are indeed being willfully obtuse with the house editorial style. How nice...
DO NOT LEAVE IT IS NOT REAL
Some of it is, but I'm not taking all of the blame. :-) I will readily admit that I didn't put in something like your suggest amendment. However, I do know that the /. editors have a habit of changing submissions. At least mine they do, but I imagine they do it to others.
B*llshit - that's what I say. The company didnt hire enough engineer to write a proper test plan and execute the test plan. Or they wrote a marvelous test plan but didnt test them according to the plan. In the end, their system crash because they didnt test out the software properly.
This happens over and over again in commercial application.
No, incrementing 0111 1111 1111 1111 (32767) would yield 1000 0000 0000 0000 which is -32768 in decimal. Assuming two's complement, which has no negative zero, of course.
Stumbling in the dark
I hear slavering of jaws
Eaten by a grue.
Has anyone discovered why a reporting variable is able to bring down the live system?
If you are rescheduling things, then of what use is it to be counting the number of times it has been done before? Surely it would be safer to break this out into a reporting program and execute an SQL query IF you actually needed that sort of information.
Probably a good reason for it, I just can't see it.
Compare the price of one 747 refuel, that would pay for 1gig of ram in 1989.
The ram limitation is in the server database, not client software. Surely they could afford to spend alittle more dont you, its not like they run out of fuel over the pacific do they.
Its just called CHEAP-ASS[tm] management. If they could, they would water down the fuel with anything, but its illegal.
And surely they could have recompiled the server software to support unsigned 16bit back in 1999.
This is all a symptom of Mr Manager doing cost cutting and getting great bonuss'
Liberty freedom are no1, not dicks in suits.
Why per month? couldnt they just purge when it hit 32000 and continue fine, who cares if they loose the previous changes, back it up to a massive text file report.
They should have had a gracefull recovery/continue, unless the software tech support was on holiday in Phuket
Liberty freedom are no1, not dicks in suits.
Why cant thses big fatcat silly airlines in usa do what qantas does? if unions are the issue, can the big airline just start a small-subsidery or small seperate partnered airline? Then the new one can easily 'compete' with the parent and take non union labor and reuse the parents facilities.
Or are there FAA regulations preventing this? if so.. change the laws.
Liberty freedom are no1, not dicks in suits.
The sad news....The IT guy who pushed for newer software probably got fired for not alerting them about the "counter problem" and the M.A. idiot who cut the IT budgets and refused to spend a million on nerds for new software got promoted and awarded a trip to the golf course paid for with the cut IT budget to fix the problem the IT guy missed.
Unions are also used to seeing promises that pay will go back to normal in good times broken. The reality of capitalism is that if you don't have your act together enough to meet a known wages bill your company is probably going to expire.
While unions are not the problem - some unreasonable bastard in a paticular union may be - but you get that in all kinds or organisations. It sounds like there is an "us or them" attitude going on, where each group hates the other, which can lead to all kinds of problems and the end of the company if it isn't sorted out.
The USA has all kinds of protections to stop better run airlines coming in from overseas to create even more intense competition. The land that gave us Valuejet and the mess that was United in its final years really needs to get its act together, stop blaming the unions, and see if they can do as well as any of a score of airlines that would be happy to come in as soon as deregulation happens and show how airlines work in the rest of the world.
When I go to the USA I'd better catch a bus, I bet the bus companys scheduling software is less than fifteen years old and has been updated if the company has grown - that's what most places do for business critical applications, and it has nothing at all to do with unions.
It has happened before and it will happen again. The biggest case of a transaction counter overflowing a 16 bit signed integer was in 1985 at the Bank of New York - they came up 32 billion dollars short and nearly caused a collapse of the financial system. A description of the problem - Washington Post, 13 December 1985, p. D7 - Computer Snarled N.Y. Bank - $32 Billion Overdraft Resulted From Snafu - can be found in a message at: http://www.mirrors.wiretapped.net/security/info/te xtfiles/risks-digest/1/risks-1.31
Whoops, my bad. Got my binary math wrong.
..it wasn't labelled something like "TrialMinutesLeft" or something was it? :P
one (of many) Big problem there is a safety critical system using input rather than configuration state to control the Status display, hell the status display should not even be aware of most input commands, except by way of detecting the resulting configuration changes, and parhapse having the input system notify the status display that "something has changed, check status"
Snowden and Manning are heroes.
Unions seem to believe that society owes them a living.
So do corporations, apparently.
The only thing management could have done would be to have rejected the union contracts earlier.
If management made contracts that they couldn't keep, they screwed up and their companies should go out of business. That's the way the free market works.
In fact, in reality, big airlines are screwing up not only on their labor contracts, they are poorly run, heavily government subsidized businesses offering mostly uncompetitive transportation options. The sooner the big airlines go out of business, the better off we all are.
Unions work by artificially limiting labor supply - but that doesn't work if there is not enough work.
Unions are a free market mechanism by which employees get together and bargain collectively. That's not very different from shareholders forming corporations in order to make collective contracts. Unions "artificially limits the labor supply" about as much as car makers "artificially limits the car supply" when they set a price for their new SUVs.
What it comes down to is that you want intrusive government: you want government to prop up failing, uncompetitive businesses, you want government to protect management from the consequences of their bad decisions, and you want government to prevent employees from associating freely and making a free-market choice to bargain collectively. Come on, admit it to yourself: you are just another one of those typical intrusive-government Republicans.
Actually my friend, I heard it on Bugtraq first.