Slashdot Mirror


Comair Done In by 16-Bit Counter

Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...

25 of 441 comments (clear)

  1. Forget Y2k... by Cytlid · · Score: 4, Funny

    This was Y32k!

    --
    FLR
  2. Well... by Tuxedo+Jack · · Score: 5, Funny

    It seems that 16 bits and 640K wasn't enough for them after all.

    --

    Striking fear in the authors of godawful fanfiction, I am here, appearing in darkness, Tuxedo Jack!
  3. Re:Signed or unsigned by Vengeance · · Score: 4, Informative

    I believe this will answer your question:

    Tom Carter, a computer consultant with Clover Link Systems of Los Angeles, said the application has a hard limit of 32,000 changes in a single month.

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.


    So it sounds like a signed int.

    --
    It was a joke! When you give me that look it was a joke.
  4. Bugtraq covered this as well.. by EvilStein · · Score: 5, Informative

    Here's the original post:

    Hi,

    On Christmas Day last Saturday, Comair Airlines had to completely stop
    flying
    all of its planes due to computer problems. Comair blamed the computer
    problems on their pilot scheduling software being overloaded after bad
    weather earlier in the week forced many flights to be rescheduled. Comair
    now hopes to have all of its 1,100 daily flights restored by tomorrow.

    An article which was published today at the Cincinnati Post Web site
    provides some interesting details of a software failure in Comair's pilot
    scheduling software:

    How it happened
    http://www.cincypost.com/2004/12/28/comp12-28-2004 .html

    According to the article, Comair is running a 15-year old scheduling
    software package from SBS International (www.sbsint.com). The software has
    a hard limit of 32,000 schedule changes per month. With all of the bad
    weather last week, Comair apparently hit this limit and then was unable to
    assign pilots to planes.

    It sounds like 16-bit integers are being used in the SBS International
    scheduling software to identify transactions. Given that the software is 15
    years old, this design decision perhaps was made to save on memory usage.
    In retrospect, 16-bit integers were probably not a good choice.

    An anonymous message posted to Slashdot the day after Christmas first
    described the software failure at Comair:

    http://slashdot.org/comments.pl?sid=134005&cid=111 85556

    Earlier this year, an overflow of a 32-bit counter in Windows shut down air
    traffic control over southern California for 3 hours:

    Microsoft server crash nearly causes 800-plane pile-up
    http://www.techworld.com/opsys/news/index.cfm?News ID=2275

    This problem occurred because of a known design flaw in older versions of
    Windows:

    http://tinyurl.com/5n9gc

    Richard M. Smith
    http://www.ComputerBytesMan.com

    1. Re:Bugtraq covered this as well.. by dmccarty · · Score: 5, Insightful
      It sounds like 16-bit integers are being used in the SBS International scheduling software to identify transactions. Given that the software is 15 years old, this design decision perhaps was made to save on memory usage. In retrospect, 16-bit integers were probably not a good choice.

      Rubbish. Don't judge yesteryear's programs by today's standards. Back then 4MB RAM cost more than $200. That's how important memory conservation was. In 1989 using an int was a perfectly acceptable choice. If you were programming back then you'd know how loathe programmers were to use longs when they didn't have to. (Granted an unsigned int would've worked better here, but that 64K limit could've also been reached.)

      The software spec probably says something to the effect of "Don't attempt to schedule more than 32,767 crew changes." If you're running software that's more than a decade old you need to know what the limits of your software are.

      --
      Have fun: Join D.N.A. (National Dyslexics Association)
    2. Re:Bugtraq covered this as well.. by imsabbel · · Score: 5, Informative

      200$ for 4MB? Thas more 1994 than 1989...

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
  5. From Another article... by bje2 · · Score: 4, Interesting

    from information week

    "The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768." ...

    According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month. ...

    to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...

    --

    "Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
    1. Re:From Another article... by Anonymous Coward · · Score: 5, Funny

      >... 32K crew changes in a month? that's like 1,000 a day? that's crazy!

      You arent by any chance the original developer of this software?

  6. Let's not be too hard.. by Staplerh · · Score: 4, Interesting

    This was a horrible chain of events that severely inconvenienced a lot of people for Christmas, and I would be hoppin' mad if I was in any of their places. However, let's not jump on ComAir too hard, IMHO. From TFA:

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.

    It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.

    In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.

    --
    "There's no success like failure, and failure's no success at all."
    - Bob Dylan
  7. So after Y2K is this ... by adzoox · · Score: 4, Funny

    what Initech handles?

    Yeahhhhhh! Mmmmmmkay!

    Did you get that memo?

    --
    Yell & scream & rant & rave... it's no use... you need a shaaaave ~ Bugs Bunny
  8. Re:Maybe it had "worked just fine" for them? by kirun · · Score: 5, Insightful

    It's interesting because it provides a lesson in software design - arbitary limits will trip you up eventually. It's not as if nobody knew to avoid them before, though.

    --
    I'm scared of numbers that can't be written as a fraction. It's an irrational fear.
  9. Re:Maybe it had "worked just fine" for them? by jedidiah · · Score: 5, Insightful

    This assumes that they had the resources. Given the current competitive environment in terms of consumer price and fuel costs, it would not be surprising if IT got the short end of things.

    --
    A Pirate and a Puritan look the same on a balance sheet.
  10. Let's try to remember by CodeWanker · · Score: 5, Funny

    That when you are talking about an airline, a COMPUTER crash is by far the least traumatic kind you can have.

    --


    "Wow. Now THAT'S a lot of angry Indians." - Lt. Col. George Armstrong Custer
  11. Re:Comair? by buckeyeguy · · Score: 5, Insightful
    Potential trolls aside, Comair is a regional air carrier, based at the Greater Cinci airport, that was bought up by Delta, and turned into their secondary route provider. They handle both short and medium-range non-stop flights (i.e. Ohio to Atlanta or Orlando). So it's more closely-related than the code-sharing arrangement that some carriers have.

    Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?

    p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.

    --
    I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
  12. Re:Common problem by Anonymous Coward · · Score: 4, Insightful

    That's not true.

    Even if a system is outsourced it doesn't provide a company with 100% stable system. Frequently businesses define the type of system they want hardware/software and the amount they're willing to pay for it.

    I work in a company that provides outsourced solutions. Monthly we provide info to businesses about their system. Also, we frequently make recommendations to augment the systems to improve performance. Businesses often choose to ignore our reports and recommendations.

    Nothing's more frustrating then a meeting with a business having them tell us we mucked it up and in return we drop off the last 6 months of recommendations on upgrades to provide them additional hardware for their growing requirements and question why they choose to ignore it.

    Now I'm not saying the provider didn't muck up. But, what I am saying is your statement that it's all the provider's fault may not be the case as the airlines probably choose to stay on that system as it 'met' their needs as they saw them.

  13. Error checking is the real culprit by Anonymous Coward · · Score: 5, Insightful

    So it turned out to be problematic to use a signed 16-bit integer.

    But the real problem is a lack of error checking. It sounds like the code had something like:

    int num_crew_changes; ...
    crew_change_list[++num_crew_changes] = blah;

    And the counter wrapped and the system crashed.

    The code should have said:

    if (num_crew_changes == MAXINT)
    {
    ERROR(E1234, "too many crew changes");
    }

    The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.

  14. It's times like this... by AtariAmarok · · Score: 4, Funny

    It's times like this when you begin to realize that the Vic-20 (duct-taped to the bulletin board and surrounded by haywires) might not be the best choice anymore as mission-critical hub of your operations.

    --
    Don't blame Durga. I voted for Centauri.
  15. Re:Maybe it had "worked just fine" for them? by Remlik · · Score: 5, Informative

    bet *now* they'll upgrade, but until this particularly hairy situation arose, they didn't really see a need to upgrade a computer scheduling system that had been working great for them.

    RTFA RTFA RTFA - The new system goes live in January. Good god its like herding cats around here.

    Gotta love /. when you can get moded +5 insightful without RTFA AND posting verbal vomit....

    --
    Apple free since 1990!
  16. Maestro sucks. by Anonymous Coward · · Score: 4, Informative

    Maybe Maestro should just die. My friend is a flight attendant for Southwest and has to use Maestro to plan her schedule. To use it she has to citrix into their main server and wait for an open client (I assume they have either a license or horrible programming restriction on concurrent users). On the very day that the new schedules are posted, it can take hours to log in. It's a joke.

    This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.

  17. ComAir Now Hiring IT People by JavaDev04 · · Score: 5, Funny

    Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/

  18. There was a high profile example of this problem by hey! · · Score: 4, Interesting

    back in the early 80's. There was a big financial company that had an automated system that watched the prices of certain commodities and issued automated trade orders. The transactions where stored in arrays addressed by 16 bit signed integers, with the (now) highly predictable result on the first day that trading volume exceeded 16384 transactions. Since in C arrays are just syntactic sugar for pointer arithmetic, the system started executing trades based on "data" from random bits of heap memory. This apprently went on for some time before a human being figured out something had gone wrong, and (reportedly) the company lost billions in a single day. This might be somewhat exaggerated, since the event now has passed into folklore.

    In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  19. Re:Maybe it had "worked just fine" for them? by aoty · · Score: 4, Informative

    My wife works for Comair here in Cincinnati. The computer system under discussion was in the process of being upgraded prior to the crash. Comair's IT recognized weaknesses in the current system some time ago. The upgrade just happened to be taking a little longer than anticipated. Timing is a bitch, isn't it?

  20. Re:actually... by nuclearspike · · Score: 5, Funny

    I heard it from the ComAir desk at the airport when I was trying to get home. :(

  21. Happened to me too by wandazulu · · Score: 4, Interesting

    I worked at a bank in the early 90s that had a trading system based on SQL Server and the client was written in Visual Basic 3. Apart from every other bad design choice in this system (I inherited it when the designers got promoted and started working on another, even bigger system), the all important record counter was an integer, so when trade 32768 was posted, the application crashed, and simply could not be started again, because the first thing it did was try to show the current total (it was written for operators to use, not traders). Worse was that the counter variable wasn't a global, and it was often times a stack variable, and always with a different name (sometimes iCounter, sometimes iCount, sometimes x).

    The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.

    *Sigh*...okay, back to coding.

  22. Re:It shows what unions are all about by WhiplashII · · Score: 4, Insightful

    I work in the industry, so I might be able to provide an alternate viewpoint. Esentially what happened to the airline industry is that the market changed from a luxury market (high profit margins, low competition) to a commodity market (low profit margins, intense competition). Unfortunately, the airlines had all made long term deals with the trade unions that presumed stagnate market conditions - so when the market changed, they could not change with it.

    The smaller carriers all have one thing in common - no unions. They do not pay their pilots as much, and their pilots do not get paid if they don't fly. The number one expense for an airline is fuel, but the number two expense are the pilots, stewardesses, mechanics, and baggage handlers. There was no way for older airlines to meet the new market conditions (fly more for less profit per flight) without paying people less. The problem is that no one wants to be paid less, so instead they get rid of the least powerful people (who also happen to be the least paid). This is also specified in the union contract... all laying off actions must be FILO.

    Essentially, the major carriers are hamstrung by the unions, and they will not survive long term. Unions work by artificially limiting labor supply - but that doesn't work if there is not enough work.

    The unions say how evil it is that they are getting pay cuts, but where exactly do they expect the money to come from? The government really should not prop up certain providers when others are eager to take there place. Competition works for the most part. Air travel is becoming a commodity market, like cars. Market transitions cause upheavals, and change the market leaders - especially if the current leaders cannot change their bussiness structure.

    I have to say that I totally disagree about management being incompetant - the current management (at least the upper level ones I deal with) are extremely good. They may even get the airline to survive and change to the current market conditions. But what has really destroyed the airlines is the changing markets, and the unions preventing the old airlines to change with the times. The only thing management could have done would be to have rejected the union contracts earlier. But I doubt if that was possible.

    Unions seem to believe that society owes them a living. The problem is that society (except in the form of government) is not a person, and so recognizes no debts. Fighting that is totally ineffective because there is no one to fight.

    --
    while (sig==sig) sig=!sig;