Slashdot Mirror


Comair Done In by 16-Bit Counter

Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...

77 of 441 comments (clear)

  1. Forget Y2k... by Cytlid · · Score: 4, Funny

    This was Y32k!

    --
    FLR
    1. Re:Forget Y2k... by stupidfoo · · Score: 3, Informative

      RTFA

      It was a signed integer. The problem occured at 2^15 (32768) (although the article reported it as 32,000)

  2. Well... by Tuxedo+Jack · · Score: 5, Funny

    It seems that 16 bits and 640K wasn't enough for them after all.

    --

    Striking fear in the authors of godawful fanfiction, I am here, appearing in darkness, Tuxedo Jack!
    1. Re:Well... by HiThere · · Score: 2, Interesting

      Actually, it *MIGHT* have. No guarantees.

      But it would have been much easier to fix.

      The problem here is that even though "with enough eyes, all bugs are shallow", this application is specialized enough that there might well not have been enough eyes. Still, if it were open, then the people who work there might have spent some time looking through it. And *MIGHT* have found the problem.

      OTOH, with open source, when the problem manifested, it could have been debugged and recompiled starting immediately. This might well have saved they several days, certainly several hours.

      OTOH, if it were FOSS, then it would be their job to maintain it. This would increase the chance of the problem being fixed, but decrease their ability to point fingers at someone else and say "It's all their fault!", which is the capability that they seem to most desire.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
  3. actually... by erroneus · · Score: 2, Funny

    ...I heard it on BugTraq first...

    1. Re:actually... by nuclearspike · · Score: 5, Funny

      I heard it from the ComAir desk at the airport when I was trying to get home. :(

  4. Re:Signed or unsigned by Vengeance · · Score: 4, Informative

    I believe this will answer your question:

    Tom Carter, a computer consultant with Clover Link Systems of Los Angeles, said the application has a hard limit of 32,000 changes in a single month.

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.


    So it sounds like a signed int.

    --
    It was a joke! When you give me that look it was a joke.
  5. Common problem by confusion · · Score: 3, Insightful

    Well, not this specific problem, but businesses have a common problem of outgrowing the systems that run their business. OTOH, this was an outsourced solution, so this case is pretty hard to explain away, other than sheer incompetence.

    1. Re:Common problem by Anonymous Coward · · Score: 4, Insightful

      That's not true.

      Even if a system is outsourced it doesn't provide a company with 100% stable system. Frequently businesses define the type of system they want hardware/software and the amount they're willing to pay for it.

      I work in a company that provides outsourced solutions. Monthly we provide info to businesses about their system. Also, we frequently make recommendations to augment the systems to improve performance. Businesses often choose to ignore our reports and recommendations.

      Nothing's more frustrating then a meeting with a business having them tell us we mucked it up and in return we drop off the last 6 months of recommendations on upgrades to provide them additional hardware for their growing requirements and question why they choose to ignore it.

      Now I'm not saying the provider didn't muck up. But, what I am saying is your statement that it's all the provider's fault may not be the case as the airlines probably choose to stay on that system as it 'met' their needs as they saw them.

    2. Re:Common problem by plover · · Score: 3, Insightful
      My point was more that it is much much harder to upgrade a system when it's managed internally.

      Only if it's done wrong.

      The "value" of outsourcing in that particular example is that it forces the company to completely spec out the system requirements. No changes without documentation. With poorly controlled internal development, changes happens in the hallways or the cafeteria: "Hey, Rick, did you add the code to handle the offline situation?" "Oh, right, I'll just put in a return value for you." This leads to code that doesn't match its spec, making it harder to maintain. Outsourcing tends to enforce a good interface between spec and code, (which is what your claim seems to be.)

      Internally developed programs don't necessarily receive the same amount of attention to detail because the programmers typically have an idea about the business domain of the problem, and can work more with less documentation. In some organizations, this leads to "fast and loose" -- great for response time, not so great for maintainability.

      I think the "value" of outsourcing in a case like ComAir's is one of liability: ComAir will probably try to play "let's blame the vendor." Or, maybe they'll offer up for sacrifice only the one guy who signed the contract with the vendor, and not an entire division. But, when a failure reaches this magnitude, I don't think they'll get off that easy.

      --
      John
  6. Bugtraq covered this as well.. by EvilStein · · Score: 5, Informative

    Here's the original post:

    Hi,

    On Christmas Day last Saturday, Comair Airlines had to completely stop
    flying
    all of its planes due to computer problems. Comair blamed the computer
    problems on their pilot scheduling software being overloaded after bad
    weather earlier in the week forced many flights to be rescheduled. Comair
    now hopes to have all of its 1,100 daily flights restored by tomorrow.

    An article which was published today at the Cincinnati Post Web site
    provides some interesting details of a software failure in Comair's pilot
    scheduling software:

    How it happened
    http://www.cincypost.com/2004/12/28/comp12-28-2004 .html

    According to the article, Comair is running a 15-year old scheduling
    software package from SBS International (www.sbsint.com). The software has
    a hard limit of 32,000 schedule changes per month. With all of the bad
    weather last week, Comair apparently hit this limit and then was unable to
    assign pilots to planes.

    It sounds like 16-bit integers are being used in the SBS International
    scheduling software to identify transactions. Given that the software is 15
    years old, this design decision perhaps was made to save on memory usage.
    In retrospect, 16-bit integers were probably not a good choice.

    An anonymous message posted to Slashdot the day after Christmas first
    described the software failure at Comair:

    http://slashdot.org/comments.pl?sid=134005&cid=111 85556

    Earlier this year, an overflow of a 32-bit counter in Windows shut down air
    traffic control over southern California for 3 hours:

    Microsoft server crash nearly causes 800-plane pile-up
    http://www.techworld.com/opsys/news/index.cfm?News ID=2275

    This problem occurred because of a known design flaw in older versions of
    Windows:

    http://tinyurl.com/5n9gc

    Richard M. Smith
    http://www.ComputerBytesMan.com

    1. Re:Bugtraq covered this as well.. by dmccarty · · Score: 5, Insightful
      It sounds like 16-bit integers are being used in the SBS International scheduling software to identify transactions. Given that the software is 15 years old, this design decision perhaps was made to save on memory usage. In retrospect, 16-bit integers were probably not a good choice.

      Rubbish. Don't judge yesteryear's programs by today's standards. Back then 4MB RAM cost more than $200. That's how important memory conservation was. In 1989 using an int was a perfectly acceptable choice. If you were programming back then you'd know how loathe programmers were to use longs when they didn't have to. (Granted an unsigned int would've worked better here, but that 64K limit could've also been reached.)

      The software spec probably says something to the effect of "Don't attempt to schedule more than 32,767 crew changes." If you're running software that's more than a decade old you need to know what the limits of your software are.

      --
      Have fun: Join D.N.A. (National Dyslexics Association)
    2. Re:Bugtraq covered this as well.. by imsabbel · · Score: 5, Informative

      200$ for 4MB? Thas more 1994 than 1989...

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    3. Re:Bugtraq covered this as well.. by GarrettZilla · · Score: 2, Interesting

      I disagree. Which takes more memory - adding a couple more bytes to this counter, or putting in the code to check for maximum value exceeded and emit a message saying "Cannot perform more than 32767 crew reassignments in a single month."? Or did they just press on into unspecified behavior after an integer overflow?

      Hell, just putting the word "unsigned" in front of "int" (do you really need to tally a negative number of crew reassignments?) would have prevented this particular problem and given double the capacity, all else being equal. If you're worried about memory usage, it's certainly not a good idea to waste a bunch of bits on unneeded signs.

      By 1989, we certainly knew that the world was in the habit of using software for ten years or more. Software was being modified all the time as larger memory spaces and requirements came along. It was practice long before that to be explicit about memory-related design decisions because you knew it would be your problem in five years to update the software. Unless you just ran away and quit before the problem came up, and it was somebody else's worry.

      --
      Ecce potestas casei!
    4. Re:Bugtraq covered this as well.. by plover · · Score: 3, Insightful
      In 1988 I was constantly having this argument with one of our other developers. He insisted on using a char when enumerating 80 or 90 status codes, or a short when conditions were "unlikely" that we'd need a long. We both grew up programming in the '70s (at which point I'd have agreed with you -- back then we only had 16Kwords to play in.) Yes, our 2MB boxes were pretty tight on memory, but even in the 1980s it was obvious that saving a single byte in the executable was a false economy, if it risked stability.

      The only place where shaving bits made sense for us was on data records: we had a hash file with 2.1 million records, each 29 bytes long and it they all had to fit on a single 80MB hard drive. We squeezed every single bit out of those records (including developing a 3-byte integer to handle amounts that we told them could never exceed $99,999.99 (among other things, larger amounts would not have printed correctly.) But they were read-only records to us: we never wrote more than a few thousand rows of data, and we had plenty of space for the day's processing. And when they did have the odd line item that exceeded $100,000.00, they figured out to break it up into multiple smaller items.

      And we got bit more than once by overflows. It took like three separate f-ups to get this guy to acknowledge that he needed to stop being stingy with the bytes. Even then, he'd still try to sneak in some memory "savings", but at least he stopped arguing when we called him on them.

      --
      John
    5. Re:Bugtraq covered this as well.. by fm6 · · Score: 2, Interesting
      If you're running software that's more than a decade old you need to know what the limits of your software are.
      Indeed. I get the impression that Boeing is very unmotivated when it comes to keeping its IS technology up to date. Until recently, they were still using slips of paper to track the process of assembling their airplanes!

      What's particularly disturbing is that nothing was done about this during the big Y2K push 5 years ago. Of course, the official goal of Y2K efforts was to make sure your computers didn't crash on 1/1/2000. But it's pretty hard to separate Y2K bugs from other clock bugs, and I think most places didn't even try. Easier to fix or document the bugs than to classify them. I was involved with the Y2K effort at SGI, and we looked at everything from leap year bugs to the Unix 16-bit clock overflow -- which won't occur until 2038!

  7. From Another article... by bje2 · · Score: 4, Interesting

    from information week

    "The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768." ...

    According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month. ...

    to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...

    --

    "Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
    1. Re:From Another article... by Anonymous Coward · · Score: 5, Funny

      >... 32K crew changes in a month? that's like 1,000 a day? that's crazy!

      You arent by any chance the original developer of this software?

    2. Re:From Another article... by mikesmind · · Score: 3, Interesting

      Legacy systems will often contain such hard limits. Usually, they are buried deep in the code and sometimes no one knows that they exist. Any point where such hard limits exist must be discovered. A solution needs to then be designed for each situation. If you are a manager or a maintainer of such a system, it is your responsibility to do this. When you are questioned, just point out the Comair computer disaster.

      --
      www.mikesmind.com - www.daddyworkathome.com - www.freetofarm.org - www.tenfoottable.com
    3. Re:From Another article... by tsangc · · Score: 3, Insightful
      to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...


      I would suspect the attitude of debating a limit without knowing the business context your design choice exists in is probably what created this error to begin with.

    4. Re:From Another article... by jnhtx · · Score: 2, Insightful
      to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...

      Well, figure 3 crewmembers per flight, something like 1500 flights per day, cancel most flights for one snow and ice day near the end of the month.

      Maybe not so crazy.

  8. Let's not be too hard.. by Staplerh · · Score: 4, Interesting

    This was a horrible chain of events that severely inconvenienced a lot of people for Christmas, and I would be hoppin' mad if I was in any of their places. However, let's not jump on ComAir too hard, IMHO. From TFA:

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.

    It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.

    In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.

    --
    "There's no success like failure, and failure's no success at all."
    - Bob Dylan
    1. Re:Let's not be too hard.. by danheskett · · Score: 2, Informative

      It was an unoffical job action (aka not a strike) - about 1/3 of the flight crew personell called in sick, or did not show for work.

      It was a very, very selfish thing to do - stranding thousands of people on Christmas to complain about pay cuts. Will it be effective? Time will tell...

    2. Re:Let's not be too hard.. by afidel · · Score: 2, Informative

      They HAD outgrown their current system, and they knew it. That's why the new system was scheduled to go online in the next couple months. Unfortunatly they met with a perfect storm of problems just at the wrong time. If you've ever worked with retail you know that NOTHING gets changed from mid November to early January unless god and the CEO both say it has to be so, I imagine airlines are pretty much the same. Heck airlines probably have an even larger freeze window since few people book flights at the last minute for holiday travel.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  9. So after Y2K is this ... by adzoox · · Score: 4, Funny

    what Initech handles?

    Yeahhhhhh! Mmmmmmkay!

    Did you get that memo?

    --
    Yell & scream & rant & rave... it's no use... you need a shaaaave ~ Bugs Bunny
  10. Re:Comair? by bje2 · · Score: 2, Insightful

    just RTFA linked in the summary ("conair system crash")...

    --

    "Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
  11. Re:Maybe it had "worked just fine" for them? by kirun · · Score: 5, Insightful

    It's interesting because it provides a lesson in software design - arbitary limits will trip you up eventually. It's not as if nobody knew to avoid them before, though.

    --
    I'm scared of numbers that can't be written as a fraction. It's an irrational fear.
  12. Re:Comair? by slapout · · Score: 2, Funny

    The human slashdot editors where replaced long ago. I think it's some google news beta program that currently posts the stories.

    --
    Coder's Stone: The programming language quick ref for iPad
  13. Re:Maybe it had "worked just fine" for them? by jedidiah · · Score: 5, Insightful

    This assumes that they had the resources. Given the current competitive environment in terms of consumer price and fuel costs, it would not be surprising if IT got the short end of things.

    --
    A Pirate and a Puritan look the same on a balance sheet.
  14. Re:Signed or unsigned by Evangelion · · Score: 2, Insightful


    Since 2^16 = 65536, I'm guessing signed.

  15. Let's try to remember by CodeWanker · · Score: 5, Funny

    That when you are talking about an airline, a COMPUTER crash is by far the least traumatic kind you can have.

    --


    "Wow. Now THAT'S a lot of angry Indians." - Lt. Col. George Armstrong Custer
    1. Re:Let's try to remember by thetroll123 · · Score: 2, Funny

      COMPUTER crash is by far the least traumatic kind

      What about two of those little baggage carts crashing in the arrivals area? Surely that's even less traumatic?

  16. Re:Comair? by buckeyeguy · · Score: 5, Insightful
    Potential trolls aside, Comair is a regional air carrier, based at the Greater Cinci airport, that was bought up by Delta, and turned into their secondary route provider. They handle both short and medium-range non-stop flights (i.e. Ohio to Atlanta or Orlando). So it's more closely-related than the code-sharing arrangement that some carriers have.

    Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?

    p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.

    --
    I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
  17. Damn you 2s Complement! by jellomizer · · Score: 2, Insightful

    It could have worked if it wern't for the 2s complement they would be good for twice what they had. I think programming languages should make numbers unsigned unless asked that way we can take advantage of that extra bit. For things like counters where negitive numbers just wont happen is like having a 15bit number taking 16bits of space.

    --
    If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    1. Re:Damn you 2s Complement! by rat_love_cat · · Score: 2, Informative

      Having fixed point numbers default to unsigned is not a good idea because, at least with C's unsigned rules, it's easy to end up with a huge number if a negative number is ever generated. This has bitten me enough times in C that I avoid unsigned unless I'm damn sure that it can never go negative (and even then I check all subtractions real carefully).

  18. Don't mod me up. Mod up the Zucker brothers. by AtariAmarok · · Score: 3, Funny
    Can't help but remember the scene in one of the "Airplane" movies where a kid sneaks into the hi-tech air traffic control room. He sees one of the airline's shuttle-like planes on a screen, and grabs the nearest joystick and begins to (he thinks) play a videogame.

    When the shuttle on the screen blows up, and is accompanied by a very loud explosion sound outside the building, the kid looks sheepish and sneaks away.

    --
    Don't blame Durga. I voted for Centauri.
  19. Error checking is the real culprit by Anonymous Coward · · Score: 5, Insightful

    So it turned out to be problematic to use a signed 16-bit integer.

    But the real problem is a lack of error checking. It sounds like the code had something like:

    int num_crew_changes; ...
    crew_change_list[++num_crew_changes] = blah;

    And the counter wrapped and the system crashed.

    The code should have said:

    if (num_crew_changes == MAXINT)
    {
    ERROR(E1234, "too many crew changes");
    }

    The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.

    1. Re:Error checking is the real culprit by MadHungarian1917 · · Score: 2, Interesting

      Although the application crashed that is not why Comair needed to cancel all their flights.

      In aviation there is a concept of "Legality" which basically states that the FAA in cooperation with the ICAO www.icao.int has set the allowable hours a pilot or F/A can work in a day. Since this is a crew scheduling application the airline was no longer able to determine the flight status of their crews and by FAA regulations needed to stop flying until they could once again determine their crew(s) flight status.

      The only plan B the FAA alows is flght cancellation unless the airline has an approved greaseboard process.

      The int overflow was just the nail in Ben Franklin's
      famous poem

      For want of a nail a shoe was lost
      for want of a shoe a horse was lost
      for want of a horse a rider was lost
      for want of a rider the battle was lost
      for want of a battle the kingdom was lost
      and all for the want of a horseshoe nail.
      (or long counter)

  20. It's times like this... by AtariAmarok · · Score: 4, Funny

    It's times like this when you begin to realize that the Vic-20 (duct-taped to the bulletin board and surrounded by haywires) might not be the best choice anymore as mission-critical hub of your operations.

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:It's times like this... by tomstdenis · · Score: 3, Insightful

      Ok dude do the math.

      A sells tickets for $0 loss.
      B sells tickets for $75 loss.

      B gains many customers. However, the more customers the more loss they incur. Recall EVERY SEAT costs them $75. Eventually B just runs out of money and ups the costs.

      Now A and B sell at the same cost. Customers notice the price hike and get upset [because for some reason people think air travel is a god given right so they get insanely upset at everything].

      Sure some won-over customers will stay with B but many will spread out [many are also not particularly loyal they just use whatever cheaptickets.com tells them to].

      Tell me I'm wrong. Tell me that most airlines haven't been filing for protection. Come on, tell me ;-)

      Tom

      --
      Someday, I'll have a real sig.
    2. Re:It's times like this... by tomstdenis · · Score: 2, Insightful

      So you're saying the solution is to lower the price so they can recoup costs.

      My point is the **real** solution is to

      get this, this is a doozy

      ====> **** NOT FLY THE FUCKING PLANE IN THE FIRST PLACE **** <====

      If you're over supplying the true demand then you're always going to waste money. Don't make 90 million gizmos when there is only demand for 1 million gizmos.

      The demand for air travel only surged when discount rates appeared. Discount rates only appeared to fill seats [re: artificial demand].

      Essentially if you're flight is an emergency or longer than say 3-4 hours than it's a "need". People used to go by bus and train remember? You could take a one hour flight to the next city or spend a day on a bus. When flights were 400-500 dollars a seat people used to take the bus. Now that the same flight may cost 150 or so it's more attractive to fly.

      But just because you CAN do something doesn't mean you SHOULD.

      I mean they COULD sell the tickets for a buck each. That would fill up planes quick too. By your logic is that a good idea?

      Tom

      --
      Someday, I'll have a real sig.
  21. New CIO? by Mr.+BS · · Score: 2, Interesting

    I wonder how fast this CIO is going to be on his butt.

    "Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."

    "Well... you're fired! NEXT?!

    1. Re:New CIO? by mslinux · · Score: 3, Interesting

      I wonder how fast this CIO is going to be on his butt.

      Probably never. Our CIO is an idiot when it comes to technology. He has a law degree from a big college. He earns six figures for sitting in his office and trading stocks (his stocks) all day. I'll never forget the day he picked up a WordPerfect Office11 box looked at it and then said, "So Tom... who is it that makes Word?"

      These guys are dumber than dirt, but they're well-connected in the "good ole boy club"

  22. Once did IT support for Comair by Anonymous Coward · · Score: 3, Informative

    Having once done tech support for the Maestro program used by Comair (and other scheduling software for other airlines as well), I think the software is junk. The employees undoubtedly said "I told you so!" when it broke, because they hated it as much as the support team did. IMO the airline didn't bother upgrading because they didn't think the old version was broken enough or outdated enough to warrant it.

  23. Re:Maybe it had "worked just fine" for them? by TopShelf · · Score: 2, Insightful

    According to the article, the system was on track to be replaced in the coming months...

    That said, it's very true that many businesses get by "just fine" with existing, antiquated systems. Justifying system upgrades can be difficult from a conventional cost-benefit standpoint, when a large part of the benefit is based on preventing theoretical problems like this one.

    --
    Stop by my site where I write about ERP systems & more
  24. Playing with fire by ravingidiot · · Score: 2, Insightful

    Why was conair using signed shorts to track their scheduling changes anyway? It seems to me that a company of that magnitude should expect to run into more than 32000 schedule changes within one month more than once. I mean, I can understand that the counter was probably designed with space constraints in mind, but for christs sake, it would've only only been two extra bytes to fix this. That brings the total up to some 4 billion unsigned if I'm not mistaken. Technically, they could've used just three bytes, but then again, I wouldn't expect them to because how many languages have 24bit integers built in as primitives? Of course like someone else said, I guess we can't blame this all on the programmers either. I wouldn't just consider it very comforting that such a system could become crippled just because the programmers didn't think to allocate enough memory to allow for enough flexibility in scheduling.

  25. Re:Maybe it had "worked just fine" for them? by Remlik · · Score: 5, Informative

    bet *now* they'll upgrade, but until this particularly hairy situation arose, they didn't really see a need to upgrade a computer scheduling system that had been working great for them.

    RTFA RTFA RTFA - The new system goes live in January. Good god its like herding cats around here.

    Gotta love /. when you can get moded +5 insightful without RTFA AND posting verbal vomit....

    --
    Apple free since 1990!
  26. Maestro sucks. by Anonymous Coward · · Score: 4, Informative

    Maybe Maestro should just die. My friend is a flight attendant for Southwest and has to use Maestro to plan her schedule. To use it she has to citrix into their main server and wait for an open client (I assume they have either a license or horrible programming restriction on concurrent users). On the very day that the new schedules are posted, it can take hours to log in. It's a joke.

    This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.

  27. Re:Wasn't it Nic Cage? by Zorilla · · Score: 2, Funny

    I knew Delta should have left the bunny alone!

    --

    It would be cool if it didn't suck.
  28. ComAir Now Hiring IT People by JavaDev04 · · Score: 5, Funny

    Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/

    1. Re:ComAir Now Hiring IT People by dgb2n · · Score: 2, Insightful

      Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/

      Read to the bottom. They're also hiring a "Staff Scheduler". Only a high school diploma required and 1 year of experience. Maybe they should raise their qualification requirements for this one given recent difficulties ....

  29. I did RTFA by EvilStein · · Score: 2, Funny

    "The computer software that crashed and grounded Comair's entire fleet on Christmas Day was an antiquated system due to be replaced in the coming months."

    First paragraph. I had just forgotten about it by the time I got to the *end* of the article. 6am + ADD - caffeine = me missing that bit. My bad. :P

  30. Re:Maybe it had "worked just fine" for them? by jcr · · Score: 2, Funny

    Maybe the existing system was working just fine?

    Apparently not.

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
  31. Re:65535+2 post by arkanes · · Score: 2, Insightful

    Hypothetical: There's some function that accepts a crew change and returns either the number of schedule changes to date or an error code. The error code is a negative value. This is a really common paradigm in C code.

  32. Re:Maybe it had "worked just fine" for them? by EvilStein · · Score: 2, Funny

    Oh, you're quite welcome. Be sure to stay tuned for my next opinion piece regarding "World Peace in 6 Easy Steps."

  33. There was a high profile example of this problem by hey! · · Score: 4, Interesting

    back in the early 80's. There was a big financial company that had an automated system that watched the prices of certain commodities and issued automated trade orders. The transactions where stored in arrays addressed by 16 bit signed integers, with the (now) highly predictable result on the first day that trading volume exceeded 16384 transactions. Since in C arrays are just syntactic sugar for pointer arithmetic, the system started executing trades based on "data" from random bits of heap memory. This apprently went on for some time before a human being figured out something had gone wrong, and (reportedly) the company lost billions in a single day. This might be somewhat exaggerated, since the event now has passed into folklore.

    In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  34. Re:How old was this software? by adlaiff6 · · Score: 2, Insightful

    The article said it was 15 years old. I guess 16-bit systems are really named for their expiration date.

  35. Re:Maybe it had "worked just fine" for them? by aoty · · Score: 4, Informative

    My wife works for Comair here in Cincinnati. The computer system under discussion was in the process of being upgraded prior to the crash. Comair's IT recognized weaknesses in the current system some time ago. The upgrade just happened to be taking a little longer than anticipated. Timing is a bitch, isn't it?

  36. Re:65535+2 post by R.Caley · · Score: 2, Insightful
    Regardless, having 32767 schedule changes in a month? Must track every flight in the world.

    The article says they had 1100 flights on one day. That's 34100 per month. So basicly they seem to have had on average about one crew schedule change per flight.

    Now, that is a very bad failure rate, but given that any one crew change probably causes a mass of knock-on changes (Fred misses this flight, so you have to substitute John, and then someone has to take over what John should be doing for the rest of the day, and Fred won't be on the right place to do what he was supposed to do this afternoon, and John has reached his flight-hours limit so you have to get Harry in early and...), a good flu epedemic going around the fleet plus some bad weather delays and technical faults all coming in the same month would probably do it.

    --
    _O_
    .|<
    The named which can be named is not the true named
  37. Re:unsigned by rjstanford · · Score: 3, Insightful

    unsigned short numberScheduleChanges;

    fixes the problem.


    You do realize that you've just fallen into the same trap, right? That doesn't fix the problem worth a damn. I mean, sure it doubles the amount of changes. And yes, 64,000 should be enough. But, hey, 32,000 should have been enough too, right?

    Programs have internal limits. That's kosher. What's not appropriate is allowing the user base to exceed them or - for something like this - come close to exceeding them, without giving some kind of warning that notifies people of an impending problem and provides possible solutions (purge data, etc). Now you may point out that adding that kind of security increases the cost and complexity of software. Yup. That's why true enterprise software is expensive. Because that's what you're paying for.

    Another alternative would have been for the software wrap and start purging existing records to make room for new ones. Either way, there should have been some defined strategy for the boundary condition, and there wasn't.

    The other thing that the software vendor should have done when pushing their upgrade is point out that the previous version wouldn't allow flights to continue in that situation, but the new version expanded it to (some large number). Instead, they probably said, "We're 32 bit!" or something totally meaningless to the people evaluating the business case for the upgrade.

    --
    You're special forces then? That's great! I just love your olympics!
  38. Re:Hmmmm.... Maybe I should edit my software by aldoman · · Score: 3, Insightful

    Am I the only one which reads POS as 'piece of shit' regardless of the fact I know it means point of sale?

    It always fits perfectly in the context as well, as this example proves.

  39. Happened to me too by wandazulu · · Score: 4, Interesting

    I worked at a bank in the early 90s that had a trading system based on SQL Server and the client was written in Visual Basic 3. Apart from every other bad design choice in this system (I inherited it when the designers got promoted and started working on another, even bigger system), the all important record counter was an integer, so when trade 32768 was posted, the application crashed, and simply could not be started again, because the first thing it did was try to show the current total (it was written for operators to use, not traders). Worse was that the counter variable wasn't a global, and it was often times a stack variable, and always with a different name (sometimes iCounter, sometimes iCount, sometimes x).

    The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.

    *Sigh*...okay, back to coding.

  40. Re:Maybe it had "worked just fine" for them? by Anonymous Coward · · Score: 3, Insightful

    Arbitrary being the key word. The limits probably weren't arbitrary when they were put in. The system probably had an expected life, and instead of maintaining their infrastructure the people tasked with running the company probably gave themselves pay raises while postponing payments into the employee pension fund. What stories like this are really about, are the complete worthlessness of MBAs. They exist for the sole purpose of diffusing responsability and obstructing accountability.

    Very rarely does anyone have the luxury of designing for something with a hundred year life expectancy and a budget to match.

  41. Re:Maybe it had "worked just fine" for them? by Meostro · · Score: 2, Funny

    Is one step ??? and another Profit!?

  42. Re:Maybe it had "worked just fine" for them? by barzok · · Score: 3, Insightful

    When business won't give IT the money needed to keep business's systems operational (be it for manpower, software upgrades, or electricity) and makes the final decision in purchases, something's going to have to give.

    Business decides to buy a software package. After a while, upgrades come out, and the old version keeps getting pushed to the limits. IT adivses business of this, and says that an upgrade/replacement will resolve the problem, but business refuses to authorize said upgrade/replacement.

    How do you propose IT "make it work" when their hands are tied? Even worse, IT will take the blame when it wasn't even their decision to make.

  43. Re:Comair? by amabbi · · Score: 2, Interesting
    Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?

    There probably isn't any reason to. Comair, as a regional jet carrier, has separate crew contracts and crew rules than Delta, a mainline carrier. Thus they operate completely different types of jets, with different crew staffing requirements. The FAA crew rules might even be different. While it might make sense from a consolidation standpoint to merge the two systems of Comair and Delta, since in reality there would be no interaction and no overlap between the two systems (an RJ pilot isn't suddenly going to jump over to fly a 757) the expense isn't worth it.

    p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.

    Then I hope you also avoid United/United Express/Ted at O'Hare/Denver, Continental at Newark/Houston, Northwest in Detriot, USAirways in Cincinnati, American at O'Hare/Dallas... etc. etc. Every airline, not just Delta, uses hubs, and ground stops at any of these airports will cause significant delays. That's just the reality of air travel these days; if you're really worried, book non-stop travel (and pay up to 10x more).

  44. Re:Maybe it had "worked just fine" for them? by EvilStein · · Score: 2, Funny

    I don't know, but isn't there a way that we can blame Microsoft or SCO for this whole Comair mess? :)

  45. Re:The far bigger snafu... by Anonymous Coward · · Score: 2, Funny
    ' The only way for employees to fight back is to hit the company where it hurts the most, by striking at a busy time. '

    If they hate their jobs so much, why don't they just quit and go elsewhere? This "strike" is like pouring water in a sinking ship. They'll be the first to suffer the negative effects. Not only does it damage the company so that it has less money to pay these employees: it also tells the company to do whatever it can to get rid of these employees.

  46. Re:There was a high profile example of this proble by sharkey · · Score: 2, Funny

    Seems like there was another example of this sort of thing on November 2nd, 2004 as well. IIRC, some North Carolina machines dumped 3000+ votes due to a similar problem.

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  47. Re:Comair? by AceyMan · · Score: 2, Informative

    mod parent down.

    One poster already noted; The wholly-owned carriers fly different equipment and are staffed by pilots who are members of a different seniority force.

    Moreover, typically the crew tracking system is integrated with the flight operations/dispatch system, and the maintenance control system, and the route planning system, and the trip optimization system. You wouldn't want to try to integrate all those functions into the parent carriers system unless you *had* to.

    Finally, CFR 14 Part 121 says that each certificated carrier has to have their own dispatchers on staff. Comair, et al, are technically independant carriers -- they have their own certificate (DOT license to run an airline), and therefore have to staff their own flight operations (dispatch) office.

    Therefore, Comair cannot integrate their staff with Delta's, even if they wanted to. Of course, that doensn't mean they couldn't still use Delta's operations software, but it just shows how separate the airlines actually must operate -- making the advantage of merging systems specious at best.

    \FAA licensed aircraft dispatcher

    --
    -- Experience is a wonderful thing. It enables you to recognize a mistake when you make it again.
  48. You've missed one dimension by cdrguru · · Score: 2, Insightful
    The dimension of time. An airline seat changes in value to the consumer as time goes on. Currently, we're told that the optimal price point changes at 14 days, but I believe it to be a little different than that in reality.

    You want to fly to Los Angeles in a month and purchase the ticket then. The price you pay reflects the value of being able to make that choice then and assuring the airline of a seat being filled in a month. The value of the seat changes as time goes on such that 1 day before the plane leaves the seat is now worth a lot more to someone that has to get to Los Angeles the next day, no matter what the cost. Of course there is the other aspect as well - the seat has no value once the plane leaves.

    Managing this changing value is what makes airline ticket prices incredibly complicated.

  49. Re:It shows what unions are all about by WhiplashII · · Score: 4, Insightful

    I work in the industry, so I might be able to provide an alternate viewpoint. Esentially what happened to the airline industry is that the market changed from a luxury market (high profit margins, low competition) to a commodity market (low profit margins, intense competition). Unfortunately, the airlines had all made long term deals with the trade unions that presumed stagnate market conditions - so when the market changed, they could not change with it.

    The smaller carriers all have one thing in common - no unions. They do not pay their pilots as much, and their pilots do not get paid if they don't fly. The number one expense for an airline is fuel, but the number two expense are the pilots, stewardesses, mechanics, and baggage handlers. There was no way for older airlines to meet the new market conditions (fly more for less profit per flight) without paying people less. The problem is that no one wants to be paid less, so instead they get rid of the least powerful people (who also happen to be the least paid). This is also specified in the union contract... all laying off actions must be FILO.

    Essentially, the major carriers are hamstrung by the unions, and they will not survive long term. Unions work by artificially limiting labor supply - but that doesn't work if there is not enough work.

    The unions say how evil it is that they are getting pay cuts, but where exactly do they expect the money to come from? The government really should not prop up certain providers when others are eager to take there place. Competition works for the most part. Air travel is becoming a commodity market, like cars. Market transitions cause upheavals, and change the market leaders - especially if the current leaders cannot change their bussiness structure.

    I have to say that I totally disagree about management being incompetant - the current management (at least the upper level ones I deal with) are extremely good. They may even get the airline to survive and change to the current market conditions. But what has really destroyed the airlines is the changing markets, and the unions preventing the old airlines to change with the times. The only thing management could have done would be to have rejected the union contracts earlier. But I doubt if that was possible.

    Unions seem to believe that society owes them a living. The problem is that society (except in the form of government) is not a person, and so recognizes no debts. Fighting that is totally ineffective because there is no one to fight.

    --
    while (sig==sig) sig=!sig;
  50. SIGNED 16-bit!! by adam31 · · Score: 2, Interesting

    what-- were they expecting negatives also?

    1. Re:SIGNED 16-bit!! by pclminion · · Score: 2, Informative
      Using a signed integer allows you to distinguish between error and non-error conditions. In UNIX for example, system calls return a negative value on error, so these calls often are declared to return a signed int even though the number they return in a non-error condition will always be positive.

      This might be viewed as laziness depending on the cirucmstances. Obviously, it seems weird to waste half of the integer space just so you can return -1 on error, but if you need to report many various error conditions, using negative numbers to do so makes things a little easier because you can just check if the return value was negative in order to detect an error (instead of comparing the return value one by one against each possible error code).

      However, the entire problem is mitigated if you just switch to a slightly different calling convention. If you need a function to return some value which is always positive, but you still want to indicate possible error conditions, forget about using the return value to return the result. Instead pass a variable by pointer or reference, stick the result in there, and return 0 on success or -1 on failure.

      Unfortunately many programmers regard this as ugly, so we're stuck with silly crap like wasting half the integer space just in order to report errors.

  51. Coding practice by hey! · · Score: 2, Insightful
    Well, there's a curious thing about this story.

    Back in the 80s when I was C programmer (K&R, thank you, the one true C), C integer types were not standardized. "Integers" were defind to be the most natural size for a machine (typically a data word), "shorts" were defined to be no larger than ints, but possibly smaller (and thus possibly more space efficient). This reflected the philosophy of C-as-portable-assembler: if you were indexing an array of character representations of digits, for example, there was no reason not to use a short. It was conceivable that, since arrays were essentially immutable pointers and array indices were merely offsets against those pointers, you might want to refernce a negative offset from an array base in some kind of clever trick.

    Various C implementations used short/integer sizes like 8/16 (for microprocessors like the 8080), 16/16, 16/32. These days, there are some mininal assumptions we can count on. Ansi-C specifies the following as minimal data sizes for char/short/int/long: 8/16/16/32. In practices IIRC, most modern compilers use 8/16/32/32, in other words a 32 bit int. GCC, I think uses 8/16/32/64.
    The problem with this airline scenario I would expect is a kind primitive cousin of cut-and-paste coding. This is where the the programmer is pasting something like this from his mental scrapbook:

    int i; // index into transaction array
    TRANSACTION trArray[]; ...

    It's very easy to do something like this. A really conscientious programmer asks himself whether the index value is indexing something that doesn't have a prescribed limit in the specifications (in this case I'm guessing it was probably indexing a file position). If there is no prescribed limit he uses an unsigned long. If there is a prescribed limit that would allow an integer index, he still uses an unsigned long unless he indexing something which logically can't grow larger, or until the profiler forces him.

    Which brings me to what I find curious about this. Either: the programmer chose to index the value by an signed short (which would be almost inconceivably stupid as opposed to unforgiveably negligent), or he was using a C compiler with a 16 bit integer, which while possible under ANSI IIRC, seems terribly archaic.

    Java, of course, uses 32 bit ints. But you aren't completely safe from this sort of thing. For example FileInputStream has two methods of interest here:

    long skip(long n)


    this is very safe, since it uses a long, which in java is 64 bits; even unsigned, there is little chance of overflow.

    However consider this:

    int read(byte[] b, int off, int len)


    What happens when a programmer decides to skip around in a LARGE file using this API? If he decides to skip forward by more than 2,147,483,647 bytes the signed int will silently be converted to a negative offset, at least as of java 1.4. Granted the possibility is slim in most applications.
    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  52. Re:Maybe it had "worked just fine" for them? by jedidiah · · Score: 2, Insightful

    As they like to say in my particular part of the retail world: You build for Easter Sunday.

    THIS INCLUDES COMPUTING.

    Part of serving the business is being built to capacity. This is no different than a correctly tooled factory or having enough warm bodies.

    Your mentality simply falls from the illusion that IT isn't an integral part of the business. You can always choose to be "penny wise". However, that always comes with inherent risk.

    They question you need to ask your CIO is: Do you feel lucky?

    --
    A Pirate and a Puritan look the same on a balance sheet.
  53. Blaming the unions for a bad management decision? by dbIII · · Score: 2, Interesting
    Essentially, the major carriers are hamstrung by the unions,
    It's often easier to blame the unions than fix bad management practices, and in a lot of cases competitors that have workers in the same union operate well and get the job done instead of throwing their hands in the air, blaming the unions, and refusing to move into this century.

    Unions are also used to seeing promises that pay will go back to normal in good times broken. The reality of capitalism is that if you don't have your act together enough to meet a known wages bill your company is probably going to expire.

    While unions are not the problem - some unreasonable bastard in a paticular union may be - but you get that in all kinds or organisations. It sounds like there is an "us or them" attitude going on, where each group hates the other, which can lead to all kinds of problems and the end of the company if it isn't sorted out.

    The USA has all kinds of protections to stop better run airlines coming in from overseas to create even more intense competition. The land that gave us Valuejet and the mess that was United in its final years really needs to get its act together, stop blaming the unions, and see if they can do as well as any of a score of airlines that would be happy to come in as soon as deregulation happens and show how airlines work in the rest of the world.

    When I go to the USA I'd better catch a bus, I bet the bus companys scheduling software is less than fifteen years old and has been updated if the company has grown - that's what most places do for business critical applications, and it has nothing at all to do with unions.