Slashdot Mirror


Comair System Crashes; Passengers Stranded

Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."

22 of 398 comments (clear)

  1. Re:Official my arse... by Saven+Marek · · Score: 2, Informative

    You know I think it was. btw the system being used by Comair?

    Its one of SCO's last large scale deployments. You know who to blame now.

    Online Anime Gallery's

  2. Re:Happens all the time... by hughk · · Score: 4, Informative
    I have a lot of friends working at a large airline.

    Yes, but it is mostly recoverable. The heavy iron handles things like backend reservations, checkin and cargo. Smaller systems handle things like weight/balance and fuel and PCs are typically used for the front-ends.

    Weight/balance calcs can be done more or less by hand if necessary, however a larger fuel margin is needed. Checkin can be done by hand (you have seen those sticky label systems). However to lose reservations is a major problem.

    --
    See my journal, I write things there
  3. Crew assigment is a hard problem by rsilva · · Score: 5, Informative

    'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.'

    I am only trying to make sense out of the above comment from the official statement above.

    Crew assigment is a hard problem, it is usually an MILP (Mixed Interger Linear Programming) .

    Such problems may be very hard to solve in reasonable time. Maybe (I'm shooting in the dark here) the first delays made the crew assigment problems grow too large for being solved in reasonable time.This would generate a snow ball effect as the assimgment problems would keep on growing maing the system "crash".

    We may never know what really happened but this would be a nice example for my classes :-)

  4. System Tracked Crew Location, Not Reservations by reallocate · · Score: 5, Informative

    Of course, a techie didn't write the PR release. Who in their right mind would let a techie anywhere near a PR release?

    BTW, Comair, a Delta feeder headquartered outside Cincinnati, says the system that crashed was used to monitor crew locations and track working hours to ensure no one went over the legal maximum. Comair says the system crashed as a result of massive crew rescheduling following a record snow in their service area on Wednesday. There is no backup.

    --
    -- Slashdot: When Public Access TV Says "No"
  5. Re:Travel tip by xlation · · Score: 5, Informative

    From: http://www.fly.faa.gov/FAQ/faq.html

    The term "Rule 240" refers to a rule that existed before airline deregulation. There is no longer an actual Rule 240. The term, as it is now used, refers to each airlines "conditions of carriage" policy. You would need to contact the airlines to obtain this.

  6. whole story? by confusion · · Score: 4, Informative
    This comair story is all I'm seeing getting press. I think its a lot bigger than that.
    My sister flew Delta on Dec 23rd from Detriot to Atlanta. Plane was 2 hours late, but no big thing. Waited 5 hours for her luggage, with no dice. By the time we got in line for luggage services, there were at least 600 people in the line already.
    Talking to other passengers from 10+ different flights from different cities, no one got their luggage that night. Apparently, it wasn't just Atlanta - the local news in Tampa and Detroit had segments on how the airports had taken over parts of taxiways to sort through seas of bags that didn't make it on to planes.
    It's been 2 days, and Delta has no idea where the stuff from that flight is. I'm guessing it isn't just Comair that got hit by some computer problems.

    Jerry
    http://www.syslog.org/

    1. Re:whole story? by HeghmoH · · Score: 2, Informative

      These days, "we hate the customer" seems to be the motto of all of the big airlines.

      This summer, I was flying from Paris to Ft. Lauderdale via Philadelphia on USAir. The Paris->Philadelphia leg was handled by the same plane that does USAir's Philadelphia->Paris flight that same day. The incoming flight was about four hours late, so of course our outgoing flight was also four hours late. Sucks, but what can you do.

      So we get into Philadelphia at about 9PM instead of 4:30PM and everybody rushes to get any last-minute connections they can. I was already stuffed and had to wait for the next day's flight, but a lot of people had chances to make late flights to their destination. We all get off the plane, go through customs, get to USAir's rebooking desk.

      Two people are working this desk.

      An airplane with three hundred people comes in four hours late. USAir knew that this flight would be late almost a full day in advance, since it was a cascade effect from the other flight's delay. And yet, none of USAir's genius managers had the presence of mind to call in a few extra employees that night to speed things along.

      A lot of people missed connections they otherwise could have made, because they had to wait in line for an hour to get new tickets.

      Since USAir obviously hates their customers exceptionally strongly, I won't be flying with them again.

      This isn't really an isolated incident, either, just the most recent bad one. The entire industry has a serious problem with this, and I have a feeling it's going to take a couple of high-profile bankruptcies before they get a grip on it.

      --
      Mod down posts with a "Free Mac Mini/iPod" sig, they're spam!
    2. Re:whole story? by winwar · · Score: 2, Informative

      "I have little sympathy for people that whine about holiday travel when they didn't plan for things like this."

      Okay troll, I'll bite. Maybe he had a limited amount of time off. Maybe that was the most convenient time to fly. Whatever. It doesn't matter.

      He shouldn't have to plan for weather, high traffic, and/or computer screwups. That is the airlines JOB. You know, the people who took the money and agreed to get him from point A to point B. Bad weather in the winter? From the massive effects it has on the airlines, you think this is the first time they have ever experienced it.... Running a computer system they KNOW will fail under load?!? Other airlines running out of deicing fluid?!? Excuse me, it IS THE AIRLINES FAULT. When your system is such that one winter storm will screw it up, and it happens repeatedly, and you do nothing to change it, it is broken and your fault.

      But they don't care. And that was the grandparents point. Admit it is your fault, refund his money, and let him make other plans.

      People accept that there will be problems-lying to them just pisses them off and guarantees that they WON'T believe you if it ever really isn't your fault. Accepting blame tends to build respect.

  7. From old information... by gminks · · Score: 5, Informative
    According to this article [written in 1995] , Dell and AT&T created a new company called TransQuest Information Solutions.

    This article outlines how this joint venture re-vamped Delta's IT systems (again remember, this is 1995):


    During 1995 and 1996, TransQuest reengineered Delta's systems to migrate them from Hitachi mainframes running Natural, Adabas, and DB2 to an open systems environment. The new systems are written in C++ and access Sybase databases of reusable and distributed objects. The systems run primarily on Sun, HP and AT&T servers under UNIX with clients running under UNIX, MS-DOS, and Windows. The clients are connected to the servers over high bandwidth TCP/IP frame relay networks.

    Job titles for the company's 1,100 computer professionals include Systems Engineer and Software Engineer 1 through 8. Staff members recently developed an aircraft weight balance system that can be accessed by pilots to determine how luggage and fuel have been distributed within the aircraft for balance during a flight. This system was developed in C++ on AT&T and HP UNIX servers and will be available on 40,000 devices to 2,000 users.


    The trail runs dry here, job postings stopped around 2001.

    Which really raises suspicions that all the code is written and maintained offshore. The question now becomes who is handling this for Delta.

    One of Tata's spinoffs, Airline Financial Support Services, is described as


    "an example of an external service provider that handles a wide range of back-office functions for the airlines. AFS handles sales, refund, traffic and cargo; performs fare audits; manages yields and revenues by performing departure and post-departure processing checks; books crews; deals with overbooked flights and wait-lists; adminsters frequent flyer programs; draws up flight navigation charts; such as landing or route facility charts; and provides customer care." This according to ebstrategy.com


    Wipro handles some of Delta's inbound reservation calls in India and the Phillipines.

    In conclusion, it would appear that either Tata's AFS arm or Wipro do the IT for Delta airlines.
  8. Possible system OS by Anonymous Coward · · Score: 1, Informative

    Judging by www.comair.com and their job ops, it's probably HP-UX or Windows. More than likely the Unix flavor rather than Windows. Why down for a couple of days, probably a database restore. Never happens in TPF. Those mainframe systems crash and are back up with very little database degredation. By the way, in the job ops, if you want to be a crew scheduler, only need HS diploma!

  9. Re:Someone's gotta say it... by Anonymous Coward · · Score: 2, Informative

    I've done contract programming for Comair. They use HP/UX for the servers that handle most things except HR which is mostly Windows. The system that went down was a COTS app that handles crew scheduling. It is a bid system that allows crew to bid for flights based on seniority w/ constraints to match FAA rules.

    Their IT director is really sharp, but he faces some real problems. First, IIRC, they only created a dedicated IT shop about three years ago. Second, their budget is small compared to the task they have to perform. Comair is an airline and airlines have been in real trouble since 9/11.

  10. Probably TPF by Anonymous Coward · · Score: 1, Informative

    More than likely it is TPF as Delta is a TPF shop.

    TPF (http://www-306.ibm.com/software/htp/tpf/index.htm l) has been around since the '60's and is used by all the major airlines, most of the large hotels and most bizarrely NYC 911.

  11. Re:Fire away! by [Xorian] · · Score: 5, Informative

    Someone from Comair (who shall remain anonymous) provided me with some details whch people here would be interested in:

    The computer system in question runs AIX. The box itself is still up and running just fine; this is purely an application error. This application was not written in-house at Comair, but by another large aerospace company -- SBS (http://www.sbsint.com/, owned by Boeing.) This bit of software does not use an external database, it tracks everything itself. It is a dedicated system responsible only for flight crew assignments. (The blather in the original submission about passenger reservations is way off-base. Those functions are handled by a completely different system.)

    The great majority of Comair's traffic flows through the midwest, and the central base of operations is in Cincinnati. The midwest was hit by a major snowstorm this week, causing many, many crew reassignments. It appears right now that the application in question has a hard limit of 32,000 changes per month (ouch). Consider that Comair runs 1,100 flights a day and there are usually 3 crew members on each aircraft. A big storm like this can cause problems for days after the snow stops falling. That's a whole lot of crew changes.

    In Comair's defense, this has never happened before and is unlikely to happen again. The crew system was already on the chopping block long before this incident, with its replacement scheduled to go live in January. If this freak storm had happened a month later, this likely never would have occurred.

    --
    CVS is teh suck. Use Vesta instead.
  12. Re:Fire away! by [Xorian] · · Score: 4, Informative


    Just to be absolutely clear: I've only ever communicated with this person on-line, and I can't verify who they are in real life or that they actually work for Comair. It seemed credible though, and it seemed worth posting to de-bunk the slashdot knee-jerk reaction of blaming Microsoft. To me, an application using a 16-bit integer for something seems like a very likely explanation.


    --
    CVS is teh suck. Use Vesta instead.
  13. Re:Southwest refuses to drink the Kool-aid by Anonymous Coward · · Score: 4, Informative
    Actually, the only thing that makes these sort of problems easier for Southwest is the consolidated fleet types. With nothing but 737's, you don't add complexity to the scheduler for things like pilot and f/a qualifications.

    What happened to Comair here could happen to just about any airline. There is no comprehensive suite of software that handles crew scheduling, aircraft scheduling, reservations, and the myriad of other functions that are needed to run an airline.

    Reservations, for other than tiny airlines, are still managed by large TPF mainframes. TPF is a very "bare bones" operating system that runs on IBM mainframes, and was written specifically to deal with high volume / high transaction rate systems. Personally, I've seen 5 attempts at 3 different airlines to replace it with something modern. ( like Unix with an RDBMS ). Each attempt failed miserably, and the airline went back to TPF. Note that TPF is not MVS, OS/390, or any other more mainstream Mainframe OS. It's purpose built.

    Unfortunately, this means that all of the other applications have to interface with TPF via screen scraping. To further compound the problem, no "suites" exist to handle the following functions, so most airlines have to "sew together" best of breed solutions for these basic functions:

    • Crew Scheduling - F/A's and pilots bid on slots to fly, this system takes those bids and turns it into a schedule.
    • Aircraft Scheduling - Tracks which tail numbers are flying which flights for the dispatchers
    • Optimization - Different optimizers to do things like:
      • Fuel Tankering - Use the jets as "tankers" so that you buy fuel where it's cheapest for flights later in the day
      • Crew Optimization - "Traveling Salesman" type solver to incur lowest labor cost, get crews back to home base, etc
      • Schedule Optimization - Use the aircraft in the most cost efficient way to cover all of your scheduled flights.
      • Maintenence Optimization - Pull aircraft in for Scheduled Maintenance at the optimum time.
      • Reacommodation - When things go wrong ( weather, mechanicals, whatever, pull in all of the above variables to crank out a new schedule, crewing, mx schedule, etc )
    • Booking Engines, for the internet and reservations agents
    • Point of Sale and Boarding functions for agents, skycaps, and kiosks
    • Interline functions where other airlines sell your tickets, and transfers for bagggage, etc
    Anyhow, this list isn't comprehensive, but shows enough of the disparate pieces that you can imagine why these "glitches" happen. Very few of the items from the list above come from the same vendor, or even run on the same platforms.
  14. Yep, you are right! by Anonymous Coward · · Score: 5, Informative

    Your statements are accurate.

    I was a unix sys admin there, but left for greener pastures during the dot-com craze. The non-redundant hardware at the time ran AIX, and had a great support contract from IBM. The SBS application however, always had monthly issues, at least at that airline. They were looking for a replacement then, and I'm not suprised they still haven't replaced it.

  15. Simple Solution by jeephistorian · · Score: 2, Informative

    Take Amtrak!

    Amtrak receives around $500 million for a total budget, while the airtravel receives around $15 billion in subsidies. Take the train and save everyone money!

    _____________

    --
    Huh?
  16. Re:Fire away! by Anonymous Coward · · Score: 5, Informative
    If it was the crew scheduling system, and it was SBS's Maestro Crew scheduling system, I can fill in some details.

    Maestro is delivered on AIX, uses a rather old version of Informix for it's database, and is tied together using the TUXEDO TP monitor from BEA.

    The business logic is written in C, and abstracted away using Tuxedo.

    In the case of a major schedule disruption, this program isn't responsible for "solving" the problem, but is responsible as being the system of record for holding the new crew schedule.

    My guess is that the changes to the crew schedule were large enough that some piece of the system was overwhelmed. ( For example, a transaction that was too large and overran the rollback buffers in Informix ).

    Without the system of record in place, a manual process would be very difficult. You would have to figure out:

    • Which crews where in which locations
    • What aircraft each crew member was qualified on.
    • How long they had flown already that day. ( Legalities about how much time you can fly before you need mandatory rest )
    • Which routes to send those crews on
    • How to get the crews back to a specific city to run the next day's schedule
    Of course, any mistakes you made doing this manually would overflow into other systems. For example, you might send an aircraft that's due maintenance to a city with no maintenance facilities.

    Also, for those that were critical of the system not being highly availble...this doesn't sound like the kind of problem that HACMP and replicated databases would have helped. The hot standby would have choked at the exact same point.

  17. Re:Fire away! by Daa · · Score: 5, Informative

    just to give you an idea, here is the applicable FAA reg for crew scheduling, and the pilots contract may have additional terms that must be met.

    121.471 Flight time limitations and rest requirements: All flight crewmembers.
    top

    (a) No certificate holder conducting domestic operations may schedule any flight crewmember and no flight crewmember may accept an assignment for flight time in scheduled air transportation or in other commercial flying if that crewmember's total flight time in all commercial flying will exceed--

    (1) 1,000 hours in any calendar year;

    (2) 100 hours in any calendar month;

    (3) 30 hours in any 7 consecutive days;

    (4) 8 hours between required rest periods.

    (b) Except as provided in paragraph (c) of this section, no certificate holder conducting domestic operations may schedule a flight crewmember and no flight crewmember may accept an assignment for flight time during the 24 consecutive hours preceding the scheduled completion of any flight segment without a scheduled rest period during that 24 hours of at least the following:

    (1) 9 consecutive hours of rest for less than 8 hours of scheduled flight time.

    (2) 10 consecutive hours of rest for 8 or more but less than 9 hours of scheduled flight time.

    (3) 11 consecutive hours of rest for 9 or more hours of scheduled flight time.

    (c) A certificate holder may schedule a flight crewmember for less than the rest required in paragraph (b) of this section or may reduce a scheduled rest under the following conditions:

    (1) A rest required under paragraph (b)(1) of this section may be scheduled for or reduced to a minimum of 8 hours if the flight crewmember is given a rest period of at least 10 hours that must begin no later than 24 hours after the commencement of the reduced rest period.

    (2) A rest required under paragraph (b)(2) of this section may be scheduled for or reduced to a minimum of 8 hours if the flight crewmember is given a rest period of at least 11 hours that must begin no later than 24 hours after the commencement of the reduced rest period.

    (3) A rest required under paragraph (b)(3) of this section may be scheduled for or reduced to a minimum of 9 hours if the flight crewmember is given a rest period of at least 12 hours that must begin no later than 24 hours after the commencement of the reduced rest period.

    (4) No certificate holder may assign, nor may any flight crewmember perform any flight time with the certificate holder unless the flight crewmember has had at least the minimum rest required under this paragraph.

    (d) Each certificate holder conducting domestic operations shall relieve each flight crewmember engaged in scheduled air transportation from all further duty for at least 24 consecutive hours during any 7 consecutive days.

    (e) No certificate holder conducting domestic operations may assign any flight crewmember and no flight crewmember may accept assignment to any duty with the air carrier during any required rest period.

    (f) Time spent in transportation, not local in character, that a certificate holder requires of a flight crewmember and provides to transport the crewmember to an airport at which he is to serve on a flight as a crewmember, or from an airport at which he was relieved from duty to return to his home station, is not considered part of a rest period.

    (g) A flight crewmember is not considered to be scheduled for flight time in excess of flight time limitations if the flights to which he is assigned are scheduled and normally terminate within the limitations, but due to circumstances beyond the control of the certificate holder (such as adverse weather conditions), are not at the time of departure expected to reach their destination within the scheduled time.

  18. Re:Fire away! by Anonymous Coward · · Score: 5, Informative
    No. It is the version of SBS that pre-dated Maestro. It was brought into Comair in the early 1980's. It's written in FORTRAN and uses whatever record managment system that came with the compiler.

    As such it used some very interesting data representations. For example, it tracked time using julian minutes. There are 44640 minutes in a 31 day month. That's small enough to fit in a 16-bit unsigned variable. This approach, nearly taboo by modern standards, was a God-send during Y2K. The system never needed to know what year it was. It became the running wisecrack, "You can't have a Y2K problem if you don't have a 'Y'".

    The Aircraft to Flight assignments is another system, but the two share information.

  19. Some clarification by Anonymous Coward · · Score: 1, Informative

    Well... to try and provide a little clarification here, as I work for Comair. Here's the skinny:

    Crew and aircraft scheduling is done through a software package called SBS Track. This very same software package is used by many other airlines, including the two I worked for before coming to Comair. I don't know if their systems have the same hard-coded limit that ours does or not. This software package has _nothing_ to do with reservations, or anything concerning passengers whatsoever. It is simply the software we use to schedule our aircraft and crews to fly the list of flights that Delta wants us to fly.

    Crew scheduling is done by creating "pairings". A pairing is a sequence of flights that comprise a crewmember's trip. Anytime a change is made, a new pairing is generated, with the new sequence of flights. The system has a hard-coded limit of 32k pairings ("transactions" is the what the IT folks call it) in a calendar month. As of 10:00 pm on 12/24, that limit was reached. Crew Scheduling was unable to create any new pairings, unable to track who would be flying what airplane to where, and basically unable to keep the airline flying at that point.

    It was not any kind of a hardware failure, there are backups for that. It is simply a software limitation, that when it was coded many years ago, nobody realistically thought it would ever be reached. Why they hardcoded a limit into it in the first place is beyond my knowledge. :)

    A major part of the problem is Comair's concentration in Cincinnati. CVG is our only crew base, and it is the largest single crew base of any airline in the world. Over 1800 pilots and 1100 flight attendants in one base. Not even any of the majors have a single base that large. Several of our software packages are woefully inadequate, and replacements have been sought for some time.

    As for getting things up and running on paper, this is a monumental task. Scheduling for 160+ aircraft and 2900+ crewmembers, and compliance with all FAA regulations, maintenance requirements, crew rest requirements, and contractual requirements is incredibly complex. In addition, we have crews and aircraft stranded across the country due to the weather that moved through that caused this whole mess in the first place. Add to that the very limited number of people who actually have the knowledge of all the requirements for scheduling, and coming up with a full schedule for the next day would be nearly impossible.

    Jan. 1 starts a new month, and the system will return to full functionality then. Until that date, however, our operations will be very limited.

  20. Oh jesus christ... by Anonymous Coward · · Score: 1, Informative

    Some of you have no clue.

    The BS&T quotient on your average travel application is on the relatively nuts scale. Expedia, Travelocity, hotwire, priceline, whatever -- I'd ask that some of you with simple solutions go and speak to the lead travel-server dev for the product.

    You'll probably have to change pants after the conversation. Travel is stable, reliable, and generally rock-solid. The algo's for selecting airline flight prices or hotel room block-reservations are known and well-tested. The methods and protocols of communication are well-documented and generally straightforward.

    Until recently, it was all on hardware (And i'm speaking generally about the large travel providers -- Worldspan and Sabre come to mind) that was considered arcane. Ancient versions of Netware on an X.25 pad; screen-scrapers on top of it. Have Fun trying to modernize!

    This does not suprise me in the slightest. We are stressing our ancient systems more than ever these days, and it should not be a suprise when the occasional ancient application (ctime, folks) gets floor'd and dies a bloody death.

    It'll be patched in a month.