Comair System Crashes; Passengers Stranded

← Back to Stories (view on slashdot.org)

Comair System Crashes; Passengers Stranded

Posted by timothy on Saturday December 25, 2004 @11:08PM from the for-very-high-values-of-too-many dept.

Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."

18 of 398 comments (clear)

Fire away! by weeksie · 2004-12-25 23:11 · Score: 5, Funny

Anybody know what they were running? I'd like to see this flamewar get started as soon as possible.
1. Re:Fire away! by mirko · 2004-12-25 23:16 · Score: 5, Insightful
  
  There recently was a big card problem here, in Europe.
  It did not come from a peculiar OS but just because a partition got filled by index tablespace extents.
  So, it could just be that they ran out of place and it froze the whole application.
  
  --
  Trolling using another account since 2005.
2. Re:Fire away! by [Xorian] · 2004-12-26 05:56 · Score: 5, Informative
  
  Someone from Comair (who shall remain anonymous) provided me with some details whch people here would be interested in:
  
  The computer system in question runs AIX. The box itself is still up and running just fine; this is purely an application error. This application was not written in-house at Comair, but by another large aerospace company -- SBS (http://www.sbsint.com/, owned by Boeing.) This bit of software does not use an external database, it tracks everything itself. It is a dedicated system responsible only for flight crew assignments. (The blather in the original submission about passenger reservations is way off-base. Those functions are handled by a completely different system.)
  
  The great majority of Comair's traffic flows through the midwest, and the central base of operations is in Cincinnati. The midwest was hit by a major snowstorm this week, causing many, many crew reassignments. It appears right now that the application in question has a hard limit of 32,000 changes per month (ouch). Consider that Comair runs 1,100 flights a day and there are usually 3 crew members on each aircraft. A big storm like this can cause problems for days after the snow stops falling. That's a whole lot of crew changes.
  
  In Comair's defense, this has never happened before and is unlikely to happen again. The crew system was already on the chopping block long before this incident, with its replacement scheduled to go live in January. If this freak storm had happened a month later, this likely never would have occurred.
  
  --
  CVS is teh suck. Use Vesta instead.
3. Re:Fire away! by Anonymous Coward · 2004-12-26 06:48 · Score: 5, Informative
  If it was the crew scheduling system, and it was SBS's Maestro Crew scheduling system, I can fill in some details.
  Maestro is delivered on AIX, uses a rather old version of Informix for it's database, and is tied together using the TUXEDO TP monitor from BEA.
  The business logic is written in C, and abstracted away using Tuxedo.
  In the case of a major schedule disruption, this program isn't responsible for "solving" the problem, but is responsible as being the system of record for holding the new crew schedule.
  My guess is that the changes to the crew schedule were large enough that some piece of the system was overwhelmed. ( For example, a transaction that was too large and overran the rollback buffers in Informix ).
  Without the system of record in place, a manual process would be very difficult. You would have to figure out:
  
  Which crews where in which locations
  
  What aircraft each crew member was qualified on.
  
  How long they had flown already that day. ( Legalities about how much time you can fly before you need mandatory rest )
  
  Which routes to send those crews on
  
  How to get the crews back to a specific city to run the next day's schedule
  
  Of course, any mistakes you made doing this manually would overflow into other systems. For example, you might send an aircraft that's due maintenance to a city with no maintenance facilities.
  Also, for those that were critical of the system not being highly availble...this doesn't sound like the kind of problem that HACMP and replicated databases would have helped. The hot standby would have choked at the exact same point.
4. Re:Fire away! by Daa · 2004-12-26 07:41 · Score: 5, Informative
  
  just to give you an idea, here is the applicable FAA reg for crew scheduling, and the pilots contract may have additional terms that must be met.
  
  121.471 Flight time limitations and rest requirements: All flight crewmembers.
  top
  
  (a) No certificate holder conducting domestic operations may schedule any flight crewmember and no flight crewmember may accept an assignment for flight time in scheduled air transportation or in other commercial flying if that crewmember's total flight time in all commercial flying will exceed--
  
  (1) 1,000 hours in any calendar year;
  
  (2) 100 hours in any calendar month;
  
  (3) 30 hours in any 7 consecutive days;
  
  (4) 8 hours between required rest periods.
  
  (b) Except as provided in paragraph (c) of this section, no certificate holder conducting domestic operations may schedule a flight crewmember and no flight crewmember may accept an assignment for flight time during the 24 consecutive hours preceding the scheduled completion of any flight segment without a scheduled rest period during that 24 hours of at least the following:
  
  (1) 9 consecutive hours of rest for less than 8 hours of scheduled flight time.
  
  (2) 10 consecutive hours of rest for 8 or more but less than 9 hours of scheduled flight time.
  
  (3) 11 consecutive hours of rest for 9 or more hours of scheduled flight time.
  
  (c) A certificate holder may schedule a flight crewmember for less than the rest required in paragraph (b) of this section or may reduce a scheduled rest under the following conditions:
  
  (1) A rest required under paragraph (b)(1) of this section may be scheduled for or reduced to a minimum of 8 hours if the flight crewmember is given a rest period of at least 10 hours that must begin no later than 24 hours after the commencement of the reduced rest period.
  
  (2) A rest required under paragraph (b)(2) of this section may be scheduled for or reduced to a minimum of 8 hours if the flight crewmember is given a rest period of at least 11 hours that must begin no later than 24 hours after the commencement of the reduced rest period.
  
  (3) A rest required under paragraph (b)(3) of this section may be scheduled for or reduced to a minimum of 9 hours if the flight crewmember is given a rest period of at least 12 hours that must begin no later than 24 hours after the commencement of the reduced rest period.
  
  (4) No certificate holder may assign, nor may any flight crewmember perform any flight time with the certificate holder unless the flight crewmember has had at least the minimum rest required under this paragraph.
  
  (d) Each certificate holder conducting domestic operations shall relieve each flight crewmember engaged in scheduled air transportation from all further duty for at least 24 consecutive hours during any 7 consecutive days.
  
  (e) No certificate holder conducting domestic operations may assign any flight crewmember and no flight crewmember may accept assignment to any duty with the air carrier during any required rest period.
  
  (f) Time spent in transportation, not local in character, that a certificate holder requires of a flight crewmember and provides to transport the crewmember to an airport at which he is to serve on a flight as a crewmember, or from an airport at which he was relieved from duty to return to his home station, is not considered part of a rest period.
  
  (g) A flight crewmember is not considered to be scheduled for flight time in excess of flight time limitations if the flights to which he is assigned are scheduled and normally terminate within the limitations, but due to circumstances beyond the control of the certificate holder (such as adverse weather conditions), are not at the time of departure expected to reach their destination within the scheduled time.
5. Re:Fire away! by Anonymous Coward · 2004-12-26 08:24 · Score: 5, Informative
  
  No. It is the version of SBS that pre-dated Maestro. It was brought into Comair in the early 1980's. It's written in FORTRAN and uses whatever record managment system that came with the compiler.
  As such it used some very interesting data representations. For example, it tracked time using julian minutes. There are 44640 minutes in a 31 day month. That's small enough to fit in a 16-bit unsigned variable. This approach, nearly taboo by modern standards, was a God-send during Y2K. The system never needed to know what year it was. It became the running wisecrack, "You can't have a Y2K problem if you don't have a 'Y'".
  The Aircraft to Flight assignments is another system, but the two share information.
Happens all the time... by Anonymous Coward · 2004-12-25 23:12 · Score: 5, Interesting

When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.
stating the obvious by Anonymous Coward · 2004-12-25 23:20 · Score: 5, Insightful

"Does anyone know what platform their system was based on? What kind of system just totally crashes?"

A stab in the dark here but I'm assuming a system without foresight and redundancy?
blaming the system can backfire by ext42fs · 2004-12-25 23:24 · Score: 5, Insightful

It's not the OS, it's the people behind who's to blame. Yes, stupidity and MSW often go together but in a few years one will probably occasionally see a massive linux outage due to... similarly stupid people.
Crew assigment is a hard problem by rsilva · 2004-12-25 23:59 · Score: 5, Informative

'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.'

I am only trying to make sense out of the above comment from the official statement above.

Crew assigment is a hard problem, it is usually an MILP (Mixed Interger Linear Programming) .

Such problems may be very hard to solve in reasonable time. Maybe (I'm shooting in the dark here) the first delays made the crew assigment problems grow too large for being solved in reasonable time.This would generate a snow ball effect as the assimgment problems would keep on growing maing the system "crash".

We may never know what really happened but this would be a nice example for my classes :-)
System Tracked Crew Location, Not Reservations by reallocate · 2004-12-26 00:38 · Score: 5, Informative

Of course, a techie didn't write the PR release. Who in their right mind would let a techie anywhere near a PR release?

BTW, Comair, a Delta feeder headquartered outside Cincinnati, says the system that crashed was used to monitor crew locations and track working hours to ensure no one went over the legal maximum. Comair says the system crashed as a result of massive crew rescheduling following a record snow in their service area on Wednesday. There is no backup.

--
-- Slashdot: When Public Access TV Says "No"
Re:Travel tip by xlation · 2004-12-26 00:42 · Score: 5, Informative

From: http://www.fly.faa.gov/FAQ/faq.html

The term "Rule 240" refers to a rule that existed before airline deregulation. There is no longer an actual Rule 240. The term, as it is now used, refers to each airlines "conditions of carriage" policy. You would need to contact the airlines to obtain this.
Re:My theory? by rlauzon · 2004-12-26 00:45 · Score: 5, Funny

Probably not. It's an old story (quickly retold):

Army base computer going down every night. So the grunt in charge of it stayed the night to see what was happening. When the computers went down, he heard the hum of the floor buffer.

The janitor had plugged his floor buffer into the same power as the computers and it caused the crashes. It was quickly fixed by telling the janitor to not do that and putting locking covers on the power outlets.

But they dreaded telling the base commander what the issue was. So they told him it was "a buffer problem."
Not surprising, coming from Comair by Anonymous Coward · 2004-12-26 00:54 · Score: 5, Interesting

Some of my co-workers are on contract developing Java software for Comair.

Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.

The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.

Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.

At least when the portal crashes, it only impacts employees and not passengers.
From old information... by gminks · 2004-12-26 01:51 · Score: 5, Informative

According to this article [written in 1995] , Dell and AT&T created a new company called TransQuest Information Solutions.

This article outlines how this joint venture re-vamped Delta's IT systems (again remember, this is 1995):

During 1995 and 1996, TransQuest reengineered Delta's systems to migrate them from Hitachi mainframes running Natural, Adabas, and DB2 to an open systems environment. The new systems are written in C++ and access Sybase databases of reusable and distributed objects. The systems run primarily on Sun, HP and AT&T servers under UNIX with clients running under UNIX, MS-DOS, and Windows. The clients are connected to the servers over high bandwidth TCP/IP frame relay networks.

Job titles for the company's 1,100 computer professionals include Systems Engineer and Software Engineer 1 through 8. Staff members recently developed an aircraft weight balance system that can be accessed by pilots to determine how luggage and fuel have been distributed within the aircraft for balance during a flight. This system was developed in C++ on AT&T and HP UNIX servers and will be available on 40,000 devices to 2,000 users.

The trail runs dry here, job postings stopped around 2001.

Which really raises suspicions that all the code is written and maintained offshore. The question now becomes who is handling this for Delta.

One of Tata's spinoffs, Airline Financial Support Services, is described as

"an example of an external service provider that handles a wide range of back-office functions for the airlines. AFS handles sales, refund, traffic and cargo; performs fare audits; manages yields and revenues by performing departure and post-departure processing checks; books crews; deals with overbooked flights and wait-lists; adminsters frequent flyer programs; draws up flight navigation charts; such as landing or route facility charts; and provides customer care." This according to ebstrategy.com

Wipro handles some of Delta's inbound reservation calls in India and the Phillipines.

In conclusion, it would appear that either Tata's AFS arm or Wipro do the IT for Delta airlines.
Re:30,000? by edp · 2004-12-26 04:47 · Score: 5, Funny

"30,000 passengers? Getting dangerously close to an integer overflow there."
That is not a bug but an accurate model of reality. When you strand 32,768 passengers, they will turn negative.
Re:My theory? by jridley · 2004-12-26 04:49 · Score: 5, Funny

A friend was sysadmin at a manufacturing plant, and the janitor kept plugging into the power conditioned sockets with a very large, power-hungry floor polisher. He was actually blowing power supplies. Every one cost several thousand dollars in service calls to replace the power supply and downtime.

My friend put "COMPUTER USE ONLY" stickers OVER the power-conditioned sockets. The janitor ripped them off to plug in, and blew another power supply.

My friend finally confronted the janitor, who was a really obstinate PITA. He stood there and said "Yeah, I did it, and I'm gonna keep doing it, and I don't give a damn about you or your fu*kin' computers."

This was a automotive union shop, very difficult to get people fired.

But, in a show of karma rarely witnessed by mortals, the VP of the division was standing within earshot but out of sight. When the janitor finished saying he didn't give a damn that he was costing the company $10,000 a week because he was too lazy to go get an extension cord, the VP walked around the corner and said hi. I don't know whether the guy ran to his car or the VP kicked his ass right over the top of it.
Yep, you are right! by Anonymous Coward · 2004-12-26 06:41 · Score: 5, Informative

Your statements are accurate.

I was a unix sys admin there, but left for greener pastures during the dot-com craze. The non-redundant hardware at the time ran AIX, and had a great support contract from IBM. The SBS application however, always had monthly issues, at least at that airline. They were looking for a replacement then, and I'm not suprised they still haven't replaced it.