Comair System Crashes; Passengers Stranded
Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."
Anybody know what they were running? I'd like to see this flamewar get started as soon as possible.
When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.
Sounds like my Mother wrote the official statement. A techy would never report something in that way.
:p
Besides, it's pretty obvious their OS wasn't digitally signed.
The janitor pulled out the plug for the mainframe and used it to drive is floor polisher..
Simon.
"Does anyone know what platform their system was based on? What kind of system just totally crashes?"
A stab in the dark here but I'm assuming a system without foresight and redundancy?
They obviously didn't take mcbride's "license or we will have you shut down" threats seriously enough.
It's not the OS, it's the people behind who's to blame. Yes, stupidity and MSW often go together but in a few years one will probably occasionally see a massive linux outage due to... similarly stupid people.
Back on May 1st of this year Delta's internal traffic monitoring system grounded them worldwide when it was hit by a worm (forget which one). Yours truly was flying that day. I spent 7 hours on a runway in Cleveland. (Talk about adding insult to injury.) Comair is a regional carrier of Detla's. I wonder who handles Delta's IT needs?
I have seen the major hub for an airline closed because of snow for just a couple of hours in the early morning, but the resulting chaos of rescheduling/rebooking caused the reservations system to crash after just a few minutes of uptime. The same would keep happening after restarts.
It is normal to test system up to several times normal load, but they were seeing peaks at over 100x. The old, 3270 emulator based system would have slowly got through it but the newer system died.
See my journal, I write things there
'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.'
.
:-)
I am only trying to make sense out of the above comment from the official statement above.
Crew assigment is a hard problem, it is usually an MILP (Mixed Interger Linear Programming)
Such problems may be very hard to solve in reasonable time. Maybe (I'm shooting in the dark here) the first delays made the crew assigment problems grow too large for being solved in reasonable time.This would generate a snow ball effect as the assimgment problems would keep on growing maing the system "crash".
We may never know what really happened but this would be a nice example for my classes
30,000 passengers? Getting dangerously close to an integer overflow there.
Of course, a techie didn't write the PR release. Who in their right mind would let a techie anywhere near a PR release?
BTW, Comair, a Delta feeder headquartered outside Cincinnati, says the system that crashed was used to monitor crew locations and track working hours to ensure no one went over the legal maximum. Comair says the system crashed as a result of massive crew rescheduling following a record snow in their service area on Wednesday. There is no backup.
-- Slashdot: When Public Access TV Says "No"
From: http://www.fly.faa.gov/FAQ/faq.html
The term "Rule 240" refers to a rule that existed before airline deregulation. There is no longer an actual Rule 240. The term, as it is now used, refers to each airlines "conditions of carriage" policy. You would need to contact the airlines to obtain this.
My sister flew Delta on Dec 23rd from Detriot to Atlanta. Plane was 2 hours late, but no big thing. Waited 5 hours for her luggage, with no dice. By the time we got in line for luggage services, there were at least 600 people in the line already.
Talking to other passengers from 10+ different flights from different cities, no one got their luggage that night. Apparently, it wasn't just Atlanta - the local news in Tampa and Detroit had segments on how the airports had taken over parts of taxiways to sort through seas of bags that didn't make it on to planes.
It's been 2 days, and Delta has no idea where the stuff from that flight is. I'm guessing it isn't just Comair that got hit by some computer problems.
Jerry
http://www.syslog.org/
Some of my co-workers are on contract developing Java software for Comair.
Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.
The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.
Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.
At least when the portal crashes, it only impacts employees and not passengers.
Somewhere deep in the code is a comment that says:
// I don't need to check for this condition because
// my asshole manager Steve Johnson says it'll
// never happen
{friggin' slash - When I say plain old text, I mean plain old text!}
This article outlines how this joint venture re-vamped Delta's IT systems (again remember, this is 1995):
The trail runs dry here, job postings stopped around 2001.
Which really raises suspicions that all the code is written and maintained offshore. The question now becomes who is handling this for Delta.
One of Tata's spinoffs, Airline Financial Support Services, is described as
Wipro handles some of Delta's inbound reservation calls in India and the Phillipines.
In conclusion, it would appear that either Tata's AFS arm or Wipro do the IT for Delta airlines.
Occasionally, however, the head IT guy gets over-ridden by management or by available finances. I've been there, saying "we need to spend money on this" and having to make do with much less money, or even with a cut in funding. You need to document the problem in advance to cover your ass, and get it in print and saved offsite to protect yourself from that kind of mistake. I've done that, too. It helped protect me from a nasty lawsuit because I demonstrated where I had told a consulting client, in print, when the systems would start failing and the resulting legal liabilities, and gotten it signed by the company notary.
I have watched the operation at Atlanta for over 21 years, and I've seen how cutthroat the competition for a major hub is, but it feels like watching two dogs fight over two bones--you can't tell if they're fighting out of greed or stupidity. Southwest doesn't even fly into Atlanta--they know that only a pyrrhic victory would be possible under those circumstances. Management at the other airlines has been criminally incompetent ever since airline deregulation, but it's the passengers, employees and shareholders who pay the penalty time and again.
What happened to Comair here could happen to just about any airline. There is no comprehensive suite of software that handles crew scheduling, aircraft scheduling, reservations, and the myriad of other functions that are needed to run an airline.
Reservations, for other than tiny airlines, are still managed by large TPF mainframes. TPF is a very "bare bones" operating system that runs on IBM mainframes, and was written specifically to deal with high volume / high transaction rate systems. Personally, I've seen 5 attempts at 3 different airlines to replace it with something modern. ( like Unix with an RDBMS ). Each attempt failed miserably, and the airline went back to TPF. Note that TPF is not MVS, OS/390, or any other more mainstream Mainframe OS. It's purpose built.
Unfortunately, this means that all of the other applications have to interface with TPF via screen scraping. To further compound the problem, no "suites" exist to handle the following functions, so most airlines have to "sew together" best of breed solutions for these basic functions:
- Crew Scheduling - F/A's and pilots bid on
slots to fly, this system takes those bids and
turns it into a schedule.
- Aircraft Scheduling - Tracks which tail numbers are flying which flights for the dispatchers
- Optimization - Different optimizers to do
things like:
- Fuel Tankering - Use the jets as "tankers" so that you buy fuel where it's cheapest for flights later in the day
- Crew Optimization - "Traveling Salesman" type solver to incur lowest labor cost, get crews
back to home base, etc
- Schedule Optimization - Use the aircraft
in the most cost efficient way to cover all of
your scheduled flights.
- Maintenence Optimization - Pull aircraft in
for Scheduled Maintenance at the optimum time.
- Reacommodation - When things go wrong ( weather, mechanicals, whatever, pull in all of the above variables to crank out a new schedule,
crewing, mx schedule, etc )
- Booking Engines, for the internet and
reservations agents
- Point of Sale and Boarding functions for
agents, skycaps, and kiosks
- Interline functions where other airlines
sell your tickets, and transfers for bagggage, etc
Anyhow, this list isn't comprehensive, but shows enough of the disparate pieces that you can imagine why these "glitches" happen. Very few of the items from the list above come from the same vendor, or even run on the same platforms.Your statements are accurate.
I was a unix sys admin there, but left for greener pastures during the dot-com craze. The non-redundant hardware at the time ran AIX, and had a great support contract from IBM. The SBS application however, always had monthly issues, at least at that airline. They were looking for a replacement then, and I'm not suprised they still haven't replaced it.