Comair System Crashes; Passengers Stranded
Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."
Anybody know what they were running? I'd like to see this flamewar get started as soon as possible.
When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.
Sounds like my Mother wrote the official statement. A techy would never report something in that way.
:p
Besides, it's pretty obvious their OS wasn't digitally signed.
Yep, it was Windows XP. ;)
I don't know. Frankly, it has less to do with the platform than the custom software that runs on it.
A blog like any other.
They're a bunch of incompetent boobs. The news keeps reporting on a "computer glitch" or a "computer malfunction". That's bullshit. This happened because some human(s) fucked up.
Linking to their home page will surely help the situation..
The janitor pulled out the plug for the mainframe and used it to drive is floor polisher..
Simon.
"Does anyone know what platform their system was based on? What kind of system just totally crashes?"
A stab in the dark here but I'm assuming a system without foresight and redundancy?
That doesn't need answering.
# cat
Damn, my RAM is full of llamas.
They obviously didn't take mcbride's "license or we will have you shut down" threats seriously enough.
It's not the OS, it's the people behind who's to blame. Yes, stupidity and MSW often go together but in a few years one will probably occasionally see a massive linux outage due to... similarly stupid people.
Sounds like Comair could have used a little virtualized scalability and third party audited builds.
See Twelve Step TrustABLE IT : VLSBs in VDNZs From TBAs.
and also The ActiveGrid(TM) Grid Application Server and Grid Computing in general.
Back on May 1st of this year Delta's internal traffic monitoring system grounded them worldwide when it was hit by a worm (forget which one). Yours truly was flying that day. I spent 7 hours on a runway in Cleveland. (Talk about adding insult to injury.) Comair is a regional carrier of Detla's. I wonder who handles Delta's IT needs?
Too bad the airline will go bust because of this. But then all airlines lose are loosing billions except for Southwest.
'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.'
.
:-)
I am only trying to make sense out of the above comment from the official statement above.
Crew assigment is a hard problem, it is usually an MILP (Mixed Interger Linear Programming)
Such problems may be very hard to solve in reasonable time. Maybe (I'm shooting in the dark here) the first delays made the crew assigment problems grow too large for being solved in reasonable time.This would generate a snow ball effect as the assimgment problems would keep on growing maing the system "crash".
We may never know what really happened but this would be a nice example for my classes
...slashdotted reservations?
Find a job you like and you will never work a day in your life.
30,000 passengers? Getting dangerously close to an integer overflow there.
Of course, a techie didn't write the PR release. Who in their right mind would let a techie anywhere near a PR release?
BTW, Comair, a Delta feeder headquartered outside Cincinnati, says the system that crashed was used to monitor crew locations and track working hours to ensure no one went over the legal maximum. Comair says the system crashed as a result of massive crew rescheduling following a record snow in their service area on Wednesday. There is no backup.
-- Slashdot: When Public Access TV Says "No"
From: http://www.fly.faa.gov/FAQ/faq.html
The term "Rule 240" refers to a rule that existed before airline deregulation. There is no longer an actual Rule 240. The term, as it is now used, refers to each airlines "conditions of carriage" policy. You would need to contact the airlines to obtain this.
As a preliminary finding that may or may not give us a clue as to what the internet system was running, Netcraft reports that www.comair.com is running Apache on HP-UX.
So don't assume that the internal system was Windows just yet. Then again, don't assume that it wasn't.
Hopefully I didn't put any [] around my words.
My sister flew Delta on Dec 23rd from Detriot to Atlanta. Plane was 2 hours late, but no big thing. Waited 5 hours for her luggage, with no dice. By the time we got in line for luggage services, there were at least 600 people in the line already.
Talking to other passengers from 10+ different flights from different cities, no one got their luggage that night. Apparently, it wasn't just Atlanta - the local news in Tampa and Detroit had segments on how the airports had taken over parts of taxiways to sort through seas of bags that didn't make it on to planes.
It's been 2 days, and Delta has no idea where the stuff from that flight is. I'm guessing it isn't just Comair that got hit by some computer problems.
Jerry
http://www.syslog.org/
Some of my co-workers are on contract developing Java software for Comair.
Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.
The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.
Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.
At least when the portal crashes, it only impacts employees and not passengers.
Somewhere deep in the code is a comment that says:
// I don't need to check for this condition because
// my asshole manager Steve Johnson says it'll
// never happen
{friggin' slash - When I say plain old text, I mean plain old text!}
that in the name of sensationalism reporters haven't said, "terrorism is probably not to blame but the Dept. of Homeland Security is looking into it." It seems that after Sep. 11th, the news wants to try to connect everything even remotely bad with terrorism, and of course the Dept. of Homeland Security encourages them by using as vague of language as possible. Are people that easily frightened?
Monstar L
http://home.hccnet.nl/jaap.kranenburg/fun/xx/image s/fun20020415.jpg
This article outlines how this joint venture re-vamped Delta's IT systems (again remember, this is 1995):
The trail runs dry here, job postings stopped around 2001.
Which really raises suspicions that all the code is written and maintained offshore. The question now becomes who is handling this for Delta.
One of Tata's spinoffs, Airline Financial Support Services, is described as
Wipro handles some of Delta's inbound reservation calls in India and the Phillipines.
In conclusion, it would appear that either Tata's AFS arm or Wipro do the IT for Delta airlines.
This is a worst case scenario for a system of that nature because of so many dependent calculations and calls to other systems. It takes more than just having a plane and a crew...which is a lot of work all by itself. It has to have a gate and connecting flights. Then multiply all that by 30,000 people, roughly 120 plane loads, and complicate it by some airports being closed. I bet you could actually watch the lights get dimmer in the server room. Still when you know the potential peak demand you have reserve capacity. Slow is okay, stop is unacceptable.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
Hopefully someone from Commair reads /. and will not be able to resist spilling the beans. This sounds like a lawsuit in the making. It was not weather related - it was someone trying to either save a buck by writing crappy software or having poor operational procedures. This is a Sarbanes-Oxley event - and hopefully, the truth will come out about what happened, and why the backup procedures were either not-in-place or did not work. I don't want to see them go bankrupt, but they should be held accountable.
I have watched the operation at Atlanta for over 21 years, and I've seen how cutthroat the competition for a major hub is, but it feels like watching two dogs fight over two bones--you can't tell if they're fighting out of greed or stupidity. Southwest doesn't even fly into Atlanta--they know that only a pyrrhic victory would be possible under those circumstances. Management at the other airlines has been criminally incompetent ever since airline deregulation, but it's the passengers, employees and shareholders who pay the penalty time and again.
The problem with your analysis is that point-to-point flying doesn't work when you start talking about international travel. It's just not possible to fly passengers to, say, Germany or Japan from every domestic airport. The way you do it is to accumulate passengers at a major hub on the coast and then fly from there.
The hub-spoke system is easier to manage, and can be profitable if the airlines relize that they aren't unlimited resources, and decentralize the hubs on a limited basis.
Anyways Southwest doesn't drink anyone's koolaid, they run all their own in house designed systems (I am not sure they are even on Sabre anymore), including web apps. It's an intresting concept, but it probably causes their IT managers to pull their hair out.
From Yahoo Jobs:
Software Engineer Cincinnati, OH $40K -$50K
What happened to Comair here could happen to just about any airline. There is no comprehensive suite of software that handles crew scheduling, aircraft scheduling, reservations, and the myriad of other functions that are needed to run an airline.
Reservations, for other than tiny airlines, are still managed by large TPF mainframes. TPF is a very "bare bones" operating system that runs on IBM mainframes, and was written specifically to deal with high volume / high transaction rate systems. Personally, I've seen 5 attempts at 3 different airlines to replace it with something modern. ( like Unix with an RDBMS ). Each attempt failed miserably, and the airline went back to TPF. Note that TPF is not MVS, OS/390, or any other more mainstream Mainframe OS. It's purpose built.
Unfortunately, this means that all of the other applications have to interface with TPF via screen scraping. To further compound the problem, no "suites" exist to handle the following functions, so most airlines have to "sew together" best of breed solutions for these basic functions:
- Crew Scheduling - F/A's and pilots bid on
slots to fly, this system takes those bids and
turns it into a schedule.
- Aircraft Scheduling - Tracks which tail numbers are flying which flights for the dispatchers
- Optimization - Different optimizers to do
things like:
- Fuel Tankering - Use the jets as "tankers" so that you buy fuel where it's cheapest for flights later in the day
- Crew Optimization - "Traveling Salesman" type solver to incur lowest labor cost, get crews
back to home base, etc
- Schedule Optimization - Use the aircraft
in the most cost efficient way to cover all of
your scheduled flights.
- Maintenence Optimization - Pull aircraft in
for Scheduled Maintenance at the optimum time.
- Reacommodation - When things go wrong ( weather, mechanicals, whatever, pull in all of the above variables to crank out a new schedule,
crewing, mx schedule, etc )
- Booking Engines, for the internet and
reservations agents
- Point of Sale and Boarding functions for
agents, skycaps, and kiosks
- Interline functions where other airlines
sell your tickets, and transfers for bagggage, etc
Anyhow, this list isn't comprehensive, but shows enough of the disparate pieces that you can imagine why these "glitches" happen. Very few of the items from the list above come from the same vendor, or even run on the same platforms.Your statements are accurate.
I was a unix sys admin there, but left for greener pastures during the dot-com craze. The non-redundant hardware at the time ran AIX, and had a great support contract from IBM. The SBS application however, always had monthly issues, at least at that airline. They were looking for a replacement then, and I'm not suprised they still haven't replaced it.
Take Amtrak!
Amtrak receives around $500 million for a total budget, while the airtravel receives around $15 billion in subsidies. Take the train and save everyone money!
_____________
Huh?
I sent a summary of these Slashdot comments to my cousin who works at American Airlines hq in Dallas. Here's his response!
---
"ugh... I worked 9pm-1am yesterday (xmas day). I spent the first two
hours of my shift calling people to tell them their flight was
cancelled and reschedule them. Most of them were taking flights out to
Miami and the Caribbean to spend New Years Eve partying on the beach.
Honestly, I had little pity telling them they were going to miss out on
one day of tanning especially since they seem to 'blame' the weather on
us.
"One hour into my shift our reference system went down. No IT people
were willing to come in and fix it. I had the system up for booking
flights and making reservations, but I could not look up any of our
rules and regulations. Ah well, enjoy your xmas off IT guys!! Enjoy
the weather in Cabo San Lucas!! Cheers!!
"Fortunately, we have a backup of all our html files saved as text
files. However each text file can only hold serval hundred text
characters. So, when I want to look up our baggage policies the normal
html file is called BAG INFO. In the backup system BAG INFO is
separated into 10 or 20 text files and I have to 'page' through them by
typing BAG INFO P2, BAG INFO P3, BAG INFO P4. The text files are not
indexed and are not searchable. It took me 10 minutes to find and
advise someone how big a bag they can take to Puerto Rico.
"After I started taking incoming calls again, there were people calling
in on Christmas day to book their trips for Spring Break. There were
over 100 calls on hold to talk to us, and there were people sitting on
hold for half an hour to ask me how much it would cost to book a trip
to Fort Lauderdale in March. Couldn't that wait until the day after
Christmas?
"Yes, the airline industry does not prepare for emergencies as well as
it could for the holidays when people want to travel in record numbers.
However, I think the general public could try to have their own backup
plans in place as well and realize that the travel industry in general
does not have the equipment or the staff to handle everyone in the
country wanting to travel all at once in one week. Do people stock
their refrigerators year round with enough food to feed everyone in
their families at one meal like they do at Christmas?
"Even though we try to accommodate everyone as best as we can on the
holidays, we want to to have a holiday just as bad as the rest of
everyone else. Working in the travel industry should not indenture us
to be your slaves over holidays. The public needs to have a little bit
of compassion and realize how much we give up in our own personal lives
just to help you get where you are going. Frankly, the way most people
treat me on the phones I don't think they deserve our help and
compassion. And don't call on Christmas day to book flights in March.
That phone call is making someone work on a day they shouldn't have to.
"anyways.... heh..... guess i had a bad night at work last night, huh
"MERRY XMAS!"
So lets think this one through for a second. The people who work there say the system that failled runs on AIX and that its the application thats gone whoopsie. So they obviously must be lying since everyone knows that the minute an application is ported to AIX all the bugs fall out of it.
Of course with this type of thinking there is no way that reputations are ever going to change since every computer error is attributed to Windows even if it has nothing to do with the issue.
I suspect that the HR advert is for a completely unrelated job.
I also would hazzard a guess that the real problem at the place now is not the system anymore. The system is probably back up but they are now having to deal with planes that are in the wrong places and crews that have no flying hours left because of decisions that were taken manually while the system was down.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/