Comair System Crashes; Passengers Stranded
Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."
When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.
Sounds like Comair could have used a little virtualized scalability and third party audited builds.
See Twelve Step TrustABLE IT : VLSBs in VDNZs From TBAs.
and also The ActiveGrid(TM) Grid Application Server and Grid Computing in general.
Well, judging by the IT jobs they're advertising on their web site, it looks like a combination Windows/Linux/UNIX shop.
At any rate, I suspect they'll be looking for a new IT director Real Soon.
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
Back on May 1st of this year Delta's internal traffic monitoring system grounded them worldwide when it was hit by a worm (forget which one). Yours truly was flying that day. I spent 7 hours on a runway in Cleveland. (Talk about adding insult to injury.) Comair is a regional carrier of Detla's. I wonder who handles Delta's IT needs?
FAA's Rule 240 says that if your flight gets canceled for any reason other than weather, the airline has to get you on the next available flight to your destination, regardless of carrier. So if you're stuck in an airport bar reading this article go talk to your airline!
Too bad the airline will go bust because of this. But then all airlines lose are loosing billions except for Southwest.
As a preliminary finding that may or may not give us a clue as to what the internet system was running, Netcraft reports that www.comair.com is running Apache on HP-UX.
So don't assume that the internal system was Windows just yet. Then again, don't assume that it wasn't.
Hopefully I didn't put any [] around my words.
Some of my co-workers are on contract developing Java software for Comair.
Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.
The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.
Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.
At least when the portal crashes, it only impacts employees and not passengers.
that in the name of sensationalism reporters haven't said, "terrorism is probably not to blame but the Dept. of Homeland Security is looking into it." It seems that after Sep. 11th, the news wants to try to connect everything even remotely bad with terrorism, and of course the Dept. of Homeland Security encourages them by using as vague of language as possible. Are people that easily frightened?
Monstar L
Personally I think that Delta was being a bunch of assholes about the whole thing...
Seeing that my 7pm flight was cancelled for the 23rd I spent 20 minutes redialing from two different phones until I got past a busy signal. After 50 minutes on hold I got through to a representative who scheduled me for the 24th's 7pm flight. I spent the rest of the time rearranging time off from work, the dog's time to be spent at the kennel, car rental stuff, and phone calls to my fiance who would meet me at the airport, and to family we were supposed to see.
At 7am on the 24th the flight was already cancelled. At this point I didn't give a shit anymore. Delta was saying I would have to use my tickets by the 15th of January because "it wasn't their fault". I knew it wasn't the fucking weather down there as plenty of people were saying it was fine in the area. So I call again and get through after redialing for 65 minutes. I get through to a rep after 50 more minutes in queue. She tells me she can't do anything but schedule me for the 25th at 7pm so I'd have to get in queue for the reissue desk. Fine...
After 2 hours and 11 minutes in queue (with no hold music or sound for that matter) someone calls on my home line at 5:15pm from Delta to tell me my 7pm flight is cancelled (cute, I would have been at the airport by then). I tell that rep to get me into the reissue queue as I've been on hold with them for 2 hours.
I finally get through and tell them I want my money back. They tell me I need to speak to customer service. After waiting on hold (with the reissue rep) for 25 minutes the reissue rep offers to refund my money.
We can't fly out for New Years as the kennel is booked and I'd feel horrible asking someone to watch our dog in our house for me than 1 night. So basically we have to wait quite some time to fly down there again.
It was a little bit of a pain in the ass to wait on hold and be jerked around for two days for something that was their fault when they continually claimed wasn't. BAD WAY TO TRY AND PLEASE A CUSTOMER.
Thanks for ruining our Christmas.
Interesting...
Job postings might give some insight: Comair, Inc. jobs into what they are using.
This is a worst case scenario for a system of that nature because of so many dependent calculations and calls to other systems. It takes more than just having a plane and a crew...which is a lot of work all by itself. It has to have a gate and connecting flights. Then multiply all that by 30,000 people, roughly 120 plane loads, and complicate it by some airports being closed. I bet you could actually watch the lights get dimmer in the server room. Still when you know the potential peak demand you have reserve capacity. Slow is okay, stop is unacceptable.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
I have watched the operation at Atlanta for over 21 years, and I've seen how cutthroat the competition for a major hub is, but it feels like watching two dogs fight over two bones--you can't tell if they're fighting out of greed or stupidity. Southwest doesn't even fly into Atlanta--they know that only a pyrrhic victory would be possible under those circumstances. Management at the other airlines has been criminally incompetent ever since airline deregulation, but it's the passengers, employees and shareholders who pay the penalty time and again.
It's far harder than that alone since you also have to get the aircraft back to the right city (many are in the wrong city due to airport shutdowns due to the weather). Obviously you want to optimize the number of passengers carried along for those flights, but at the same time you'll be "burning" allowed worktime for the crew.
Even worse the crew and aircraft are independent variables. Obviously you need a crew to operate a flight, but the crew may end up in the "wrong" city for the usual schedule. It may be better to leave a plane on the ground and fly its crew "deadhead" to the "right" city than to have them fly a load of passengers to the "wrong" city.
There are reasonably efficient algorithms to solve these problems, but we spent most of my entire second-semester graduate-level algorithms class studying them (network flows). The algorithms most developers would come up (including me after a decade of experience and graduate-level algorithm class) are extremely inefficient and scale horribly.
The bottom line is that it's easy to imagine a system that has no problem with pertubations from the regular schedule but is totally overwhelmed when starting from scratch. I hope the bean counter who saved the company a few bucks by insisting on far more modest hardware gets canned for his costly lack of foresight, but we all know that IT will catch the heat.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
I think you are overthinking it. My point is simply that a company that can not be trusted to keep their computers fully functional, can not be trusted to keep their aircraft fully functional. This is based on the premise that it is easier to keep the computers running than the aircraft, which I can easily assume, based upon my own experience.
I also don't eat at diners where the help isn't properly groomed. Same principal: if you can't take of simple stuff, you probably can't take of something more important and/or complex.
Tequila: It's not just for breakfast anymore!
The problem with your analysis is that point-to-point flying doesn't work when you start talking about international travel. It's just not possible to fly passengers to, say, Germany or Japan from every domestic airport. The way you do it is to accumulate passengers at a major hub on the coast and then fly from there.
I worked on a car dealers' wide area network for a short time. Their entire network, all connections to other dealerships, internet connectivity, not to mention their Novell network, dealership inventory, parts, and tie-in to the manufacturer(s) was tied to a single router. They had problems, and I finally drove out there, and found the router "installed" in the drop ceiling above the mechanics' bathroom. The opposite side of that wall was the backer board for the telephone lines, located in a broom closet. I pulled the router down, and the inside had green mildew on the board. Routinely, the housekeeping service would unplug the 25 foot ORANGE extension cord plugged into the single-socket bathroom outlet! I advised the general manager about these problems, told them that they'd best extend their demarc, move the router to a better location, but they never bothered to fix it.
From Yahoo Jobs:
Software Engineer Cincinnati, OH $40K -$50K
My wife says things just snowballed.
Crew assignment is a hard problem...
Records keeping, very tricky. You would not want to try that with any old database, no sir, it might pop a window. Just thinking about how every other airline has managed this tricky problem since before computers makes my head hurt.
We may never know what really happened but this would be a nice example for my classes :-)
Yeah, it's a real class act for those 30,000 people sitting around in airports for Christmas, employees doing the same and those who have to recover from this disaster. Management is going to be happy about the publicity they just earned while their huge capital investment in AIRPLANES sits idle during a time of year that's supposed to be their most profitable because their far to expensive M$ "soloution" "melted". A chain is only as strong as it's weakest link. Employees, I'm sure, are also stranded for Christmas. For the New Year they get to ponder layoffs. What a happy company for you to dissect at your leisure next semester. Season's Best!
Here's what I'll bet you might learn: WHEN SOMETHING MELTS, YOU LOSE YOUR ASS IF YOU DEPEND ON IT. MICROSOFT MELTS AND HAS POOR OR NO FAIL OVER CAPABILITY, SO YOU BETTER NOT DEPEND ON IT.
Friends don't help friends install M$ junk.
SQL transactions generally last seconds and involve operations like "open tr, is there space in this flight?, reserve space, close tr". Not "open tr, wait for flight to fill up, close tr". Rescheduling or canceling flights probably isn't accomplished using transactions: it's application level logic.
My personal diagnosis: I think it has nothing to do with the backlog, and that the system just melted under high strain (of millions of people trying to book other flights). Either that, or they ran out of disk space.
I sent a summary of these Slashdot comments to my cousin who works at American Airlines hq in Dallas. Here's his response!
---
"ugh... I worked 9pm-1am yesterday (xmas day). I spent the first two
hours of my shift calling people to tell them their flight was
cancelled and reschedule them. Most of them were taking flights out to
Miami and the Caribbean to spend New Years Eve partying on the beach.
Honestly, I had little pity telling them they were going to miss out on
one day of tanning especially since they seem to 'blame' the weather on
us.
"One hour into my shift our reference system went down. No IT people
were willing to come in and fix it. I had the system up for booking
flights and making reservations, but I could not look up any of our
rules and regulations. Ah well, enjoy your xmas off IT guys!! Enjoy
the weather in Cabo San Lucas!! Cheers!!
"Fortunately, we have a backup of all our html files saved as text
files. However each text file can only hold serval hundred text
characters. So, when I want to look up our baggage policies the normal
html file is called BAG INFO. In the backup system BAG INFO is
separated into 10 or 20 text files and I have to 'page' through them by
typing BAG INFO P2, BAG INFO P3, BAG INFO P4. The text files are not
indexed and are not searchable. It took me 10 minutes to find and
advise someone how big a bag they can take to Puerto Rico.
"After I started taking incoming calls again, there were people calling
in on Christmas day to book their trips for Spring Break. There were
over 100 calls on hold to talk to us, and there were people sitting on
hold for half an hour to ask me how much it would cost to book a trip
to Fort Lauderdale in March. Couldn't that wait until the day after
Christmas?
"Yes, the airline industry does not prepare for emergencies as well as
it could for the holidays when people want to travel in record numbers.
However, I think the general public could try to have their own backup
plans in place as well and realize that the travel industry in general
does not have the equipment or the staff to handle everyone in the
country wanting to travel all at once in one week. Do people stock
their refrigerators year round with enough food to feed everyone in
their families at one meal like they do at Christmas?
"Even though we try to accommodate everyone as best as we can on the
holidays, we want to to have a holiday just as bad as the rest of
everyone else. Working in the travel industry should not indenture us
to be your slaves over holidays. The public needs to have a little bit
of compassion and realize how much we give up in our own personal lives
just to help you get where you are going. Frankly, the way most people
treat me on the phones I don't think they deserve our help and
compassion. And don't call on Christmas day to book flights in March.
That phone call is making someone work on a day they shouldn't have to.
"anyways.... heh..... guess i had a bad night at work last night, huh
"MERRY XMAS!"