Slashdot Mirror


Comair System Crashes; Passengers Stranded

Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."

23 of 398 comments (clear)

  1. Happens all the time... by Anonymous Coward · · Score: 5, Interesting

    When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.

    1. Re:Happens all the time... by Greyfox · · Score: 4, Interesting
      From looking at the various terminals that the airline people use, I suspect that most of those airline systems are held together with duct tape and library paste and no one really understands how the whole system works anymore. We see that a lot in non-IT industries (And a few IT ones, too.) Of course, the folks using the IBM ones are not ever supposed to go down...

      I moonlighted as an AS/400 operator for a cruise line for a while. We had the system go down once because the janitor turned off the air conditioner in the closet the AS/400 lived in. They didn't dedicate a more secure facility for the computer because the computer wasn't demonstrably central to how the company made money. Turns out they couldn't launch a ship without it. Oops. I suspect that mentality is also prevalent throughout the non-IT industries. They don't know how important their computers are to their business models until those computers die on them.

      --

      I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

    2. Re:Happens all the time... by HiThere · · Score: 2, Interesting

      There were many of them that did, however, crash. But the reason you don't hear about it much is that most of them weren't designed to be running all of the time, but only occasionally. If one crashed (and was a known good program) you'd just re-run it. Frequently that was your only choice, as you might not have anything but the binary. (Sloppy contracts often left consultants with the only copy of the source.)

      I did hear of one company that went out of business because their accounting system was written in a combination of those languages (plus a bit of assembler, and some binary patches). When it was done, they let the consultants go. A few years later the consultants didn't have a copy of the source anymore, and some tax law changes took effect. Oops! (That's not exactly a crash, but it wiped out the whole company.)

      OTOH, when I was writing fortran I had frequent crashes. I never got programs as solid as I later did with C. But they were "good enough". (Actually, a bit better than good enough. I was criticized for "gold plating" code that didn't need it.)

      A new degree of error frequency, however, entered with dynamic memory allocation. This allowed memory leaks that had previously been the provice of the compiler (and assembly language subroutined). One must write very diciplined C code to avoid memory allocation problems unless you just don't do dynamic memory allocation. And as multi-tasking operating systems became common it also became more common to have interaction problems. Etc.

      But I can guarantee you that it's quite as possible to have those problems with PL/1 if you use a multi-tasking OS. And likewise if you use Java or Python, or similar language with constraints on pointer use you can avoid those problems. (This doesn't get rid of other problems. Thread syncronization problems are still problematical ... though you might check out Erlang or inferno. I think they both claim to have general solutions. [The Erlang solution has been ported to Python under the name of Candygram, but I haven't checked it out yet.])

      But if you haven't heard of the older program failing, it's because they are older, and the flakey ones have been retired or repaired.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
  2. Scalability and Twelve Step TrustABLE IT by NZheretic · · Score: 2, Interesting

    Sounds like Comair could have used a little virtualized scalability and third party audited builds.
    See Twelve Step TrustABLE IT : VLSBs in VDNZs From TBAs.
    and also The ActiveGrid(TM) Grid Application Server and Grid Computing in general.

  3. Re:Someone's gotta say it... by jcr · · Score: 2, Interesting

    Well, judging by the IT jobs they're advertising on their web site, it looks like a combination Windows/Linux/UNIX shop.

    At any rate, I suspect they'll be looking for a new IT director Real Soon.

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
  4. This is getting a little to common for them. by jhobbs · · Score: 4, Interesting

    Back on May 1st of this year Delta's internal traffic monitoring system grounded them worldwide when it was hit by a worm (forget which one). Yours truly was flying that day. I spent 7 hours on a runway in Cleveland. (Talk about adding insult to injury.) Comair is a regional carrier of Detla's. I wonder who handles Delta's IT needs?

  5. Travel tip by Anonymous Coward · · Score: 1, Interesting

    FAA's Rule 240 says that if your flight gets canceled for any reason other than weather, the airline has to get you on the next available flight to your destination, regardless of carrier. So if you're stuck in an airport bar reading this article go talk to your airline!

  6. But management saved 13.7% by hiring H1-Visas by Soyobob · · Score: 2, Interesting

    Too bad the airline will go bust because of this. But then all airlines lose are loosing billions except for Southwest.

  7. I don't know about their internal system... by Glowing+Fish · · Score: 2, Interesting

    As a preliminary finding that may or may not give us a clue as to what the internet system was running, Netcraft reports that www.comair.com is running Apache on HP-UX.
    So don't assume that the internal system was Windows just yet. Then again, don't assume that it wasn't.

    --
    Hopefully I didn't put any [] around my words.
  8. Not surprising, coming from Comair by Anonymous Coward · · Score: 5, Interesting

    Some of my co-workers are on contract developing Java software for Comair.

    Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.

    The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.

    Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.

    At least when the portal crashes, it only impacts employees and not passengers.

  9. I'm surprised by antifoidulus · · Score: 2, Interesting

    that in the name of sensationalism reporters haven't said, "terrorism is probably not to blame but the Dept. of Homeland Security is looking into it." It seems that after Sep. 11th, the news wants to try to connect everything even remotely bad with terrorism, and of course the Dept. of Homeland Security encourages them by using as vague of language as possible. Are people that easily frightened?

  10. Re:whole story? by garcia · · Score: 4, Interesting

    Personally I think that Delta was being a bunch of assholes about the whole thing...

    Seeing that my 7pm flight was cancelled for the 23rd I spent 20 minutes redialing from two different phones until I got past a busy signal. After 50 minutes on hold I got through to a representative who scheduled me for the 24th's 7pm flight. I spent the rest of the time rearranging time off from work, the dog's time to be spent at the kennel, car rental stuff, and phone calls to my fiance who would meet me at the airport, and to family we were supposed to see.

    At 7am on the 24th the flight was already cancelled. At this point I didn't give a shit anymore. Delta was saying I would have to use my tickets by the 15th of January because "it wasn't their fault". I knew it wasn't the fucking weather down there as plenty of people were saying it was fine in the area. So I call again and get through after redialing for 65 minutes. I get through to a rep after 50 more minutes in queue. She tells me she can't do anything but schedule me for the 25th at 7pm so I'd have to get in queue for the reissue desk. Fine...

    After 2 hours and 11 minutes in queue (with no hold music or sound for that matter) someone calls on my home line at 5:15pm from Delta to tell me my 7pm flight is cancelled (cute, I would have been at the airport by then). I tell that rep to get me into the reissue queue as I've been on hold with them for 2 hours.

    I finally get through and tell them I want my money back. They tell me I need to speak to customer service. After waiting on hold (with the reissue rep) for 25 minutes the reissue rep offers to refund my money.

    We can't fly out for New Years as the kennel is booked and I'd feel horrible asking someone to watch our dog in our house for me than 1 night. So basically we have to wait quite some time to fly down there again.

    It was a little bit of a pain in the ass to wait on hold and be jerked around for two days for something that was their fault when they continually claimed wasn't. BAD WAY TO TRY AND PLEASE A CUSTOMER.

    Thanks for ruining our Christmas.

  11. Re:Fire away! by Deviate_X · · Score: 3, Interesting

    Interesting...

    Job postings might give some insight: Comair, Inc. jobs into what they are using.

  12. I'd like to know by HangingChad · · Score: 2, Interesting
    Not just the database platform and front end but who built it. This just has E-D-S stamped all over it. Everybody has a system go down once in a while, but it just seems like EDS has had more than their share.

    This is a worst case scenario for a system of that nature because of so many dependent calculations and calls to other systems. It takes more than just having a plane and a crew...which is a lot of work all by itself. It has to have a gate and connecting flights. Then multiply all that by 30,000 people, roughly 120 plane loads, and complicate it by some airports being closed. I bet you could actually watch the lights get dimmer in the server room. Still when you know the potential peak demand you have reserve capacity. Slow is okay, stop is unacceptable.

    --
    That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
  13. Southwest refuses to drink the Kool-aid by Oswald · · Score: 4, Interesting
    This computer problem of Comair's just demonstrates how unworkable the hub-and-spoke system of flight scheduling is. It's a flawed concept, foisted on a naive public by an industry locked in some sort of mass psychosis. In the pursuit of minor economies of scale, the big airlines treat their passengers like packages (hey! it works for Fedex, and their cargo can't even walk itself to the next gate...), treat airport runways and air traffic controllers like unlimited resources, and waste vast amounts of jet fuel. The fact that Southwest Airlines (which does not use a hub-and-spoke scheduling system) is profitable, and the rest of our major airlines are either in, just out of, or about to go into, bankruptcy doesn't seem to dent their thick skulls.

    I have watched the operation at Atlanta for over 21 years, and I've seen how cutthroat the competition for a major hub is, but it feels like watching two dogs fight over two bones--you can't tell if they're fighting out of greed or stupidity. Southwest doesn't even fly into Atlanta--they know that only a pyrrhic victory would be possible under those circumstances. Management at the other airlines has been criminally incompetent ever since airline deregulation, but it's the passengers, employees and shareholders who pay the penalty time and again.

  14. Re:Crew assigment is a hard problem by coyote-san · · Score: 4, Interesting

    It's far harder than that alone since you also have to get the aircraft back to the right city (many are in the wrong city due to airport shutdowns due to the weather). Obviously you want to optimize the number of passengers carried along for those flights, but at the same time you'll be "burning" allowed worktime for the crew.

    Even worse the crew and aircraft are independent variables. Obviously you need a crew to operate a flight, but the crew may end up in the "wrong" city for the usual schedule. It may be better to leave a plane on the ground and fly its crew "deadhead" to the "right" city than to have them fly a load of passengers to the "wrong" city.

    There are reasonably efficient algorithms to solve these problems, but we spent most of my entire second-semester graduate-level algorithms class studying them (network flows). The algorithms most developers would come up (including me after a decade of experience and graduate-level algorithm class) are extremely inefficient and scale horribly.

    The bottom line is that it's easy to imagine a system that has no problem with pertubations from the regular schedule but is totally overwhelmed when starting from scratch. I hope the bean counter who saved the company a few bucks by insisting on far more modest hardware gets canned for his costly lack of foresight, but we all know that IT will catch the heat.

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
  15. Re:System Tracked Crew Location, Not Reservations by Pharmboy · · Score: 2, Interesting

    I think you are overthinking it. My point is simply that a company that can not be trusted to keep their computers fully functional, can not be trusted to keep their aircraft fully functional. This is based on the premise that it is easier to keep the computers running than the aircraft, which I can easily assume, based upon my own experience.

    I also don't eat at diners where the help isn't properly groomed. Same principal: if you can't take of simple stuff, you probably can't take of something more important and/or complex.

    --
    Tequila: It's not just for breakfast anymore!
  16. Re:Southwest refuses to drink the Kool-aid by HR · · Score: 3, Interesting

    The problem with your analysis is that point-to-point flying doesn't work when you start talking about international travel. It's just not possible to fly passengers to, say, Germany or Japan from every domestic airport. The way you do it is to accumulate passengers at a major hub on the coast and then fly from there.

  17. Car dealer by Anonymous Coward · · Score: 1, Interesting

    I worked on a car dealers' wide area network for a short time. Their entire network, all connections to other dealerships, internet connectivity, not to mention their Novell network, dealership inventory, parts, and tie-in to the manufacturer(s) was tied to a single router. They had problems, and I finally drove out there, and found the router "installed" in the drop ceiling above the mechanics' bathroom. The opposite side of that wall was the backer board for the telephone lines, located in a broom closet. I pulled the router down, and the inside had green mildew on the board. Routinely, the housekeeping service would unplug the 25 foot ORANGE extension cord plugged into the single-socket bathroom outlet! I advised the general manager about these problems, told them that they'd best extend their demarc, move the router to a better location, but they never bothered to fix it.

  18. Er, I think I found the problem...they pay squat! by SharpNose · · Score: 2, Interesting

    From Yahoo Jobs:

    Software Engineer Cincinnati, OH $40K -$50K

  19. A better snow job. They need it. by twitter · · Score: 1, Interesting
    I am only trying to make sense out of the above comment from the official statement above.

    My wife says things just snowballed.

    Crew assignment is a hard problem...

    Records keeping, very tricky. You would not want to try that with any old database, no sir, it might pop a window. Just thinking about how every other airline has managed this tricky problem since before computers makes my head hurt.

    We may never know what really happened but this would be a nice example for my classes :-)

    Yeah, it's a real class act for those 30,000 people sitting around in airports for Christmas, employees doing the same and those who have to recover from this disaster. Management is going to be happy about the publicity they just earned while their huge capital investment in AIRPLANES sits idle during a time of year that's supposed to be their most profitable because their far to expensive M$ "soloution" "melted". A chain is only as strong as it's weakest link. Employees, I'm sure, are also stranded for Christmas. For the New Year they get to ponder layoffs. What a happy company for you to dissect at your leisure next semester. Season's Best!

    Here's what I'll bet you might learn: WHEN SOMETHING MELTS, YOU LOSE YOUR ASS IF YOU DEPEND ON IT. MICROSOFT MELTS AND HAS POOR OR NO FAIL OVER CAPABILITY, SO YOU BETTER NOT DEPEND ON IT.

    --

    Friends don't help friends install M$ junk.

  20. Re:Fire away! by pVoid · · Score: 2, Interesting
    I don't think they keep a SQL transaction running for as long as the flight hasn't taken off.

    SQL transactions generally last seconds and involve operations like "open tr, is there space in this flight?, reserve space, close tr". Not "open tr, wait for flight to fill up, close tr". Rescheduling or canceling flights probably isn't accomplished using transactions: it's application level logic.

    My personal diagnosis: I think it has nothing to do with the backlog, and that the system just melted under high strain (of millions of people trying to book other flights). Either that, or they ran out of disk space.

  21. response from an AA employee by dan_bethe · · Score: 2, Interesting

    I sent a summary of these Slashdot comments to my cousin who works at American Airlines hq in Dallas. Here's his response!

    ---

    "ugh... I worked 9pm-1am yesterday (xmas day). I spent the first two
    hours of my shift calling people to tell them their flight was
    cancelled and reschedule them. Most of them were taking flights out to
    Miami and the Caribbean to spend New Years Eve partying on the beach.
    Honestly, I had little pity telling them they were going to miss out on
    one day of tanning especially since they seem to 'blame' the weather on
    us.

    "One hour into my shift our reference system went down. No IT people
    were willing to come in and fix it. I had the system up for booking
    flights and making reservations, but I could not look up any of our
    rules and regulations. Ah well, enjoy your xmas off IT guys!! Enjoy
    the weather in Cabo San Lucas!! Cheers!!

    "Fortunately, we have a backup of all our html files saved as text
    files. However each text file can only hold serval hundred text
    characters. So, when I want to look up our baggage policies the normal
    html file is called BAG INFO. In the backup system BAG INFO is
    separated into 10 or 20 text files and I have to 'page' through them by
    typing BAG INFO P2, BAG INFO P3, BAG INFO P4. The text files are not
    indexed and are not searchable. It took me 10 minutes to find and
    advise someone how big a bag they can take to Puerto Rico.

    "After I started taking incoming calls again, there were people calling
    in on Christmas day to book their trips for Spring Break. There were
    over 100 calls on hold to talk to us, and there were people sitting on
    hold for half an hour to ask me how much it would cost to book a trip
    to Fort Lauderdale in March. Couldn't that wait until the day after
    Christmas?

    "Yes, the airline industry does not prepare for emergencies as well as
    it could for the holidays when people want to travel in record numbers.
    However, I think the general public could try to have their own backup
    plans in place as well and realize that the travel industry in general
    does not have the equipment or the staff to handle everyone in the
    country wanting to travel all at once in one week. Do people stock
    their refrigerators year round with enough food to feed everyone in
    their families at one meal like they do at Christmas?

    "Even though we try to accommodate everyone as best as we can on the
    holidays, we want to to have a holiday just as bad as the rest of
    everyone else. Working in the travel industry should not indenture us
    to be your slaves over holidays. The public needs to have a little bit
    of compassion and realize how much we give up in our own personal lives
    just to help you get where you are going. Frankly, the way most people
    treat me on the phones I don't think they deserve our help and
    compassion. And don't call on Christmas day to book flights in March.
    That phone call is making someone work on a day they shouldn't have to.

    "anyways.... heh..... guess i had a bad night at work last night, huh

    "MERRY XMAS!"