Slashdot Mirror


Comair System Crashes; Passengers Stranded

Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."

25 of 398 comments (clear)

  1. Official my arse... by Omicron32 · · Score: 4, Insightful

    Sounds like my Mother wrote the official statement. A techy would never report something in that way.

    Besides, it's pretty obvious their OS wasn't digitally signed. :p

  2. Someone's gotta say it... by mOoZik · · Score: 3, Insightful

    Yep, it was Windows XP. ;)

    I don't know. Frankly, it has less to do with the platform than the custom software that runs on it.

    1. Re:Someone's gotta say it... by Pharmboy · · Score: 3, Insightful

      You would think so. The IT Director is respsonsible for making sure everything IT works. Not to do it himself, but to make sure it is done and done right. I can't see how someone can argue with that. Even if it IS the janitor unplugging the UPS to plug in a floor buffer.

      Whether it is the cooling system for the computers, the operating system, the applications or simple hardware issues, it HAS to be the IT Director's responsibility. I mean, who the hell else?

      --
      Tequila: It's not just for breakfast anymore!
    2. Re:Someone's gotta say it... by Antique+Geekmeister · · Score: 4, Insightful

      Occasionally, however, the head IT guy gets over-ridden by management or by available finances. I've been there, saying "we need to spend money on this" and having to make do with much less money, or even with a cut in funding. You need to document the problem in advance to cover your ass, and get it in print and saved offsite to protect yourself from that kind of mistake. I've done that, too. It helped protect me from a nasty lawsuit because I demonstrated where I had told a consulting client, in print, when the systems would start failing and the resulting legal liabilities, and gotten it signed by the company notary.

  3. It doesn't matter... by Anonymous Coward · · Score: 2, Insightful

    They're a bunch of incompetent boobs. The news keeps reporting on a "computer glitch" or a "computer malfunction". That's bullshit. This happened because some human(s) fucked up.

  4. Re:Fire away! by mirko · · Score: 5, Insightful

    There recently was a big card problem here, in Europe.
    It did not come from a peculiar OS but just because a partition got filled by index tablespace extents.
    So, it could just be that they ran out of place and it froze the whole application.

    --
    Trolling using another account since 2005.
  5. stating the obvious by Anonymous Coward · · Score: 5, Insightful

    "Does anyone know what platform their system was based on? What kind of system just totally crashes?"

    A stab in the dark here but I'm assuming a system without foresight and redundancy?

  6. blaming the system can backfire by ext42fs · · Score: 5, Insightful

    It's not the OS, it's the people behind who's to blame. Yes, stupidity and MSW often go together but in a few years one will probably occasionally see a massive linux outage due to... similarly stupid people.

  7. Failure due to inability/unwillingness to test/QA by Anonymous Coward · · Score: 1, Insightful

    It is not easy to do real world extreme situation testing on large systems, but I wish people would at least try.

    It is fun to say Windows, blah, blah but given the number of buffer overflow problems found in programs/packages on all platforms, I would say that many programmers of every stripe severely underestimate the real world range/type/size of data their programs will encounter when in non-typical situations.

    To whoever wrote/maintains/admins this software:Global "climate change" means weather "events" will be more frequent and more exteme in coming years, another terrorist event on US soil may cause days of air travel disruption. Please "refactor" your shit with those things in mind. You're on the East Coast and Midwest for god's sake you're going to get storms that will shut down regions for days at a time. What happens when the FAA finds some issue with an aircraft part or maint. procedure and grounds your whole damn fleet to have it fixed.

  8. Re:Scalability and Twelve Step TrustABLE IT by hughk · · Score: 4, Insightful
    No, its more difficult in the airline industry. The system by default tries to keep as many planes in the air earning money as possible. If you have an outage which disrupts this choreography, there is a tremendous knock-on effect as passengers/urgent cargo must be rebooked.

    I have seen the major hub for an airline closed because of snow for just a couple of hours in the early morning, but the resulting chaos of rescheduling/rebooking caused the reservations system to crash after just a few minutes of uptime. The same would keep happening after restarts.

    It is normal to test system up to several times normal load, but they were seeing peaks at over 100x. The old, 3270 emulator based system would have slowly got through it but the newer system died.

    --
    See my journal, I write things there
  9. Re:I'm surprised by HangingChad · · Score: 3, Insightful
    It seems that after Sep. 11th, the news wants to try to connect everything even remotely bad with terrorism

    What else do they have to do? They've got this huge ass budget, all those people watching a lot of honest citizens. It was 10 years between the first attempt on the world trade center and the second. We've built and paid for this entire monster agency for an event that might be 10 or 15 years away. What are they going to do in the meantime? Grope women at the airport. They have to do something to justify their existence, Otherwise we'd have admit we over-reacted to 9-11.

    --
    That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
  10. Re:No manual process? by aggles · · Score: 2, Insightful

    Hopefully someone from Commair reads /. and will not be able to resist spilling the beans. This sounds like a lawsuit in the making. It was not weather related - it was someone trying to either save a buck by writing crappy software or having poor operational procedures. This is a Sarbanes-Oxley event - and hopefully, the truth will come out about what happened, and why the backup procedures were either not-in-place or did not work. I don't want to see them go bankrupt, but they should be held accountable.

  11. Re:I'm surprised by Quixote · · Score: 1, Insightful
    Are people that easily frightened?

    As Nov 2 showed, yes they are.

  12. Re:Happens all the time... by budgenator · · Score: 2, Insightful

    Not to hard to imagine, I see a system that's a combination of Fortran 66, cobol, and C all sort of working together over the years. All parts have had numerous patches and changes applied over the years until no one understands it anymore with each interation making the system more fragile. Now they are lucky if they have the source code for the current build.
    Each time the industry is making money and IT is flush a project is started, to examine all the code in the system and refactor and rewrite to modern standards, and each time the project gets just past the planning phase the economy takes a dump and the team get laid off.
    Now that the problem has had an economic impact on the company, the PHB is going to send it off to India, to some kids 6 months out of college who is going to have to google the internet for the meaning of a GOTO statement, used in the million lines of code that is older than he is.

    --
    Apocalypse Cancelled, Sorry, No Ticket Refunds
  13. Re:Southwest refuses to drink the Kool-aid by PPGMD · · Score: 3, Insightful
    It's isn't that easy, for the longest time Southwest was the hardest to book a flight for because they had no web system that could figure out it's route system (only 5 years later they just released one). Up about about July of this year to book a web flight you needed a route map and schedule to figure out what cities you had to go throuh if there was no direct flight option.

    The hub-spoke system is easier to manage, and can be profitable if the airlines relize that they aren't unlimited resources, and decentralize the hubs on a limited basis.

    Anyways Southwest doesn't drink anyone's koolaid, they run all their own in house designed systems (I am not sure they are even on Sabre anymore), including web apps. It's an intresting concept, but it probably causes their IT managers to pull their hair out.

  14. Re:System Tracked Crew Location, Not Reservations by logicnazi · · Score: 2, Insightful

    Do you also refuse to eat at a relatives house if their computer is virus laden or crash prone? After all if they can't be trusted to keep their computer working why should you trust them to make safe, sanitary food.

    Perhaps if computer usage/programming had evolved to the level of personal hygenie, namely routine effort anyone could do would prevent computer crashes, your point would be convincing. However, in practice we realize even the best professional programmers make errors even buffer overflows (and we don't even really know it's an 'error' perhaps the program exited gracefully after realizing the demands exceded its capacity and simply hadn't been programed to handle this size situation). So unlike your hygenie example this hardly impeaches the basic organizational discipline/compotency.

    Had this really been a computer engaged in flight critical tasks I would feel quite differntly. Programming error or even an unanticipated shutdown is not acceptable in systems necessery for real-time flight control. Since this was instead a system to reassign crew and guarantee compliance with federal labour law I feel much differntly. In fact if this system had been subject to a rigorous source code review by an outside team to check for bugs, or linked into some sort of failover system with differntly programmed systems accomplishing the same task I would worry that their priorities are being misplaced.

    Arguably an airline, given their limited budgets, which puts too much redundancy into their non-critical systems has an incorrect set of priorities.

    --

    If you liked this thought maybe you would find my blog nice too:

  15. Re:Happens all the time... by phil+reed · · Score: 2, Insightful

    Of course, the folks using the IBM ones are not ever supposed to go down...
    There's a difference between the machine crashing and the application crashing.

    --

    ...phil
    "For a list of the ways which technology has failed to improve our quality of life, press 3."
  16. Re:Fire away! by theonetruekeebler · · Score: 2, Insightful
    Based on those postings, I'm guessing the application is based on either Oracle or Sybase on HP-UX.

    My preliminary diagnosis: blown rollback segment. With too many flights being cancelled, the simultaneous rescheduling of all those crew resulted in a SQL transaction that exceeded the size of what the DBMS could undo. So an uncommitted statement failed and the application code either was not prepared for such a possibility or could only handle it by timing out. Scheduling tasks could no longer move forward, and right now some poor DBA is hoping to Christ that he printed out that e-mail he wrote asking for more disk space...

    --
    This is not my sandwich.
  17. Re:Happens all the time... by Greyfox · · Score: 2, Insightful
    Funny how you never really hear about the applications written in COBOL, Fortran and PL/1 crashing. You get the impression that all those applicatons run for years at a time without so much as a hiccup. It's only with the invasion of GUIs and "modern" design techniques and languages that you start hearing about crashes like this. Granted the newer applications tend to be more ambitious about what they do...

    I'd love to see some uptime numbers for past systems versus the systems we have today. I wonder if they'd show the downward trend that I suspect they would.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  18. Re:System Tracked Crew Location, Not Reservations by benjamindees · · Score: 2, Insightful

    Perhaps there's a better principle you could apply, namely that anyone, be it company or person, only has a finite amount of resources (time, money) at their disposal, and choose to dedicate them to specific tasks.

    Perhaps unkept diners are more concerned with the quality of their food than the ambiance. Perhaps the IT guy with twelve certifications knows more about getting certifications than about working on computers. Perhaps the vendor that sends you a Christmas card every year is pulling employees off of doing real work in order to make it look to you like they have their shit together. Perhaps the antisocial guy with the unkept hair and the socks-with-sandals is more concerned with proving his latest theory than with what you think of him.

    Perhaps appearances can be deceiving.

    --
    "I assumed blithely that there were no elves out there in the darkness"
  19. Re:Fire away! by Anonymous Coward · · Score: 1, Insightful

    Ahh. I'm surprised the "pre Maestro" stuff still
    exists. In fact, I think SBS's preferred platform for the older stuff was Ultrix. If
    COMAIR waited this long to address replacing this
    ancient FORTRAN spiderweb, they made their own
    bed. I think SBS released Maestro to replace that stuff in 1993 or so.

  20. McPay. by Anonymous Coward · · Score: 1, Insightful

    "From Yahoo Jobs:

    Software Engineer Cincinnati, OH $40K -$50K"

    That's more than I make at McDonalds.

  21. Re:Simple Solution by the+pickle · · Score: 3, Insightful

    Sure, that's eminently practical. I can take 48 hours to get from Detroit to LA, or I can take six (including travel time and check-in time at both airports).

    p

  22. Re:Fire away! by pVoid · · Score: 2, Insightful
    Do you not know what a rollback segment is? It's what makes you run out of disk space while updating a table 1300 times larger than you thought it would be

    Yes, but you pretty much spelled out what my point was in that the n^2 complexity issue is unrelated to transactional operations. That is, a transaction is a transaction, it is scalable, so it doesn't matter whether the actual operation for computing stuff is O(n^2), the transaction is still a fixed cost. On a side point: I don't agree that because the problem is 1300 more complexe, the updates are 1300 times bigger. The problem is still based on n elements: it just happens that computing the solution of a problem with n elements takes n^2 time... the end result though is still n elements to update.

    That being said, I am fairly confident modern relational databases are scalable to the point of being able to handle a 500 fold increase (if only by simply slowing down to a crawl - but not crashing). If anything, it's probably internal application logic that wasn't able to handle the added computational complexity and at a certain point hit a hard limit of its scalability (some fixed sized arrays, or indexes of some sort).

    My comment about 'ran out of disk space' was more in the lines of "it's either an application fault, or something mundane like someone forgot to check if they had sufficient disk space" (something which can happen anytime due to neglect)

  23. Re:System Tracked Crew Location, Not Reservations by Anonymous Coward · · Score: 1, Insightful
    I also don't eat at diners where the help isn't properly groomed. Same principal: if you can't take of simple stuff, you probably can't take of something more important and/or complex.
    Principle, not principal. You also left out the word "care", not once but twice. If you can't take care of simple stuff like grammar and spelling, you probably can't take care of something more important and/or complex (like programming, maybe?).