Comair System Crashes; Passengers Stranded
Broerman writes "30,000 people have had their flights cancelled by Comair this weekend thanks to a computer system shutdown. It appears that due to weather and other problems that flights began to be cancelled on Thursday and the backlog choked the system. 1,100 flights have been cancelled so far, including all flights through 12/26. Does anyone know what platform their system was based on? What kind of system just totally crashes? The official statement is that 'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.' It seems highly improbable that a system would crash because it had too many reservations. The system should only be able to hold as many reservations as it has flights/seats. It would seem that it's more likely that the system was overloaded with use and that caused a meltdown. When you add in the problems experienced by US Airways, this hasn't been a Merry Christmas for many."
Anybody know what they were running? I'd like to see this flamewar get started as soon as possible.
When I lived in Chicago, they would lose their radar system on what seemed like a strong wind. And I got stuck in Denver overnight once because the computer system they use to calculate the weight of departing flights crashed. I have a feeling these kinds of crashes are much more common than most people think.
"Does anyone know what platform their system was based on? What kind of system just totally crashes?"
A stab in the dark here but I'm assuming a system without foresight and redundancy?
It's not the OS, it's the people behind who's to blame. Yes, stupidity and MSW often go together but in a few years one will probably occasionally see a massive linux outage due to... similarly stupid people.
'There was a cumulative effect with the canceled flights and trying to get crew assigned that caused the system to be overwhelmed.'
.
:-)
I am only trying to make sense out of the above comment from the official statement above.
Crew assigment is a hard problem, it is usually an MILP (Mixed Interger Linear Programming)
Such problems may be very hard to solve in reasonable time. Maybe (I'm shooting in the dark here) the first delays made the crew assigment problems grow too large for being solved in reasonable time.This would generate a snow ball effect as the assimgment problems would keep on growing maing the system "crash".
We may never know what really happened but this would be a nice example for my classes
Of course, a techie didn't write the PR release. Who in their right mind would let a techie anywhere near a PR release?
BTW, Comair, a Delta feeder headquartered outside Cincinnati, says the system that crashed was used to monitor crew locations and track working hours to ensure no one went over the legal maximum. Comair says the system crashed as a result of massive crew rescheduling following a record snow in their service area on Wednesday. There is no backup.
-- Slashdot: When Public Access TV Says "No"
From: http://www.fly.faa.gov/FAQ/faq.html
The term "Rule 240" refers to a rule that existed before airline deregulation. There is no longer an actual Rule 240. The term, as it is now used, refers to each airlines "conditions of carriage" policy. You would need to contact the airlines to obtain this.
Probably not. It's an old story (quickly retold):
Army base computer going down every night. So the grunt in charge of it stayed the night to see what was happening. When the computers went down, he heard the hum of the floor buffer.
The janitor had plugged his floor buffer into the same power as the computers and it caused the crashes. It was quickly fixed by telling the janitor to not do that and putting locking covers on the power outlets.
But they dreaded telling the base commander what the issue was. So they told him it was "a buffer problem."
Some of my co-workers are on contract developing Java software for Comair.
Comair are very tied to particular systems, and don't want to change even when the developers have pointed out problems. Case in point: a J2EE-based employee portal, based on Novell exteNd (Novell Portal Service) and a one-way HPUX server. NPS runs in Tomcat, which is servicing requests (via mod_jk) through Apache. No other application shares the machine, and Comair will only consider vertical scaling, not horizontal.
The application creates at least two threads per connection, and when the thread count goes beyond a relatively low threshold (between 300 and 400), Tomcat deadlocks. It's not because they're running out of space in the allocated JVM heap, and they've tuned mod_jk to allow for heavy load. The current solution is to restart Tomcat when the system locks up.
Novell's support has been less than stellar, so the Java contracting group was informally asked what to do. We had all kinds of useful suggestions, from dumping NPS for another portal implementation, to creating custom thread-pools, to using JDK 1.4 new I/O and a minimally-threaded design, and even using round-robin DNS and a group of independent portal servers to share the load. Comair are wedded to particular minimal cost solutions, however, and it shows.
At least when the portal crashes, it only impacts employees and not passengers.
This article outlines how this joint venture re-vamped Delta's IT systems (again remember, this is 1995):
The trail runs dry here, job postings stopped around 2001.
Which really raises suspicions that all the code is written and maintained offshore. The question now becomes who is handling this for Delta.
One of Tata's spinoffs, Airline Financial Support Services, is described as
Wipro handles some of Delta's inbound reservation calls in India and the Phillipines.
In conclusion, it would appear that either Tata's AFS arm or Wipro do the IT for Delta airlines.
That is not a bug but an accurate model of reality. When you strand 32,768 passengers, they will turn negative.
A friend was sysadmin at a manufacturing plant, and the janitor kept plugging into the power conditioned sockets with a very large, power-hungry floor polisher. He was actually blowing power supplies. Every one cost several thousand dollars in service calls to replace the power supply and downtime.
My friend put "COMPUTER USE ONLY" stickers OVER the power-conditioned sockets. The janitor ripped them off to plug in, and blew another power supply.
My friend finally confronted the janitor, who was a really obstinate PITA. He stood there and said "Yeah, I did it, and I'm gonna keep doing it, and I don't give a damn about you or your fu*kin' computers."
This was a automotive union shop, very difficult to get people fired.
But, in a show of karma rarely witnessed by mortals, the VP of the division was standing within earshot but out of sight. When the janitor finished saying he didn't give a damn that he was costing the company $10,000 a week because he was too lazy to go get an extension cord, the VP walked around the corner and said hi. I don't know whether the guy ran to his car or the VP kicked his ass right over the top of it.
Your statements are accurate.
I was a unix sys admin there, but left for greener pastures during the dot-com craze. The non-redundant hardware at the time ran AIX, and had a great support contract from IBM. The SBS application however, always had monthly issues, at least at that airline. They were looking for a replacement then, and I'm not suprised they still haven't replaced it.