Windows Upgrade, FAA Error Cause LAX Shutdown
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c
Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.
Coding my way to the next BSOD!
Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:
That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.
Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.
Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.
This sig intentionally left blank.
and yes, 49.7 is an approximate value
The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
Hey, news for nerds, what did you expect...
Learn to love Alaska
Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.
Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.
Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.
No, we didn't require the customer to reboot. The system could run for years at a time.
Putting mission critical applications on Windows 95 is just plain stupid.
There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.
This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")
McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)
The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.
Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.
Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.