Slashdot Mirror


Windows Upgrade, FAA Error Cause LAX Shutdown

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

5 of 862 comments (clear)

  1. 32 bit timer by charnov · · Score: 5, Interesting

    This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
  2. Check out this little pile of bullshit by Trailer+Trash · · Score: 5, Interesting

    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

    Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

    Maybe they left off a percent sign?

  3. Lessions from other Aviation Authorities by MosesJones · · Score: 5, Interesting


    I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

    Let me give you an overview of the failure approach of just one of those systems.

    1) Everything on Unix, ruggedised releases of UNIX

    2) Every box must be able to FAIL ON ITS OWN

    3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

    4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

    5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

    6) 4 Years of testing of FULL system before live.

    This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

    The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

    Welcome to the US... we will be landing slightly quicker than expected.

    --
    An Eye for an Eye will make the whole world blind - Gandhi
  4. Uptime: From one of the artticle links by Mateito · · Score: 5, Interesting
    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.

    Whoah! 7 nines uptime!

    22 seconds of downtime per year.

    Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.

    "5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.

    Wonder how much their failure clause is going to set them back?

  5. Downtime vs Failure by burnin1965 · · Score: 5, Interesting

    I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

    It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

    In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

    I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

    burnin