Windows Upgrade, FAA Error Cause LAX Shutdown

← Back to Stories (view on slashdot.org)

Windows Upgrade, FAA Error Cause LAX Shutdown

Posted by michael on Tuesday September 21, 2004 @09:48AM from the first-woodpecker-to-come-along dept.

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

10 of 862 comments (clear)

Min score:

Reason:

Sort:

32 bit timer by charnov · 2004-09-21 09:57 · Score: 5, Interesting

This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

--
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Check out this little pile of bullshit by Trailer+Trash · 2004-09-21 09:58 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

Maybe they left off a percent sign?

--
Do you have ESP?
Lessions from other Aviation Authorities by MosesJones · 2004-09-21 10:06 · Score: 5, Interesting

I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

Let me give you an overview of the failure approach of just one of those systems.

1) Everything on Unix, ruggedised releases of UNIX

2) Every box must be able to FAIL ON ITS OWN

3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

6) 4 Years of testing of FULL system before live.

This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

Welcome to the US... we will be landing slightly quicker than expected.

--
An Eye for an Eye will make the whole world blind - Gandhi
Seen this week at various airports by whoever57 · 2004-09-21 10:07 · Score: 4, Interesting

This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

Do you feel more secure?

--
The real "Libtards" are the Libertarians!
You insensitive clod by rutledjw · 2004-09-21 10:09 · Score: 4, Interesting
As a PHB, I resemble that remark! Clearly you do not appreciate the fine art which is combining management and technical decision-making. Neither does my parent corp.
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
1. Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)
2. Outsource development to India (which has been a mess I won't use the foul language to describe) _AND_
3. Squeeze remaining people to make up for items 1 and 2!
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
--

Computer Science is Applied Philosophy
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:10 · Score: 4, Interesting

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

--
In the future, I would want to not be isolated from my friends in the Space Station.
Space Shuttle accidents and software bugs by BlueUnderwear · 2004-09-21 10:17 · Score: 4, Interesting

Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there is other software than what rides in the shuttle itself...

--
Say no to software patents.
Uptime: From one of the artticle links by Mateito · 2004-09-21 10:28 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?

--
Norman Cook's Ode to Sl
Not necessarily Windows' fault by DunbarTheInept · 2004-09-21 10:29 · Score: 4, Interesting

While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Downtime vs Failure by burnin1965 · 2004-09-21 11:55 · Score: 5, Interesting

I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

burnin