Windows Upgrade, FAA Error Cause LAX Shutdown

← Back to Stories (view on slashdot.org)

Windows Upgrade, FAA Error Cause LAX Shutdown

Posted by michael on Tuesday September 21, 2004 @09:48AM from the first-woodpecker-to-come-along dept.

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

21 of 862 comments (clear)

Min score:

Reason:

Sort:

A hit for the other team... by LostCluster · 2004-09-21 09:52 · Score: 3, Interesting

When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.

This mess was big enough that there's a large enough supply of blame to give some to everybody involved.

- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.

Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.
Re:Anyone want to clue them in to scheduled jobs? by TykeClone · 2004-09-21 09:54 · Score: 3, Interesting

at sucks. Very, very much.
I've got an NT server that would hang after 2 weeks. I set up an at job to restart that service nightly and do not have that problem.
I've also got several linux servers that just plain run (and some NT/2000 servers as well).
That being said, rebooting sometimes does clear up many evils. We have a speakerphone (around 10 years old - no OS) that just wouldn't work one day. After looking at it, I unplgged it and plugged it back in (I rebooted it!) and it worked. No good reason, it just helps.

--
A fine is a tax you pay for doing wrong and a tax is a fine you pay for doing all right.
32 bit timer by charnov · 2004-09-21 09:57 · Score: 5, Interesting

This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

--
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Check out this little pile of bullshit by Trailer+Trash · 2004-09-21 09:58 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

Maybe they left off a percent sign?

--
Do you have ESP?
We used to joke by multiplexo · 2004-09-21 09:59 · Score: 3, Interesting

that no one would ever run into the 49.7 day bug on a Windows system because the chances of having that much uptime were slim to none. Having a system where you know that things are broken and you have to reboot it every 30 days to keep it from breaking down is a bad thing, deploying such a system into a production environment is even worse (but it's been done, I don't know how many times I wrote cron jobs to kill bad pieces of software and restart them) but deploying such a system in an environment where lives are at stake is completely inexcusable, regardless of whether or not it is closed or open source. This is similar to having a circuit in your house that overheats because occasionally too much load is placed on it. The idiot solution is to reset the breaker when it trips, the correct solution is to put in a bigger circuit that can handle the peak load. This vendor provided the idiot solution to this problem and should be punished for it, this never should have been deployed, I can only hope that they won't blame the technician for failing to do something that he wouldn't have had to do if the system had been designed properly.
I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?

--
cheap labor conservatives - they want to keep you hungry enough to be thankful for minimum wage.
Lessions from other Aviation Authorities by MosesJones · 2004-09-21 10:06 · Score: 5, Interesting

I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

Let me give you an overview of the failure approach of just one of those systems.

1) Everything on Unix, ruggedised releases of UNIX

2) Every box must be able to FAIL ON ITS OWN

3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

6) 4 Years of testing of FULL system before live.

This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

Welcome to the US... we will be landing slightly quicker than expected.

--
An Eye for an Eye will make the whole world blind - Gandhi
Seen this week at various airports by whoever57 · 2004-09-21 10:07 · Score: 4, Interesting

This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

Do you feel more secure?

--
The real "Libtards" are the Libertarians!
You insensitive clod by rutledjw · 2004-09-21 10:09 · Score: 4, Interesting
As a PHB, I resemble that remark! Clearly you do not appreciate the fine art which is combining management and technical decision-making. Neither does my parent corp.
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
1. Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)
2. Outsource development to India (which has been a mess I won't use the foul language to describe) _AND_
3. Squeeze remaining people to make up for items 1 and 2!
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
--

Computer Science is Applied Philosophy
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:10 · Score: 4, Interesting

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

--
In the future, I would want to not be isolated from my friends in the Space Station.
Space Shuttle accidents and software bugs by BlueUnderwear · 2004-09-21 10:17 · Score: 4, Interesting

Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there is other software than what rides in the shuttle itself...

--
Say no to software patents.
Uptime: From one of the artticle links by Mateito · 2004-09-21 10:28 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?

--
Norman Cook's Ode to Sl
Not necessarily Windows' fault by DunbarTheInept · 2004-09-21 10:29 · Score: 4, Interesting

While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Re:Seen this week at various airports by Anonymous Coward · 2004-09-21 10:33 · Score: 3, Interesting

It probably should. My company uses XP Embedded for a few systems, and doesn't have any software-related problems on them. Ever. The only problems we have are when people snap off antennae that we use for the wireless connections, or something similar. There's no reason that they shouldn't be using something like this to scan baggage. It sounds like someone at O'Hare didn't do their homework.

The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.
I don't blame the OS per se... by WebCowboy · 2004-09-21 11:37 · Score: 3, Interesting

...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).

* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS

* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!

* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.

* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).

I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.

Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...
Downtime vs Failure by burnin1965 · 2004-09-21 11:55 · Score: 5, Interesting

I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

burnin
Re:Repent, Sinners! by Turmio · 2004-09-21 12:21 · Score: 3, Interesting

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?
Actually I was hit by the max 497 days uptime bug of Linux 2.4 (and with a desktop machine no less). The box at work did run for about 650 days but anyway well after the mile stone of half way journey for 2nd consecutive uptime reset. Then it was time for me to change rooms. I wasn't at office that day and my co-worker just unplugged the box. Was I pissed or not? Yes I was.
Re:Now even the submitters aren't reading the arti by AstroDrabb · 2004-09-21 12:25 · Score: 3, Interesting

The shutdown is not a crash but a scheduled event to bring the servers down to flush data.
That is MS PHB speek to "assure" other PHB's that it was not MS's fault. What _modern_ server OS needs to reboot to flush freakin data! Why do you think technical details are never released in these types of press releases?
The reboot was to reset the logic flaw in the MS system timer. Read my post here on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:Repent, Sinners! by dgatwood · 2004-09-21 14:06 · Score: 3, Interesting

As another link in this discussion noted, an unpatched Win2k does, in fact, require a reboot every 47.5 days because a certain process goes nuts and eats 60% of the CPU. The fact that MS has a patch for the problem does not mean that the problem does not exist.

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
49.7-day bug not exclusive to Windows. by Temporal · 2004-09-21 15:23 · Score: 3, Interesting

It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.

That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.

In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.
Re:Fire the Department of the Interior's IT staff. by tbogart · 2004-09-21 15:41 · Score: 3, Interesting

Looking at the www.faa.gov home page, it says "Department of Transportation". However, having been a systems engineer and administrator in a couple of stints at one of the DOI Bureaus ... you don't want to know.
Hypocrites? by coronaride · 2004-09-21 18:58 · Score: 3, Interesting

This is not addressed to the parent, but is for everyone who responded to the parent -

I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!

It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.

--
Those who can, do. Those who can't, go into business for themselves.