Windows Upgrade, FAA Error Cause LAX Shutdown
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.
/sarcasm
Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer!
a technician didn't reboot the system monthly as he should have
You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)
The dangers of knowledge trigger emotional distress in human beings.
"Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.
Use Ctrl-C instead of ESC in Vim!
I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?
This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999
Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.
Maybe they left off a percent sign?
Do you have ESP?
Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c
Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.
Coding my way to the next BSOD!
Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.
I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.
I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.
Unknown host pong.
"Why did they move from Unix to Windows in the first place?"
Maybe they didn't want to have to reboot on January 19, 2038
OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:
This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.
I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.
Let me give you an overview of the failure approach of just one of those systems.
1) Everything on Unix, ruggedised releases of UNIX
2) Every box must be able to FAIL ON ITS OWN
3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.
4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.
5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.
6) 4 Years of testing of FULL system before live.
This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.
The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.
Welcome to the US... we will be landing slightly quicker than expected.
An Eye for an Eye will make the whole world blind - Gandhi
It sounds to me like an application they were running was badly designed to use GetTickCount() as a long-term counter. If so, it's not Win2k's fault.
giant advertisement:
I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.(I'm not making this up- that's really the ad I'm seeing.)
314-15-9265
How can you intimate blaming the software company here?
You are joking, right? The majority of accidents happen due to human error. This is supposed to be mission critical software (and there's more than just money at stake). Yet, it relies on needless human intervention once a month! This is simply unacceptable for a piece of software in such a position. The main blame lies in the hands of the comany that provided it, the person who decided to switch to it and the person who decided to bring the new system online and remove the old one despite this flaw. The tecnician is almost irrelevent, since this happening was an inevitibility. It would have happened sooner or later because the system left room in there for human error to happen.
And yet, you still don't blame a company which ships mission critical software which leaves such a huge hole open for human errors. I hope our nuclear power plants are running on better designed stuff.
SJW n. One who posts facts.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?
Norman Cook's Ode to Sl
Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c
We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
Does it work on Friday? You might want to check on that...
BTM
That was the turning point of my life--I went from negative zero to positive zero.
Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:
That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.
Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.
Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.
This sig intentionally left blank.
and yes, 49.7 is an approximate value
The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
Hey, news for nerds, what did you expect...
Learn to love Alaska
Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?
Liberals call everyone Nazis yet they are the closest thing to it.
A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.
I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
Don't deploy flawed software.
Make sure redundant systems work.
As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?
Learn to love Alaska
Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Three words:
Returns the number of milliseconds since the machine was last booted.
From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.
The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.
So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).
Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925
I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.
Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.
Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.
No, we didn't require the customer to reboot. The system could run for years at a time.
Putting mission critical applications on Windows 95 is just plain stupid.
There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.
This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")
McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)
The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.
Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.
Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.
:P
It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.
In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.
I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use.
burnin
Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.
If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.
My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.
Patches, do ya have 'em?
Look behind you...