Slashdot Mirror


Windows Upgrade, FAA Error Cause LAX Shutdown

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

19 of 862 comments (clear)

  1. Why is the FAA using off the shelf software? by Samir+Gupta · · Score: 4, Informative
    This is not an attack on Microsoft.

    But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:

    "technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."

    --
    -- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.
  2. Why 49.7 days? by FirstTimeCaller · · Score: 4, Informative

    Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).

    Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.

    --
    Wanted: witty unique signature. Must be willing to relocate.
    1. Re:Why 49.7 days? by Holi · · Score: 4, Informative

      It was this issue has nothing to do with the Win95 bug, It was just the submitters opinion (which happens to be very wrong)

      --
      Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
    2. Re:Why 49.7 days? by AK+Marc · · Score: 5, Informative

      and yes, 49.7 is an approximate value

      The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
      Hey, news for nerds, what did you expect...

  3. Re:Anyone want to clue them in to scheduled jobs? by dbottaro · · Score: 5, Informative

    Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

    Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

    While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.

    --
    Coding my way to the next BSOD!
  4. Re:32 bit timer by Draknor · · Score: 4, Informative

    Parent is right - its not a bug in Windows itself, but rather a piece of software running on Windows - from (one of the)FA's:

    Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

    (emphasis added)

  5. no such thing as a Windows 2000 49.7 day bug by art123 · · Score: 4, Informative

    There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

    The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.

    So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

    There is nothing Microsoft could do to prevent this.

  6. Re:Check out this little pile of bullshit by larien · · Score: 4, Informative
    Welcome to planned vs unplanned downtime; in many cases, a 10 hour outage can still give you a 100% availability if you planned that outage. What they're probably quoting is 0.0000001 unplanned downtime.

    Lies, damned lies and availability stats...

  7. Re:2K is based on NT kernel by LostCluster · · Score: 4, Informative

    As many others have pointed out here, it's the same bug that brought down Windows 9x reappearing.

    Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.

  8. Re:Retard by Keith+Russell · · Score: 5, Informative

    Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:

    That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.

    Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.

    Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

    --
    This sig intentionally left blank.
  9. Re:32 bit timer by djwolf · · Score: 3, Informative

    The timer has not been incremented to 64bit. The reason is for api compatibility it hasn't been changed. Microsoft does give you some warning though:

    GetTickCount

    The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.

    DWORD GetTickCount(void);

    Parameters
    This function has no parameters.
    Return Values
    The return value is the number of milliseconds that have elapsed since the system was started.

    Remarks
    The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

    If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

    To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.

    Example Code
    The following example demonstrates how to use a this function to wait for a time interval to pass. Due to the nature of unsigned arithmetic, this code works correctly if the return value wraps one time. If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead.

    DWORD dwStart = GetTickCount(); // Stop if this has taken too long
    if( GetTickCount() - dwStart >= TIMELIMIT )
    Cancel();
    Example Code
    Note that TIMELIMIT is defined as the time interval of interest to the application, in milliseconds.

    Requirements
    Client: Requires Windows XP, Windows 2000 Professional, Windows NT Workstation, Windows Me, Windows 98, or Windows 95.
    Server: Requires Windows Server 2003, Windows 2000 Server, or Windows NT Server.
    Header: Declared in Winbase.h; include Windows.h.
    Library: Use Kernel32.lib.

    --
    ---- I like compilers
  10. The article is light on details... by Ayanami+Rei · · Score: 4, Informative

    It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.

    It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
    That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.

    It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
    overflows/wraps around, causing the software to crash.
    OR
    It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
    In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.

    I really don't see how Microsoft could be the blame here at all...

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
  11. It was the app, not the OS by Teahouse · · Score: 5, Informative

    Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
  12. Re:Anyone want to clue them in to scheduled jobs? by Anonymous Coward · · Score: 5, Informative

    I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.

    Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.

    Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.

    No, we didn't require the customer to reboot. The system could run for years at a time.

    Putting mission critical applications on Windows 95 is just plain stupid.

  13. Re:Heh by Michael+Woodhams · · Score: 5, Informative

    There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.

    This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")

    McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)

    The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.

    Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.

    Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.

    --
    Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
  14. Re:Repent, Sinners! by Phillup · · Score: 3, Informative

    Only if that is what you have run level 6 configured to do.

    All the init 6 command does is initialize run level 6. You can have run level 6 configured any way you want.

    It isn't hard wired to shut down. (On debian run level 6 does a reboot... run level 0 halts the system.)

    --

    --Phillip

    Can you say BIRTH TAX
  15. Re:Anyone want to clue them in to scheduled jobs? by agallagh42 · · Score: 3, Informative
    "Since when does Windows 2000 include a "shutdown" command?"

    Uh, since about 2000 I believe.:)
    C:\>shutdown /?
    Usage: shutdown [-i | -l | -s | -r | -a] [-f] [-m \\computername] [-t xx] [-c "c
    omment"] [-d up:xx:yy]

    No args Display this message (same as -?)
    -i Display GUI interface, must be the first option
    -l Log off (cannot be used with -m option)
    -s Shutdown the computer
    -r Shutdown and restart the computer
    -a Abort a system shutdown
    -m \\computername Remote computer to shutdown/restart/abort
    -t xx Set timeout for shutdown to xx seconds
    -c "comment" Shutdown comment (maximum of 127 characters)
    -f Forces running applications to close without warning
    -d [u][p]:xx:yy The reason code for the shutdown
    u is the user code
    p is a planned shutdown code
    xx is the major reason code (positive integer less than 256)
    yy is the minor reason code (positive integer less than 65536)

    C:\>
    --
    Carpe Cerevisi - Seize the Beer
  16. Re:I Hate to Say It by AstroDrabb · · Score: 4, Informative
    Funny, no where in the doc for GetTickCount() does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don't know what the program needs since I did not write it nor have I seen the code. Maybe they didn't need a high-res timer and wanted a tick count for how long the system has been up? I don't think that is too much to ask from on OS.

    The GetSystemTimeAsFileTime() function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn't tell you how long the system has been up.

    Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack? Also, right in that link MS says

    Microsoft has confirmed that this is a problem in Windows NT 4.0 and Windows NT Server 4.0, Terminal Server Edition. This problem was first corrected in Windows NT 4.0 Service Pack 4.0 and Windows NT Server 4.0, Terminal Server Edition Service Pack 4.

    MS also didn't seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe which according to MS

    SYMPTOMS
    The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
    CAUSE
    This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.
    If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe? According to this site
    rpcss.exe is an executable of Microsoft Windows Opearting System. It is reponsible for Remote Procedure Call services on the local machine. These are public services available to the local network. This program is important for the stable and secure running of your computer and should not be terminated.

    Still not convinced and want to appologize for MS? Well here are some more of MS's software that are affected by it in Windows 2000 servers (what this FAA project is using).
    Print Spooler Stops Scheduling Print Jobs

    The Print Spooler service may stop scheduling print jobs to specific Simple Port Monitor (SPM) ports. Although incoming jobs are queuing into the spooler, print jobs may not start. Note that this symptom occurs 49.7 days after you start the Print Spooler service.

    There are a bunch of MS apps affected by this logic flaw that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don't see how other developers would not stumble over it as well since they do not have access to the proprietary OS.

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  17. Re:Repent, Sinners! by multipartmixed · · Score: 3, Informative

    > since non-SQL formats like DBase have always been
    > a little funky when they start having to deal
    > with million-record tables.

    Oh, yes, SQL the magic bullet. I have a database problem! No matter what it is, I can solve it by migrating to a database system which uses SQL!

    > It's amazing how ugly legacy databases can be
    > compared to today's tech.

    Yes, today's tech! SQL, the magic bullet! Why, we should use Oracle! It's SQL and thus must be modern! It's only been around since 1979!

    Wait!

    1979 was a long time ago.

    Oh, dear?

    Could it be that Oracle is not modern tech? But, how could it not be? It uses SQL, the magic bullet!

    Hint: query language and scalability are not related.
    Hint II: RDBMS is no magic bullet, either.

    --

    Do daemons dream of electric sleep()?