Slashdot Mirror


Windows Upgrade, FAA Error Cause LAX Shutdown

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

23 of 862 comments (clear)

  1. Repent, Sinners! by mfh · · Score: 5, Insightful

    The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.

    Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer! /sarcasm

    a technician didn't reboot the system monthly as he should have

    You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Repent, Sinners! by (H)elix1 · · Score: 5, Insightful

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

      All right, I cannot throw the first stone here. I can raise my hand as a AIX C programmer back in the day...

      We inherited a huge ball of spaghetti wire, nasty stuff that had memory leaks. Rather than taking the time to fix it, the powers that be determined it was better to keep working on new features rather than hash out the issues. At first it happened once a quarter, then once a month, and as time ticked by a weekly 'fix' to recycle the server. Lord knows I added to the mix as well, as they picked 'cheap' and 'build it fast' (not to be confused with running fast), skipping the entire do it right. That is how it happens... stuff gets rushed before its time. OSS is more immune than the typical commercial gig, but anytime a deadline comes without enough time to finish something is going to give. Downtime is just duct tape.

    2. Re:Repent, Sinners! by pchan- · · Score: 5, Insightful

      where do you want to go today?

      dear microsoft,

      the above question was posed in a line of your advertisements well, after spending an hour and a half on a plane on the runway in oakland, and another hour on the runway in l.a. (sunday night), i think i have the answer. i want to go home. sounds like a simple enough request, or so i thought.

      but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so. when my desktop computer at work crashes (admittedly a rare occurance nowadays), i am inconvenienced. when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable. i realize that buyers of software and IT systems are easily suckered or bribed into using your systems, that is why i am appealing directly to you. please exit this market before we are forced to legislate you out.

      thanks,
      pc

    3. Re:Repent, Sinners! by claar · · Score: 4, Insightful

      Bah, what a cop out. If "we" won't accept criticisms similar to our own, we have no right to criticize in the first place..

      Yes, init 6 is counter-intuitive. I remember that it actually did confuse me a bit the first time I heard of it. Does that mean we need to remove or change it? Nah, let 'em use `shutdown -r` or `alias restart="init 6"`. But just don't be an apologist for Linux, it just makes "us" look hypocritical.

      --
      I'd give my right arm to be ambidextrous...
    4. Re:Repent, Sinners! by 47Ronin · · Score: 4, Insightful

      Personally, I use "reboot".

      "shutdown -r now" also works (r stands for reboot). To shut down, use -h (for halt).


      Personally i use sudo reboot because I would never login as root for security/safety reasons.

      --
      Those who laugh at you for you having a Mac.. are the people who constantly call you to fix their PC.
    5. Re:Repent, Sinners! by Awptimus+Prime · · Score: 4, Insightful

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

      Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

      I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.

    6. Re:Repent, Sinners! by autopr0n · · Score: 4, Insightful

      Yes, but maybe that was controlled by a cron-job and not some poor person manually initiating it every night? Just like an automated reboot is also not too scary on any decent Unix, but a manual action in MS-world?

      a) This could easily been done as a sheduled task in windows 2000.

      b) This could have been done by their code, in windows 2000 and windows 95.

      c) Windows 2000 does not require a reboot after 49.7 days. Maybe their software relied on gettickcount() or something.

      The problem lays with the developers of the software, not microsoft.

      --
      autopr0n is like, down and stuff.
    7. Re:Repent, Sinners! by ckaminski · · Score: 4, Insightful

      Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.

      Your plane *WILL* land. It may be at a different airport, and sooner or later than planned, but you will get on the ground in one piece.

    8. Re:Repent, Sinners! by Awptimus+Prime · · Score: 4, Insightful

      and these are heavy used mail servers.. no need to reboot on a nightly basis!! good grief (charley brown)

      Right, the code used for mail serving is some of the most mature server code out there. This is far more reliable than say a Linux box set up with proprietary, closed src, business applications with their own bugs.

      My feelings are the article may have mistakenly blamed Windows for a problem with one of the server applications running on it. It is not typical for even Win2k to hang unexpectedly when running good hardware and well-written code.

      I say fuck it. There is no point in ever trying to defend logic when it stands in the way of the Microsoft bash-fests on /..

      Just to clarify, I am not saying Windows servers can and will run as reliably as a properly configured BSD, Solaris, or Linux box. I am just trying state that Windows is reliable, if properly configured, but will probably not win an uptime competition. Big whoop. Reboot your shit during maintenance windows, regardless of OS, you run a much better chance of finding pending hardware failures. It is much better to powercycle that database server and get an error detecting the SCSI bus during a maintenance window than for it to happen at 5:30AM on a Monday or during your vacation.

      Then again, I could be overly anal. I just like to avoid the reputations gained by those before me. :)

    9. Re:Repent, Sinners! by pchan- · · Score: 5, Insightful

      see what you've done, now i had to go and rtfa just to respond. here's a choice quote:

      The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days.

      now, let's do a little math. the number of milliseconds in 49.7 days = (49.7 * 24 * 60 * 60 * 1000) = 4,294,080,000. recognize that number? that's right, it's 2^32 (actually, this is: 4,294,967,296, but it's pretty damn close). and why is that significant, you ask? because at 2^32, the unsigned int used by some versions of windows to keep the time since boot overflows back to zero, and bad things begin to happen.

      is the problem microsoft's fault? goddamn right it is. in software that runs A MAJOR AIRPORT and controls the flight control and radar systems that affect thousands of lives in the air, an error like this just not an option. the people who put this system into production ought to be fired. i don't know what the right os for this task is. solaris? aix? vms? something with provable uptime and reliability, something that can deliver uptime of longer than a month and a half, that's for sure.

      I'm sure Linux doesn't store time in an infinite bit counter either.

      i don't recall advocating linux for the job. maybe it can do it, maybe not. and in regards to being free, when my life is on the line, they better spend every god-damn dollar they can to make sure that critical systems do not fail under any circumstances. microsoft was absolutely the wrong choice in this case.

    10. Re:Repent, Sinners! by Atzanteol · · Score: 4, Insightful

      You don't work with other people much do you? It's probably for the best.

      These things cost money. Migrating apps that use the old DB to the new one, testing, bugs introduced in the migration, etc. If it works most companies will stick with it and not risk spending large amounts of money for no 'gain' (in their mind).

      --
      "Ignorance more frequently begets confidence than does knowledge"

      - Charles Darwin
  2. Re:A hit for the other team... by PPGMD · · Score: 5, Insightful
    The patriot missile system had a similar problem. It's timing broke down after a period of time without a reboot (it was a much shorter cycle, either one day or one week).

    Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.

    I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.

  3. Don't be so hasty to blame the OS... by Ann+Elk · · Score: 5, Insightful

    OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:

    Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

    This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.

  4. Re:Anyone want to clue them in to scheduled jobs? by Ann+Elk · · Score: 4, Insightful
    It's obviously lunacy for any company to replace a proven system, which has given years of reliable service...

    It's obvious you have never toured an ARTCC (Air Route Trafic Control Center). The system that is being replaced was barely hanging together by voodoo and chicken wire. It was designed back in the 60's to handle maybe 1/10th the current capacity. It is in dire need of replacement.

    That said, I'm not convinced Windows (or Linux for that matter) is an appropriate OS for an application that practically defines the phrase "mission critical".

  5. Re:Why 49.7 days? by PhrostyMcByte · · Score: 5, Insightful

    It sounds to me like an application they were running was badly designed to use GetTickCount() as a long-term counter. If so, it's not Win2k's fault.

  6. Re:2K is based on NT kernel by gl4ss · · Score: 4, Insightful

    so what if it is "completely different os"? that's the whole point, if it were continuation of the win95 line it would have been fixed!

    now the bug was present in both codebases, but fixed just in one.

    that's at least how the article and the writeup make it sound like.

    --
    world was created 5 seconds before this post as it is.
  7. Re:If it's in the job description... by serviscope_minor · · Score: 5, Insightful

    How can you intimate blaming the software company here?

    You are joking, right? The majority of accidents happen due to human error. This is supposed to be mission critical software (and there's more than just money at stake). Yet, it relies on needless human intervention once a month! This is simply unacceptable for a piece of software in such a position. The main blame lies in the hands of the comany that provided it, the person who decided to switch to it and the person who decided to bring the new system online and remove the old one despite this flaw. The tecnician is almost irrelevent, since this happening was an inevitibility. It would have happened sooner or later because the system left room in there for human error to happen.

    And yet, you still don't blame a company which ships mission critical software which leaves such a huge hole open for human errors. I hope our nuclear power plants are running on better designed stuff.

    --
    SJW n. One who posts facts.
  8. What failed? by AK+Marc · · Score: 5, Insightful

    A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.

    I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
    Don't deploy flawed software.
    Make sure redundant systems work.

    As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?

  9. ...Blame the API instead by tyler_larson · · Score: 5, Insightful
    This sounds to me like more of a problem with the application, not the OS.

    Three words:

    GetTickCount()

    Returns the number of milliseconds since the machine was last booted.

    From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.

    The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.

    So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).

    Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

    --
    "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
    RFC 1925
  10. Pilot or no... by juuri · · Score: 4, Insightful

    ... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?

    Crappy design, regardless of who is at fault.

    --
    --- I do not moderate.
  11. Fire the Department of the Interior's IT staff... by Dr.Dubious+DDQ · · Score: 4, Insightful

    The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)

    Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...

    (Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)

  12. Re:Space Shuttle accidents and software bugs by GlassHeart · · Score: 4, Insightful
    The only regret you'll have from paying for too much quality is the money. You'll have everything to regret from spending on too little quality.

    That's a nice thing for a professor to advocate, but real world projects like the space shuttle do not have an infinite budget to accomplish the assigned task. Therefore, spending too much money on one aspect can mean that another is sacrificed and becomes the point of failure. Therefore, while being responsible for the part that never failed is an understandable source of pride, it may actually reveal a misallocation of resources.

    Engineering is about spending the least amount of time and money to achieve the required quality. Nobody said anything about spending too little.

  13. Incorrect. by jwigum · · Score: 5, Insightful

    Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.

    If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.

    My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.

    Patches, do ya have 'em?

    --

    Look behind you...