Slashdot Mirror


Windows Upgrade, FAA Error Cause LAX Shutdown

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

69 of 862 comments (clear)

  1. Repent, Sinners! by mfh · · Score: 5, Insightful

    The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.

    Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer! /sarcasm

    a technician didn't reboot the system monthly as he should have

    You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Repent, Sinners! by Da+Twink+Daddy · · Score: 5, Funny

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

      Sure,

      init 6
      doesn't sound like it should start (initialize) anything...
    2. Re:Repent, Sinners! by (H)elix1 · · Score: 5, Insightful

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

      All right, I cannot throw the first stone here. I can raise my hand as a AIX C programmer back in the day...

      We inherited a huge ball of spaghetti wire, nasty stuff that had memory leaks. Rather than taking the time to fix it, the powers that be determined it was better to keep working on new features rather than hash out the issues. At first it happened once a quarter, then once a month, and as time ticked by a weekly 'fix' to recycle the server. Lord knows I added to the mix as well, as they picked 'cheap' and 'build it fast' (not to be confused with running fast), skipping the entire do it right. That is how it happens... stuff gets rushed before its time. OSS is more immune than the typical commercial gig, but anytime a deadline comes without enough time to finish something is going to give. Downtime is just duct tape.

    3. Re:Repent, Sinners! by pchan- · · Score: 5, Insightful

      where do you want to go today?

      dear microsoft,

      the above question was posed in a line of your advertisements well, after spending an hour and a half on a plane on the runway in oakland, and another hour on the runway in l.a. (sunday night), i think i have the answer. i want to go home. sounds like a simple enough request, or so i thought.

      but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so. when my desktop computer at work crashes (admittedly a rare occurance nowadays), i am inconvenienced. when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable. i realize that buyers of software and IT systems are easily suckered or bribed into using your systems, that is why i am appealing directly to you. please exit this market before we are forced to legislate you out.

      thanks,
      pc

    4. Re:Repent, Sinners! by claar · · Score: 4, Insightful

      Bah, what a cop out. If "we" won't accept criticisms similar to our own, we have no right to criticize in the first place..

      Yes, init 6 is counter-intuitive. I remember that it actually did confuse me a bit the first time I heard of it. Does that mean we need to remove or change it? Nah, let 'em use `shutdown -r` or `alias restart="init 6"`. But just don't be an apologist for Linux, it just makes "us" look hypocritical.

      --
      I'd give my right arm to be ambidextrous...
    5. Re:Repent, Sinners! by 47Ronin · · Score: 4, Insightful

      Personally, I use "reboot".

      "shutdown -r now" also works (r stands for reboot). To shut down, use -h (for halt).


      Personally i use sudo reboot because I would never login as root for security/safety reasons.

      --
      Those who laugh at you for you having a Mac.. are the people who constantly call you to fix their PC.
    6. Re:Repent, Sinners! by Hatta · · Score: 4, Funny

      Personally i use sudo reboot because I would never login as root for security/safety reasons.

      Funny, those are the only reasons I ever log in as root.

      --
      Give me Classic Slashdot or give me death!
    7. Re:Repent, Sinners! by Awptimus+Prime · · Score: 4, Insightful

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

      Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

      I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.

    8. Re:Repent, Sinners! by secolactico · · Score: 4, Funny

      Golly gee-whiz, if someone is too stupid to migrate a million-record dBase table to SQL, he only deserves a real good whacking (and a career re-orientation into would you like grits with that ???)...

      Most of the time it is not because the inability of the database tech, but the "hey, it's been working so far" attitude of the decision makers.

      Maybe the powers that be are allergic to Open Source solutions and commercial databases can be expensive. Maybe the client applications are tied to the current system and porting them would be too expensive (example, POS systems).

      I can imagine the conversation:

      - "We are closed at night anyway"
      - "Yes, boss, but recovering from a failure (knock on wood) can be too difficult in the current system"
      - "Well, that's what we are paying you for"
      - "Yes, sir. Thank you, sir. Would you like grits with that?"

      --
      No sig
    9. Re:Repent, Sinners! by autopr0n · · Score: 4, Insightful

      Yes, but maybe that was controlled by a cron-job and not some poor person manually initiating it every night? Just like an automated reboot is also not too scary on any decent Unix, but a manual action in MS-world?

      a) This could easily been done as a sheduled task in windows 2000.

      b) This could have been done by their code, in windows 2000 and windows 95.

      c) Windows 2000 does not require a reboot after 49.7 days. Maybe their software relied on gettickcount() or something.

      The problem lays with the developers of the software, not microsoft.

      --
      autopr0n is like, down and stuff.
    10. Re:Repent, Sinners! by ckaminski · · Score: 4, Insightful

      Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.

      Your plane *WILL* land. It may be at a different airport, and sooner or later than planned, but you will get on the ground in one piece.

    11. Re:Repent, Sinners! by Awptimus+Prime · · Score: 4, Insightful

      and these are heavy used mail servers.. no need to reboot on a nightly basis!! good grief (charley brown)

      Right, the code used for mail serving is some of the most mature server code out there. This is far more reliable than say a Linux box set up with proprietary, closed src, business applications with their own bugs.

      My feelings are the article may have mistakenly blamed Windows for a problem with one of the server applications running on it. It is not typical for even Win2k to hang unexpectedly when running good hardware and well-written code.

      I say fuck it. There is no point in ever trying to defend logic when it stands in the way of the Microsoft bash-fests on /..

      Just to clarify, I am not saying Windows servers can and will run as reliably as a properly configured BSD, Solaris, or Linux box. I am just trying state that Windows is reliable, if properly configured, but will probably not win an uptime competition. Big whoop. Reboot your shit during maintenance windows, regardless of OS, you run a much better chance of finding pending hardware failures. It is much better to powercycle that database server and get an error detecting the SCSI bus during a maintenance window than for it to happen at 5:30AM on a Monday or during your vacation.

      Then again, I could be overly anal. I just like to avoid the reputations gained by those before me. :)

    12. Re:Repent, Sinners! by pchan- · · Score: 5, Insightful

      see what you've done, now i had to go and rtfa just to respond. here's a choice quote:

      The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days.

      now, let's do a little math. the number of milliseconds in 49.7 days = (49.7 * 24 * 60 * 60 * 1000) = 4,294,080,000. recognize that number? that's right, it's 2^32 (actually, this is: 4,294,967,296, but it's pretty damn close). and why is that significant, you ask? because at 2^32, the unsigned int used by some versions of windows to keep the time since boot overflows back to zero, and bad things begin to happen.

      is the problem microsoft's fault? goddamn right it is. in software that runs A MAJOR AIRPORT and controls the flight control and radar systems that affect thousands of lives in the air, an error like this just not an option. the people who put this system into production ought to be fired. i don't know what the right os for this task is. solaris? aix? vms? something with provable uptime and reliability, something that can deliver uptime of longer than a month and a half, that's for sure.

      I'm sure Linux doesn't store time in an infinite bit counter either.

      i don't recall advocating linux for the job. maybe it can do it, maybe not. and in regards to being free, when my life is on the line, they better spend every god-damn dollar they can to make sure that critical systems do not fail under any circumstances. microsoft was absolutely the wrong choice in this case.

    13. Re:Repent, Sinners! by Atzanteol · · Score: 4, Insightful

      You don't work with other people much do you? It's probably for the best.

      These things cost money. Migrating apps that use the old DB to the new one, testing, bugs introduced in the migration, etc. If it works most companies will stick with it and not risk spending large amounts of money for no 'gain' (in their mind).

      --
      "Ignorance more frequently begets confidence than does knowledge"

      - Charles Darwin
  2. "Upgrade"? by thelenm · · Score: 5, Funny

    "Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.

    --
    Use Ctrl-C instead of ESC in Vim!
  3. Why is the FAA using off the shelf software? by Samir+Gupta · · Score: 4, Informative
    This is not an attack on Microsoft.

    But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:

    "technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."

    --
    -- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.
  4. What?! by ottergoose · · Score: 5, Funny

    I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?

  5. Heather Locklear by Billy+Donahue · · Score: 4, Funny


    To the rescue!
    http://www.nbc.com/LAX/

    --
    -- The Funk, The Whole Funk, And Nothing But The Funk
  6. Uprgrade from UNIX to Windows.. by Anonymous Coward · · Score: 4, Funny

    "This happened after an upgrade from Unix to Windows."

    Thats the funniest thing I heard all day. Windows is an upgrade from unix. I almost choked on my coffee.

    1. Re:Uprgrade from UNIX to Windows.. by Mateito · · Score: 4, Funny
      I almost choked on my coffee.

      Try preparing the coffee with some sort of liquid. I recommend water.

      You don't get the instant caffiene high like you do with chewing the beans*, but it does go down easier**

      *Yes, I do this. Chocolate coated coffee beans rock

      **Unless its Starbucks, which needs shot of snotberry flavoring to make it tolerable.

  7. Re:Heh by Nuclear+Elephant · · Score: 4, Funny

    It's an upgrade because it helps to create thousands of jobs for full-time system power cycling engineers.

  8. Why 49.7 days? by FirstTimeCaller · · Score: 4, Informative

    Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).

    Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.

    --
    Wanted: witty unique signature. Must be willing to relocate.
    1. Re:Why 49.7 days? by Holi · · Score: 4, Informative

      It was this issue has nothing to do with the Win95 bug, It was just the submitters opinion (which happens to be very wrong)

      --
      Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
    2. Re:Why 49.7 days? by PhrostyMcByte · · Score: 5, Insightful

      It sounds to me like an application they were running was badly designed to use GetTickCount() as a long-term counter. If so, it's not Win2k's fault.

    3. Re:Why 49.7 days? by AK+Marc · · Score: 5, Informative

      and yes, 49.7 is an approximate value

      The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
      Hey, news for nerds, what did you expect...

  9. They said Windows 98 or Better by www.sorehands.com · · Score: 4, Funny

    So I installed Linux.

  10. 32 bit timer by charnov · · Score: 5, Interesting

    This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
    1. Re:32 bit timer by Draknor · · Score: 4, Informative

      Parent is right - its not a bug in Windows itself, but rather a piece of software running on Windows - from (one of the)FA's:

      Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

      (emphasis added)

  11. Re:I Hate to Say It by multimed · · Score: 4, Funny
    No way is it Microsoft's fault. It even says so in their EULA...

    I'm still amused & suprised the poster left off the quotes as in "upgrade" from Unix to Windows.

    --
    Vote Quimby.
  12. Check out this little pile of bullshit by Trailer+Trash · · Score: 5, Interesting

    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

    Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

    Maybe they left off a percent sign?

    1. Re:Check out this little pile of bullshit by larien · · Score: 4, Informative
      Welcome to planned vs unplanned downtime; in many cases, a 10 hour outage can still give you a 100% availability if you planned that outage. What they're probably quoting is 0.0000001 unplanned downtime.

      Lies, damned lies and availability stats...

  13. Re:Anyone want to clue them in to scheduled jobs? by dbottaro · · Score: 5, Informative

    Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

    Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

    While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.

    --
    Coding my way to the next BSOD!
  14. Re:A hit for the other team... by PPGMD · · Score: 5, Insightful
    The patriot missile system had a similar problem. It's timing broke down after a period of time without a reboot (it was a much shorter cycle, either one day or one week).

    Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.

    I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.

  15. 49.7 days by k4_pacific · · Score: 5, Funny

    I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.

    --
    Unknown host pong.
  16. Re:Migration by legirons · · Score: 5, Funny

    "Why did they move from Unix to Windows in the first place?"

    Maybe they didn't want to have to reboot on January 19, 2038

  17. Don't be so hasty to blame the OS... by Ann+Elk · · Score: 5, Insightful

    OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:

    Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

    This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.

  18. Lessions from other Aviation Authorities by MosesJones · · Score: 5, Interesting


    I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

    Let me give you an overview of the failure approach of just one of those systems.

    1) Everything on Unix, ruggedised releases of UNIX

    2) Every box must be able to FAIL ON ITS OWN

    3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

    4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

    5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

    6) 4 Years of testing of FULL system before live.

    This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

    The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

    Welcome to the US... we will be landing slightly quicker than expected.

    --
    An Eye for an Eye will make the whole world blind - Gandhi
  19. Seen this week at various airports by whoever57 · · Score: 4, Interesting

    This week, while flying, I saw:
    1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

    2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

    Do you feel more secure?

    --
    The real "Libtards" are the Libertarians!
  20. no such thing as a Windows 2000 49.7 day bug by art123 · · Score: 4, Informative

    There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

    The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.

    So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

    There is nothing Microsoft could do to prevent this.

  21. You insensitive clod by rutledjw · · Score: 4, Interesting
    As a PHB, I resemble that remark! Clearly you do not appreciate the fine art which is combining management and technical decision-making. Neither does my parent corp.

    I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:

    1. Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)
    2. Outsource development to India (which has been a mess I won't use the foul language to describe) _AND_
    3. Squeeze remaining people to make up for items 1 and 2!

    Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...

    It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.

    --

    Computer Science is Applied Philosophy
  22. Re:Anyone want to clue them in to scheduled jobs? by mekkab · · Score: 4, Interesting

    It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

    What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

    What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

    --
    In the future, I would want to not be isolated from my friends in the Space Station.
  23. Re:Anyone want to clue them in to scheduled jobs? by Ann+Elk · · Score: 4, Insightful
    It's obviously lunacy for any company to replace a proven system, which has given years of reliable service...

    It's obvious you have never toured an ARTCC (Air Route Trafic Control Center). The system that is being replaced was barely hanging together by voodoo and chicken wire. It was designed back in the 60's to handle maybe 1/10th the current capacity. It is in dire need of replacement.

    That said, I'm not convinced Windows (or Linux for that matter) is an appropriate OS for an application that practically defines the phrase "mission critical".

  24. gettickcount maybe? by plopez · · Score: 4, Funny

    http://msdn.microsoft.com/library/default.asp?url= /library/en-us/sysinfo/base/gettickcount.asp

    Sounds like who ever wrote the software/OS module they were relying on used this gem. I hereby dub who soever was so silly as to do this as a 'code monkey, first class'.

    --
    putting the 'B' in LGBTQ+
  25. Ouch, poor ad placement by Eric+Seppanen · · Score: 5, Funny
    Headline:
    Microsoft server crash nearly causes 800-plane pile-up
    failure to restart system caused data overload

    giant advertisement:

    Make a name for yourself with Windows Server System
    I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.

    (I'm not making this up- that's really the ad I'm seeing.)

    --
    314-15-9265
  26. Space Shuttle accidents and software bugs by BlueUnderwear · · Score: 4, Interesting
    Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there is other software than what rides in the shuttle itself...

    --
    Say no to software patents.
    1. Re:Space Shuttle accidents and software bugs by GlassHeart · · Score: 4, Insightful
      The only regret you'll have from paying for too much quality is the money. You'll have everything to regret from spending on too little quality.

      That's a nice thing for a professor to advocate, but real world projects like the space shuttle do not have an infinite budget to accomplish the assigned task. Therefore, spending too much money on one aspect can mean that another is sacrificed and becomes the point of failure. Therefore, while being responsible for the part that never failed is an understandable source of pride, it may actually reveal a misallocation of resources.

      Engineering is about spending the least amount of time and money to achieve the required quality. Nobody said anything about spending too little.

  27. Re:2K is based on NT kernel by gl4ss · · Score: 4, Insightful

    so what if it is "completely different os"? that's the whole point, if it were continuation of the win95 line it would have been fixed!

    now the bug was present in both codebases, but fixed just in one.

    that's at least how the article and the writeup make it sound like.

    --
    world was created 5 seconds before this post as it is.
  28. Re:If it's in the job description... by serviscope_minor · · Score: 5, Insightful

    How can you intimate blaming the software company here?

    You are joking, right? The majority of accidents happen due to human error. This is supposed to be mission critical software (and there's more than just money at stake). Yet, it relies on needless human intervention once a month! This is simply unacceptable for a piece of software in such a position. The main blame lies in the hands of the comany that provided it, the person who decided to switch to it and the person who decided to bring the new system online and remove the old one despite this flaw. The tecnician is almost irrelevent, since this happening was an inevitibility. It would have happened sooner or later because the system left room in there for human error to happen.

    And yet, you still don't blame a company which ships mission critical software which leaves such a huge hole open for human errors. I hope our nuclear power plants are running on better designed stuff.

    --
    SJW n. One who posts facts.
  29. Re:2K is based on NT kernel by LostCluster · · Score: 4, Informative

    As many others have pointed out here, it's the same bug that brought down Windows 9x reappearing.

    Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.

  30. Uptime: From one of the artticle links by Mateito · · Score: 5, Interesting
    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.

    Whoah! 7 nines uptime!

    22 seconds of downtime per year.

    Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.

    "5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.

    Wonder how much their failure clause is going to set them back?

  31. Not necessarily Windows' fault by DunbarTheInept · · Score: 4, Interesting

    While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.

    --

    Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

  32. Re:Anyone want to clue them in to scheduled jobs? by Billy+the+Mountain · · Score: 5, Funny

    Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

    We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

    Does it work on Friday? You might want to check on that...

    BTM

    --
    That was the turning point of my life--I went from negative zero to positive zero.
  33. Re:Retard by Keith+Russell · · Score: 5, Informative

    Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:

    That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.

    Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.

    Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

    --
    This sig intentionally left blank.
  34. Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · · Score: 4, Funny
    Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.

    For a real-time application like air traffic control, you really can't automate reboots like that. You need someone standing there to say "crap! crap! crap!" and take the necessary actions when the system decides it doesn't want to reboot properly.*

    *even if they don't know what to do, they can at least shout "crap!", which is more than a system stuck at the BIOS screen with an "elbow parity error" can say.

    --
    If a job's not worth doing, it's not worth doing right.
  35. 64 bit int by Alien54 · · Score: 4, Funny
    Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

    That's every 584,942,417 years. Which is simply not going to be good enough in my book.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  36. The article is light on details... by Ayanami+Rei · · Score: 4, Informative

    It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.

    It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
    That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.

    It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
    overflows/wraps around, causing the software to crash.
    OR
    It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
    In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.

    I really don't see how Microsoft could be the blame here at all...

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
  37. Pray You Never Hear This by craXORjack · · Score: 5, Funny

    Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?

    --
    Liberals call everyone Nazis yet they are the closest thing to it.
  38. Re:Anyone want to clue them in to scheduled jobs? by thrills33ker · · Score: 4, Funny

    "This wasn't an ARTCC. Besides, the ARTCC's are all on DSR now, and a bunch have URET on top of that."

    Well, I'm glad you cleared that up!

  39. What failed? by AK+Marc · · Score: 5, Insightful

    A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.

    I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
    Don't deploy flawed software.
    Make sure redundant systems work.

    As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?

  40. It was the app, not the OS by Teahouse · · Score: 5, Informative

    Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.

    --
    "Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
  41. ...Blame the API instead by tyler_larson · · Score: 5, Insightful
    This sounds to me like more of a problem with the application, not the OS.

    Three words:

    GetTickCount()

    Returns the number of milliseconds since the machine was last booted.

    From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.

    The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.

    So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).

    Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

    --
    "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
    RFC 1925
  42. Pilot or no... by juuri · · Score: 4, Insightful

    ... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?

    Crappy design, regardless of who is at fault.

    --
    --- I do not moderate.
  43. Re:Anyone want to clue them in to scheduled jobs? by Anonymous Coward · · Score: 5, Informative

    I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.

    Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.

    Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.

    No, we didn't require the customer to reboot. The system could run for years at a time.

    Putting mission critical applications on Windows 95 is just plain stupid.

  44. Re:Heh by Michael+Woodhams · · Score: 5, Informative

    There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.

    This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")

    McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)

    The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.

    Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.

    Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.

    --
    Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
  45. Re:Heh by InfiniteWisdom · · Score: 4, Funny

    Surely you mean Microsoft Server Cycling Engineers (MSCE)

  46. Downtime vs Failure by burnin1965 · · Score: 5, Interesting

    I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

    It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

    In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

    I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

    burnin

  47. Fire the Department of the Interior's IT staff... by Dr.Dubious+DDQ · · Score: 4, Insightful

    The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)

    Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...

    (Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)

  48. Re:I Hate to Say It by AstroDrabb · · Score: 4, Informative
    Funny, no where in the doc for GetTickCount() does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don't know what the program needs since I did not write it nor have I seen the code. Maybe they didn't need a high-res timer and wanted a tick count for how long the system has been up? I don't think that is too much to ask from on OS.

    The GetSystemTimeAsFileTime() function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn't tell you how long the system has been up.

    Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack? Also, right in that link MS says

    Microsoft has confirmed that this is a problem in Windows NT 4.0 and Windows NT Server 4.0, Terminal Server Edition. This problem was first corrected in Windows NT 4.0 Service Pack 4.0 and Windows NT Server 4.0, Terminal Server Edition Service Pack 4.

    MS also didn't seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe which according to MS

    SYMPTOMS
    The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
    CAUSE
    This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.
    If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe? According to this site
    rpcss.exe is an executable of Microsoft Windows Opearting System. It is reponsible for Remote Procedure Call services on the local machine. These are public services available to the local network. This program is important for the stable and secure running of your computer and should not be terminated.

    Still not convinced and want to appologize for MS? Well here are some more of MS's software that are affected by it in Windows 2000 servers (what this FAA project is using).
    Print Spooler Stops Scheduling Print Jobs

    The Print Spooler service may stop scheduling print jobs to specific Simple Port Monitor (SPM) ports. Although incoming jobs are queuing into the spooler, print jobs may not start. Note that this symptom occurs 49.7 days after you start the Print Spooler service.

    There are a bunch of MS apps affected by this logic flaw that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don't see how other developers would not stumble over it as well since they do not have access to the proprietary OS.

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  49. Incorrect. by jwigum · · Score: 5, Insightful

    Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.

    If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.

    My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.

    Patches, do ya have 'em?

    --

    Look behind you...