Slashdot Mirror


Air Traffic Snafu: FAA System Runs Out of Memory

minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?

49 of 234 comments (clear)

  1. But, but, but... by Anonymous Coward · · Score: 3, Funny

    But nobody should ever need anything more than 640k!

    1. Re:But, but, but... by Richard+Steiner · · Score: 2, Informative

      One advantage of many airline online transaction systems: An applications programmer cannot do a malloc equivalent.

      Programs are created with a fixed memory size, and complex applications are simply a series of program modules which pass data between each other via common memory areas or memory-mapped files.

      Memory leaks in such an environment are quite rare.

      --
      Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
      The Theorem Theorem: If If, Then Then.
  2. Software error ... by gstoddart · · Score: 5, Informative

    You can make the argument that if the software allowed the operators to crash the system, it's a software fault.

    You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.

    I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?

    I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.

    I guess I'm not surprised, but I am stunned.

    --
    Lost at C:>. Found at C.
    1. Re:Software error ... by AlecC · · Score: 2

      The loose description sound like something not being garbage collected when it should have been. So no single change cause the problem. It might well have been caused by controllers playing with a new toy, in a way they would never do once it had settled in and testers would not do, It is difficult to observe heap leakage - even if you check free space after a run, it is not clear what the right value is.

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    2. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

      No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!

    3. Re:Software error ... by U2xhc2hkb3QgU3Vja3M · · Score: 5, Insightful

      Not only should they be "capable" of managing memory in their code, it should be part of the software design itself.

    4. Re:Software error ... by fahrbot-bot · · Score: 2

      I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?

      Don't know, probably just Government ineptitude. Let's ask free-market leaders how they handle things: Toyota (brake/accelerator pedals) or Chrysler / GM (remote access) or Boeing (Li-ion batteries) ... -- oh wait.

      --
      It must have been something you assimilated. . . .
    5. Re:Software error ... by dwpro · · Score: 4, Insightful

      Software engineers responsible for real-time, public safety software should be capable of managing memory in their code

      And surgeons responsible for cutting open live human beings should be capable of not leaving tools in the person they're operating on, but it still happens. Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.

      --
      Millions long for immortality who do not know what to do with themselves on a rainy Sunday afternoon. -- Susan Ertz
    6. Re:Software error ... by jamstar7 · · Score: 5, Insightful

      Couple things to keep in mind.

      The civilian aircraft control system has been chronically underfunded for decades, since Reagan fired PATCO. One of the things they were on strike for was for better equipment to do their jobs better, easier, and with less stress. Even in the 80's, the computers and radars were dinosaurs best kept in a museum. Upgrades since then have always been a day late and a dollar short.

      The airspace above the US is the busiest in the world, and it's just getting worse. They don't even report near-misses anymore to the media unless the pilots can see each other giving them the finger. They're that common.

      Nothing will be done until 3 or 4 planes do a mid-air and the public outcry is so bad that people are ready to march on the FAA's office with torches and pitchforks. Then there will be a massive round of public firings to appease the crowd, a slight boost in funding to the FAA, followed by further deregulation of the airlines.

      Personally, with all the deregulation already, I'm surprised more planes don't shed parts along the way.

      --
      Understanding the scope of the problem is the first step on the path to true panic.
    7. Re:Software error ... by SpeedBump0619 · · Score: 4, Informative

      Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.

      I get this. And as a software engineer I fully agree. However, in practical terms, there shouldn't be any dynamic memory management happening at all.

      It's a real-time system. It *must* interact, on time, with all the planes that are in it's domain. That should be a bounded, predictable load, or there's no way to guarantee responsiveness. Given that, an analysis should have been done on the maximum number of elements the system supported. Those elements should have been preallocated (into a pool if you want to treat them "dynamically") before actual operation began. If/When the pool allocator ran out of items it should do two things: allocate (dynamically) more, and scream bloody murder to everyone who would listen regarding the unexpected allocation.

      This is (one of) the reason(s) I generally haven't liked garbage collected languages for real time systems. There's rarely ever a way to guard against unexpected allocations, because *every* allocation is blind.

    8. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

      Pretty sure the defined time frame of "since Reagan fired PATCO" involves being critical of every administration since then that hasn't resolved the issues. I suppose one wouldn't want basic reading comprehension to get in the way of a good partisan knee jerk lash out though.

      Americans are fucking weird.

    9. Re:Software error ... by Rei · · Score: 5, Insightful

      So, I actually am a programmer for an ATC system...

      First off, this isn't as bad as it sounds as far as safety goes. One first needs to ask themselves, "what is the purpose of an ATC system?". The simple answer is, "don't ever let two aircraft exist in the same location at the same time". So any two aircraft can be separated in a) time, b) location, or c) altitude, and so long as they meet the minimum safety distances, that's all okay. Complicating this is the great variety of hardware on the aircraft, communications methods and protocols, and gaps in the information available to you, plus the wide variety in ATC systems and how they talk to each other. And there's a lot of potential instability at each stage. So basically ATC systems are massive collections of "special cases" that need to be handled on top of the basics. Maybe some line in Denmark is garbling messages that lead to you being fed bogus data. Maybe some aircraft in India's buggy hardware is for some reason spamming everyone on the network. Maybe you've got two different systems handling radar data and one says the radars are all fine and the other says they're not. Maybe the aircraft says they were at X point at Y time but some radar says something different. These are the sorts of things we have to deal on a weekly if not daily interval, and they lead what seems like it should be very simple pieces of software to become really huge systems.

      As mentioned, there can be lots of instability. Yep, it's true, these things can be rather buggy - both hardware and software. They're usually old designs that may have been poor design from the beginning, but have had to be continually patched and patched over the course of decades. Don't like that? Throw some more funds in for new ATC systems designed from scratch, otherwise this is going to continue to be the reality (yeah, new subsystems do come in every now and then for various purposes, but old systems are slow to go away).

      So, instability and bugs can sound scary. But remember the goals of an ATC system: separation. So let's just say that you lose the whole system for a long time - what do you do? Well, you basically revert to paper, and you've got a LOT more phone calls to make. You have to allow for more separation, and because of the increased workload, you can't handle as many planes. So you have to greatly reduce the number of planes in your region - they have to divert or wait. It's big delays, which costs big money. But it's not like we just start guessing whether planes are going to run into something or not.

      Our software here is predominantly old C code with a little bit of C++, and miscellaneous like yacc and lex. There are changelog entries dating back to the 80s - though that's the manual changelogs, it didn't go under revision control until the late 90s. Its core uses macros to an annoying degree to emulate object-oriented design in C; macros can be nested dozens of layers deep. It makes bugs very hard to find sometimes, but it's the core of the software, so it's not something that can be easily changed. So we do our best. Yes, there are "WONTFIX" bugs that we know about, and operators have documented procedures for working around them (usually involving restarting some module - the system is very modular, you don't have to restart the whole thing to fix a part that's acting up). But we always prioritize fixing the things that get in the way of their work the most - there's a lot of direct back and forth. Again, safety always takes top priority, then throughput. Everything else is way down below on the priority list.

      Changes work through the following process. A report of a bug or feature request is made. Someone analyses it and if they think it's worth working on writes up a task and assigns it to a programmer. The programmer works on the task and when they think it's ready they submit it for code review. Another programmer looks through all of the code and tries to see if they have any complaints. After any necessary back and forth to get things r

      --
      "99 dead duelists of Dios on the wall. 99 dead duelists of Dios! Take one's ring, pass it around..."
    10. Re:Software error ... by operagost · · Score: 2

      As far as safety goes, the private airlines are heavily regulated. And maintenance is not under the purview of the air traffic controllers, so this is a red herring and "shedding parts" is mere hyperbole in any case.

      If the systems were allegedly "dinosaurs" in the 1980s, I would think they'd be causing "mid-airs" on a regular basis right now. That they are not tells me that the systems have been upgraded.

      Reagan went into office SUPPORTING PATCO. They actually endorsed him over Jimmy Carter, who had ignored them. But they decided to test Reagan after only 7 months in office but illegally striking per Federal law. They certainly had concerns, but striking is illegal. These are the realities. There are now two organizations representing controllers, so they are by no means unrepresented.

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    11. Re:Software error ... by Rob+Riggs · · Score: 3, Interesting

      Surgeons leave tools in patients because they have no process when operating on a patient. Read the Checklist Manifesto sometime and read what the author has to say about best practices in the operating room. Everyone makes mistakes. The process we follow is what allows us to catch those mistakes, and prevent any mistakes from re-occurring.

      --
      the growth in cynicism and rebellion has not been without cause
    12. Re:Software error ... by operagost · · Score: 2

      So, when Obama and the Dems controlled everything, when Clinton and the Dems controlled everything, and neither of them fixed the problem you describe as being "Reagan fired PATCO" (which has little or nothing to do with the code and systems today, being 30+ years ago), you still chose to blame something that can only be described as tangential to the current problem.

      Indeed, Clinton had an opportunity to help loosen the union restrictions in Taft-Hartley (passed with veto override by both parties in 1947 and subsequently used by the President who vetoed it), but he did not do so. And, of course, neither has Obama as you stated. I guess Obama needs a third term to have time to get things like this done.

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    13. Re:Software error ... by tomhath · · Score: 2, Informative

      The civilian aircraft control system has been chronically underfunded for decades, since Reagan fired PATCO.

      Reagan initiated and appropriately funded a complete overhaul of the control system.

      The illegal strike by the air traffic controllers is irrelevant.

    14. Re: Software error ... by Anonymous Coward · · Score: 2, Insightful

      Nice job trying to dodge responsibility, right wing whacko. Your side is so busy funneling money to their big contractor buddies for stuff that makes news when it actually works (see, pretty much every defense program ever). They then blew off concerns from people doing actual work to make a political point, and your complaint is that the other side can't fix your screwups fast enough?

      How about your people stop sabotaging agencies that do useful things just do they can say that government never does anything right? (Because if government does do something right, conservatives will jump in and fix that for you.)

    15. Re:Software error ... by phantomfive · · Score: 4, Informative

      You are trying to be sarcastic, but the MISRA standard for embedded systems includes these rules:

      1) absolutely no recursion. it could lead to stack overflows.
      2) absolutely no local variables. it could lead to stack overflows.
      3) absolutely no use of of malloc or free. it could lead to stack overflows.

      So yeah, that has been an accepted approach for many years.

      --
      "First they came for the slanderers and i said nothing."
    16. Re:Software error ... by jittles · · Score: 2

      Software engineers responsible for real-time, public safety software should be capable of managing memory in their code

      And surgeons responsible for cutting open live human beings should be capable of not leaving tools in the person they're operating on, but it still happens. Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.

      Until the entire air traffic system grinds to a halt at the same time every day while java garbage collects everything. No, garbage collection is not the answer. There are more performant ways to manage memory.

    17. Re:Software error ... by phantomfive · · Score: 4, Informative

      Garbage collection is a useful tool to make it more difficult to screw up.

      Recently I've seen a lot of memory leaks in Java and Javascript. People stick things in a hash table or a queue, then forget to remove them (angular.js also has gotchas to watch for avoiding memory leaks). Because programmers in those languages don't think about memory, they end up with more memory leaks than programmers in C.

      For a system that needs high reliability, garbage collection is not the answer, and can make things worse.

      --
      "First they came for the slanderers and i said nothing."
    18. Re:Software error ... by Anonymous Coward · · Score: 2, Insightful

      The reading from one such as this stops when they see a negative comment about Saint Reagan.

    19. Re:Software error ... by Minwee · · Score: 2

      Only because there is no moderation for "Wrong Website For That Kind Of Thing".

    20. Re:Software error ... by swillden · · Score: 3, Insightful

      No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!

      Garbage collection is a red herring. The notion that "real" software engineers must use manual deallocation is just as silly as the idea that garbage collection eliminates memory leaks. Though GC actually does eliminate dangling pointer bugs... by turning them into memory leaks.

      Garbage collection is a viable and reasonable strategy for handling deallocation -- in fact it can be significantly more efficient than manual deallocation, in terms of cycles spent on deallocation -- but it's not a panacea. It doesn't eliminate the need to think about object lifetimes or memory consumption. It reduces the amount of development effort focused on those issues, trading it instead for management of GC times. Whether that tradeoff is a net benefit depends on the context and system requirements.

      And that is what real software engineers do. They don't choose their tools based on which is the manliest and best for proving their coding prowess. They choose based on the nature of the problem and the resources available. Where GC interruptions can be tolerated, or safely scheduled, GC is a tool that automates away significant engineering effort. That's a good thing. Hard real-time systems generally don't tolerate GC very well, but virtually anything that interacts with people does tolerate brief (50 ms or less) GC pauses, and that's actually quite easy to achieve.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    21. Re:Software error ... by PuckSR · · Score: 2

      It was relevant because it was one of the issues highlighted at the time.
      Also, Reagan announced the overhaul as a RESPONSE to the strike, but it wasn't given the type of fast-track authorization that would have made it useful

    22. Re:Software error ... by turbidostato · · Score: 2

      "but in what possible universe do you imagine 3 planes colliding in total."

      I would bet that, within civil aviation, it's easier to have three planes colliding mid-air than just two or, at least, three involved with two crashing and a near-miss.

    23. Re:Software error ... by rrr00bb5454 · · Score: 2

      But there is also the issue of some reasonable level of proof that the code is robust; akin to the assurances you get from a good compiler that the machine language behaves like the source code. If you work within truly large C code bases (I estimate that I'm on one right now), the completely manual approach is just not good enough. Garbage collection isn't the only answer of course, but tooling is essential. In the future, higher languages are definitely going to play a role. C/C++ aren't keeping up with changes being created by multi-core. Innovations like LLVM help to keep making progress, but ultimately, embedded systems are going to look something like Rust while everything else is going to move up to a higher level abstraction. The abstraction just has to be high enough that we can get away from compilers being utterly blind, where we can ask the compiler if code is memory safe or conforms to protocols in its interfaces. (See Coq related projects producing subsets of C that can be proven correct)

    24. Re:Software error ... by geoskd · · Score: 2

      delete or delete [] every time you allocate, or for that matter initialize a bunch of variables to 0 manually every single time?

      It's really not that difficult unless you are not very clear on how your memory is being used and how the algorithms around it perform. With a clear understanding of the system being designed, cleaning up your memory allocation is a no brainer. The only time it presents anything remotely complicated is when the programmer doesn't understand the system they are working on. Under those circumstances, I would posit that the programmer should probably not be programming that particular project. I am well ware that stringent guidelines to that effect would knock 50% or more of the programmers out of the pool, and I am OK with that. Anyone with any sense at all should be OK with that, especially for safety critical systems.

      --
      I wish I had a good sig, but all the good ones are copyrighted
  3. Programming language doesn't matter by rockmuelle · · Score: 4, Insightful

    You can have poor memory management in any language.

    Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.

    Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.

    -Chris

    1. Re:Programming language doesn't matter by tomhath · · Score: 2

      It's very simple to prove a student's program consisting of a few modules and a couple hundred lines of code. Expand that out to hundreds of programmers, thousands of modules and tens of millions of lines of code and it's not so simple anymore.

  4. QA process? by scsirob · · Score: 2

    I don't care what language they use. It could be BASIC for all I care.

    What I do care about is what their QA process looks like. How did this not get caught in testing??

    --
    To Terminate, or not to Terminate, that's the question - SCSIROB
    1. Re:QA process? by Brett+Buck · · Score: 5, Insightful

      It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.

    2. Re:QA process? by bobbied · · Score: 2

      It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.

      There is one more reason... Testing is the LAST thing you do before a release, so as the schedule slips to the right the last task on the schedule ALWAYS gets squeezed into smaller and smaller schedules. Less time means less testing.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    3. Re:QA process? by Brett+Buck · · Score: 2

      Well, a big part of the distinction you are making (life-critical VS commodity software) has gotten lost by the various programming "cargo cults", where every problem is the same and every solution fits into some sort of stupid ritual.

  5. Re:This is why we like C by kelemvor4 · · Score: 3, Funny

    Wut.

    There aren't any memory leaks when you write in C?

    Maybe you have them, but I don't. That's because my C-peen is larger than yours.

  6. As this is the FAA by RevWaldo · · Score: 5, Funny

    I'm betting someone just got some of the punch cards mixed up.

    .

  7. Language: ADA by JumboMessiah · · Score: 5, Informative

    While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.

    reference

  8. Re:This is why we like C by U2xhc2hkb3QgU3Vja3M · · Score: 3, Funny

    He said ancient, not precambrian.

  9. Actually, it was a bunch of 68K assembly. by tlambert · · Score: 2

    I kinda doubt that, My understanding is most of the US's air-traffic control systems (and software) is ancient .

    Somehow, I doubt it was 2,000,000 lines of assembly language.

    Actually, the old system was a bunch of 68K assembly. Nowhere near 2,000,000 lines of it. I know one of the guys who wrote some of it.

  10. Re:This is why we like C by gstoddart · · Score: 2

    Well, whatever this was coded in, it was recently upgraded ... you know, not ancient.

    The language written in doesn't matter. It was a new change, insufficiently tested, and which failed in the real world with a corner case nobody anticipated.

    That's a pretty large failure of coding, testing, and deployment.

    It's a bunch of things, but really you'd expect the people responsible for it would have been a LOT more paranoid and rigorous about it.

    It's an air traffic system after all, around Washington for crying out loud ... which means there's probably some security people going completely apoplectic.

    I mean, the movie scenario of knowing there's an update to the ATC system around Washington and then all of the fanciful plot devices you can add in are suddenly slightly more plausible ... if "air traffic control offline around Washington" isn't begging to have a Bruce Willis movie, I don't know what is.

    You want to bet someone at DHS didn't have a couple of extra Rolaids when this happened?

    --
    Lost at C:>. Found at C.
  11. Implemented in Ada 2005 by deppli · · Score: 2

    A quick search reveals Lockheed Martin used Ada 2005 primarily to implement ERAM. Ada's Vital Role in New US Air Traffic Control Systems http://www.iaeng.org/publicati... "The new Ada 2005 real-time, and object-oriented language. Now it offers more has introduced more robust capabilities based on user experience. safety and portability than Java, and better efficiency and The language offers particular innovations which helps make safety assurance less costly and further improves high integrity flexibility than of C/C++"

    1. Re:Implemented in Ada 2005 by Anonymous Coward · · Score: 2, Informative

      The backend code is implemented in Ada but all of the display code is implemented in a mix of C and C++

  12. Nobody by U2xhc2hkb3QgU3Vja3M · · Score: 5, Funny

    Nobody expects the ERROR: OUT OF MEMORY.

  13. Re:As this is the FAA Easy Fix, Punched tape by BoRegardless · · Score: 2

    No more mixed cards

  14. You Said by dcw3 · · Score: 2, Informative

    But, you said that 8G was enough!

    --
    Just another day in Paradise
  15. And the language is...... by jeremyp · · Score: 4, Informative

    Ada and Java apparently

    http://dl.acm.org/citation.cfm...

    --
    All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
  16. Re:This is why we like C by Vlad_the_Inhaler · · Score: 4, Interesting

    I have actually seen something similar to this before, also involving an Air Traffic Control.

    They were having some problem in handling "Large Messages", I am not sure of the exact details / circumstances - I was only peripherally involved. Anyway, the programmer wrote these to a file, then they were processed asynchronously and deleted. This minor change was tested - as usual at the site - by someone shooting an hour's production traffic through the test system and checking for unexpected aborts or other abnormalities. All was fine, the spooling file was 1% full.
    The patch went online. 4 days later (it was a Sunday morning and it was snowing) the file hit some limit and refused to accept new messages. At that moment things went "Keystone Cops".

    • All department heads were informed, except programming. Given that only one the patch had been applied in the previous week, not very helpful. Headless chickens ran around trying to find a solution.
    • Standard practice in this type of situation was to switch to the backup/standby system. Since ATC data is very short lived, the backup system had an empty database which would then be populated dynamically. All "Station Chiefs" had to approve this step. One refused because he could not see any problem. Finally someone managed to make him understand what the problem was, then it was "oh yes, we are seeing that as well". His was the smallest station of course.
    • Standard procedure was also to switch to manual control - rather than automated - and cancel short-haul flights. The railways could take up the slack. This was done.

    The switch was duly made and everything was working again.
    It turned out that the deletion of the processed records had a bug. One hour of live data left the file 1% full. 100 hours . . . do the math. It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.

    One of the lessons from that is also relevant here - one hour of live data left the file 1% full. I'd bet that they were testing that the new feature worked, not looking for hidden side-effects.

    --
    Mielipiteet omiani - Opinions personal, facts suspect.
  17. Re:This is why we like C by gstoddart · · Score: 3, Insightful

    It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.

    I have seen far too many occasions where some hotshot made an out of band code change, broke prod, and then said "oh, it's just a quick fix".

    It would have to be one hell of an emergency to have live changes on a prod system be anything other than a hanging offense. I've see more problems caused by it, than things fixed by it.

    I've experienced several outages caused by someone who was either thinking "it's just a quick fix", or was trying to sneak in a fix for something which shouldn't have left their desk in the first place.

    --
    Lost at C:>. Found at C.
  18. Primary Buffer Panel by Anomalyst · · Score: 2

    I'm surprised more planes don't shed parts along the way.

    Did the Primary Buffer Panel just fall off my gorram ship for no apparent reason?

    --
    There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
  19. Late-80s Development Process Failures by billstewart · · Score: 2

    I think this system that failed is part of the same one I helped bid on upgrading in the late 80s. (We were the lucky ones who lost the bid; IBM were the poor suckers who won it.) The Advanced Automation System was supposed to have a budget of something like 4 years and $4B, or maybe it was $7B, but either way it ran way way over that, in both years and billions, before being restuctured, partly because the problem is really hard, partly because the specs were extremely unrealistic, and partly because we were required to use DOD-STD-2167 software development methodology, a very heavy clumsy version of waterfall process.

    The important requirement was that if anything went wrong and two airplanes crashed and fell out of the sky, mobs of citizens and Congresscritters would descend on FAA headquarters with torches and pitchforks and budget cuts, so everything that to be ultra-conservatively speced to prevent that from happening. I'm extremely annoyed to hear the FAA saying that except for this failure, they've been running 99.99% reliability this year. Four 9s? The specs we were supposed to meet were 8 9s, and since nobody was willing to ask the FAA to define a failure event, our management was conservatively aiming for 10 9s. (An average system controlled about 100 radars, and the big difference is whether a "failure" means "all the radars are out" or "any single radar is out".) This kind of reliability meant that duplicating everything wasn't good enough, you had to triplicated every piece of equipment, or double-double it, because otherwise the possibility of one piece failing while you had its backup down for preventative maintenance for 5 minutes blew your numbers for the year. (No matter that the radars were connected back to the data center over circuits that had 3.5-4 9s, just because of the usual risk of physical damage.) We later found out that the FAA shut down the then-current 1960s system for four hours a night, running on the backup equipment (which was a 1970s transistorized upgrade to the 1940s/1950s version) to keep the backup system reliable and operators trained.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks