Slashdot Mirror


Air Traffic Snafu: FAA System Runs Out of Memory

minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?

10 of 234 comments (clear)

  1. Software error ... by gstoddart · · Score: 5, Informative

    You can make the argument that if the software allowed the operators to crash the system, it's a software fault.

    You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.

    I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?

    I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.

    I guess I'm not surprised, but I am stunned.

    --
    Lost at C:>. Found at C.
    1. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

      No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!

    2. Re:Software error ... by U2xhc2hkb3QgU3Vja3M · · Score: 5, Insightful

      Not only should they be "capable" of managing memory in their code, it should be part of the software design itself.

    3. Re:Software error ... by jamstar7 · · Score: 5, Insightful

      Couple things to keep in mind.

      The civilian aircraft control system has been chronically underfunded for decades, since Reagan fired PATCO. One of the things they were on strike for was for better equipment to do their jobs better, easier, and with less stress. Even in the 80's, the computers and radars were dinosaurs best kept in a museum. Upgrades since then have always been a day late and a dollar short.

      The airspace above the US is the busiest in the world, and it's just getting worse. They don't even report near-misses anymore to the media unless the pilots can see each other giving them the finger. They're that common.

      Nothing will be done until 3 or 4 planes do a mid-air and the public outcry is so bad that people are ready to march on the FAA's office with torches and pitchforks. Then there will be a massive round of public firings to appease the crowd, a slight boost in funding to the FAA, followed by further deregulation of the airlines.

      Personally, with all the deregulation already, I'm surprised more planes don't shed parts along the way.

      --
      Understanding the scope of the problem is the first step on the path to true panic.
    4. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

      Pretty sure the defined time frame of "since Reagan fired PATCO" involves being critical of every administration since then that hasn't resolved the issues. I suppose one wouldn't want basic reading comprehension to get in the way of a good partisan knee jerk lash out though.

      Americans are fucking weird.

    5. Re:Software error ... by Rei · · Score: 5, Insightful

      So, I actually am a programmer for an ATC system...

      First off, this isn't as bad as it sounds as far as safety goes. One first needs to ask themselves, "what is the purpose of an ATC system?". The simple answer is, "don't ever let two aircraft exist in the same location at the same time". So any two aircraft can be separated in a) time, b) location, or c) altitude, and so long as they meet the minimum safety distances, that's all okay. Complicating this is the great variety of hardware on the aircraft, communications methods and protocols, and gaps in the information available to you, plus the wide variety in ATC systems and how they talk to each other. And there's a lot of potential instability at each stage. So basically ATC systems are massive collections of "special cases" that need to be handled on top of the basics. Maybe some line in Denmark is garbling messages that lead to you being fed bogus data. Maybe some aircraft in India's buggy hardware is for some reason spamming everyone on the network. Maybe you've got two different systems handling radar data and one says the radars are all fine and the other says they're not. Maybe the aircraft says they were at X point at Y time but some radar says something different. These are the sorts of things we have to deal on a weekly if not daily interval, and they lead what seems like it should be very simple pieces of software to become really huge systems.

      As mentioned, there can be lots of instability. Yep, it's true, these things can be rather buggy - both hardware and software. They're usually old designs that may have been poor design from the beginning, but have had to be continually patched and patched over the course of decades. Don't like that? Throw some more funds in for new ATC systems designed from scratch, otherwise this is going to continue to be the reality (yeah, new subsystems do come in every now and then for various purposes, but old systems are slow to go away).

      So, instability and bugs can sound scary. But remember the goals of an ATC system: separation. So let's just say that you lose the whole system for a long time - what do you do? Well, you basically revert to paper, and you've got a LOT more phone calls to make. You have to allow for more separation, and because of the increased workload, you can't handle as many planes. So you have to greatly reduce the number of planes in your region - they have to divert or wait. It's big delays, which costs big money. But it's not like we just start guessing whether planes are going to run into something or not.

      Our software here is predominantly old C code with a little bit of C++, and miscellaneous like yacc and lex. There are changelog entries dating back to the 80s - though that's the manual changelogs, it didn't go under revision control until the late 90s. Its core uses macros to an annoying degree to emulate object-oriented design in C; macros can be nested dozens of layers deep. It makes bugs very hard to find sometimes, but it's the core of the software, so it's not something that can be easily changed. So we do our best. Yes, there are "WONTFIX" bugs that we know about, and operators have documented procedures for working around them (usually involving restarting some module - the system is very modular, you don't have to restart the whole thing to fix a part that's acting up). But we always prioritize fixing the things that get in the way of their work the most - there's a lot of direct back and forth. Again, safety always takes top priority, then throughput. Everything else is way down below on the priority list.

      Changes work through the following process. A report of a bug or feature request is made. Someone analyses it and if they think it's worth working on writes up a task and assigns it to a programmer. The programmer works on the task and when they think it's ready they submit it for code review. Another programmer looks through all of the code and tries to see if they have any complaints. After any necessary back and forth to get things r

      --
      "99 dead duelists of Dios on the wall. 99 dead duelists of Dios! Take one's ring, pass it around..."
  2. As this is the FAA by RevWaldo · · Score: 5, Funny

    I'm betting someone just got some of the punch cards mixed up.

    .

  3. Language: ADA by JumboMessiah · · Score: 5, Informative

    While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.

    reference

  4. Re:QA process? by Brett+Buck · · Score: 5, Insightful

    It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.

  5. Nobody by U2xhc2hkb3QgU3Vja3M · · Score: 5, Funny

    Nobody expects the ERROR: OUT OF MEMORY.