Slashdot Mirror


Air Traffic Snafu: FAA System Runs Out of Memory

minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?

12 of 234 comments (clear)

  1. Programming language doesn't matter by rockmuelle · · Score: 4, Insightful

    You can have poor memory management in any language.

    Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.

    Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.

    -Chris

  2. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

    No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!

  3. Re:Software error ... by U2xhc2hkb3QgU3Vja3M · · Score: 5, Insightful

    Not only should they be "capable" of managing memory in their code, it should be part of the software design itself.

  4. Re:QA process? by Brett+Buck · · Score: 5, Insightful

    It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.

  5. Re:Software error ... by dwpro · · Score: 4, Insightful

    Software engineers responsible for real-time, public safety software should be capable of managing memory in their code

    And surgeons responsible for cutting open live human beings should be capable of not leaving tools in the person they're operating on, but it still happens. Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.

    --
    Millions long for immortality who do not know what to do with themselves on a rainy Sunday afternoon. -- Susan Ertz
  6. Re:Software error ... by jamstar7 · · Score: 5, Insightful

    Couple things to keep in mind.

    The civilian aircraft control system has been chronically underfunded for decades, since Reagan fired PATCO. One of the things they were on strike for was for better equipment to do their jobs better, easier, and with less stress. Even in the 80's, the computers and radars were dinosaurs best kept in a museum. Upgrades since then have always been a day late and a dollar short.

    The airspace above the US is the busiest in the world, and it's just getting worse. They don't even report near-misses anymore to the media unless the pilots can see each other giving them the finger. They're that common.

    Nothing will be done until 3 or 4 planes do a mid-air and the public outcry is so bad that people are ready to march on the FAA's office with torches and pitchforks. Then there will be a massive round of public firings to appease the crowd, a slight boost in funding to the FAA, followed by further deregulation of the airlines.

    Personally, with all the deregulation already, I'm surprised more planes don't shed parts along the way.

    --
    Understanding the scope of the problem is the first step on the path to true panic.
  7. Re:Software error ... by Anonymous Coward · · Score: 5, Insightful

    Pretty sure the defined time frame of "since Reagan fired PATCO" involves being critical of every administration since then that hasn't resolved the issues. I suppose one wouldn't want basic reading comprehension to get in the way of a good partisan knee jerk lash out though.

    Americans are fucking weird.

  8. Re:Software error ... by Rei · · Score: 5, Insightful

    So, I actually am a programmer for an ATC system...

    First off, this isn't as bad as it sounds as far as safety goes. One first needs to ask themselves, "what is the purpose of an ATC system?". The simple answer is, "don't ever let two aircraft exist in the same location at the same time". So any two aircraft can be separated in a) time, b) location, or c) altitude, and so long as they meet the minimum safety distances, that's all okay. Complicating this is the great variety of hardware on the aircraft, communications methods and protocols, and gaps in the information available to you, plus the wide variety in ATC systems and how they talk to each other. And there's a lot of potential instability at each stage. So basically ATC systems are massive collections of "special cases" that need to be handled on top of the basics. Maybe some line in Denmark is garbling messages that lead to you being fed bogus data. Maybe some aircraft in India's buggy hardware is for some reason spamming everyone on the network. Maybe you've got two different systems handling radar data and one says the radars are all fine and the other says they're not. Maybe the aircraft says they were at X point at Y time but some radar says something different. These are the sorts of things we have to deal on a weekly if not daily interval, and they lead what seems like it should be very simple pieces of software to become really huge systems.

    As mentioned, there can be lots of instability. Yep, it's true, these things can be rather buggy - both hardware and software. They're usually old designs that may have been poor design from the beginning, but have had to be continually patched and patched over the course of decades. Don't like that? Throw some more funds in for new ATC systems designed from scratch, otherwise this is going to continue to be the reality (yeah, new subsystems do come in every now and then for various purposes, but old systems are slow to go away).

    So, instability and bugs can sound scary. But remember the goals of an ATC system: separation. So let's just say that you lose the whole system for a long time - what do you do? Well, you basically revert to paper, and you've got a LOT more phone calls to make. You have to allow for more separation, and because of the increased workload, you can't handle as many planes. So you have to greatly reduce the number of planes in your region - they have to divert or wait. It's big delays, which costs big money. But it's not like we just start guessing whether planes are going to run into something or not.

    Our software here is predominantly old C code with a little bit of C++, and miscellaneous like yacc and lex. There are changelog entries dating back to the 80s - though that's the manual changelogs, it didn't go under revision control until the late 90s. Its core uses macros to an annoying degree to emulate object-oriented design in C; macros can be nested dozens of layers deep. It makes bugs very hard to find sometimes, but it's the core of the software, so it's not something that can be easily changed. So we do our best. Yes, there are "WONTFIX" bugs that we know about, and operators have documented procedures for working around them (usually involving restarting some module - the system is very modular, you don't have to restart the whole thing to fix a part that's acting up). But we always prioritize fixing the things that get in the way of their work the most - there's a lot of direct back and forth. Again, safety always takes top priority, then throughput. Everything else is way down below on the priority list.

    Changes work through the following process. A report of a bug or feature request is made. Someone analyses it and if they think it's worth working on writes up a task and assigns it to a programmer. The programmer works on the task and when they think it's ready they submit it for code review. Another programmer looks through all of the code and tries to see if they have any complaints. After any necessary back and forth to get things r

    --
    "99 dead duelists of Dios on the wall. 99 dead duelists of Dios! Take one's ring, pass it around..."
  9. Re: Software error ... by Anonymous Coward · · Score: 2, Insightful

    Nice job trying to dodge responsibility, right wing whacko. Your side is so busy funneling money to their big contractor buddies for stuff that makes news when it actually works (see, pretty much every defense program ever). They then blew off concerns from people doing actual work to make a political point, and your complaint is that the other side can't fix your screwups fast enough?

    How about your people stop sabotaging agencies that do useful things just do they can say that government never does anything right? (Because if government does do something right, conservatives will jump in and fix that for you.)

  10. Re:Software error ... by Anonymous Coward · · Score: 2, Insightful

    The reading from one such as this stops when they see a negative comment about Saint Reagan.

  11. Re:This is why we like C by gstoddart · · Score: 3, Insightful

    It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.

    I have seen far too many occasions where some hotshot made an out of band code change, broke prod, and then said "oh, it's just a quick fix".

    It would have to be one hell of an emergency to have live changes on a prod system be anything other than a hanging offense. I've see more problems caused by it, than things fixed by it.

    I've experienced several outages caused by someone who was either thinking "it's just a quick fix", or was trying to sneak in a fix for something which shouldn't have left their desk in the first place.

    --
    Lost at C:>. Found at C.
  12. Re:Software error ... by swillden · · Score: 3, Insightful

    No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!

    Garbage collection is a red herring. The notion that "real" software engineers must use manual deallocation is just as silly as the idea that garbage collection eliminates memory leaks. Though GC actually does eliminate dangling pointer bugs... by turning them into memory leaks.

    Garbage collection is a viable and reasonable strategy for handling deallocation -- in fact it can be significantly more efficient than manual deallocation, in terms of cycles spent on deallocation -- but it's not a panacea. It doesn't eliminate the need to think about object lifetimes or memory consumption. It reduces the amount of development effort focused on those issues, trading it instead for management of GC times. Whether that tradeoff is a net benefit depends on the context and system requirements.

    And that is what real software engineers do. They don't choose their tools based on which is the manliest and best for proving their coding prowess. They choose based on the nature of the problem and the resources available. Where GC interruptions can be tolerated, or safely scheduled, GC is a tool that automates away significant engineering effort. That's a good thing. Hard real-time systems generally don't tolerate GC very well, but virtually anything that interacts with people does tolerate brief (50 ms or less) GC pauses, and that's actually quite easy to achieve.

    --
    Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.