Air Traffic Snafu: FAA System Runs Out of Memory

← Back to Stories (view on slashdot.org)

Air Traffic Snafu: FAA System Runs Out of Memory

Posted by Soulskill on Wednesday August 19, 2015 @02:32AM from the must-run-in-chrome dept.

minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?

17 of 234 comments (clear)

Min score:

Reason:

Sort:

Software error ... by gstoddart · 2015-08-19 02:38 · Score: 5, Informative

You can make the argument that if the software allowed the operators to crash the system, it's a software fault.
You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.
I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?
I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.
I guess I'm not surprised, but I am stunned.

--
Lost at C:>. Found at C.
1. Re:Software error ... by Anonymous Coward · 2015-08-19 02:57 · Score: 5, Insightful
  
  No, no, no, no, no! The concept of garbage collecting is a reaction to poor coding practices and reliance on it is laziness. Software engineers responsible for real-time, public safety software should be capable of managing memory in their code!
2. Re:Software error ... by U2xhc2hkb3QgU3Vja3M · 2015-08-19 03:03 · Score: 5, Insightful
  
  Not only should they be "capable" of managing memory in their code, it should be part of the software design itself.
3. Re:Software error ... by dwpro · 2015-08-19 03:31 · Score: 4, Insightful
  
  Software engineers responsible for real-time, public safety software should be capable of managing memory in their code
  And surgeons responsible for cutting open live human beings should be capable of not leaving tools in the person they're operating on, but it still happens. Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.
  
  --
  Millions long for immortality who do not know what to do with themselves on a rainy Sunday afternoon. -- Susan Ertz
4. Re:Software error ... by jamstar7 · 2015-08-19 03:32 · Score: 5, Insightful
  
  Couple things to keep in mind.
  
  The civilian aircraft control system has been chronically underfunded for decades, since Reagan fired PATCO. One of the things they were on strike for was for better equipment to do their jobs better, easier, and with less stress. Even in the 80's, the computers and radars were dinosaurs best kept in a museum. Upgrades since then have always been a day late and a dollar short.
  
  The airspace above the US is the busiest in the world, and it's just getting worse. They don't even report near-misses anymore to the media unless the pilots can see each other giving them the finger. They're that common.
  
  Nothing will be done until 3 or 4 planes do a mid-air and the public outcry is so bad that people are ready to march on the FAA's office with torches and pitchforks. Then there will be a massive round of public firings to appease the crowd, a slight boost in funding to the FAA, followed by further deregulation of the airlines.
  
  Personally, with all the deregulation already, I'm surprised more planes don't shed parts along the way.
  
  --
  Understanding the scope of the problem is the first step on the path to true panic.
5. Re:Software error ... by SpeedBump0619 · 2015-08-19 03:53 · Score: 4, Informative
  
  Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.
  I get this. And as a software engineer I fully agree. However, in practical terms, there shouldn't be any dynamic memory management happening at all.
  It's a real-time system. It *must* interact, on time, with all the planes that are in it's domain. That should be a bounded, predictable load, or there's no way to guarantee responsiveness. Given that, an analysis should have been done on the maximum number of elements the system supported. Those elements should have been preallocated (into a pool if you want to treat them "dynamically") before actual operation began. If/When the pool allocator ran out of items it should do two things: allocate (dynamically) more, and scream bloody murder to everyone who would listen regarding the unexpected allocation.
  This is (one of) the reason(s) I generally haven't liked garbage collected languages for real time systems. There's rarely ever a way to guard against unexpected allocations, because *every* allocation is blind.
6. Re:Software error ... by Anonymous Coward · 2015-08-19 04:02 · Score: 5, Insightful
  
  Pretty sure the defined time frame of "since Reagan fired PATCO" involves being critical of every administration since then that hasn't resolved the issues. I suppose one wouldn't want basic reading comprehension to get in the way of a good partisan knee jerk lash out though.
  Americans are fucking weird.
7. Re:Software error ... by Rei · 2015-08-19 04:09 · Score: 5, Insightful
  
  So, I actually am a programmer for an ATC system...
  First off, this isn't as bad as it sounds as far as safety goes. One first needs to ask themselves, "what is the purpose of an ATC system?". The simple answer is, "don't ever let two aircraft exist in the same location at the same time". So any two aircraft can be separated in a) time, b) location, or c) altitude, and so long as they meet the minimum safety distances, that's all okay. Complicating this is the great variety of hardware on the aircraft, communications methods and protocols, and gaps in the information available to you, plus the wide variety in ATC systems and how they talk to each other. And there's a lot of potential instability at each stage. So basically ATC systems are massive collections of "special cases" that need to be handled on top of the basics. Maybe some line in Denmark is garbling messages that lead to you being fed bogus data. Maybe some aircraft in India's buggy hardware is for some reason spamming everyone on the network. Maybe you've got two different systems handling radar data and one says the radars are all fine and the other says they're not. Maybe the aircraft says they were at X point at Y time but some radar says something different. These are the sorts of things we have to deal on a weekly if not daily interval, and they lead what seems like it should be very simple pieces of software to become really huge systems.
  As mentioned, there can be lots of instability. Yep, it's true, these things can be rather buggy - both hardware and software. They're usually old designs that may have been poor design from the beginning, but have had to be continually patched and patched over the course of decades. Don't like that? Throw some more funds in for new ATC systems designed from scratch, otherwise this is going to continue to be the reality (yeah, new subsystems do come in every now and then for various purposes, but old systems are slow to go away).
  So, instability and bugs can sound scary. But remember the goals of an ATC system: separation. So let's just say that you lose the whole system for a long time - what do you do? Well, you basically revert to paper, and you've got a LOT more phone calls to make. You have to allow for more separation, and because of the increased workload, you can't handle as many planes. So you have to greatly reduce the number of planes in your region - they have to divert or wait. It's big delays, which costs big money. But it's not like we just start guessing whether planes are going to run into something or not.
  Our software here is predominantly old C code with a little bit of C++, and miscellaneous like yacc and lex. There are changelog entries dating back to the 80s - though that's the manual changelogs, it didn't go under revision control until the late 90s. Its core uses macros to an annoying degree to emulate object-oriented design in C; macros can be nested dozens of layers deep. It makes bugs very hard to find sometimes, but it's the core of the software, so it's not something that can be easily changed. So we do our best. Yes, there are "WONTFIX" bugs that we know about, and operators have documented procedures for working around them (usually involving restarting some module - the system is very modular, you don't have to restart the whole thing to fix a part that's acting up). But we always prioritize fixing the things that get in the way of their work the most - there's a lot of direct back and forth. Again, safety always takes top priority, then throughput. Everything else is way down below on the priority list.
  Changes work through the following process. A report of a bug or feature request is made. Someone analyses it and if they think it's worth working on writes up a task and assigns it to a programmer. The programmer works on the task and when they think it's ready they submit it for code review. Another programmer looks through all of the code and tries to see if they have any complaints. After any necessary back and forth to get things r
  
  --
  "99 dead duelists of Dios on the wall. 99 dead duelists of Dios! Take one's ring, pass it around..."
8. Re:Software error ... by phantomfive · 2015-08-19 04:41 · Score: 4, Informative
  
  You are trying to be sarcastic, but the MISRA standard for embedded systems includes these rules:
  
  1) absolutely no recursion. it could lead to stack overflows.
  2) absolutely no local variables. it could lead to stack overflows.
  3) absolutely no use of of malloc or free. it could lead to stack overflows.
  
  So yeah, that has been an accepted approach for many years.
  
  --
  "First they came for the slanderers and i said nothing."
9. Re:Software error ... by phantomfive · 2015-08-19 04:57 · Score: 4, Informative
  
  Garbage collection is a useful tool to make it more difficult to screw up.
  Recently I've seen a lot of memory leaks in Java and Javascript. People stick things in a hash table or a queue, then forget to remove them (angular.js also has gotchas to watch for avoiding memory leaks). Because programmers in those languages don't think about memory, they end up with more memory leaks than programmers in C.
  
  For a system that needs high reliability, garbage collection is not the answer, and can make things worse.
  
  --
  "First they came for the slanderers and i said nothing."
Programming language doesn't matter by rockmuelle · 2015-08-19 02:46 · Score: 4, Insightful

You can have poor memory management in any language.
Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.
Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.
-Chris
As this is the FAA by RevWaldo · 2015-08-19 02:53 · Score: 5, Funny

I'm betting someone just got some of the punch cards mixed up.

.

--
Prisencolinensinainciusol. Ol Rait!
Language: ADA by JumboMessiah · 2015-08-19 03:01 · Score: 5, Informative

While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.
reference
Re:QA process? by Brett+Buck · 2015-08-19 03:03 · Score: 5, Insightful

It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.
Nobody by U2xhc2hkb3QgU3Vja3M · 2015-08-19 03:11 · Score: 5, Funny

Nobody expects the ERROR: OUT OF MEMORY.
And the language is...... by jeremyp · 2015-08-19 03:37 · Score: 4, Informative

Ada and Java apparently
http://dl.acm.org/citation.cfm...

--
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
Re:This is why we like C by Vlad_the_Inhaler · 2015-08-19 03:43 · Score: 4, Interesting
I have actually seen something similar to this before, also involving an Air Traffic Control.
They were having some problem in handling "Large Messages", I am not sure of the exact details / circumstances - I was only peripherally involved. Anyway, the programmer wrote these to a file, then they were processed asynchronously and deleted. This minor change was tested - as usual at the site - by someone shooting an hour's production traffic through the test system and checking for unexpected aborts or other abnormalities. All was fine, the spooling file was 1% full.
The patch went online. 4 days later (it was a Sunday morning and it was snowing) the file hit some limit and refused to accept new messages. At that moment things went "Keystone Cops".
- All department heads were informed, except programming. Given that only one the patch had been applied in the previous week, not very helpful. Headless chickens ran around trying to find a solution.
- Standard practice in this type of situation was to switch to the backup/standby system. Since ATC data is very short lived, the backup system had an empty database which would then be populated dynamically. All "Station Chiefs" had to approve this step. One refused because he could not see any problem. Finally someone managed to make him understand what the problem was, then it was "oh yes, we are seeing that as well". His was the smallest station of course.
- Standard procedure was also to switch to manual control - rather than automated - and cancel short-haul flights. The railways could take up the slack. This was done.
The switch was duly made and everything was working again.
It turned out that the deletion of the processed records had a bug. One hour of live data left the file 1% full. 100 hours . . . do the math. It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.
One of the lessons from that is also relevant here - one hour of live data left the file 1% full. I'd bet that they were testing that the new feature worked, not looking for hidden side-effects.
--
Mielipiteet omiani - Opinions personal, facts suspect.