Air Traffic Snafu: FAA System Runs Out of Memory
minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?
But nobody should ever need anything more than 640k!
You can make the argument that if the software allowed the operators to crash the system, it's a software fault.
You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.
I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?
I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.
I guess I'm not surprised, but I am stunned.
Lost at C:>. Found at C.
You can have poor memory management in any language.
Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.
Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.
-Chris
Wut.
There aren't any memory leaks when you write in C?
Maybe you have them, but I don't. That's because my C-peen is larger than yours.
I'm betting someone just got some of the punch cards mixed up.
.
Prisencolinensinainciusol. Ol Rait!
While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.
reference
It didn't get caught in testing because testing is by far the most expensive and time-consuming part of the development process, and is always the first thing to get cut/trimmed/"streamlined". Just like it has been forever.
He said ancient, not precambrian.
Nobody expects the ERROR: OUT OF MEMORY.
Ada and Java apparently
http://dl.acm.org/citation.cfm...
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
I have actually seen something similar to this before, also involving an Air Traffic Control.
They were having some problem in handling "Large Messages", I am not sure of the exact details / circumstances - I was only peripherally involved. Anyway, the programmer wrote these to a file, then they were processed asynchronously and deleted. This minor change was tested - as usual at the site - by someone shooting an hour's production traffic through the test system and checking for unexpected aborts or other abnormalities. All was fine, the spooling file was 1% full.
The patch went online. 4 days later (it was a Sunday morning and it was snowing) the file hit some limit and refused to accept new messages. At that moment things went "Keystone Cops".
The switch was duly made and everything was working again.
It turned out that the deletion of the processed records had a bug. One hour of live data left the file 1% full. 100 hours . . . do the math. It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.
One of the lessons from that is also relevant here - one hour of live data left the file 1% full. I'd bet that they were testing that the new feature worked, not looking for hidden side-effects.
Mielipiteet omiani - Opinions personal, facts suspect.
I have seen far too many occasions where some hotshot made an out of band code change, broke prod, and then said "oh, it's just a quick fix".
It would have to be one hell of an emergency to have live changes on a prod system be anything other than a hanging offense. I've see more problems caused by it, than things fixed by it.
I've experienced several outages caused by someone who was either thinking "it's just a quick fix", or was trying to sneak in a fix for something which shouldn't have left their desk in the first place.
Lost at C:>. Found at C.