Air Traffic Snafu: FAA System Runs Out of Memory
minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?
But nobody should ever need anything more than 640k!
You can make the argument that if the software allowed the operators to crash the system, it's a software fault.
You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.
I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?
I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.
I guess I'm not surprised, but I am stunned.
Lost at C:>. Found at C.
You can have poor memory management in any language.
Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.
Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.
-Chris
I don't care what language they use. It could be BASIC for all I care.
What I do care about is what their QA process looks like. How did this not get caught in testing??
To Terminate, or not to Terminate, that's the question - SCSIROB
Wut.
There aren't any memory leaks when you write in C?
Maybe you have them, but I don't. That's because my C-peen is larger than yours.
I'm betting someone just got some of the punch cards mixed up.
.
Prisencolinensinainciusol. Ol Rait!
While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.
reference
He said ancient, not precambrian.
I kinda doubt that, My understanding is most of the US's air-traffic control systems (and software) is ancient .
Somehow, I doubt it was 2,000,000 lines of assembly language.
Actually, the old system was a bunch of 68K assembly. Nowhere near 2,000,000 lines of it. I know one of the guys who wrote some of it.
Well, whatever this was coded in, it was recently upgraded ... you know, not ancient.
The language written in doesn't matter. It was a new change, insufficiently tested, and which failed in the real world with a corner case nobody anticipated.
That's a pretty large failure of coding, testing, and deployment.
It's a bunch of things, but really you'd expect the people responsible for it would have been a LOT more paranoid and rigorous about it.
It's an air traffic system after all, around Washington for crying out loud ... which means there's probably some security people going completely apoplectic.
I mean, the movie scenario of knowing there's an update to the ATC system around Washington and then all of the fanciful plot devices you can add in are suddenly slightly more plausible ... if "air traffic control offline around Washington" isn't begging to have a Bruce Willis movie, I don't know what is.
You want to bet someone at DHS didn't have a couple of extra Rolaids when this happened?
Lost at C:>. Found at C.
A quick search reveals Lockheed Martin used Ada 2005 primarily to implement ERAM. Ada's Vital Role in New US Air Traffic Control Systems http://www.iaeng.org/publicati... "The new Ada 2005 real-time, and object-oriented language. Now it offers more has introduced more robust capabilities based on user experience. safety and portability than Java, and better efficiency and The language offers particular innovations which helps make safety assurance less costly and further improves high integrity flexibility than of C/C++"
Nobody expects the ERROR: OUT OF MEMORY.
No more mixed cards
But, you said that 8G was enough!
Just another day in Paradise
Ada and Java apparently
http://dl.acm.org/citation.cfm...
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
I have actually seen something similar to this before, also involving an Air Traffic Control.
They were having some problem in handling "Large Messages", I am not sure of the exact details / circumstances - I was only peripherally involved. Anyway, the programmer wrote these to a file, then they were processed asynchronously and deleted. This minor change was tested - as usual at the site - by someone shooting an hour's production traffic through the test system and checking for unexpected aborts or other abnormalities. All was fine, the spooling file was 1% full.
The patch went online. 4 days later (it was a Sunday morning and it was snowing) the file hit some limit and refused to accept new messages. At that moment things went "Keystone Cops".
The switch was duly made and everything was working again.
It turned out that the deletion of the processed records had a bug. One hour of live data left the file 1% full. 100 hours . . . do the math. It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.
One of the lessons from that is also relevant here - one hour of live data left the file 1% full. I'd bet that they were testing that the new feature worked, not looking for hidden side-effects.
Mielipiteet omiani - Opinions personal, facts suspect.
I have seen far too many occasions where some hotshot made an out of band code change, broke prod, and then said "oh, it's just a quick fix".
It would have to be one hell of an emergency to have live changes on a prod system be anything other than a hanging offense. I've see more problems caused by it, than things fixed by it.
I've experienced several outages caused by someone who was either thinking "it's just a quick fix", or was trying to sneak in a fix for something which shouldn't have left their desk in the first place.
Lost at C:>. Found at C.
I'm surprised more planes don't shed parts along the way.
Did the Primary Buffer Panel just fall off my gorram ship for no apparent reason?
There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
I think this system that failed is part of the same one I helped bid on upgrading in the late 80s. (We were the lucky ones who lost the bid; IBM were the poor suckers who won it.) The Advanced Automation System was supposed to have a budget of something like 4 years and $4B, or maybe it was $7B, but either way it ran way way over that, in both years and billions, before being restuctured, partly because the problem is really hard, partly because the specs were extremely unrealistic, and partly because we were required to use DOD-STD-2167 software development methodology, a very heavy clumsy version of waterfall process.
The important requirement was that if anything went wrong and two airplanes crashed and fell out of the sky, mobs of citizens and Congresscritters would descend on FAA headquarters with torches and pitchforks and budget cuts, so everything that to be ultra-conservatively speced to prevent that from happening. I'm extremely annoyed to hear the FAA saying that except for this failure, they've been running 99.99% reliability this year. Four 9s? The specs we were supposed to meet were 8 9s, and since nobody was willing to ask the FAA to define a failure event, our management was conservatively aiming for 10 9s. (An average system controlled about 100 radars, and the big difference is whether a "failure" means "all the radars are out" or "any single radar is out".) This kind of reliability meant that duplicating everything wasn't good enough, you had to triplicated every piece of equipment, or double-double it, because otherwise the possibility of one piece failing while you had its backup down for preventative maintenance for 5 minutes blew your numbers for the year. (No matter that the radars were connected back to the data center over circuits that had 3.5-4 9s, just because of the usual risk of physical damage.) We later found out that the FAA shut down the then-current 1960s system for four hours a night, running on the backup equipment (which was a 1970s transistorized upgrade to the 1940s/1950s version) to keep the backup system reliable and operators trained.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks