Air Traffic Snafu: FAA System Runs Out of Memory
minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?
But nobody should ever need anything more than 640k!
disappears.
You can make the argument that if the software allowed the operators to crash the system, it's a software fault.
You can also make the argument that stuff like this should have been tested in parallel with the live system so this wasn't a possibility.
I mean, my god, what are the change management and testing practices which allowed this to only be discovered in your real system?
I've been around a few systems which had to do with aircraft ... and the rules and practices surrounding them are pretty paranoid and rigorous, because the stakes are so high. For an actual air traffic system I'm stunned this happened.
I guess I'm not surprised, but I am stunned.
Lost at C:>. Found at C.
So sloppy, untested, programming crashed the system. I'm willing to bet that if you take the hood off the system, it's written using high level languages instead of C.
Don't see that on the articles
Why VBA, of course
You can have poor memory management in any language.
Sure, historically C/C++ have had the been known for memory leaks due to memory that's not freed, but in Java/Python/pick-your-favorite-garbage-collected-language or using smart pointers in C++, all you need to do is have a container that keeps a reference to everything and nothing will go away. It's not hard to do this.
Based on the summary, it sounds like that's what happened. Some monitor views just kept a list of everything and the developer forgot to purge the lists when things went out of, er, scope.
-Chris
From the way it reads the system could have allowed for, say, 256 spiffy windows. If they weren't getting deleted as expected they could have drained that pool of spiffy windows no matter how much RAM they had.
https://www.usajobs.gov/GetJob/ViewDetails/413146100
I don't care what language they use. It could be BASIC for all I care.
What I do care about is what their QA process looks like. How did this not get caught in testing??
To Terminate, or not to Terminate, that's the question - SCSIROB
I'm betting someone just got some of the punch cards mixed up.
.
Prisencolinensinainciusol. Ol Rait!
Depends a little on the OS... a while back it was a combination of OS/400, AIX, and MVS | OS/390 | z/OS.
While everyone speculates on GC vs heap vs what flavor is my coffee, ERAM approach systems use ADA as the language of choice.
reference
I wonder why people tend to put question marks at the end of sentences that begin with "I wonder".
"I eat". What do you eat? "I eat whole-grain wheat toast for breakfast."
"I wonder". What do you wonder? "I wonder what programming language they were using?"
Is it some valley girl thing?
THIS.
...than crashing. Well designed systems do not die when running out of memory - they recognize the issue, and either at the general OS level or at the specific Application level, begin shifting the memory requirements to storage. Yes, they run (much) slower - but it gives an opportunity for some system more aware of the big picture than the application (e.e. the operator) to prioritize and recover. As others have alluded to - how did this situation not get found in a proper testing process?
I kinda doubt that, My understanding is most of the US's air-traffic control systems (and software) is ancient .
Somehow, I doubt it was 2,000,000 lines of assembly language.
Actually, the old system was a bunch of 68K assembly. Nowhere near 2,000,000 lines of it. I know one of the guys who wrote some of it.
A quick search reveals Lockheed Martin used Ada 2005 primarily to implement ERAM. Ada's Vital Role in New US Air Traffic Control Systems http://www.iaeng.org/publicati... "The new Ada 2005 real-time, and object-oriented language. Now it offers more has introduced more robust capabilities based on user experience. safety and portability than Java, and better efficiency and The language offers particular innovations which helps make safety assurance less costly and further improves high integrity flexibility than of C/C++"
Nobody expects the ERROR: OUT OF MEMORY.
Heard that somewhere before :)
No more mixed cards
But, you said that 8G was enough!
Just another day in Paradise
This is just indicative of America's crumbling infrastructure due to extreme ineptitude at the elected leadership level.
Even if the manual says that Win 95 no longer needs to be rebooted everyday like Windows 3.11, it's still a good idea.
Ada and Java apparently
http://dl.acm.org/citation.cfm...
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
the whole thing is probably coded with a stone tablet and chisel.
If this is till the based on the 20+ year-old system it's Ada.
I hope not! Too many systems are! Banks are still using Windows on ATM's!
it was F- , pronounced F minus , and it is an indicator of programmer skill and attention.
if this is supposed to be a new economy, how come they still want my old fashioned money?
Very funny guys.
If you can't handle explicit memory management, then you can't write in MISRA-C, and if you can't write in MISRA-C, you should probably be kept as far as possible away from coding on life support systems, because you are not going to be good at it.
The idea that the could "run out of memory" in the first place implies heap allocation, at the very least, which is prohibited in the MISRA-C standard.
You may remember the Toyota acceleration bug in the ECC; it was due to a practice which, had they been enforcing their own coding standards (which were a partial intersection with the MISRA-C coding standard, which was under development and nearing completion at the time), they would have caught the bug before it started ending up in dead people.
And surgeons responsible for cutting open live human beings should be capable of not leaving tools in the person they're operating on, but it still happens. Professionals make mistakes. Garbage collection is a useful tool to make it more difficult to screw up.
If only there were some kind of portable machine one could use in order to look for metal left in a patients body...
If only there were a requirement to do things like count gauze pads before and after surgeries, and then account for any numerical discrepancy as being ON (not IN) the patient, and/or going into the biohazard disposal unit. You know: a documented procedure.
Oh, wait: there is.
Did they get Walter O'Brien to race under a plane, so they can upload the correct software to reset the system?
Calm down, Mr. Trump, it's time for your medication again.
One of the major impacts on climate change is inefficient jet planes.
Why not double the fee for non-787 fuel sipping jet planes (or turboprops), so that the impact cost has a real cost?
There's your money.
And take any extra and put it into high speed passenger and freight rail lines along the dense West urban I-5 corridor that produces 50 percent of all US GDP.
-- Tigger warning: This post may contain tiggers! --
I wonder what 'programmer' they used.
In case of a Software Error, should be "Software Programming Enviroment", not just "Programming Enviroment".
Is not just the programming language you like, if it has bad libraries, or wrong programming logic, or viceversa, the programming language you may not like, but, has a good programming logic, and good libraries.
ADA: https://www.google.com/search?...
I'm surprised more planes don't shed parts along the way.
Did the Primary Buffer Panel just fall off my gorram ship for no apparent reason?
There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
And I wonder how many 'most excellent' H1B programmers worked on this snafu.
Maybe they could [garbage collection] have used [garbage collection] java [garbage collection] [garbage collection] which would [garbage collection] [garbage collection] have ensured [garbage collection][garbage collection] lots of memory [garbage collection] [garbage collection] is always [garbage collection] available, at the [garbage collection] [garbage collection] small [garbage collection] [garbage collection] [garbage collection] price [garbage collection] [garbage collection] [garbage collection] of a runtime [garbage collection] [garbage collection] [garbage collection] hit and non [garbage collection] [garbage collection] realtime [garbage collection] [garbage collection] [garbage collection] priority. What [garbage collection] [garbage collection] could [garbage collection] [garbage collection] [garbage collection] possible [garbage collection] [garbage collection] [garbage collection] [garbage collection] happen.
There is a reason java [garbage collection] isn't used for critical "realtime" systems. imagine the garbage collector dropping events or being "slightly behind".
I bet somebody was running Firefox.
I think this system that failed is part of the same one I helped bid on upgrading in the late 80s. (We were the lucky ones who lost the bid; IBM were the poor suckers who won it.) The Advanced Automation System was supposed to have a budget of something like 4 years and $4B, or maybe it was $7B, but either way it ran way way over that, in both years and billions, before being restuctured, partly because the problem is really hard, partly because the specs were extremely unrealistic, and partly because we were required to use DOD-STD-2167 software development methodology, a very heavy clumsy version of waterfall process.
The important requirement was that if anything went wrong and two airplanes crashed and fell out of the sky, mobs of citizens and Congresscritters would descend on FAA headquarters with torches and pitchforks and budget cuts, so everything that to be ultra-conservatively speced to prevent that from happening. I'm extremely annoyed to hear the FAA saying that except for this failure, they've been running 99.99% reliability this year. Four 9s? The specs we were supposed to meet were 8 9s, and since nobody was willing to ask the FAA to define a failure event, our management was conservatively aiming for 10 9s. (An average system controlled about 100 radars, and the big difference is whether a "failure" means "all the radars are out" or "any single radar is out".) This kind of reliability meant that duplicating everything wasn't good enough, you had to triplicated every piece of equipment, or double-double it, because otherwise the possibility of one piece failing while you had its backup down for preventative maintenance for 5 minutes blew your numbers for the year. (No matter that the radars were connected back to the data center over circuits that had 3.5-4 9s, just because of the usual risk of physical damage.) We later found out that the FAA shut down the then-current 1960s system for four hours a night, running on the backup equipment (which was a 1970s transistorized upgrade to the 1940s/1950s version) to keep the backup system reliable and operators trained.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I just wrote a rant about AAS in a comment above. What a disaster that was, and the system we got out of it was partly so late and unreliable because the FAA way way overspec'd the first version.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I don't know what the current FAA system is written in, but back in the late 80s when I was working on a previous attempt to upgrade it, we had to write in Ada. (Actually, we mostly had to write in DOD-STD-2167 development methodology, and no I don't mean 2167A, but if we'd gotten far enough in the process to be coding, it would have been in Ada, generally emulating systems originally written in JOVIAL.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks