Slashdot Mirror


Air Traffic Snafu: FAA System Runs Out of Memory

minstrelmike writes: Over the weekend, hundreds of flights were delayed or canceled in the Washington, D.C. area after air traffic systems malfunctioned. Now, the FAA says the problem was related to a recent software upgrade at a local radar facility. The software had been upgraded to display customized windows of reference data that were supposed to disappear once deleted. Unfortunately, the systems ended up running out of memory. The FAA's report is vague about whether it was operator error or software error: "... as controllers adjusted their unique settings, those changes remained in memory until the storage limit was filled." Wonder what programming language they used?

2 of 234 comments (clear)

  1. Re:This is why we like C by Vlad_the_Inhaler · · Score: 4, Interesting

    I have actually seen something similar to this before, also involving an Air Traffic Control.

    They were having some problem in handling "Large Messages", I am not sure of the exact details / circumstances - I was only peripherally involved. Anyway, the programmer wrote these to a file, then they were processed asynchronously and deleted. This minor change was tested - as usual at the site - by someone shooting an hour's production traffic through the test system and checking for unexpected aborts or other abnormalities. All was fine, the spooling file was 1% full.
    The patch went online. 4 days later (it was a Sunday morning and it was snowing) the file hit some limit and refused to accept new messages. At that moment things went "Keystone Cops".

    • All department heads were informed, except programming. Given that only one the patch had been applied in the previous week, not very helpful. Headless chickens ran around trying to find a solution.
    • Standard practice in this type of situation was to switch to the backup/standby system. Since ATC data is very short lived, the backup system had an empty database which would then be populated dynamically. All "Station Chiefs" had to approve this step. One refused because he could not see any problem. Finally someone managed to make him understand what the problem was, then it was "oh yes, we are seeing that as well". His was the smallest station of course.
    • Standard procedure was also to switch to manual control - rather than automated - and cancel short-haul flights. The railways could take up the slack. This was done.

    The switch was duly made and everything was working again.
    It turned out that the deletion of the processed records had a bug. One hour of live data left the file 1% full. 100 hours . . . do the math. It took 5 or 10 minutes for the programmer to fix the problem, he could have done it live on the Sunday if anyone had bothered to tell him what was going on.

    One of the lessons from that is also relevant here - one hour of live data left the file 1% full. I'd bet that they were testing that the new feature worked, not looking for hidden side-effects.

    --
    Mielipiteet omiani - Opinions personal, facts suspect.
  2. Re:Software error ... by Rob+Riggs · · Score: 3, Interesting

    Surgeons leave tools in patients because they have no process when operating on a patient. Read the Checklist Manifesto sometime and read what the author has to say about best practices in the operating room. Everyone makes mistakes. The process we follow is what allows us to catch those mistakes, and prevent any mistakes from re-occurring.

    --
    the growth in cynicism and rebellion has not been without cause