Patching Software on Another Planet

← Back to Stories (view on slashdot.org)

Patching Software on Another Planet

Posted by Soulskill on Saturday July 6, 2013 @03:47AM from the no-do-overs dept.

An anonymous reader writes "Sixteen years ago, the Mars Pathfinder lander touched down on Mars and began collecting about the atmosphere and geology of the Red Planet. Its original mission was planned to last somewhere between a week and a month, but it only took a few days for software problems to crop up. The engineers responsible for the system were forced to diagnose the problem and issue a patch for a device that was millions of miles away. From the article: 'The Pathfinder's applications were scheduled by the VxWorks RTOS. Since VxWorks provides pre-emptive priority scheduling of threads, tasks were executed as threads with priorities determined by their relative urgency. The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus synchronized with mutual exclusion locks (mutexes). Other higher priority threads took precedence when necessary, including a very high priority bus management task, which also accessed the bus with mutexes. Unfortunately in this case, a long-running communications task, having higher priority than the meteorological task, but lower than the bus management task, prevented it from running. Soon, a watchdog timer noticed that the bus management task had not been executed for some time, concluded that something had gone wrong, and ordered a total system reset.'"

6 of 96 comments (clear)

Min score:

Reason:

Sort:

Sounds like this was noticed earlier ... by xmas2003 · 2013-07-06 03:51 · Score: 4, Interesting

From TFA: "Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch and returned to focusing on the mission-critical landing software"

Very surprised by this ... even if a hardware glitch, wouldn't you want to track that down before launch? Especially since in the harsh space environment (bit flops even with hardened RAM/CPU), you want your hardware to be as reliable as possible.

--
Hulk SMASH Celiac Disease
1. Re:Sounds like this was noticed earlier ... by mlts · 2013-07-06 04:13 · Score: 4, Interesting
  
  Devil's advocate here:
  If it were my guess, there are so many priorities of glitches, and with a limited budget, if it isn't something that actively shuts down operations, resources are spent on other things.
  The one good thing in this equation is the watchdog circuits. Without these in place, it can mean the hardware goes down and never comes to life again.
  It is extremely hard to get working operating systems and patch management here on Earth [1]... much less having systems that are made to work where there is no way to walk up to the machine, and re-flash a new OS via the JTAG ports.
  [1]: Patch management had issues for every OS I've used. AIX gets issues via lppchk which means force-installing LPPs, RedHat gets RPM glitches possibly forcing a rebuild of the DB, Windows sometimes will just not install, or permit to be installed an update from WU, and so on. Now, with this in mind, trying to patch a machine millions of miles away is very daunting for even the best of the best.
2. Re:Sounds like this was noticed earlier ... by girlintraining · 2013-07-06 04:58 · Score: 4, Interesting
  
  If it were my guess, there are so many priorities of glitches, and with a limited budget, if it isn't something that actively shuts down operations, resources are spent on other things.
  Devil here: This isn't a budget problem, this is a management problem. Going all the way back to the Challenger disaster, NASA has shown a pattern of disregard for proper engineering practice. Richard Feynman chewed their ass out in Appendix F of the Challenger report to congress, and it was so scathing that both Congress and NASA tried to kick him off the board and discard his results... prompting the entire senior engineering staff of all branches of the Shuttle project to sign a petition saying: Either publish this, or face our wrath.
  This isn't a technical problem -- this is management having shitty project management skills. If the budget is insufficient, then the project scope has to be reduced. It's just that simple. This is not the engineers' fault, or is it the fault of the technology... this is management trying to do too much with too little.
  
  --
  #fuckbeta #iamslashdot #dicemustdie
Priority inversion bug by BitZtream · 2013-07-06 04:13 · Score: 4, Interesting

This problem is known as priority inversion. Its a common concern in schedulers when critical functions run in their own threads. Its something that they should have known about and tested against. Or they could have used more traditional IO approaches and let the VxWorks IO system, which already has protection against priority inversion by design, do its job.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:Boring story by Antique+Geekmeister · 2013-07-06 04:48 · Score: 5, Interesting

Actually, it's very interesting. It shows that even with the very extensive testing and layers of planning and managerial processes to prevent such errors, they can still creep in. And it shows that very expensive, one-off projects remain vulnerable to subtle design errors, so the tools to do field updates are _critical_.
Note that designing for spacecraft can be a real artform: they have extremely limited computational resources, due to the inherent risks of bit errors in increasingly small modern silicon exposed to radiation and temperature changes, and you cannot simply shield the electronics: the shielding adds weight and itself becomes radioactive over time. So you often wind up using quite old but far more stable technologies. That means tools that may be considered quite obsolete by the time your design phase is complete and the device is ready for launch. And by the time it arrives _on Mars_, the techonology is very obsolete indeed.
My respect for the programmers and designers of interplanetary spacecraft is enormous: systems like Voyager and the Mars Rover, Spirit, that exceed their lifespans by years fill me with pride as an engineer that we could build so well. And the obligatory XKCD on the subject:
http://www.xkcd.com/695/
The problem was well known when the story was new by Cryptosmith · 2013-07-06 07:35 · Score: 5, Interesting

This is a rambling bit of history. Move on if that's not your thing. I love reading about problems like the the Pathfinder problems. Trust me - such things often happen on Earth-bound systems, too.
Back in '79, I was working on a multiprocessing router for the ancient ARPANET. At the time the net had over sixty routers distributed across the continent. Actually we called them "imps" - well, "IMPS" but I'll use the modern term "router." We had a lot of the same problems as Pathfinder without ever leaving the atmosphere.
By then all ARPANET routers were remotely maintained. They all ran continuously and we did all software maintenance in Cambridge, MA. By then the basic software was really reliable. They rarely crashed on their own, and we mostly sent updates to tweak performance or to add new protocol features. Once in a while we'd have to use a "magic modem" message to restart a dead machine and to reload things. The software rarely broke so badly that we'd have to have someone on-site load up a paper tape. So remote maintenance was well established by then.
The multiprocessor didn't run "threads" it ran "strips." Each was a non-preemptive task designed to execute quickly enough not to monopolize the processor. If you wrote software for a Mac before OS-X, you know how this works. A multi-step process might involve a sequence of strips executed one after the other.
Debugging the multiprocessor code was a bit of a challenge because we could lock out multi-step processes in several different ways. While we could put our test router on the network for live testing, this didn't guarantee that we'd get the same traffic the software would get at other sites. For example, we had software to connect computer terminals directly to hosts through the router (the original "terminal access controllers"). This software ran at a lower priority than router-to-router packet handling. It was possible for a busy router to give all the bandwidth to the packets and essentially lock out the host traffic. Such problems might not show up until updated software was loaded into a busy site.
Uploading a patch involved assembly language. We'd generally add new code virus style. First you load the new code into some spare RAM. Once the code is loaded, we patch the working program so that it jumps to the patch the next time it executes. The patch jumps back to an appropriate spot in the program once the new code has executed. We sent the patches in a series of data packets with special addressing to talk to a "packet core" program that loaded them.
The bottom line: it's the sort of challenge that kept a lot of us working as programmers for a long time. And they pop up again every time someone starts another system from scratch.