Writing Code for Spacecraft
CowboyRobot writes "In an article subtitled, "And you think *your* operating system needs to be reliable."
Queue has an interview with the developer of the OS that runs on the Mars Rovers. Mike Deliman, chief engineer of operating systems at Wind River Systems, has quotes like, 'Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet.' and, 'The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes.'"
I wonder will they be releasing the source. It could be an interesting read.
yes >
all software has bugs, what happens when 1/2 thru the trip they have an update? who installs remotely, and I guess having a sysop reboot is out of the question...
CBB
free ipod and free gmail!
That's all well and good, but don't forget that this kernel only has to interface with one set of hardware.
Things like the Linux kernel has to know about hundereds and thousands of different devices which is why it's so big.
I would like to think that this article embodies the reasons that John Carmack got into space program development to begin with.
In the beginning he got into 3d game applications for a similar reason. The cutting edge is always the very outer area of human development, and Carmack makes a good example of a programmer who has taken aim at the edge of what is known to programmers. Maybe Mr. Carmack would care to comment?
Much like how Id Software develops engines, the space craft programming is new an innovative, although the difference is that space craft have systems have no room for error.
The dangers of knowledge trigger emotional distress in human beings.
I can get Linux on a 1.44mb floppy and run a system from it. 2 megs ain't that hard.
Slashdot 1|0 Productivity
I used to write embedded applications using OS-9 (NOT MacOS 9) on 68000-based systems as a sub-contractor for Nuclear Electric (nuclear power stations company in the UK before it became BNFL). Our development system - complete with OS/Kernel and compilers - had only about a meg of memory; the final embeded systems often only had 512K if we were lucky
Okay, so this was some 14 years ago - but it was doing a lot of work. 2 megabytes is a lot of memory! There's a phenomenal amount of code and data that can be stored in 2 meg. Maybe it's good by current standards, but - personally - I would suggest that current standards is a bad place to start from.
The ways of gods are mysteriously indistinguishable from chance.
Well , beta testing , anyone ? :-D
I would pick something like Qnx or NetBSD for any critical app
Okay, let's turn NetBSD into a real-time OS. Add some "hardening" features like watchdogs etc. Hmm... what should we call it? Perhaps: SpaceBSD?
cpghost at Cordula's Web.
I worked on a satellite mission where we had some trouble. Due to an error the satellite wound up pointing 16 degrees away from the sun in a higher-than-expected orbit of 443 miles (714 kilometers) above Earth.
The misalignment meant the spacecraft was unable to look directly at the sun's center to record the amount of radiation streaming toward Earth. To accurately measure sunlight, the darn thing needed to be pointed to within a quarter of a degree of dead center.
It took about four and a half months to fix that problem, due to uplink difficulties. Ground controllers from first had to slow the spacecraft's spin in order to transmit a series of software "patches" and then gradually speed it up to see how well the commands worked.
Then things were fixed.
Moral of the story: it is a tough job indeed!
In my experience mutex's, semaphores, etc always cause trouble. There is nearly always another way to write things.
And you'll never ever seem me coding an infinite wait for a mutex. That's just asking for trouble.
Bad: in Windows, FindNextChangeNotification()
requires those IPC operations and I always gives me grief.
Good: The Linux File Activity Monitor (FAM). Lets you open and read a pipe of actions. Nice!
Perhaps not surprisingly for anyone who has heard about the management at NASA, C++ was selected for the successors to the Remote Agent on the grounds that it is supposed to be more reliable (this despite the fact that the Remote Agent was originally to be developed in C++, an effort that was abandoned after a year of failure). This caused more than a few people to be upset (including a very personal account by one of the aforementioned designers). Clearly the debugging facilities of Common Lisp are far superior to static systems like C++, something which is very useful in diagnosing unexpected error conditions in spacecraft software (read the first question on p. 3 of the interview to see what pains the JPL staff went through to adapt similar, ad-hoc methods to VxWorks). It's also clear from this interview (question: "How is application programming done for a spacecraft?" Answer:"Much the same as for anything elsesoftware requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually its sent off to do its job.") that NASA has in no way tried to adapt formal verification methods for it's software, prefering instead to rely on the "tried and true" (at failing, maybe) poke-and-test development "methods."
Clearly, formal verification methods to eliminate bugs before critical software is deployed, and deployment in a system with advanced debugging facilities is a clear win for spacecraft software, and should be adapted as the standard model of development. Unfortunately, like in many other software development enterprises, inertia keeps outdated, inadequate systems going despite a strong failure correlation rate.
In the great CONS chain of life, you can either be the CAR or be in the CDR.
Why, in the 21st century, is it necessary to fit something like the Mars rover code in 2MB of memory? If something like a Gameboy Advance or a PDA can hold 64MB-a couple gigs, what is holding NASA back, with their gigantic budget and all?
I can't imagine it would be the cost of the memory... I mean I know it costs much much more to make chips to a very strict specification, but if you are already producing so few units, isn't your cost of production going to be extrodinarily high whether you are making 64KB chips or 2MB or even 64MB?
This is not to say that I don't have admiration for fitting all that code in such a small space, but is there a reason they feel the need to do so?
About five years ago, I worked for a major test equipment manufacturer who was contracted to deliver a test system for POTS lines (which could eventually do ADSL prequalification) to a national telco in a major European country. The idea was to test every POTS line in the system (millions of them) every night to detect early signs of degradation so repair crews could be dispatched before dialtone was completely lost.
As you can imagine, this involved a distributed system of test heads in each central office, networked back to a central command and control site. The sysem worked well, but had one flaw: downloading new firmware to the test heads was fraught with problems, and often led to the test head "locking up", even though a backup copy of firmware was always present, along with a hardware watchdog timer (though it was possible to lock out the watchdog interrupt, particularly when reprogramming flash, so it was a less than perfect watchdog). In these situations, one had to dispatch a "truck roll" to the affected central office, and replace EPROMs by hand.
Needless to say, the customer was pissed. More worrying was that even if we fixed the software download problem (which we were unable to reproduce in the lab), was that we'd be paying for truck rolls all over the country. This was a not insignificant amount of money.
Management frittered away time, instead of authorizing a root cause analysis, by requesting tweaks to TCP/IP operating parameters, and testing to see if the problem was getting better or worse. This did not prove illuminating, time was wasted, and the customer was getting royally angry.
Finally, a small team of us were permitted to undertake a root cause analysis to find and fix the problem: the engineer responsible for the embedded flash file system, the telecom engineer on the control side, and I: responsible for the embedded O/S, and TCP/IP stack (inherited from the supplier of the embedded O/S). We wanted a month. We got two weeks. Remember, deploying experimental software to live COs requires so many layers of approval, it isn't funny, and we were worried that would be our biggest bottleneck.
Finally, the controller telecom engineer was able to reproduce the problem, by attempting to download software from our controllers to deployed equipment in a single central office (getting permission was a feat in itself -- while there was little danger of affecting telephone service, this was a live CO).
The problem was clear: the data network was slow (9600 b/s over an X.25 PVC, carrying PPP-encapsulated TCP/IP), resulting in the use of large MTUs to minimize packetizing overhead (latency wasn't an issue - throughput was). Because of the way the controller's TCP/IP stack worked, it misestimated the packet/ack round trip time: it used a one byte payload for the first packet, and full MTUs after that. The resulting packet ACK timeout and retransmissions exposed an inconsistency between controller and embedded TCP/IP stacks that caused the embedded system to lock up.
Great. Now, how to fix it?
The fix wasn't a big deal (I implemented a fix in the embedded TCP/IP code since we didn't have source to the controller TCP/IP stack), but deploying it was: remember we couldn't download the code sucessfully, and we didn't want to pay for a truck roll.
At this point, I proposed something daring: download a small patch, in as few packets as possible (we could send three full MTUs safely). which would patch the existing code in place, which would be good enough to reliably download a complete replacement.
The thought of "self-modifying code" freaked management out to no end: it went against every rule in the book. But all three of us stood our ground: the only other alternative was a truck roll to each central office in the country. Reluctantly, we were allowed to proceed with that fix.
At this point, we had about ten days left. I had managed to get approval to pipeline the dev and tes
You could've hired me.
Well said! And ditto.
I do embedded software for a living as well, and run like heck away from any project involving WindRiver.
WindRiver is great for those people who don't know what they are doing in the embedded space. And it's useful as a red flag for telling one as such.
But for people who actually know what they are doing, and who actually do understand OS's, Linux solutions are a far better choice. The time-to-market is absolutely unbeatable; as well as all the choices that one has in order to get a product out. Plus the reliability is also the best.
Sorry if that sounds like a troll; it's not meant to be. It's just my own first hand experience in this space.
Sure you can. We make that kind of software. The reason you won't ever see it as open source is because the various instruments on the spacecraft are covered by confidentiality agreements (or worse, in case of military hardware). And as hardware goes it is typically rather obscure stuff, requiring significant domain knowledge as well to emulate correctly.
Another issue is that these systems are rather CPU-intensive - we have a 16-CPU box for the spacecraft instruments plus a dedicated PC to emulate the flight computer itself. But you could run it on simpler hardware if you are willing to run at less than realtime speed.
Interestingly, the closest we ever get to seeing the actual flight software is binary images of it. While that is a lot closer than most slashdotters are likely to get, it is still far removed from being able to do something useful with it.
Of course the other good reason why this isn't going to be open source is because of price. For details you should really contact a salesperson, but let me give you a clue: (raises little finger to mouth) "Mwuhahaha!" ;-)
NASA may consider using a new OS after it has finished V&V in house and by an independant testing company (per NASA procedures) and it has flown in space successfully. An order of magnitude estimate is ten times the development cost.
VxWorks is a well known OS with lots of experienced users. Priority inversion is a known problems, just set the SEM_INVERSION_SAFE flag in semMCreate() to fix it.
Besides making the OS, Wind River also sells Tornado(R) and other tools for developing, debugging, and testing embedded realtime code running in the target computer. Anyone who has ever done embedded and realtime code knows good tools are mandatory with any complex system.
VxWorks runs in a flat file space. There is no segment protection, but code does get extensively reviewed and tested so bad pointers are not a problem. Preventing memory fragmentation requires good design, the solutions are will known, and more reviews and testing.
The last time I priced a run time license (most satellites need two licenses), it was noise ($400) compared to the labor required to build a spacecraft.
I a VxWorks user in the space buisness.
_Richard
It was around the first or second month of operation this year, but Spirit was unusable for a couple of weeks due to an OS failure. The symptom was Spirit tried to reboot itself about 20 times in a row- a default practice if something drastric happens. It was traced (according to the rumor mill) to flash memory overflow. Supposedly the VxWorks file manangement system improperly updated its flash memory free-inode list. So the memory appeared to run out of space.
The nice thing about software is that JPL was able to upload a patch and get both rovers working properly again. They reconfigured the Galileo mission to the bypass the broken high gain attenna and use the hundred times slower low gain attenna with software patches and achieved most of the mission objectives.