Writing Code for Spacecraft

← Back to Stories (view on slashdot.org)

Posted by michael on Saturday November 20, 2004 @06:37AM from the carried-the-bits-uphill-one-by-one-in-the-snow dept.

CowboyRobot writes "In an article subtitled, "And you think *your* operating system needs to be reliable." Queue has an interview with the developer of the OS that runs on the Mars Rovers. Mike Deliman, chief engineer of operating systems at Wind River Systems, has quotes like, 'Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet.' and, 'The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes.'"

11 of 204 comments (clear)

Reinventing the wheel. by Anonymous Coward · 2004-11-20 06:49 · Score: 5, Funny

Should have just used WinCE, with a few of the productivity apps cut out. Adding a copy of pocket Auto-route, with some Martian JPEGS would have helped navigation as well.
Re:Summary of OS code by caramelcarrot · 2004-11-20 06:50 · Score: 5, Funny

c:\rover\code\main.cpp(3) : error C2106: '=' : left operand must be l-value

Not quite bug free yet.
Re:Summary of OS code by zeath · 2004-11-20 06:56 · Score: 5, Funny

roveros.c: 1: non-lvalue in assignment
make: *** [roveros] Error 1 I'm sorry, your rover is lost in space. Insert $1 billion and press any key to try again.
Re:hard to imagine.. by Vardamir · 2004-11-20 06:58 · Score: 5, Interesting

Yes, here is an email my OS prof sent our class on the subject:

Subject: What really happened on Mars Rover Pathfinder

The Mars Pathfinder mission was widely proclaimed as "flawless" in the early
days after its July 4th, 1997 landing on the Martian surface. Successes
included its unconventional "landing" -- bouncing onto the Martian surface
surrounded by airbags, deploying the Sojourner rover, and gathering and
transmitting voluminous data back to Earth, including the panoramic pictures
that were such a hit on the Web. But a few days into the mission, not long
after Pathfinder started gathering meteorological data, the spacecraft began
experiencing total system resets, each resulting in losses of data. The
press reported these failures in terms such as "software glitches" and "the
computer was trying to do too many things at once".

This week at the IEEE Real-Time Systems Symposium I heard a fascinating
keynote address by David Wilner, Chief Technical Officer of Wind River
Systems. Wind River makes VxWorks, the real-time embedded systems kernel
that was used in the Mars Pathfinder mission. In his talk, he explained in
detail the actual software problems that caused the total system resets of
the Pathfinder spacecraft, how they were diagnosed, and how they were
solved. I wanted to share his story with each of you.

VxWorks provides preemptive priority scheduling of threads. Tasks on the
Pathfinder spacecraft were executed as threads with priorities that were
assigned in the usual manner reflecting the relative urgency of these tasks.

Pathfinder contained an "information bus", which you can think of as a
shared memory area used for passing information between different components
of the spacecraft. A bus management task ran frequently with high priority
to move certain kinds of data in and out of the information bus. Access to
the bus was synchronized with mutual exclusion locks (mutexes).

The meteorological data gathering task ran as an infrequent, low priority
thread, and used the information bus to publish its data. When publishing
its data, it would acquire a mutex, do writes to the bus, and release the
mutex. If an interrupt caused the information bus thread to be scheduled
while this mutex was held, and if the information bus thread then attempted
to acquire this same mutex in order to retrieve published data, this would
cause it to block on the mutex, waiting until the meteorological thread
released the mutex before it could continue. The spacecraft also contained
a communications task that ran with medium priority.

Most of the time this combination worked fine. However, very infrequently
it was possible for an interrupt to occur that caused the (medium priority)
communications task to be scheduled during the short interval while the
(high priority) information bus thread was blocked waiting for the (low
priority) meteorological data thread. In this case, the long-running
communications task, having higher priority than the meteorological task,
would prevent it from running, consequently preventing the blocked
information bus task from running. After some time had passed, a watchdog
timer would go off, notice that the data bus task had not been executed for
some time, conclude that something had gone drastically wrong, and initiate
a total system reset.

This scenario is a classic case of priority inversion.

HOW WAS THIS DEBUGGED?

VxWorks can be run in a mode where it records a total trace of all
interesting system events, including context switches, uses of
synchronization objects, and interrupts. After the failure, JPL engineers
spent hours and hours running the system on the exact spacecraft replica in
their lab with tracing turned on, attempting to replicate the precise
conditions under which they believed that the reset occurred. Early in the
morning, after all but one engineer had gone
Re:hmm... by Infinityis · 2004-11-20 07:08 · Score: 5, Funny

Not gonna happen, for one big reason. I could just see the Slashdot headline:

Mars Rover HaX0r3d and OS replaced with Linux.

Shortly thereafter, Micro$oft claims that they can enforce patent infringement on Mars...
Other options being considered by Dominic_Mazzoni · 2004-11-20 07:08 · Score: 5, Insightful

For those who are wondering, JPL is very aware of the shortcomings of VxWorks and has seriously considered other alternatives for every mission. Keep in mind that the choice of OS has to be made years before launch, so at the time the OS for the 2004 Mars Rovers was decided on, many options that are possibilities today were not contenders. Also keep in mind that in spite of many shortcomings, VxWorks is a known quantity. JPL has been working with it for years and had a lot of in-house expertise with it.

There are a few groups at JPL that have been actively experimenting with other options, including RTLinux and a few different variants of hard-real-time Java (basically Java with explicit memory management and no garbage collection).
Huh, its easy.. by adeyadey · 2004-11-20 07:13 · Score: 5, Funny

you are in a red rocky landscape..

GO NORTH..

you are in a red rocky landscape..

DIG.

ok. you see some red sand.
it is getting dark.

GO NORTH..

you were eaten by a grue.

--
"You lied to me! There is a Swansea!"
my satellite debugging experience by nil5 · 2004-11-20 07:25 · Score: 5, Interesting

I worked on a satellite mission where we had some trouble. Due to an error the satellite wound up pointing 16 degrees away from the sun in a higher-than-expected orbit of 443 miles (714 kilometers) above Earth.

The misalignment meant the spacecraft was unable to look directly at the sun's center to record the amount of radiation streaming toward Earth. To accurately measure sunlight, the darn thing needed to be pointed to within a quarter of a degree of dead center.

It took about four and a half months to fix that problem, due to uplink difficulties. Ground controllers from first had to slow the spacecraft's spin in order to transmit a series of software "patches" and then gradually speed it up to see how well the commands worked.

Then things were fixed.

Moral of the story: it is a tough job indeed!
Re:Efficiency by Brett+Buck · 2004-11-20 07:55 · Score: 5, Informative

> "The operating system and kernel fit in less than 2
> megabytes; the rest of the code, plus data space,
> eventually exceeded 30 megabytes." This should be used as
> the example for efficient coding

You've GOT to be kidding, right? 2 meg of OS code? That's ULTRABLOAT compared to most spacecraft. In fact, for the vast majority of the space age, that would have exceeded the resources of the computer by several orders of magnitude.

I've done this kind of programming for a living (for 10 years, moved up to controls design) but the last system I programmed for has 372k of memory, total. That includes data, code, OS, everything. Runs at 432 KIPS. And it performs what it probably one of the most complex in-flight autonomous control operations ever.

Most are even more restrictive. For example, 8K of PROM and 1k of volatile memory (and 28 WORDS) of non-volatile memory. This more than adequate for most applications, if you do it right.

Many spacecraft OS's are more akin to this:

hardware interrupt
external electronics power up processor.
external electronics set PC = 80hex
run
{execute all the code}
halt
power down

Once every 1/4 a second for 15 years.

The project I am currently working on uses VxWorks (and so we were quite interested in the Mars Rover problem) and it's so bloated with unnecessary features it's absurd. This is not a Windows box, it's a spacecraft processor.

I can't argue with the 30 meg of data space. Using the memory as a data recorder would be quite useful and a good picture takes a lot of space. But it's alarming to me that you could figure out how to waste maybe 4-5 meg on code. If you started with a bare home-brew OS, I would guess (and I get paid for this sort of guess) that you could do the entire flight code in 512K, with maybe 8k of data space, excluding the science data.

Only recently have space-qualified rad-hard processors with this kind of capability become available. Until then, if you said you needed 2 meg for the OS alone, you would have gotten fired on the sopt and referred to mental health professionals. The availability of these processors enabled people to use high-level languages with tremendous overhead (like C++) to be used. And this was only done for employee retention purposes during the bubble. For years it was done at the assembler or even machine level. It's still not at all uncommon to do, and we've done MANY flight code patches, with only a processor handbook, an engineering paper pad, and by setting individual bits one-by-one.

Brett
Open source spaceware by relaxrelax · 2004-11-20 07:56 · Score: 5, Insightful

If that was open source, there are so many space nerds who are programmers that flaws of that magnitude would never get by the army of testers.

Many would help out simply because hey it's the *space program* and that's good enough for them. Other would want their name listed next to some obscure bug fix on a NASA site; it's good for the ego or your CV.

Simply put, even a binary distribution of that code would allow unlimited free testing for crashes. Why wouldn't NASA do it?

Because there are still people in washington that think code mysteriously get damaged by being public - even if such code isn't modifiable by the public who reads it.

This is evidence of advanced cluelessness in Washington and maybe independant anti-free-source advocates (spelled M-i-c-r-o-s-o-f-t) are at cause.

But I've learned not to bash. Never explain by Microsoft malice what could be explained by stupidity. Such as using DOS on a space thing...

--
Microsoft is pure dog-ma. FreeBSD is pure cat-ma.
Similar, though terrestrial, problems by renehollan · 2004-11-20 09:52 · Score: 5, Interesting

I'll leave out the names to protect the guilty.
About five years ago, I worked for a major test equipment manufacturer who was contracted to deliver a test system for POTS lines (which could eventually do ADSL prequalification) to a national telco in a major European country. The idea was to test every POTS line in the system (millions of them) every night to detect early signs of degradation so repair crews could be dispatched before dialtone was completely lost.
As you can imagine, this involved a distributed system of test heads in each central office, networked back to a central command and control site. The sysem worked well, but had one flaw: downloading new firmware to the test heads was fraught with problems, and often led to the test head "locking up", even though a backup copy of firmware was always present, along with a hardware watchdog timer (though it was possible to lock out the watchdog interrupt, particularly when reprogramming flash, so it was a less than perfect watchdog). In these situations, one had to dispatch a "truck roll" to the affected central office, and replace EPROMs by hand.
Needless to say, the customer was pissed. More worrying was that even if we fixed the software download problem (which we were unable to reproduce in the lab), was that we'd be paying for truck rolls all over the country. This was a not insignificant amount of money.
Management frittered away time, instead of authorizing a root cause analysis, by requesting tweaks to TCP/IP operating parameters, and testing to see if the problem was getting better or worse. This did not prove illuminating, time was wasted, and the customer was getting royally angry.
Finally, a small team of us were permitted to undertake a root cause analysis to find and fix the problem: the engineer responsible for the embedded flash file system, the telecom engineer on the control side, and I: responsible for the embedded O/S, and TCP/IP stack (inherited from the supplier of the embedded O/S). We wanted a month. We got two weeks. Remember, deploying experimental software to live COs requires so many layers of approval, it isn't funny, and we were worried that would be our biggest bottleneck.
Finally, the controller telecom engineer was able to reproduce the problem, by attempting to download software from our controllers to deployed equipment in a single central office (getting permission was a feat in itself -- while there was little danger of affecting telephone service, this was a live CO).
The problem was clear: the data network was slow (9600 b/s over an X.25 PVC, carrying PPP-encapsulated TCP/IP), resulting in the use of large MTUs to minimize packetizing overhead (latency wasn't an issue - throughput was). Because of the way the controller's TCP/IP stack worked, it misestimated the packet/ack round trip time: it used a one byte payload for the first packet, and full MTUs after that. The resulting packet ACK timeout and retransmissions exposed an inconsistency between controller and embedded TCP/IP stacks that caused the embedded system to lock up.
Great. Now, how to fix it?
The fix wasn't a big deal (I implemented a fix in the embedded TCP/IP code since we didn't have source to the controller TCP/IP stack), but deploying it was: remember we couldn't download the code sucessfully, and we didn't want to pay for a truck roll.
At this point, I proposed something daring: download a small patch, in as few packets as possible (we could send three full MTUs safely). which would patch the existing code in place, which would be good enough to reliably download a complete replacement.
The thought of "self-modifying code" freaked management out to no end: it went against every rule in the book. But all three of us stood our ground: the only other alternative was a truck roll to each central office in the country. Reluctantly, we were allowed to proceed with that fix.
At this point, we had about ten days left. I had managed to get approval to pipeline the dev and tes

--
You could've hired me.