Writing Code for Spacecraft
CowboyRobot writes "In an article subtitled, "And you think *your* operating system needs to be reliable."
Queue has an interview with the developer of the OS that runs on the Mars Rovers. Mike Deliman, chief engineer of operating systems at Wind River Systems, has quotes like, 'Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet.' and, 'The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes.'"
I wonder will they be releasing the source. It could be an interesting read.
yes >
all software has bugs, what happens when 1/2 thru the trip they have an update? who installs remotely, and I guess having a sysop reboot is out of the question...
CBB
free ipod and free gmail!
"The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes." This should be used as the example for efficient coding
Requiem
while (1 = 1) { Dig(); Picture(); }
The interviewer George Neville-Neil co-authored "The Design and Implementation of the FreeBSD Operating System" with Marshall Kirk McKusick.
cpghost at Cordula's Web.
Too bad about their compiler/asssembler line it is not half as reliable as their mars rover software...
...rover codes you!
Should have just used WinCE, with a few of the productivity apps cut out. Adding a copy of pocket Auto-route, with some Martian JPEGS would have helped navigation as well.
I would like to think that this article embodies the reasons that John Carmack got into space program development to begin with.
In the beginning he got into 3d game applications for a similar reason. The cutting edge is always the very outer area of human development, and Carmack makes a good example of a programmer who has taken aim at the edge of what is known to programmers. Maybe Mr. Carmack would care to comment?
Much like how Id Software develops engines, the space craft programming is new an innovative, although the difference is that space craft have systems have no room for error.
The dangers of knowledge trigger emotional distress in human beings.
a beowulf cluster of morons who think this joke is funny...
a beowulf cluster of rovers :P
Don't you mean a convoy of rovers?
Kevin
Was not the OS about Rover loaded with problems? Go read past news from last Febuarary here on slashdot?
VXworks does not even offer memory protection and the ram can get fragmented. Not to sound trollish but I would pick something like Qnx or NetBSD for any critical app or embedded device.
Its amazing the engineers fixed it and got it to work reliably but better more mission critical operating systems would be a better choice.
http://saveie6.com/
My linux kernel comes in at 1.7 meg and that's a fairly large kernel from what I've seen.
From Mr.Marvin
Olympus Mons Coast.
DEAR SIR/MADAM,
I AM HAPPY TO WRITE AND SEND THIS MESSAGE TO YOU.
AND I STRONGLY BELIEVE THAT THIS MESSAGE WOULD COME TO YOU AS A SURPRISE BUT I HOPE YOU WILL CONSIDER IT AS A CALL FROM A FAMILY IN DARE NEED AND GIVE IT URGENT CONSIDERATION. MY NAME IS MR marvin, A CITIZEN OF MARS AND THE SON OF LATE DR. FIDELIS GUBWANO WHO BEFORE HIS DEATH WAS THE MANAGER OF MARTIAN FINANCIAL TRUST CORPORATION (M.F.T.C). UPON HIS DEATH HE $60,000,000 (SIXTY MILLION U.S. DOLLARS) IN A THE OLYMPUS MONS BRANCH OF THE MARTIAN PLANETARY BANKING SYSTEM. I BELIEVE YOU TO BE AN HONEST AND TRUSTWORTY CITIZEN AND CAPABLE OF ASSISTING ME IN REMOVING THE MONEY FROM THIS ACCOUNT.
#include
int main() {
printf("Hello World!\n");
return 0;
}
marsrover.c: 3: You are no longer on the planet Earth.
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Remember sometime ago Spirit was continously rebooting due to a flash memory problem. The usage of FAT file system in the embedded systems was partly responsible for the mess.
The problem, Denise said, was in the file system the rover used. In DOS, a directory structure is actually stored as a file. As that directory tree grows, the directory file grows, as well. The Achilles' heel, Denise said, was that deleting files from the directory tree does not reduce the size of the directory file. Instead, deleted files are represented within the directory by special characters, which tell the OS that the files can be replaced with new data.
By itself, the cancerous file might not have been an issue. Combined with a "feature" of a third-party piece of software used by the onboard Wind River embedded OS, however, the glitch proved nearly fatal.
According to Denise, the Spirit rover contains 256 Mbytes of flash memory, a nonvolatile memory that can be written and rewritten thousands of times. The rover also contains 128 Mbytes of DRAM, 96 Mbytes of which are used for data, such as buffering image files in preparation for transmitting them to Earth. The other 32 Mbytes are used for code storage. An additional 11 Mbytes of EEPROM memory are used for additional program code storage.
The undisclosed software vendor required that data stored in flash memory be mirrored in RAM. Since the rover's flash memory was twice the size of the system RAM, a crash was almost inevitable, Denise said.
Moving an actuator, for example, generates a large number of tiny data files. After the rover rebooted, the OSes heap memory would be a hair's breadth away from a crash, as the system RAM would be nearly full, Denise said. Adding another data file would generate a memory allocation command to a nonexistent memory address, prompting a fatal error.
Source: DOS Glitch Nearly Killed Mars Rover
BTW, there is another interview of Mike Deliman I read sometime ago in PCWorld.
For those who are wondering, JPL is very aware of the shortcomings of VxWorks and has seriously considered other alternatives for every mission. Keep in mind that the choice of OS has to be made years before launch, so at the time the OS for the 2004 Mars Rovers was decided on, many options that are possibilities today were not contenders. Also keep in mind that in spite of many shortcomings, VxWorks is a known quantity. JPL has been working with it for years and had a lot of in-house expertise with it.
There are a few groups at JPL that have been actively experimenting with other options, including RTLinux and a few different variants of hard-real-time Java (basically Java with explicit memory management and no garbage collection).
you are in a red rocky landscape..
GO NORTH..
you are in a red rocky landscape..
DIG.
ok. you see some red sand.
it is getting dark.
GO NORTH..
you were eaten by a grue.
"You lied to me! There is a Swansea!"
I can't get through to acmqueue.com. Can someone post an alternate link to the article?
Of course, a wise man knows the difference between "the" and "than".
At the bottom of the
Don't hold your breath though.
I worked on a satellite mission where we had some trouble. Due to an error the satellite wound up pointing 16 degrees away from the sun in a higher-than-expected orbit of 443 miles (714 kilometers) above Earth.
The misalignment meant the spacecraft was unable to look directly at the sun's center to record the amount of radiation streaming toward Earth. To accurately measure sunlight, the darn thing needed to be pointed to within a quarter of a degree of dead center.
It took about four and a half months to fix that problem, due to uplink difficulties. Ground controllers from first had to slow the spacecraft's spin in order to transmit a series of software "patches" and then gradually speed it up to see how well the commands worked.
Then things were fixed.
Moral of the story: it is a tough job indeed!
Do the Debian.
In my experience mutex's, semaphores, etc always cause trouble. There is nearly always another way to write things.
And you'll never ever seem me coding an infinite wait for a mutex. That's just asking for trouble.
Bad: in Windows, FindNextChangeNotification()
requires those IPC operations and I always gives me grief.
Good: The Linux File Activity Monitor (FAM). Lets you open and read a pipe of actions. Nice!
Okay, I've got to call foul on this WindRiver marketing ploy. They're trading on the last days of being able to get away with saying that something mystical and special and super-high quality is going on behind the walls of trade secret and proprietary software.
I used vxworks on a reasonably large project several years ago, it's a fine piece of work, but nothing special, it's no where close to the quality of a recent linux kernel.
About half-way through our project we developed a need for a local filesystem on our box. We bought a FAT filesystem add-on from wind river that was annoyingly poor quality, lots of bizarre little problems, memory leaks, and of course no source to look at. In the end we didn't use it, we put together our own filesystem from freely available sources.
When I read the articles about vxworks filesystem problems nearly borking the entire Mars rover mission I laughed and laughed. I'm sure that it was the same crappy code (although I don't really know for sure).
For me it's a case study on why you shouldn't use closed source software, you can't evaluate the quality of the code on the other side trade-secret barrier and you wind up trusting things like glossy brochures.
jeff
If that was open source, there are so many space nerds who are programmers that flaws of that magnitude would never get by the army of testers.
Many would help out simply because hey it's the *space program* and that's good enough for them. Other would want their name listed next to some obscure bug fix on a NASA site; it's good for the ego or your CV.
Simply put, even a binary distribution of that code would allow unlimited free testing for crashes. Why wouldn't NASA do it?
Because there are still people in washington that think code mysteriously get damaged by being public - even if such code isn't modifiable by the public who reads it.
This is evidence of advanced cluelessness in Washington and maybe independant anti-free-source advocates (spelled M-i-c-r-o-s-o-f-t) are at cause.
But I've learned not to bash. Never explain by Microsoft malice what could be explained by stupidity. Such as using DOS on a space thing...
Microsoft is pure dog-ma. FreeBSD is pure cat-ma.
Contiki - multitasking kernel, TCP/IP stack, GUI, themeable window system, web server, web browser, etc. Runs in 40k RAM (yes, only 40960 bytes!). That's efficient coding.
Perhaps not surprisingly for anyone who has heard about the management at NASA, C++ was selected for the successors to the Remote Agent on the grounds that it is supposed to be more reliable (this despite the fact that the Remote Agent was originally to be developed in C++, an effort that was abandoned after a year of failure). This caused more than a few people to be upset (including a very personal account by one of the aforementioned designers). Clearly the debugging facilities of Common Lisp are far superior to static systems like C++, something which is very useful in diagnosing unexpected error conditions in spacecraft software (read the first question on p. 3 of the interview to see what pains the JPL staff went through to adapt similar, ad-hoc methods to VxWorks). It's also clear from this interview (question: "How is application programming done for a spacecraft?" Answer:"Much the same as for anything elsesoftware requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually its sent off to do its job.") that NASA has in no way tried to adapt formal verification methods for it's software, prefering instead to rely on the "tried and true" (at failing, maybe) poke-and-test development "methods."
Clearly, formal verification methods to eliminate bugs before critical software is deployed, and deployment in a system with advanced debugging facilities is a clear win for spacecraft software, and should be adapted as the standard model of development. Unfortunately, like in many other software development enterprises, inertia keeps outdated, inadequate systems going despite a strong failure correlation rate.
In the great CONS chain of life, you can either be the CAR or be in the CDR.
Explain to me how the souce code for a computer designed to operate a slow-moving, 4 or 6 wheeled vehicle used to take pictures and to sample temprature, radiation and other scientefic data could be adapted for use on an aicraft with a crusing speed of about 84 miles per hour.a /uav.htm
Also, China already has its own UAV. "China's armed forces have operated the Chang Hong (CH-1) long-range, air- launched autonomous reconnaissance drone since the 1980s. China developed the CH-1 by reverse-engineering US Firebee reconnaissance drones recovered during the Vietnam War. An upgraded version of the system was displayed at the 2000 Zhuhai air show and is being offered for export. A PRC aviation periodical reported that the CH-1 can carry a TV, daylight still, or infrared camera." (from http://www.globalsecurity.org/military/world/chin
I have gas, but my car uses petrol.
I remeber running windows 95, and if my pc cost a few million dollars i wouldn't want a copy of win95 within 100ft of it.
System.out.println(syynnapse.getSig());
Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet: you can't put your hands on the malfunctioning system to see what's going on; you must use intuition and experience.
System.out.println(syynnapse.getSig());
Hands down for any Mission Critical application.
Why, in the 21st century, is it necessary to fit something like the Mars rover code in 2MB of memory? If something like a Gameboy Advance or a PDA can hold 64MB-a couple gigs, what is holding NASA back, with their gigantic budget and all?
I can't imagine it would be the cost of the memory... I mean I know it costs much much more to make chips to a very strict specification, but if you are already producing so few units, isn't your cost of production going to be extrodinarily high whether you are making 64KB chips or 2MB or even 64MB?
This is not to say that I don't have admiration for fitting all that code in such a small space, but is there a reason they feel the need to do so?
Writing Code for Spacecraft
My first thought was "Spacecraft? is that a new Starcraft clone I hadn't heard about?". It was then I realized I've been hanging out on the Game Programming Wiki too much lately.
YHBT ;)
I'd point out how stupid your arguement is, but I don't think I really have to. It speaks for itself.
It's been a long time.
Exactly!
The problem is that most /.ers are used to thinking of an OS as something that needs to run any arbitrary program under any arbitrary conditions and survive any arbitrary crash in those programs.
For a Rover, none of those are true. They know exactly what code is going to be run. They know exactly where it's going to sit in memory. And they test it. (This is the part that /.ers can't quite understand.) They test these programs far more rigorously than any bog-standard x86 Linux OSS program ever gets tested. Those programs have their problems, but they will be mistakes in logic (metric/imperial conversions, or thread priority inversions), not segfaults because of derefing a null pointer.
I wonder how many undergrand CS degree programs still teach correctness proofs? Not "yeah, I ran it lots of times and it didn't crash," but "I ran it 100,000 times with 100,000 different inputs, all random, and it didn't crash, but while it was running I also sat down and mathematically proved the code is correct."
Embedded programming is just plain different than "normal" progrmming. It's usually a mistake to try to generalize from one to the other.
(All that said, the next version of VxWorks is advertised to optionally support a "traditional Unix" process model, and I think protected memory boundaries are one of the features. In case your embedded app needs to run arbitrary third-party software which probably doesn't get stress-tested at JPL :-), you can turn all that stuff on and live with the overhead.)
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
...the memory inside the Gameboy Advance and whatnot isn't radiation-hardened.
The grandparent poster needs to RTFA, and note what had to be done to protect circuits from Marvin the Martian's cosmic rays. The chips get physically bigger (sometimes a lot bigger), and that builds up quickly.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
About five years ago, I worked for a major test equipment manufacturer who was contracted to deliver a test system for POTS lines (which could eventually do ADSL prequalification) to a national telco in a major European country. The idea was to test every POTS line in the system (millions of them) every night to detect early signs of degradation so repair crews could be dispatched before dialtone was completely lost.
As you can imagine, this involved a distributed system of test heads in each central office, networked back to a central command and control site. The sysem worked well, but had one flaw: downloading new firmware to the test heads was fraught with problems, and often led to the test head "locking up", even though a backup copy of firmware was always present, along with a hardware watchdog timer (though it was possible to lock out the watchdog interrupt, particularly when reprogramming flash, so it was a less than perfect watchdog). In these situations, one had to dispatch a "truck roll" to the affected central office, and replace EPROMs by hand.
Needless to say, the customer was pissed. More worrying was that even if we fixed the software download problem (which we were unable to reproduce in the lab), was that we'd be paying for truck rolls all over the country. This was a not insignificant amount of money.
Management frittered away time, instead of authorizing a root cause analysis, by requesting tweaks to TCP/IP operating parameters, and testing to see if the problem was getting better or worse. This did not prove illuminating, time was wasted, and the customer was getting royally angry.
Finally, a small team of us were permitted to undertake a root cause analysis to find and fix the problem: the engineer responsible for the embedded flash file system, the telecom engineer on the control side, and I: responsible for the embedded O/S, and TCP/IP stack (inherited from the supplier of the embedded O/S). We wanted a month. We got two weeks. Remember, deploying experimental software to live COs requires so many layers of approval, it isn't funny, and we were worried that would be our biggest bottleneck.
Finally, the controller telecom engineer was able to reproduce the problem, by attempting to download software from our controllers to deployed equipment in a single central office (getting permission was a feat in itself -- while there was little danger of affecting telephone service, this was a live CO).
The problem was clear: the data network was slow (9600 b/s over an X.25 PVC, carrying PPP-encapsulated TCP/IP), resulting in the use of large MTUs to minimize packetizing overhead (latency wasn't an issue - throughput was). Because of the way the controller's TCP/IP stack worked, it misestimated the packet/ack round trip time: it used a one byte payload for the first packet, and full MTUs after that. The resulting packet ACK timeout and retransmissions exposed an inconsistency between controller and embedded TCP/IP stacks that caused the embedded system to lock up.
Great. Now, how to fix it?
The fix wasn't a big deal (I implemented a fix in the embedded TCP/IP code since we didn't have source to the controller TCP/IP stack), but deploying it was: remember we couldn't download the code sucessfully, and we didn't want to pay for a truck roll.
At this point, I proposed something daring: download a small patch, in as few packets as possible (we could send three full MTUs safely). which would patch the existing code in place, which would be good enough to reliably download a complete replacement.
The thought of "self-modifying code" freaked management out to no end: it went against every rule in the book. But all three of us stood our ground: the only other alternative was a truck roll to each central office in the country. Reluctantly, we were allowed to proceed with that fix.
At this point, we had about ten days left. I had managed to get approval to pipeline the dev and tes
You could've hired me.
Put it on SourceForge and watch what branches appear :)
If they release [part of?] the source then they should also release their test cases as well, and then award cash prizes to whomever is first to find and confirm input datasets that result in a new bug, by posting it to an online forum. Of course, this would probably be most useful for the next rover design, but may require extra work to set up that may make the effort more expensive than doing it yourself (in the short term). But, if even one major bug is found this way I think the effort could easily pay for itself. Surely a metric unit conversion error would be spotted easily this way.
Of course, this is in an Ideal World where the OS is not platform-specific and could be run under Linux (similar to how an instance of Linux can run under itself to allow quicker testing of kernel patches).
From then on, outgoing communications to the rover would probably need to be encrypted :) but it is probably just as well, as long as
they don't give out the key, the communications frequencies, the
exact location, etc.
Unfortunately I don't know any of the details offhand.
I as well have had the misfortune to pick WindRiver as the core OS for my project, and have had no end of problems.
Part of the problem in my case was that VxWorks is for smaller embedded systems, which my project is NOT. I need fast disk storage, I need graphics, I need networking, I need things that VxWorks just doesn't provide very well.
Were I able to change one decision about the design of my project, I would have gone with Linux instead.
WRS *used* to have something to offer, in that they provided a real-time OS and hardware driver bundles (board support packages in WRS-speak). However, they no longer provide great value in that area - Linux has far better hardware support, and for any reasonably complex project will scale down as well as VxWorks will scale up.
www.eFax.com are spammers
You cite two /. articles in the "Publications" section of your resume. What kind of response has this received in interviews?
Vista:XPSP2::ME:98SE
They should try debugging from my QA lab. That'll give them a run for their money.
NASA may consider using a new OS after it has finished V&V in house and by an independant testing company (per NASA procedures) and it has flown in space successfully. An order of magnitude estimate is ten times the development cost.
VxWorks is a well known OS with lots of experienced users. Priority inversion is a known problems, just set the SEM_INVERSION_SAFE flag in semMCreate() to fix it.
Besides making the OS, Wind River also sells Tornado(R) and other tools for developing, debugging, and testing embedded realtime code running in the target computer. Anyone who has ever done embedded and realtime code knows good tools are mandatory with any complex system.
VxWorks runs in a flat file space. There is no segment protection, but code does get extensively reviewed and tested so bad pointers are not a problem. Preventing memory fragmentation requires good design, the solutions are will known, and more reviews and testing.
The last time I priced a run time license (most satellites need two licenses), it was noise ($400) compared to the labor required to build a spacecraft.
I a VxWorks user in the space buisness.
_Richard
I don't think VxWorks has or had a C Language interpreter. VxWorks is in transition from a tcl shell to a Java bases shell. Prior to tcl, VxWorks has some type of shell script language. C code is hard enough to parse in batch mode, I would not want to try and write an runtime interpreter.
Can anyone confirm this, or fill in what I don't remember?
_Richard
"...and even though I chose the wrong tool for the job, it's still the tool's fault for not doing everything I need."
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Honestly, it never came up.
Look, a resume proves you can "talk the talk".
An interview is your opportunity to prove that you can "walk the walk" as well.
You could've hired me.
My Win2k machine has a 1,702,800 byte NTOSKRNL.EXE, and that's not compressed. Using NTFS comression it gets down to 1,286,144 bytes.
A few links away from the article is the first bboard post that proposed the smiley as joke marker:
:-)
:-)
:-(
19-Sep-82 11:44 Scott E Fahlman
From: Scott E Fahlman
I propose that the following character sequence for joke markers:
Read it sideways. Actually, it is probably more economical to mark
things that are NOT jokes, given current trends. For this, use
It was around the first or second month of operation this year, but Spirit was unusable for a couple of weeks due to an OS failure. The symptom was Spirit tried to reboot itself about 20 times in a row- a default practice if something drastric happens. It was traced (according to the rumor mill) to flash memory overflow. Supposedly the VxWorks file manangement system improperly updated its flash memory free-inode list. So the memory appeared to run out of space.
The nice thing about software is that JPL was able to upload a patch and get both rovers working properly again. They reconfigured the Galileo mission to the bypass the broken high gain attenna and use the hundred times slower low gain attenna with software patches and achieved most of the mission objectives.
I hope no one overlooked the "radiation hardening" part of the article. This is something the common, and even a lot of techs I talk to, don't realize as important. Speed is not the only variable in the equation. I'd much rather have a chip that doesn't fall to pieces on me while I'm flying through space. In fact I think it's time for us normal people to get used to thinking about quality again. We are soon going to be forced into harsh elements where we must be able to depend, absolutely, on the hardware being reliable. It's time we start now getting used to the performance loss some might have because of it; or get ready to ditch thin again.
My sig is as boring as you...
If you had read the article, you would have discovered that JPL had full source code to VxWorks. The article belabors the fact that the folks at WindRiver went out of their way to make sure that JPL could complie the entire system from scratch.
I'm as fervent a WindRiver basher as the next guy. But at least bash them for things they are *guilty* of. Sheesh!