Debugging The Spirit Rover
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
'How do you diagnose an embedded system that has rendered itself unobservable?'
The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.
This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.
(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)
I have something in common with Stephen Hawking...
Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.
-
Uhmm... we DID build a 'twin' of the rover, hardware and all. Give us a bit more credit, will ya? :-P What you may not realize is that exposure to radiation on the surface of Mars, solar wind while in transit and other factors such as thermal expansion / contraction, etc. are slowly degrading the rovers in nondeterministic ways. It is not nearly as simple as 'running the commands in the testbed' at JPL to diagnose any problems which occur.
To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.
If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.
Score: -1, Didn't Read Article
The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.
This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.
What really surprises me is that NASA did not verify the software. Software verification is essentially mathematically proving the software. It is tedious and expensive but we are talking about NASA and the Mars. Infact even beloved MS formally verifies device drivers before use ( believe it or not !!) If the original program was correct they wouldnt have to reupload it and the entire problem ...gone.
You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.
The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.
Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.
Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
is because when the batteries got drained the os went into a stable "safe mode" state. If they made a long lasting powersupply this project was doomed(.f) and they never found out what the real problem was.
It was the inability to build the RAM-based directory structure of the files in the Flash memory.
Why couldn't they build the directory structure? They had too many files, the size of the files doesn't matter here, only the number of files.
In other words, they ran out of RAM, not Flash.
Exercise left for the readers: Why can a Unix file system that is out of inodes have much less than 100% disk usage and still not be able to create a file?
If you're really worried about your remote server being unreachable, here's what I would suggest doing:
Have a hardware watchdog. If the machine is lost or confused, it reboots itself.
Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.
If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.
From the description of how they got Spirit back, it looks like this is exactly how it was set up.
Who'da thunk it!!
455fe10422ca29c4933f95052b792ab2
Vx-Works
A highly respected embedded OS.
YAW.
Your head of state is a corrupt weasel, I hope you're happy.
What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)
:)
There is a low gain omni-directional antenna that can be used as backup. Infact I think they use it most of the time for commands and just use the high-gain for data transfer back to Earth. Which makes sense, they never need to send large amounts of data to the rover.
No lightning has ever been detected on Mars. Tho it's not impossible, it is very very unlikely. No proper observations of the night side of Mars has been done tho, so they may just be missing it.
And Opportunity did fall into a hole
Actually though, it's not too bad an analogy. While Earth based servers aren't absolutely unreachable like SPirit, they are often remote, and there are expenses associated with visiting them in person.
Various schemes now exist to help deal with that. Many boards have a small management processor (bmc, server management board, IPMI, whatever) that is used for remote diagnostics and reconfiguration when the main board won't even boot.
Meanwhile, LinuxBIOS supports two complete BIOS images. One 'old reliable' that once working is never changed, and one that can be upgraded freely. Coupled with a watchdog card or timer, it's decently managable in the field. That work is continuing.
Meanwhile, IBM is pushing the 'blue button' that forces a software reload from an image partition.
In that sense, the problem is strongly analogous. Most of us will not, however, encounter the exact problem that Spirit had, though some embedded device developers just might.
It's not that hard to pull off off this sort of seemingly amazing remote recovery with pure off-the-shelf tech if you plan for it in advance and are willing to pay a modest premium.
You need remote serial console access -- ideally including firmware/bios serial console access -- and remote power cycling, controlled by a small embedded system, either in separate units (APC masterswitch, terminal servers) or as part of the system unit (common on Sun gear as "LOM"/"ALOM"/etc.; some of this is also creeping into x86 mobos). All this lets you regain control of the system remotely.
Then it becomes a matter of hardening the system to let you recover from various other insults. Never let go with both hands: Mirrored disks (protecting against hardware failure) and multiple bootable partitions (protecting against software or human error) can both be used; netbooting is also a nice capability to have when you've got a bunch of servers in the same place.
Disclaimer: I bet you can do much of the above with other people's gear, but I work for Sun and I know it works for me...
And frankly, if your car isn't lasting ten years, then you bought junk in the first place. Of the four cars I've owned, not one has had a lifetime of less than ten years. Three of them were already older than that when then they came to me, and none lasted me less than four years. (Other than the one that got re-possesed, but I had that one three years.) But then I invest in regular maintenance, don't leadfoot, etc...
Here's a link to the NASA press release describing all the details to that fix of the Galileo orbiter. I remembered it because I sometimes work at JPL and walked into a lab where a JPL-er was packing up what looked like a home-brew old time reel-to-reel tape player. It turned out that it was the sister device to the Galileo flight system and the guy I was talking to was one of the brains who had figured out the fix! JPL press release