Debugging The Spirit Rover
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?
As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."
A programmer is a machine for converting coffee into code.
Man, I have a hard enough time debugging programs running on my local machine.
I dont think i want to learn too much from this as the solution was the equivalent of rm -rf... On a side note i wonder when the 40 min ssh delay jokes will begin again
drunk chemists
at least it wasn't a blue screen?
man rover?
The Human Cow - bringing you scrumtrelescence since 1995
I don't get it, couldn't NASA afford the on-site warranty?
Wow, I didn't expect the rover had 128MiB of RAM, or 256MiB of flash. Funny to think they had to run chkdsk from so far away :)
That's the thing that amaze me. Any technology having to do with space seem that much more advanced.
Here on earth we can't even build cars that require no maintainance and last more than 10 years.
I hope they use SSH or something .. who's to say a future mission ..some hax0r doesnt grab control of a space probe and have it send goatse.cx pics back??
.. after all the probe communicates using known frequencies. There may be probs picking up the return signal without an expensive antenna i suppose. But then again maybe some hax0r can build one cheaply and or do what captin midnight did ( www.signaltonoise.net/library/captmidn.htm ).
All it takes is a transmitter out in the middle of nowhere africa or some island
I wouldnt worry about signal jamming though as that will probably be discovered easily.
The Martians are pissed that the repair labor was outsourced to Earth.
Table-ized A.I.
If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.
I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
Sounds like NASA forgot to empty the rover's recycle bin. =)
Steal This Sig
...would have been to have "fixed" the problem before the hardware left earth. This "bug" (or more accurately, known limitation of the filesystem) should have been discovered here on earth if the rover had been properly tested.
The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.
Dan East
Better known as 318230.
Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.
A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.
This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.
Slashdot monitor for your Mozilla sidebar or Active Desktop.
What filesystem is used? Is wear leveling being used? The directory structure is apparently stored in RAM during the day (why else would it use so much RAM?), that is a good thing for reducing wear on the flash system. But what's the number of writes on the flash chips? When will that number be reached?
"It's too bad that stupidity isn't painful." - Anton LaVey
'How do you diagnose an embedded system that has rendered itself unobservable?'
The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.
This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.
(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)
I have something in common with Stephen Hawking...
With all of the money we spend on this stuff, couldn't someone have written an exception handler for this? Haven't we learned our lessons in the past about unhandled exceptions?
The article states that they are working on one now. A bit late eh? Lucky indeed.
If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space! They forgot to delete old files when updating the programs in the flash memory (which is mounted like a filesystem, or hard disk), and the OS was failing because it wanted to use that space. So it rebooted, and still had insufficient disk space, and rebooted again . . . lather rinse repeat. There was no signal because it was stuck in a reboot loop because they ran out of disk. Wow.
:) ), then used low-level (direct access) flash utilities to remove the old files. Reboot, mount, disk check / corruption repair, voila it works again.
They fixed it by telling it to boot without using the flash (safe mode
We have a big 1TB NetApps server where I work, and we have so much disk space that people get lazy and don't delete files or archive old projects, then they get really confused when jobs fail, not thinking disk space until checking everything else first. But it happens, and it's usually surprisingly hard to debug (they check a lot of other things first, sometimes even upgrading tool versions!). It's really kinda funny, in an expensive and mildly embarassing way that the Spirit had the same problem.
everything in moderation
"The outcome strikes me as an extremely Lucky Hack..."
The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?
Yeah, that was HAL's excuse too.
Seriously, hats off to all the JPL programmers. Proving to the Martians that there is indeed intelligent life on Earth, very intelligent.
My pet peeve when I'm doing remote troubleshooting is 'ifconfig eth0 down'...oops. At least NASA is smarter than that.
Peter.
You know what I hate? Wait, what do you like? I hate that!
Well, considaring the problem was they had too many files on the flash, why would they want even more? They should have had more ram, not flash
MoFscker
They should just have ticked the "autoaccept and minimize" checkbox .
Don't you know it is now both immoral and criminal to think beyond the next quarterly report?
Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.
-
Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.
Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I know next to nothing about progamming, but I'm a fairly good armchair quarterback.
"Spirit attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception..."
For an agency that usually trys to think of everything, doesn't this seem like a stupid lack of planning? To not have any error handling to catch something that is trying to allocate more memory than what is availible? From a laymen's perspective, this seems like a rookie goof. Please correct me if I'm wrong.
"...just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure."
I think anything would be more gracefull than 'totally puke and get stuck in a futile reboot cycle'.
Our tax dollars at work...
One lesson we can learn from the Spirit problems that really and truly does directly apply on earth:
Just in case of a worst case scenario, always make sure you have physical access to the machines.
Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.
Seriously though, the key lessons to take away from this are.
1) Gather all of the clues you can.
2) Take those clues and build a model.
With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!
BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.
To the making of books there is no end, so let's get started
There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Operating System not found. Press any key to continue.
Damn! Left the floppy in!
What surprises me is that they don't have a 'twin' of the rover's computer system set up on earth. When commands are run on the rover, the same commands could be run on the computer system on earth. Then, if the rover's software, fails (as it did), the software on earth would (theoretically) fail in a similar way, and be MUCH easier to debug. Of course, the systems wouldn't be identical (without building an entire duplicate and expensive rover), and the data gatehred wouldn't be identical, but if the twin was carefully planned and fed dummy data that aproximately mirrored that data the rover was gathering. For example, the twin could be fed dummy pictures about as often as the rover took a real picture.
From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).
Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.
Stupid like a fox!
It's not even that they forgot to delete old files. Then program they sent to the old files failed to upload correctly, and they ran out of space before they could retransmit the delete program.
Stupid like a fox!
To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.
If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.
Regardless of your personal experience, it is Ford's habit to replace reliable vehicles with unreliable ones. The classical example of this is the Festiva. Those little things just went and went, got excellent reviews in Consumer Reports, and really upset a few Ford corporate executives.
They replaced the vehicle with the Aspire, which Ford dealership automechanics quicky nicknamed the "expire" due to their regular need for maintenance. They still sold quite a number of them due to the reputation of the previous vehicle.
Wake up - the future is arriving faster than you think.
Wrong wrong wrong, as I'm sure someone else will post. He spins a good yarn but he's just a machine room flunky and hasn't RTFA himself.
One simple rule for its versus it's
I can't believe that this is the state of the art at NASA - no wonder Shuttles fall from the sky.
I had pretty much the same post - the originator of the story confuses luck with skill, a mistake a find very annoying and committed all too frequently. I'll fully admit when I've been lucky, but I also went recognition for foresight when I've had some! NASA deserves at least that much respect.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Duh. That's what they have been keeping a secret. They have a DB9 serial link strung from here to the landing site. It's not as cool as you all make it out to be.
First wxWindows, now Vx-works?
Before doing something risky, type this:
:-)
:-)
sleep 600 && reboot &
Now if your risky maneuver makes the ssh session unusable, just wait 5 minutes for the machine to reboot.
This is great for fiddling with firewalls by remote control... through the firewall.
Oh... You say you're not using a POSIX-like system? That's not supported. Sorry.
They could have set it up out in my backyard to take pictures of the piles of crap and rocks out there and if they wanted to simulate the solar radiation, they could have my girlfriend give it one of her famous looks... cause those are leathal enough to burn a hole in your soul.
-SF
That must have been some feat to get the arm on the rover to press Ctrl, Alt and Delete at the same time!
What really surprises me is that NASA did not verify the software. Software verification is essentially mathematically proving the software. It is tedious and expensive but we are talking about NASA and the Mars. Infact even beloved MS formally verifies device drivers before use ( believe it or not !!) If the original program was correct they wouldnt have to reupload it and the entire problem ...gone.
Today we salute YOU, Mr. Super Wizard Windows Reinstaller.
Only YOU can fully appreciate the difficulty of running a format c: command, while swilling a room temperature can of Red Bull.
"Hey this stuff is hard now!"
While NASA is too preoccupied with things like farway rovers, you take your vocational tech school fueled arrogance directly to the place where it will make the absolute least possible impact: A Slashdot discussion thread.
"Loggin' on now!"
Your unique eye for obviousness allows you to sling turds of obtuseness every which way, and then brag about how you were RIGHT as soon as one of your pronouncements hit true - regardless of how many times you were wrong before.
"See I told you sooooooo!!"
And if some idiot rocket scientist has the unmitigated gall to not bow down to your obvious Geniusdom, you unleash your fury down upon him with all the tenacity and mercilessness of a rabid pit bull with a tender buttock locked in its jaws.
"Total anonymity!"
So keep clicking away, oh Marauder of the Mousepad. Because when the results you so desire finally come about years from now, you can say it was because YOU demanded it."
"How come they haven't fired that dumbass head of NASA yet yet?"
(Bud Light Beer, Anheuser Busch, St. Louis Missouri.)
Using the low- level commands, about a thousand files and their directories -- the leftovers from the initial launch load -- were removed.
I think that means they deleted the useless stuff they wanted to delete anyways but didn't get to delete before the crash. I also remember news about science data from before the crash that was received after they got the rover working again.
As for how critical it is, well yeah, it seems the rover didn't need the contents of the flash file system. The operating system and other software was in the same flash memory but I assume that any sane designer would put in some hardware write protect interlock that's not easy to defeat accidentally.
You realize that missions to Mars can only be launched once every two years, right? If they miss their launch window, they've got to wait two years before they can launch again.
You also realize that NASA did do a test mission, right? They built a test rover and put it out in a desert somewhere. They used the mission to test the hardware, test the software, and to help train the team.
MEGABYTES. You mean MEGABYTES. Don't use that revisionist MiB crap - 1024^2 is MEGA in computers and always has been. Don't let storage manufacturers redefine the language for their own gain!
Disclaimer: IANAL. This post is, however, legal advice, and creates an attorney-client relationship.
You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.
The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.
Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.
Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
is because when the batteries got drained the os went into a stable "safe mode" state. If they made a long lasting powersupply this project was doomed(.f) and they never found out what the real problem was.
Actually, they used VxWorks because it was the same OS used for the lander on the Mars Pathfinder mission. Since they were using the same CPU and same basic computer design as the Mars Pathfinder lander, they probably figured, "Why not use the same OS?"
Of course they tested it! Something is not quite right about this strory though. How could such a seemingly simple problem have been missed?
-- Exposing the hype of Gentoo zealots. Modded into the ground to suppress opinion.
Here's what happened according to the article. They launched the ship with an OS image in flash, and soon realized that they needed to update it. So shortly after launch they sent another complete OS image. They knew they'd have to delete the first image, but they didn't do it right away. At that point there was plenty of room in the flash memory so having two OS images was not a problem.
After a few days on Mars, they were starting to fill up the flash, so they planned to go ahead and delete the old launch OS image, its directories and files. This is a complicated process so they uploaded a special program to do it on Sol 15. And apparently they informed the rest of the team that the memory would be free and available after that point, so the rest of the team made plans to start filling it up with pictures.
However, the upload on sol 15 failed, and was rescheduled for sol 19. Now, here's the big mistake (which the article glosses over): They forgot to tell the rest of the team that all that memory wasn't going to be freed up as planned, not for a few more days. So instead, Spirit is moving around now, taking lots of pictures, storing them in flash, and all the people involved with that think they have plenty of room. Little do they know that they are running out of flash space. Finally, the morning of Sol 19, shortly before the memory cleaning program was going to be sent down, it happened. The flash memory was exhausted. This triggered a sequence of events which put the craft into a failure loop.
The big problem here, then, was the failure on the part of the group which was supposed to clean out the launch OS image to tell the rest of the team that it wasn't going to happen as scheduled, so the memory wasn't going to be available. It wasn't really Murphy's Law, but rather a failure to communicate among the team. This is an institutional problem which will hopefully be fixed.
There are lots of 91 escorts on the road. Pieces of junk? Yes, but they are still running.
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
I wonder if the guy that put this article summary up read the article. They very clearly stated therein what the difficulty was and the unique confluence of events--specific to the rover's hardware and OS architecture--that led to the shutdown. What on earth (or on Mars) could we possibly take away from this experience that would lead to some ability to troubleshoot systems remotely?
I just don't see it...and I'm in computers for a living.
sev
but have you considered the following argument: shut up.
yer seems fishy to me
>>> I know next to nothing about progamming
Absolutely correct.
It was the inability to build the RAM-based directory structure of the files in the Flash memory.
Why couldn't they build the directory structure? They had too many files, the size of the files doesn't matter here, only the number of files.
In other words, they ran out of RAM, not Flash.
Exercise left for the readers: Why can a Unix file system that is out of inodes have much less than 100% disk usage and still not be able to create a file?
Why do car interriors turn to
shit after 5 years?
Why do repair costs
start to outweigh buying a new car?
Why do car manufacturers offer buy backs?
The reason is simple.
The entire car industry depends on
people buying new cars every 5-10 years,
That's why cars are only really made to last 5 years, and that's why warranties only cover a car for up to 100 thousand miles.
It's nothing short of a conspiracy.
Actually, they used VxWorks because it was the same OS used for the lander on the Mars Pathfinder mission. Since they were using the same CPU and same basic computer design as the Mars Pathfinder lander, they probably figured, "Why not use the same OS?"
Actually they use VxWorks because WindRiver gives JPL major discounts...
Can I get an eye poke?
Dog House Forum
WindRiver may give JPL large discounts, but I doubt that's the only reason VxWorks is running on the MERs.
Years ago, when JPL was designing the Mars Pathfinder mission, they asked Wind River to do an "affordable" port of VxWorks to the RAD6000 (a radiation-hardened RS6000), and they agreed. Since the computers on the two MERs are very similar to the computer on the Mars Pathfinder lander, it makes sense that they'd use the same OS that they used on the MPF lander.
I would think the fact that JPL knows VxWorks very well by now would be a major factor in deciding to use VxWorks for the MERs.
> "We recognized early in the planning process that the flash file system had a limited capacity for files."
;)
Wow, geniuses. As opposed to the regular file systems we use here on Earth, which have unlimited "capacity" for files?
> "But there were also directories of files already placed into the file system in the launch load,"
More advanced high-tech speak.. NOT!
All joking aside, when they try to make it easy-to-understand for laymen (or politicians?), they make it sound retarded to us.
I think the problem was actually with the imperial-to-metric conversion functions, they're just covering it up to avoid further embarrassment
Must-not-watch TV!
I am currently considering buying a 1980 MkII Escort. 118,000 k's on the clock, and it still goes sweet as. One of my friends drives an '85 that has no problems either.
Making the moon less necessary since 1998.
The JPL is a pretty viral license. It forces you to spread their space probes from your planet to all your customer's planets. This is un-solar systematic! What's next? Calling GNUpiter Jupiter instead?
One word: outsourcing.
When I worked at JPL, every 6 months to a year there'd be talks of layoffs because the headcount was too high; people would leave and return to the same projects as contractors, then get a higher hourly wage for doing the same work with less accountability.
The whole reason for that lost probe (feet vs meters, anyone?) was because of a political squabble between two teams (one JPL-internal, one outside contractors as I recall) who simply failed to cooperate productively. The whole management structure inside that world is screwed. People's project leads are not the same as their section/department leads, so the reporting chain is a mes{h,s}. Time and energy is wasted in contract(or) management, all in the name of "reduced costs" even though having all the work done in-house would eliminate a full layer or two of mid-level management waste.
NASA/JPL are totally hamstrung by beancounters who think they're saving the public's money, but truly can't see the big picture, missing the forest for the trees. (Either that, or they *do* see the big picture, and are busily lining their own pockets with the excess that gets tossed around thru all the churn.)
-- *My* journal is more interesting than *yours*...
Could this have not been said more succinctly with a simple quote? Namely:
"What we have here, is failure to communicate."
Mal-2
How is the Riemann zeta function like Trump rallies? Both have an endless number of trivial zeros.
"The irony of it was that the operating system was doing exactly what we'd told it to do"
Funny, that's how it was explained to me by my computer science teacher my freshman year in high school. He said, "The problem with computers is that they do exactly what we tell them to."
I belong to the ______ generation.
"We discovered a system log in which the problem was documented,"
Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
I did read the article, and my comments are completely accurate. Unfortunately you must not have made it to the 3rd paragraph, and neither did the mods that modded you up and me down.
The problem was discovered after launch. The first few fixes made the problem worse by stressing the filesystem even further.
It doesn't matter that they were trying to fix the problem. THAT WAS NOT MY POINT. The problem should have been identified and fixed before the craft was launched.
Yes, they may have taken "around" 100000 pictures. Does that mean they sequentially stored every picture in an actual rover file system? I get the impression they were only testing the cameras or the capture software, not the holistic system.
Did they first simulate filling the filesystem with files generated during the actual trip to mars? Apparently not, because the system would have failed if they had actually put the rover software through a launch to end of mission simulation here on earth when the software was developed.
Dan East
Better known as 318230.
Although dead now, it was alive at its 21st birthday.
A mighty fine piece of hardware. Might still have been running, had I not driven it into another car.
They couldn't even find the start button. Arnie had to turn it on for them many years later and it was only a single button. I don't think that the Martians figure out three buttons...
From excellent karma to terible karma with a single +5 funny post...
Uh, the parent post is correct and modded down. Then in this same thread there is a +5 post that is totally wrong. Hello?
Engineers are some of the worst programmers. This is especially true when it comes to testing. Ugh.
Who would not have tested the file system filling up?! I mean, it's not like you'll have easy access to the device if there's a problem. You need to test everything you can. Testing something as simple as filling up the file system is routine. Unfortunately this is typical of the software from many programmers, especially engineer types.
Seriously, from a developer viewpoint, that is all wrong.
I have worked on projects in which there was simply too much logging going on that you couldn't tell head from toe anymore. When a problem arrived, scanning the logfiles proved very cumbersome indeed. Every developer had his own stuff logged, which sometimes proved interesting, sometimes proved utter crap (noone wants to know variable XYZ is increased by 1 for 24943 times).
You should develop a well-thought logging strategy that increases the logging verbosity on a problem-basis, not simply log everything that happens and hoping you get some useful information.
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
Not lost forever, but lost until we travel to Mars and retool it as an extraterrestrial barbeque grill.
"I'll say it again for the logic-impaired." -- Larry Wall.
...and I'm not saying that just because we agree. Yours are good additional insights (hence your "insightful" mods up! :-)
:-( "
:-)
I agree with the reply-post below too, saying that if they'd made their system a bit more fault-tolerant, then the problem might have been more easily recovered from. Sixty reboots in a row in a day seems a little excessive! Don't they have counters to detect that very thing? Don't they have a failsafe/debug OS burned into ROM (not flash) to load automatically in just such an event? Such are the risks when you're reloading a whole new OS remotely!
However, maybe they do have such things, or equivalent. I don't think their method of recovery was "accidental" (or a hack) either, although I'm making assumptions and I haven't seen their spec. The key is that they recovered from the error... and I now assume that they have recovered completely.
What I found interesting was NASA's initial assessment that the flash ROM was failing -- a hardware failure. The media jumped all over that and reported it, so the rest of us were thinking, "Great, the rover is crippled and will never be the same.
Now, turns out it was just a software error. Where's the mainstream media now? ("EE Times" is hardly mainstream!) Can the rover's recovery now be considered a "complete recovery"?
If this story goes mainstream, will it make NASA look bad for screwing up... or look good for making a full recovery? I'm not sure. (Of course, smart people make mistakes too, lots of them, but the key to being smart is covering your ass beforehand!
They were sloppy.
Just imagine the poor bastard who had to carry the tomsrtbt floppy all those millions of miles up there, and stick it in the floppy drive!
Well, we all have to do stuff like that every day, and God help us if we run out of Magic Wands!
You might not be able to recover from every possible thing that might ever go wrong, but there is no excuse for not checking for a file creation error. If there is only one error you check for, it should be that. And there is no excuse for dying after not being able to create a file either. You should simply report the error and return to idle.
> What on earth (or on Mars) could we possibly take away from this experience?
Rule 3: Never ignore the return value from open.
cat
error: No space left on device
Dude, take a look at jffs, it has built in wear-levelling and is used extensively by those of us with a Zaurus, and probably 100's more appliances as well.
I want to delete my account but Slashdot doesn't allow it.
"Repair costs for things like Cadallacs and BMW's are not cheap for TCO! "
Take a look again about BMW's in CR. They're generally the top rated car in their class, and they have average to above average warranty.
Hondas are good; I've owned 7 or 8. But BMW's are better. Hondas you can drive for 15 years. BMW's... you WANT to drive for 15 years.
That's the difference.
I've owned my BMW for 3 years and its has been absolutely flawless. Good gas mileage, excellent design, it delivers a driving experience unmatched by *anything* on the planet.
Why you think BMW's are the poster child for bad design and manufacturing process is a mystery.
isn't an uncommon thing really. I can't find the article now, but I read something about Cassini (the probe to Saturn) a couple of years ago that said that the flight software for the orbital mission wasn't even written yet.
I wish that my computers contained a backup copy of the baseline OS somewhere. There have been a few times that I wished I could just flush them quickly. Yeah, I know. Quit using Windows, right? I will when I can get games I like on *nix.
"Well Ranger Brad, I'm a scientist. I don't believe in anything." - Dr. Roger Fleming
It's not that hard to pull off off this sort of seemingly amazing remote recovery with pure off-the-shelf tech if you plan for it in advance and are willing to pay a modest premium.
You need remote serial console access -- ideally including firmware/bios serial console access -- and remote power cycling, controlled by a small embedded system, either in separate units (APC masterswitch, terminal servers) or as part of the system unit (common on Sun gear as "LOM"/"ALOM"/etc.; some of this is also creeping into x86 mobos). All this lets you regain control of the system remotely.
Then it becomes a matter of hardening the system to let you recover from various other insults. Never let go with both hands: Mirrored disks (protecting against hardware failure) and multiple bootable partitions (protecting against software or human error) can both be used; netbooting is also a nice capability to have when you've got a bunch of servers in the same place.
Disclaimer: I bet you can do much of the above with other people's gear, but I work for Sun and I know it works for me...
Welcome to the world of users-as-entropy-pool.
1. Get 900 million USD.
2. Hire a team of rocket scientists.
3. Build duplicates of all equipment, let QA team run tests for a couple of years.
Yeah, I learned something, thanks!
You can't handle the truth.
"Blue screen on Mars? That's original!"
Hah! It may have been a hardware problem, but the "bug" involved was... yes, it's absolutely true... a Martian relieving itself on one of the Spirit motherboards: reported here. Instead of gee-whiz technology, what they really need is anti-whiz -- or maybe just some kitty litter.
The enroute time for Cassini to get to Saturn was 7 years; rather than push back an already long mission they launched with feature-incomplete code. They knew they had 7 years to get the software fully functional and debugger; they've updated it remotely from millions of miles away a number of times now.
I'm sure the rovers did the same thing... Develop the launch/cruise software before you launch (and of course try to get as much of the entry/landing code done as you can!), and then uplink the final code before it's needed. Therefore it doesn't surprise me one bit that the JPL engineer knew there were shortcomings in the launch software.
Hell, I develop BIOS for servers and we do it all the time. The BIOS image we give the hardware engineers for initial bringup is usually *way* short of features that will be there when it actually gets used by the customers!
--Rob
Wow, you're a whiney little bitch. Suck it up and move on, pee-wee.
It must suck to be wrong all the time and resort to childish comebacks like the one you just made.
The Mars Rover software team was lucky that their complete incompetance did not cost them the mission.
The govermnent already funds way more medical research than does NASA. Health care in the United States costs more than a trillion a year. If they can't deliver all of these miracles you promise with the money that they have, they can't do it all.
Testing everything on the ground is silly because you cannot duplicate either Mars or Deep Space on earth. NASA didn't get lucky - they did it right by design. If you can reprogram the craft while it is in flight, and have a robust capability to do so, then that is way more useful than a Mars simulation on the ground.
This is my sig.
Apparently you read the article but missed the point entirely - their programmers and testers screwed up, plain and simple. They never anticipated this scenario and they were just lucky that it was remotely fixable. The programmers are not the 'heroes', but fortunate bumbling fools making a very junior QA mistake.
What is the optimal amount of RAM and flash to have? As much as possible is not the answer. The more you have, the more you increase your chances of getting corruption from radiation. Enough to do the job is probably the answer.
What I would like to see is redundant RAM and flash to avoid corruption. Something like a RAID disk array, but for RAM and flash.
If I have seen further, it is because I have stood on the toes of giants.
It was the inability to build the RAM-based directory structure of the files in the Flash memory.
Fine, but the point remains that it was a major screwup on NASA's part. They never communicated the fact that the memory-clearing download had failed and therefore the flash was still full of many more files (from the original launch OS image) than the rest of the team thought. That's why the flash was allowed to fill up (in terms of file count or data, it's not important). The engineers are very careful not to exceed the limits of their flash file system, but they were misinformed about what those limits were, since they had not been told that the erasure download had failed.
It was not Murphy's Law, it was not a fail to simulate or verify or test the software. It was a simple internal communications failure, where the right hand didn't know what the left hand was doing. The article lets NASA off too lightly by not emphasizing this failure (which is ultimately management's fault for not making sure that everyone on the team is aware of all crucial data).
Space Communications Protocol Standards (SCPS)
http://www.scps.org/
http://www.scps.org/Documents/SCPSoverview.PDF
One simple rule for its versus it's
...as I suggested before, they wouldn't have to reboot at all. The rover never would have been lost! They have to stop using DOS for all their spacecraft!
"The fact that they filled up the flash memory with too many files that were accumulated during the cruise phase of the mission between earth and mars was something that they should have known would happen."
/tmp/blah) failed _silently_.
How easy it is to describe a test after a problem appears. Not so easy when there are a million different things you want to test and some untested script statement(i.e. cp $EARTH_FILES
even worse to test and detect are periodic restarts due to some constant input
Wow. That just set my mind spinning! Are there any pictures of what mars looks like at night? If not, how can we ask NASA to take some?
Doesn't VxWorks have a way to run an automatic scheduled task?
Why would this task have to be manually sent all the time?
the computer is online
i am not at it
what a waste of ressources
Off the record, the main reason for JPL not transitioning away from VxWorks is budgetary. They do not have the budget to switch to something else, despite the engineers feeling that some other OSes would be superior for their needs. Think about how much time and money it would take to retool, retrain, recode and reverify using a new OS.
Incidentally, on the record, JPL doesn't use Wind River's toolchain (i.e. Diab etc.) for Spirit; they use compiler/debugger/etc. from Green Hills Software, Inc. Only the OS itself is Wind River. I don't happen to have any links to back that up ATM but that at least has been stated publicly.
anybody not factoring lifecycle costs into a car purchase is a dummy. But there are more variables than just the brand of automaker.
I've owned 2 BMWs, an Audi, a VW, a Ford Bronco II, a Toyota Celica, and a 1970 Oldsmobile Cutlass.
The most reliable of all those cars ?
The 1970 Oldsmobile. It did nothing. There was nothing to go wrong. Well, nothing until i blew up the 27 year old engine. Threw a rod through the block. Paid some guys $900 to do a motor swap. Reliability went to shit after that.. but once it was running it ran great (4bbl carb in the 13 year newer motor)
The Ford Bronco was the worst i think. It was a hodgebodge of hacked up stuff by the time I got it.
The Celica was rock solid but I let a friend borrow it and she reduced the clutch to literally nothing.. stranding her on an onramp. Probably not the cars fault, but it's a knock against its perceived reliability. I was able to drive the thing no problem (and very hard, i might add) so i dont know how she managed to put the car in an undrivable state in just a matter of hours.
I did the clutch replacement myself on that car. I am never again owning a transverse engined FWD japanese car. Working on them is pure bullshit.
I think your position that vehicle TCO should trump looks or other factors is silly. It's up to each person to decide whats important to them.
For instance, i'll likely never own a future toyota or honda product because with the exception of the MR2, Supra, S2000, and NSX, they make FWD economy appliance cars that are at best uninspiring to drive and at worst unsafe (inadequate brakes, woeful suspensions, inadequate acceleration on automatic 4 cylinder models, etc). The models i've excepted are rarities in their model range (and have higher maintenance costs than run of the mill honda/toyota vehicles)
If you were to make TCO the only factor in a car purchase, you'd ride the bus. Owning and insuring a car is an expensive proposition regardless of what you buy. And the car's you're talking about are nothing more than appliance transportation. When i get in a car, the drive is the point, not the destination. Driving in an uninspiring car is torture for me, so when i lived in seattle i'd often take the bus even though we had 3 cars. When you factor in the joy of driving, or unfortuneately, the desire for worldly status, it becomes clear why people buy BMWs - they are subjectively and objectively more fun to drive than Camrys.
There are other concerns - i.e. honda and toyota do not offer station wagons at all, much less station wagons with manual gearboxes. VW does, so now we have a VW. Yes, it will probably end up being more expensive than a honda accord. It also fits the requirements; an accord does not.
BTW - my Audi and BMWs are from the 1980s.
My first BMW as a 1980 528 5 speed. I bought it with 220,000 miles on the original drivetrain, and drove it _Very_ hard and sold it with 240k miles on it. The second bmw is an 88 model with 111k miles on it (i bought it with 98k). I drive it extremely hard (it has a 6900 rpm redline that i am very familiar with).
The Audi is an 88 model that i just bought with 192k miles on it (now has 196k)
German engines are notoriously bulletproof and well made. The 88 BMW uses a detuned race engine of which less than 5000 examples exist in North America. Yeah, that one is expensive to maintain an work on, but it's also more motor than anything you can get from honda, even today (except for the NSX motor). And its 15 years old (and a 25 year old design).
My opinions are my own, and do not necessarily represent those of my employer.
www.qnx.com - free beer demo
I suspect the reason you don't see QNX used more is that it isn't American, it's Canadian, but maybe that's just the canuck in me coming out. QNX is a really interesting OS, and no, I don't have any affiliations with the company.
(RT) Linux is a long way from true RTOS performance, or at least it was when I last looked at it.
..don't panic
Another interesting thing about the RA experiment is the way the error was found and fixed. Because the RA was written in Lisp, it had interactive debugging and loading features, and the race condition was diagnosed without having to stop the experiment, and patched without having to reload the whole system. The same project team member (Erann Gat) said it "proved invaluable in finding and fixing the problem."
In the great CONS chain of life, you can either be the CAR or be in the CDR.
Scary to think of all that hardware flying around up there run by a dilbert culture!
[Gentoo is hyped. Modded into the ground to suppress opinion]
# ping rover.nasa.gov
Request timed out
# ping rover.nasa.gov
Request timed out
# ping rover.nasa.gov
Request timed out
# ping rover.nasa.gov
PING rover.nasa.gov (192.168.1.143): 56 data bytes
64 bytes from 192.168.1.143: icmp_seq=0 ttl=50 time=6043.446 ms
"It's better to be a pirate then join the Navy"
are a fat dicklicking faggot. HUAGLUALAUGHLAUGH, HTH, HAND. Don't use so many caps.