Slashdot Mirror


Debugging The Spirit Rover

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"

130 of 390 comments (clear)

  1. Oh, sure... by inertia187 · · Score: 5, Funny

    Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?

    As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."

    --
    A programmer is a machine for converting coffee into code.
    1. Re:Oh, sure... by JWSmythe · · Score: 4, Insightful

      It sounded like the same type questions non-technical bosses always ask about technical matters.

      "We're ordering this brand new hardware that you've never tested before. Can you guarantee it will never crash?"

      "Will this database server handle the load of our brand new project?" (without an accurate growth estimate)

      "A server 2000 miles away just went down. What happened?" (no ping, no nothing) Hmmm.. Power/NIC/CPU/CPU fan/hard disks?

      It really sounds like they did some decent advanced planning on those probes, but from other stories I read, the were shooting for 90 days of reliability, which in itself was a hard one to do. What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)

      Sure, relate this to your web server colocated somewhere you're not. Cross your fingers, hold your breath, and hope there aren't a few fatal systems failures, or a bit of human error. I've been responsible for a bit of that in the past, but at least my equipment wasn't a few million miles away.

      --
      Serious? Seriousness is well above my pay grade.
    2. Re:Oh, sure... by FrostedWheat · · Score: 3, Informative

      What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)

      There is a low gain omni-directional antenna that can be used as backup. Infact I think they use it most of the time for commands and just use the high-gain for data transfer back to Earth. Which makes sense, they never need to send large amounts of data to the rover.

      No lightning has ever been detected on Mars. Tho it's not impossible, it is very very unlikely. No proper observations of the night side of Mars has been done tho, so they may just be missing it.

      And Opportunity did fall into a hole :)

    3. Re:Oh, sure... by sjames · · Score: 3, Informative

      Actually though, it's not too bad an analogy. While Earth based servers aren't absolutely unreachable like SPirit, they are often remote, and there are expenses associated with visiting them in person.

      Various schemes now exist to help deal with that. Many boards have a small management processor (bmc, server management board, IPMI, whatever) that is used for remote diagnostics and reconfiguration when the main board won't even boot.

      Meanwhile, LinuxBIOS supports two complete BIOS images. One 'old reliable' that once working is never changed, and one that can be upgraded freely. Coupled with a watchdog card or timer, it's decently managable in the field. That work is continuing.

      Meanwhile, IBM is pushing the 'blue button' that forces a software reload from an image partition.

      In that sense, the problem is strongly analogous. Most of us will not, however, encounter the exact problem that Spirit had, though some embedded device developers just might.

  2. Local Debugging by webmaestro · · Score: 3, Funny

    Man, I have a hard enough time debugging programs running on my local machine.

    1. Re:Local Debugging by srichand · · Score: 5, Funny

      In other news stories, the Microsoft Corporation decided to sue NASA, apparently since the right to crash systems was only theirs. Not to be left behind, SCO insisted that the code that caused the failure was unethically copied from their source repositories. This has indeed caused a flutter in the space communities

  3. I dont know about learning much.... by detritus` · · Score: 4, Funny

    I dont think i want to learn too much from this as the solution was the equivalent of rm -rf... On a side note i wonder when the 40 min ssh delay jokes will begin again

  4. well by whackco · · Score: 4, Funny

    at least it wasn't a blue screen?

  5. Like this? by The+Human+Cow · · Score: 4, Funny

    man rover?

    --
    The Human Cow - bringing you scrumtrelescence since 1995
  6. Remote debugging? by Nimloth · · Score: 4, Funny

    I don't get it, couldn't NASA afford the on-site warranty?

    1. Re:Remote debugging? by kfg · · Score: 5, Funny

      Yeah, but they thought they could save a few bucks and got the Gateway consumer version.

      "Oh, you've got the on-site warranty, huh? Ok, first thing you have to do is ship it to South Dakota. . ."

      Oh, hey, looks just like Mars.

      KFG

    2. Re:Remote debugging? by operagost · · Score: 2, Funny

      When you get the on-site warranty, make sure they tell you WHICH site!

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
  7. lots of mem of an embedded system by millette · · Score: 4, Funny

    Wow, I didn't expect the rover had 128MiB of RAM, or 256MiB of flash. Funny to think they had to run chkdsk from so far away :)

    1. Re:lots of mem of an embedded system by You're+All+Wrong · · Score: 3, Informative

      Vx-Works

      A highly respected embedded OS.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
  8. Space Technology by superpulpsicle · · Score: 5, Insightful

    That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

    Here on earth we can't even build cars that require no maintainance and last more than 10 years.

    1. Re:Space Technology by Anonymous Coward · · Score: 2, Insightful

      Yeah offer to pay $800 million for a custom built car, and you can bet it will last 90 days too.

    2. Re:Space Technology by beeplet · · Score: 5, Insightful

      Actually any technology making it into space is more likely to be 10 years out of date... Getting anything certified for space is a long process. The technology in space isn't more advanced, just much better documented and well-understood.

    3. Re:Space Technology by Billly+Gates · · Score: 4, Insightful

      The Japanese started that.

      They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.

      I will buy only Toyatas and Honda's for that reason.

      It amazes me consumers are too stupid to read consumer reports and buy cars on looks. Repair costs for things like Cadallacs and BMW's are not cheap for TCO! Yes consumer products have TCO too and we and not just businesses should look at that as well.

    4. Re:Space Technology by kfg · · Score: 5, Insightful

      Ten years out of date, but ten years more reliable for the effort.

      Sort of like Debian.

      Cutting edge ain't always what it's cracked up to be.

      KFG

    5. Re:Space Technology by kfg · · Score: 4, Interesting

      No. You can't make a mechanical device like a car that requires no maintainence. Bearings wear out. Hoses and belts have a limited lifespan even you never drive the car, etc. This is the real world. We will obey the laws of thermodynamics. Entropy always wins.

      What you can do is make it require less maintainence, make that maintainence cheaper to perform, and make the car last until you hit something really hard so long as you maintain it. You should be able to hand your car down to your kids.

      Other than that you're bang on though.

      I wonder what we can learn from that about maintaining our computers?

      KFG

    6. Re:Space Technology by alwaystheretrading · · Score: 3, Funny
      Here's an example of the Mars Rover's 10 year old networking technology:

      Ring, Ring, Ring....
      "Welcome to the Mars Rover answering system. For English press 1, Para Espanol prensa 2"
      BEEP

      "You selected English. To leave a message for Spirit press 1. To leave a message for Opportunity press 2"
      BEEP

      "You selected Spirit. Transfering now." CLICK "I'm sorry, Spirit is unavailable at this time. To leave a message press 1. To return to the main menu press 2"
      BEEP

      "Hi this is the Spirit rover. I can't come to the phone right now but if you'll leave a message I'll get back to you." BEEEEEP
      "Spirit, this is NASA. Please phone home when you get a chance. I think your fax machine has jammed and we need you to re-send. Thanks, bye"

    7. Re:Space Technology by Zakabog · · Score: 2, Informative

      I have a 90 ford mustang you insensitive clod. Still runs strong today, has like 107,000 miles on it and I'm sure it'd destroy your civic in a race ;-P. The only money I've really been spending is on a tune up, and new tires (old tires were crappy and leaking air.) And besides when someone buys a Cadillac or BMW (and god damn it it's Toyota, what the hell is Toyata) they don't care about the price. When you're going to spend $30,000 on a "cheap" BMW 3 series you're not gonna care that it's going to cost you x amount more than a cheap japanese car.

      Cadillacs I don't really know too well, but I know a BMW doesn't need a whole lot of repairs. Most german cars are VERY well built. Much better than japanese cars too. And what good's a car that'll last you forever if you don't like the piece of shit in the first place. I just bought a new car (My mustang's in NY, my sister drives it now, my grandmother didn't like the idea of me driving across country in a 1990 Mustang with 300+ rwhp, on such long straight roads, top speeds 145 btw), I could have gone with a VERY cheap Honda Civic, it would probably last me most of my adult life but why would I want such a piece...? I bought a fully loaded 2004 Nissan Sentra SE-R SpecV, it's a quick car, with low insurance and great looks. I wouldn't have bought anything less, I didn't look into TCO at all, it didn't really matter to me. I don't want a car that'll last me forever if I don't like it. And most people let the dealer pick out the car they want, they don't really realize it but they don't care, the average person wants to get from point A to point B, and the salesman is gonna try to sell them a car that costs a lot of money, not caring about the life of the car.

    8. Re:Space Technology by DerekLyons · · Score: 3, Informative
      That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

      Here on earth we can't even build cars that require no maintainance and last more than 10 years.
      Most of the stuff in space that lasts ten years usually has no moving parts, which is what generates much of the maintenace requirements on your car. Nor does it have parts to get fouled, corroded, or otherwise mucked up by the enviroment of or the operation of the car.

      And frankly, if your car isn't lasting ten years, then you bought junk in the first place. Of the four cars I've owned, not one has had a lifetime of less than ten years. Three of them were already older than that when then they came to me, and none lasted me less than four years. (Other than the one that got re-possesed, but I had that one three years.) But then I invest in regular maintenance, don't leadfoot, etc...
  9. do they use SSH ? by Anonymous Coward · · Score: 5, Funny

    I hope they use SSH or something .. who's to say a future mission ..some hax0r doesnt grab control of a space probe and have it send goatse.cx pics back??

    All it takes is a transmitter out in the middle of nowhere africa or some island .. after all the probe communicates using known frequencies. There may be probs picking up the return signal without an expensive antenna i suppose. But then again maybe some hax0r can build one cheaply and or do what captin midnight did ( www.signaltonoise.net/library/captmidn.htm ).

    I wouldnt worry about signal jamming though as that will probably be discovered easily.

    1. Re:do they use SSH ? by mcbridematt · · Score: 5, Insightful

      I don't think they would bother using anything to do with TCP. Anything you do send you will have to wait 9 minutes for. Just imagine the ping times:

      Pinging mars-rover with 32 bytes of data:
      request timed out
      request timed out
      request timed out
      64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms :(

      If it has anything to do with current internet protocols, it would be UDP.

    2. Re:do they use SSH ? by AhBeeDoi · · Score: 4, Funny

      ttl=64
      I realize that Mars is a long way away, but how many routers do you think exist between here and there?

    3. Re:do they use SSH ? by Anonymous Coward · · Score: 4, Insightful

      UDP would be even worse. Interplanetary transmission is difficult, so some packet loss is likely. Under UDP the packets would just disappear-it's an unreliable protocol. TCP would of course be too inefficient. I'd expect them to use a custom protocol designed for the specific application, since their situation is totally unlike anything you'll face on Earth.

    4. Re:do they use SSH ? by Phil+Karn · · Score: 2, Informative
      As challenging as the links are, they are very well modeled; the signal-to-noise ratio can usually be accurately predicted to a fraction of a dB. This allows the telecom team to confidently schedule downlink sessions at the highest data rate that the link can handle without a significant risk of data loss.

      Because very strong forward error correction coding is used, the link tends to be "brittle"; as long as you stay just under the maximum allowable data rate, it will work perfectly. So a lot of work goes into making those accurate link predictions.

      But data can still be lost if the signal-to-noise ratio takes an unexpected dip. The most likely cause is rain at the earth station site, as the weather is not as easily predicted and water is a strong absorber of X-band radio energy. Most of the DSN sites are in deserts for just this reason. But even if data is lost, it can be retransmitted later as it is stored on the rover until explicitly deleted.

  10. Pissed Martians by Tablizer · · Score: 5, Funny

    The Martians are pissed that the repair labor was outsourced to Earth.

  11. What's the big deal?? by prakslash · · Score: 4, Insightful
    Unless you are a lay person, I don't understand what the big deal is .

    If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.

    I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

    As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

    1. Re:What's the big deal?? by Gizzmonic · · Score: 5, Funny

      I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

      As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.


      You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science. A miracle that you take part in every day. And because of men like you, we don't have to rely on the abacus anymore. We sent a pentium to the Moon, and soon, Mars will be colonized by G5s. America salutes you, for all the things that you do.....

      Like a rock! I was strong as I could be be!

      Ooooooohh! Like a rock!

      --
      (-1, Raw and Uncut is the only way to read)
    2. Re:What's the big deal?? by dellis78741 · · Score: 4, Insightful

      The tricky part here was that the 'hardware connectivity' depended on 'software functionality'. Try maintaining machine a block away if the commnication link requires both ends to point a satellite dish at an orbiting satellite and that pointing relied of software functioning correctly.

      --
      ======= ~\_/~\_O Burmese
    3. Re:What's the big deal?? by FTL · · Score: 4, Insightful
      I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.

      As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

      There are some fundamental differences, my friend:

      • If you screw up leaving the computer unbootable, you get local tech support to check the console and fix it. NASA on the other hand doesn't have tech support on Mars.
      • If you hose the server, it means a day's worth of reinstallation. If NASA hoses their rover, they just lost $300,000,000.
      • You can poke around the system and see what's wrong. NASA has a harder time since their lag time is 20 minutes.
      • You can download core dumps, NASA were operating on the low-bandwidth antenna which meant looking at file sizes, time stamps, selected lines, but not file contents.
      • You have your boss breathing down your neck (hoping for success), NASA have the international media breathing down their necks (hoping for a disaster).
      --
      Slashdot monitor for your Mozilla sidebar or Active Desktop.
    4. Re:What's the big deal?? by updog · · Score: 4, Insightful
      There is a big difference between this, and your example of forcing a controlled reboot of your remote machines.

      Spirit was in a constant reboot cycle, and the fact that they could even communicate with it long enough to bypass the problem was an accomplishment (and lucky).

      It would be more similar to your remote data-center machine suddenly going offline and you have no idea why, and you are unable to ssh to it, and you fix it by running through potential scenarios and finding that the problem could have been due to mounting a certain partition, then discovering that there's an exploit in ICMP that allows you to hack to kernel so it doesn't mount that partition.

    5. Re:What's the big deal?? by amRadioHed · · Score: 4, Insightful

      Are you forgetting that the latency when communicationg with mars averages around 1200000 ms? I'd say that when you have to wait 20 minutes to see the result of anything you do you're going to have to substantially change your debugging strategy.

      --
      We hope your rules and wisdom choke you / Now we are one in everlasting peace
    6. Re:What's the big deal?? by afidel · · Score: 5, Interesting

      Actually I remember NASA doing a hardware repair from most of the way across the solar system. One of the deep space probes was starting to have a problem sending signals, some bright mind at NASA looked at the circuit diagram and figured out that a single component (resistor, cap, can't remember) was starting to fail, they figured out that there was a way to recondition the part. So they came up with a program that basically intentionally overstressed that component path and the extra energy heated up the part an reconditioned it so that the unit was back to working condition.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    7. Re:What's the big deal?? by NymblZ · · Score: 3, Insightful

      As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.

      That's just it - consider the stress those rovers are enduring or might encounter: subzero tempatures down to -200f, out-of-the-blue (red?) sandstorms, gamma radiation, and who knows what else out there that could suddenly fsck with the systems or scramble internal data ? Your average Dell rack will never have to deal with any of those things.

      --
      -- NymblZ
      Ignorance is a sty in the mind's eye
    8. Re:What's the big deal?? by cookiepus · · Score: 4, Insightful

      I'd say that when you have to wait 20 minutes to see the result of anything you do you're going to have to substantially change your debugging strategy.

      Please! Back in the day people would write programs on paper, mail them in an envelope to a computing center somewhere, and get results weeks later.

      THAT was pressure not to fuck up.

    9. Re:What's the big deal?? by jelle · · Score: 3, Insightful

      But at NASA, you have a local replica of the whole system sitting in the lab next door, you're in a team of professionals that if necessary can calculate the most probable results of particular radiation hitting your system under a given angle, or can tell you the power usage and temperature effect of the system components given a particular subroutine, or can dream low-level correct assembly for the platform under study, plus the vendor has a couple of on-line support guys sitting in chairs in the corner of your office waiting for your activation command (which is the word "huh?")...

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
    10. Re:What's the big deal?? by Matrix9180 · · Score: 4, Insightful

      Did you RTFA? The rover was rebooting over and over because it was using up all of it's memory... then eventually the batteries were low so it went into a sort of 'safe mode' where only the absolute minimum was loaded, and that's when NASA was able to communicate with it again...

      It was nothing like what you described, just a VERY well designed system (though it would have been somewhat better had the system been able to go straight to "safe mode" after the initial critical error (running out of memory))

      Did the people with mod points RTFA? Score 5 Insightful?

      And no, I'm not new to /. ;)

      --
      120chars for a sig is teh suck
    11. Re:What's the big deal?? by You're+All+Wrong · · Score: 3, Interesting

      There was also pressure not to drop your stack of punched cards in those days!

      (hint - draw a diagonal line across their top edges so you can get them in order again quickly.)

      Some people seem to no know why "batch" files were so-called, it seems.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    12. Re:What's the big deal?? by Lobo_Louie · · Score: 3, Funny

      We salute you and Bud Light salutes you, Mr. Three Finger Remote Computer Rebooter. >You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science.

    13. Re:What's the big deal?? by YetAnotherDave · · Score: 2

      >> then eventually the batteries were low so it went into a sort of 'safe mode'

      No, what I read in TFA was that they interrupted the boot cycle, and told it to start without mounting the flash filesystem. Pertty normal practice for dealing with sick filesystem under vxWorks.

      What's impressive is that they can maintain a stable enough network connection to another fucking planet to do this. I routinely do it with systems running vxWorks on _this_ planet, but even then keeping a reliable connection is the tricky part...

    14. Re:What's the big deal?? by DerekLyons · · Score: 2, Insightful
      But at NASA, you have a local replica of the whole system sitting in the lab next door.
      A lab that resembles the rover on Mars less than you might think. You see, the lab is rebooted frequently, and equally frequently has it's configuration reset to test one thing or another. The rover on Mars has been running for months. (This is the difference that lead to the problems they are currently debugging.)
      you're in a team of professionals that if necessary can calculate the most probable results of particular radiation hitting your system under a given angle, or can tell you the power usage and temperature effect of the system components given a particular subroutine,
      Of course none of the guys have acess to or experience with a system that has been exposed to severe enviroments, operated for months at a time, etc.. *Except* for the two irreplaceable examples sitting on the Martian surface. Nor are they *entirely* certain of the effects of those exposures and changes, *except* by examing the two irreplaceable examples sitting on the Martian surface.

      Certainly the JPL/NASA guys are smart, experienced with other probes, and have massive resources backing them up. But they also have some heavy odds against them.
    15. Re:What's the big deal?? by rossumtech · · Score: 3, Informative

      Here's a link to the NASA press release describing all the details to that fix of the Galileo orbiter. I remembered it because I sometimes work at JPL and walked into a lab where a JPL-er was packing up what looked like a home-brew old time reel-to-reel tape player. It turned out that it was the sister device to the Galileo flight system and the guy I was talking to was one of the brains who had figured out the fix! JPL press release

    16. Re:What's the big deal?? by Phil+Karn · · Score: 2, Informative
      An earlier example was Voyager 2. This spectacularly successful mission almost didn't make it even to Jupiter. Its primary command receiver failed, and the AFC (automatic frequency control) in its backup also failed. That meant the receiver was listening only to a single frequency with almost no tolerance for error. And the precise frequency was a function of component drift, which was in turn mainly a function of receiver temperature.

      The failed components never recovered, but JPL was able to work around it. They constructed an elaborate thermal model of the spacecraft to predict the precise temperature (and therefore the operating frequency) of the command receiver. Everything but the kitchen sink went into this model: the effect of attitude on solar heating, the self-generated heat from the electronics, the effect of turning various instruments on or off, the time lags due to structural heat capacities, everything. And it has worked fine ever since.

      JPL doesn't get nearly the credit they deserve for their track record in rescuing missions from seemingly fatal failures like these. There's still a pervasive public myth (sustained by the human space flight side of NASA) that only humans in space can fix things when they break. But they seriously overestimate the astronauts' abilities, and they greatly underestimate what a bunch of really smart people can often do from the ground.

  12. Uh-oh by z0ink · · Score: 5, Funny
    "We recognized early in the planning process that the flash file system had a limited capacity for files."

    Sounds like NASA forgot to empty the rover's recycle bin. =)
    --
    Steal This Sig
    1. Re:Uh-oh by LnxAddct · · Score: 3, Funny

      I've thought long and hard on this topic and yes on windows it is accurately called the recycle bin because you dont get rid of the junk you put in there, it gets reused in some other part of your system. You put junk in, the junk is modified into other junk and then sent back to create new system dlls. In linux(and I believe macs) it is accurately called the trash can because what we put in there is thrown out for good, we don't have our junk recycled to create more, but different, junk:)
      Regards,
      Steve

    2. Re:Uh-oh by brendan_orr · · Score: 3, Funny

      Nah, Linux, Mac OS X, *BSD, and other *nix users have /dev/null as a trash can.

    3. Re:Uh-oh by AhBeeDoi · · Score: 3, Funny

      Nah, Linux, Mac OS X, *BSD, and other *nix users have /dev/null as a trash can.
      Trash can? More like a neutron star, 'cause anything you put in it is totally and absolutely gone.

  13. The proper fix... by Dan+East · · Score: 3, Insightful

    ...would have been to have "fixed" the problem before the hardware left earth. This "bug" (or more accurately, known limitation of the filesystem) should have been discovered here on earth if the rover had been properly tested.

    The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.

    Dan East

    --
    Better known as 318230.
    1. Re:The proper fix... by Chester+K · · Score: 4, Funny

      The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.

      When you can write an embedded operating system that can gracefully and automatically recover from every possible thing that might ever go wrong, perhaps you should send your resume to NASA.

      --

      NO CARRIER
    2. Re:The proper fix... by KewlPC · · Score: 5, Informative

      Score: -1, Didn't Read Article

      The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.

      This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.

    3. Re:The proper fix... by cetialphav · · Score: 2, Insightful

      So it's clear that their on-the-ground testing didn't catch the first bug, despite the rigorous testing you described. Which makes one wonder if they really did such rigorous testing. The grandparent is right.

      But this was probably intentional. The flight to Mars is a long one, so there is plenty of time to test while the rover is in transit. Before launch, you need to make sure that the hardware works and is reliable. Since they can upload new versions of software, they can do much of the testing after the launch. This is one of the things that allowed them to hit aggressive launch windows.

      This looks like it was less a technical failure and more a communications failure. Other rover operations were dependant on the utilities running to clear up flash space. When that did not happen on time, the right people were not told and so they assumed there would be more space available.

  14. Hindsight by FTL · · Score: 5, Insightful
    The article (I know, I know, this is Slashdot) is really good. It contains everything that is missing from traditional media. The story, the background, technical details, and follow through.

    Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.

    A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.

    This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.

    --
    Slashdot monitor for your Mozilla sidebar or Active Desktop.
    1. Re:Hindsight by Jeremy+Erwin · · Score: 2, Informative

      I'm sure there will be at least some mention of the results of the investigation when it is completed and various persons are prosecuted. In the meantime, here's a relatively recent article on the investigation into the collapse.

    2. Re:Hindsight by addaon · · Score: 2, Funny

      I think this is exactly what he means. We get the beginning of the story, but then, no followup!

      --

      I've had this sig for three days.
    3. Re:Hindsight by Anonymous Coward · · Score: 5, Interesting

      I'm a journalism undergrad at a large university. One of the points I brought up with some of our administrators is that the innumeracy and scientific illiteracy of the graduates of our program is appalling. I think this is one reason why many important stories don't get reported accurately or in depth: the writers simply don't understand the story, and don't want to understand the story. They actually feel that math and science are somehow beneath them, and that the average reader doesn't need to be bothered with the facts. So we get vagueness instead of specifics in the articles we read.

      I suggested we allow j-students to substitute math or hard science minors in place of the foreign language requirement. Most graduates of college foreign language programs don't translate at a level any higher than Babelfish. It seems wasteful to force people to spend so much time learning a language that most will never use, when that time could be more productively spent introducing them to the languages of math and science, which they will undoubtedly use in the future. We'd get better reporting that way, and isn't that what going to j-school is all about? Science and technology are too important to our day-to-day lives and governance to be left to illiterates.

  15. What the article doesn't say by Mr2cents · · Score: 4, Insightful

    What filesystem is used? Is wear leveling being used? The directory structure is apparently stored in RAM during the day (why else would it use so much RAM?), that is a good thing for reducing wear on the flash system. But what's the number of writes on the flash chips? When will that number be reached?

    --
    "It's too bad that stupidity isn't painful." - Anton LaVey
    1. Re:What the article doesn't say by afidel · · Score: 4, Interesting

      Never, the rovers are only going to operate for ~100 days, the number of writes for modern flash ram is 100K cycles minimum, over a million typical. So unless they are really screwing something up that shouldn't be a limitation, also distributing file placement shouldn't be a software function, good CF cards do it in the controller logic.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    2. Re:What the article doesn't say by hyc · · Score: 2, Interesting

      Except that fully mirrored RAM would use way too much power, something that no space probe can afford.

      --
      -- *My* journal is more interesting than *yours*...
  16. Mod this "redundant" by Penguinshit · · Score: 5, Informative


    'How do you diagnose an embedded system that has rendered itself unobservable?'

    The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.

    This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.

    (well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)

  17. Ran out of flash disk space. No, really. by randyest · · Score: 2, Insightful

    If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space! They forgot to delete old files when updating the programs in the flash memory (which is mounted like a filesystem, or hard disk), and the OS was failing because it wanted to use that space. So it rebooted, and still had insufficient disk space, and rebooted again . . . lather rinse repeat. There was no signal because it was stuck in a reboot loop because they ran out of disk. Wow.

    They fixed it by telling it to boot without using the flash (safe mode :) ), then used low-level (direct access) flash utilities to remove the old files. Reboot, mount, disk check / corruption repair, voila it works again.

    We have a big 1TB NetApps server where I work, and we have so much disk space that people get lazy and don't delete files or archive old projects, then they get really confused when jobs fail, not thinking disk space until checking everything else first. But it happens, and it's usually surprisingly hard to debug (they check a lot of other things first, sometimes even upgrading tool versions!). It's really kinda funny, in an expensive and mildly embarassing way that the Spirit had the same problem.

    --
    everything in moderation
  18. Lucky Hack? by electromaggot · · Score: 5, Insightful

    "The outcome strikes me as an extremely Lucky Hack..."

    The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?

  19. All these worlds.... by dmeranda · · Score: 4, Funny
    "The irony of it was that the operating system was doing exactly what we'd told it to do," Klemm lamented.

    Yeah, that was HAL's excuse too.

    Seriously, hats off to all the JPL programmers. Proving to the Martians that there is indeed intelligent life on Earth, very intelligent.

  20. Remote debugging pet peeve by Peter+McC · · Score: 5, Funny

    My pet peeve when I'm doing remote troubleshooting is 'ifconfig eth0 down'...oops. At least NASA is smarter than that.

    Peter.

    --
    You know what I hate? Wait, what do you like? I hate that!
  21. rebooting on mars... by segment · · Score: 4, Interesting
    Interesting reading:

    Rebooting on Mars

    By Matthew Fordahl, The Associated Press

    It's a PC user's nightmare: You're almost done with a lengthy e-mail, or about to finish a report at the office, and the computer crashes for no apparent reason. It tries to restart but never quite finishes booting. Then it crashes again. And again.

    Getting caught in such a loop is frustrating enough on Earth. But imagine what it's like when the computer is 200 million miles away on Mars. That's what mission controllers faced when the Mars rover Spirit stopped communicating last month.

    ...

    Tech support for an $820 million mission is a cautious affair. Tools to recover from and fix any problem must be built into the system before launch. The systems' behaviors need to be completely understood and predictable.

    "Luckily, during the design period, we anticipated that we might get into a situation like this," said Glenn Reeves, who oversees the software aboard the Mars rovers Sprit and Opportunity at NASA's Jet Propulsion Laboratory.

    For stability, reliability and predictability, mission designers did not bust the budget and design the hardware or software from scratch. Instead, they turned to hardware and software that's been used in space before and has a proven track record on Earth as well.

    "The advantage of using commercial software is it's well-known, and it's well deployed," said Mike Deliman, an engineer at Alameda-based Wind River Systems Inc., which made the rovers' operating system. "It has been used throughout the world in hundreds of thousands of applications."

    The operating system, VxWorks, has its roots in software developed to help Francis Ford Coppola gain more control over a film editing system. But the developers, David Wilner and Jerry Fiddler, saw a greater potential and eventually formed Wind River, named for the mountains in Wyoming. VxWorks became a formal product in 1987.

    rest of article

    1. Re:rebooting on mars... by JWSmythe · · Score: 2, Funny


      I wonder how many Microsoft salesmen were pushing for putting WinXP on it.. :)

      --
      Serious? Seriousness is well above my pay grade.
    2. Re:rebooting on mars... by hazem · · Score: 2, Funny

      It's a PC user's nightmare: You're almost done with a lengthy e-mail, or about to finish a report at the office, and the computer crashes for no apparent reason. It tries to restart but never quite finishes booting. Then it crashes again. And again.

      Gee... that sounds a lot like the last worm to hit my mom's Dell Laptop running Windows XP.

    3. Re:rebooting on mars... by Anonymous Coward · · Score: 2, Informative

      1. It's not a known broken OS. It's an OS that doesn't have any failsafe to protect against running out of storage, and user error caused it to allocate too many files. The people who were keeping track of old files from a failed transfer weren't talking to the guys that allocated new files, so nobody knew how many files were actually allocated and they ran out.

      2. That's not what "begs the question" means. http://skepdic.com/begging.html

      3. Based on 1 and 2, it is proved by example that you=monkey puppet.

  22. not really... by rebelcool · · Score: 3, Informative
    on projects such as this, the design specs would've been frozen several years ago, and then would've been conservative for the time, using proven technology.

    Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.

    --

    -

  23. Lucky Hack? by SuperKendall · · Score: 5, Insightful

    Your post is the only thing that strikes me as a "Lucky Hack" here. They included the ability in the design to remotely disable booting from flash and upload new boot images, in what way is that a "hack"? All this is just foresight in design to include as many possible recovery modes as they could.

    Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  24. NASA Rocks! by blueZhift · · Score: 5, Interesting

    Great article! This is just the sort of thing that has always impressed me about NASA and the JPL. Just when mere mortals might give it up and walk away, they figure out the problem. I can only imagine how wild the party must have been after they fixed Spirit, the scientists and engineers I've worked with in the pass could really put away the booze.

    Seriously though, the key lessons to take away from this are.

    1) Gather all of the clues you can.

    2) Take those clues and build a model.

    With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!

    BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.

    1. Re:NASA Rocks! by roskakori · · Score: 2, Interesting

      Seriously though, the key lessons to take away from this are.

      1) Gather all of the clues you can.

      2) Take those clues and build a model.

      you forgot this one:

      0) predict failure scenarious in the design phase, think them through, and design accordingly.

      when you read the article, you will notice that a lot of plans and tools already existed that allowed them to trace the problem. this is one of the major difference between armchair coding and reliabilty engineering.

  25. Remote safe mode by Megane · · Score: 3, Interesting
    The first thing needed to achieve remote maintainability on the order of space probes is some way to access a machine remotely when it's not running the full OS. A KVM switch isn't going to work over long distances. The BIOS needs a way to run over the network. Same for the kernel boot messages. Whether it's through a serial console and SSH server, or through the BIOS running TCP/IP, what we have now isn't enough. A separate console server could also control a power cycle/reset switch circuit.

    There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  26. Re:NASA should have simulated... by updog · · Score: 3, Informative
    The fact that they filled up the flash memory with too many files that were accumulated during the cruise phase of the mission between earth and mars was something that they should have known would happen. Apparently you didn't read the article. Because of a communication failure, a utility that was supposed to delete the old files didn't get completely uploaded. The utility was scheduled for retransmission, but the filesystem filled up before it got re-transmitted.

  27. whoops by usillyman · · Score: 5, Funny

    Operating System not found. Press any key to continue.
    Damn! Left the floppy in!

  28. Could an earthbout 'twin' computer help? by AaronStJ · · Score: 4, Interesting

    What surprises me is that they don't have a 'twin' of the rover's computer system set up on earth. When commands are run on the rover, the same commands could be run on the computer system on earth. Then, if the rover's software, fails (as it did), the software on earth would (theoretically) fail in a similar way, and be MUCH easier to debug. Of course, the systems wouldn't be identical (without building an entire duplicate and expensive rover), and the data gatehred wouldn't be identical, but if the twin was carefully planned and fed dummy data that aproximately mirrored that data the rover was gathering. For example, the twin could be fed dummy pictures about as often as the rover took a real picture.

    From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).

    Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.

    --
    Stupid like a fox!
    1. Re:Could an earthbout 'twin' computer help? by Anonymous Coward · · Score: 4, Informative

      Uhmm... we DID build a 'twin' of the rover, hardware and all. Give us a bit more credit, will ya? :-P What you may not realize is that exposure to radiation on the surface of Mars, solar wind while in transit and other factors such as thermal expansion / contraction, etc. are slowly degrading the rovers in nondeterministic ways. It is not nearly as simple as 'running the commands in the testbed' at JPL to diagnose any problems which occur.

    2. Re:Could an earthbout 'twin' computer help? by Gogo+Dodo · · Score: 2, Informative
      They do have a twin system here, but having one here isn't quite the same as the two on Mars. You can't replicate everything on the two Mars rovers such as the science data files.

      When Spirit was turned around on it's lander, they tested the moves on it's twin here, hence the long delay getting off the lander.

  29. There is a significant lesson to learn, here .. by Anonymous Coward · · Score: 3, Insightful

    .. namely, "Do Not Use VxWorks". Use something stable instead. eCos comes to mind. So does everyone's favorite OS these days, which has RTOS support. Having been a frustrated VxWorks user in the past, I'd no more entrust my mission-critical services to it than I would to Microsoft. -- TTK

  30. Ran out of INODES. No really. by dorko · · Score: 5, Informative
    If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space!
    Well, I did read the article and I wouldn't say it quite like that. The article says: "Spirit attempted to allocate more files than the RAM-based directory structure could accommodate." Furthermore, the article says that the low-level file manipulation commands "worked directly on the flash memory without mounting the volume or building the directory table in RAM ."

    To me, if this were a Unix-like system, it sounds like they ran out of inodes. Running out of inodes is very different than running out of disk space.

    If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.

    1. Re:Ran out of INODES. No really. by Concerned+Onlooker · · Score: 3, Funny
      If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.

      That's why I always keep a spare bag or two of inodes on hand, just in case. They're small so they don't take up too much space in the closet. I store them next to those f-stops I used to use for photography.

      --
      http://www.rootstrikers.org/
  31. mod parent down by ChrisCampbell47 · · Score: 2

    Wrong wrong wrong, as I'm sure someone else will post. He spins a good yarn but he's just a machine room flunky and hasn't RTFA himself.

  32. Yes, see my post by SuperKendall · · Score: 2, Insightful

    I had pretty much the same post - the originator of the story confuses luck with skill, a mistake a find very annoying and committed all too frequently. I'll fully admit when I've been lucky, but I also went recognition for foresight when I've had some! NASA deserves at least that much respect.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  33. Lots of copper by Anonymous Coward · · Score: 2, Funny

    Duh. That's what they have been keeping a secret. They have a DB9 serial link strung from here to the landing site. It's not as cool as you all make it out to be.

  34. Does Microsoft know about this? by superyooser · · Score: 2, Interesting
    The operating system is Wind River Systems' Vx-Works version 5.3.1, used with its flash file system extension.

    First wxWindows, now Vx-works?

  35. Great trick for ssh administration by nsayer · · Score: 4, Insightful

    Before doing something risky, type this:

    sleep 600 && reboot &

    Now if your risky maneuver makes the ssh session unusable, just wait 5 minutes for the machine to reboot.

    This is great for fiddling with firewalls by remote control... through the firewall. :-)

    Oh... You say you're not using a POSIX-like system? That's not supported. Sorry. :-)

    1. Re:Great trick for ssh administration by Anonymous Coward · · Score: 2, Informative
      No, if sleep finishes successfully it'll reboot. If you kill sleep, it'll exit with some big code (on Linux it would be 130). sleep exiting with code 130 will cause && to not execute the consequent.

      Ergo, if everything works and you don't want to reboot anymore, you just do a little % followed by a little ctrl-C and it's all good.

      Incidentally, sleep 600 will make the machine sleep for 10 minutes, not 5 minutes, as the OP said :)

    2. Re:Great trick for ssh administration by Helvick · · Score: 2, Funny

      You mean like the time I sent myself 22000 mails because the OS in question didn't have an implementation of sleep.

  36. Brilliant! by frenchs · · Score: 3, Funny

    They could have set it up out in my backyard to take pictures of the piles of crap and rocks out there and if they wanted to simulate the solar radiation, they could have my girlfriend give it one of her famous looks... cause those are leathal enough to burn a hole in your soul.

    -SF

  37. How'd they do it? by alwaystheretrading · · Score: 5, Funny

    That must have been some feat to get the arm on the rover to press Ctrl, Alt and Delete at the same time!

    1. Re:How'd they do it? by oohgodyeah · · Score: 5, Funny

      Maybe it's all lies and the Martians hit Ctrl+Alt+Del...

      --

      - OohGodYeah!
    2. Re:How'd they do it? by inertia187 · · Score: 2, Funny

      NASA: Ok, let's do this. I want you to press CTRL-ALT-DEL.
      Spirit: But I only have two fingers, you insensitive clod!

      --
      A programmer is a machine for converting coffee into code.
    3. Re:How'd they do it? by rjamestaylor · · Score: 5, Funny
      Actually, a friend of mine is a system admin with JPL and he had to drive out to the San Bernadino soundstage where the rovers are being filmed and reboot the computer a 4AM. The funny thing is he left a tool chest and sleeping bag (he was using it to minimize footprints and body impression, not sleep on the job!) where the Opportunity rover was scheduled to peek over the horizon and the ensuing photo of the tool chest / sleeping bag on the horizon had to be quickly -- and deftly, I must say -- explained away as being Opportunity's back shell and parachuete.

      Just another day in the life of a sys admin!

      --
      -- @rjamestaylor on Ello
  38. Verifying the software !!! by vinit79 · · Score: 4, Informative

    What really surprises me is that NASA did not verify the software. Software verification is essentially mathematically proving the software. It is tedious and expensive but we are talking about NASA and the Mars. Infact even beloved MS formally verifies device drivers before use ( believe it or not !!) If the original program was correct they wouldnt have to reupload it and the entire problem ...gone.

    1. Re:Verifying the software !!! by jonastullus · · Score: 2, Interesting

      you are quite aware that software verification is far from being usable for any languages found in the wild?!
      first you need a model to verify your software against and as a matter of fact the model will again be written in some pseudo-language so that you not only double the workload but also introduce slight incompatibilites between the implementation language and the model language!
      and then somebody still has to prove that there are no errors in your model for which you will need a meta-model, etc, etc, etc.

      i'm not saying that verification is not practicale or that it wouldn't be nice to have it, but there are obstacles that won't be solved for quite some years to come!

    2. Re:Verifying the software !!! by WayneConrad · · Score: 5, Interesting

      Software verification is essentially mathematically proving the software....

      I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.

      Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.

      We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.

  39. Bud Light Presents...Real Men of Genius. by Blaede · · Score: 4, Funny

    Today we salute YOU, Mr. Super Wizard Windows Reinstaller.

    Only YOU can fully appreciate the difficulty of running a format c: command, while swilling a room temperature can of Red Bull.

    "Hey this stuff is hard now!"

    While NASA is too preoccupied with things like farway rovers, you take your vocational tech school fueled arrogance directly to the place where it will make the absolute least possible impact: A Slashdot discussion thread.

    "Loggin' on now!"

    Your unique eye for obviousness allows you to sling turds of obtuseness every which way, and then brag about how you were RIGHT as soon as one of your pronouncements hit true - regardless of how many times you were wrong before.

    "See I told you sooooooo!!"

    And if some idiot rocket scientist has the unmitigated gall to not bow down to your obvious Geniusdom, you unleash your fury down upon him with all the tenacity and mercilessness of a rabid pit bull with a tender buttock locked in its jaws.

    "Total anonymity!"

    So keep clicking away, oh Marauder of the Mousepad. Because when the results you so desire finally come about years from now, you can say it was because YOU demanded it."

    "How come they haven't fired that dumbass head of NASA yet yet?"

    (Bud Light Beer, Anheuser Busch, St. Louis Missouri.)

  40. They didn't just randomly delete stuff by enosys · · Score: 4, Insightful
    From the article:

    Using the low- level commands, about a thousand files and their directories -- the leftovers from the initial launch load -- were removed.

    I think that means they deleted the useless stuff they wanted to delete anyways but didn't get to delete before the crash. I also remember news about science data from before the crash that was received after they got the rover working again.

    As for how critical it is, well yeah, it seems the rover didn't need the contents of the flash file system. The operating system and other software was in the same flash memory but I assume that any sane designer would put in some hardware write protect interlock that's not easy to defeat accidentally.

    1. Re:They didn't just randomly delete stuff by edesio · · Score: 2, Informative

      It seems to have two differente flashes: a larger for new files and a smaller one for programs. This would make it easier to manage.

      "...Separately, about 230 Mbytes are used to implement a flash file system..."

  41. Re:NASA should have simulated... by KewlPC · · Score: 2, Insightful

    You realize that missions to Mars can only be launched once every two years, right? If they miss their launch window, they've got to wait two years before they can launch again.

    You also realize that NASA did do a test mission, right? They built a test rover and put it out in a desert somewhere. They used the mission to test the hardware, test the software, and to help train the team.

  42. Re:only 120 megs ram? by KewlPC · · Score: 5, Informative

    You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.

    The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.

    Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.

    Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).

  43. The only reason nasa got it back to work by PipoDeClown · · Score: 3, Informative

    is because when the batteries got drained the os went into a stable "safe mode" state. If they made a long lasting powersupply this project was doomed(.f) and they never found out what the real problem was.

  44. What we can learn: by sakusha · · Score: 4, Insightful
    It appears that we still haven't learned the biggest lesson of all. I still remember back around 1970, there was a big sign on the wall next to the IBM 370s at my university, written on a primitive pen plotter, it said:

    Computers never make mistakes, they do exactly what humans tell them to do. All "computer errors" are human errors.

    1. Re:What we can learn: by ColaMan · · Score: 2, Interesting

      Unless you own an early pentium.

      --

      You are in a twisty maze of processor lines, all alike.
      There is a lot of hype here.
    2. Re:What we can learn: by roman_mir · · Score: 2, Insightful

      Really? What about external factors like a radiation spike that kills some of your hardware, will the computer do what you tell it to do then correctly?

    3. Re:What we can learn: by swillden · · Score: 2, Informative

      You still haven't learned the lesson. Those are not errors or mistakes, they are malfunctions. A properly designed computer system can easily detect malfunctions

      Guess you'd better get over to NASA and set up a series of lectures so that you can impart your vast expertise and wisdom.

      but that same system will happily execute any human-designed code containing massive errors.

      Interesting that you point out the code as being human-designed. Who designed the hardware? God?

      You're just the kind of computer geek I abhor, always looking for excuses instead of solutions to your own mistakes.

      And you're just the kind of self-assured idiot who amuses me endlessly with your clueless but oh-so-confident assertions.

      In the real world, hardware defects do exist, some designed into the hardware, others induced by external effects or damage. Software errors are certainly far more common, but that's mostly just because there's vastly more software.

      Even without the effects of space travel, hardware contains flaws and, indeed, much of the job of low-level software is to work around those flaws. It's not uncommon for a significant percentage of the code in a device driver to be dedicated to working around various hardware defects.

      Anyone who's spent considerable time working around custom and embedded computing hardware knows that defects often turn out to be *both* hardware and software-based. Insignificant hardware bugs interact with insignificant software bugs to produce major problems. Hardware defects aren't limited to those environments, either. Spend a little time searching the LKML archives for "ACPI" and reading what you find, or even just look through the Linux kernel configuration help and see how many configuration options you find that implement softare hacks to work around problems with particular pieces of hardware.

      When you factor in the rather unique and harsh operating environment of this hardware and software, and consider the amount and depth of testing that certainly went into the development process, it's not in the least bit unusual that the programmers should be surprised that the flaw was purely a software error. If I'd been in those engineers' shoes, I also would have expected something far more complex. I'm sure they went into it, quite reasonably, assuming that some hardware component had failed and that they were going to have to implement a software workaround.

      I'm sure the prevailing sentiment when they finally discovered the actual nature of the problem was "Hallelujah! This is something we can fix!", not "Uh, oh, I can't blame this on anyone else." That's certainly how I would have felt, anyway.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  45. Re:WindRiver's fault by KewlPC · · Score: 2, Informative

    Actually, they used VxWorks because it was the same OS used for the lander on the Mars Pathfinder mission. Since they were using the same CPU and same basic computer design as the Mars Pathfinder lander, they probably figured, "Why not use the same OS?"

  46. Re:OT:lots of mem of an embedded system by millette · · Score: 2, Interesting

    I'm serious. http://physics.nist.gov/cuu/Units/binary.html for all the groovy details. If anything, it's a move away from the hd manufacturers lingo.

  47. Re:Ran out of flash disk space. No, really. by SiliconEntity · · Score: 2, Informative

    Here's what happened according to the article. They launched the ship with an OS image in flash, and soon realized that they needed to update it. So shortly after launch they sent another complete OS image. They knew they'd have to delete the first image, but they didn't do it right away. At that point there was plenty of room in the flash memory so having two OS images was not a problem.

    After a few days on Mars, they were starting to fill up the flash, so they planned to go ahead and delete the old launch OS image, its directories and files. This is a complicated process so they uploaded a special program to do it on Sol 15. And apparently they informed the rest of the team that the memory would be free and available after that point, so the rest of the team made plans to start filling it up with pictures.

    However, the upload on sol 15 failed, and was rescheduled for sol 19. Now, here's the big mistake (which the article glosses over): They forgot to tell the rest of the team that all that memory wasn't going to be freed up as planned, not for a few more days. So instead, Spirit is moving around now, taking lots of pictures, storing them in flash, and all the people involved with that think they have plenty of room. Little do they know that they are running out of flash space. Finally, the morning of Sol 19, shortly before the memory cleaning program was going to be sent down, it happened. The flash memory was exhausted. This triggered a sequence of events which put the craft into a failure loop.

    The big problem here, then, was the failure on the part of the group which was supposed to clean out the launch OS image to tell the rest of the team that it wasn't going to happen as scheduled, so the memory wasn't going to be available. It wasn't really Murphy's Law, but rather a failure to communicate among the team. This is an institutional problem which will hopefully be fixed.

  48. Re:Ran out of flash disk space. No, really. by dorko · · Score: 4, Informative
    [T]hey are running out of flash space. ... The flash memory was exhausted.
    No, no, NO!

    It was the inability to build the RAM-based directory structure of the files in the Flash memory.

    Why couldn't they build the directory structure? They had too many files, the size of the files doesn't matter here, only the number of files.

    In other words, they ran out of RAM, not Flash.

    Exercise left for the readers: Why can a Unix file system that is out of inodes have much less than 100% disk usage and still not be able to create a file?

  49. Re:One reasonable anology by zcat_NZ · · Score: 5, Informative

    If you're really worried about your remote server being unreachable, here's what I would suggest doing:

    Have a hardware watchdog. If the machine is lost or confused, it reboots itself.

    Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.

    If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.

    From the description of how they got Spirit back, it looks like this is exactly how it was set up.

    Who'da thunk it!!

    --
    455fe10422ca29c4933f95052b792ab2
  50. Re:WindRiver's fault by KewlPC · · Score: 3, Insightful

    WindRiver may give JPL large discounts, but I doubt that's the only reason VxWorks is running on the MERs.

    Years ago, when JPL was designing the Mars Pathfinder mission, they asked Wind River to do an "affordable" port of VxWorks to the RAD6000 (a radiation-hardened RS6000), and they agreed. Since the computers on the two MERs are very similar to the computer on the Mars Pathfinder lander, it makes sense that they'd use the same OS that they used on the MPF lander.

    I would think the fact that JPL knows VxWorks very well by now would be a major factor in deciding to use VxWorks for the MERs.

  51. JPL by EachLennyAPenny · · Score: 3, Funny

    The JPL is a pretty viral license. It forces you to spread their space probes from your planet to all your customer's planets. This is un-solar systematic! What's next? Calling GNUpiter Jupiter instead?

  52. Re:NASA should have simulated... by hyc · · Score: 2, Interesting

    One word: outsourcing.

    When I worked at JPL, every 6 months to a year there'd be talks of layoffs because the headcount was too high; people would leave and return to the same projects as contractors, then get a higher hourly wage for doing the same work with less accountability.

    The whole reason for that lost probe (feet vs meters, anyone?) was because of a political squabble between two teams (one JPL-internal, one outside contractors as I recall) who simply failed to cooperate productively. The whole management structure inside that world is screwed. People's project leads are not the same as their section/department leads, so the reporting chain is a mes{h,s}. Time and energy is wasted in contract(or) management, all in the name of "reduced costs" even though having all the work done in-house would eliminate a full layer or two of mid-level management waste.

    NASA/JPL are totally hamstrung by beancounters who think they're saving the public's money, but truly can't see the big picture, missing the forest for the trees. (Either that, or they *do* see the big picture, and are busily lining their own pockets with the excess that gets tossed around thru all the churn.)

    --
    -- *My* journal is more interesting than *yours*...
  53. Re:Ran out of flash disk space. No, really. by Mal-2 · · Score: 2, Insightful

    Could this have not been said more succinctly with a simple quote? Namely:

    "What we have here, is failure to communicate."

    Mal-2

    --
    How is the Riemann zeta function like Trump rallies? Both have an endless number of trivial zeros.
  54. Hmmmm by ziggy_zero · · Score: 3, Insightful

    "The irony of it was that the operating system was doing exactly what we'd told it to do"

    Funny, that's how it was explained to me by my computer science teacher my freshman year in high school. He said, "The problem with computers is that they do exactly what we tell them to."

    --
    I belong to the ______ generation.
  55. Re:One reasonable anology by Fallen_Knight · · Score: 3, Interesting

    considering the distance i'd say a while, couple hours doesn't make much diffrence when you got a billion $$ probe on another planet, it surviveing is more important then a fast boot time heh. and you can always login and tell it to continue booting

  56. Discovered a system log ? by thrill12 · · Score: 3, Interesting

    "We discovered a system log in which the problem was documented,"
    Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
    1. Re:Discovered a system log ? by heneon · · Score: 2, Funny
      Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??

      When building the rover, they probably just put vxworks installation cd in, and selected "Typical Install for inter-planetary missions", clikced "Next" a couple of times and got the OS running in no time.

      Now, if the NASA engineers are anything like me, they didn't bother to check what was being logged and where... after all, when a problem arises, then you can go to /var/log/ to see if there's anything interesting.

    2. Re:Discovered a system log ? by roskakori · · Score: 2, Insightful
      maybe the guy really is from PR and doesn't know how to carefully phrase sentences targeted at a technical audience, but these also hit my eye:
      "It was recognized just after [the June 2003] launch that there were some serious shortcomings in the code that had been put into the launch load of software," said JPL data management engineer Roger Klemm.
      i know this is common within the software industry, but if this happens on such a project, it looks like plain incompetence to me.
      Klemm said that with the leftover directories and their files removed, the system is now functioning well. But just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure.

      allocation errors are the easiest to predict. even if you don't handle them gracefully (which often can be near to impossible), most of the time you can log them. of course, a reliable, redundant log facility is one the most crucial components of such a system...

      writting this from my armchair, of course i can't really judge their competence and claim i could have done better. still, the article makes me suspicious.
  57. Parent should be modded down by Dan+East · · Score: 2, Informative

    I did read the article, and my comments are completely accurate. Unfortunately you must not have made it to the 3rd paragraph, and neither did the mods that modded you up and me down.

    The problem was discovered after launch. The first few fixes made the problem worse by stressing the filesystem even further.

    It doesn't matter that they were trying to fix the problem. THAT WAS NOT MY POINT. The problem should have been identified and fixed before the craft was launched.

    Yes, they may have taken "around" 100000 pictures. Does that mean they sequentially stored every picture in an actual rover file system? I get the impression they were only testing the cameras or the capture software, not the holistic system.

    Did they first simulate filling the filesystem with files generated during the actual trip to mars? Apparently not, because the system would have failed if they had actually put the rover software through a launch to end of mission simulation here on earth when the software was developed.

    Dan East

    --
    Better known as 318230.
  58. Logging should not be limited ? by thrill12 · · Score: 3, Insightful

    Seriously, from a developer viewpoint, that is all wrong.
    I have worked on projects in which there was simply too much logging going on that you couldn't tell head from toe anymore. When a problem arrived, scanning the logfiles proved very cumbersome indeed. Every developer had his own stuff logged, which sometimes proved interesting, sometimes proved utter crap (noone wants to know variable XYZ is increased by 1 for 24943 times).

    You should develop a well-thought logging strategy that increases the logging verbosity on a problem-basis, not simply log everything that happens and hoping you get some useful information.

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  59. Not lost forever by PeekabooCaribou · · Score: 2, Funny

    Not lost forever, but lost until we travel to Mars and retool it as an extraterrestrial barbeque grill.

    --
    "I'll say it again for the logic-impaired." -- Larry Wall.
  60. Good posts! by electromaggot · · Score: 2, Interesting

    ...and I'm not saying that just because we agree. Yours are good additional insights (hence your "insightful" mods up! :-)

    I agree with the reply-post below too, saying that if they'd made their system a bit more fault-tolerant, then the problem might have been more easily recovered from. Sixty reboots in a row in a day seems a little excessive! Don't they have counters to detect that very thing? Don't they have a failsafe/debug OS burned into ROM (not flash) to load automatically in just such an event? Such are the risks when you're reloading a whole new OS remotely!

    However, maybe they do have such things, or equivalent. I don't think their method of recovery was "accidental" (or a hack) either, although I'm making assumptions and I haven't seen their spec. The key is that they recovered from the error... and I now assume that they have recovered completely.

    What I found interesting was NASA's initial assessment that the flash ROM was failing -- a hardware failure. The media jumped all over that and reported it, so the rest of us were thinking, "Great, the rover is crippled and will never be the same. :-( "

    Now, turns out it was just a software error. Where's the mainstream media now? ("EE Times" is hardly mainstream!) Can the rover's recovery now be considered a "complete recovery"?

    If this story goes mainstream, will it make NASA look bad for screwing up... or look good for making a full recovery? I'm not sure. (Of course, smart people make mistakes too, lots of them, but the key to being smart is covering your ass beforehand! :-)

  61. Except just one thing: by Chemisor · · Score: 3, Insightful

    > What on earth (or on Mars) could we possibly take away from this experience?

    Rule 3: Never ignore the return value from open.

  62. Re:One reasonable anology by sjames · · Score: 3, Interesting

    well, this presupposes that what caused the problem in the first place also didn't mess up the hardware watchdog as well.

    Nothing's perfect. It also presupposes that the sun didn't explode and vaporize the Earth and that God didn't get ticked off and squish it with his thumb, So What?

    A watchdog is a VERY simple device. A simple countdown timer, a control register with associated address decode, etc. It's quite unlikely to fail. When the timer hits zero, it strobes reset. Any access to the port address resets the countdown timer.

    Some dual processor boards are even set up to alternate which is the boot processor, so they can come up with a single failed CPU.

    There is always some sort of problem that precludes recovery. No amount of software or clever design can help you if the device is destroyed. However, that doesn't mean don't even try.

  63. you, too, can have this capability on earth... by sommerfeld · · Score: 4, Informative

    It's not that hard to pull off off this sort of seemingly amazing remote recovery with pure off-the-shelf tech if you plan for it in advance and are willing to pay a modest premium.

    You need remote serial console access -- ideally including firmware/bios serial console access -- and remote power cycling, controlled by a small embedded system, either in separate units (APC masterswitch, terminal servers) or as part of the system unit (common on Sun gear as "LOM"/"ALOM"/etc.; some of this is also creeping into x86 mobos). All this lets you regain control of the system remotely.

    Then it becomes a matter of hardening the system to let you recover from various other insults. Never let go with both hands: Mirrored disks (protecting against hardware failure) and multiple bootable partitions (protecting against software or human error) can both be used; netbooting is also a nice capability to have when you've got a bunch of servers in the same place.

    Disclaimer: I bet you can do much of the above with other people's gear, but I work for Sun and I know it works for me...

  64. Launching with incomplete code is common by rarose · · Score: 4, Interesting

    The enroute time for Cassini to get to Saturn was 7 years; rather than push back an already long mission they launched with feature-incomplete code. They knew they had 7 years to get the software fully functional and debugger; they've updated it remotely from millions of miles away a number of times now.

    I'm sure the rovers did the same thing... Develop the launch/cruise software before you launch (and of course try to get as much of the entry/landing code done as you can!), and then uplink the final code before it's needed. Therefore it doesn't surprise me one bit that the JPL engineer knew there were shortcomings in the launch software.

    Hell, I develop BIOS for servers and we do it all the time. The BIOS image we give the hardware engineers for initial bringup is usually *way* short of features that will be there when it actually gets used by the customers!

    --
    --Rob