Slashdot Mirror


Mars Rover Spirit Back Online

Skyshadow writes "Just in time for the arrival of its twin, the Spirit Mars Rover is back in working order. Programmers at the JPL have traced the problem to the rover's flash RAM, which it uses to maintain its filesystems. They are using a ramdisk in the rover's RAM to bypass the bad flash memory, and are working on a workaround for the bad flash. Good news, but the rover is still potentially weeks away from full operational status."

21 of 386 comments (clear)

  1. Checksums by Anonymous Coward · · Score: 1, Insightful

    Sounds to me like they need to send back checksums of the contents of the Flash memory and figure out if part of it got corrupted somehow. Then re-flashing that section would probably fix the problem.

  2. The mission is not yet out of danger by Space_Soldier · · Score: 1, Insightful

    The status has been upgraded from critical to serious condition. Opportunity will most likely have the same problem since they are twin brothers and had an identical build process. They better figure out what is wrong with this rover before sending Opportunity to invetigate its part of Mars.

    1. Re:The mission is not yet out of danger by endersdouble · · Score: 1, Insightful

      Y'know, I don't think you are right about that, actually. Defective flash RAM just happens sometimes...just because the same parts and process build Opportunity does not mean that the same part will have a flaw. What they should have done is tested the ram on earth, but even so, most likely it'll be fine (we hope.) And even if it isn't, what are you going to do now?

  3. Monday morning quarterback by GGardner · · Score: 5, Insightful

    If I was sending an embedded control computer to another planet, I would have chosen an OS with memory protection, not VxWorks. VxWorks is like DOS, and early versions of Windows, where one pointer problem in one task can corrupt the whole system. Sure, we don't know that's the problem now, but it would be nice to know for sure that it wasn't.

  4. Software / Hardware Breakthrough? by Saeed+al-Sahaf · · Score: 4, Insightful
    This is remarkable, and a testament to good software / hardware integration. It is true that I think this money could have been better spent elsewhere in terms of our understanding of the universe, but still, these types of projects and the hardships that come with them teach miles of experience in remote software / hardware problems.

    I do seriously wonder if these types of projects will tell us anything more than esoteric wonders of Mars, but from a strictly engineering standpoint, perhaps it's worth it after all.

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
  5. Re:Static Discharge? by juglugs · · Score: 2, Insightful

    Doubt it

    I'd hope that the RAM is in a shielded box given the amount of radiation it's getting from the sun and the rest of space.

    Could be Soft Errors caused by Alpha particles though - depends on the technology used in the flash - unlikely, but possible...

    --
    This sig is in Spanish when you're not looking....
  6. Re:Not "online" at all... by Mr.+Darl+McBride · · Score: 3, Insightful
    You can read all about it at: Spaceflight Now - where you can continue to follow the status of both spirit and opportunity

    Nicely karma-whored. That's the link from the article. :)

  7. Re:Where is the redundancy? by cperciva · · Score: 4, Insightful

    It's not just a $5 flash ROM. If they wanted control redundancy, they would need extra flash RAM, RAM, ROM, CPU, motherboard, arbitration hardware, and arbitration software.

    Also keep in mind that this isn't a $5 flash ROM chip. When you consider the hostile environment, the testing, the power, and the fuel required to get everything to Mars, that flash ROM probably cost at least fifty thousand dollars.

  8. Re:heh... /. was right! by cubicledrone · · Score: 4, Insightful

    Amazing, isn't it? Writing comments correctly debugging an $800 million spacecraft on another planet without even looking at it, and most programmers still can't rent a fuckin' job.

    Now let's all sing the company song...

    --
    Business isn't willing to pay for products, innovation and careers, so we get brands, mortgage commercials and layoffs.
  9. Re:Monday morning quarterback: RTOS tradeoffs by GGardner · · Score: 4, Insightful
    memory protection adds overhead and can affect real-time performance

    This is the conventional wisdom, and in my experience, this particular nugget causes more embedded and real time software projects to fail than any other.

    First off, on a modern PowerPC processor, memory protection (that is, without virtual memory support) can be implemented very cheaply. If you can do it just with the IBAT/DBAT registers, it should be a constant-time overhead, which is good enough for hard-real time. Oddly enough, I can't find a single reference on the net that measures the cost of memory protection alone on a modern CPU. Anyone? Anyone?

    Secondly, though the rover certainly may have some software components that have hard-real time requirements, that doesn't mean that every single line of code does. Typically, less than 1 percent of the code in a real time system is hard real time. In that case, you can run the real-time code in ISRs, or perhaps in a dual-mode system, like RT-Linux, or in high-priority kernel threads (as with QNX). In any of these situations, you can run all the rest of the code in protected memory space.

  10. Re:Warranty by questamor · · Score: 4, Insightful

    Curiously, is there any difference with flashram on Spirit, and the stuff we have here? I didn't know about any radiation hardened flash ram... or even if there's any difference between the physical chips themselves in CF, SD, MemorySticks etc.

    The nasa report mentioned the problem seems to be revolving around the software that accesses the flashram. It could be filesystem corruption, or a physical problem with the flash ram itself, or even a broken interface to the flash ram. It's about the equivalent of having a machine a thousand miles away and just seeing that a certain drive won't mount, at the moment. Finding out whether there's a problem with the SCSI card it's connected to, or the drive itself, or a filesystem corruption, or a head crash... that comes in the next few weeks

  11. Re:Where is the redundancy? by Anonymous Coward · · Score: 1, Insightful

    I'm watching NASA tv at the moment and they're explaining possibilities now. At the moment, they only have a very broad explanation of what's going wrong. However the newest knowledge is:

    There are two separate flash memories on Spirit. At the moment, part of the problem is software which can read part of the flash memories as some of the operational software which is kept in flash ram seems to be coming up before the system reboots.

    The system is rebooting no matter which flash memory is being accessed, it has the same bug both ways, so the flash ram itself looks to be OK, but the interface between the flash ram and the software looks to be causing resets.

    Even if there were a dozen 1GB flashrams, it looks like they'd still have this problem. Perhaps many, all on different controllers, and even an entire backup computer. at 100watts total power available for the rover, an entire extra computer may be a bit much to have fit.

    At the moment the family of components that are involved with the hardware addressing of the flash memories looks to be where the problem is.

  12. Relative positions of Earth and Mars by LouisvilleDebugger · · Score: 3, Insightful

    The present series of orbiters/landers (Nozomi, Mars Express, Spirit, Opportunity) were launched at such a time as to take advantage of the most optimal Mars-Earth configuration for something like 60,000 years. I believe the bottom line is that it was a time you could get the most science there for the least cost of launch.

    Shame on my fellow American who said we should strip Beagle 2 and leave it up on cinderblocks. If Beagle is ever discovered to have soft landed, I would think the only proper thing to do would be to restore whatever's wrong with it, and let it complete its mission. (HAL, V'Ger, anyone?) Given the discussion of things like the effects of radiation exposure on electronics, you'd just have to be interested to know what a 50-or-150-year-old "dead" lander might be able to wake up and do.

    If Spirit's problems aren't resolved, the Mars Scorecard should at least reflect that Beagle was the less expensive failure.

    (Disclaimer: I visited England for the first time last year, and falling in love with the whole place doesn't begin to describe it. R.I.P. Beagle 2. *sniff*)

  13. Remote nonsense by fm6 · · Score: 2, Insightful
    Just wait until we have Interplanetary, Interstellar, Intergalactic Remote Desktop. I'm only half-joking.
    No you're not. All these Mars glitches are exactly why real space exploration entails sending an actual carbon-based unit, not a glorified laptop.

    Consider that an interstellar probe will take years to receive updated instructions. By which time, any fix will probably be irrelevent. Plus if they're more than 30 light-years away (practically next door by galactic standards) they guy who sent out the instructions probably won't live long enough to find out if they worked!

  14. Re:2 years ago, back at NASA R&D... by CrazyJoel · · Score: 2, Insightful

    isn't the budget for social welfare something like 300 billion dollars?

    If only we spent so much on NASA. They only get 12 billion.

    --

    Such is the infinite Grace of Popeye.
  15. Re:Where is the redundancy? by grozzie2 · · Score: 2, Insightful
    Really,they shouldn't have one of anything

    I think you folks all missed the point completely. They have full dual redundancy on EVERYTHING in the MER program. Not only are the computer systems somewhat an issue, there's little issues like landing in one piece, etc etc. to that end, they built 2 full systems, packaged them on 2 different rockets, and fired them off a month apart from each other. this gave full dual redundancy to every system and every component, from the initial launch igniters, to every bit of hardware that landed on the surface. Then to maximize the redundancy, they set them to land on different halves of the planet, serious physical isolation of component set A and component set B.

    If one complete set of hardware arrives on the surface, and returns scientific data, the mission is considered a success. the real issue and difference of this program is, they went with dual redundancy in everything, from launchers to arrival, not just separate systems mounted on the same physical hardware host.

    This type of full redundancy does make a lot of sense when you consider, the highest risk portion of the mission is the entry and landing phase, followed closely by the launch phase. Dual redundant systems mounted on the same rover platform may well give for better chances of success whilst on the surface portion of the mission, but leaves a huge single point of failure during launch, and another one during entry and landing.

    Take a peek over at nasa tv, and you well see what real mission redundancy is all about, second lander about to enter martian atmosphere.

  16. Re:The epitome of remote administration by IgnoramusMaximus · · Score: 2, Insightful
    All the expense and complexity, gone!

    I assume you are being really, really sarcastic. In reality the PCs multiplied complexity and expense by orders of magnitude. And only now after decades of chaos and misery all the turd-brained suckers/managers who were responsible for this snake-oil sales bonanza which created the likes of Microsoft are now retreating to the only sane method of enterprise computing: centralized storage and processing. After billions of dollars wasted and con-men very rich by now, the circle is now complete. Of course the managerdiots are now seeing this old idea as "new" after someone smart re-labaled those old concepts with new sales tags like "thin client" or "data warehousing" etc. They would never have allowed someone to hint that they have been taken for a ride and the "priesthood", unlike them, actually had a clue. It is a sad testament to the depths of human stupidity.

    PCs are greatest thing ever for game players, home computer users and many other applications like engineering or science. They make no sense on adminstrative workers' desks, yet those presently constitute something like 80% of business PC deployment and come with hordes of MSCEs without whom they would come to a grinding halt within days.

  17. Re:Follow the status? by datan · · Score: 2, Insightful

    just wondering something. when they say 'currently' do they mean now or light-time ago? eg. they confirmed cruise stage separation less than a minute after it "happened"

  18. Cut it out! by Dun+Malg · · Score: 3, Insightful

    OK, you dorks (you know who you are) need to stop postulating about the memory failures having to do with static electricity, martian dust, or lack of redundancy. This is JPL and (the one case of metric vs. standard aside) they thought of all the obvious stuff during the design stage. Do you really think they're slapping their foreheads and saying "the dust! we forgot about the dust!" over in the design lab? Get real, people.

    --
    If a job's not worth doing, it's not worth doing right.
  19. Re:Monday morning quarterback: RTOS tradeoffs by AaronW · · Score: 3, Insightful

    As someone who has programmed VxWorks (including AE) for several years, I can say AE is a buggy piece of crap. We moved to AE for our project and eventually had to dump it since it was so buggy and slow. Also, as far as flash filesystems go, VxWorks ONLY SUPPORTS FAT, and not even FAT32, so it isn't a very robust filesystem. Not only that, because it's FAT there is no wear level support. I believe there also isn't the equivelent of chkdsk either. I also imagine that it can't handle faults in the filesystem (as if anything ever could deal with faults in a FAT filesystem very well).

    With VxWorks you can often get away without any filesystem because all the code is linked together in one big monolithic file. Separate tasks are not separate files (although you can have loadable object files).

    Yes, AE does provide memory protection domains, but it still doesn't clean up after a task dies. Sure, you can free the memory, but not open files, semaphores, pipes, or other things. Malloc in AE is improved over the braindead implementation in standard VxWorks, but it still has a long way to go. For example, it can't free up open file descriptors, semaphores, or other items associated with a task because a task usually isn't associated with it. So if you have a task that acquired a semaphore and dies, that semaphore will never be released.

    Hell, Wind River couldn't even get malloc right! Their malloc has got to be the worst implementation I've ever seen! They place free blocks in sorted order (smallest to largest) in a linked list after attempting to combine a new free block with neighboring free blocks. The next time you allocate, it walks the entire linked list until it finds a block large enough! In our case we wound up with tens or even hundreds of thousands of small blocks causing our watchdog timer to kick in because malloc became impossibly slow. AE improves this to use a tree instead of a list, but it still fragments. I ripped out the Wind River implementation and replaced it with Doug Lea's dlmalloc and all our malloc problems were solved, and the fragmentation went from tens of thousands of fragments to only a few dozen.

    For an RTOS being pushed for networking it isn't very good there either. It comes with an ancient BSD TCP/IP stack. If you have a device and want to see if it runs VxWorks, just run nmap against it. If it says TCP sequence number guessing is trivial, you can bet it's probably running VxWorks.

    In todays world, VxWorks doesn't cut it any more. Any complex project should choose a real OS like QNX or even embedded Linux over VxWorks. For realtime, Linux usually isn't very good, but Timesys appears to have solved that problem nicely.

    VxWorks isn't even that good at realtime. Usually you can't get any better resolution than half the system tick rate (usually 10ms), so you can't get better than 20ms of resolution in many cases.

    I've also heard many rumours that Wind River is dropping AE, or at least not pushing it. We're not the only ones to have been burned by it. I've heard of only one other company that used it, and they were also burned. I think it was a startup that went out of business.

    In VxWorks, all tasks share the same memory space. Think of every "task" as really a thread and you get the idea. In other words, if a "task" dies, the only way to clean up the system is to reboot.

    Also, VxWorks doesn't scale. The more tasks you have, the slower it runs (i.e. no O(1) scheduler). And with the shared memory, the more complex the code, the harder it is to debug and develop a stable system.

    QNX would have been a much better solution. In QNX, the core OS is very small, and if a task dies it can easily be restarted. In QNX, everything is a task with memory protection. The TCP/IP stack is separate from the core OS, for example, as are all the other drivers. If a driver crashes, it won't take the OS with it. Context switching in QNX is also very fast, faster than VxWorks even though memory protection is involved.

    -Aaron

    --
    This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
  20. Re:heh... /. was right! by HermanAB · · Score: 2, Insightful

    Company song? "You load 16 tons, what do you get? A little bit older and deeper in debt..."?

    --
    Oh well, what the hell...