Mars Rover Spirit Back Online
Skyshadow writes "Just in time for the arrival of its twin, the Spirit Mars Rover is back in working order. Programmers at the JPL have traced the problem to the rover's flash RAM, which it uses to maintain its filesystems. They are using a ramdisk in the rover's RAM to bypass the bad flash memory, and are working on a workaround for the bad flash. Good news, but the rover is still potentially weeks away from full operational status."
They signed up for Mars Online with 3000 free hours. What they didn't realize was that the free 3000 hours only applied to the first month of service. Once they paid their MOL bill, they got hooked back up. All the probes friends on Mars use MOL!
They should boot faster, using linux. Then they'd only be ten seconds away :-)
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
/riff/Move over Rover, let the ramdisk take over!/riff/
Wonder wehre they got they flash ram from?
--
I think they should return the bad flash part to where they got it and exchange it for a new part... although getting the memory back to the store by the 30 day warranty might be a little difficult.
I hope they bought the extended warranty.
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i
it was their AOL bill that wasn't paid? hmmmm...
Linux with kernel panic...
MadPenguin.org
During all of the "Spirit is broken" columns, I kept reading /. comments saying that it was likely a memory error due to the non-consistent errors...I guess a million monkeys with a typewriter can be correct :-)
Doh!
Engineers guessed that Spirit's troubles were in its Flash memory and set about sending the rover a complex series of instructions to see if they could get it to bypass the corrupted memory. Theisinger said engineers sent Spirit a command just before its daily "waking up," telling it to shut down and restart in what is known as "cripple mode," using RAM instead of Flash for its start-up instructions.
Some people may take this sort of thing for granted, but I for one find it remarkable that we can essentially reboot and perhaps even fix a system that is on a whole other planet.
Just wait until we have Interplanetary, Interstellar, Intergalactic Remote Desktop. I'm only half-joking.
The coolest voice ever.
If I understand this properly, they've got a damaged filesystem on the flash RAM. Not really a big problem, you just have to send someone over to the console to boot it up in single-user mode and run fsck. ... oh yeah, sending someone over to the console is a little bit difficult here. :)
Tarsnap: Online backups for the truly paranoid
Shouldn't they have like 5 Flash RAM's? Really,they shouldn't have one of anything. In my computer if my BIOS fries, I pop open the box and replace it. If it fries on mars, obviously I kiss my megamillion dollar project goodbye, all for a $5 Flash ROM.
Engineer 1: Ho-hum.. Little bit of ... whatever it is, 'ere... Hand me that thingamajig, will you?
Engineer 2: Yah, sure... Hey, remember that employee last month who got laid of within a week?
Engineer 1: Who? Vincent?
Engineer 2: Yeah, Vinnie... With the Italian accent?
Engineer 1: Yeah, him. What about the guy?
Engineer 2: Well, he has this offer on cheap RAM we just CAN'T resist!
Engineer 1: Really now? But-
Engineer 2: Look, our budget is already comparable to social welfare. We need to save some loot.
Engineer 1: Fair enough, buy the crap and hand me the other twisty-turny thingy over there? I need to screw on this name tag reading... "Spirit"?
Engineer 2: Look, it's either that or my wife's name.
Hate me!
If I was sending an embedded control computer to another planet, I would have chosen an OS with memory protection, not VxWorks. VxWorks is like DOS, and early versions of Windows, where one pointer problem in one task can corrupt the whole system. Sure, we don't know that's the problem now, but it would be nice to know for sure that it wasn't.
Is there a chance that the problem could've been caused by electrostatic discharge? Rover bounces on rubber airbags on sand, bags fold up, Rover rolls off, Rover touches rock - zap!??
...will apparently cause one out of every trillion bits on Earth to flip randomly... I guess with less of an atmosphere, it is a bigger problem on Mars! ;)
libertarianswag.com
I do seriously wonder if these types of projects will tell us anything more than esoteric wonders of Mars, but from a strictly engineering standpoint, perhaps it's worth it after all.
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
Here is the link to the real story. The one given in the /. acticle is getting pushed down spaceflight's page.
I have a friend who works in the field. Space travel hoses electronics bad. Triple redundancy and over-engineering is the name of the game. This is nice to hear. I would imagine that something went wrong intransit or on-landing, but they can keep going,
Great ideas often receive violent opposition from mediocre minds. - Albert Einstein
I know a lot of ppl are using flash ram in smaller computers for booting linux or what not. Well if they are writing their logs and other things to that flash be aware that you can only write to it so many times before it fails.
Was NASA writing to that flash or just reading? A ram drive in flash sounds like it will access/write thousands of times a ?minute? This should wear it out quickly.
If I was sending an embedded control computer to another planet, I would have chosen an OS with memory protection, not VxWorks.
Actually, they might have protected memory if they use VxWorks AE RTOS/Tornado Tools 3.0. Spirit uses VxWorks, but I don't know what version they used or when they had to commit to a particular version of VxWorks.
Also, as the article mentions, memory protection adds overhead and can affect real-time performance. Hard real-time software cannot afford to have a complex layered structure and lots of conditional code that adds unpredictable delays. For that reason, many really real-time applications run very close to the hardware (for better or for worse.)
Two wrongs don't make a right, but three lefts do.
I mean like beagle isnt using its flashram anymore, just go and jack some off it. While your at it TAG the Beagle with some PRO-US graffiti :) hell maybe its got nicer rims too
Seriously, can you imagine the first manned expiditon seeing the Beagle Jacked up, tagged , up on little martian cinderblocks, All that and we already got a head start on building martian cities
Nicely karma-whored. That's the link from the article. :)
Ive been unable to find any hard information on the design of the MER memory systems. If anyone can point me to a technical brief id be very happy.
From what ive pieced together the MER system is something like this:
One RAD6000 powerpc cpu.
Connected via probably compact pci to 128 mb of ecc sdram.
256 mb of flash. No info on what make of flash, but likely Intel since they are the biggest. There was some info from the press conference that there are actually two flash chips and that the flight software is redundantly stored on each. So does this mean that there is actually 128mb of redundant flash? Also it was said that they had problems even with the redundancy, could they possibly have overwritten something? We all know that even a redundant raid does not stop filesystem corruption.
No information on how the flash is connected, parallell / serial? How the redundancy works?
Btw, I guess flash is rather radiation hard since they require 10 - 20V to erase / write.
They don't. See DoD Bids About 3/4 of the way down the page.
Title: Rad Hard Flash Technology Abstract: The highest density radiation hardened non-volatile (NV) memory currently available is a 256 kbit EEPROM based on SONOS technology. One of the major limitations in developing rad hard NV memory has been the cost in bringing up the NV technology in a dedicated rad hard process facility, especially when weighed against the limited market size. One way to bring radiation hardening to an advanced electronic product on a cost-effective basis is to leverage the commercial product by applying the hardening to the commercial fab instead of bringing the commercial technology to the rad hard fab. NV flash memory technology is popular in the commercial marketplace, with densities up to 256 Mbit in production. Unfortunately, flash memory is not available, at any density, in total dose rad hard versions. And, most commercial flash memories are so soft that impractical amounts of shielding are required to survive even moderate radiation environments. This effort will be the first step in developing rad hard flash technology at a commercial fab. Rad hard flash technology will be a near-term solution to the problem of high density NV memory for space applications. It will enable the development of rad hard flash memories and embedded NV memory for rad hard ASICs.
Flash...the weakest link...
They should stick with purple next time.
...and it's amazing NASA could press it at the right time from 124 million miles away (1.3 AU). Although I wonder how many times NASA did have to press it before they got the timing right -- we only know about the success :-)
I have had some tough calls in my time but I have never had to walk a robot 283 million miles away through brain surgery. Man I am glad I did not get that call. This is going to blow there call averages all to hell. I raise a cup of Joe to you, Rover Help Desk man.
Papa Legba come and open the gate
This is the last image received prior to the recent issues with Spirit...
This is the conventional wisdom, and in my experience, this particular nugget causes more embedded and real time software projects to fail than any other.
First off, on a modern PowerPC processor, memory protection (that is, without virtual memory support) can be implemented very cheaply. If you can do it just with the IBAT/DBAT registers, it should be a constant-time overhead, which is good enough for hard-real time. Oddly enough, I can't find a single reference on the net that measures the cost of memory protection alone on a modern CPU. Anyone? Anyone?
Secondly, though the rover certainly may have some software components that have hard-real time requirements, that doesn't mean that every single line of code does. Typically, less than 1 percent of the code in a real time system is hard real time. In that case, you can run the real-time code in ISRs, or perhaps in a dual-mode system, like RT-Linux, or in high-priority kernel threads (as with QNX). In any of these situations, you can run all the rest of the code in protected memory space.
I remember in the last thread about the rover, someone opined that it was bad memory, then proceeded to give a half dozen reasons why. Totally nailed it.
Yeah, in the future NASA should just submit an Ask Slashdot whenever something goes wrong..
Opportunity is fast approaching the red planet. It should be an interesting night at JPL. Execellent work guys, good luck.
[alk]
Here's a rant by a JPL guy about appropriate technologies for software on deep space probes. He recounts one story of a failed probe "100 million dollars, and 100 million miles away".
They fixed it. The fact there was a lisp REPL running on the spacecraft helped.
That's cool:
(unwind-protect
(progn (do-science)(talk-to-earth))
(wait-in-repl-for-earth))
where the russian cosmonaut says "American components, Russian components. They're all made in Taiwan!"
IANANE (I am not a NASA engineer), but.....
I'm not drunk, I just have a speech impediment. And a stomach virus. And an inner ear infection.
I'm watching NASA tv at the moment and they're explaining possibilities now. At the moment, they only have a very broad explanation of what's going wrong. However the newest knowledge is;
There are two separate flash memories on Spirit. At the moment, part of the problem is software which can read part of the flash memories as some of the operational software which is kept in flash ram seems to be coming up before the system reboots.
The system is rebooting no matter which flash memory is being accessed, it has the same bug both ways, so the flash ram itself looks to be OK, but the interface between the flash ram and the software looks to be causing resets.
Even if there were more backup flashrams, it looks like they'd still have this problem. Perhaps many, all on different controllers, and even an entire backup computer would have prevented this. at 100watts total power available for the rover, an entire extra computer may be a bit much to have fit. But then sending two rovers would also negate problems, and thats just what they've done
It seems most likely at the moment, according to NASA, that the family of components that are involved with the hardware addressing of the flash memories looks to be where the problem is.
I don't know that much about VXWorks, but I heard that one of its main assets is having a very small tight multitasking kernel.
They were able to regain the system, despite loss of a major computational component. Remotely. Through a debug link. That sure says a helluva lot for the robustness of the OS and how they configured it.
Good job, JPL.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
The Spirit is willing, but the flash is weak.
(Posted by Jane Slee and John Stracke in separate usenet postings.)
They didn't use Windows CE. Remember the diplomat months back that got locked in his 7 series BMW because of a computer crash? :)
There is a big difference between standard flash and radiation hardened flash. In fact we are designing a project with one of these VME buss units as a storage array.
The present series of orbiters/landers (Nozomi, Mars Express, Spirit, Opportunity) were launched at such a time as to take advantage of the most optimal Mars-Earth configuration for something like 60,000 years. I believe the bottom line is that it was a time you could get the most science there for the least cost of launch.
Shame on my fellow American who said we should strip Beagle 2 and leave it up on cinderblocks. If Beagle is ever discovered to have soft landed, I would think the only proper thing to do would be to restore whatever's wrong with it, and let it complete its mission. (HAL, V'Ger, anyone?) Given the discussion of things like the effects of radiation exposure on electronics, you'd just have to be interested to know what a 50-or-150-year-old "dead" lander might be able to wake up and do.
If Spirit's problems aren't resolved, the Mars Scorecard should at least reflect that Beagle was the less expensive failure.
(Disclaimer: I visited England for the first time last year, and falling in love with the whole place doesn't begin to describe it. R.I.P. Beagle 2. *sniff*)
So... I wonder if they'll consider validating MRAM more quickly if Flash is found to be more error prone.
You know how NASA works. The Space Shuttle running on 486's and whatnot. I understand the science behind that reasoning, as sad as a 66 MHz processor seems to us geeks nowadays, but I wonder if MRAM will prove more flexible and stable for future space missions.
"...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
Consider that an interstellar probe will take years to receive updated instructions. By which time, any fix will probably be irrelevent. Plus if they're more than 30 light-years away (practically next door by galactic standards) they guy who sent out the instructions probably won't live long enough to find out if they worked!
This is like a reality tv show, I love Nasa Tv!
With the exception that this is actually real...
Alito: A vote for Alito is a punch in the eye to put that bitch back in her place!
Flash RAID array.
(Can this even be done?)
Weeks of coding saves hours of planning.
If you don't get it on cable, you can watch NASA TV here.
Litigious bastards
Opportunity will most likely have the same problem since they are twin brothers and had an identical build process.
I quote from my post a couple of days ago:
Parent: So even if Spirit gives up the ghost, her kin can carry on the flame (albeit in a less interesting location).
Me: Not if the problem is due to a design fault. That's the drawback of sending multiple identical probes: if one is intrinsically fucked, they all are.
I now bask, contented, in the glow of my own brilliance....
Tubal-Cain smokes the white owl.
with all the radiation and very high energy particles zipping thru the spacecraft on its way there, I'm suprised any computerized spacecraft get anywhere intact.
"It's so convenient to have a system where everyone is a criminal" - A. Hitler
A little too much grandstanding though.
I've noticed that a few people stand facing the cameras a lot, gesticulating wildly as if talking about something important.
I also saw one guy go from reading a magazine and sipping a martini to furiously typing away at a keyboard as the camera panned across the room!
Oh, and he has another quote I liked too:
But maybe I just like it because thats how I tend to fix things too
just wondering something. when they say 'currently' do they mean now or light-time ago? eg. they confirmed cruise stage separation less than a minute after it "happened"
Well Done NASA.. bringing space to us is the next best thing to taking us to space.
0508 GMT (12:08 a.m. EST)
A good signal is still being received! Unlike the Spirit landing where signal was lost immediately after touchdown, Opportunity continues to talk to Earth.
0506 GMT (12:06 a.m. EST)
After a short loss of signal from the rover, a strong signal is now being received as Opportunity arrives on Mars!
0505 GMT (12:05 a.m. EST)
BOUNCING ON MARS! Mission Control has received a signal of Opportunity bouncing on the surface of Mars.
Besides robot exploration software would be handy right here. It would be neat to be able to send a research bot out in the deserts, deep oceans and jungle canopies of the world. Machines can go where we can't.
Individually you can be damn annoying sometimes, but I'm constantly amazed and delighted by the collective intelligence of the /. pack.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
You have two or more running in parallel. While one is running, the next reloads from ROM. When it's loaded and synchronized, you switch to it, and load the next one. You do that in series, over and over, so you're only using any particular FPGA for a couple of seconds at a time, and their configurations are constantly being refreshed. It's a very simple idea that can be done now.
--- Ban humanity.
OK, you dorks (you know who you are) need to stop postulating about the memory failures having to do with static electricity, martian dust, or lack of redundancy. This is JPL and (the one case of metric vs. standard aside) they thought of all the obvious stuff during the design stage. Do you really think they're slapping their foreheads and saying "the dust! we forgot about the dust!" over in the design lab? Get real, people.
If a job's not worth doing, it's not worth doing right.
I was going to mod this one up, but I decided to give this reply some more emphasis by actually replying with some thoughtful encouraging words instead.
It would be nice to be able to have some folks at JPL throw down the source code and engineering schematics and say to the geek/space/engineering community at large "We have a problem here and could use your suggestions to see if we can get this fixed."
This (the mars missions) is obviously a big hit, as measured by replies on Slashdot, the number of hits on the website at JPL, stories in mainstream media, and other reasonable metrics to gague popularlity of a project. I'm sure that there are several geeks out there that wouldn't mind digging into the source code.
The only reason I could see the engineers not wanting to do that is to open themselves up to obvious scrutiny for poor engineering and coding. (Whadda you mean the global variable named temp is the only variable. We also have temp2, temp3, and temp4. What do the numbers mean in those mean? You can get it from context, can't you?) That and some people just aren't used to allowing other into their "domain".
Being 100% funded by public money should also be further reason for why this should be opened up. I also totally agree.
As someone who has programmed VxWorks (including AE) for several years, I can say AE is a buggy piece of crap. We moved to AE for our project and eventually had to dump it since it was so buggy and slow. Also, as far as flash filesystems go, VxWorks ONLY SUPPORTS FAT, and not even FAT32, so it isn't a very robust filesystem. Not only that, because it's FAT there is no wear level support. I believe there also isn't the equivelent of chkdsk either. I also imagine that it can't handle faults in the filesystem (as if anything ever could deal with faults in a FAT filesystem very well).
With VxWorks you can often get away without any filesystem because all the code is linked together in one big monolithic file. Separate tasks are not separate files (although you can have loadable object files).
Yes, AE does provide memory protection domains, but it still doesn't clean up after a task dies. Sure, you can free the memory, but not open files, semaphores, pipes, or other things. Malloc in AE is improved over the braindead implementation in standard VxWorks, but it still has a long way to go. For example, it can't free up open file descriptors, semaphores, or other items associated with a task because a task usually isn't associated with it. So if you have a task that acquired a semaphore and dies, that semaphore will never be released.
Hell, Wind River couldn't even get malloc right! Their malloc has got to be the worst implementation I've ever seen! They place free blocks in sorted order (smallest to largest) in a linked list after attempting to combine a new free block with neighboring free blocks. The next time you allocate, it walks the entire linked list until it finds a block large enough! In our case we wound up with tens or even hundreds of thousands of small blocks causing our watchdog timer to kick in because malloc became impossibly slow. AE improves this to use a tree instead of a list, but it still fragments. I ripped out the Wind River implementation and replaced it with Doug Lea's dlmalloc and all our malloc problems were solved, and the fragmentation went from tens of thousands of fragments to only a few dozen.
For an RTOS being pushed for networking it isn't very good there either. It comes with an ancient BSD TCP/IP stack. If you have a device and want to see if it runs VxWorks, just run nmap against it. If it says TCP sequence number guessing is trivial, you can bet it's probably running VxWorks.
In todays world, VxWorks doesn't cut it any more. Any complex project should choose a real OS like QNX or even embedded Linux over VxWorks. For realtime, Linux usually isn't very good, but Timesys appears to have solved that problem nicely.
VxWorks isn't even that good at realtime. Usually you can't get any better resolution than half the system tick rate (usually 10ms), so you can't get better than 20ms of resolution in many cases.
I've also heard many rumours that Wind River is dropping AE, or at least not pushing it. We're not the only ones to have been burned by it. I've heard of only one other company that used it, and they were also burned. I think it was a startup that went out of business.
In VxWorks, all tasks share the same memory space. Think of every "task" as really a thread and you get the idea. In other words, if a "task" dies, the only way to clean up the system is to reboot.
Also, VxWorks doesn't scale. The more tasks you have, the slower it runs (i.e. no O(1) scheduler). And with the shared memory, the more complex the code, the harder it is to debug and develop a stable system.
QNX would have been a much better solution. In QNX, the core OS is very small, and if a task dies it can easily be restarted. In QNX, everything is a task with memory protection. The TCP/IP stack is separate from the core OS, for example, as are all the other drivers. If a driver crashes, it won't take the OS with it. Context switching in QNX is also very fast, faster than VxWorks even though memory protection is involved.
-Aaron
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
I can tell you that AE is in many ways WORSE than the standard VxWorks. It has a lot more bugs and is quite a bit slower. Think of regular VxWorks with memory protection hacked in, not designed in from the ground up.
As a VxWorks programmer for the last 5 years, I can honestly say VxWorks is a PoS that is losing market share at a tremendous rate to the likes of embedded Linux and QNX. Wind River decided to spend tons of money buying add-in products like Routerware instead of improving their RTOS. It was a huge waste of money and now they're paying for it. They're losing money hand over fist and have had a lot of layoffs lately. They were good at one time, but they have fallen far behind the curve now in embedded RTOS design, especially for complex systems.
VxWorks comes with support for a FAT flash file system, a completely broken malloc implementation, an ancient BSD TCP/IP stack, poor RT support, no memory protection, and no way to clean up after a task that dies. Not only that, it usually costs a fortune, but I've heard they're willing to sell it very cheap now because they're desparate.
I looked into embedded Linux for our next generation hardware and software and Timesys appears to have a very nice solution with hard real-time support. The kernel is fully preemptable using semaphores instead of spinlocks and has priority inversion support. They also offer resource reservations, so I can say "I want this task to be guaranteed 5.73ms of execution time every 9.8ms" where after 5.73ms the task either gives up the CPU entirely, or else changes to a non-RT priority to not starve other tasks. It's really quite clever. Not only that, unlike RT-Linux there isn't a separate API for RT vs non-RT tasks. Monta Vista Linux is soft real-time. It cannot guarantee context switching time, nor does it deal with priority inversion. In RT, priority inversion can be a major problem (see the first Mars rover for an example).
For an example of priority inversion say you have 3 tasks, a low priority, medium priority, and a high priority. The low priority task acquires a mutex semaphore to protect a critical section and starts processing. It is interrupted by a medium priority task. Meanwhile, a high-priority task unblocks and attempts to grab the mutex. The high priority task will block until the medium priority task blocks so that the low priority task can release the semaphore. A common solution is priority inheritance. With priority inheritance, as soon as the high priority task attempts to acquire the mutex semaphore, the low priority task has its priority bumped to that of the high priority task until it releases the semaphore. In this way, the low priority task will interrupt the medium priority task so that the high priority task won't have to wait as long.
QNX is also a very good alternative. Very fast context switching and extremely robust memory protection. I think with QNX you can even buy a license suitable for use in medical devices (i.e. you absolutely cannot afford to have the OS crash for any reason).
I've heard rumours that Wind River is dropping AE since nobody is using it. After our experience, I pity whoever tries it.
Also, unless you get the source to VxWorks, which usually costs a lot of $$$, debugging is a complete nightmare, especially when you hit a bug in the Wind River code (and there's a lot of them). Hell, they couldn't even implement malloc right!
Wind River is coming out with version 6 of VxWorks, but it is basically an enhanced version of AE. I'm not holding my breath.
-Aaron
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.