The 100-Million Mile Network
mykepredko writes "eWeek has an article on the network and radio topography of the two Mars rovers and how they communicate with satellites in Mars' orbit as well as the Earth. The article ends by giving four rules for maintaining a space network, a) Automate processes, b) Bulletproof your gear, c) Be persistent and d) Simulate potential problems, which are probably good rules for any network."
It appears that one of the direct results of NASA research will be better networks, both on Earth and elsewhere. Just about anything and everything applied to a deep space network can be applied right here at home. I'm also wondering about wireless network tech resulting from all of this.
I was wondering if NASA has actually disclosed the details of what they believe was the malfunction of the Spirit rover?
As someone who has developed backup and recovery systems for embedded systems using vxWorks and flash memory I have my own theory of what could have gone wrong.
There is an intermitant problem that can occur when using a combination of vxWorks 5.5, dosFs2 and flash memory.
The problem goes like this : When file A is written to flash memory formatted with a FAT16 table the FAT table is updated to say which disk clusters are occupied by file A, and hence no longer available as free disk space. So when file B starts writing to the hard disk it checks what clusters are free to write to.
Now a timing problem can occur when a process writing files in a sequential order closes the file handle to A and opens a new handle for B and starts writting to B. The problem exists because the clusters used by A have not been updated to the FAT in time before file B starts writing. The consequence of this is that some of the data belonging to A is overwritten hence breaking the chain. Once this has occurred the FAT and file A cluster chain are corrupt. Once this corruption occurs more corruptions occur with rate of corruption errors growing expotentially until the flash memory can longer function for disk I/O.
Now as the problem only occurs rarely it is very hard to reproduce in a lab. Also as the rate of corruption is expotential then catching the orginal culprit is even harder. I have spent weeks just trying to catch and diagnosis the problem before eventually catching it.
Unfortunately once the flash had started to become corrupt the only way to correct it was to reformat the flash memory.
As for solving the problem, before closing the handle of a file that had been written to flash memory was done an ioctl call would be made to the dosFs2 library to write the size of the file to the disk. Once this solution was is in place the problem never raised its head again.