ISS Computer Failure
A number of readers wrote us with news of the computer problems on the International Space Station. Space.com has one of the better writeups on the failure of Russian computers that control the ISS's attitude and some life-support systems. Two out of six computers in a redundant system cannot be rebooted. The space shuttle Atlantis may have its mission extended until the problem is fixed. A NASA spokesman was optimistic that the problem can be resolved; worst-case scenario would be for the shuttle to evacuate everyone onboard the ISS. Engineers are working on the theory (among others) that the failure may have been triggered by new solar panels installed earlier in Atlantis's mission.
Hopefully they're starting with their DFMEA documentation... "guessing" at the problem and having "theories" is probably not a good way to go. Also, it's apparently a common-mode failure, which you shouldn't have in a safety-critical system; generally this is avoided by having different computer hardware and/or completely different code to do the same tasks.
Quite unfortunate that it seems like systems engineering is lacking in more and more disciplines recently, although I suppose it makes good systems engineers more valuable.
My list for this would be something like: "Computer doesn't boot." Possible reasons: "No Power", "Insufficient power", "Corrupt memory", "Broken circuits", etc. Then you go down that tree further and find the root cause. The most disturbing thing is that they had such a major common-mode failure...whatever happened to the "no single points of failure" mantra?
* sigh *
"There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
No, no--I know is sounds crazy. But hear me out. Maybe we could actually pursue something NEW--you know, dare to violate that 30-year-old sacrosanct NASA policy of just repeating themselves over and over again and wasting trillions of $ on contractors and grandiose promises which never amount to squat.
Just a thought.
SJW: Someone who has run out of real oppression, and has to fake it.
The stated worst case scenario is that the ISS will need to be evacuated, but if the remaining gyros are being overwhelmed, might the station enter an unrecoverable spin state before the problem is resolved?
What do you mean they cut the power? How can they cut the power, man? They're animals!
Sort of related.. The trains on my line in the UK are run using some sort of Java based system (we know because they were very buggy to begin with and the website used to give surprisingly honest updates on progress). ANyway, now and then it still goes a bit loopy and we have to sit in the station while the drive warns us over the Tannoy 'I'm just rebooting the train, back in a few minutes' and sure enough, the power drops, lights go out, fans stop then whoosh, it's on again, the displays start scrolling logos and welcome messages and one by one you can hear the subsystems power up. Quite cool, if your sad like me.
I want a list of atrocities done in your name - Recoil
No.
On NASA's manned space equipment you will find no software that is not controlled by NASA. These folks don't just run a few tests. They spend thousands of dollars per SLOC in testing. They actually mathematically prove their software's correctness. Perhaps the Russian agency's quality isn't quite as high, but I still doubt their (or anyone else's) systems onboard the ISS have any OS at all. Most likely they are all custom embedded systems.
I'd council against jumping to conclusions about the cause of this solely based on the Russian origin of these systems. I remember a lot of people did that with the early Ariane crash based on it being written in Ada, and ended up looking pretty silly when the problem turned out to be some ported code that wasn't rewritten properly for the new platform.
The first piece of the space station was Zarya, the Russian control module that was launched into orbit November 20, 1998. A few weeks later, on December 4, 1998, the U.S. module Unity was launched into space. On December 7, 1998, the two modules were connected.
That makes the ISS just over 8 years in service.
How old is Atlantis?
Space Shuttle Atlantis has completed 27 flights, spent 220.40-days in space, completed 3468 orbits, and flown 89908732 miles in total, as of September 2006. Atlantis visited visited MIR in 1997!
Atlantis is 23 years old as of last April. 21 years in service. More than twice as old as the ISS.
Now, tell again - which is the real bucket of bolts? ISS or Atlantis?
Many of NASA computers on spacecraft use a long-tested version of realtime UNIX called VxWorks from Charles River. It doesnt nexcessarily have the fancy stuff in modern *nix's, but is fairly reliable. Even that has been known to fail. The flash memory driver in the Martian Rovers had a bad free-list routine which shut them down for several weeks near the beginning of their mission after the flash memory filled up. A fix was uploaded. Flash memory was relatively new and hadnt been tested as much as the rest of the system.
Try here.
Service restart isn't the problem. The problem is copying kernel state.
The kernel holds a lot of information, such as which processes are running, memory allocation, drivers etc. For a true in-place switchover to a new kernel (i.e., all programs keep running as if nothing happened), all that information has to be copied over.
The other option is to load the new kernel image to memory, shut down all processes and unload drivers, jump to new kernel and start a standard initialization. That would be the same as doing a 'shutdown -r', except that the new kernel is loaded by the old kernel instead of by the BIOS.
If J.K.R wrote Windows: Puteulanus fenestra mortalis!