The Last Stop For Space Station-Bound Software
Normally I avoid slide show type articles, but this one is actually pretty interesting. It starts "This NASA lab contains a recreation of the computer systems found onboard the International Space Station. It is the place where the final bug testing takes place before software is uploaded to the station and where software engineers recreate bugs that occur onboard the station in an attempt to help fix them."
They're testing their software in an accurate environment? They're not pushing beta testing onto their users?! Preposterous!
I have a bad feeling about this...
Don't forget... HAL 9000 had a ground-based testbed "twin", SAL 9000. Too bad they didn't try lying to it about its critical and super-secret mission before doing so to HAL. They only tried that afterward, as a debugging replay.
Welcome to the Panopticon. Used to be a prison, now it's your home.
well, i'm not sure about that... but there certainly have been some interesting times with the computers on board. THere are some computers which basically are like HAL and run most of the station... they are known as C&C computers and since they are so critical... there are 3 of them. And they constantly are in hot standby.
Interesting enough, the crap hit the fan during stage 6a, and being here, I can tell you that everyone was looking to blame someone else's subsystem. The canadian robot arm was being installed and so people were very suspicious it had something to do with it, but it basically turned out to be hard drive problems on all 3 C&C computers.
The cool story here is that the idea that you could have a triple-redundant system fail seemed so far off that it was almost thought impossible. Even still, some engineer had this idea to write this little program which would jump into action if it ever saw all 3 C&C computers offline. The program was called "Mighty Mouse".
During the episode, things went really bad. Lights were out. Comm was out. At one point people were trying to confirm whether commands were coming up and down by looking out the windows and seeing lights get turned on and off. But... mighty mouse saved the day. It kept cyclicling power to the 3 C&C computers until it saw a healthy one and for a while C&C2 came back online and let the ground controllers get some data and start fixing some issues.
All is fine now. I believe they have replaced the hard drives with space hardened solid state drives... but it was one of those interesting periods during early space station construction where software was an integral part. you can read all about this here:
http://spaceflightnow.com/station/stage6a/010426fd8/index2.html
The cool story here is that the idea that you could have a triple-redundant system fail seemed so far off that it was almost thought impossible.
Heh, well it would be, if the estimated probability of failure for each was truly an independent random variable. The excrement usually hits the fan when it isn't. Like, say, a hard drive with an unknown defect where a certain access pattern can make it fail. 3 machines doing the same work means they could all fail for the same reason at about the same time.
Or an old example that was basically just doing "redundancy" wrong, the telco that laid two fiber optic cables -- you know, cus the CTO read that redundancy was important! -- directly side by side. So when the inevitable backhoe came along and accidentally cut one, it cut both.
All is fine now. I believe they have replaced the hard drives with space hardened solid state drives...
Huh... Do you have any idea what the failure mode was? If it was a solar storm or other such event, then it would make perfect sense that all 3 systems conked out at the same time in defiance of all probability.
Anyway, that is pretty cool. And it shows the kind of folks who work at NASA that they decided that triple redundancy just wasn't enough and they needed a last-ditch recovery mechanism.
The enemies of Democracy are