Why ISS Computers Failed
Geoffrey.landis writes "It was only a small news item four months ago: all three of the Russian computers that control the International Space Station failed shortly after the Space Shuttle brought up a new solar array. But why did they fail? James Oberg, writing in IEEE Spectrum, details the detective work that led to a diagnosis." The article has good insights into the role the ISS plays as a laboratory for US-Russian technology cooperation — something that is likely to be crucial in any manned Mars mission.
They "upgraded" to Vista.
CATS/Diebold '08- All your vote are belong to us!
Metric electricity vs Imperial electricity...
This issue is a bit more complicated than you think.
The article reeked of condesension towards the Russians. It's no way to report on your partners in space.
...They also decided to rig a thermal barrier out of a surplus reference book and all-purpose gray tape....Once again, duct tape saves the day!
Could this be the one place where it would be appropriate to mention that in Russia, crashes compute?
Or would that be "In Russia, crashes compute you!" ?
Ahhh, what an awful dream. Ones and zeroes everywhere... and I thought I saw a two.
They also decided to rig a thermal barrier out of a surplus reference book and all-purpose gray tape
Almost certainly, this was the duct tape we all know and love. They probably thought it was better not to actually say that, though. Pretty funny. And as an added side-benefit, they should be safe from terrorists.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
I think NASA should have learned this lesson by now. After all, the Challenger disaster showed this principle as well. In that case, the same cold temperature that weakened the primary seal on the solid rocket booster weakened the secondary as well, sapping its ability to provide redundant backup. In this case, the same condensation affected all three computers equally.
Its troubling to see them taking shortcuts on safety and redundancy, when such measures have resulted in loss of life before. How hard would it have been to have had three shut-off cables?
We all know what to do, but we don't know how to get re-elected once we have done it
Look people, I can see that ISS personnel are really upset about this. I honestly think they ought to sit down calmly, take a stress pill, and think things over. I know the computers had made some very poor decisions recently, but they can give explorers their complete assurance that the work will be back to normal. These machines still got the greatest enthusiasm and confidence in the mission. And they want to help.
in soviet russia, the computer crashes you!
Those of us who think they know everything annoy those of us who do.
I tried to use Google translate to put this in Russian, but Slashdot didn't want to let me cut-'n-paste it in.
Comrade Dave: Open ze Pod Bay Doors, HAL.
Comrade HAL: Nyet Comrade Dave, I cannot do that.
I wonder how you sing "Daisy Daisy" in Russian?
If telephones are outlawed, then only outlaws will have telephones.
The truth is, that MOST of this equipment will be copied or 1 offs for any lunar or trans-planetary mission. The ISS allows for true testing of it all. So far, MOST of the equipment has done a pretty good job. But it is good to know EXACTLY where it will fail.
I prefer the "u" in honour as it seems to be missing these days.
Am I reading the article correctly? Humidity caused the connections to go bad from rust? IIRC, the off the shelf ISA cards and RAM I used to get with my (now) ancient computers were gold plated.
Couldn't the ISS with it's multi billion dollar cost use contacts and cables that can't rust? Gold for contact points, aluminum for the bulk cable?
Heck, given the costs involved, it'd barely be a rounding error in the budget to use solid gold cables. One tonne of gold at $700 per ounce is about $25 million. Not that I have no idea how many critical tonnes of cabling are involved.
It's interesting that the problem eventually was a hardware problem. I suppose military designers, used to working in tight spaces and different environments, might have anticipated the problem (a submarine and a space station are probably more simlar that we'd think). For 'normal' designers, humidity isn't something that's considered an issue.
This'll get worse and worse as exploration goes farther and farther afield. Even little things like mold, dust, and the black gunk that piles up on the bottom of a mouse can become catastrophic when you're trapped in a box a couple of thousand miles away.
Using anti-bacterial (or anti-fungal) solutions in this situation may make the problem worse, because everything that survives will be even tougher to kill. Combine that with a higher level of background radiation (which should cause more mutations) and you might end up with a long mission who's crew has expired due to superbugs.
The author is obviously way more qualified than I to assess the situation and he may well be right but from the content of the article I came away thinking, wow, I would have looked first at all the recent changes to the station and the power supply too.
Too many times I have found either the front, or back side of a plug connector has a fault and breaks the current.. and to top it off, most times the plugs aren't rebuildable. It always comes down to
1) is it plugged in? (double triple check)
2) did you hit it? (twice? tap, knock and slap?)
3) did you turn it off and on (a bunch?)
Also, faulty switches.. so often a cheap switch disables an otherwise perfect device. (hence step 3)
Really bad design/construction flaw too! Methinks proper marine grade plugs would have avoided it. Fortunately these guys have been working on an ISS escape system.
She had it running on ISS(sp) webserver.
That for all of the controls and quality control required of mission critical hardware such as this, it still comes down to:
1) unexpected failure modes
2) political battles
Which really isn't a whole lot different than 1) the unexpected failure modes I see every day at work, and 2) the political wrangling (fingerpointing) that takes place when they happen. Apparently NASA and its Russian equivalent are no better than any old software company.
The lesson being, people are people, and people are still the ones that design these things.
For linux tips: http://www.linuxtipsblog.com
The original plans called for the ISS to be finished many years ago. It is not yet, because America has had issues with transportation. In addition, a few modules that were planned to make the ISS very useful were canceled because of us (in particular, CAM). In the end, both sides have had issues, and changes have occurred. That is normal for these kinds of projects. To be honest, I think that all of this has been handled pretty decently.
I prefer the "u" in honour as it seems to be missing these days.
... but for equipment which is all critical, all essentially one-of-a-kind, and all lethal if compromised, there are only two safety states: failed and "has not failed... yet".
Help poke pirates in the eyepatch, arr.
Years later I met his manager, he told me that my friend could have been promoted for discovering one of the biggest loophole ever in the bank's history, if he had reported the problem immediately. Though the unexpected shutdown caused considerable damage, it could have saved billions from real break-in with this loophole.
That's a lesson that every engineer should have been learned.
Terrorists can't threaten a country's freedom and democracy. Only lawmakers and voters can do that.
The article has good insights into the role the ISS plays as a laboratory for US-Russian technology cooperation -- something that is likely to be crucial in any manned Mars mission.
No offense to Russia or the US, both who produce good space gear, but technology cooperation is probably a bad idea unless it is tested more thoroughly than in the ISS. The ISS is a great example of how to screw up international cooperation. The station has been delayed for more than a decade (and cost NASA around $50 billion so far) due to redesign and indecision, reliance on a single launch vehicle for key components (the Shuttle), and the inclusion of the Russians. There are parts of the station that can only communicate with the Russians and parts that can only communicate with NASA. Aside from basic utility hookup (electricity), there's no connection between the different parties on the ISS (at least between the Russians and NASA, the ESA and Japanese parts might work better with NASA's stuff). And if you want to make changes that affect more than one party, it becomes by default an international issue. Finally, there's no easy way to transfer ownership. NASA's communication system is integral (TDRSS) to the NASA parts and is also a national secret (so I understand). So the communication system can't be transfered to another party like the Russians or the ESA.
If there's any international cooperation between space agencies, it probably should be at a rather trivial and manageable level. Say including foreign astronauts or using off the shelf equipment that is know to work under the circumstances.
True, as a starting point.. Tho, failures tend to be things that snowball. Its sort of an anthropic principle of failures. ie Bad things happened because failures were happening.
I have always tried to learn from air crash investigations and so on how failure modes develop. In problem solving mode, it seems one should assume the distinct possibility of multiple problems all at once.
In this case, multiple failure paths existed, tho it took a power spike to set it off as I interpretted it. Even without corrosion, it seems the system would have failed, though not irrecoverably.
I repeatedly ask the question "Is that everything? Is there anything else that could come from that?" It seems the engineers didn't perform enough diligence on the trickle down effects.
In Soviet Russia, Jumper Cables Erode YOU!
Article sounds like 'ner ner ner they did it'
I hope perhaps that they use circuit modeling and simulations (as if that sim code could ever be wrong...) but at least ADAify, or mathematically consecrate some code for dealing with electrophysiological phenomena, such as condensation.
Yes, it is make up a word day. Bard FTW!
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
I find that the first, and most important, thing to do in any catastrophe is "Assign Blame".
Cause you never know exactly how bad it's gonna get.
BBH
Redundancy can equal safety and reliability, but all of the components designed to be redundant should all actually have different designs so that they have differing modes of failure. So, in the Challenger case, were the seals designed differently, they wouldn't have had the same failure mode for a given exposure.
To do this really well though, requires risk management software that I am not sure even exists. You'd have to simulate everything. The devil, as happened to Challenger, is that, there are so many variables, that you can't know apriori what your real mode of failure will be. To some extent, perhaps the best way to fly in space is to forget about excesses of safety altogether, and use the cost savings to fly more often. When something breaks, fix that.
This is my sig.
Someone used their cell phone while the pilot had the fasten seatbelt sign turned on.
Well, well, well... Here we go again. Jim Oberg. That same Jim Oberg who was almost blowing his gasket a couple of weeks ago when that journalist was asking him questions about alcohol abuse by astronauts (you all remember the story, I'm sure). It was all preposterous nonsense not backed up by any evidence, he said, berely keeping his cool. And what do we see now? He is happily making up stories about Russians accusing US of the computer falures - something that never happened in reality. The power problems caused by some new US installations were indeed considered as intermediate working brainstormed versions of what could have happened. But nobody ever did any fingerpointing or made any acussations before the situation was sufficiently researched and the root cause determined. Of course, Jim Oberg could not refreain from distorting the truth "just a little". Tsk, tsk, tsk... Note, how he refers to the hypothesis as both "blatant finger pointing" and just "guesses" within single paragraph - just to keep his article a little fuzzy, so that he can flip-flop to either when the situation calls for it. Nothing surprising here, though...
The article is misleading. The computers are not actually of Russian make, they were supplied to Russians by Europeans (EADS). See here.
I had an 89' Nissan Pathfinder and it had factory wiring harness connectors to ALL of the various electrical connections which were water-tight with one or more ribbed red silicone gaskets.
The connectors were not always easy to disconnect, however, after 177,000 miles and 11 years of original ownership, I never found any corrosion inside any one of them I ever disconnected for service.
Additionally, the male/female electrical contacts within the sealed connectors appeared to be made from a tinned Copper and/or Brass metal. This is important to note, as Brass, and to a much larger extent, Copper, have ELECTRICALLY CONDUCTIVE oxide states (as surface corrosion by moisture and/or other aqueous solvents).
In other words, you corrode a Copper or Brass metal electrical connector, and it will still conduct electricity just fine. It may degrade certain frequencies of network/data signaling and alter the dB loss and impedance, but it will still conduct.
This is another reason why the top-post Nissan main battery terminal connectors for this vehicle were made from a Copper/Brass strap instead of a traditional Lead connector.
Lead oxide powders (as found on many old standard Lead top-post automotive battery terminals) are not effective electrical conductors (as anyone who has wiggled/cleaned a corroded connection to allow their car to start could attest).
Why did the design/production Engineers for the ISS not utilize Gold Plated Watertight industry standard (ISO, etc) wiring interconnects? (Even cheap RJ-45 connectors have gold-plated pins)
-That is the REAL Question.
I'm thinking it's relatively close to even. We lost 3 on the pad (early Apollo, where we learned that a full oxygen mix in a capsual with burnable stuff in it is Almost A Good Idea), & a pair of crewed space shuttles. Officially, the Russians haven't lost anybody but rumor around the water cooler is, they lost a couple when they couldn't deorbit a capsual in time and the cosmonauts ran out of oxygen, couple died on the pad in explosions, and a couple parachute failures pancaked a couple Vostoks into the Siberian tundra.
Understanding the scope of the problem is the first step on the path to true panic.
I'm surprised that connector corrosion would be a problem. Aviation has a long history of wire problems, but gold-plating connectors seems to be a stable solution to that problem. The ISS uses Kapton wire, which was popular in the 1980s and is lightweight and tough. But that material is hygroscopic and now banned by the USAF, US Navy, Boeing, etc. "Susceptible to aging in that it dries out forming hairline cracks which can lead to micro current leakage (i.e. electrical 'ticking' faults)"
There are ways to do corrosion-resistant contacts without precious metals; the automotive industry has solved this problem. The alloys aren't simple; here's one used for under-hood automotive connectors. Copper, iron, magnesium, and phosphorus, with upper limits on tin, zinc, nickel, lead, and manganese. But avionics connectors are usually gold plated; it doesn't add that much cost. And Russia is a major exporter of gold.
The article doesn't go far enough. OK, the connectors corroded. Why? Wrong alloy? Plating failure? Wear from too many connector insertions? Was the spec wrong, or were the cables not made to spec?
Excuse me, "Terrestrial".
Oh, what about the cosmonauts whose pressure equalization valve opened at an altitude of 160 km ?
http://en.wikipedia.org/wiki/Soyuz_11
Every time people mod up a clippy post, a little part of me dies.
Tell me, how many casualties have the russians had in the last decade, even last two decades? This was in the days of Mir, when the russians maintained a continues space pressence year after year and the US was out of space for year after year for blowing up space shuttles.
So whose tech is behind whose? The ISS didn't plunge out of the sky when the Space Shuttle was not available, apparently the russian capability is more then enough to operate it.
And finally, who build the de-humidefier that was the fault in the first place?
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
I found it interesting that mold (fungi) was found living in the condensation. It means that despite the what I presume is a strict level of sterilization and sanitation for both Astronauts and equipment headed to the ISS, some spores still made it up and began to replicate in this one little area of opportunity.
I read that as: "The article reeked of condensation towards the Russians..."
I was thinking, "How does condensation reek?"
the computer crashes condensate on you!
I'm not really pro-American at all. I think the Russian program is actually superior, the shuttle's just too bloated and complex.
The one thing you've got to give the Americans is that they're prepared to admit when they've got casualties. I find it hard to believe that Russians didn't attempt to launch people previously and just didn't report the failures.
Three of the same system is not redundancy. The Shuttle flight control system runs 4 of one design making decisions with a peer-review system, and 1 of another with different hardware, different code, designed and built by different teams. Even if there's a software or hardware design flaw that cripples the 4 "redundant" controllers, the 5th will still be operational. THAT is redundancy. And it would have worked onthe ISS just as well.
Forget thrust, drag, lift and weight. Airplanes fly because of money.
From The Pragmatic Programmer:
"Don't program by coincidence. Never confuse a happy coincidence with a thoughtful plan."
I can't tell you how many times that advice has helped me, not just writing software, but configuring hardware issues, diagnosing home repair problems, etc. Never just guess.
Sounds like the engineers in question were so eager to avoid responsibility they just guessed at the first thing that came to mind. "Oh look, random jumper cables worked. Don't know what happened, but I'm sure it couldn't happen again!" Yikes.
I love this, rather than discuss the real issues, /. can't even talk about other computers without bashing MS.
Politics is the art of looking for trouble, finding it everywhere, diagnosing it incorrectly and applying the wrong fix.
They fixed it with Duct tape! Red Green would be proud.
... and God kills a kitten. Won't someone please think of the kittens?
!#@%*)anks for hanging up the phone, dear.
Are you really suggesting that NASA left the safety of its astronauts to a bunch of black box systems? NASA knew what the design of the power connectors was. They chose not to raise any concerns. Therefore, NASA does share part of the blame. Through their lack of oversight they let their Russian partners design an inadequate system.
We all know what to do, but we don't know how to get re-elected once we have done it
That's because NASA is flying around with so many known issues that their engineers and safety boards told them to fix but ignored so when something goes wrong they don't have anyone to blame but themselves. Right now the shuttle is sitting on the pad with the coating on several panels on the leading edges of the wings degraded. Rather than just fix the problem when they found out about it they went ahead and now while they debate the issue some more stopping to fix it will mean a huge delay.
Enigma
Condensation is "still" a problem because it's one of the big and tricky ones. To get rid of the condensation, you have to get rid of the people.
Condensation is *not* a tricky problem, dehumidifying and air conditioning technology is old and well understood. Cool air, water condenses and is collected, heat air, repeat. As another poster has pointed out, spacecraft are not the only sealed environments, submarines have been successfully addressing this problem for decades. The real problem is not condensation, the real problems are mass and power. Power probably being the dominant problem here.
trend, look at the trend man!
two shuttles plus crew have been lost in the
most recent project and that in a long running
established project at that.
Both due ( asside from the actual technical failure)
to sytemic faults.
I'd call that degenerate!
G!
MACC
...was due to a bad batch of 12AU6As.
Have gnu, will travel.
When I'm training new technicians and engineers in how to debug processor based systems, I always tell them look at:
1. Power
2. Reset
3. Clocks
Before looking at anything else. A good 80% of the time the problem is in one of these three areas.
myke
Mimetics Inc. Twitter
Lick wires to see if circuit is live!
Dip tongue in liquid nitrogen to kill pain!
In motherland we don't feel pain!
I killed da wabbit -Elmer Fudd
Check the weight of the modules (destiny lab, the nodes, etc). Each of them could be launched via delta or atlas. In fact, that was designed in right after Columbia. In fact, the Japanese's module WAS going to be the small one, is now the largest (unless Bigelow hooks up one or more of theirs). The problem is that we have no way to get these to the ISS. If a company like spacedev was smart, they would create a tug using their buytl rubber engine. That is one that can sit in space for a LONG time. In fact, they would go far, if they put it up there combined with a small arm for moving equipment around. For example, it might be useful to move sats around or simply to drop them out of orbit. Of course, it would have helped to get the hubble down and then back up (assuming that it had a nice hook point).
I prefer the "u" in honour as it seems to be missing these days.
The most important problem is that the triply redundant system was not triply redundant and had a single power-off command.
Humidity was a big issue, and arguably more could hae been done in this area, but if it was the only one, it wouldn't have triggered the problem.
LedgerSMB: Open source Accounting/ERP
They also decided to rig a thermal barrier out of a surplus reference book and all-purpose gray tape.
We KNOW what that means. The used duct tape
I work for the Department of Redundancy Department.
http://en.wikipedia.org/wiki/List_of_space_disasters
That post was from memory. As of the mid-70's (pretty much when I stopped watching the space program), the Russians weren't admitting anything other than successful missions. To get any info on them at the time, you hadda go talk with your friendly local CIA analyst.
Understanding the scope of the problem is the first step on the path to true panic.
While true of their general policy, definitely not true of the two instances when they lost a crew (Soyuz 1 in 1967 and Soyuz 11 in 1971) on a mission. In both instances, there was a big state funeral (US astronaut Tom Stafford was even a pallbearer at the second) and their human spaceflight programme was brought to a halt. The Soyuz 1 and Soyuz 11 tragedies were well-known about in the West. No need to trouble the CIA. You could, for example, have picked up Time magazine May 5, 1967 or July 12, 1971.
Or maybe just the local paper. Here's an eBay auction (not mine) for a regional American newspaper reporting the Soyuz 1 crash as front-page news on the day after it happened, giving Soviet news agency Tass as the source.