Mars Polar Lander Had Fatal Design Flaw

← Back to Stories (view on slashdot.org)

Mars Polar Lander Had Fatal Design Flaw

Posted by CmdrTaco on Wednesday February 16, 2000 @03:00AM from the segmentation-fault dept.

GSearle writes, "Spacedaily.com reports on a design flaw that may have caused the Mars Polar Lander to cut its engines immediately upon firing and plummet 1800 meters to the surface. The problem lies with a sensor that detects when the lander has reached the ground. The sensor may have been triggered prematurely when the lander's legs locked into position. "

16 comments

Min score:

Reason:

Sort:

Testing Processes by Dolohov · 2000-02-15 23:18 · Score: 1

I'm curious, then, about NASA's testing processes for these things. It's a shame these things are too expensive to try out in Earth's orbit, where breakage is immediately obvious.
So do NASA engineers test on computer? Run tests on individual parts? Or do they just rely on calculation and past performance? If anyone has any information, I'd be glad for it.
1. Re:Testing Processes by CyberDong · 2000-02-16 01:20 · Score: 2
  
  So do NASA engineers test on computer?
  Undoubtedly, some of the testing is done on computer, but most of it is probably done on actual hardware. However, the problem here isn't one of testing, it's one of communication. The article points out that the deployment testing group knew of the problem, but the descent control group did not account for it.
  Standard engineering practice is to gather the staff involved and hold a brainstorming session known as an FMEA (Failure Mode Effects Analysis). At this session, the idea is to identify all the possible ways in which something can go wrong, and determine the outcome if it were to happen. The most critical items are given the highest priority to ensure that they do not occur. Surely the consequences of sensor failure were identified at the session.
  Since the second group did not account for the possibility of the sensor being prematurely activated, they must not have been informed of the results of the testing by the first group. What they need to work on is their inter-group communications, not their hardware testing.
  Of course, I'm crediting the design group with following reasonable procedures. They may not have done it exactly like that, but it's a pretty standard approach in industry. Every part on an automobile goes through such an analysis (even down to the headlight switch), so it just stands to reason that a multi-million dollar one-off space probe should...
  
  - - - -
2. Re:Testing Processes by OmicronTheta · 2000-02-16 12:26 · Score: 1
  
  >Of course, I'm crediting the design group with following reasonable procedures.
  
  You should dig out your old copy of Surely You're Joking, Mr. Fenyman and read it again, then. One of the major points of that book is that many of NASA's woes come from processes designed by managers focused on expediency and politics rather than good engineering practice.
  
  --
  Cuiusvis hominis est errare, nullius nisi insipientis in errore perseverare
3. Re:Testing Processes by Anonymous Coward · 2000-02-16 16:38 · Score: 0
  
  IMHO there were several things going wrong.
  
  - Groups didn't communicate their findings.
  
  The first group didn't say the second that
  the correct end-state constantly and
  reproducable wasn't reached.
  
  OUCH!
  
  - Why was that group not _forced_ to communicate
  their findings, anyway?
  
  A management fault.
  
  - The second group didn't ask if their initial
  state was there. This is unwise, if Your
  mission depends on ONE BIT of information.
  
  (Depending on one bit of information is a
  big risk in itself, that's why discussion of
  other methods to control descent is a good
  thing to do.)
  
  - There was only module level testing.
  No system level testing was done.
  
  OUCH!
  
  This is a management fault.
  
  => IMHO most desasters involving machines
  have multiple failures as cause.
  
  Here we see lots of individual failures:
  - A risky design decission
  - An unadeqate test plan
  - People not communicating their findings
  - People not asking, even if they know there's
  a risk
  
  Christian
programming error!?! by milliyear · 2000-02-16 00:10 · Score: 1

Somebody forgot to clear the bit. How would you like to be the one responsible for that little piece of code? (Actually, I'd love to work on something like that!)

Kind of adds new meaning to the term 'System Crash', eh?
The good news is... by Rantage · 2000-02-16 00:18 · Score: 1

...that the Deep Space Two microprobes survived impact, cushioned from the fall by the main body of the probe...
Online gaming for motivated, sportsmanlike players: www.steelmaelstrom.org.

--
Online gaming for motivated, sportsmanlike players: www.steelmaelstrom.org.
Drive-by development syndrome by gelfling · 2000-02-16 01:16 · Score: 1

Do it on the cheap, cut corners, rush people and accept no dissention. Oh, and throw in several different development groups that don't communicate well together and aren't required to work in tandem. The strongest possible example of why we DON'T want to send people into space unless/until it's necessary to take the next step forward. If any of you are old enough to remember Challenger what's the lesson to take away from that? Test the entire system to insure that it can function under operational conditions that are not optimal but can be invoked by the PEOPLE who manage it.
1. Re:Drive-by development syndrome by CyberDong · 2000-02-16 01:33 · Score: 1
  
  Do it on the cheap, cut corners, rush people and accept no dissention.
  Reminds me of an interview with an astronaut several years ago. He commented on how scary it really is to be sitting on top of a billion dollars worth of hardware, all supplied by the lowest bidder...
  remember Challenger
  Challenger was a good example of the "accept no dissent" philosophy. One of the project engineers pointed out to them that they were operating outside the parameters for the o-rings, but was ignored. The higher-ups decided that the risk was minimal, and launched anyway. Then again, astronauts & test-pilots are very aware of the risks involved in their professions. They accept the risks (and the danger pay, and the glory) before the climb aboard.
  
  - - - -
2. Re:Drive-by development syndrome by Tassach · 2000-02-16 01:50 · Score: 1
  
  Do it on the cheap, cut corners, rush people and accept no dissention. Oh, and throw in several different development groups that don't communicate well together and aren't required to work in tandem.
  
  Unfortunatly, this is business as usual in Government. Plus, the seperate groups are usually highly politicized. This true in the private sector as well, but usually to a lesser degree than it is in Government.
  If the report is correct that the crash was due to a faulty sensor, that is an example of bad engineering. Every engineer learns (or should learn) the lesson early in their career that you should never design a system with a single point of failure if at all possible. For a component as critical as this, there should have been redundant sensors. I know that on a spacecraft, every gram of mass is important; but in this case the designers should have sacrificed mission payload to ensure that the lander arrived intact. Saving mass on the landing system so you can pack in one more sensor dosn't do you any good if you lose the entire platform due to a landing failure.
  
  That being said, we have to remember that this is rocket science - it's not supposed to be easy. It's easy to play "armchair quarterback" and find flaws in hindsight. Preventing them in the first place is much harder. The cost of the mission will not be a total loss if we learn from this mistake and apply those lessons to the next effort.
  
  "The axiom 'An honest man has nothing to fear from the police'
  
  --
  Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
3. Re:Drive-by development syndrome by CyberDong · 2000-02-16 02:03 · Score: 1
  
  never design a system with a single point of failure
  There was a sensor on each leg... they all failed. Lack of redundancy wasn't the problem, it was a lack of communication. The rockets are designed simply to slow the descent, so at least one of the leg sensors should have been activated at touch-down.
  It's easy to play "armchair quarterback"
  How apropos. Playing armchair quarterback with a failed touchdown problem... ;-)
  
  - - - -
4. Re:Drive-by development syndrome by JJ · 2000-02-17 02:48 · Score: 1
  
  While I agree with most of the arguements presented, I wish to add that they were all equally true during the Apollo project and it got off the ground, yes with the death of three astronauts, but on a really cramped schedule.
  NASA needs to be de-politiced, maybe even split in two and completely revamped. There are two major goals in space exploration: manned space flight and planetary exploration. More often than not these two are in conflict, at least for resources.
  Manned spaceflight has a few major goals: a)regular, dependable service for satelittes, b)a space station, c)a lunar station, d) Mars landing . . . The military controls a big portion of this side and to be honest, why not all?
  Planetary exploration has much smaller goals but many more and more exotic ones. This should still be handled by consortia of universities, be solely directed toward science, and probably needs exclusively civilian leadership/staffing.
  
  --
  So long and thanks for all the fish . . . !!!
Design changes to prevent similar failures by Tassach · 2000-02-16 02:33 · Score: 2

I re-read the original article and have some observations on how I would have made the software more fault-tolerant.

The lander had 3 legs, each with a simple switch. If any one of the switches read "closed", the engines would be shut down (presumably to keep it from either cooking itself with blow-back and/or flipping over. This was a simplification of earlier landers, which used a radar altimiter to tell the engines when to shut off.

The first change I would have made would be to design the software to scan the switches repeatedly, rather than just once. Before the landing sequence begins, scan all the switches 5 (or more) times. They should all read open each time; if any reads closed, disregard further input from that sensor. (If all 3 failed closed, you'd be SOL and would have to rely on a back-up system) Then, after the legs deploy and the engines start firing, poll the switches at a rate of say 10x per second. Instead of shutting down the engines as soon as any 1 switch reads closed, only shut down the engines after you have the same switch read closed for 3 consecutive pollings (even better would be to require 3 consecutive closed readings from 2 sensors [assuming that you did not find 2 failures earlier].

I think that the logic should be to keep firing the engines until you are SURE it is safe to stop; I think this is a safer failure mode than to risk having the engines shut down prematurely. This approach leaves you vulnerable to blow-back damage from the exhaust and/or flipping over after touchdown if the switch system fails completely.

To combat these failure modes, you would need additional sensors to detect blow-back and tipping. One possible sensor to detect blow-back could be a series of strips made of a metal with a sufficiently low melting point as to melt under prolonged exposure to exhaust gasses; tipping could be detected via a mechanism similar to a common mercury switch or an aircraft artificial horizon. Also, the fact that you have a limited fuel supply limits the amount of damage you can suffer from not shutting down in time. Even if all your cut-off systems fail, the engine will stop firing when it runs out of fuel. While this would be a bad failure, it would probably not result in a total loss. Even if some of the more sensitive instruments got cooked or the lander flipped over, you could still probably get SOME usable data back. A semi-functional lander is much better than a smoking crater!

"The axiom 'An honest man has nothing to fear from the police'

--
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
1. Re:Design changes to prevent similar failures by CyberDong · 2000-02-16 03:10 · Score: 1
  
  a series of strips made of a metal with a sufficiently low melting point as to melt under prolonged exposure to exhaust gasses
  I don't know what metal you're proposing, but lead would make a reasonable choice here. Of course, the cost of lifting lead makes it prohibitive... And then there's the issue of re-entry heat. While the shields should absorb most of the initial heat, this thing's still going to be falling pretty quick. The sensor could activate due to the heat of friction, or just from falling through the exhaust...
  tipping could be detected via a mechanism similar to a common mercury switch
  It's a pretty safe bet that something falling on a parachute will activate the tipping sensor well before grounding.
  So now we have a situation where the legs sensors are tripped, the metal strips are melted, and the tipping sensor's activated, all prior to grounding. Not good.
  Additional (read redundant) safety sensors are undesirable. How many levels of backup is enough? Aside from the weight issue, there's Murphy's Law to consider. More sensors means more possible failure modes as well.
  scan the switches repeatedly, rather than just once
  This is the best solution from a practical and simplistic point of view. Oh yeah, and have the different departments communicate. Maybe they ought to install something like bugzilla to ensure that all groups are aware of defects found by others.
  
  - - - -
2. Re:Design changes to prevent similar failures by Wigs · 2000-02-16 10:30 · Score: 1
  
  Well, it's obvious that there are plenty of ways to fix this problem. However, these kinds of ways are exactly what NASA is trying to avoid.
  This was a simplification of earlier landers, which used a radar altimiter to tell the engines when to shut off.
  A radar altimeter isn't all that complex, nor takes up much space, but it was scrubbed in an attempt to make the craft cheaper and less complex.
  To combat these failure modes, you would need additional sensors to detect blow-back and tipping. One possible sensor to detect blow-back could be a series of strips made of a metal with a sufficiently low melting point as to melt under prolonged exposure to exhaust gasses; tipping could be detected via a mechanism similar to a common mercury switch or an aircraft artificial horizon.
  That's probably a lot more trouble than it's worth. To figure out how to connect all that will take valuable time and money that the space program just doesn't have.
  The article pointed out that the craft would only be slowed to 2.4 m/s. That's pretty fast, and the lander must have been fairly durable. Rather than implementing all of these possible fixes, something much easier could have been done.
  The article suggests that they had a good idea of exactly where the lander would be, how fast it would be going, and what type of conditions it would be running into. Figuring out whether or not the legs would have been affected by the deceleration due to the parachute is a rather trivial physics problem. Rather than come up with a complex solution a simple delay in the timing of the landing sequence could have been included. Rather than allow the engines to shutdown whenever the switch was activated, only allow the switches to become active after a certain period of time had passed. This would have allowed the engines to operate normally and only be shut-off when the switches could triggered when the engineers would be sure that the lander would be at a realistic height above the surface.
  Wigs
  --The purpose of Physics 7A is to make the engineers realize that they're not perfect, and to make the rest of the people realize that they're not engineers.
3. Re:Design changes to prevent similar failures by Niko. · 2000-02-19 13:49 · Score: 1
  
  This would have allowed the engines to operate normally and only be shut-off when the switches could triggered when the engineers would be sure that the lander would be at a realistic height above the surface.
  
  I hope you're not suggesting that the engineers get into the landing-sequence control loop? They are many light-seconds away and lose a lot of dexterity because of it.
Testing? we don't need no stinking testing!! by JJ · 2000-02-16 22:22 · Score: 1

That seems to be the current NASA head's (Dan Goldin, Clinton's hand picked point man) thought. Years back I worked on NASA's PVO data. I was shown a film of the testing, done in a huge water tank, of each of the critical stages of the mission. I understand this has been completely scrubbed because 'it costs too damm much'. Well, failure costs even more.

--
So long and thanks for all the fish . . . !!!