Mars Failures: Bad luck or Bad Programs?
HobbySpacer writes "One European mission is on its way to Mars and two US landers will soon launch. They face tough odds for success. Of 34 Mars missions since the start of the space age, 20 have failed. This article looks at why Mars is so hard. It reports, for example, that a former manager on the Mars Pathfinder project believes that "Software is the number one problem". He says that since the mid-70s "software hasnâ(TM)t gone anywhere. There isnâ(TM)t a project that gets their software done."" Or maybe it has to do with being an incredible distance, on an inhumane climate. Either or.
Before complaining at the lack of manned missions to mars any time soon.
SPAM
because software is one of the only things that could and should be theoretically perfect
maths (especially that based on 1 or 0 is either right or wrong it seems to be only when humans get involved that things go wrong and mistakes happen
I am with NASA on this one (almost always a good idea to stick with NASA). From when I remember of fubar'd mars missions, its been screw ups by the programers.
Just as in the NFL when a receiver drops an easy pass and someone yells that he gets paid to catch passes like that, programers get PAID not to fuck things up.
The ultimate network admin tool needs HELP!
...is "garbage in, garbage out" right? One of the mottos anyway.
If you underestimate the resources you need to do software right, of course you'll have problems -- either getting it done on time, or getting the quality to the level it needs to be (or both).
That problem is hardly unique to the space programs. And of course, it would be a little tricky trying to upload a software patch to a hunk of solar-powered metal a few million miles away.
I wonder how much NASA et al. really tap the resources they should be tapping -- I mean, there ARE areas of industry where mission-critical or life-critical software has been developed and deployed for some time now. Maybe it's just a question of getting the right kind of experience in-house...
Xentax
You shouldn't verb words.
... on the last two trips to Mars that failed. Communication and incompetence on Earth were the problem. Exactly how do scientists screw up and get the unit system wrong?
What we need is a bit of competition between nations. Let's face it, without Kennedy wanting to 'beat the Russians' to the moon, there would have been no Apollo programme. Nowadays we throw unmanned stuff around and expect it to perform flawlessly with (comparatively) little monetary backing and none of the incentives of older space programmes.
However just throwing money at the problem isn't going to solve it, I'd suggest throwing away the rulebook and starting over for unmanned systems, better craft, less of the multimillion dollar single units and more cheaper devices that can carry out multiple landings at once.
For once, it might be worth imagining a Beowolf cluster of those things - because with many cheaper devices, the mission would most likely have a modicum of success.
It's interesting that he blames the problems of software on external pressures such as management hassling of coders but there is no mention of project delivery methodology. I would be interested to know what methods they uses. Are they using continuous intergration techniques, unit testing, agile methodolgies, XP? These things in my experience are crucial to low bug software. Also who are they employing to write their software? Rocket scientists or coders. In my experience domain expertise counts for very little when it comes to writting rock solid code.
----
I think that is part of the difficulty...
With 512 BYTES of ram you can literally look at the entire contents. You can be aware of every single bit on the system.
Now, where we have gigabytes of ram, and even more other storage it is simply impossible to sort through every bit. This errors roll in.
I'm not sure what to do about it, but I see why there is difficulty.
Sticks and Stones may break my bones, but copyright will always protect me.
It looks as if the testing and debugging starts at the begining and works through the mission. I suppose this will eventially work, but it seems to be an expensive way to do it.
Well, there are a lot of reasons thing go wrong. Landing a spacecraft on a different planet is inherently difficult, and when you read about how MER-1 and MER-2 will land, it's amazing that they can work at all.
The flip side is that. After Mars Ovserver spectatularly failed in 1993 ("Martians"), NASA started to go with faster, cheaper, better. The idea was, instead of a single $1 billion mission every 5 years with with 90% chance of success, why not 2 $200 million missions every two years, with an 80% chance of success. Everyone loves this idea when it works (Pathfinder), but when a cheap spacecraft fails, the public doesn't care if it cost $10 million or $10 billion, all we know is that NASA is wasting money.
So, the answer is, NASA has hit some bad luck. But the idea of faster, cheaper, better is ultimately a cost-effective one, so if we can solve these software problems (I mean, can't someone independently design a landing simulator?), and NASA can get 80-90%, we'll be getting a lot more science for the dollar. But NASA-haters will always have some missions to point to as a "waste" of money, and try to cut funding as it's mismanaged; other space junkies will insst that anything under 100% is unacceptble, and costs should double to move from 80% to 100%. I don't which attitude is more damaging.
NASA has a "good" track record since Observer, unfortunately, the highest profile missions have generally failed. If MER-1, and MER-2 are both succesful, and SIRTF flies this summer, then everyone should get off of NASA unmanned program's back for a while.
Seriously. Space is tough, as the US has experienced with both Challenger and Columbia, and those should only reach orbit. Going even further away in space is tougher. So much can go wrong, and so little can be done to correct it. Certainly a few blunders like the feet-to-meter bug is huge, but they try. I'm not so sure any private corporation that had been asked to do the same would fare any better. They are pushing limits, where you fail and (hopefully) learn from your mistakes.
Which is why we should continue to try. Giving up, saying "space travel is just too costly and risky" is a big cop-out. If we could send people to a different stellar object (the moon) in 1969 with the equivalent of a pocket calculator but not now, what does that say of our technology? Or sociology? Sure you could take the narrow-minded approach and say "and what does that bring us? The ability to jump from rock to rock in our solar system?" If so, you might as well ask why people decided to go to the poles (just ice) or whatever. You're still missing the point.
Kjella
Live today, because you never know what tomorrow brings
In my years at NASA Goddard I saw a dysfunctional management operate in ignorance of reality.
There was much praise of the employee who "went the extra mile", "put in long hours" and "served the customer" (that applied to contractor employees). There was also very little thought paid to the consequences of those practices.
What's the first thing to go when you're tired? It's not your body -- it's your mind. That's right -- if you're staying at work until you're feeling tired, you're making mistakes that need to be corrected later. The tireder you are, the more mistakes. The tireder you are, the less you can actually do.
I witnessed people who wore their exhaustion as a badge of honor. And, when they got into management, insist that others emulate their bad example. The result that I saw was people who should have been kept out of management becoming increasingly dominant. This was accentuated by the "faster, better, cheaper" ideology promulgated by former NASA administrator Goldin. This ideology was used to get rid of more experienced (and thus costly) people who were aware of the consequences of trying to squeeze more work out of fewer people.
It could take a long time for NASA to recover from this culture. The failure of projects in the past few years, the crash of Columbia could be turning points -- or they could be used by incompetents to justify even more dysfunctional behavior.
"Beer is proof God loves us and wants us to be happy." -- B. Franklin
Yes, programmers have erred. To err is human, to allow errors to propagate into mission failures is a failure of systems engineering, and I think that is where the real blame lies. A lot of the problem is thatspacecraft systems engineers often have a very amateurish grasp of software, if any at all.
For example, on Mars Climate orbiter, a junior programmer failed to properly understand the requirements. However, systems failed to:
Helium balloons want to be free.
Space Exploration isn't easy.
Look at the Space Shuttle. The space shuttle has never had a catastrophic computer failure-- but every line of code on that truck has survived review by a group of programmers. They've examined it, line by line, multiple times, in order to ensure that it's exactly right, because the cost of failure is 7 astronauts and a multimillion dollar orbiter.
The new Mars programs, however, are part of the streamlined "do it on the cheap" NASA. NASA put the Mars Rover down using mostly off-the-shelf and open-source software and a small amount of home-brew stuff. No matter how good open source software gets, it still hasn't undergone the level of review that the Space Shuttle code has seen. No matter how popular an off-the-shelf package is, it's not cost-effective for the manufacturer to give it that sort of treatment. NASA can't afford to do that level of code review because that costs them the ability to do some other program.
NASA is simply trying to do more with less in the unmanned launches, and the cost of that is we need to expect some failures. These failures are unfortunately very visible...
-JDF
I have to really disagree with this. NASA is used to dealing with alien climates and terrain and astronomical distances. NASA is also used to dealing with problems. They have some of the best problem solvers out there, and when something goes wrong, then tend to pinpoint why. When NASA says A, B, and C are the causes of failure, I believe them. When NASA cannot figure out why something went wrong, I worry.
What I'm trying to say is, distance and inhuman conditions shouldn't have that much of an affect on how well a probe works. We built Voyagers I and II, didn't we? They worked even better than expected. And they encountered climates and conditions which make Mars look easy.
NASA has dealt with so many varying circumstances and climates over the years, and been so blunt about their mistakes, I find it hard to believe that they would blame the failures of an entire class of missions on something "easy." And yes, blaiming failures on software is an easy way out, how many times have you heard someone say "Oh! It must be the software!" when something doesn't go as expected?
Now, I know this guy doesn't speak for NASA as a whole, but as a NASA trained administrator, and the head of some very large projects, I'm willing to take his opinions at face value. If he says it looks like software has really been a cause of failure, who am I to laugh at his expertise and belittle his explanations? I might not like his explanation, but I buy it.
---
"Of course, that's just my opinion. I could be wrong." --Dennis Miller
Most PHB's haven't figured it out yet: SOFTWARE IS HARD. It's amazingly complicated. It's also notoriously hard to come up with realistic estimates.
PHB's also haven't figured out that developers aren't interchangeable widgets. If you know C, it doesn't mean you'll be immediately productive in Korn shell scripting, and vice-versa.
PHB's also haven't figured out that experience is key. There are exceptions, but generally speaking, a young hotshot isn't going to be as productive as an experienced professional. Sure, the young hotshot might get v1.0 done first, but it'll be buggy, unreliable, unscalable, hard to maintain, etc.
The "problem with software" is almost entirely a management issue, imho.
-Teckla
We haven't seen software failures taking out manned missions, two shuttles failed from the high stresses of takeoff and re-entry. Just a guess, but the engineering standards are probably much higher for the manned programs, and more people review the code. Also, keep in mind that NASA has been experimenting with the idea of saving money with faster paced development which means some reduction in review and other QA standards, particularly on unmanned planetary missions. It may even be that this method is cost effective in spite of some high profile failures.
Yes, but can your computer recover from a triple memory failure? Can you rewire your computer remotely to fall back on a redundent system? Frankly I keep the covers off my case to keep my CPU from overheating.
State of the art is not always measured in Gigahertz.
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming
I think the primary problem is that the technology to build and design probes changes too quickly, and affects design.
I always thought that there should be a way, to build a probes navigation and propulsion systems in a standardized whay so that avionics software wouldn't need to change that much.
Sort of a standardized platform if you will for doing solar system exploration.
This platform would consist of a number of parts that would not change, and could be reusable in a number of different configurations for building a probe, depending on what its job was.
Cameras, photometers, spectrometers, and power sources could all be packaged in the same why depending on the probes job.
Every probe that nasa launches is always customized and built around cost and included packages.
I am not so sure that is the best way to go about it as you have to reinvent all the software to manage the probe every time you build one.
Probes should be cheap, produced in high volume, (thousands) and interchangeable.
With a standardized approach, failure rates should come down a bit and costs should be reduced.
-Hack
Got Geometrodynamics? Awe, too hard to figure out? Too bad.
So then they spent what, twice as much? three times as much? As a QC regime would have cost to actually design, build, and install compensation electronics on the Hubble to correct for the aberrations in the mirror.
Probably is, then as a result budgets STILL get cut. There's no money to do things "right".
and the Viking landed. Dad points out that the budget for the Viking was in the neighborhood of 1 billion dollars, and that was when a Mustang Mach 1 cost just over 4 grand. The space program doesn't have the money now to do the missions the right way, which is unfortunate... the developments of NASA when they had tons of money were numerous and wonderful (i.e. Tang!)
stuff |
... and the Mars vehicles. The Shuttle carries people. You can afford to cut corners a little if no one's going to get killed.
Sean
We like to prey on these simple glitches only because it is poetic to do so. Saying the MPL failed because a programmer failed to initialize a variable sounds much more interesting and is much easier for a reporter to remember than saying MPL failed because a programmer failed to initialize a variable, which determined how close to the planet the retro-rockets would turn off, and that this was observed in the testing laboratory, but the test data was not annalyzed until after the crash.
0xfeedface
no, for instance the Mars Pathfinder spacecraft had "128 Mbyte mass memory" and used a R6000 computer.
The grandparent post's point still stands. 128MB is one huge mass of program and data to debug. I know I wouldn't stake my reputation on a "bug free" multi-megabyte program--only a fool would.
Remember, the true complexity of a program increases exponentially with the size of the program.
This is why I will never trust Windows for anything more than a gaming platform (millions of lines of hastily-written code == one hell of a buggy program). I would bet that any recent version of Windows has several hundred thousand bugs in it.
From a complexity standpoint, UNIX is an order-of-magnitude better than Windows but is still big enough to have lots of bugs. Linux is similar to UNIX in complexity.
No software in wide use today is bug free. I have never seen software that was bug free. Even the printf() call in a "Hello World" program probably has bugs in it, regardless wether the "Hello World" program exposes them.
Personally, I would never feel confident enough to write software that puts human life directly at risk, unless there are fail-safe non-software-controlled mechanisms in place. Sometimes, we just have to put software aside and let real Engineers do what they do best. And, yes, there is no such thing as a Software Engineer (it is still very much a made-up job title that anyone can have, even me:).
Healthcare article at Kuro5hin
Most of the failures occured during the orbital entry phase, during which time they shut off the transmitter, and therefore don't have up to the second data on the reason for the failure.
That's why some folks at NASA develop more sophisticated control software that can take of failures. The RAX experiment on DS1 probe successfully demonstrated this approach viable.
However, at the moment the project suffers major rewrite in C++, notorious for its 'safety', for reasons having very little to do with engineering...
Lisp is the Tengwar of programming languages.
Software can be done right. Anyone who doesn't believe this either (a) does not know how many millions of lines of software are involved in avionics and air traffic control, (b) never flies on an airplane, or (c) has a death wish. Of course I guess there's also a fourth possibility - when all else fails, blame the software. The space shuttle's record proves that software can be dependable, but also illustrates that making it that way is very, very expensive. Just a matter of priorities.
I think one of the factors contributing to the poor Mars success rate is orbital mechanics. The launch window to Mars opens for only a month or so every two years. This is the longest interval between window openings for launches from Earth to any other planet; windows to the other planets open at roughly yearly intervals or less. Since missing the launch window means waiting another two years, this undoubtedly creates enormous schedule pressures on any team preparing a spacecraft for launch to Mars.
As I see it, the problem is this:
1. Distance from Earth to Mars is about 35,000,000 miles at the closest and the mean distance is something like 48,000,000 miles.
2. The velocity of light is constant at 186,000 miles/second.
3. This means it takes 6.5 to 9 minutes or so, round trip for a radio signal to reach the spacecraft and get feedback in either direction.
4. If the spacecraft encounters difficulties that would require it to report, receive instructions, report back, receive additional instructions, if necessary, then we are talking about a 13 - 18 minute process, just for minor correctons.
5. This is akin to remotely driving an unmanned car with messages transmitted by carrier pidgeon.
6. So, for all practical purposes, the landing craft must be autonomous, which means that the software must be reliable, fast, and comprehensive.
I don't know about you folks, but I haven't seen any software that I would trust to drive my car from my house to the office unmanned (about 7 blocks), much less take millions of dollars worth of hardware millions of miles from home and expect it to get there safely.
In my opinion, manned missions make more sense because they have a significantly greater chance of success even though the cost is also significantly higher.
(1) Schedule realistically, so that tasks can be completed without overtime. This may mean some things just cannot be done in the desired time period. Learn to accept that.
(2) Hire and retain sufficient staff, so that the work can be shared between multiple people. This may mean that some of the time the company will be overstaffed. Accept that too.
Obviously both these suggestions come with a pricetag, but lost missions aren't free either...