ESA: European Mars Lander Crash Caused By 1-Second Glitch (space.com)
An anonymous reader quotes a report from Space.com: The European Space Agency (ESA) on Nov. 23 said its Schiaparelli lander's crash landing on Mars on Oct. 19 followed an unexplained saturation of its inertial measurement unit (IMU), which delivered bad data to the lander's computer and forced a premature release of its parachute. Polluted by the IMU data, the lander's computer apparently thought it had either already landed or was just about to land. The parachute system was released, the braking thrusters were fired only briefly and the on-ground systems were activated. Instead of being on the ground, Schiaparelli was still 2.3 miles (3.7 kilometers) above the Mars surface. It crashed, but not before delivering what ESA officials say is a wealth of data on entry into the Mars atmosphere, the functioning and release of the heat shield and the deployment of the parachute -- all of which went according to plan. In its Nov. 23 statement, ESA said the saturation reading from Schiaparelli's inertial measurement unit lasted only a second but was enough to play havoc with the navigation system. ESA said the sequence of events "has been clearly reproduced in computer simulations of the control system's response to the erroneous information." ESA's director of human spaceflight and robotic exploration, David Parker, said in a statement that ExoMars teams are still sifting through the voluminous data harvest from the Schiaparelli mission, and that an external, independent board of inquiry, now being created, would release a final report in early 2017.
Man, if I had a nickel for every time some kind of sensory saturation forced a premature release...
I've calculated my velocity with such exquisite precision that I have no idea where I am.
They should've accounted for +-2 seconds of random delays in sensory data in their simulations.
They're blaiming lag?
https://en.wikipedia.org/wiki/...
How in hell did they test their Kalman filter to allow such bad data to reach the decision logic? (I assume they used one.)
Government cannot make man richer, but it can make him poorer. - Ludwig von Mises
Overflows and bad data problems happened to ESA before.
"Obligatory" Dark Star reference.
Brings to mind the failure of the first Arianne 5 launcher because control software spat an Ada stack trace over a line which was supposed to only contain kinematic data.
http://michaelsmith.id.au
When the altitude stops changing for a whole second the filter is going to have to be a long one! And that ain't desirable for responsive control.
The real question is how could the sensory processor have overloaded in the first place? My money is on simple [b]code bloat[/b]. Ie: They used a bunch of generic libraries that use further libraries that use further libraries that use further libraries that use further libraries that use further libraries ...
So they didn't correlate the IMU data with ranging radar or even barometric altitude information so as to avoid this?
I know weight and volume are at a premium on such craft but a barometric sensor (even one capable of operating in Mars's rarefied atmosphere, is the size of a thumbnail and weighs just a fraction of a gram.
Sigh!
Should've used metric seconds.
systemd is Roko's Basilisk.
What kind of IMUs are normally used in these craft? The same kind used in aircraft and weapons?
if they used metric seconds.
... $1000 quadcopters back here on Earth ship with multiple IMUs for redundancy, since the bloody things are about as trustworthy as your average politician.
Having made that glib remark, I'm sure it either did have redundancy, or if it didn't that was for a good reason (e.g. risk of failure deemed too low to warrant the weight penalty in adding redundancy). I would also like to think that they're using somewhat more reliable IMUs than those found in quads.
Yes, you have to wonder why on a mission of this expense and complexity the height about the ground is essentially done by mathematical dead reckoning. Would adding a ranging radar really have added so much to the weight and/or required package size that it was infeasible to include it? Obviously they must have considered it and I'd be interested to know why in the end it was not seen as a viable part of the solution.
If only we can remove the human-factor from our experiments, we can make things perfect!
I knew it was going to be a bad idea...
Hidden malloc()'s is a good example of the bloat problem I'm referring to.
Why wasn't the IMU sensor doubled by other ways of detection? There was no fallback in case it malfunctioned.
No basic sanity checks? As in "This phase must last at least X seconds", or "No switching to landing behavior if altitude measurement from 1 second ago still said '2 miles above surface'"?
If the landing struts are subject to a compressive force, you've probably landed. If not, you haven't. Why wouldn't the computer make use of this?
Am I missing something, or is this a stupid design?
"[T]he erroneous information generated an estimated altitude that was negative," ESA said.
Which resulted in an actual altitude that was negative.
A brief burst was enough.
Margret Hamilton (who just received a Presidential Medal) and J. Halcombe Laning's clever software design overcame such an event on the Apollo 11 Moon landing. From Wikipedia:
In one of the critical moments of the Apollo 11 mission, the Apollo Guidance Computer together with the on-board flight software averted an abort of the landing on the Moon. Three minutes before the Lunar lander reached the Moon's surface, several computer alarms were triggered. The computer was overloaded with interrupts caused by incorrectly phased power supplied to the lander's rendezvous radar.[18][19][5] The program alarms indicated "executive overflows", meaning the guidance computer could not complete all of its tasks in real time and had to postpone some of them.[20] The asynchronous executive designed by J. Halcombe Laning [18][21] allowed the computer to cope with the increased demand by prioritizing tasks. Hamilton's priority alarm displays interrupted the astronauts' normal displays to warn them that there was an emergency “giving the astronauts a go/no go decision (to land or not to land)”.[22] Jack Garman, a NASA computer engineer in mission control, recognized the meaning of the errors that were presented to the astronauts by the priority displays and shouted, "Go, go!" And on they went.”
Generally such mission critical systems include redundancies and fail-safes. In this case the probe should have looked at its mission profile and said "wait a minute, I'm not supposed to be this low yet?" and took a look at its other systems (radar, LiDAR, etc) to confirm/dispute the location being provided by the IMU. Putting all of your eggs in one basket (the IMU) was about as daft as a car that loses all steering/breaking if its engine stalls.
They have never been to Mars. Mars Curiosity is on Devon Island.
I wonder whether making the source code of these probes available to the public, for vetting would help spot bugs like these? I am also curious whether releasing the code would be problematic for any reason?
Jumpstart the tartan drive.
Lots of just plain old ignorant comments here. I say this in a nonperjorative sense - if you've not worked on flight software, there's no way you could know.
1) Space is unforgiving, hardware designs change very, very slowly. Project schedules move fast and have limited budgets. Just because you can buy a MEMS based IMU for your quadcopter does not mean that you can get one for a spacecraft that will work reliably from -40 to +80C, withstand the vibe tests, the pyroshock, etc. Oh, yeah, and it (and the surrounding electronics) has to not suffer any ill effects from a stray high energy particle: Single Event Effects, generally, but Latch-up, gate rupture, etc. are the issues one would worry about - bit flips and SEFI are something software can potentially handle, but destructive latchup is, well, destructive. Parts made for consumer or automotive high-rel applications just don't worry about this kind of thing. Sure, you could test your brand new whiz-bang IMU in an accelerator, but that costs money and schedule.
2) Yes, not range checking or reasonableness checking the data from the IMU was a definite design problem. Or, maybe, the entry descent and landing algorithm was such that if you DID get a hiccup and detect it, the missing data means all is lost anyway, so why bother. There's a basic principle in spacecraft design that you don't add hardware or software (which might fail, and costs time, money, and maybe mass or power) to give you information which you cannot use.
3) With respect to the landing radar - it's very possible that it has poor accuracy up high and is only used in the final descent stages.
4) Fault handling is tricky - you can easily go down a rat maze of low probability events generating code (and hardware) to handle obscure corner cases, thereby increasing your test costs and time, and potentially introducing other faults. For a lot of plausible error scenarios, it's likely you're going to fail for other reasons, so there's no point in trying to do things like estimate state from other sensors.
Summary - Entry, Descent, and Landing on another planet is really, really hard. It's even harder when you have time and budget constraints.
Was someone inside your systems where the firmware and other software was being developed? Did the vehicle only accept encrypted and signed software updates sent back from ESA, or was it possible for a malicious actor to compromise the vehicle with different software?
Don't think for a second that there's not a space-race going on, and that sabotage of this kind is unthinkable. This is a genuinely concerning question that must be asked.
The Schiaparelli EDM lander is an example of the typical one-off missions that humanity does these days. It's worth noting that they could have had built and launched two or more of these vehicles for much less than the first and already be correcting the erroneous code on a second spacecraft. Then they wouldn't have to wait years for a replacement mission and have a much better chance of mission success.
-1.#INF -1.#INF -1.#INF -1.#INF
yay, we're on the surface! Deploy the chutes!
FUUUUUUUUUUUUUUUUUUUUCKKKK!!
Software to land a probe on mars is quite similar, if not identical, to software to put a (nuclear) warhead on a target. That's an important strategic capability for "first world" nations - otherwise you're in the category of Saddam firing Scuds, which are basically V2s with newer parts, and quite literally cannot hit the broad side of a barn (albeit from 100 km away).
So, the hard parts of solving the problem (after you've done the basic college physics part) are likely to not be open source. Things like handling the rapidly changing aerodynamic effects at hypersonic speeds are a long way from "considering air as an incompressible fluid": Your state estimator has inputs with wildly varying inputs (dynamic range is huge compared to, say, a quadcopter), with equally wildly varying uncertainties. The transitions as you go through various deployments are also something that does not have a lot of commonality with other applications: airplanes do have deployments (landing gear, flaps, stores), quads and hexes do not; and the deployment of flaps on a plane occurs in a fairly narrow, and well understood, set of dynamic conditions.
This is not a bugginess issue.
Point is the layers create bloat. Any hidden dynamic memory allocations that occur, by whatever system call, is just one more part of the bloat.
Fucking off by one errors!
So say they're doing some kind of weighted average of an altitude computation from the inertial navigation unit and an altitude computation from the doppler radar altimeter.
They should have some code in there saying: If these two values that we're averaging are wildly off from each other, let's not take the average. Instead, let's go into some exception handling code which uses some kind of heuristic (and a little time perhaps) to determine which of the two instruments should become the solely trusted source of the altitude value.
Sounds like a lack of hazard analysis / fault tree analysis and or fault-tolerant design in the design process.
Where are we going and why are we in a handbasket?
Interesting was Java garbage collecting?
Guess they should have used a uint32_t rather than an int32_t. I would like to think that the system should be able to handle a flood of data even if the data were right. There is not a lot that you can do with a flood of incorrect data.
You've hit the nail on the head for flight systems development - for the vast majority of detectable errors, there's no reasonable recovery possible, so why bother checking, especially for low probability errors. The chance of introducing some new problem from the code you're adding for the check/recovery is probably greater than the event you're checking for. Add that to the non-zero time/budget to rigorously define, code and test that corner case.
If you're running a real time system with a time critical calculation, and your math check throws an exception, what are you going to do? Substitute the value from the last time tick? Or are you going to write a more complex control system which can deal with missing data? that then has to be tested, debugged, etc.