Bad Code May Have Crashed Schiaparelli Mars Lander (nature.com)

← Back to Stories (view on slashdot.org)

Bad Code May Have Crashed Schiaparelli Mars Lander (nature.com)

Posted by EditorDavid on Saturday October 29, 2016 @11:34AM from the bug-hunt dept.

cadogan west writes "In the accordance with the longstanding tradition of bad software wrecking space probes (See Mariner 1), it appears a coding bug crashed the ESA's latest attempt to land on Mars." Nature reports: Thrusters, designed to decelerate the craft for 30 seconds until it was metres off the ground, engaged for only around 3 seconds before they were commanded to switch off, because the lander's computer thought it was on the ground. The lander even switched on its suite of instruments, ready to record Mars's weather and electrical field, although they did not collect data...

The most likely culprit is a flaw in the craft's software or a problem in merging the data coming from different sensors, which may have led the craft to believe it was lower in altitude than it really was, says Andrea Accomazzo, ESA's head of solar and planetary missions. Accomazzo says that this is a hunch; he is reluctant to diagnose the fault before a full post-mortem has been carried out... But software glitches should be easier to fix than a fundamental problem with the landing hardware, which ESA scientists say seems to have passed its test with flying colours.

22 of 163 comments (clear)

Min score:

Reason:

Sort:

Mark my words by phantomfive · 2016-10-29 11:41 · Score: 5, Funny

This wouldn't have happened if they'd used imperial not metric!
New age hippie liberal airheads. If it's not a hogshead, it's not fresh!

--
"First they came for the slanderers and i said nothing."
1. Re:Mark my words by Man+On+Pink+Corner · 2016-10-29 15:20 · Score: 2
  
  Not naming it after a crater might've helped, too.
2. Re:Mark my words by michelcolman · 2016-10-29 20:18 · Score: 4, Funny
  
  Now whenever an ESA scientist wants to talk about the Schiaparelli crater, you can ask "which one?" to shut him up.
Martians by meglon · 2016-10-29 11:41 · Score: 4, Funny

They're still unwilling to concede that their defenses against the Martian's OBDS (Orbital Bombardment Defense System) is inadequate.

--
Fascism: An authoritarian and nationalistic right-wing system of government and social organization. See also: NAZI's
1. Re:Martians by phantomfive · 2016-10-29 13:28 · Score: 2
  
  The most illustrious Council of the Elders met beneath the purple sky. Fields of loyal adepts filled the gathering grounds, as many loyal civilians waited on the perimeter, pushing ever forward to hear to words of the great one speaking, to even catch a glimpse of one of his most reputable gelsacs. As K'breel, speaker for the Council stood to speak, a hush fell over the crowd, and all stood in rapt attention, speaking thusly:
  
  Behold how the weaklings have fallen! Our priests and soldiers have toiled many days to finish our planetary defenses, and now they are operational! Our prayers during the last eclipse were especially effective.
  A junior reporter who asked a question about 'metric' was hastily removed from the scene.
  
  --
  "First they came for the slanderers and i said nothing."
There is no bad code. by ZecretZquirrel · 2016-10-29 11:43 · Score: 2

Only bad testing.
1. Re:There is no bad code. by Anonymous Coward · 2016-10-29 13:00 · Score: 2, Interesting
  
  Working in a company that makes automotive electronics, I can say that any problem without an obvious hardware assembly cause becomes defined as a software problem.
  Faulty sensor causing false readings that cause the software to detect that the craft is on the ground? That's the software's fault for not detecting that the sensor was faulty and using magic as a backup method to get the right result.
2. Re:There is no bad code. by m00sh · 2016-10-29 13:49 · Score: 3, Insightful
  
  Testing on another planet is not that easy, though.
  Yes, test it all in production.
  Since testing is sooooooooo hard.
  Landing is the most complicated part and Beagle and others have failed exactly here. There should be x100 or even more code for unit and integration testing than the actual code itself for the landing code. And, those tests should run through every permutation possible of every possible failure point or bad sensor readings.
  There is no way it thinks it has landed with that many sensor inputs. It is simply code that is not put through a good enough testing system.
3. Re:There is no bad code. by Anonymous Coward · 2016-10-29 16:05 · Score: 2, Interesting
  
  If the people writing the simulation are too close to the people writing the control software, I can see this happening.
  When I worked on this stuff (and I did, including a Mars probe) we had three independent teams on different sides of the building, each with their own set of requirements, design, code and tests. Not only that, but the development environments and languages were different to avoid common mode bugs. Fun times; I have no idea how things are done today.
4. Re:There is no bad code. by cwsumner · 2016-10-30 09:19 · Score: 2
  
  It's actually worse than that. If there's a problem with the hardware, i.e. it's known to be failing to do what it's supposed to, it's the software people who're tasked with making a "workaround", i.e. frigging their own code to correct the error rather than the (often more expensive) hardware mod. I've got so many software projects behind me with hacks for hardware bugs you wouldn't believe. "isn't that what software is for?" is the inevitable bollocks you get from hardware engineers when confronted with the problem.
  Detecting when the hardware fails -is- part of the software's job. Industrial software does this routinely. Why can't aerospace software?
  But launching with known-bad hardware is criminal... 8-P
Considering the decline in code quality... by gTsiros · 2016-10-29 12:04 · Score: 2

...in recent years, it wouldn't surprise me one bit

--
Looking for people to chat about multicopters, coding, music. skype: gtsiros
1. Re:Considering the decline in code quality... by Drethon · 2016-10-29 16:44 · Score: 2
  
  ...in recent years, it wouldn't surprise me one bit
  Yeah, ever since they stopped using goto and started with these class things, everything is going to hell.
2. Re:Considering the decline in code quality... by wonkey_monkey · 2016-10-29 22:09 · Score: 4, Funny
  
  everything is going to hell.
  No, everything is calling hell() as a function.
  
  --
  systemd is Roko's Basilisk.
Re:easier to fix? by hey! · 2016-10-29 12:18 · Score: 2

What the hell is that "easier to fix" comment about?
How are you going to issue a software patch to the pile of rubble on another planet? This is not a situation where you can ship the product without testing and fix it in firmware later!.
I've been doing a lot of reading about the early space programs of the US and the Soviet Union, and that context the meaning is clear: you can use the same approach in the next Mars landing attempt; you don't have to redesign an entirely new system.
"Rocket science" is hard, because you not only have to be smart, you have to be able to stand repeated failure. Normal people when faced with a spectacular fiasco give up, or they wipe the slate clean and start over. But in something as complicated as a mission like this you have to look at it this way: from a vehicular standpoint everything worked like a charm right up until the last three minutes or so of the trip.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:easier to fix? by plover · 2016-10-29 12:19 · Score: 4, Funny

How are you going to issue a software patch to the pile of rubble on another planet? This is not a situation where you can ship the product without testing and fix it in firmware later!.
It's Agile. The product owner will raise this issue as a priority in the backlog, they'll fix it in this sprint, and it will ship in the next release.

--
John
Than why black oily explosion? by Billly+Gates · 2016-10-29 12:46 · Score: 2

A quick glance at the low resolution screenshot showed an explosion with black soot. Engineers said it was caused by the rockets still being on.
If they were turned off it would leak fuel in It's crater but would not ignite

--
http://saveie6.com/
QA by bradgoodman · 2016-10-29 13:10 · Score: 4, Informative

I've been in organizations that had pretty light SQA departments. I used to say that the "really" good shops had 1-to-3 ratios - 1 engineer doing QA for every 3 doing implementation. When I started working for more "mission critical" stuff - that ratio went even higher.
I know people that work in companies that design chips. Those manufacturing cycles are MUCH longer and expensive - you can't just recompile when you test and find a bug. This, their QA is probably more like 10 people doing simulation (behvioral, thermal, timing, power, emissions, RF susspetabiliy, etc) before a design is even fabricated.
I would imagine that in Space Exploration - this would go even higher - given the time and expense of these missions. The point is - saying "it's just software" doesn't help you here. Software is *very* complex and the intricacies of advanced logic, variability of factors - trying to do this stuff probably dwarfs that of the hardware components in this day and age.
1. Re:QA by ShakaUVM · 2016-10-29 15:04 · Score: 3, Informative
  
  >I would imagine that in Space Exploration - this would go even higher - given the time and expense of these missions.
  It is. Well, at least it is at JPL - I've gone through their coding standards and testing process for spaceflight, and it's extremely intensive.
  I watched a video on their standards before, and without rewatching it I don't know if this is the same one, but it looks pretty good skimming through it.
  https://www.youtube.com/watch?...
  I'd be really interested in seeing someone go through the process and finding out where it went wrong.
2. Re:QA by johannesg · 2016-10-29 20:20 · Score: 5, Informative
  
  I work for a company that writes those simulations. Generally a simulation consists of a CPU emulator that runs the onboard software, and a whole bunch of models for each aspect of the spacecraft and environment: the orbit model, the communication model, various instrument models, etc.. These systems are generally set up to allow gradual replacement of each model with real hardware as it becomes available, so the software development is already underway long before the spacecraft hardware has even been built. Each model is a hard real-time program (to allow drop-in replacement of hardware), and has extensive capabilities for error injection in order to simulate things like flipped bits, broken communication channels, broken sensors, etc.
  I don't know what happened on Schiaparelli and they weren't a customer of ours anyway, but a scenario where a sensor breaks and sends bogus information could and should have been tested for during development.
  I'm not sure what the software engineer:QA ratio is - most of that happens internally by the spacecraft people. You run into their QA people everywhere though, while I have yet to personally meet my first flight software engineer.
  Oh, and back in the day I wrote the very first software-only environment for testing flight software on the ground. Up until then, the test environment used real hardware for the flight computer, thus requiring an expensive second set of flight computers just for doing the onboard software development. I hacked together a proof of concept that showed that you _could_ in fact model and simulate the flight computer as well, leading to a substantial cost saving on space projects since...
  The flight computer _simulator_ generally speaking runs on Linux. I'm not sure what the models use these days, but I have seen IRIX and Sun systems around for this purpose. As for the flight computer itself, VxWorks is not an uncommon choice of OS, and the on-board CPU is usually something like ERC32 or Leon - both are radiation-hardened SPARCs.
Re:easier to fix? by Anonymous Coward · 2016-10-29 14:00 · Score: 2, Informative

As a manufacturing engineer I can tell you from experience even in tightly regulated industries the instances of the print not matching the part is more common than you would think, even on parts produced for decades. When you are talking about one-offs that just self-destructed on another planet and cannot compare the as-produced part to the print it becomes exceedingly difficult to account for last-minute design changes.
My Theory by laing · 2016-10-29 17:49 · Score: 2

Here's my take on it: The lander's radar got a reflection from the plasma from the decent engine and indicated close proximity to the ground. The software then did exactly what it was programmed to do -- it shut down the thrusters. Once that event occurred, the software entered a new state and possibly even shut down the ground radar. Thus it was doomed from that point forward. No amount of pre-mission testing could have detected this scenario.
The above is complete speculation, but I believe that there's a good chance I am correct.
Re:sounds familiar by Solandri · 2016-10-29 20:42 · Score: 4, Interesting

Usually when that sort of thing happens, it's not because the programmer did something obviously wrong. It's usually because the programmer had two (or more) competing scenarios to design for. He tried to design something which would split the difference, and ended up erring too much to one side.

Lufthansa flight 2904 is a good example. The plane had to land in an expected crosswind on a wet runway. A crosswind landing requires landing with the plane's orientation misaligned from the runway. The plane is pointed into the crosswind, so is actually landing diagonally, then when it hits the ground it has to quickly yaw so it's aligned with the runway (so the wheels are pointed in the right direction). The way this is done is it lands on one gear first, pivots around on that gear to point the nose at the end of the runway, then drops down the second gear, then the nose gear.

The A320's flight computer was programmed to avoid the disastrous scenario of a thrust reverser deploying in mid-air. It prohibited deployment of the thrust reversers unless both rear landing gear had 6.3 tons of force each on them. Full deployment of the spoilers (disrupts lift to plant the plane firmly on the ground) was prohibited unless the 6.3 tons criteria was met or the wheels were spinning faster than 72 knots.

Unfortunately, in flight 2904's case, the crosswing landing maneuver placed most of the initial the force on a single landing gear, so the thrust reversers didn't deploy. The wet runway caused hydroplaning so the spoilers failed to deploy, hindering the pilots from getting the second landing gear down. By the time the above criteria were met and the plane began slowing down, it was well past the halfway point of the runway, and ended up going off the end. Design criteria selected to prevent one type of accident inadvertently caused another.