Design, Hardware, Software Errors Doomed Japanese Hitomi Spacecraft (scientificamerican.com)
Reader Required Snark writes: The Japanese space agency JAXA said its recently launched X-Ray observation satellite Hitomi has been destroyed. After a successful launch on February 17, contact with the satellite was lost on March 28. Off the 10-year expected life span, only three days of observations were collected. Preliminary inquiry points to multiple failures in design, hardware and software. After the launch it was discovered that the star tracker stabilization didn't work in a low magnetic flux area over the South Atlantic. When the backup gyroscopic spin stabilization took control, the spin increased instead of stopping. An internal magnetic limit feature in the gyroscope failed, causing the spin get worse. Finally, a thruster based control started, but because of a software failure the spin increased further. The solar panels broke off, leaving the satellite without a long-term power supply. It seems that untested software had been uploaded for thrust control just before the breakup. This is a major loss for astronomical research. Two previous attempts by Japan to launch a high-resolution X-ray calorimeter had also failed, and the next planned sensor of this type is not scheduled until 2028 by the ESA. Just building a replacement unit would take 3 to 5 years and cost $50 million, without the cost of a satellite or launch.
Design, Hardware, Software Error
Oh, is that all?
Better known as 318230.
Only got 3 days of data? Damn, that's gotta hurt.
Also, the "Design, Hardware, Software Error" bit is funny in a way...I mean, what else was left to screw up? This was like the Trifecta of Fuckups.
Just cruising through this digital world at 33 1/3 rpm...
on Reddit's TIFU: https://www.reddit.com/r/tifu/
... that you find they were wired backwards.
Subject says it all.
It seems that untested software had been uploaded for thrust control just before the breakup.
See what happens when you don't disable the GWX settings.
It must have been something you assimilated. . . .
From the TFA
Dan McCammon, an astronomer at the University of Wisconsin–Madison, helped to design and build Hitomi’s premiere scientific instrument, an X-ray calorimeter that measures the energy of X-ray photons with exquisite precision. He has been working on the technology for more than three decades, flying versions of it on the ASTRO-E mission, which failed on launch in 2000, and the Suzaku spacecraft, in which a helium leak rendered the instrument useless weeks after its 2005 launch.
Does anybody else think that their insurance company may not pay out ?
Re-appoint your entire senior software team, especially the lead. Examine the engineering background of the rest.
Hardware fails, that's completely inevitable. Software of the kind we're talking about is meant to limit the impact of independent hardware failures, which it can do because its own failure modes can be given however many fractional 9's of perfect reliability you desire, limited only by available resources.
From the reports, it seems clear that the probe's software was not designed to do that, and the failures of process which started off the event were also not designed using defensive and self-corrective principles.
In other words, this was entirely a people problem, a failure caused by using system software designers who lack the engineering mindset and extreme cautions needed when handling systems of this kind.
The poor software should actually have been caught in an external design audit in advance of launch, and in simulation. Investigate why it wasn't, and you'll probably find yet another people problem.
If the satellite is being designed and built by a government organisation, in the name of the advancement of human knowledge, should we be encouraging the software to be open source? Have there been examples of such initiatives?
Jumpstart the tartan drive.
We're totally colonizing the universe any day now.
Space is dead. It's a radiation-blasted vacuum. Nobody is going to live there. Ever. Get over it, Space Nutters. We should kill all astrophysicists and burn all scifi books. Like in Europe.
Europe got bored of that and the sport is now found elsewhere in the world. I for one welcome space nutters, since they give us something else to talk about :) I would burn the trolls, but not considering myself a violent person will accepting making a sport of them.
Jumpstart the tartan drive.
Those are called political and budget pressure by managers who have no clue on engineering ---
Software uploaded with out testing ? There is no way they could have gotten this far with out testing. I am sure there is no engineer in Japan that does not test thoroughly. Actually Japanese code is famous for being of the best quality -
This was caused by politics, bureaucracy and plain bad management.
Why would it be an excellent time? None of the fuck-up dates from TFA are 30 April.
Been on many software projects. From experience, it sounds like a project I am on now. You can't upload untested software on the fly if you want something to work. Its hard to convince non engineers of this but perhaps this well teach someone... I make sure I protest every time I am asked to release untested software to a production environment so they know this is really a bad decision when they follow the paper trail and the developers did everything they could to stop it from happening ahead of time.
First part is 100% correct. I don't know how you arrived at that conclusion though.
It seems that untested software had been uploaded for thrust control just before the breakup.
Note to self: Don't ask your girlfriend questions you don't want the answers to - again.
It must have been something you assimilated. . . .
This is just one of the more spectacular examples. I have heard of managers of large software teams that "do not believe in testing", I have seen Internet-reachable critical software that got a security evaluation only after deployment, because it was finished only a few days before deployment, and quite a few more things of similar utter incompetence. My guess is that the people responsible for these completely ridiculous screwups are "managers" that think they know how it all works (while being clueless), and that have eliminated all resistance to their views by firing anybody actually competent.
This is a dangerous and completely unacceptable regression. Humanity needs to be good at engineering if it is to have a future.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Interesting I just watched a true-story-based jdrama about the development of rockets in Japan called "Shitamachi Rocket" which blew my mind.
Does this mean we should readjust our opinions on Japanese technology back to the 'only make shit' opinion people had prior to the... 60s?
" star tracker experienced glitches whenever it passed over the eastern coast of South America"
"Somewhere along the way, the problems with the star tracker caused Hitomi to rely instead on another method, a set of gyroscopes, to calculate its orientation in space. But those gyroscopes were reporting, erroneously, that the spacecraft was rotating at a rate of about 20 degrees each hour. Tiny motors known as reaction wheels began to turn to counteract the supposed rotation."
So let me get this straight...2 major pieces of hardware failed to perform as designed yet the third leg of this stool, the control software, was identified as the cause of this catastrophe.
to be sure there is no excuse for untested software in any production system let alone one so critical. However, I find the articles headline, and analysis, dubious when there were other design (hardware) based flaws, that if caught earlier, would have prevented this tragedy.
All Engineering disciplines have design flaws that escape into "production." It's only software design flaws that get everyone so breathless. Why?
I guessing only (because the article's information content sucks) that what was actually untested was this particular chain of events. My professional guess is that the software was tested except for this failure edge case. You can't test Software forever. You have to ship it at some point, and you always ship it with a set of known presumptions.
Sad moment for stellar science...
I'd like to see a more thorough investigation of this set of incidents. That means no one involved gets to skip out by Seppuku. One of the problems with having a number of backup systems is that people tend to think "well, if it breaks, there's a backup system" - not realizing that each time a backup system is added, complexity is added, and that overall reliability goes down, instead of up. I don't know if over-reliance of backup systems, and failure to manage complexity, was the cause here, but it's the only thing other than "bad luck" or "sabotage" that can explain this disaster from a country which has many talented engineers.
"Well, I don’t think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error."
FastWorks at its finest. Wonder how many executives got their bonus because it launched?
I would kill all the trolls. It's the right thing to do. Gandalf told me so.
I'd have thought for a spacecraft control system it would be one of the first pieces of code you'd test! Its equivalent to putting a car into Drive and finding yourself going backwards!
"Ask yourself why an antenna won't deploy on a deep space probe."
"Or ask how they could launch a $6Billion telescope without testing its mirror."
'The Arrival'
https://www.youtube.com/watch?...
Uh, Linux geek since 1999.
So... what *modern* development methodology and platform did they use?
:T:R:A:N:S:
Lol, the T in TIFU is really more of a guideline, in that it's OK to completely ignore it. Same goes for the I.
TIFU is really more like TITOASWIOSEFU: Today I Thought Of A Story Where I or Someone Else Fucked Up. Not quite as catchy.
Only crack the nuts that crack. You don't put the ones that don't crack in the sack.
What the heck is happening to Japan? You'd think they'd have better management and knowhow than this, but I guess their standards are slipping severely.
" It seems that untested software had been uploaded for thrust control just before the breakup"
Microsoft shill alert! A$$hole in the room, and it is not me!
"cos Arianespace is totaly not a thing, ESA was closed down years ago and Darmstadt is only known for its football team.
Watch this Heartland Institute video
We all know countries in the far east have great number crunchers who work hard, but space, aeronautics, the large scale stuff needs more. I worry that a future of space dominated by Asia will lead to a big orbital debris field, much like how the Pacific is becoming a big nuclear waste and refuse pool thanks to Japan. Leave it to the big boys in Europe and North America. It takes more (a lot more) than just crunching the numbers. And xenophobic, proud, stubborn inadequate people in Japan doesn't help the situation.
Flight software developer here (I *am* a rocket engineer, of sorts)
Spacecraft stuff is not made in sufficient volumes to have standardized interfaces beyond the basic electrical interface. Sure, there's a 24V DC power, perhaps a few discretes, some discrete telemetry (voltage, current, temperature), and some kind of data interface (MIL-STD-1553, RS-422 serial, or SpaceWire are most likely).
So you will be writing some custom software to deal with this almost one-of-a-kind interface.
Typically, you are inheriting control code from some previous spacecraft, as well - flight software is expensive, the stuff you have to do is pretty much the same every time, there's a good case for re-use. And, perhaps not all the corner cases of that inherited software. Or perhaps your little shim layer that sits between old code and new device and translates "new device format" data into "old device format" data has some issues.
Flight software typically doesn't have lots of extra capability: you have to test it over the entire range, so it tends to be "do we have a specific requirement for that? Yes: build it and test it; No, it's nice to have: Don't build it" So your idea of "incorporate lots of flexibility against potential future devices" would be a non-starter: what requirement would you design against for that "potential future device"? How would you justify that particular requirement, as opposed to another? Say your existing software MUST handle 100 byte messages from the reaction wheel controller. and you want to say "why don't we code it for 1000 bytes to make room for expansion?".. that extra space comes at a cost: memory costs money, testing to 1000 costs more than testing to 100. And ultimately, someone will say "well, why not 2000? or 500?" - unless there's some natural "breakpoint" in the cost function, there's no good rationale.
And with any sort of self checking, you have to trade off the failure probability of the self checker.
Not that bad? Half the cost and probably half the time than developing Star Citizen.
The summary says the star tracker didn't work in "an area of low magnetic flux" (the South Atlantic Anomaly). The true issue is that the SAA is a high radiation area and the radiation caused an SEU in the star tracker. The Scientific American article was a bit mixed up about dumping the momentum stored in the reaction wheels. The text is a bit jumbled, but I believe the article was referring to magnetic torque rods which produce a force vs. Earth's magnetic field, but they only work if the spacecraft is stable. The spacecraft was never stable because the IRU (gyroscopes) provided erroneous information. In the end, the ACS issue (probably a sign error) is what killed the spacecraft.
http://spaceflight101.com/h-iia-astro-h/hitomi-failure-chain/
1. They launch the spacecraft.
2. They find out that Star tracker stabilization doesn't work in an area that is also a communications blackout.
3. They upload a patch, continuue deployment of the boom and start a reorientation maneuver (for the next image target), as they go into blackout.
4. The maneuver completes when the satellite is in the high radiation region that is problematic.
5. The IRU (inertial guidance) error is high presumably because of the maneuver, the Star tracker data that is supposed to override and correct it is invalid. The temporal integration algorithm that is supposed to correct the error (21.7 degree-per-hour roll) doesn't have time to work because the IRU takes action.
6. Reaction wheels spin the spacecraft based on erroneous data. There was a limiter issue but that doesn't seem to materially matter.
7. Because the reaction wheels are near saturation it enters safe mode.
8. A sun sighting that is supposed to provide correct attitude doesn't.
9. The thrusters fire without solar information (presumably using the flawed IRU information since that is the only information available). Now the OP thread article says because of software error, but the only error seems to be no source to correct the IRU information. Presumably the thrusters were trying to finish cancelling the non-existent spin the reaction wheels couldn't.
A couple of questions:
1. Once they found out there was a Star Tracker problem, why did they continue business as usual?
2. Why start a reorientation maneuver on a spacecraft with orientation problems going into a blackout region that is problematic for Star Tracker.particularly with a new software patch?
3. Can the Star Tracker even function with a high rate of spin, given the solar sensor couldn't?
4. Why wasn't the IRU information more accurate and why wasn't it able to correct a low but false spin rate? IE the worst case should be the spacecraft rotating at 21.7 degrees per hour.
It seems to me that spacecraft stabilization is pretty important. It doesn't seem it was that important to the ground support staff.before the incident.