Crowdfunded, Solar-powered Spacecraft Goes Silent
Last week saw the successful launch of the Planetary Society's LightSail spacecraft, the solar-powered satellite that runs Linux and was crowdfunded on Kickstarter. The spacecraft worked flawlessly for two days, but then fell silent, and the engineering team has been working hard on a fix ever since. They've pinpointed the problem: a software glitch. "Every 15 seconds, LightSail transmits a telemetry beacon packet. The software controlling the main system board writes corresponding information to a file called beacon.csv. If you're not familiar with CSV files, you can think of them as simplified spreadsheets—in fact, most can be opened with Microsoft Excel. As more beacons are transmitted, the file grows in size. When it reaches 32 megabytes—roughly the size of ten compressed music files—it can crash the flight system." Unfortunately, the only way to clear that CSV file is to reboot LightSail. It can be done remotely, but as anyone who deals with crashing computers understands, remote commands don't always work. The command has been sent a few dozen times already, but LightSail remains silent. The best hope may now be that the system spontaneously reboots on its own.
I’m usually the first to defend others when some bug like this makes it through testing. Hindsight always being 20/20, only takes one bug amongst a million good bits of code, etc. But this just seems like something that even basic testing should have caught.
Did they not run this thing on the ground for a few weeks? That’s just basic testing, especially for something that is going to be inaccessible for a while. Also that some critical bit of processing relies on stuff being written (and then presumably read back from) a csv file is very worrying.
This sounds like some very shoddy work.
I know the average IQ at /. has gone down over the years, but I think the explanation of what a CSV file is is slightly too much dumbing down.
You'd think that something as small as 32MB would have been tested before they launched the thing... It doesn't sound like it takes very long to fill up 32MB either
How much v Could a LightSail see If a LightSail could c s v
Roll your log files. I smell a DevOps debacle.
putting the 'B' in LGBTQ+
and you are an idiot for using it.
It came across a tachyon eddy and is at warp speed on it's way to the Cardassian homeworld.
Well, how do you test it before you're happy ? If the beacon is 40 bytes, and transmitted every 15 seconds, it would take half a year before you fill up 32 MB. That's a long time for testing.
This is the kind of mistake you shouldn't even make in the first place.
Actually this particular failure wasn't as obvious of an oversight as you may think. The reason it happened was because in an existing system one particular set of parameters were logged in miles since they weren't responsible for flight control (which NASA mostly uses metric for). Later on portions of this design were reused and an engineer decided to use the originally non-essential values as a feed into the navigation system.
The problem in this case is when you have something large and complex (a space craft) and a large organization with many projects (NASA/JPL) the younger generation tends to just rely on what's in place without doing the research they should.
That being said there were many times this particular error could have been caught on the ground and weren't, and that's a process failure. The "process" should have caught it.
Now get off my lawn!
No they mean MB because they even though they've crowdfunded a tiny satellite launch they are still not as autistic as you.
No. They need programmers and sysadmins that knew that they were doing. E.g. roll log files and/or put logs on a non-critical partition. Systems Administration 101 for systems where memory and disk space are at a premium. It was a rookie mistake.
putting the 'B' in LGBTQ+
First off.. LightSail isn't a NASA mission.. it's a low budget cubesat and cubesats tend to trade risk for rigor.
NASA does run stuff for days/weeks/etc in testing. And you'll note that the Mars rover flash file system thing was able to be recovered from, thanks to smart people at JPL realizing that you always need a way to recover. This is not necessarily the case for cubesats, often built by enthusiastic grad students whose hair is not yet grey from living through near and actual disasters in flight projects: them young-uns just don't know any better.
As a practical matter, "running for weeks on the ground" isn't practical: As an experienced software developer, I'm sure you know how real projects are always running tight for time: and a space mission where the launch date is determined well in advance can't just say "oh, I guess we'll slip the release a few weeks". You're building the spacecraft and verifying that everything works as well as you can: you verify that you can wiggle all the interfaces, you verify that the debugger/backdoor capabilities that will allow you to recover work; you verify the watchdogs. And you get what test time you can, before you ship to launch.
Don't forget that for a lot of the testing, you reset the system state to a known starting point (that means wiping the non-volatile memory).
And then you test, if you can, during the 8-9 months the spacecraft is on the way to Mars (which is WHY Spirit had the issue: they got a lot more test time on the software in flight than they had during the 3 year buildup of the spacecraft on the ground; log files got bigger, etc.)
when some bug like this makes it through testing
Testing? what testing? If it compiles, it works. Every hacker knows this.
I have to say, when I read that the spacecraft ran Linux and had died, I naturally assumed that someone had left the auto-update enabled and it was busy trying to apply about 50 million kernel patches.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
and not as a verb. Using "hope" as a verb in spaceflight hasn't always gone very well in the past.
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
Shaka, when the walls fell
Coming up next on Slashdot... Linux is an operating system, kinda like Windows or Mac OS, but built by a bunch of neckbeards, and uses about the same amount of space as 10 compressed music files. Some versions use less, some use more depending upon how it's configured.
Wow; I think it's time to move on from Slashdot. Taco would be spinning in his grave, assuming he was dead.
If telephones are outlawed, then only outlaws will have telephones.
You write your test so that it sends the 40 bytes to the csv file every 10 milliseconds instead of every 15 seconds.
The moment that you think of doing that, is the moment that you realize that the file will grow too big.
We don't normally test our spacecraft systems, but when we do, we do it after launch.
Speaking as an engineer working on software that is on the Orion spacecraft, I can say that rigorous testing is budgeted into the project from the beginning because it helps to avoid most of the problems like this. The testing that goes on with flight software is orders of magnitude more than you find for a traditional commercial product. You have to. The consequences of failure are, obviously, a lot more significant.
That being said, it's impossible to catch every single possible bug, especially as systems get more and more complex. But there are strategies that help reduce your risk. For example, you don't just run off to kernel.org and throw the latest stable release on a board. You pick operating systems that are maybe a bit harder to use (i.e. limited in what they can do) but are far better suited to real-time embedded work. And you certainly don't blindly append to a file without verifying that you're not going to overflow your space. And you always have an automated recovery plan for any dynamically allocated space in the event of an overflow.
This kind of failure is caused by amateurs making amateur mistakes. It was caused by application programmers who don't understand the consequence of failure in a constrained environment where you can't just click a mouse to restart the program. It was caused by poor planning and a lack of understanding of the environment in which they were designing. This was caused by hiring coders instead of experienced engineers. It was caused by trying to do it cheap rather than spending the money to do it right. They got what they paid for.
Last week a week is approximately the amount of time between new 'Keeping up with the Kardashians' episodes saw the successful launch of the Planetary Society's LightSail spacecraft, the solar-powered satellite that runs Linux Linux is like Windows for smart people and was crowdfunded on Kickstarter Kickstarter is a place to buy digital watches . The spacecraft worked flawlessly for two days, but then fell silent, and the engineering team has been working hard on a fix ever since. They've pinpointed the problem: a software software is like what you download from the app store glitch. "Every 15 seconds, LightSail transmits a telemetry beacon packet a telemetry beacon packet is like a tweet . The software controlling the main system board writes corresponding information to a file called beacon.csv. If you're not familiar with CSV files, you can think of them as simplified spreadsheets—in fact, most can be opened with Microsoft Excel. As more beacons are transmitted, the file grows in size. When it reaches 32 megabytes—roughly the size of ten compressed music files 32 MB is also approximately the size of 13 iPhone 6 selfies —it can crash the flight system The satellite's twitter feed blows-up ." Unfortunately, the only way to clear that CSV file is to reboot LightSail Like holding down the power and home buttons on your iPhone at once -- don't try this unless instructed by someone at the Genius Bar . It can be done remotely, but as anyone who deals with crashing computers understands, remote commands don't always work Like when Siri plays Billy Ray instead of Miley . The command has been sent a few dozen times already, but LightSail remains silent. The best hope may now be that the system spontaneously reboots on its own Like when drop your phone in the pool and it still works .
Meanwhile, at Planetary Society's headquarters...
Well, Jason. What have you got to say?
Well, Mr Nye...
Doctor! It's Doctor Nye.
But I thought those were honourary degrees.
It is DOCTOR Nye. Say it! SAY IT!
Y..Yes. D..D..Doctor Nye.
So, what happened to our bird, Jason?
As you know, um... Doctor Nye... We used a kickstarter campaign to fund the satellite's development and testing.
Get to the point, Jason.
We ran out of funds. If we had one more donor, we would have been able to complete the final testing.
So we lost the satellite and now face public humiliation because one anonymous person was too cowardly to donate?
Yes. Um.. Doctor Nye. That's about the size of it.
Well, Jason. That fellow had best pray that he and I never cross paths. You may go.
When our name is on the back of your car, we're behind you all the way!
Today, though, dynamic memory allocation is a reasonable thing. Granted you want to make sure it can't fail, and that "out of memory" is handled appropriately.
I don't completely disagree but you might watch the CPPCON 2014 presentation on the Curiosity rover for some insights into how the industry actually does things. One thing I noticed right off; rad hardened hardware is way behind the latest thing from Intel.
Actually, NASA had a "file system full" problem on one of the Mars rovers, almost exactly the same problem that Lightsail has. Fortunately they were able to fix it remotely.
This is something that should have been done with any number of RTOSes. FreeRTOS is a good start, I prefer ChibiOS/RT.
some people still do code for the Z80 in 16KB of memory.
It is still in widespread use from robotic control systems to hardened consoles to law enforcement (it's the processing unit in portable breathalysers). Some modern mobile phones (some Ericsson models) still use the Z80. Some musical synthesisers use the Z80 in realtime voice processing. The Harvard Zed SBC uses a Z80 core.
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
I'll never understand how groups (Especially NASA) can spend millions, or even BILLIONS on projects like these and not even complete the sorts of rudimentary testing that those of us in the professional software fields have to do every day.
This is not a NASA project, so you've made a stunningly basic error in your first sentence. Not looking too good for attention to detail for someone "in the professional software field".
Regardless, if you want to see how NASA does software, or for anyone even remotely interested in how the best practices for true mission-critical software gets written, you can't find a more interesting story on the creation of space shuttle software:
Some of my favourite parts begin with the following quote:
That is how one writes software. NASA cannot be beaten when lives matter.