Crowdfunded, Solar-powered Spacecraft Goes Silent
Last week saw the successful launch of the Planetary Society's LightSail spacecraft, the solar-powered satellite that runs Linux and was crowdfunded on Kickstarter. The spacecraft worked flawlessly for two days, but then fell silent, and the engineering team has been working hard on a fix ever since. They've pinpointed the problem: a software glitch. "Every 15 seconds, LightSail transmits a telemetry beacon packet. The software controlling the main system board writes corresponding information to a file called beacon.csv. If you're not familiar with CSV files, you can think of them as simplified spreadsheets—in fact, most can be opened with Microsoft Excel. As more beacons are transmitted, the file grows in size. When it reaches 32 megabytes—roughly the size of ten compressed music files—it can crash the flight system." Unfortunately, the only way to clear that CSV file is to reboot LightSail. It can be done remotely, but as anyone who deals with crashing computers understands, remote commands don't always work. The command has been sent a few dozen times already, but LightSail remains silent. The best hope may now be that the system spontaneously reboots on its own.
One report I read made it sound like they were aware of the bug for a while. It's possible that they had to launch with an old version of the software because the patch wasn't ready yet, and being a secondary payload on a launch you have no say whatsoever as to the launch date. They probably expected to be able to upload the patch after launch, but the log filled up faster than expected.
That being said, it is shoddy programming to blindly write to a log on a resource-constrained embedded platform (or any platform, really. Just especially so on something like this), so somebody definitely goofed. All I am saying is that it probably was caught by testing, but couldn't be fixed in time due to various constraints. It was a dumb move on the developer's part to not do enough diligence and to rely too heavily on QA in the first place.
Their current plan is to wait charged particles to affect electronics so that it forces a reboot.
Spacecraft are susceptible to charged particles zipping through deep space, many of which get trapped inside Earth’s magnetic field. If one of these particles strikes an electronics component in just the right way, it can cause a reboot. This is not an uncommon occurrence for CubeSats, or even larger spacecraft, for that matter. Cal Poly’s experience with CubeSats suggest most experience a reboot in the first three weeks; I spoke with another CubeSat team that rebooted after six.
First off.. LightSail isn't a NASA mission.. it's a low budget cubesat and cubesats tend to trade risk for rigor.
NASA does run stuff for days/weeks/etc in testing. And you'll note that the Mars rover flash file system thing was able to be recovered from, thanks to smart people at JPL realizing that you always need a way to recover. This is not necessarily the case for cubesats, often built by enthusiastic grad students whose hair is not yet grey from living through near and actual disasters in flight projects: them young-uns just don't know any better.
As a practical matter, "running for weeks on the ground" isn't practical: As an experienced software developer, I'm sure you know how real projects are always running tight for time: and a space mission where the launch date is determined well in advance can't just say "oh, I guess we'll slip the release a few weeks". You're building the spacecraft and verifying that everything works as well as you can: you verify that you can wiggle all the interfaces, you verify that the debugger/backdoor capabilities that will allow you to recover work; you verify the watchdogs. And you get what test time you can, before you ship to launch.
Don't forget that for a lot of the testing, you reset the system state to a known starting point (that means wiping the non-volatile memory).
And then you test, if you can, during the 8-9 months the spacecraft is on the way to Mars (which is WHY Spirit had the issue: they got a lot more test time on the software in flight than they had during the 3 year buildup of the spacecraft on the ground; log files got bigger, etc.)
it's not so much the capacity of local storage, plus you have to consider that this is a system which is unlikely to be touched by a human being ever again so whatever goes up has to be physically resilient - super-compact flash storage such as micro/SD would be out, I'd go a couple generations back and use Compact Flash with slightly lower capacity to take advantage of larger dies - this is why NASA went on a shopping trip very recently for Pentium I and Pentium Pro chips for space systems, they're by virtue of their architecture, fairly hard against the environment. Back to topic, a cursory search around and it apepars that it's an issue with the kernel, sysvinit and/or php, all of which at some point or another default shared and allocated memory spaces for various purposes (including scripting and logging) to 32MB.
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
No, csv sure as hell is NOT a Microsoft format.
This has nothing whatsoever to do with Microsoft, as much as you seem to want to blame them.
Lost at C:>. Found at C.
I wish people would stop thinking that hard drive manufacturers are the "source" of this so-called "problem". Digital communication speeds never used base 2, clock speeds didn't either.
People are simply stuck on terms like "Mebibyte" because they either don't want to accept the fact that mega is an SI prefix or because they don't like how the IEC units sound. Get over it.
Get free satoshi (Bitcoin) and Dogecoins
Speaking as an engineer working on software that is on the Orion spacecraft, I can say that rigorous testing is budgeted into the project from the beginning because it helps to avoid most of the problems like this. The testing that goes on with flight software is orders of magnitude more than you find for a traditional commercial product. You have to. The consequences of failure are, obviously, a lot more significant.
That being said, it's impossible to catch every single possible bug, especially as systems get more and more complex. But there are strategies that help reduce your risk. For example, you don't just run off to kernel.org and throw the latest stable release on a board. You pick operating systems that are maybe a bit harder to use (i.e. limited in what they can do) but are far better suited to real-time embedded work. And you certainly don't blindly append to a file without verifying that you're not going to overflow your space. And you always have an automated recovery plan for any dynamically allocated space in the event of an overflow.
This kind of failure is caused by amateurs making amateur mistakes. It was caused by application programmers who don't understand the consequence of failure in a constrained environment where you can't just click a mouse to restart the program. It was caused by poor planning and a lack of understanding of the environment in which they were designing. This was caused by hiring coders instead of experienced engineers. It was caused by trying to do it cheap rather than spending the money to do it right. They got what they paid for.
These guys did not launch a satellite, ULA did. Basically LightSale simply took a ride on an Atlas 5 that was deploying the X-37B and was thrown out as a secondary payload. Pretty much anybody can do that. A lot of CubeSats are often made by college students.
Also, describing CSV and measuring files in songs makes me want to punch Bill Nye, and I love Bill Nye.
I'm a good cook. I'm a fantastic eater. - Steven Brust