Computer Date Glitch May Limit Next Shuttle Launch
n3hat writes "Reuters reports that the next Space Shuttle mission may have to be deferred if it gets too close to the New Year because the onboard computers do not handle the changing of the date in the same way as the ground computers. From the article: '"The shuttle computers were never envisioned to fly through a year-end changeover," space shuttle program manager Wayne Hale told a briefing. The problem, according to Hale, is that the shuttle's computers do not reset to day one, as ground-based systems that support shuttle navigation do. Instead, after December 31, the 365th day of the year, shuttle computers figure January 1 is just day 366."
How many times is this going to bite us in the ass? Ada solves all these sorts of problems, and soooo many of my tax dollars went into its creation? I understand that the space shuttle is a limited platform, but why aren't any of the lessons learned in Ada being applied?
Is there a reason these aren't built on standard parts and operating systems? If they ran their shuttles on something like Debian stable it would be a rock solid platform and probably end up saving them lots of money. Or am I missing something here.
*ducks and runs for cover*
Seriously though- they never "envisioned" a mission occuring over the end-of-year? Let me guess: a defense (space) contractor designed the systems.
Please help metamoderate.
Pardon my ignorance, but is this really serious enough that it should actually cause a delay? I mean, if it's simply a matter of figuring out what the date is, I'm sure that the astronauts and engineers involved in the project know at LEAST basic mathematics, and can determine that if it's, say, Day 367 on the shuttle computer, then 367-365 = 2, AKA January 2nd, 2007.
I'd say the article missed something; the whole concept sounds far too ridiculous to stand on its own.
Sorry to sonud so skeptical....but am I the only one who is worried about capability of missiles (and other defence systems) to handle war through a year-end changeover?
hilarious
Well we know that programmers get confused with numbers one time or another since we're used to start things at index 0. The shuttle's programmer must have left an extra ctr++ there :)
(or maybe he watched too much Star Trek that he thought he should follow the intergalactical star dates)
Three cheers for the Y0.001K problem!
I have discovered a truly remarkable
Oh, shit! You mean we're not supposed to be following intergalactic star dates?? No wonder those programs I wrote have so many date bugs...
My blog
The shuttle runs on three modified IBM 360 systems. Were pushing 35, almost 40 year old systems here.
Do you know how many eligible 35 year old computer bachelors there are out there? Ill tell you: none. Of course the shuttle computers can't get a date.
I read it quickly and thought it said, "The shuttle computers were never envisioned to fly through a year-end hangover".
I couldn't figure out for the life of me why they'd let mission critical crew drink bubbly in space... or why the computer would give a damn.
You can't win, Darth. If you mod me down, I shall become more powerful than you could possibly imagine.
Granted, the work they do is very impressive and the process is very exacting. But come on...they haven't been able to fix a simple year rollover event in 30 years?!?
From the Fast Company article:
I would say that requiring a reboot every year on December 31 is a pretty huge error. In this case, it is forcing NASA to launch earlier than they otherwise would wish. And this isn't the first time this type of problem has caused problems. The New Scientist has a similar article that goes into more detail:
So, they made the software so it does not kill anyone. Who needs fancy features like precise yearly timing?
Seriously, though, it's worked fine. The software has not killed anyone. They can either fix it and modify a very critical system on an enormously complex vehicle, or they can move the launch date around a few days, which they seem to do for every launch anyway. B is probably safer and more predictable.
The problem seems obvious. If the shuttle computer is allowed to think it is the 366th day of the year, it will obviously turn evil and try to destroy the earth using the vast orbiting nuclear arsenal, while we sit helpless on the surface. We can't allow this to happen.
Apparently the below article was full of shit:
They write the right stuff
???
The end-of-year rollover depends on the leap year and leap second (if any), and has traditionally been a source of problems.
Mea navis aericumbens anguillis abundat
Could it simply be that the date is a hard concept? You've got months with uneven number of days in them, including one month that can have an extra day added to it based on a somewhat complex concept (every 4 years, except if it's divisible by 100, UNLESS that year also happens to be divisible by 400). Calculating how many days there are between now and some future date, without using magic numbers? Heck, even software in the 90's couldn't get it right that there was a Feb 29, 2000.
Every date math equation I've seen has all sorts of wierd magic numbers in them where it isn't clear how those numbers were obtained. This may work just fine in day to day computations, but oddball bugs in date calculations can lead to some very wierd errors. Look at the C library sometime for the date functions. It's quite impressive.
Perhaps when the shuttles were designed, the inability to schedule across the new year was acceptable to avoid introducing odd bugs in the program to keep the software provably correct. Ground systems, which can be repaired in the middle of a mission easily, can be a little less bug-free, since a miscalculation won't cause the Earth to suddenly veer off course.
Do you even remember why the Y2K thing happened? People saved space back in the day by using a 2 digit year. Hell, in the 1970's, people were using a one digit year to save even more memory and storage space. The Space Shuttle uses very old technology for its computer systems (read: 1970's level technology), and doesn't have much memory. That extra 10 lines of code could make it oversized.
Additionally, making a change to space (or even military) software requires a shitload of paperwork and testing. Its a wild guess, but it would probably take almost a year to get that "ten lines of code" into the Shuttle and cost more money than its worth to just not have the birds up in space at an end-of-year scenario.
There are only 10 kinds of people in this world... those who understand binary and those who don't
Nah, everyone knows geeks are useless at dates because they never get any. Predictable failure, that one.
"I've got more toys than Teruhisa Kitahara."
...and I'm surprised that so many of the techie gurus around here are buying it.
I work with military navigation software, and that is sorta remotely applicable to this. Here's my thoughts:
You people with your "WTF NASA SUXORS THIS IS EASY FIX!!!11!!1!one!!" need to stop and think for a second. This is a space application that carries HUMAN BEINGS! Think about how hard it will be to get this "easy fix" qualified, proven, documented, etc. Its not an easy task. A formal qualification test on the systems I work on (military land- and air-, but not space-based navigation software) can take months, and require all sorts of tests and documentation. Anything that isn't formally tested (i.e. run in a van, on a plane, etc) must be shown to not fail in any way; all exceptions handled, no bad data can cause an undesireable state, etc. I would hate to see the type of scrutiny that the Shuttle software goes through (although I could probably call somebody in our Space division across the street and find out).
Second, I don't know exact specifics, but based on the information provided, I think this "glitch" will have to do with the data/time difference between ground stations and the Shuttle computers. Things like message time stamping between the Earth and the Shuttle, etc, will be wrong, and things could be garbled or just dropped all together. The navigation systems themselves should not be terribly impacted since the date will just roll to the next day. Inertial instrument samples will continue to flow in and be correctly time stamped, be it the 366th or 400th or 500th day.
There are only 10 kinds of people in this world... those who understand binary and those who don't
Microsoft can't get a lot of things right. On the other hand, I sure hope Microsoft's software isn't anywhere near a space shutle, that would be a disaster waiting to happen.
...which is more than many software development processes would reveal. Chances are that this known restriction is on a check-list which every shuttle mission has to be checked against, and the list would exist precisely because the software development and verification process is so solid and conservative.
Opinions vary, but I don't think I'd ever recommend working to the same standards, unless the customer actually had good reason to require it. (NASA does.) Even aside from your own code, doing it properly would require an extensive understanding of any and every third party library and system the code interacts with, which could add orders of magnitude to the development time and cost, even if it's open source and open hardware. I don't like hacks and yucky untested code any more than most people, but at some point it can just make sense to avoid extensive and pedantic formal development processes in favour of just getting it to work.
A lot of development processes (perhaps most) wouldn't have stopped the shuttle launching, even if this were reported as a bug. Chances are that it'd be forgotten about (if not fixed straight away), and someone would stumble on it again accidentally. Many bugs aren't even reported until someone's stumbled on them at least once. This is fine in most situations. Once it becomes a problem again, you can go and look it up, quickly find out everything that's known about it from before, apply any known workarounds, and spend time to fix it if necessary. The point, though, is that many systems wouldn't be sure to keep you informed about the restriction in a way that actively prevents someone stumbling on it later.
I still agree that it seems a little strange that this problem wasn't fixed ages ago. Realistically, though, the Shuttle was never expected to fly this long. It sounds a lot like a compromise that was made in the earlier days when computers were more limited, probably even moreso for the restricted range of systems that are certified to work under such conditions. Any update is likely to be very expensive and time consuming, simply because the software development and verification process is so solid and reliable.
From the article you quoted, it sounds more like they dropped a spacewalk (for Hubble maintenance, probably not safety-critical) so they could return sooner and avoid encountering the bug. To me it sounds like they did what they should have done, with safety as a priority.
Launching spacecraft is an industry in which the stakeholders usually prepare for possible or likely delays. NASA has to delay launches all the time for all sorts of reasons. I'm not sure why a possible software problem would be treated any differently. If the problem is with the managers dangerously forcing early launches, NASA should really be fixing their managers as a priority over fixing a known bug with a known workaround. Weighing it out, it's probably a lot cheaper, easier and safer to simply delay the occasional launch for a few more days, especially given that the Shuttle's remaining days are limited. Why risk the safety of future launches by making changes that will soon become obsolete?
Anyway, those are just my own thoughts. I don't work in software development where the process is quite so strict, but I'm sure they know what they're doing when they don't fix something like this.
In other news: a computer glitch may elect a President.
I originally found it hard to believe the shuttle hasn't been in orbit over new year's before.
e _missions
http://en.wikipedia.org/wiki/List_of_space_shuttl
The closest I could find was STS-103, the HST servicing mission in '99. Launched December 19th and lasted 7d 23h.
http://www.fastcompany.com/magazine/06/writestuff. html is a really good read on how the shuttle software is actually made. It's the most reliable software in the world with the most exacting design process.
How many other groups can deliver a half million lines of code with only 1 error (and no, not this issue. And as far as this being an error or bug, it really isn't. It's a know design restriction on a system that just works. Do you really want to go redesign a large chunk and possibly introduce life threatening bugs, or work within the known design window for the system.
Seems like a piss poor way to run a circus.
Still, in the patriotic spirit of our times I took out some date changing code I wrote years ago when I was first learning C and sent it to NASA. Never let it be said that in times of national need that I did nothing.
Ken
Why is the date critical to the operation of the shuttle? Do the astronauts forget what day they are supposed to land or something? If the day flips to 366 - so what? Now, do not get me wrong. I'm sure there is a *good* reason why a date rollover to a non-existing day would cause a problem, but I can't seem to find out what that problem would be. Does the computer lock up? Does it loose it's ability to navigate? Does the life support system shut off? Do they even know? Maybe it is a case of 'we don't know what would happen, so rather than find out let's just not do it.'
Also, from what I understand, there are 4 computers aboard the craft. So, why not reboot one computer at a time to update the date until all are updated?
I want the cowboy astronauts back. The boys that few Apollo 13 and duct-taped their space craft together and rode it home. I think they are more scientist than hot-shot now-a-days. Kind of a shame, it was the ego driven pilot that sorta made it all romantic in a way, now we send accountants in to space that get freaked out over a little date change procedure.
"Yeah, you're missing something. Such as the fact that the Shuttle was designed a quarter century ago, "
I can't believe this was moderated as +5, Insightful.
The shuttle was designed WELL OVER a quarter century ago. A quarter century ago, they had done some much design and testing, they were able to have the maiden flight (STS-1, Columbia launched in April 1981). Shuttle design and specification requirements analysis began in October of 1968. VMS, CP/M, PC-DOS, and 4BSD did not exist when the Shuttle was designed.
You must be thinking of Multics, which was the closest thing to a modern operating system that existed in the 1960s.
Seriously, you have no idea how old the Shuttle design is. I have no idea why they keep using it after the great work done 20 years ago by Richard Feynman who showed that NASA's shuttle design was about 1/100 flights unreliable. For the record, we've sent up 200 missions and had 2 shuttles blown up. The Space Shuttle is a piece of garbage, and NASA has wasted billions exploring low Earth orbit, rather than do something more useful.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Imagine you are a member of the shuttle design team and you can make a choice (for the next 20 years) to either know for sure that you're with the kids at home on X-mas and New Year .... or you can suggest a software feature that could result in your New Year's Eve being spoiled down the road because you have to be for days in a dumb control room. Hey, what would you do??
And I still remember, when I was a kid, that we had that Apollo flight during X-mas. I think it was the one that would for the first time go behind the moon. Someone in the control room that year made it into an important enough person on the Shuttle program so that this WOULD NEVER HAPPEN AGAIN. :-)
Browsers shouldn't have a back button!! It's all about going forward...
I would say that requiring a reboot every year on December 31 is a pretty huge error.
I wouldn't. When you're designing something like Shuttle software that has to work absolutely flawlessly 100% of the time, you don't put in any frills. And on something that is only ever in space for 10-15 consecutive days at most, year-end handling is most certainly a frill. (If you are a professional software developer, it ought to be obvious just how many things could break by adding a feature like that. If the original design calls for a monotonically increasing day number, for example, there's very likely to be some code that relies on that, so you have to go through the entire system, checking everything that even touches the day counter to ensure it can handle a reset from 365 to 1--and then check everything that uses those routines, and so on and so on.)
I suspect this is routine to NASA, and the reporter just blew it out of proportion. After all, Windows can handle end-of-year rollover, so if the Shuttle can't then it's broken, right?
TFA carefully does NOT say that anything actually will fail, but that something might fail. Thank you, Fallon: your link (http://www.fastcompany.com/magazine/06/writestuff .html) is a good explanation. (However, the "on-board shuttle group" is actually called the "on-board systems group").
It's like this: A clock rollover (such as at midnight or the last day of the month or year) always sets something back to zero. That resetting is a risk: Is there something somewhere that doesn't take the rollover into account? It may be an obvious bug, or not so obvious - what if the problem is dynamic? For example, what if system A sends some data and rolls over, and system B rolls over and receives the data? Then it looks like stale data, but isn't. How do you test for dynamic conditions like this?
Dodging this bullet is far, far cheaper than testing for it.
The only time I know of that a shuttle flight software bug affected a flight was uh...STS 2 or 3 or thereabouts. The shuttle often flies an updated load on one or two of its computers before the load is installed on all of them. On this mission, a new load on one GPC dumped (crashed) at T -9 seconds or so, causing everything to shut down automatically. The shuttle launched a day or two later, after the new load was rolled back.
Funny thing was, the same bug had occurred in the training simulators before launch, but was written off as a lack of fidelity of the simulator itself, not a bug in the flight software.
After that, the astronauts really began to appreciate running the real GPCs with the real flight software in the simulators.
PS: Although I work at NASA, this message is my own expression, and not that of NASA or my employer. I am a programmer only, not anyone with any kind of authority or insight except for my experiences here.
Pavlov wouldn't be so famous if he'd used a can opener instead of a bell.
Actually, the estimated failure rate for the shuttle program was 1 in 35, though the shuttles themselves may have been designed to withstand 100 launch/landing cycles*. This was a bit of an issue when the 25th mission resulted in a failure (since most of the population does not understand statistics).
And, for the record, there have been 117 launches, according to wiki, which I will take as accurate enough for this discussion (far less than 200).
*yes, IWAAE (I was an aerospace engineer) working for NASA, and was involved with shuttle payloads and structural reliability analyses.
Is it just my observation, or are there way too many stupid people in the world?
So, what if, oh, say, the CO2 scrubbers need to work differently depending on how many days the mission has been run. So, they keep track of the first day number, and the current day number. The amount of CO2 scrubbing then is varied based on elapsed days.
^^and here's the key -- it's something you don't know about^^
Now, you make your little 5-second fix, and send seven astronauts into space.
New Year's Eve rolls around, and suddenly the mission started on day 360 and it's now day 1. Holy crap, says the scrubber, we have to scrub as though it's a 359-day mission, instead of a lousy 6.
Scrubbers go into overtime, and break. (Or, scrubber math is done in eight bits, and they think the shuttle's still on the ground and not ready to launch for another ~100 days due to integer roll-over. Or any other set of unforseen possibilities.)
Next, astronauts die of CO2 poisoning because the scrubber subsystem has been compromised.
Great fix, mister five-second-coder.
Do daemons dream of electric sleep()?
If I recall, the shuttle computers were originally based on the IBM 360 (or maybe 370) mainframe architecture standard. The 360 series was the first real effort at standardization of components and instruction set so that upgrading your machine does not mean upgrading your software or peripheral equipment. And like the IBM PC after it, this bit IBM on the ass by allowing an open market to emerge where clones (Amdahl Computers, Hitachi) and third party peripheral manufactures to compete against IBM. (This is where the term FUD originates... Amdahl, one of the original 360 architects who left to found a clone manufacturer, created the term FUD to describe aggressive IBM sales tactics to discredit third parties and intimidate customers into staying with IBM.)
I have no idea if the flight computers are still the same or not, but NASA long ago ditched their 360 complex for flight operations. (I think it was in use roughly until the mid 80s? Maybe early 90s?)
What you've failed to realize is that the flaw isn't so much that they decided to not
do a rollover, it's that the ground computers do a rollover, and the shuttle computers don't.
AccountKiller
Hundreds of comments and not a single one mentions that NASA is a CMMI Level 5 organization. For those that don't know (and apparently, that's a lot of you), CMMI, aka Capability Maturity Model Integration, is software ENGINEERING methodology for developing processes and technologies around IT systems. It is a very in-depth methodology for developing software and comes about as close to "engineering" as you can get in software development.
Here is a list of participants in this program.
And here is a general overview of what CMMI is.
And just to put it into perspective, when I was last working with CMMI, there were only 3 companies certfied at level 5. Nasa, Motorola, and another one I can't remember. I am sure that has changed but nonetheless, it's a big deal and shows a serious effort to do things in a controlled, measureable, testable, way.
I only bring this up to counter the ridiculous "solutions" that some have proposed on this site.
"I can fix that in 3 lines of code".
Well, great. That might work at YOUR company. But please don't do that at NASA. Despite what many think here, NASA is a top-notch software development house. And I would expect nothing less given what is at stake.
I think the third one you're thinking of is Dibold?