Space Shuttle Software: Not For Hacks
Jeff Evarts writes: "
This article
in Fast Company talks about
the process the Shuttle Group uses to make software. At first it
seems too predictable: a very cool project but no hacks, no
pizza-and-coke all-nighters, etc. Then, however, it goes on to talk about why: They have an informed customer, they talk to that
customer until they have a very clear idea of what is wanted,
they have a budget focused on prevention, and they focus on
fixing the process and not blaming the individual."
As someone who's done more than his share of late-nighters, it was an interesting view into the mission-critical environment. Maybe there are a few software firms out there that would rather spend some of their money on better processes rather than technical support engineers. Maybe a little more market research and a little less marketing, too. A good read."
These guys are "pretty thorough" the way Vlad the Impaler was "a little unbalanced." Still, you have to wonder how they can claim single-digit errors among thousands of lines of code, but I guess the proof is in the rocket-powered pudding. And lucky for them, their target platform was recently upgraded.
While the caliber of this grooup seems unbeatable, it's too bad NASA doesn't apply this rigid development model to its unmanned space craft. -- I still don't understand how a difference in units (english vs. metric) managed to go undetected!
The only thing I could think of after hearing that such an error caused mulitmillion dollar craft to crash was IDIOTS - any scientist should be using SI units today.
--Aaron Greenberg
What's needed is a "meta-process", a process to develop the software process and keep it directed towards the goal. I would suggest that a democratic meta-process, where developers themselves work together to evolve the procedures they will use, would work better than decrees from clueless management.
Well, that's one set of religions. Others - such as Zen Buddhism - would say that such rules, or "process", are things to ultimately be transcended. The enlightened person, the sage or bodhisattva, does not refrain from killing based on some religious law; he simply acts. The practice of these religions is designed to help lead ordinary people to that state of enlightenment.Perhaps that should be the goal of software development practices, as well - to help lead ordinary programmers into that state where they are enlightened enough to be simply incapable of producing flawed software.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
Don't forget why the Arian 5 rocket blew up in 1996 , a conversion error caused a software shutdown that lead to the self-destruct of the rocket.
"The internal SRI software exception was caused during execution of a data conversion from a 64-bit floating-point number to a 16-bit signed integer value. The value of the floating-point number was greater than what could be represented by a 16-bit signed integer. The result was an operand error. The data conversion instructions (in Ada code) were not protected from causing operand errors, although other conversions of comparable variables in the same place in the code were protected."
What was the estimate, about $8,000,000,000 of uninsured losses, including 10 years of work for the scientists with satellites on board.
I wonder how many other maiden voyages have started off so poorly, other that the Titanic that is.
and how slowly they are being developed? I don't mean that it's a bad thing -- it's good that Shuttle program allows them to do it at reasonable pace and with reasonable requirements, but if everyone else wasn't under constant pressure, and if everyone's else software wasn't a victim of feature bloat, dealing with poorly documented and even worse implemented protocols, and never-ending stream of bullshit coming from the management, everyone else would write robust software, too. Well, not really everyone -- some "programmers" wouldn't be able to do anything because they have no skill, no education or are plain dumb, but reasonably geeky and educated programmer can pull something like that in ideal conditions -- and those guys _are_ working in ideal conditions.
Contrary to the popular belief, there indeed is no God.
I think the point of that exercise is to promote a sense of well defined accountability and confidence, up and down the management chain. Sure, in theory, the project manager should be ultimately accountable. But all too often she can, post facto, dodge responsibility for failure by (accurately) claiming that other project stakeholders failed to provide their inputs to the project correctly. In Mr. Keller's case, he would not sign the certificate if he felt that failure was a possibility, for any reason. This also gives the decision makers a well defined "emergency brake" that perhaps could have averted a *Challenger* like disaster, where some line managers said STOP, while some higher-ups said GO!
I worked on some mission-critical/life-critical stuff about 2 years ago. It was aircraft related, and since it was basically carrying the data which made the plane fly it was critical by any definition. The processes we followed was absolutely document driven. User specs were examined, questions asked and the user asked to add definition and clarification for several iterations of the document. Then the software requirements etcetcetc were followed, ech document with quite a bit of iteration. Eventually we found that typically documentation and design would take 50% of the project. Testing would take about 30 to 35%, and the actual implementation hardly took any time at all. Now in the commercial world, I find that the process is VASTLY different. Implementation has started shortly after user specs have hit the desk, before design or documentation has begun. As a result, the system we currently have is very patchy in places. Its mission is a lot less critical, but the bugs slow us down tremendously. The bugs are due to the process. The process is requirements driven, not documentation driven. But it seems that the current system I'm working on has about the same complexity as that I used to work on. Only even though we are supposed to be pushing it all out the door faster, the bugs are slowing us to the point where we have approximately the same rate of progress as the mission-critical project!! Lesson: If you do it by the documentation, you will push it out faster and cleaner (and more bugfree!!!)
I don't know a software company that wouldn't implement such a strategy to ensure that their software wasn't perfect if they had the budget to do so. As with all things of this nature it comes down to the money vs. quality contest. The better the quality the more it cost to produce but unfortunatly its not an even rise up the scale. It may cost you 2X$ as much to improve quality by 50% but it might cost you another 4X$ to get the next increase of quality of only 25%. Even the article points out that, that the Shuttle software is the most expensive in world and it still is run on old computers. Give me the same scale of budget/time and I'll give you a windows operating system that a fanatical Linux user would be hard pressed to complain about. Or, even better. I'll use the funds to set up an open source group to make Linux as versatile and useful across the board, from beginners to the "Linux guru's".
I don't work with the FSW people, so I'm not sure about the details of their work flow, but I think it's safe to say that new code goes through several readings, probably both at the pseudo-code and code levels.
Schedule is driven by the planned date for launch, and worked backward from there. For example, if you're going to launch a mission at date L, then the crew begins training at L minus X months, which means that the software has to be ready for the SMS at L minus Y months, which means you have to begin design at L minus Z months, etc. I'm not sure what X, Y, Z and related time deltas are, but I believe they probably start planning at least a couple of years in advance.
--Jim
Sort out that closing italics tag! The front page article only has the first paragraph, and the second paragraph has the tag. All the headlines are italicised!
My god! Where's all my karma going?
...that we could arrange for a situation where the requirements are all fixed and locked down, and documented, before any coding begins? In industry jobs, I've never seen a project that wasn't having some marketing group force "critical" changes the whole time something was being written.
:)
:)
You get what you pay for, and take the time for. These days, most people and companies seem quite willing to settle for "bad, buggy programs now" rather than "better programs, later". Of course, without organization (also common), it's possible to wait and get nothing later, too. Process is expensive in terms of people involved and time, but it's a lot cheaper in the long run than the alternative.
Open-source projects actually follow this - every successful open project I've seen has a definate hierarchy of people managing patches and controling what winds up in the latest sub-point build, and making key architectural decisions so nothing derails them. Oh - and there's no one who'll fire you if marketings last-minute changes aren't rushed through.
The problem with this arguement is that while many companies think that they can't afford to do it, what they really can't do is afford NOT to do it. Software is becoming more complex - it's the nature of the beast. For the most part, design is not; we are all still using procedures that were brought into being in 'dawn of computer age', with the exception of higher order languages and more focus on OO.
You are correct in that it may be expensive, THE FIRST TIME. This is called a 'learning curve' and the cost is amortized over the number of times you use this technique. You may also say that the process itself is expensive but that is incorrect, or at least only partially correct. The process allows errors to be caught EARLY, which reduces cost. Please don't tell me that you believe a code-compile-fix routing can catch these sorts of errors as early as a well thought out design.
Also, rigourous design allows for flexibility - this may sound contradictory but consider the use of design patterns. They are NOT things that can just be thrown into the code ad hoc; they require thought and intelligence. A good upfront design means the ability to use these tools. Consequently, use of these design patterns allows for a certain level of flexibility in statisfying the lower to medium level nasty customer requests, and certainly helps on the more egregious ones. Does a code now, look later approach allow this? (if you think so, I have this bridge I'd like to sell you ...)
In short, yes, using these techniques is expensive. But they also produce code that cuts development time (i.e., no stuck in debug/extra request phase for 2 years) and once people get used to the process, the extra cost/load is minimal.
their process ensures it will be. The vast majority of software development is performed in an environment where individual "heroes" are the primary reason projects succeed. The Space Shuttle Onboard Software processes will seem to almost all of us to be "common sense", but how many of us work in a place where management mandates these things to ensure quality? Their environment is "ideal" because they have made it so. Unfortunately, many managers' (and too many developers', also) attitudes can be described as "get it done", and it shows!
They were rated CMM level 5 in 1988 - one of the first organizations anywhere rated at that level of software process maturity. Another good description of their processes (and how they created them) is in the book "The Capability Maturity Model - Guidelines for Improving the Software Process" (ISBN 0-201-54664-7) in Chapter 6, "A High-Maturity Example: Space Shuttle Onboard Software".
As far as making software error-free, a quote from the book will help illustrate the difference in attitude they have (it's talking about a graph). "These data include failures occurring during NASA's testing, during use on flight simulators, during flight, or during any use by other contractors. Any behavior of the software that deviates from the requirements in any way, however benign, constitutes a failure. Contrast this level of commitment with the cavalier attitude toward users in most warranties offered by vendors of personal computer software."
The best place to find more about the CMM is their web site at http://www.sei.cmu.edu/
I Think your problem here is that you still subscribe to the fallacy that "Code like Hell" Programming is faster than doing things properly.
It isn't.
Many organisations are starting to find this out and are moving to proper professional engineering practices that improve reliability increase schedule predictability and more importantly reduce costs.
A couple of hundred years ago people built houses & bridges the way we build software - work until it's done. These days we have archaetects and project managers that build houses faster, more reliably and ON BUDGET.
This is the way the wind's blowing. It's a lot less heroic but it's the future.
It's not just space shuttle code that needs extreme reliabilty. The embedded systems in civilian aircraft are not interrupt-driven because of the reliabilty issues associated with interrupt-driven code - interrupts make the software to hard to debug thoroughly (becuase there are so many combiniations and timings of input signals to test), make faults difficult to replicate and have the potential to go wrong on a spurious set of input signals. This sort of problem doesn't really matter too much in a home or corporate computing environment, but it would be a major disaster if a plane carrying a few hundred people were to crash into a city with a population of a few million, just because of a software error. These things need 100.00 per cent reliability, so obviously software hacks are frowned upon.
Bullshit.
NASA didn't just have a solid process, they had MONEY. They BOUGHT that quality, by hiring an order of magnitude more testers than you'd find in the commercial world. By budgeting several years of development time rather than weeks or months. By reducing the number of lines of code that any one developer is responsible for.
There's a lot to learn from a highly structured development process like NASA's. But don't kid yourself that the quality they produced is simply because they 'had the right process' or had better management.
But then the Japanese came along with a radical new idea: if there are defective parts coming down the line, then we should figure out why they were created defectively in the first place and fix that. Then the number of defective parts at the end of the line would be less, thus you would need *fewer* inspectors and *less* time at the end of the assembly line. (Ironically, this principle came from an American named Edward Deming; unfortunately American companies were too successful during his lifetime for them to take him seriously :-) So the Japanese were able to build cheaper cars quicker than the Americans while actually having higher quality.
I think that's very analogous to the current argument. Under the current system of coding, you basically hack together something that sorta works, and then use sophisticated debuggers/development tools to figure out which parts are buggy. Using that system, it's true that higher quality requires more cost and time.
But I think the point of this article is that that is the wrong way to approach programming. First, figure out why defective code gets written in the first place (be it poor client specifications, poor management, poor documentation, whatever) and then fix those processes, and you'll turn out quality code without having to spend any more time or money!
As a practical example, I first learned C under a CompSci Ph.D. who was a quality fanatic. In order to teach me to code properly, he would give me projects and then not allow me to use a debugger. Nothing at all. Zilch. Nada. The only thing I was allowed was to place print statements within my program wherever I wanted to see what was going on. As a result, I spend *a lot* of my time planning my code out, and reviewing it over and over again before even compiling it, because I knew that if there were bugs in it, I couldn't just fire up a debugger and take a look.
And secondly, if there were bugs, I couldn't just trace through the entire program or create a watch list of every variable. I had to study the bug and understand it, look at the code and figure out where the bug most likely was, and then use selective print statements to look at the most suspicious stuff. That way, when I encountered bugs, I'd be forced to actually understand what the bug was and then analyze my code to figure out where that error most likely was.
If this sounds like a programming class from hell, believe me, it was incredible! I couldn't believe how much of my code worked the first time it compiled. And when there were bugs, I actually fixed the underlying flaw in the logic rather than just applying a temporary patch. What's more, since the rest of my program was well planned and documented, there were no "hidden" effects: if I found a bug, I knew exactly which parts of the program it affected, and perhaps more importantly, *how* it affected those parts. Thus they were very easy to fix.
Believe it or not, it took me less time to program this way than using debuggers, and the resulting code was much more stable and understood.
If you look at commercial software these days, it's not uncommon for the debugging period to take longer than the actual coding. In other words, there are more quality inspectors than there are assembly workers, and the time the code spends in inspection stations is longer than it spends being produced. It's tough to say that this is the "efficient" method of programming...
If you want to see where this is heading, just turn once again to the car industry: once American companies got their asses kicked by the Japanese, they adopted their techniques, and Surprise! Cars now come out of their factories with higher quality, in less time, and at less cost (adjusted for inflation and new features :-). Who would've believed it? :-)
Noone wants to write buggy code...
Well, mister know-it-all...how do you go about getting really obnoxious amounts of money out of the customer?
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Even if yo consider a project managment approach, often you will wind up rewriting the code from scratch anyhow. From personal experience I can tell you that realying too heavilty on user input in the proposal/design phase can cause a project to completey lose focus. Many rewrites are the result of "feature creep" associated with pandering to the user's every whim. The most sucessful projects I have seen started with a narrowly defined, strict set of goals. Even at that, the trend seems to be at least one major rewrite by the time the software reaches its third version. Code reuse is highly overrated.
That's because pizza-and-coke all-nighters are a direct byproduct of poor planning, either by the engineer implementing the code, the architect creating the design (if there even is such a person) or the person making the engineer's schedule. And the result is usually hastily written, incompletely tested software that is typical of most product offerings for use on the desktop.
The process of authoring mission critical, man rated software is so far removed from the ad hoc, informal, duct-tape-it-together approach that most programmers use that no direct comparison can be made. I've seen both ends of the software development spectrum and they each have their uses. You can't launch a shuttle with a bunch of last minute kernel patches and some stuff that was written the night before the launch date. But you can't compete in the commercial software marketplace with code that takes 2 or 3 years to specify, design, implement, test, and integrate, either.
Stand in awe of the people who have the skill and discipline to write software of this quality. Learn what you can from their process and try and use the lessons they've learned. Their stuff doesn't break, because when it does, people die. If O/S developers had that same attitude about their code, we'd never see blue screens of death, kernel panics, or any of the other flakiness we tolerate on our desktop machines.
Shut up and eat your vegetables!!!
US Military test pilots aren't stupid people. Most of them have advanced degrees in aeronautics or aeuronautical engineering -- at the insistance of the military or aerospace firm they work for.
I suspect that, upon seeing the "computer restart" button, the test pilot evaluating the aircraft would start asking a series of questions:
1. What is the failure rate of the computers; i.e., how often will that button have to be pressed.
2. What is the time elapsed between the computer failing and the computer operational, including the reaction time of the pilot or weapons officer? Assume that the pilot and weapons officer are already a) flying the plane, b) lining up on target, c) watching for SAM sites, ground fire, enemy aircraft, and d) coordinating with friendly aircraft.
3. How does the computer controlled, fly-by-wire system function during the timeframe covered in question 2? Will it fly steady (given that many modern fighter airframes are inherently unstable in flight, and rely on active computer control)? Will I have any control over the plane until it restarts?
4. If this happens in a dogfight, what are the chances of recovery and survival?
Or not... In truth, I suspect the first few questions would really be something like "You're kidding me, right? Do you think I'm crazy? Would you be willing to fly this deathtrap?"
Every time I read a history of a programme and find a line "completely re-wrote the code", I begin having second thougths about how really good the programme is.
There have been several occasions last year where me and a co-worker ended up trashing pages and pages of code to re-write it with the same functionality, but modular and ended up being smaller in some cases.
My company used consultants who wrote terrible code. Let's use this example...there is a program that calcuates x days ago. The consultant's program went and tried to calculate leap years and all of that. Our program that replaced it used system library calls to date, and then simply subtracted the proper amount of seconds. Other ones were hardcoded scripts to run sql on our database, we replaced that with a perl script that took the sql as a parameter.
So there are times where a re-write is better than maintaining the code. I guess the biggest case in point is mozilla versus navigator. Basically I agree that projects were planned and used software engineering principles we would most likely end up with good products. Granted game programs seem to be done best when they're a hack.. But how many times have you seen long term maintenance of games?
"If you insist on using Windoze you're on your own."
Here's NASA's own history on bugs in that software:
- So, despite the well-planned and well-manned verification effort, software bugs exist. Part of the reason is the complexity of the real-time system, and part is because, as one IBM manager said, "we didn't do it up front enough," the "it" being thinking through the program logic and verification schemes. Aware that effort expended at the early part of a project on quality would be much cheaper and simpler than trying to put quality in toward the end, IBM and NASA tried to do much more at the beginning of the Shuttle software development than in any previous effort, but it still was not enough to ensure perfection.
Read the NASA history. They had a 200-page known-bug list in 1983, although they did fix most of them during the long downtime after the Challenger explosion.The Shuttle's user interface is awful. The thing has hex keyboards!. Some astronaut comments include
This project should not be held up as a great example of software engineering. Even NASA doesn't think it is.
Solidifying a contract like that works when the client actually knows what they want. More often they have absolutely no clue of what they want/need, and require the programmer to help them along that stage as well.
With these type of clients (and I've dealt with many) taking the proper long stage of design and discussion doesn't work at all. The client immediately changes their tune after seeing initial results. Not so much to add features, but that the features they actually requested were not the ones they needed, or didn't work within their business practices.
Doug
Venn ist das nurnstuck git und Slotermeyer? Ya! Beigerhund das oder die Flipperwaldt gersput!
'Course if they started writing space shuttle code like that, it would be "Goodbye World"...
All opinions are my own - until criticized
This software is the work of 260 women and men...
Commercial programs of equivalent complexity would have been written by 7 or 8 people.
If more projects worked like that, there would also be a lot less software in the world. Say goodbye to whateever you're running to watch slashdot, you couldn't afford it. (You also couldn't afford the hardware to run it on, because faultless software is of little utility without faultless hardware.)
I would suggest that if every software project werre SEI-5, there would be no internet and people would be doing papers on typewriters.
My third programming job was my first experience with software engineering. I'd had 4 years of experience at two other jobs -- one where I wrote code for a InterLibrary Loan book lending database, and one where I worked on an e-commerce package. There was not a thing at either place that would qualify as a spec, and there was no process in place for engineering. I didn't know anyone who used specs. I assumed that this was something that was taught by computer science professors, but wasn't actually practiced by anyone.
:)
Then I got a job at the Waterford Institute. Their process wasn't probably as tight as the space shuttle, but there WAS a process, and there were specs. Nice specs. Nearly psuedo-code.
We were programming educational activites for kids learning math. Activities were created by design teams consisting of an educator, an artists, a tech writer, and a programmer. The tech writer would document everything that went on at the meetings, and distill it into spec. The design team would meet regularly over a period of several months, refining the spec until it was solid.
The spec described various states of the software. When a user did something, the state of the software changed, and did something accordingly. I'd never seen software described this way, but it made a big impression on me, and it made things easy to write and debug.
('course, the platform we were writting on was in Java, which kept changing, and in-house developers were writing our own object library, which kept changing too, so your code would work one day, and then wouldn't the next, so everything wasn't perfect. But hey. I was impressed with the specs
Tweet, tweet.
I saw this article a while back linked from here. Incredibly cool stuff . . . the part about "blueprinting software" and "how we design software in the future" was especially cool. It makes one aspire to code to a higher standard.
That said, something I was curious about that the article didn't answer, and that I don't see mentioned here yet-- what language is all of this done in? Ada would be my guess, or is there something even better than that?
iSKUNK!
Want to know what a Shuttle GPC looks like? Check outa o/STS39/10064134.htm.
http://www.ksc.nasa.gov/mirrors/images/images/p
--Jim
I thought they were the only group to achieve SEI-Level 5. If not, then who else has, I'd love to go and correct one my lecturers.
When the Capability Maturity Model for Software was published by the SEI there was only one ML-5 orginzation; at the time they were known as the IBM Onboard Shuttle group. Thankfully, times are changing.
According to the SEI's 1999 survey, 61 organizations reported a Maturity Level of 4 or 5. Of those, 40 were Level 4 groups and 21 were Level 5. The survey goes on to mention that as of 15-Feb-2000, some 71 organizations reported that they were Level 4 or 5. Those that gave their consent are listed in Apendix A.
I think the last US fighter not to rely on computer controls was prbly the F-15. To be inherently unstable is a feature...not a bug. Wasn't it a software flaw that caused the prototypes of both the F-22 and the Saab Griffon to crash on landings? Although the F-22 was a walk away crash and fire...the Griffon was a bit more spectacular if I remeber it right.
The B-1B has seven of the GCUs that the Shuttle has. So it's couldn't fly at all either. The FA-18E has a number of PowerPC chiped flight control computers...the FA-18E is the first US fighter to use Cat-5 Ethernet to connect the computers togeather instead of obscure military cabling...at least thats what I read.
IMHO the biggest problem with the F-16 is the fact that it has a single engine. If you look single engine jets crash more than twice as much as twin engine jets. Single Point of Failure will get you every time.
I work for SAIC and we use the same processes (SEI) in our software development. Our clients include banks, airlines, brokerages, the IRS, etc. We made >$5 billion last year alone doing this. It costs a bundle to set it up initially and requires a ton of training to make sure people do it right but the result is outstanding software and very, very few all-nighters.
"Shredded cabbage and mayo go good together." Cole's Law
Not as many as should have.
Every time I read a history of a programme and find a line "completely re-wrote the code", I begin having second thougths about how really good the programme is.
With the ever-faster-growing complexity of programmes, it becomes more and more difficult for humans, even aided with computers, to keep track of the project. But if you teach everyone how the computer logic works, the programming would become only about writing the necessary simple code (ha! hackers, get this!).
Would the next generation programmers write in "logic language" instead of C++? Who knows, but it would IMHO make the programmes robuster and even better.
I can see many Dilbert-fans wondering if that is a bug or a feature.
1 Space Shuttle Endeavor
1 Launch Pad
1 Houston Mission Control Station
4 Astronauts
Shine on, you crazy diamond.
I had the time.
I had the paitience.
Well this is cool. I proves that you can't write perfect software*. However you can come close.
If only everybody would do it this way, not just some cool company.
This probably even produces better software the "open source" way. OpenBSD is the only open software project that comes close, it really is kind of sad. People need to relax to do it right, down with stress!
Well if you met someone who works at some dot com ( well there quite a lot of them here in Stockholm ) they are always really really stressed. That might impress the stockmarket but not really anyone else... That is the reason everybody talkes about "When will the bubble bburst?"and I can tell you this:
The "bubble" ( which consists of overstressed people ) will burst very soon. The more relaxed people will take it easyily.
* Well you can, but Hello World! isn't really THAT
complex.
It's called new wave but it's just the same.
This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors.
How can they be sure it's bug free? If the last 14 versions had 20 errors, did they think it was bug free each time - only to find more bugs? At 500k lines of code you can't prove it all mathematically and human checkers are.. well human.
One way to measure how many bugs your code has is to purposefully introduce a bug and tell people to find it. Then you count how many new bugs they found along with the bug you introduced and scale that by the lines of code you have. But this technique won't work if you one have 1 or 2 bugs that people are actively looking for in the first place. So, my question is - how can they be sure it is bug free?
-- Virtual Windows Project
So people don't see lines like this in the code:
#Shuttle Waste Dump
#
#I dunno WHY this works, but it does!
Call on God, but row AWAY from the rocks!
I can almost hear the moans from the pizza-and-coke crowd whem they read this: "Where's the fun? Where's the creativity?". But they're under the mistaken assumption that putting lines of code into the editor is the only fun thing about developing software.
IMHO, software development is full of fun activities. What about analysis and design? In my experience, that's where the creativity really comes into play. Just talking to the customer, understanding the problem and making a working design is really difficult, and hence rewarding when you pull it off.
And what about the process itself? Software development is a young dicipline, where individuals and small groups really can make an impact. Nobody really knows how to make good software. Maybe you'll be the one to find out? As the man says, in the shuttle software group, people use their creativity on improving the process.
And last, but not least, I bet those guys have a really good feeling when they talk to the customer after delivery. Not like some people I know, who just hide. ;)
If you can't see the fun of these other activities, maybe you shouldn't be working in this field...
A)bort, R)etry or S)elf-destruct?
One way of increasing the reliability of software is to use n-version programming, whereby you implement several versions of the software, written by different people, and then create a voter system that constantly compares the data of each program and forwards the consensus one. Even if none of the programs agree, the voter 'knows' that something is amiss and can alert the pilot/engineer/whatever. I'm doing my PhD on this, and I know that NASA has implemented quite a few n-version systems, as well as the more tried and trusted multiple-redundant hardware. I heard somewhere that the space shuttle code costs the equivalent of $100,000 a line (feel free to tell me I'm wrong if you know the 'true' figure) so it might be worth considering. Certainly a number of prominent academics reckon that you can get a 45:1 improvement in a software system by implementing 3 channels as opposed to a single good system. Blah, anyway, that's my $.02 worth.
Lets see,
- half a million LOC (that's small)
- under development for 20 years
- new requirements are avoided at all cost
So it is a small, long lived project with nearly unlimited budget. No wonder they can afford to have such a process in place. But now realistically, how long does it take to set up such a project from scratch. How about having a customer who does not know what he wants. How about deadlines of less than 10 years from now.
I honestly believe that this way of delivering software is optimal for nothing else but long lived, multi billion dollar projects. In any other case you'll end up with something that is delivered years to late, indeed matches the requirements of 10 years ago and is close to useless.
Unfortunately many software companies are in a situation where they can't afford to wait for perfect software. Take mobile phones as an example. Typically these things become obsolete within half a year after introduction. The software process is what determines time to market. Speed is everything. If you can deliver the software one month earlier, you can sell the phone one month longer.
Of course testing, requirementspecs and software designs are usefull for any project but it's usually not feasible to do it properly.
Jilles
If that's so, it's an interesting illustration of the overall system's requirements imposing lower quality standards on components of that system.
To wit: the article (I presume; haven't read it, but have read similar ones on the same topic) discusses the importance of achieving a 100% quality rate on a given chunk of software.
Now, that software is merely one component in a much larger system.
Actually, these larger systems nest "outwards". I.e. the shuttle itself is a larger system than the software it contains, but so is NASA a larger system than the shuttle; so is the US government larger than NASA; so is the USA larger than the government; so is the planet's population larger than the USA; etc.
In this case, there are specific reasons I can suggest account for the 100% quality requirement that might otherwise go unnoticed:
-
-
-
(Yes, there's some overlap there, but these are subtly different points, that might apply independently in other projects. E.g. a not-publicly-visible project might have no risk of embarrassment should it fail in one way vs. another, but have a huge risk of $$$ loss.)Failure resulting in death of participants, and especially of non-participants (humans), is not an option.
However, failure resulting in not launching, not even building it in the first place, especially not building it within some timeframe, is an option. That is, failure of the "commitment to quality" approach to actually deliver the component on a "timely" basis is an acceptable option.
The world generally will admire a program such as the space shuttle less if it crashes and burns frequently, killing/maiming people and destroying equipment, than if it succeeds on the extremely rare occasions on which it is tried -- perhaps even less than if it never happened in the first place.
A delay in a shuttle launch costs, overall, far less than the cumulative risks of premature shuttle launches. (Challenger demonstrated that.)
Compare these elements to fighter aircraft, where the software is part of a somewhat different set of larger systems:
The deaths of participants and non-participants is expected by most everyone of this sort of system and the activities around which it revolves.
On the contrary, the sorts of failures that result from failing to launch a fighter plane, or never having designed it in the first place, are generally not so well-tolerated.
The world will likely fear a non-existent fighter plane, even one that has 100% success in its flight-control software (doesn't require rebooting) but is launched extremely rarely (it's hard to build) or too late, far less than it will a large fleet of existing, dangerous fighters that have even a 10% "kill" rate of its pilots per year.
A delay in a fighter-plane deployment can literally cause lost wars. In that sense, the loss of pilots due to poor design is a calculated positive compared to the loss of a nation's (and/or its peoples') freedom.
Of course, I'm making pretty much everything up, above, so don't bother arguing details or interpretations with me -- I have no idea whether they're correct or not.
But, they're probably correct enough to illustrate why it's probably okay for us to be using highly buggy computers on a poorly designed (for the way it's being used now, anyway) Internet rather than, as another post on this thread put it, using typewriters and plain paper.
Not that there aren't wonderful advantages to deploying 100% correct software components in a large-scale, much-buggier system! "Creeping quality" is not a bad thing at all, since it allows people working on the system to worry less about various portions of it as they try to debug it.
But, the effort to deploy such perfect components may well outweigh the utility of doing so, overall, given the pertinent timeframe.
In particular, when trying to deploy such a perfect component in a large, buggy system, it can be hard figuring out which component can be made so "perfect" and still be useful in that (presumably speedily-evolving) system by the time it's ready!
So maybe it's appropriate to view almost everything we deal with on the Internet as a very early alpha-stage prototype after all. ;-)
Practice random senselessness and act kind of beautiful.
I'm working on a large NASA project now. I have determined that the purpose of this project is not to produce a working software system, but rather to produce a wall full of loose-leaf binders of incomprehensible documentation that no one will ever refer to again.
The process says we must have code reviews - great! But instead of being an analysis of the logic of my code, it turns into a check against the local code formatting standards - "You can't declare two variables with one declaration, use int a; int b; instead of int a,b;" (yes, that's an actual standard around here) instead of "Hey, if foo is true and bar is negative, you're going to dereference a garbage pointer here!"
The forms are observed, but the meaning is forgotten, like Christians going to church on Sunday then cutting people off and flipping them the bird on the drive home.
"Process" won't save us. Which doesn't mean that a certain amount of it can't help, but there is no silver bullet.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
So the way that Microsoft Flight Simulator keeps crashing is actually a feature?
I happen to work just down the hall from the guys who maintain and upgrade the shuttle Flight Software (FSW), and I can tell you they have a rigorous design, inspection, and test sequence that they go through before they fly new or modified code. The story around here (which I have no reason to doubt) is that the FSW team was one of the first SEI level-5 certified shops in the nation.
I can also tell you that NASA avoids having to make unnecessary changes to the FSW. For example, the new "glass cockpit" recently discussed here on Slashdot: when these upgrades were designed, they chose to design the interface to the new display modules to exactly mimic the interface to the old intruments. In other words, they are true plug-and-play replacements; one significant reason for this was so the flight software didn't have to be modified.
Likewise, people often ask why the shuttle continues to use such antiquated General Purpose Computers: slow, 16-bit machines designed back in the seventies. There are many reasons, but a big reason is that new hardware would almost certainly require massive changes to the flight software. And rewriting and recertifying all that software would be a huge task. The current FSW works reliably; if it ain't broke...
Huzzah! As I type, we just launched Atlantis. Go, baby, go!
--Jim
If everyone would simply use VDM/Z or Larch/CLU for all their development work, it would be much easier for us to prove our software is correct, and then all bugs would be a thing of the past.
It really is that simple. Don't these people remember what they were taught at college ?
Notice the article similar to ISO (International Standards Org), everything revolves around the 'process'. The result is determined by the process. I use to work for a company that had a documented process for everything...from software devlopment right through to filling out your wage timesheet! I think the important thing to note is that it all depends on the 'culture' and type of the organization. If people accept this style of operation then it's great. For a oraganization that has to program software that directly deals with lives at stake, there must be a 'process' to ensure the s/w written perfectly (and tested).
I have come across fellow works where they absolutly hate this type of practice... well they probably best suited for development in non-critical life threatening systems.
"C is a great, if complicated language. It's simple, yet can get complicated very easily..."
It's complicated.
It's simple.
It's complicated again.
The article gets worse from there.
--
Peter
Some of my most succesful programs (read, they actually worked or there abouts) came about because I was in a funny mood and decided to actually plan it out. From what I hear about in the real world, some (but by all means not all or even most) programmers look down on clients just because they don't know much about programming. They assume that just because they have a certain expertise over others that they somehow know more than them in general.
The good thing about the way software is written here is that the requirements are written down and sorted out before they even do the planning. How many prgrammers, groups, firms etc. can say that. I will admit, though, that a major problem is changing requirements. Something that just happen in the same way for NASA. It might just be better if people decided to wait a bit before jumping in to the programming. They'll save themselves more time and money in the long run.
I never quite understand why it is an act of macho bravado to work all night and live off pizza. It indicates two things 1) A badly run project and 2) poor maintainability in the code.
In one of my previous incarnations I worked on display systems for Air Traffic Control, where the quality level was also very high, where the performance requirements were exacting and the specifications precise.
Some would think that this means simple and boring... Of course not. Having to display a track from reception at the Radar to the display in 1/10th of a second isn't easy by any stretch of the imagination, and to do it so it works 100% of the time means you have to understand the problem properly rather than coding and patching.
If only more projects worked like that then there would be a lot less bugs in the world.
An Eye for an Eye will make the whole world blind - Gandhi