Ultra-Stable Software Design in C++?

← Back to Stories (view on slashdot.org)

Ultra-Stable Software Design in C++?

Posted by Cliff on Saturday February 4, 2006 @03:35PM from the failure-minimization dept.

null_functor asks: "I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries. The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms. Basically, it allows users to crunch a god-awful amount of data over several computing nodes. The application is meant to primarily run on Linux, but should be portable to Windows without much difficulty." While there's more to this, what strategies should a developer take to insure that the resulting program is as crash-free as possible? "I'm thinking of decoupling the modules physically so that, even if one crashes/becomes unstable (say, the distributed communication module encounters a segmentation fault, has a memory leak or a deadlock), the others remain alive, detect the error, and silently re-start the offending 'module'. Sure, there is no guarantee that the bug won't resurface in the module's new incarnation, but (I'm guessing!) it at least reduces the number of absolute system failures.

How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.

And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"

32 of 690 comments (clear)

Min score:

Reason:

Sort:

You're not the first one.... by DetrimentalFiend · 2006-02-04 15:36 · Score: 4, Insightful

I'd hate to say it, but you might want to SERIOUSLY consider managed code. You could build some of the parts in C++ if need to be, but doing it purely in C++ seems like a bad idea to me. You're asking for a silver bullet that just doesn't exist...but managed code is getting faster and can be pretty stable.
1. Re:You're not the first one.... by gadzook33 · 2006-02-04 16:10 · Score: 5, Insightful
  
  Ah, another true believer. I work heavily in both managed and unmanaged code (c/c++/c#) hybrid solutions. In my experience, a well designed C++ program is as stable as a well designed C# program. Who cares if it "crashes" if it doesn't do what you want? The worst program is one that seems to be working but is generating invalid results. Don't let anyone convince you that C# is going to provide more reliable execution. We use C# for its nice GUIs; C++ for cross-platform portability.
2. Re:You're not the first one.... by mr_tenor · 2006-02-04 16:45 · Score: 5, Insightful
  
  WTF? I love Haskell as much as the next programming-language-theory fanboy, but saying "Haskell or one of the other functional languages might be a good idea." in reply to the OP strongly suggests to me that you are just making stuff up and/o are copy/pasting things that you have read elsewhere out of context
  
  If not, then great! Please post some references to literature which demonstrates how what you've suggested is sane and/or possible :)
3. Re:You're not the first one.... by Duhavid · 2006-02-04 17:21 · Score: 4, Insightful
  
  I read his comment as saying that C# would not guarantee
  a good result ( correctness ). And it wont.
  
  What the guy really needs is a great team and some decent
  process to backstop that team. Not a silver bullet.
  
  --
  emt 377 emt 4
4. Re:You're not the first one.... by SchroedingersCat · 2006-02-04 18:10 · Score: 5, Insightful
  
  Here is the clue: if the code *relies* on being *managed* then the design is not stable. Well-designed system will not need a garbage collector, and poorly-designed system will not be saved by the garbage collector.
5. Re:You're not the first one.... by asmussen · 2006-02-04 18:46 · Score: 4, Funny
  
  What? You need to use ones AND zeros??? Loser...
  
  --
  Shawn Asmussen
6. Re:You're not the first one.... by chiph · 2006-02-05 01:50 · Score: 4, Insightful
  
  Absolutely.
  What I'm hearing is the guy's boss telling him "And it'd better not crash!"
  
  Typically, when absolute reliability is needed (nuclear power plants, spacecraft, pacemakers), you start subtracting libraries which aren't known to be absolutely reliable, yet in this case they're adding them. In addition, he's wanting it to run on multiple platforms, which radically increases your testing workload.
  
  On top of that, he admits he's got no experience in the techniques needed to produce reliable software. Probably has a short deadline, too.
  
  My crystal ball says he's doomed to failure.
  
  Chip H.
7. Re:You're not the first one.... by The_Wilschon · 2006-02-05 04:18 · Score: 4, Interesting
  
  More: http://www.cs.indiana.edu/~jsobel/c455-c511.update d.txt about a guy who wrote the "Fast Multiplication" algorithm very simply in scheme, and then transformed it (using correctness preserving transformations, which are much much easier to do in "Haskell or one of the other functional languages" than in C/C++ and friends) into scheme code that was as optimized as he could come up with, and which furthermore had a pretty much 1-1 correspondence with C statements. He then rewrote it in C (including perfect "goto"s!), and beat all but one person in his class on the speed of the algorithm. Furthermore, he spent significantly less time working on (read debugging) his code than anyone else in the class.
  
  --
  SIGSEGV caught, terminating
  
  wait... not that kind of sig.
8. Re:You're not the first one.... by synthespian · 2006-02-05 06:11 · Score: 4, Funny
  
  What, praytell, is the difference between a functional language like Haskell and a well-designed C++ template library?
  
  Referential transparency.
  That comp.lang.functional thread is interesting because there are guys from Ericsson elaborating on some real-world aspects of referential transparency. As you know, Ericsson uses the funtional programming language Erlang for their switches. See more in: Welcome to a Smarter Way of Programming. Of course, you can't take their use of Erlang seriously, because they're from Sweden, and Sweden, being a fucked-up third-world country with no tech at all, is not an example for America. The mighty AT&T pushed C++, and now the world is better, safer place, where software errors are a thing of the past.
  
  --
  Main difference between the BSD license and the GPL license: one is from California and the other is from Massachusetts
I'm gonna take a guess, but.. by Anonymous Coward · 2006-02-04 15:41 · Score: 5, Funny

try not to de-reference any NULL pointers and you should be ok..
1. Re:I'm gonna take a guess, but.. by arivanov · 2006-02-04 19:50 · Score: 5, Insightful
  
  Well... Someone modded this as funny. Wrong... It is the first comment so far I have seen on this article that comes anywhere near being insightfull.
  The secret of stable system design is designing from failure. Designing and implementing defensively. If you want to design an ultrastable system you start with the failure analysis for every component, following with failure analysis of modules and the entire thing as it grows.
  This in the world of C++ (and C for that matter) quite often means checking paranoiacally everything everywhere for NULLs before doing anything about it.
  Designing and writing from failure means that every system or library call should be assumed to fail first and all failures handled cleanly. This may be quite painfull because it usually requires the development of special tools like wrappers around malloc, file calls, etc that return error conditions which are nearly impossible to achieve on a live system.
  Only after all codepaths for "bad" results have been handled, the actual "normal" codepaths should be written. This unfortunately is not the way code is written in 99% of the shops out there. Most design and implement from success first and add failure handling later.
  Just ask in your shop: "Where is our memalloc wrapper that simulates a failed memory allocation? I need to link versus it to do some testing to see how our app handles NULLs in a few places". The usual answer you will get is "Ugh? WTF you are talking about Dude... We do not smoke that stuff here... Just go and write the code you have been assigned to write..."
  And the results are quite bloody obvious.
  
  --
  Baker's Law: Misery no longer loves company. Nowadays it insists on it
  http://www.sigsegv.cx/
Here's your best bet. by neo · 2006-02-04 15:42 · Score: 5, Interesting

1. Write the whole thing in Python.
2. Once it's bullet-proof, replace each function and object with C++ code.
3. Profit.
1. Re:Here's your best bet. by YGingras · 2006-02-04 16:24 · Score: 5, Informative
  
  This is really good advice but it needs more details:
  
  1) Wrap your legacy libs with SWIG
  2) Code a working prototype in Python
  3) Profile it (never skip this step)
  4) Use SWIG to write the bottle neck parts in C++
  5) Use Valgrind to ensure you are still OK memory wise
  6) Profit!!
They Write the Right Stuff by Pentclass · 2006-02-04 15:43 · Score: 5, Interesting

Follow NASA's advice... http://www.fastcompany.com/online/06/writestuff.ht ml
Re:Get another programmer by Philip+K+Dickhead · 2006-02-04 15:44 · Score: 5, Funny

Make sure his name is something like "Bjarne" or "Knuth".

--
"Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
Don't get too fancy... by Pyromage · 2006-02-04 15:47 · Score: 4, Informative

First, consider how complex you want to make the system. The decoupling is a good idea, I think. However, I don't think that having modules automatically restart one another is a good idea; it introduces a whole slew of other problems. At most I'd say use a watchdog process (principle of single responsibility).

Furthermore, you're crunching large amounts of data, so I'm guessing batch processing. If you can have the application not be a server, then you simplify things a lot. Make it a utility that takes data on standard input and runs whatever analysis you need, and duct tape it together with cron or a simple program that watches for new input files.

Also, I'd like to suggest that you consider whether other languages could be efficient for the task. For example, Java is pretty good numerically, and as far as your libraries go, see if you can use SWIG to generate JNI wrappers. Also, then you get Java RMI.

Next, get them down to one platform. It's *way* easier to develop software with tight constraints on a single platform (versus multiple platforms). Investigate QNX: a reliable operating system (though admittedly quirky) with a beautiful IPC API. In any case, make sure you get a well-tested library with message queues, etc. You don't want to be using raw sockets; you could but that's just another pain in the ass on top of everything else.

Last, figure out what the cost of a failure is. Getting that last few percent of reliability is very very expensive. Unless you're a pacemaker or respirator, the cost of failure is probably not as high as the cost of five nines of uptime.
Don't code to impress. by jellomizer · 2006-02-04 15:47 · Score: 5, Informative

When coding something that needs to be stable, you need to keep your ego aside and concentrate on the task at hand. Stick with tried and true methods don't go with any algorithm that you are not 100% comfortable with even if it makes the code less ugly. Be sure to follow good practices make many function/methods, and make each one as simple as possible, makes it easier to check each function for bugs when they are simple. Secondly document it like you never want to touch the code again (in code and out of code), you want to know what is going on at all time and the bigger it gets the larger chance you could get lost in your own code. When working in a team and you are in someone else's code document that you did the change.

Next take into account what causes most Crashes.
Bad/Overflow memory allocation.
Memory leaks.
Endless loops.
Bad calls to the hardware.
Bad calls to the OS.
Deadlock

If you are going to decouple modules keep in mind that you will need to do as much processing as possible with minimum message passing and allow for mirrors so if one system is down and other can take its place, without killing the network.

For IPC I tend to like TCP/IP Client server. But that is because it tends to offer a common platform independence and allows for expansion across the network. Or try other Server Methods such as a good SQL server Where you can put all the shared data in one spot and get it back. But not knowing the actual requirements it may just be a stupid idea.

I would suggest that you also ask in other places other then Slashdot. While there are many experts on this topic there are also equal if not greater amount of kids on there who think they know what they are talking about, or they have there ego in this technology/or method.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Fault Tolerance Vs. Stability by aeroz3 · 2006-02-04 15:59 · Score: 5, Insightful

I think perhaps what you REALLY mean here by stability is Fault Tolerance. It's impossible to write code that has zero defects, outside of any trivial examples. Real Code Has Real Defects. Now, as you talk about modular design and being able to restart modules, you're talking about, not stability, but fault tolerance; the ability of the application to recognize and recover from faults. For instance, you can't necessarily guarantee that the module on machine A running task B won't die, hell the computer could accidently fry, but if your application was Fault Tolerant then the application would kick off another process somewhere else on computer C to rerun job B. Stable systems aren't built necessarily by trying to write defect-free code, but by recognizing that defects will occur and architecting the system in such a way that it can recover from them. Here you need to be concerned about things like transactions, data roll-back, consistency, techniques (active vs. passive, warm vs. cold). The key thing is before you even write a LINE of this C++ code, make sure that you have a complete, comprehensive ARCHITECTURE for your application that will gracefully handle faults.
test with valgrind! by graveyhead · 2006-02-04 16:25 · Score: 4, Interesting

valgrind -v ./myapp [args]

It gives you massive amounts of great information about the memory usage of your program.

The other day I spent nearly 3 hours trying to decode what was happening from walking the backtrace in gdb. Couldn't for the life of me figure out what was happening. Valgrind figured out the problem on the first run and after that, I had a solution in a few minutes.

Highly recommended software, and installed by default on several distributions, AFAIK.

Enjoy!

--
std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
Forget it. by Pig+Hogger · 2006-02-04 16:39 · Score: 4, Funny

Forget it, with C and C++.
Those are low-level programming-jock languages disguised as high-level languages. As long as the punks who program them will have pissing contests in code obfuscation, you can count on having buffer overflows and memory leaks.
Unit Testing and Smart Pointers by pjkundert · 2006-02-04 16:42 · Score: 4, Insightful

60,000+ lines of communications protocol and remote industrial control and telemetry code. No memory leaks, and less than 5 defects installed into production.
The reasons? A unit test suite that implements several million test cases (mostly pseudo-random probes -- the actual test code is about 1/3 the size of the functional code). In fact, the "defects" that hit production were more "oversights"; stuff that didn't get accounted for and hence didn't get implemented.
Just as importantly; every dynamically allocated object just got assigned to a "smart pointer" (see Boost's boost::shared_ptr implementation).
Quite frankly, compared to any Java implementation I've seen, I can't say that "Garbage Collection" would give me anything I didn't get from smart pointers -- and I had sub-millisecond determinism, and objects that destructed precisely when the last reference to them was discarded. The only drawback: loops of self-referencing objects, which are very simple to avoid, and dead trivial if you use Boost's Weak Pointer implementation.
We didn't have access to Boost (which I Highly Recommend using, instead of our reference counted pointer) when we first started the project, so we implemented our own Smart Pointers and Unit Testing frameworks.
I've since worked on "Traditional" C++ applications, and it is literally "night and day" different; trying to do raw dynamic memory allocation without reference counting smart pointers is just insane (for anything beyond the most trivial algorithm). And developing with Unit Testing feels like being beaten with a bat, with a sack tied around your head...

--
-- -pjk Perry Kundert perry@kundert.ca http://kundert.2y.net
Congratulations! Nice Work! by aendeuryu · 2006-02-04 16:53 · Score: 5, Funny

"I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed...

From zero to flame war in under 20 words. Well done!
Agreed. and a few more thoughts. by jd · 2006-02-04 17:43 · Score: 4, Insightful
- Where they exist, use fault-tolerent components for interconnects. Making things fault-tolerent is tough, so re-using such stuff will simplify the task. Best of all, use stuff with a significant history behind it, because communication will be the biggest headache and bugs there will be hard to pinpoint exactly.
- When coding, assume that anything can crash. I don't care if you use exception handling, reactive methods or a purple pizza, but you want components to be able to recover from failure (by restarting if need be) and you want anything that talks to it (and the data!) to be able to survive a loss of connection and handle the condition in a predictable way. (This may mean resending to another node, waiting for the old one to reset, buying said pizza over the Internet, whatever.)
- Keep It Simple! The more layers, the greater the liklihood of bugs. (There are exceptions - if you're using CORBA, then the ACE ORB is heavyweight but generally considered pretty solid. That's partly because it has a decent amount of maintenance and has been around a while. I would probably not go for lesser-known ORBs, though.) The more complexity you can avoid, the more certainty you can have that the code is solid.
- Analyze, Specify, Design, Implement, Validate. There are no "perfect" techniques to Software Engineering, but a few things generally hold up fairly well. The first of these is to keep the steps in the process as clean and methodical as practical. There will be some overlap, but in general you can't implement good code until you know what good code you want implemented.
- Testing Is Important. There are more schools of thought on testing than there are programmers. (At last count, at least three times as many.) Even if nobody is quite sure what role testing has, most seem fairly convinced it has got a role. One popular creed states that design should be from the top down and testing from the bottom up. (ie: test at the level of the components that call nothing else, then build up step by step.) Another states that since you have a specification (you do, don't you? :), you can write the tests according to the specification first, then write the code to comply with the tests. You can even follow both approaches, if it helps you feel better. Just pick something and stick to it. My preferred testing method is to check "typical" conditions, boundary (extreme) conditions and erronious conditions.
- Never assume that some other coder's assumptions about the compiler's assumptions of what was assumed by someone else entirely bears any resemblance to what you think. Computers know all about luck and hope and how to utterly crush them when you're not looking.
Yes, some of those do conflict. How to keep things simple AND have fault-tolerence, for example. That's where a good design comes in handy, because you can get a better feel for where you should make the trade-off between certainty of working, certainty of working later on and getting some sleep this side of 2008. It's all a matter of weighing the options and investing time in the place most likely to benefit.
(Because everything is a trade-off, anything listed above may not apply. But then, it may not need to. If you've tested a component thoroughly along all boundaries, a good sample of valid conditions and a good sample of erronious conditions, AND everything has been kept as simple as possible so that really wierd cases are unlikely to crop up, then you may decide you can simplify or eliminate fault-tolerent components. There is no point in catching errors that won't occur. In fact, that adds complexity and violates the Keep It Simple rule.)
Oh, and as this is a networked system, testing should include testing network I/O. Use packet generators if necessary, to see how the system handles erronious packets or massive packet floods. You don't want "perfect" responses (unless you can define what "perfect" means), you want reliable responses. If X occur
--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Listen to what he said!! by logicnazi · 2006-02-04 18:08 · Score: 5, Insightful

Jesus christ people he is asking you how he should go about building an ultra-stable application in C++. He told you he *has* to build it in C++ because there are critical libraries and other components that aren't availible in C++. Telling him he shouldn't build it in C++ anyway just isn't helpfull.

I hate to break it to people but there *are* libraries, especially for types of scientific computing, that are only (reasonably) availible in C++ or sometimes FORTRAN. Not only would abandoning these libraries mean he would completely have to reinvent the wheel but also might cause serious compatibility problems not to mention a much greater ongoing maintenence responsibility (he can't just check his program to make sure things still work when someone fixes a library bug).

Moreover, the idea that because he is considering using CORBA, IPC or whatever else speed can't matter enough to require C/C++ is dead wrong. It is true that whatever *parts* of the process are done using these components may not require huge amounts of speed but this doesn't mean one of these components isn't doing something very processor heavy.

In particular what he says sounds like the situation in some areas of scientific computing. If one is writing a program to do some sort of simulation or similar math intensive operations speed can be *very* important in the critical parts of the code but (in some cases) transfering information to the GUI or other components need not be particularly speedy (increasing by an order of magnitude may make a small difference in overall runtime). Imagine a program that does some kind of weather, or nuclear detonation simulation. The cross-processor communication and the core simulation kernel need to be very fast but the GUI and data input components need not be particularly fast. Also it is my understanding that often the critical libraries in this area are often only availible (at least freely) with C/C++ or fortran bindings.

Anyway I think it is important to distingush several different goals, ultra-stability, minimal downtime, and minimal data/computation loss. For instance a climate simulation that may run on a supercomputer for months it is very important to have minimal data/computation loss (i.e. if something goes bad you don't lose months of very valuable supercomputer time) but you need not have ulta-stability or minimal downtime. As long as when any node crashes the simulation can easily be restarted without loss of data there is no problem. On the other hand if you are running a website like slashdot it is minimal downtime that is important it doesn't really matter if some of the web server processes are rebooted once in awhile. If, on the other hand, you are writing code to monitor a nuclear power plant it is ultra-stability that is important (though I can't at the moment think of something that requires distributed processing and ultra-stability but I'm probably just missing something).

So I think the answer depends on what sort of stability you want. If it is important that no individual *node* crashes (though the GUI/other non-core components can crash) then you should pursue the seperation you described above. I have to admit I'm not an expert here but the client-server model (like mysql, X etc.) seems to work well in this context. However, this depends alot on what sort of data you need to transfer. If you just need to send the core setup commands and get back mostly unstructured info (say a grid of tempratures or other simple datasets) then I would suggest sticking with one of the simpler abstractions and don't get lost in CORBA. On the other hand if you need to send back and forth real objects with significant structure then creating your own serialization system/bindings is just asking for bugs.

On the other hand if what you want is minimal data/computation loss, downtime, or any other property where it is the overall system you care about not a crash at any particular node then I suggest concentrating less on dividing any one node into comp

--
If you liked this thought maybe you would find my blog nice too:
Don't reinvent the wheel by DigitalCrackPipe · 2006-02-04 18:13 · Score: 4, Insightful

Avoid the latest "big thing" for the core of your project. It's usually specialized, non-portable, etc. The standard template library for C++ (for example) is here to stay, with tested algorithms that are safer and faster than you can usually write (because they are optimized for the platform you compile on). For the GUI, on the other hand, you may be better off with a GUI-based language/tool. That's less likely to be portable, but that's the way GUIs work.

Next, spend some time upfront on your design, with things like use cases, sequence diagrams, and other visualization tools to help you understand just what you want to happen in best case situations as well as failures. The level of detail/formality required is a moving target, so update as needed. You should have a solid error detection/correction plan so that you can design each component to follow it. Also design for test and with logging - it will help you while debugging, while testing, and while fixing the bug the customer is seeing.

Make sure management will allow sufficient time for testing. A lot more lip service goes into support for testing than actual schedule and money. Your test plan should be as bulletproof as your design.

That's my 2 cents. And a random book recommendation: books like Scott Meyers' "Effective " provide info on effective/error reducing ways to use the language/libraries, but won't help you get started with the architecture.
robust software by avitzur · 2006-02-04 18:26 · Score: 4, Interesting

Way back in 1993, thanks to a three month schedule delay in shipping the original Apple Power PC hardware, Graphing Calculator 1.0 had the luxury of four months of QA, during which a colleague and I added no features and did an exhaustive code review. Combine that with being the only substantial PowerPC native application, so everyone with prototype hardware played with it a lot, resulted in that product having a more thorough QA than anything I had ever worked on before or since. It also helped that we started with a mature ten year old code base which had been heavily tested while shipping for years. Combine that with a complete lack of any management or marketing pressure on features, allowed us to focus solely on stability for months.

As a result, for ten years Apple technical support would tell customers experiencing unexplained system problems to run the Graphing Calculator Demo mode overnight, and if it crashed, they classified that as a *hardware* failure. I like to think of that as the theoretical limit of software robustness.

Sadly, it was a unique and irreproducible combination of circumstance which allowed so much effort to be focused on quality. Releases after 1.0 were not nearly so robust.
Re:inline code by countach · 2006-02-04 19:16 · Score: 4, Insightful

Perl? Fuck. He wants a stable app with good code. Sheesh.
good coding techniques by Pr0xY · 2006-02-04 19:40 · Score: 4, Informative

first and foremost, use good coding techniques. This means use exception handling where appropriate, use standard containers over hand rolled data structures (prefer std::string over char arrays, this will help prevent almost all common string based buffer overflows alone), and follow good style guidelines.

As for a GUI programming, if you are strictly tied to c++, i would recommend QT (www.trolltech.com) they have a fabulous API (takes getting used to, but it makes sense once you do). Nice part about QT is that it is source portable to just about every major platform (X11, Win32, Mac).

It is possible to write reliable, fault tolerate code in c++ (realize please that perfect code is impossible in any language), it just has to be well thought out and done right.

proxy
A few guidelines by pornking · 2006-02-04 19:51 · Score: 4, Insightful

The suggestions I have seen here so far seem to boil down to "Don't do it that way". Sometimes that's not possible. If it truly has to be C++, and it truly has to be as fast as possible and as bug free as possible, there are a few guidelines that can help:

1. Unless the GUI will be I/O bound, and that's unlikely, try to write it in a safer language that has better GUI support.

2. Make all your classes small and simple, and create test harnesses that are as complete as possible. Try to make the classes simple enough that they can be individually tested in such a way that all code paths are exercised.

3. Check your arguments. This includes checking for invalid combinations, and arguments that are invalid given the state of the object.

4. Don't use new or pointers directly. If there may be multiple references to an object, then reference count it and create handle classes that hold the references so all instantiation is controlled, and all destruction is implicit. Make these handles STL compatible, and never pass around pointers to them.

5. Try to design the application to fail fast and recover from failure. For example, maintain the state of work being done in discrete transactions that can be aborted if a failure is detected. This can be on disk or in memory depending on your performance needs. This could be combined with the ability to restart the app in a new process and have it pick up where the last one left off.

6. Have the app keep track of its memory usage, and be prepared to recover from memory leaks, possibly by restarting as in item 5.

7. If the compiler you're using supports structured exceptions, then use them. They can degrade performance a bit, but they can also enable you to recover from NULL pointer exceptions.

8. If you have multiple threads, then to avoid both the performance hit from context switches and the chance of deadlocks, don't let them access the same data directly. Instead, have them communicate through lock free queue structures. That way, all your main threads can pretty much spin freely. Spawn worker threads for any I/O or other operations that can block. A context switch can take as much time as thousands of instructions. You want to use as much of every time slice as possible.

9. Keep the number of main threads down to the number of CPU's or less. That way, except for the times when the CPU is being used by the OS or other processes, (should be relatively rare) each non blocked thread gets its own CPU.

10. Have an experienced QA team, that understands their job goes beyond unit testing.

Now here's a few that are always important, but for what you want to do, they become critical.

11. Have the design laid out at least roughly before you start.

12. If at all possible, don't let requirements change in midstream.

13. Overestimate the time it will take very generously. You will probably still be crunched.

--
pornking
Obvious ! by Thomas+Miconi · 2006-02-04 23:30 · Score: 5, Funny

What this guy really needs is the time-tested, tried-and-true Waterfall development process !

Thomas-
I don't know why this dominates the first page... by hummassa · 2006-02-05 00:05 · Score: 4, Insightful

of the thread... I'm appalled. I'll answer to this, when I would really like to answer to the main post, to maximize chances of you reading me.

Question 1: what strategies should a developer take to insure that the resulting program is as crash-free as possible?

Answer:

a. Use OO techniques and maintain all objects in your system extremely simple; furthermore, maintain all methods in your system extremely short, well-contained, well-defined.

b. Don't use C++ arrays, ever. Especially not for strings. Use and abuse the STL.
copy( istream_iterator<int>( cin ), istream_iterator<int>(), back_inserter( v ) );
is just plain beautiful IMH?O.

c. Check extensively the behaviour of your constructors and destructors.

d. Make a object-lifecycle diagram of each class you program. In the diagram, relate it to the neighboring classes (parents, children, siblings, classes involved in design patterns with, classes aggregated, classes value-aggregated, classes where this is aggregated or value-aggregated)

e. Use, carefully, and always when possible, smart pointers. Remember std::auto_ptr is your best friend -- its limitations are a defining part of its strength. Remember boost::shared_ptr is also a good friend, but its cousin boost::intrusive_ptr is even more friendly -- but use one of those (and their other cousins scoped_{ptr,array}, shared_array, weak_ptr) only in the (rare) cases where auto_ptr does not apply.

f. As a corollary to (e) above, use boost. This is really an extension of (b), too.

Question 2: How can I actually implement such a decoupling?

Answer:

I would use a simple, socket-base, take-my-data, gimme-my-results scheme. It would be network-distributable, easy to detect if some service is or isn't alive via timeouts... If you want something more sofisticated/RMI-like, SOAP (with binary XML or compressed) may be an option. The simpler the better IMHO.

Question 3: are there any software _design patterns_ that specifically tackle the stability issue?

Answer:

All of them? IMHO, DPs can represent huge tool to increase the stability of a system. Take a look athere [WARNING: PDF] (and in the bibliography) for some ideas.

I know many of my posts were self-marketing lately, but if you need someone to work with you, I'll be happy to send you my resume... write me at hmassa (at) gmail.

--
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
Have you have flown on a commercial airline? by EMB+Numbers · 2006-02-05 04:37 · Score: 4, Informative

Have you have flown on a commercial airline in thelast 30 years? If so, you trusted your life to software.
Thare is a standard called DO-178B Level A that applies to aircraft software upon which lives depend. There is a saying in the commercial avionics business: "Nobody has ever died from software failure on an airplane, yet." There have been some accidents where software played a role, but I won't quibble with that now.

The point is that safety critical software is developed routinely. It has been developed in asembly language. It has certainly been developed in Ada, C, and sub-sets of C++. It is expensive. Validation of avionics software and certification in an aircraft can easilly cost an order of magnitude more that just writing the software, and writing the software using required processes and producing required artifacts is not cheap either.