Ultra-Stable Software Design in C++?

← Back to Stories (view on slashdot.org)

Ultra-Stable Software Design in C++?

Posted by Cliff on Saturday February 4, 2006 @03:35PM from the failure-minimization dept.

null_functor asks: "I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries. The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms. Basically, it allows users to crunch a god-awful amount of data over several computing nodes. The application is meant to primarily run on Linux, but should be portable to Windows without much difficulty." While there's more to this, what strategies should a developer take to insure that the resulting program is as crash-free as possible? "I'm thinking of decoupling the modules physically so that, even if one crashes/becomes unstable (say, the distributed communication module encounters a segmentation fault, has a memory leak or a deadlock), the others remain alive, detect the error, and silently re-start the offending 'module'. Sure, there is no guarantee that the bug won't resurface in the module's new incarnation, but (I'm guessing!) it at least reduces the number of absolute system failures.

How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.

And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"

14 of 690 comments (clear)

Min score:

Reason:

Sort:

I'm gonna take a guess, but.. by Anonymous Coward · 2006-02-04 15:41 · Score: 5, Funny

try not to de-reference any NULL pointers and you should be ok..
1. Re:I'm gonna take a guess, but.. by arivanov · 2006-02-04 19:50 · Score: 5, Insightful
  
  Well... Someone modded this as funny. Wrong... It is the first comment so far I have seen on this article that comes anywhere near being insightfull.
  The secret of stable system design is designing from failure. Designing and implementing defensively. If you want to design an ultrastable system you start with the failure analysis for every component, following with failure analysis of modules and the entire thing as it grows.
  This in the world of C++ (and C for that matter) quite often means checking paranoiacally everything everywhere for NULLs before doing anything about it.
  Designing and writing from failure means that every system or library call should be assumed to fail first and all failures handled cleanly. This may be quite painfull because it usually requires the development of special tools like wrappers around malloc, file calls, etc that return error conditions which are nearly impossible to achieve on a live system.
  Only after all codepaths for "bad" results have been handled, the actual "normal" codepaths should be written. This unfortunately is not the way code is written in 99% of the shops out there. Most design and implement from success first and add failure handling later.
  Just ask in your shop: "Where is our memalloc wrapper that simulates a failed memory allocation? I need to link versus it to do some testing to see how our app handles NULLs in a few places". The usual answer you will get is "Ugh? WTF you are talking about Dude... We do not smoke that stuff here... Just go and write the code you have been assigned to write..."
  And the results are quite bloody obvious.
  
  --
  Baker's Law: Misery no longer loves company. Nowadays it insists on it
  http://www.sigsegv.cx/
Here's your best bet. by neo · 2006-02-04 15:42 · Score: 5, Interesting

1. Write the whole thing in Python.
2. Once it's bullet-proof, replace each function and object with C++ code.
3. Profit.
1. Re:Here's your best bet. by YGingras · 2006-02-04 16:24 · Score: 5, Informative
  
  This is really good advice but it needs more details:
  
  1) Wrap your legacy libs with SWIG
  2) Code a working prototype in Python
  3) Profile it (never skip this step)
  4) Use SWIG to write the bottle neck parts in C++
  5) Use Valgrind to ensure you are still OK memory wise
  6) Profit!!
They Write the Right Stuff by Pentclass · 2006-02-04 15:43 · Score: 5, Interesting

Follow NASA's advice... http://www.fastcompany.com/online/06/writestuff.ht ml
Re:Get another programmer by Philip+K+Dickhead · 2006-02-04 15:44 · Score: 5, Funny

Make sure his name is something like "Bjarne" or "Knuth".

--
"Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
Don't code to impress. by jellomizer · 2006-02-04 15:47 · Score: 5, Informative

When coding something that needs to be stable, you need to keep your ego aside and concentrate on the task at hand. Stick with tried and true methods don't go with any algorithm that you are not 100% comfortable with even if it makes the code less ugly. Be sure to follow good practices make many function/methods, and make each one as simple as possible, makes it easier to check each function for bugs when they are simple. Secondly document it like you never want to touch the code again (in code and out of code), you want to know what is going on at all time and the bigger it gets the larger chance you could get lost in your own code. When working in a team and you are in someone else's code document that you did the change.

Next take into account what causes most Crashes.
Bad/Overflow memory allocation.
Memory leaks.
Endless loops.
Bad calls to the hardware.
Bad calls to the OS.
Deadlock

If you are going to decouple modules keep in mind that you will need to do as much processing as possible with minimum message passing and allow for mirrors so if one system is down and other can take its place, without killing the network.

For IPC I tend to like TCP/IP Client server. But that is because it tends to offer a common platform independence and allows for expansion across the network. Or try other Server Methods such as a good SQL server Where you can put all the shared data in one spot and get it back. But not knowing the actual requirements it may just be a stupid idea.

I would suggest that you also ask in other places other then Slashdot. While there are many experts on this topic there are also equal if not greater amount of kids on there who think they know what they are talking about, or they have there ego in this technology/or method.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Fault Tolerance Vs. Stability by aeroz3 · 2006-02-04 15:59 · Score: 5, Insightful

I think perhaps what you REALLY mean here by stability is Fault Tolerance. It's impossible to write code that has zero defects, outside of any trivial examples. Real Code Has Real Defects. Now, as you talk about modular design and being able to restart modules, you're talking about, not stability, but fault tolerance; the ability of the application to recognize and recover from faults. For instance, you can't necessarily guarantee that the module on machine A running task B won't die, hell the computer could accidently fry, but if your application was Fault Tolerant then the application would kick off another process somewhere else on computer C to rerun job B. Stable systems aren't built necessarily by trying to write defect-free code, but by recognizing that defects will occur and architecting the system in such a way that it can recover from them. Here you need to be concerned about things like transactions, data roll-back, consistency, techniques (active vs. passive, warm vs. cold). The key thing is before you even write a LINE of this C++ code, make sure that you have a complete, comprehensive ARCHITECTURE for your application that will gracefully handle faults.
Re:You're not the first one.... by gadzook33 · 2006-02-04 16:10 · Score: 5, Insightful

Ah, another true believer. I work heavily in both managed and unmanaged code (c/c++/c#) hybrid solutions. In my experience, a well designed C++ program is as stable as a well designed C# program. Who cares if it "crashes" if it doesn't do what you want? The worst program is one that seems to be working but is generating invalid results. Don't let anyone convince you that C# is going to provide more reliable execution. We use C# for its nice GUIs; C++ for cross-platform portability.
Re:You're not the first one.... by mr_tenor · 2006-02-04 16:45 · Score: 5, Insightful

WTF? I love Haskell as much as the next programming-language-theory fanboy, but saying "Haskell or one of the other functional languages might be a good idea." in reply to the OP strongly suggests to me that you are just making stuff up and/o are copy/pasting things that you have read elsewhere out of context

If not, then great! Please post some references to literature which demonstrates how what you've suggested is sane and/or possible :)
Congratulations! Nice Work! by aendeuryu · 2006-02-04 16:53 · Score: 5, Funny

"I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed...

From zero to flame war in under 20 words. Well done!
Listen to what he said!! by logicnazi · 2006-02-04 18:08 · Score: 5, Insightful

Jesus christ people he is asking you how he should go about building an ultra-stable application in C++. He told you he *has* to build it in C++ because there are critical libraries and other components that aren't availible in C++. Telling him he shouldn't build it in C++ anyway just isn't helpfull.

I hate to break it to people but there *are* libraries, especially for types of scientific computing, that are only (reasonably) availible in C++ or sometimes FORTRAN. Not only would abandoning these libraries mean he would completely have to reinvent the wheel but also might cause serious compatibility problems not to mention a much greater ongoing maintenence responsibility (he can't just check his program to make sure things still work when someone fixes a library bug).

Moreover, the idea that because he is considering using CORBA, IPC or whatever else speed can't matter enough to require C/C++ is dead wrong. It is true that whatever *parts* of the process are done using these components may not require huge amounts of speed but this doesn't mean one of these components isn't doing something very processor heavy.

In particular what he says sounds like the situation in some areas of scientific computing. If one is writing a program to do some sort of simulation or similar math intensive operations speed can be *very* important in the critical parts of the code but (in some cases) transfering information to the GUI or other components need not be particularly speedy (increasing by an order of magnitude may make a small difference in overall runtime). Imagine a program that does some kind of weather, or nuclear detonation simulation. The cross-processor communication and the core simulation kernel need to be very fast but the GUI and data input components need not be particularly fast. Also it is my understanding that often the critical libraries in this area are often only availible (at least freely) with C/C++ or fortran bindings.

Anyway I think it is important to distingush several different goals, ultra-stability, minimal downtime, and minimal data/computation loss. For instance a climate simulation that may run on a supercomputer for months it is very important to have minimal data/computation loss (i.e. if something goes bad you don't lose months of very valuable supercomputer time) but you need not have ulta-stability or minimal downtime. As long as when any node crashes the simulation can easily be restarted without loss of data there is no problem. On the other hand if you are running a website like slashdot it is minimal downtime that is important it doesn't really matter if some of the web server processes are rebooted once in awhile. If, on the other hand, you are writing code to monitor a nuclear power plant it is ultra-stability that is important (though I can't at the moment think of something that requires distributed processing and ultra-stability but I'm probably just missing something).

So I think the answer depends on what sort of stability you want. If it is important that no individual *node* crashes (though the GUI/other non-core components can crash) then you should pursue the seperation you described above. I have to admit I'm not an expert here but the client-server model (like mysql, X etc.) seems to work well in this context. However, this depends alot on what sort of data you need to transfer. If you just need to send the core setup commands and get back mostly unstructured info (say a grid of tempratures or other simple datasets) then I would suggest sticking with one of the simpler abstractions and don't get lost in CORBA. On the other hand if you need to send back and forth real objects with significant structure then creating your own serialization system/bindings is just asking for bugs.

On the other hand if what you want is minimal data/computation loss, downtime, or any other property where it is the overall system you care about not a crash at any particular node then I suggest concentrating less on dividing any one node into comp

--
If you liked this thought maybe you would find my blog nice too:
Re:You're not the first one.... by SchroedingersCat · 2006-02-04 18:10 · Score: 5, Insightful

Here is the clue: if the code *relies* on being *managed* then the design is not stable. Well-designed system will not need a garbage collector, and poorly-designed system will not be saved by the garbage collector.
Obvious ! by Thomas+Miconi · 2006-02-04 23:30 · Score: 5, Funny

What this guy really needs is the time-tested, tried-and-true Waterfall development process !

Thomas-