Ultra-Stable Software Design in C++?
null_functor asks: "I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries. The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms. Basically, it allows users to crunch a god-awful amount of data over several computing nodes. The application is meant to primarily run on Linux, but should be portable to Windows without much difficulty." While there's more to this, what strategies should a developer take to insure that the resulting program is as crash-free as possible?
"I'm thinking of decoupling the modules physically so that, even if one crashes/becomes unstable (say, the distributed communication module encounters a segmentation fault, has a memory leak or a deadlock), the others remain alive, detect the error, and silently re-start the offending 'module'. Sure, there is no guarantee that the bug won't resurface in the module's new incarnation, but (I'm guessing!) it at least reduces the number of absolute system failures.
How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.
And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"
How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.
And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"
> Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries.
You can easily embed C/C++ in other languages. Take a look at Inline::CPP, for example. With code like:
use Inline CPP;
print "9 + 16 = ", add(9, 16), "\n";
print "9 - 16 = ", subtract(9, 16), "\n";
__END__
__CPP__
int add(int x, int y) {
return x + y;
}
int subtract(int x, int y) {
return x - y;
}
you can put the parts that need to be fast in C++, and the parts that need to be easy in Perl. (If you do the GUI in perl, you won't have to worry about portability or memory allocation. And the app will be fast, because the computation logic is written in C++.)
> The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms.
Yup. There's no need for the GUI to know how to do computations, remember. The more separate components you have, the more reliable your application (can) be. Make sure you have good specs for communication between components. Ideally, someone will be able to write one component without having the other one to "test" with. For testing, write unit tests that emulate the specs... and make sure your tests are correct!
My other car is first.
First, consider how complex you want to make the system. The decoupling is a good idea, I think. However, I don't think that having modules automatically restart one another is a good idea; it introduces a whole slew of other problems. At most I'd say use a watchdog process (principle of single responsibility).
Furthermore, you're crunching large amounts of data, so I'm guessing batch processing. If you can have the application not be a server, then you simplify things a lot. Make it a utility that takes data on standard input and runs whatever analysis you need, and duct tape it together with cron or a simple program that watches for new input files.
Also, I'd like to suggest that you consider whether other languages could be efficient for the task. For example, Java is pretty good numerically, and as far as your libraries go, see if you can use SWIG to generate JNI wrappers. Also, then you get Java RMI.
Next, get them down to one platform. It's *way* easier to develop software with tight constraints on a single platform (versus multiple platforms). Investigate QNX: a reliable operating system (though admittedly quirky) with a beautiful IPC API. In any case, make sure you get a well-tested library with message queues, etc. You don't want to be using raw sockets; you could but that's just another pain in the ass on top of everything else.
Last, figure out what the cost of a failure is. Getting that last few percent of reliability is very very expensive. Unless you're a pacemaker or respirator, the cost of failure is probably not as high as the cost of five nines of uptime.
When coding something that needs to be stable, you need to keep your ego aside and concentrate on the task at hand. Stick with tried and true methods don't go with any algorithm that you are not 100% comfortable with even if it makes the code less ugly. Be sure to follow good practices make many function/methods, and make each one as simple as possible, makes it easier to check each function for bugs when they are simple. Secondly document it like you never want to touch the code again (in code and out of code), you want to know what is going on at all time and the bigger it gets the larger chance you could get lost in your own code. When working in a team and you are in someone else's code document that you did the change.
Next take into account what causes most Crashes.
Bad/Overflow memory allocation.
Memory leaks.
Endless loops.
Bad calls to the hardware.
Bad calls to the OS.
Deadlock
If you are going to decouple modules keep in mind that you will need to do as much processing as possible with minimum message passing and allow for mirrors so if one system is down and other can take its place, without killing the network.
For IPC I tend to like TCP/IP Client server. But that is because it tends to offer a common platform independence and allows for expansion across the network. Or try other Server Methods such as a good SQL server Where you can put all the shared data in one spot and get it back. But not knowing the actual requirements it may just be a stupid idea.
I would suggest that you also ask in other places other then Slashdot. While there are many experts on this topic there are also equal if not greater amount of kids on there who think they know what they are talking about, or they have there ego in this technology/or method.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Test, test, test. Use test driven development if you can. Have a good test harness and use it all the time. Do the stereotypical "random input" tests. Test each and every component to destruction white-box style and then double test all your interfaces. Design by contract can help here. Use a tinderbox that runs continual builds. Maintain strict version controls. Maintain code discipline (getting rid of C++ helps here too). Realize that you probably do not (currently) have the skills to produce the kind of product you are talking about and be willing to commit the time and effort to tear up mistakes, to start over and to teach yourself. What you are attempting to do is not easy.
This is really good advice but it needs more details:
1) Wrap your legacy libs with SWIG
2) Code a working prototype in Python
3) Profile it (never skip this step)
4) Use SWIG to write the bottle neck parts in C++
5) Use Valgrind to ensure you are still OK memory wise
6) Profit!!
There are no silver bullets no, but there are tools that can help. Splint is a good example of something you can employ to make static checking for buffer overflows, and various dynamic memory errors like misuse of null pointers, dead storage, memory leaks, and dangerous aliasing in C and C++. It doesn't make your code bullet proof, but it can catch a lot of errors that you probably wouldn't otherwise spot. There's a nice paper about what it can do (warning PDF).
Jedidiah.
Craft Beer Programming T-shirts
I've dealt with software that automatically restarts a dead process, and in my experience, it doesn't work so good. If you want ultra-stable software, you want to know what caused the crash and why.
For your situation, where I guess you're doing lots of time consuming computing, I'd think you should also set checkpoints, save intermediate results, or something, so if it does crash, you can restart in the middle instead of going back to 0. (A standard practice when I was analyzing large databases for corruption, a task that could take days)
Do you even lift?
These aren't the 'roids you're looking for.
The use of managed language will not necessarily result in a more stable code. Recovering form SIGSEGV by installing a POSIX handler or detecting the death of the forked child process in C++ can be done with the same ease as catching NullPointer Runtime exception in Java.
I would agree that having to write memory management code is error prone, but it is possible to be careful (i.e., use auto pointers, stl vectors instead of arays, etc). You do need to be very good with C++, however.
My suggestion to the author is to look at the application servers that support C++ components. I worked with a small, relatively unknown server, but even that product had a feature that kept the server up even when some C++ component (running as a forked process) crashed.
first and foremost, use good coding techniques. This means use exception handling where appropriate, use standard containers over hand rolled data structures (prefer std::string over char arrays, this will help prevent almost all common string based buffer overflows alone), and follow good style guidelines.
As for a GUI programming, if you are strictly tied to c++, i would recommend QT (www.trolltech.com) they have a fabulous API (takes getting used to, but it makes sense once you do). Nice part about QT is that it is source portable to just about every major platform (X11, Win32, Mac).
It is possible to write reliable, fault tolerate code in c++ (realize please that perfect code is impossible in any language), it just has to be well thought out and done right.
proxy
There is no silver bullet for what you describe other than sound development practices.
True, but it should be pointed out that C++ is well-equipped to make such sound development practices easy. Consider the major sources of instability in C programs:
In my experience, doing the above religiously will ensure you never see segmentation faults. The next step, of course, is to make sure your code correctly implements the desired functionality. C++ is no different from Java or any other OO language in this respect. Clear rquirements definition, modularity, clean separation of concerns and testing, both automated an manual, are the basic keys to generating correct and maintainable code in any language.
[*] A story: I once asked a guy on my team to write a little program to monitor a bank of modems, accepting incoming calls and exchanging data with the callers. He spent two weeks and produced nearly 10,000 lines of code
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
You can avoid some of the pitfalls of C++'s need for manual memory management and other problems by simply avoiding them. For instance, never do memory management yourself. How? By using STL containers to do it all for you. Next, avoid fixed arrays. Again, let STL do it for you. And, above all else, never do anything where you don't restrict the length. Since you're using STL for arrays, you're good to go there, and you won't end up running off the end of a character array (because you don't use them!). So what you're left with is doing I/O properly. Always limit the the amount of data you read to the buffer size you have allocated.
I'm sure that there's tons I've left out, but this has worked reasonably well for me. The only problem is that STL can be slow. Sure, map may be O(log(n)), but the constants are huge. Unfortulately, for practical reasons, performance and security are often inversely proportional.
Have you have flown on a commercial airline in thelast 30 years? If so, you trusted your life to software.
Thare is a standard called DO-178B Level A that applies to aircraft software upon which lives depend. There is a saying in the commercial avionics business: "Nobody has ever died from software failure on an airplane, yet." There have been some accidents where software played a role, but I won't quibble with that now.
The point is that safety critical software is developed routinely. It has been developed in asembly language. It has certainly been developed in Ada, C, and sub-sets of C++. It is expensive. Validation of avionics software and certification in an aircraft can easilly cost an order of magnitude more that just writing the software, and writing the software using required processes and producing required artifacts is not cheap either.
So what he needs to do is develop a design that is robust in the face of errors. In other words, it needs to be fault tolerant. There are well-known design practices for doing this (checkpoints, watchdogs, rollbacks, etc.) as well as design patterns for robust distributed computation (see, for one example, Joe Armstrong's thesis on making reliable systems in the presence of software errors.
No, the situation the OP is in is not ideal. But it's also not impossible to work with, and there are techniques that can help him to get closer to achieving his goals within the constraints placed upon him.
1) Learn to use STL.
Do *all* memory management via STL vector/string.
2) Don't ever type "new[]/delete[]".
Just don't do it. Not. Ever. Use std::vector instead.
"Arrays are evil" - the C++ FAQ.
PS: You can still use malloc()/free() but only as a last resort in low-level classes which are designed for data storage.
3) Get a reference-counted pointer and use it.
Automatic memory management...'nuff said.
4) Attach an alarm bell to your "~" key.
If you're writing destructors for classes which don't control system resources (eg. files) then you're probably doing something wrong - see notes 1, 2 and 3.
No sig today...