Ultra-Stable Software Design in C++?

← Back to Stories (view on slashdot.org)

Ultra-Stable Software Design in C++?

Posted by Cliff on Saturday February 4, 2006 @03:35PM from the failure-minimization dept.

null_functor asks: "I need to create an ultra-stable, crash-free application in C++. Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries. The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms. Basically, it allows users to crunch a god-awful amount of data over several computing nodes. The application is meant to primarily run on Linux, but should be portable to Windows without much difficulty." While there's more to this, what strategies should a developer take to insure that the resulting program is as crash-free as possible? "I'm thinking of decoupling the modules physically so that, even if one crashes/becomes unstable (say, the distributed communication module encounters a segmentation fault, has a memory leak or a deadlock), the others remain alive, detect the error, and silently re-start the offending 'module'. Sure, there is no guarantee that the bug won't resurface in the module's new incarnation, but (I'm guessing!) it at least reduces the number of absolute system failures.

How can I actually implement such a decoupling? What tools (System V IPC/custom socket-based message-queue system/DCE/CORBA? my knowledge of options is embarrassingly trivial :-( ) would you suggest should be used? Ideally, I'd want the function call abstraction to be available just like in, say, Java RMI.

And while we are at it, are there any software _design patterns_ that specifically tackle the stability issue?"

14 of 690 comments (clear)

Min score:

Reason:

Sort:

inline code by jrockway · 2006-02-04 15:41 · Score: 3, Informative

> Sadly, the programming language cannot be changed due to reasons of efficiency and availability of core libraries.

You can easily embed C/C++ in other languages. Take a look at Inline::CPP, for example. With code like:

use Inline CPP; print "9 + 16 = ", add(9, 16), "\n"; print "9 - 16 = ", subtract(9, 16), "\n"; __END__ __CPP__ int add(int x, int y) { return x + y; } int subtract(int x, int y) { return x - y; }

you can put the parts that need to be fast in C++, and the parts that need to be easy in Perl. (If you do the GUI in perl, you won't have to worry about portability or memory allocation. And the app will be fast, because the computation logic is written in C++.)

> The application can be naturally divided into several modules, such as GUI, core data structures, a persistent object storage mechanism, a distributed communication module and several core algorithms.

Yup. There's no need for the GUI to know how to do computations, remember. The more separate components you have, the more reliable your application (can) be. Make sure you have good specs for communication between components. Ideally, someone will be able to write one component without having the other one to "test" with. For testing, write unit tests that emulate the specs... and make sure your tests are correct!

--
My other car is first.
Don't get too fancy... by Pyromage · 2006-02-04 15:47 · Score: 4, Informative

First, consider how complex you want to make the system. The decoupling is a good idea, I think. However, I don't think that having modules automatically restart one another is a good idea; it introduces a whole slew of other problems. At most I'd say use a watchdog process (principle of single responsibility).

Furthermore, you're crunching large amounts of data, so I'm guessing batch processing. If you can have the application not be a server, then you simplify things a lot. Make it a utility that takes data on standard input and runs whatever analysis you need, and duct tape it together with cron or a simple program that watches for new input files.

Also, I'd like to suggest that you consider whether other languages could be efficient for the task. For example, Java is pretty good numerically, and as far as your libraries go, see if you can use SWIG to generate JNI wrappers. Also, then you get Java RMI.

Next, get them down to one platform. It's *way* easier to develop software with tight constraints on a single platform (versus multiple platforms). Investigate QNX: a reliable operating system (though admittedly quirky) with a beautiful IPC API. In any case, make sure you get a well-tested library with message queues, etc. You don't want to be using raw sockets; you could but that's just another pain in the ass on top of everything else.

Last, figure out what the cost of a failure is. Getting that last few percent of reliability is very very expensive. Unless you're a pacemaker or respirator, the cost of failure is probably not as high as the cost of five nines of uptime.
Don't code to impress. by jellomizer · 2006-02-04 15:47 · Score: 5, Informative

When coding something that needs to be stable, you need to keep your ego aside and concentrate on the task at hand. Stick with tried and true methods don't go with any algorithm that you are not 100% comfortable with even if it makes the code less ugly. Be sure to follow good practices make many function/methods, and make each one as simple as possible, makes it easier to check each function for bugs when they are simple. Secondly document it like you never want to touch the code again (in code and out of code), you want to know what is going on at all time and the bigger it gets the larger chance you could get lost in your own code. When working in a team and you are in someone else's code document that you did the change.

Next take into account what causes most Crashes.
Bad/Overflow memory allocation.
Memory leaks.
Endless loops.
Bad calls to the hardware.
Bad calls to the OS.
Deadlock

If you are going to decouple modules keep in mind that you will need to do as much processing as possible with minimum message passing and allow for mirrors so if one system is down and other can take its place, without killing the network.

For IPC I tend to like TCP/IP Client server. But that is because it tends to offer a common platform independence and allows for expansion across the network. Or try other Server Methods such as a good SQL server Where you can put all the shared data in one spot and get it back. But not knowing the actual requirements it may just be a stupid idea.

I would suggest that you also ask in other places other then Slashdot. While there are many experts on this topic there are also equal if not greater amount of kids on there who think they know what they are talking about, or they have there ego in this technology/or method.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:You're not the first one.... by arkanes · 2006-02-04 15:50 · Score: 3, Informative

I have to agree. You say it can't be changed due to efficency and core library issues, and then list a bunch of components for which there are no core libraries in C++ and which are rarely if ever CPU bound. Further, if you're asking this question then you aren't a highly experienced guru in this field and thats what you need to be to write these sort of applications in C++. Managed Code (tm), as in .NET, is not your only option but you should look at something higher level and more robust than C++. Haskell or one of the other functional languages might be a good idea.
Test, test, test. Use test driven development if you can. Have a good test harness and use it all the time. Do the stereotypical "random input" tests. Test each and every component to destruction white-box style and then double test all your interfaces. Design by contract can help here. Use a tinderbox that runs continual builds. Maintain strict version controls. Maintain code discipline (getting rid of C++ helps here too). Realize that you probably do not (currently) have the skills to produce the kind of product you are talking about and be willing to commit the time and effort to tear up mistakes, to start over and to teach yourself. What you are attempting to do is not easy.
Re:Here's your best bet. by YGingras · 2006-02-04 16:24 · Score: 5, Informative

This is really good advice but it needs more details:

1) Wrap your legacy libs with SWIG
2) Code a working prototype in Python
3) Profile it (never skip this step)
4) Use SWIG to write the bottle neck parts in C++
5) Use Valgrind to ensure you are still OK memory wise
6) Profit!!
Re:Development Practices by Coryoth · 2006-02-04 17:03 · Score: 3, Informative

There are no silver bullets no, but there are tools that can help. Splint is a good example of something you can employ to make static checking for buffer overflows, and various dynamic memory errors like misuse of null pointers, dead storage, memory leaks, and dangerous aliasing in C and C++. It doesn't make your code bullet proof, but it can catch a lot of errors that you probably wouldn't otherwise spot. There's a nice paper about what it can do (warning PDF).

Jedidiah.

--
Craft Beer Programming T-shirts
my experience by larry+bagina · 2006-02-04 17:53 · Score: 3, Informative

I'd say try to have lots of little programs that do one specific thing -- easier to test and verify. Also, if (when) a bug is found it should be easier to fix, and hopefully will have less impact than if it was a monolithic application.
I've dealt with software that automatically restarts a dead process, and in my experience, it doesn't work so good. If you want ultra-stable software, you want to know what caused the crash and why.
For your situation, where I guess you're doing lots of time consuming computing, I'd think you should also set checkpoints, save intermediate results, or something, so if it does crash, you can restart in the middle instead of going back to 0. (A standard practice when I was analyzing large databases for corruption, a task that could take days)

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:You're not the first one.... by SEAlex · 2006-02-04 18:44 · Score: 3, Informative

The use of managed language will not necessarily result in a more stable code. Recovering form SIGSEGV by installing a POSIX handler or detecting the death of the forked child process in C++ can be done with the same ease as catching NullPointer Runtime exception in Java.

I would agree that having to write memory management code is error prone, but it is possible to be careful (i.e., use auto pointers, stl vectors instead of arays, etc). You do need to be very good with C++, however.

My suggestion to the author is to look at the application servers that support C++ components. I worked with a small, relatively unknown server, but even that product had a feature that kept the server up even when some C++ component (running as a forked process) crashed.
good coding techniques by Pr0xY · 2006-02-04 19:40 · Score: 4, Informative

first and foremost, use good coding techniques. This means use exception handling where appropriate, use standard containers over hand rolled data structures (prefer std::string over char arrays, this will help prevent almost all common string based buffer overflows alone), and follow good style guidelines.

As for a GUI programming, if you are strictly tied to c++, i would recommend QT (www.trolltech.com) they have a fabulous API (takes getting used to, but it makes sense once you do). Nice part about QT is that it is source portable to just about every major platform (X11, Win32, Mac).

It is possible to write reliable, fault tolerate code in c++ (realize please that perfect code is impossible in any language), it just has to be well thought out and done right.

proxy
Re:Development Practices by swillden · 2006-02-05 01:49 · Score: 3, Informative
There is no silver bullet for what you describe other than sound development practices.
True, but it should be pointed out that C++ is well-equipped to make such sound development practices easy. Consider the major sources of instability in C programs:
1. Buffer overflows. This arises in C because you must allocate and use arrays for buffers. In C++, you should almost never allocate a simple array for use as a buffer in application code. For strings, use std::string. For collections of various sorts, used the STL collections. If you do actually need a buffer for some reason, wrap it in an appropriate class, keep it simple enough that you can easily verify every mode of operation and test the hell out of it.
2. Dereferencing dangling or NULL pointers. There are several things you should do to avoid them:
  - Use smart pointers. Don't use naked pointers. Use appropriate smart pointer classes.
  - Don't create uninitialized pointers. At the very least, every pointer created should immediately be set to NULL if it can't be set to point at a valid object. Smart pointer classes can make sure everything is automatically NULLed.
  - Check memory allocations. Actually, on systems with virtual memory like Linux or Windows, memory allocations don't fail -- or, rather, if they do, you already have really big problems. It's a good idea to check anyway.
  - Check every pointer for NULL before dereferencing. This one is overkill for most applications, but if you can't ensure that your pointers aren't NULL, check them every time. If you use exceptions, have your smart pointers throw on a NULL dereference, but be warned that exception handling must be designed into your application from the beginning. It's hard to retrofit.
3. Memory leaks. Other resources can leak as well. To avoid this, you should almost never explicitly free resources. Use "Resource Acquisition Is Initialization", which means that every resource you allocate should be immediately wrapped in an object whose destructor will free that resource (i.e. a smart pointer). If the memory you allocate doesn't have a simple lifetime (which is actually fairly rare, in my experience), use a reference counting smart pointer or similar.
4. Incorrect library interface use. When you call library code, even if the library itself is solid and well-debugged, you have to use the library APIs as the library authors intended them to be used. Often, that means you have to violate the above recommendations. If the violations are too frequent and egregious, wrap the library in a layer that gives your application a cleaner API. If they're not too bad, just code all of the library calls carefully and review that code thoroughly. Above all, you *must* understand how the library APIs work, especially with respect to memory management and callbacks.
5. Concurrency. Multi-threaded programming introduces a lot of weird and unexpected failure modes, at least until you really understand it. In general, most application code should avoid concurrency. However, there are some cases where it simplifies the code so much that it is a net win[*]. In those cases, all I can say is: code carefully, and make sure you know what you're doing.
In my experience, doing the above religiously will ensure you never see segmentation faults. The next step, of course, is to make sure your code correctly implements the desired functionality. C++ is no different from Java or any other OO language in this respect. Clear rquirements definition, modularity, clean separation of concerns and testing, both automated an manual, are the basic keys to generating correct and maintainable code in any language.
[*] A story: I once asked a guy on my team to write a little program to monitor a bank of modems, accepting incoming calls and exchanging data with the callers. He spent two weeks and produced nearly 10,000 lines of code
--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Use STL; avoid pointers; avoid fixed buffers. by Theovon · 2006-02-05 03:43 · Score: 3, Informative

You can avoid some of the pitfalls of C++'s need for manual memory management and other problems by simply avoiding them. For instance, never do memory management yourself. How? By using STL containers to do it all for you. Next, avoid fixed arrays. Again, let STL do it for you. And, above all else, never do anything where you don't restrict the length. Since you're using STL for arrays, you're good to go there, and you won't end up running off the end of a character array (because you don't use them!). So what you're left with is doing I/O properly. Always limit the the amount of data you read to the buffer size you have allocated.

I'm sure that there's tons I've left out, but this has worked reasonably well for me. The only problem is that STL can be slow. Sure, map may be O(log(n)), but the constants are huge. Unfortulately, for practical reasons, performance and security are often inversely proportional.
Have you have flown on a commercial airline? by EMB+Numbers · 2006-02-05 04:37 · Score: 4, Informative

Have you have flown on a commercial airline in thelast 30 years? If so, you trusted your life to software.
Thare is a standard called DO-178B Level A that applies to aircraft software upon which lives depend. There is a saying in the commercial avionics business: "Nobody has ever died from software failure on an airplane, yet." There have been some accidents where software played a role, but I won't quibble with that now.

The point is that safety critical software is developed routinely. It has been developed in asembly language. It has certainly been developed in Ada, C, and sub-sets of C++. It is expensive. Validation of avionics software and certification in an aircraft can easilly cost an order of magnitude more that just writing the software, and writing the software using required processes and producing required artifacts is not cheap either.
Re:Listen to what he said!! by GileadGreene · 2006-02-05 05:04 · Score: 3, Informative

So what he needs to do is develop a design that is robust in the face of errors. In other words, it needs to be fault tolerant. There are well-known design practices for doing this (checkpoints, watchdogs, rollbacks, etc.) as well as design patterns for robust distributed computation (see, for one example, Joe Armstrong's thesis on making reliable systems in the presence of software errors.

No, the situation the OP is in is not ideal. But it's also not impossible to work with, and there are techniques that can help him to get closer to achieving his goals within the constraints placed upon him.
Learn to use STL and don't *ever* type "new[]" by Joce640k · 2006-02-05 05:35 · Score: 3, Informative

1) Learn to use STL.

Do *all* memory management via STL vector/string.

2) Don't ever type "new[]/delete[]".

Just don't do it. Not. Ever. Use std::vector instead.

"Arrays are evil" - the C++ FAQ.

PS: You can still use malloc()/free() but only as a last resort in low-level classes which are designed for data storage.

3) Get a reference-counted pointer and use it.

Automatic memory management...'nuff said.

4) Attach an alarm bell to your "~" key.

If you're writing destructors for classes which don't control system resources (eg. files) then you're probably doing something wrong - see notes 1, 2 and 3.

--
No sig today...