Slashdot Mirror


A Diagnosis of Self-Healing Systems

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."

1 of 149 comments (clear)

  1. UNIX is the problem. Tandem was the solution. by Animats · · Score: 5, Interesting
    There are operating systems for which "self-healing" is quite feasible, but UNIX is all wrong for it.

    The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.

    The basic concepts are:

    • All the permanent state is in a database with proper atomic restart and recovery mechanisms.
    • Flat "files" are implemented on top of the database, not the other way round.
    • When applications fail, they are usually restarted completely, with any in-process transactions being backed out.
    • Applications with long-running state are tracked by a watching program on another machine which periodically receives state updates from the first program. If the first program fails, the watching program restarts it from a previous good state.

    Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.

    Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.