Slashdot Mirror


A Diagnosis of Self-Healing Systems

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."

5 of 149 comments (clear)

  1. Had this 3 years ago by shoppa · · Score: 4, Interesting
    According to a documentary movie from 3 years ago, we already had this. HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.

    Which turned out not to be faulty... hmmm...

    Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!

  2. Re:The challenge of a truly self-healing system by grahamsz · · Score: 4, Informative

    Plenty of Sun's boxes have redundant power supplies.

    If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.

    Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

    Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.

  3. UNIX is the problem. Tandem was the solution. by Animats · · Score: 5, Interesting
    There are operating systems for which "self-healing" is quite feasible, but UNIX is all wrong for it.

    The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.

    The basic concepts are:

    • All the permanent state is in a database with proper atomic restart and recovery mechanisms.
    • Flat "files" are implemented on top of the database, not the other way round.
    • When applications fail, they are usually restarted completely, with any in-process transactions being backed out.
    • Applications with long-running state are tracked by a watching program on another machine which periodically receives state updates from the first program. If the first program fails, the watching program restarts it from a previous good state.

    Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.

    Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.

    1. Re:UNIX is the problem. Tandem was the solution. by rlp · · Score: 4, Interesting

      Tandem had a FT Unix division in Austin. One of the teams I managed that was responsible for an embedded expert system that monitored faults in the redundant components of the system. Every component was replicated. Each logical CPU actually consisted of four processors - two pairs running in lock-step. If one CPU in a pair disagreed with it's counter-part, the pair would be taken out of service. The expert system monitored transient faults and would "predict" that a component was going to fail, and could take it out of service. The system had a modem that would "phone home" in the event of a component failure, and a service tech would be dispatched with a part - often before the customer knew there was a problem.

      The machines used MIPS processors (supporting SMP) and ran a Tandem variant of System V UNIX. Combine this with a decent transactional database, and application software capable of check-pointing itself, and you have a very robust system. Albeit a very expensive one.

      Tandem was bought out by Compaq, and then by HP. When I left, Tandem had quite a few interesting ideas they were working on, but near as I can tell, they never saw the light of day.

      --
      [Insert pithy quote here]
  4. One of my self-healing systems by skinfitz · · Score: 4, Interesting

    I have it so that if one of our firewalls detects an attempt to access gator.com it enrols the machine into an active directory system group which the SMS server queries to automatically de-spyware it with SpyBot.

    I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.