A Diagnosis of Self-Healing Systems
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
Which turned out not to be faulty... hmmm...
Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!
Plenty of Sun's boxes have redundant power supplies.
If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.
Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.
Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.
The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.
The basic concepts are:
Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.
Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.
I have it so that if one of our firewalls detects an attempt to access gator.com it enrols the machine into an active directory system group which the SMS server queries to automatically de-spyware it with SpyBot.
I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.