A Diagnosis of Self-Healing Systems
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
Neither the applications nor the OS should depend on the other providing any failover or self-healing services; they should always be prepared to go it alone if necessary (as it might be the failover system). Services that crash should restart themselves, etc. This part is pretty well done by most enterprise-grade server software. It's the operating systems we're waiting to play catch-up.
And I'm still waiting to see any box that can replace its own power supply after someone flips the 115/230 switch. Once we get that, then we'll have truly self-healing systems. And all you BOFH's out there might be looking for a new career...
How am I supposed to fit a pithy, relevant quote into 120 characters?
Which turned out not to be faulty... hmmm...
Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!
TiVo has had self-healing Linux systems out there for five years now. There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.
It's very easy to make a system self-healing when you are running in a completely controlled evironment.
Indeed my TiVo very rarely crashes and always recovers, but the same is also true of every embedded system i've used - be it a cellphone, weather station or alarm system.
Now if i screw around modding my tivo then it's entirely possible to crash it and it doesn't recover very well from that...
if self healing = ms office keeps putting another icon in my start menu whenever I start word, then I don't want self healing.
How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?
While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.
Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.
Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.
Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.
Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default
Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.
If your future depended on merely fixing computers, it was a bad one in the first place.
I don't know why windows doesn't just have a reset button for all the settings to return it to it's original condition. It's a bitch to reinstall it twice a year, you know.
Well this seems like where computing services are heading as IBM is doing extensive research on Self-Configuring, Self-Healing, Self-Optimizing, and Self-Protecting computing systems called 'Autonomic'
Check out: Autonomic Computing
The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.
The basic concepts are:
Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.
Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.
Click here to ruin the joke.
How am I supposed to fit a pithy, relevant quote into 120 characters?
I have it so that if one of our firewalls detects an attempt to access gator.com it enrols the machine into an active directory system group which the SMS server queries to automatically de-spyware it with SpyBot.
I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.
This is really something that, IMHO, calls for more interaction between the best of the futurists, science-fiction writers, and coders, and other complexity thinkers.
In order for any system to have an understanding of and proper diagnosis of its own operation, it needs to be able to conceptualize its relationship to other systems around it. Am I important? What functions do I provide? What level of error is proper to report to my administrator? Do I have a history of hardware problems? Has chip 2341 on motherboard 12 been acting up intermittently? If so, is it getting worse or better? How have I been doing over the last few days? Is there a new virus going around that is similar to something I've had before?
What good is a self-diagnosing system without a memory of its prior actions?
All of these questions imply some sort of context that will require the system to use symbols to represent "things" in the "world" around it. Clearly, the largest (though perhaps not qualitatively different) symbol will be a "self" symbol.
From there, all you have to do is follow Hofstadter's path and you'll arrive at a system with emergent self-awareness or consciousness.
The end result of this will be something a) very complex and b) designed/grown by itself. You'll have either the computer from the U.S.S. Enterprise or H.A.L.
Side question: What is CYC doing these days?
Hire a Linux system administrator, systems engineer,
How about just systems that fail *verbosely*, so admins can quickly diagnose them? Once the patient can complain properly, we can get to work replacing the admin doctors with "self-healing" metasystems that use those diagnostics. It will be a lot easier just mimicking the best admins' best practices by automating them, than all this screwing around trying to compile marketsprach like "self-healing" without understanding how it even works in nature.
--
make install -not war
The former could be considered self-repair, but it is limited as you don't have to have much in the way of an error to totally swamp most error-correction codes.
The second form isn't really self-repair as much as it is damage control. This is just as important as self-repair, as you can't do much repair work if your software can't run.
On the whole, "normal" systems don't need any kind of self-repair, beyond the basic error-correction codes. Instead, you are likely better off to have a "hot fail-over" system - two systems running in parallel with the same data, only one of them is kept "silent". Both take input from the same source(s), and so should have identical states at all times, with no synchronization required.
If the "active" one fails, just "unsilence" the other one and restore the first one's state. If the "silent" one fails, all you do is copy the state over.
However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results. Therefore, in order to have a meaningful hot fail-over of the kind described, the two can't be identical. They have to be different enough to not fail under identical conditions, but be similar enough that you can trivially switch the output from one to the other without anybody noticing.
eg: The use of a Linux box on an AMD running Roxen, and an OpenBSD box on an Intel running Apache, would be pretty much guaranteed not to have common points of failure. If you used a keepalive daemon for each box to monitor the other's health, you could easily ensure that only one box was "talking" at a time, even if both were receiving.
The added complexity is minimal, which is always good for reliability, and the result is as good or better than any existing software self-repair method out there.
Now, you can't always use such solutions. Anything designed to work in space, these days, uses a combination of the above techniques to extend the lifetime of the computer. By dynamically monitoring the health of the components, re-routing data flow as needed, and repairing data/code stored in transistors that have become damaged, you ensure the system will keep functioning.
Transistors get destroyed by radiation quite easily. If you didn't have some kind of self-repair/damage-control, you'd either be using chips with transistors which may or may not work, or you'd have to scrub the entire chip after a single transistor went.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Files arn't versioned
Undelete?
Check of integritiy of an installed piece of software
During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available. This is why things like file versioning (popular on large systems like VMS, where the universities/companies running it had the money for the storage requirements) and permanent storage of "unwanted" files just didn't appear.
The third problem is a bit tougher without some extra metadata and hardcore discussions on exactly what should be monitored/done/etc (personally, I don't think this is a kernel-level operation). Something must be stored somewhere so that the system can identify a modified binary. At some time (before change, in which case the operation is stopped? After change? Monthly?) someone (root? file owner? script kiddie currently logged in as root?) has to be notified (syslog? message to terminal? email?) that something (virus? script kiddie? make install? dpkg? rpm?) has altered the (executable? configuration? library? manpage?). As you can see, its one thing to say "oh yeah the OS should do this" and another entirely to define what this is.
The second problem is tough as well, but there are patches to libc's unlink() function (either as a patch or as an LD_PRELOAD library to override libc's function) that move the files to a pre-defined trashcan, and that every dynamically linked application will use.
The first problem is mostly just a lack of demand. Nobody cares, so nobody made a filesystem that can do it. Both ext*fs and reiserfs are extendable (with optional options. Reiserfs moreso than ext), so if you care, do it yourself, but again there's questions you'll have to be prepared to answer (and since you insist on doing this at the kernel level, you have to have THE answer): If a program writes 1MB to a file 1 byte at a time, is that one million revisions? If you're writing a document and you hit save after every paragraph, is that a revision? How are you going to tell this apart at the kernel level?
If I have been able to see further than others, it is because I bought a pair of binoculars.
Fault Tolerance implies the ability to not just detect the fault (i.e. a failed cpu), but to keep the processes running as if nothing happened. This is possible with Stratus and Tandem boxes. It is genrally not possible with common x86/Power/SPARC boxes (unless you put a lot of software on top of two boxes to make them look like one big virual system).
"Self Healing", in this context, is the systems ability to detect a fault (hardware or software), deal with it (restart a process, isolate hardware, etc) and then get on with life (in a possibly degraded mode). In a way, the venerable Veritas Cluster System is an example of a "self healing" system. (it detects a failure of a service group and restarts it, on another node if needed)
Note that with "self healing" systems, the process may die, and end users may notice a failure. But the system is 'back online' sooner than if it required manual intervention. Compare this to a Fault Tolerant systems that never went down in the first place.
The space shuttle, as old as it is, has an absolutely incredible computer system that is self healing.
The Shuttle has many thousands of sensors and backup sensors. Each sensor feeds into one of many computer systems. These computer systems talk to each other as more of a committee rather than just passing data amongst themselves. If a computer discovers a fault, another computer will see that fault as well, it will combine data gathered from other computer systems throughout the suttle and each computer system will literally cast a vote on what the best solution should be for the particular fault discovered.
If one computer system suffers a partial or complete failure, the remaining systems will work around the failed system.
This computer system has managed to keep our astronauts alive for every mission, except those two that suffered from a catastrophic mechanical failure. The second of which (Columbia) the computers kept the craft flying until it broke apart completely.
I say not bad for a system designed over 20 years ago!
Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
But I read in 1958 that we would have self healing systems "within a decade" - surely we must have had them for over 30 years!
Sent from my ASR33 using ASCII
The currentzSeries machines come with 16 cpus and L2 & L1 packaged together on a board.
But only 12 cpus are used.
Each "cpu" is actually two cpus and a comparitor. When the cpus come up with a different answer the cpu is shutdown and procesing is taken over by one of the four free cpus on the board.
You will never know it happened until you run one of the mainrneance utilities.
In the way of IBM this technoligy will probaly appear on top end pSeries (AIX/Linux) and iSeries boxes in a couple of years.
Old COBOL programmers never die. They just code in C.