A Diagnosis of Self-Healing Systems
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
TiVo has had self-healing Linux systems out there for five years now. There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.
It's very easy to make a system self-healing when you are running in a completely controlled evironment.
Indeed my TiVo very rarely crashes and always recovers, but the same is also true of every embedded system i've used - be it a cellphone, weather station or alarm system.
Now if i screw around modding my tivo then it's entirely possible to crash it and it doesn't recover very well from that...
if self healing = ms office keeps putting another icon in my start menu whenever I start word, then I don't want self healing.
How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?
While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.
Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.
Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.
Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.
Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default
Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.
If your future depended on merely fixing computers, it was a bad one in the first place.
How about just systems that fail *verbosely*, so admins can quickly diagnose them? Once the patient can complain properly, we can get to work replacing the admin doctors with "self-healing" metasystems that use those diagnostics. It will be a lot easier just mimicking the best admins' best practices by automating them, than all this screwing around trying to compile marketsprach like "self-healing" without understanding how it even works in nature.
--
make install -not war
Knowing HP, your systems are probably being replaced by Tandem-branded PCs with ECC RAM and software RAID. A rescue DVD will provide instant system rebuilds so downtime is never more than two days.
-- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
Files arn't versioned
Undelete?
Check of integritiy of an installed piece of software
During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available. This is why things like file versioning (popular on large systems like VMS, where the universities/companies running it had the money for the storage requirements) and permanent storage of "unwanted" files just didn't appear.
The third problem is a bit tougher without some extra metadata and hardcore discussions on exactly what should be monitored/done/etc (personally, I don't think this is a kernel-level operation). Something must be stored somewhere so that the system can identify a modified binary. At some time (before change, in which case the operation is stopped? After change? Monthly?) someone (root? file owner? script kiddie currently logged in as root?) has to be notified (syslog? message to terminal? email?) that something (virus? script kiddie? make install? dpkg? rpm?) has altered the (executable? configuration? library? manpage?). As you can see, its one thing to say "oh yeah the OS should do this" and another entirely to define what this is.
The second problem is tough as well, but there are patches to libc's unlink() function (either as a patch or as an LD_PRELOAD library to override libc's function) that move the files to a pre-defined trashcan, and that every dynamically linked application will use.
The first problem is mostly just a lack of demand. Nobody cares, so nobody made a filesystem that can do it. Both ext*fs and reiserfs are extendable (with optional options. Reiserfs moreso than ext), so if you care, do it yourself, but again there's questions you'll have to be prepared to answer (and since you insist on doing this at the kernel level, you have to have THE answer): If a program writes 1MB to a file 1 byte at a time, is that one million revisions? If you're writing a document and you hit save after every paragraph, is that a revision? How are you going to tell this apart at the kernel level?
If I have been able to see further than others, it is because I bought a pair of binoculars.
The space shuttle, as old as it is, has an absolutely incredible computer system that is self healing.
The Shuttle has many thousands of sensors and backup sensors. Each sensor feeds into one of many computer systems. These computer systems talk to each other as more of a committee rather than just passing data amongst themselves. If a computer discovers a fault, another computer will see that fault as well, it will combine data gathered from other computer systems throughout the suttle and each computer system will literally cast a vote on what the best solution should be for the particular fault discovered.
If one computer system suffers a partial or complete failure, the remaining systems will work around the failed system.
This computer system has managed to keep our astronauts alive for every mission, except those two that suffered from a catastrophic mechanical failure. The second of which (Columbia) the computers kept the craft flying until it broke apart completely.
I say not bad for a system designed over 20 years ago!
Good security is based upon reality and common sense. Common sense is a function of having common knowledge.