A Diagnosis of Self-Healing Systems

← Back to Stories (view on slashdot.org)

A Diagnosis of Self-Healing Systems

Posted by michael on Tuesday December 21, 2004 @11:40AM from the heal-thyself dept.

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."

11 of 149 comments (clear)

Min score:

Reason:

Sort:

One system this will never work on by mboverload · 2004-12-21 11:46 · Score: 0, Insightful

This will never work on Windows. With all the registry crap it has, I don't see anything like this working. The registry is a nightmare to fix if anything goes wrong, it is ALWAYS easier to reinstall. In fact, I'm reinstalling XP tomorrow because of all the crap and bugs it has accumulated. I do this at least twice a year, and its a shame.
TiVo by Radak · 2004-12-21 11:51 · Score: 3, Insightful

TiVo has had self-healing Linux systems out there for five years now. There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.
Not really by grahamsz · 2004-12-21 11:57 · Score: 2, Insightful

It's very easy to make a system self-healing when you are running in a completely controlled evironment.

Indeed my TiVo very rarely crashes and always recovers, but the same is also true of every embedded system i've used - be it a cellphone, weather station or alarm system.

Now if i screw around modding my tivo then it's entirely possible to crash it and it doesn't recover very well from that...
if by Kanasta · 2004-12-21 11:58 · Score: 2, Insightful

if self healing = ms office keeps putting another icon in my start menu whenever I start word, then I don't want self healing.

How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?
How about systems that I can manually heal first? by grumbel · 2004-12-21 12:00 · Score: 3, Insightful

While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.

Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.

Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.

Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.

Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default

Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.
Re:As a Tech... by Rew190 · 2004-12-21 12:01 · Score: 2, Insightful

If your future depended on merely fixing computers, it was a bad one in the first place.
Where does it hurt? by Doc+Ruby · 2004-12-21 13:13 · Score: 2, Insightful

How about just systems that fail *verbosely*, so admins can quickly diagnose them? Once the patient can complain properly, we can get to work replacing the admin doctors with "self-healing" metasystems that use those diagnostics. It will be a lot easier just mimicking the best admins' best practices by automating them, than all this screwing around trying to compile marketsprach like "self-healing" without understanding how it even works in nature.

--
--
make install -not war
Re:UNIX is the problem. Tandem was the solution. by upsidedown_duck · 2004-12-21 13:23 · Score: 2, Insightful

Knowing HP, your systems are probably being replaced by Tandem-branded PCs with ECC RAM and software RAID. A rescue DVD will provide instant system rebuilds so downtime is never more than two days.

--
-- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
Re:How about systems that I can manually heal firs by Qzukk · 2004-12-21 13:33 · Score: 2, Insightful

Files arn't versioned

Undelete?

Check of integritiy of an installed piece of software

During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available. This is why things like file versioning (popular on large systems like VMS, where the universities/companies running it had the money for the storage requirements) and permanent storage of "unwanted" files just didn't appear.

The third problem is a bit tougher without some extra metadata and hardcore discussions on exactly what should be monitored/done/etc (personally, I don't think this is a kernel-level operation). Something must be stored somewhere so that the system can identify a modified binary. At some time (before change, in which case the operation is stopped? After change? Monthly?) someone (root? file owner? script kiddie currently logged in as root?) has to be notified (syslog? message to terminal? email?) that something (virus? script kiddie? make install? dpkg? rpm?) has altered the (executable? configuration? library? manpage?). As you can see, its one thing to say "oh yeah the OS should do this" and another entirely to define what this is.

The second problem is tough as well, but there are patches to libc's unlink() function (either as a patch or as an LD_PRELOAD library to override libc's function) that move the files to a pre-defined trashcan, and that every dynamically linked application will use.

The first problem is mostly just a lack of demand. Nobody cares, so nobody made a filesystem that can do it. Both ext*fs and reiserfs are extendable (with optional options. Reiserfs moreso than ext), so if you care, do it yourself, but again there's questions you'll have to be prepared to answer (and since you insist on doing this at the kernel level, you have to have THE answer): If a program writes 1MB to a file 1 byte at a time, is that one million revisions? If you're writing a document and you hit save after every paragraph, is that a revision? How are you going to tell this apart at the kernel level?

--
If I have been able to see further than others, it is because I bought a pair of binoculars.
worst case? by bird603568 · 2004-12-21 14:40 · Score: 1, Insightful

this sounds good un till somebody meake a worn that uses an exploit that (for the sake of argument say there is one) was/(will be might be) found. The worm tricks the server in to thinking it is severly messed up so it orders a boat load of parts or shuts down or both. the tech shows up and its just a worm. now you have these parts and have to pay up. also the server shut down, now its lost time. did i mention its a worm so it spreads. thats just worst case, but i could be great unless you fix broken servers for a living.
We already have this... by JRHelgeson · 2004-12-21 15:11 · Score: 2, Insightful

The space shuttle, as old as it is, has an absolutely incredible computer system that is self healing.

The Shuttle has many thousands of sensors and backup sensors. Each sensor feeds into one of many computer systems. These computer systems talk to each other as more of a committee rather than just passing data amongst themselves. If a computer discovers a fault, another computer will see that fault as well, it will combine data gathered from other computer systems throughout the suttle and each computer system will literally cast a vote on what the best solution should be for the particular fault discovered.

If one computer system suffers a partial or complete failure, the remaining systems will work around the failed system.

This computer system has managed to keep our astronauts alive for every mission, except those two that suffered from a catastrophic mechanical failure. The second of which (Columbia) the computers kept the craft flying until it broke apart completely.

I say not bad for a system designed over 20 years ago!

--
Good security is based upon reality and common sense. Common sense is a function of having common knowledge.