Slashdot Mirror


A Diagnosis of Self-Healing Systems

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."

10 of 149 comments (clear)

  1. The challenge of a truly self-healing system by IO+ERROR · · Score: 3, Funny
    Your operating system provides threads as a programming primitive that permits applications to scale transparently and perform better as multiple processors, multiple cores per die, or more hardware threads per core are added. Your operating system also provides virtual memory as a programming abstraction that allows applications to scale transparently with available physical memory resources. Now we need our operating systems to provide the new abstractions that will enable self-healing activities or graceful degradation in service without requiring developers to rewrite applications or administrators to purchase expensive hardware that tries to work around the operating system instead of with it.

    Neither the applications nor the OS should depend on the other providing any failover or self-healing services; they should always be prepared to go it alone if necessary (as it might be the failover system). Services that crash should restart themselves, etc. This part is pretty well done by most enterprise-grade server software. It's the operating systems we're waiting to play catch-up.

    And I'm still waiting to see any box that can replace its own power supply after someone flips the 115/230 switch. Once we get that, then we'll have truly self-healing systems. And all you BOFH's out there might be looking for a new career...

    --
    How am I supposed to fit a pithy, relevant quote into 120 characters?
    1. Re:The challenge of a truly self-healing system by grahamsz · · Score: 4, Informative

      Plenty of Sun's boxes have redundant power supplies.

      If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.

      Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

      Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.

  2. Had this 3 years ago by shoppa · · Score: 4, Interesting
    According to a documentary movie from 3 years ago, we already had this. HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.

    Which turned out not to be faulty... hmmm...

    Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!

    1. Re:Had this 3 years ago by jomas1 · · Score: 3, Interesting

      Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!

      Interesting. Where I work this happens too except instead of IBM techs we get sent techs who work for the city and instead of finding out that they were sent for some good reason, 90% of the time it turns out that the techs were sent for no reason. The techs usually don't even know that a machine called in a service request and waste a lot of time asking me why they were called.

      If the future holds more of this I hope I die soon.

  3. TiVo by Radak · · Score: 3, Insightful

    TiVo has had self-healing Linux systems out there for five years now. There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.

  4. How about systems that I can manually heal first? by grumbel · · Score: 3, Insightful

    While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.

    Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.

    Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.

    Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.

    Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default

    Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.

  5. UNIX is the problem. Tandem was the solution. by Animats · · Score: 5, Interesting
    There are operating systems for which "self-healing" is quite feasible, but UNIX is all wrong for it.

    The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.

    The basic concepts are:

    • All the permanent state is in a database with proper atomic restart and recovery mechanisms.
    • Flat "files" are implemented on top of the database, not the other way round.
    • When applications fail, they are usually restarted completely, with any in-process transactions being backed out.
    • Applications with long-running state are tracked by a watching program on another machine which periodically receives state updates from the first program. If the first program fails, the watching program restarts it from a previous good state.

    Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.

    Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.

    1. Re:UNIX is the problem. Tandem was the solution. by rlp · · Score: 4, Interesting

      Tandem had a FT Unix division in Austin. One of the teams I managed that was responsible for an embedded expert system that monitored faults in the redundant components of the system. Every component was replicated. Each logical CPU actually consisted of four processors - two pairs running in lock-step. If one CPU in a pair disagreed with it's counter-part, the pair would be taken out of service. The expert system monitored transient faults and would "predict" that a component was going to fail, and could take it out of service. The system had a modem that would "phone home" in the event of a component failure, and a service tech would be dispatched with a part - often before the customer knew there was a problem.

      The machines used MIPS processors (supporting SMP) and ran a Tandem variant of System V UNIX. Combine this with a decent transactional database, and application software capable of check-pointing itself, and you have a very robust system. Albeit a very expensive one.

      Tandem was bought out by Compaq, and then by HP. When I left, Tandem had quite a few interesting ideas they were working on, but near as I can tell, they never saw the light of day.

      --
      [Insert pithy quote here]
  6. One of my self-healing systems by skinfitz · · Score: 4, Interesting

    I have it so that if one of our firewalls detects an attempt to access gator.com it enrols the machine into an active directory system group which the SMS server queries to automatically de-spyware it with SpyBot.

    I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.

  7. It's a long way by jd · · Score: 3, Interesting
    ...from what we have now to the Liberator (DSV-2) from Blake's 7, the Ultimate in self-repairing systems. At the moment, most "self-repair" is in the form of software error-correction and bypassing faulty hardware. (The "badmem" patches for Linux do this, for example.)


    The former could be considered self-repair, but it is limited as you don't have to have much in the way of an error to totally swamp most error-correction codes.


    The second form isn't really self-repair as much as it is damage control. This is just as important as self-repair, as you can't do much repair work if your software can't run.


    On the whole, "normal" systems don't need any kind of self-repair, beyond the basic error-correction codes. Instead, you are likely better off to have a "hot fail-over" system - two systems running in parallel with the same data, only one of them is kept "silent". Both take input from the same source(s), and so should have identical states at all times, with no synchronization required.


    If the "active" one fails, just "unsilence" the other one and restore the first one's state. If the "silent" one fails, all you do is copy the state over.


    However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results. Therefore, in order to have a meaningful hot fail-over of the kind described, the two can't be identical. They have to be different enough to not fail under identical conditions, but be similar enough that you can trivially switch the output from one to the other without anybody noticing.


    eg: The use of a Linux box on an AMD running Roxen, and an OpenBSD box on an Intel running Apache, would be pretty much guaranteed not to have common points of failure. If you used a keepalive daemon for each box to monitor the other's health, you could easily ensure that only one box was "talking" at a time, even if both were receiving.


    The added complexity is minimal, which is always good for reliability, and the result is as good or better than any existing software self-repair method out there.


    Now, you can't always use such solutions. Anything designed to work in space, these days, uses a combination of the above techniques to extend the lifetime of the computer. By dynamically monitoring the health of the components, re-routing data flow as needed, and repairing data/code stored in transistors that have become damaged, you ensure the system will keep functioning.


    Transistors get destroyed by radiation quite easily. If you didn't have some kind of self-repair/damage-control, you'd either be using chips with transistors which may or may not work, or you'd have to scrub the entire chip after a single transistor went.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)