Slashdot Mirror


A Diagnosis of Self-Healing Systems

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."

149 comments

  1. The challenge of a truly self-healing system by IO+ERROR · · Score: 3, Funny
    Your operating system provides threads as a programming primitive that permits applications to scale transparently and perform better as multiple processors, multiple cores per die, or more hardware threads per core are added. Your operating system also provides virtual memory as a programming abstraction that allows applications to scale transparently with available physical memory resources. Now we need our operating systems to provide the new abstractions that will enable self-healing activities or graceful degradation in service without requiring developers to rewrite applications or administrators to purchase expensive hardware that tries to work around the operating system instead of with it.

    Neither the applications nor the OS should depend on the other providing any failover or self-healing services; they should always be prepared to go it alone if necessary (as it might be the failover system). Services that crash should restart themselves, etc. This part is pretty well done by most enterprise-grade server software. It's the operating systems we're waiting to play catch-up.

    And I'm still waiting to see any box that can replace its own power supply after someone flips the 115/230 switch. Once we get that, then we'll have truly self-healing systems. And all you BOFH's out there might be looking for a new career...

    --
    How am I supposed to fit a pithy, relevant quote into 120 characters?
    1. Re:The challenge of a truly self-healing system by grahamsz · · Score: 4, Informative

      Plenty of Sun's boxes have redundant power supplies.

      If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.

      Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

      Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.

    2. Re:The challenge of a truly self-healing system by Anonymous Coward · · Score: 0

      Missed Joke... nice job.

    3. Re:The challenge of a truly self-healing system by MooseGuy529 · · Score: 1
      Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

      Of course not... the point is not that each layer (peripheral, BIOS, kernel, application) can handle errors in all other layers. The point is that Mozilla should be designed to be able to recover from crashes without help from the kernel, BIOS, or anything else. Likewise, if a USB mouse somehow gets "confused" (protocol-wise) it should take the initiative to re-register itself with the system; if the BIOS notices that the system is frozen (through a watchdog), it should reboot it; if the kernel notices that some part of the BIOS has stopped responding (e.g. IDE bus won't work) it should reboot. The idea is for every layer to be able to recover from errors that spring from its own responsibilities--in other words, MySQL is responsible for keeping a database, so it should be able to do that relatively robustly, but if the underlying media is corrupted, it should not take it upon itself to rebuild the RAID array!

      --

      Tired of free iPod sigs? Subscribe to my blacklist

    4. Re:The challenge of a truly self-healing system by isurge · · Score: 1

      .... my thoughts have always been the computer is a neuron .. so we designed our system (server farm) so that neurons are backed up and if one fails we simply restore it to a fresh server (fully automated -- simply follows the logic path we humans follow) we use DCH or Distributed Cheap Hardware -- aka we go to a local computer store/online and buy the biggest bang for buck stuff(linux only) we can find average server life time is about 2.5 years or about $10 a month --- our only missing part is an automatic purchasing system ... and that is because our suppliers ordering systems vary so much ... my 2 cents .... validation: our servers presently average about 60 connections per second and have went as high as 1400 connections per second (Nov 3rd 2004) - no it is not Google size but that is nor our goal.

    5. Re:The challenge of a truly self-healing system by Anonymous Coward · · Score: 0

      switch to it's backup

      "its".

    6. Re:The challenge of a truly self-healing system by Anonymous Coward · · Score: 0

      Far better for the power supply to detect whether the line voltage is at 120 or 240 volts, and whether it is at 50 or 60 hertz, and adjust its configuration accordingly. I'm kind of surprised they don't do this already.

    7. Re:The challenge of a truly self-healing system by Anonymous Coward · · Score: 0

      and have went as high as

      "have gone as".

    8. Re:The challenge of a truly self-healing system by The+Raven · · Score: 1
      Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

      No, but Mozilla could be written to survive a memory access failure. It oculd be written so that it does not assume that drives and ram are infallible.

      Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.
      Should yes. But the application could also be written to survive data losses gracefully as well.

      This is no different than driving on a highway. Most people can get by MOST of the time by assuming their neighboring cars will drive carefully and sanely. But it is the drivers who are careful and sane THEMSELVES, who do not get into situations where a mistake by another driver will cause them to crash, who are cautious and prudent... those people avoid accidents even when their neighbors are NOT careful or sane.

      And if every application (driver) was cautious and sane, and did not ASSUME that their neighboring applications would never crash, and did not assume the system (roadway) would always be obstruction free, then the highway (computer) could continue to flow smoothly even when problems did occur.

      Self reliance is the one key to self-healing. This self reliance is another form of redundancy... the hardware checks the data, and the driver checks the data, and the OS checks the data, and the application checks the data, and any of them are capable of handling a situation in which the data passed to it is wrong.

      Raven
      --
      "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
  2. As a Tech... by ZSpade · · Score: 0

    So in the future, instead of out-sourcing our tech jobs to India, they'll simply in-source it to the computer it'self..

    What'do I care, I'm out a job either way.

    --
    Go ahead and call me unreliable; reliable is just a synonym for predictable.
    1. Re:As a Tech... by Rew190 · · Score: 2, Insightful

      If your future depended on merely fixing computers, it was a bad one in the first place.

    2. Re:As a Tech... by ZSpade · · Score: 0

      True, Glorified service men, that's what we are. We all have dreams, I wanna be a writer, I'm just pointing out that one of my day jobs might be dissapearing in the not so distant future.

      --
      Go ahead and call me unreliable; reliable is just a synonym for predictable.
    3. Re:As a Tech... by StikyPad · · Score: 1

      Don't be such a pious jackass. Aside from your questionable use of the past tense in reference to the future.. Simply because someone picks a different career path or has a different vision of what they want out of life than your own, doesn't make their future bad. Not everyone has the luxury of picking a trade with future job security -- some people just need jobs, now. If people weren't teaching your children, paving your roads, constructing your buildings, fixing your car, or whatever other occupation you consider to lead to a "bad future," you'd be doing these things yourself or going without.

    4. Re:As a Tech... by Rew190 · · Score: 1

      With all due respect, it sounds like you had a knee-jerk reaction. Keep in mind we're talking about a computer-repair career. Are you denying that computers are becoming more and more reliable? That software is generally being made easier to use for an end user and more stable? Are you denying that as years pass, more and more households will either be computer literate enough to fix their own computers or know someone else who can? That computers are becoming ridiculously cheap (read: disposable)?

      Do you believe these trends will reverse themselves in the future?

      Do you honestly believe that getting into a career just fixing computers (I used the term "merely" to emphasize that, NOT to belittle the ability to repair a computer) is a good idea? If one chooses to be ignorant of these facts and dive in anyhow that's one thing, but it's another thing to blame the market one might have wanted to get into because he couldn't see it happening right in front of himself.

      If people weren't teaching your children, paving your roads, constructing your buildings, fixing your car, or whatever other occupation you consider to lead to a "bad future," you'd be doing these things yourself or going without.

      Of course, absolutely. We need roads, we need teachers, we need construction workers, we need mechanics. Believeing that you can start a career in this day and age in computer repair when computers are nearly disposable and will only run more and more stable over time is naive. If you didn't get the jist of it, substitute "bad future" with "unlikely successful longer-term career."

    5. Re:As a Tech... by Rew190 · · Score: 1

      Absolutely, it sucks that market is disappearing from a monetary perspective, but it's cool to see the technology/society (in the form of computer literacy) evolving with it.

      I got the impression from your original post that you want to get into repair as a career, sorry for the mis-interpretation.

    6. Re:As a Tech... by StikyPad · · Score: 1

      Computer repair in the present consists largely of replacing cards and/or peripherials. I don't see the need for that, especially in large corporations, going away any time soon. The term "component replacement" has evolved to mean swapping out a hard drive rather than replacing a failed or dirty head, but at some level repairs will always be required. And there are still people who troubleshoot specialized computers down to the component level, because it's still more cost effective to spend 16 hours hunting down a faulty capacitor than replacing a $10,000 board. Or, in some cases, systems are simply obsolete and no board replacement is possible, and no system has been designed to replace the functionality of the equipment which is no longer manufactured. True, a larger skillset is almost always beneficial in the long term, and may make it easier to find a job; however some people may give a higher priority to finding satisfaction from their work than the possible monetary benefits. As programmers and engineers are becoming more and more ubiquitous, this is philosophy is seen more and more in things like OSS.

    7. Re:As a Tech... by ZSpade · · Score: 0

      Naw, it totally sounded like it. As things go, more and more jobs will be replaced by technology, and if computers are able to fix themselves, well then our society is gonna have to change, a lot.

      --
      Go ahead and call me unreliable; reliable is just a synonym for predictable.
    8. Re:As a Tech... by Rew190 · · Score: 1

      I don't see the need for that, especially in large corporations, going away any time soon.

      I definitely do. A job like that is more for a computer engineer/systems administrator who takes care of things like that between his other skills, IE maintaining the network, analyzing security, and other things that are generally outside the scope of a computer repair dude. I can't really see companies hiring full time guys who only do computer-repair. Also keep in mind the speed of computers and their prices; most companies aren't going to be doing a lot of frantic upgrading for running MS Office. I can't see too many needs to upgrade hardware given what most users will be doing.

      And there are still people who troubleshoot specialized computers down to the component level, because it's still more cost effective to spend 16 hours hunting down a faulty capacitor than replacing a $10,000 board.

      Yes, but those people are going to be computer engineers or admins, not people who ONLY do computer repair.

      True, a larger skillset is almost always beneficial in the long term, and may make it easier to find a job; however some people may give a higher priority to finding satisfaction from their work than the possible monetary benefits.

      Yeah, absolutely. I was looking at it from a monetary/job availability view.

    9. Re:As a Tech... by Anonymous Coward · · Score: 0

      Too bad we'll eventually have computers do our writing for us too.

  3. Had this 3 years ago by shoppa · · Score: 4, Interesting
    According to a documentary movie from 3 years ago, we already had this. HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.

    Which turned out not to be faulty... hmmm...

    Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!

    1. Re:Had this 3 years ago by Anonymous Coward · · Score: 0

      IBM ~ HAL

      UH OH!

    2. Re:Had this 3 years ago by Anonymous Coward · · Score: 0

      According to a documentary movie from 3 years ago, we already had this. HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.

      Which turned out not to be faulty... hmmm...


      Thus proving it's at least as easy to program artificial hypochondria as it is artificial intelligence.

      "I'm sorry, I can't do that, Dave. It hurts too much. I think I'm coming down with a virus."

    3. Re:Had this 3 years ago by jomas1 · · Score: 3, Interesting

      Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!

      Interesting. Where I work this happens too except instead of IBM techs we get sent techs who work for the city and instead of finding out that they were sent for some good reason, 90% of the time it turns out that the techs were sent for no reason. The techs usually don't even know that a machine called in a service request and waste a lot of time asking me why they were called.

      If the future holds more of this I hope I die soon.

    4. Re:Had this 3 years ago by Anonymous Coward · · Score: 0

      It's a conspiracy I tells ya. I mean we all know the IBM shifted right thing, and that Aurthur C. Clarke says that it is just coincidence, but come on, and now this article.

      *puts tinfoil hat on*

    5. Re:Had this 3 years ago by Aardpig · · Score: 1

      HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.

      Unfortunately, the astronaut (one 'Dave') wasn't able to comply, because HAL refused to open the pod-bay door.

      --
      Tubal-Cain smokes the white owl.
    6. Re:Had this 3 years ago by NeuralAbyss · · Score: 1

      HAL9000? Bah.. they should've used a Mac.. it would have been compatible with the iPod bay door..

    7. Re:Had this 3 years ago by rednaxel · · Score: 2, Interesting
      I did R & D for an elevator factory 12 years ago, and back then we made a box that called home when something went wrong. The system scanned some critical points of the circuit and, if the readings were not in the expected pattern, an external modem was used to call the maintenance and send a full report of the readings, indicating the cause of the failure.

      For example, a broken door sensor could make the door fail to slow down when closing, and the only symptom would be the louder sound of the door slamming. However, in a few days other parts would be damaged, increasing the cost of the repair and rendering the elevator out of service.

      The tech could get in the building before the elevator stopped working. According with the marketing guys, it would gave us an image of excellence in hardware and service.

      All this was written in 80C51 Assembly using less than 16 Kb. The PC code for the field service central was written in C, and featured a nice EGA graphic (640x350 in 4 pages) of the electric circuit. In real-time mode (when the central called the elevator) the graph could show the relays, interruptors, buttons, etc all animated. We could even tell how many people entered the elevator by the number of times the door sensor was activated, or which buttons were pushed. Cool!

      --
      If you can read this, thank an english teacher.
    8. Re:Had this 3 years ago by ozmanjusri · · Score: 2, Funny

      If the future holds more of this I hope I die soon.

      Your support request has been logged and a field technician has been sent to solve your problem.

      Thank you for using IBM.

      --
      "I've got more toys than Teruhisa Kitahara."
  4. One system this will never work on by mboverload · · Score: 0, Insightful

    This will never work on Windows. With all the registry crap it has, I don't see anything like this working. The registry is a nightmare to fix if anything goes wrong, it is ALWAYS easier to reinstall. In fact, I'm reinstalling XP tomorrow because of all the crap and bugs it has accumulated. I do this at least twice a year, and its a shame.

    1. Re:One system this will never work on by Anonymous Coward · · Score: 0

      sounds like an end user problem

    2. Re:One system this will never work on by Anonymous Coward · · Score: 1, Funny

      Perhaps that's a feature Micro$oft are going to provide in Longhorn: self-reinstalation. :p

    3. Re:One system this will never work on by Anonymous Coward · · Score: 0

      Um..

      Use XP's System Restore. You can roll back to a previous working state.

      This will fix any registery issues you have.

      Personaly I'm a linux fan. I like it a lot. But you can tweak a windows platform [2000 or XP] to be pretty darn stable if you've a mind to. In my own experience, I find I have to reinstall Mac OS X far more often than Windows. Simply because of the kinds of software I run on it (poisined and co.) Tends to be the same thing for windows.

    4. Re:One system this will never work on by jproudfo · · Score: 1

      They already do this. Check out the OS deployment feature pack for SMS 2003. :)

      http://www.microsoft.com/downloads/details.aspx? Fa milyId=3E51FD48-C412-48C9-942D-648914C2759E&displa ylang=en

    5. Re:One system this will never work on by TapeCutter · · Score: 1

      As someone who works on commercial monitoring software I can say from experience that HKEY_PERFORMANCE_DATA is full of design holes and bugs. Reinstalling won't help and it is not all MS's fault, many of the problems come from third party dll's (eg: Eventlog source "perflib", error 1008). However MS's overblown design does make it worse and hard to pinpoint many problems with the performance counters themselves. On a servely degraded Windows machine, just when you are most likely to be looking at performance, the counters will turn themselves off! Permenantly!!! Well maybe not permenant but at least until you use the secret decoder ring and regedit (methodical FULL re-install would help here).

      *nix systems? Version nightmares, way to many inconsistencies to even start complaining.

      Mainframes? They are the "Buhdist Monks" of computer self contemplation. This explains why it is supported by IBM but practically ignored by MS. Whoever gets it right first will own the farm (pun intended).

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
  5. I've had plenty of luck with self healing by Neil+Blender · · Score: 1

    I use a little method I like to call the crontab coupled with shell scripts.

  6. Self-Healing Systems by MonkeyCookie · · Score: 1

    It looks like the T1000 won't be appearing any time soon: at least not until Skynet comes online.

  7. Self-healing == human brain by Anonymous Coward · · Score: 0

    "And I'm still waiting to see any box that can replace its own power supply after someone flips the 115/230 switch. Once we get that, then we'll have truly self-healing systems"

    You may know of the JTAG boundary scan originally implemented as a form of system test to go beyond the limits of 'nail bed' hardware testing. Developing on this circa 1993 were several ideas that extended the hardware self-test model to include self-heal redundancy such that components that failed the test could switch out and switch in a secondary circuit. What I am reading here is encouraging, instead of having a separate 'test and monitoring' circuit for the 'main' device the device itself (and its software) becomes a coarse grain distributed system in which each node can both opearate as part of the devices global behaviour or monitor (test) and switch other nodes within the system. There is one other device that uses this highly distributed and non-specialised approach to fault tolerance, we call it the brain.

  8. TiVo by Radak · · Score: 3, Insightful

    TiVo has had self-healing Linux systems out there for five years now. There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.

    1. Re:TiVo by daeg · · Score: 1

      Yes, self healing and reactive systems exist for specific implementations, but not as a whole. If the kernel or some low level drivers in a TiVo went bad somehow, I doubt the TiVo could repair that. I doubt we will see self-healing desktop boxes or general purpose servers for at least a few more years. I think that is what the article is trying to drive home.

    2. Re:TiVo by Daniel+Ellard · · Score: 1
      That's a nice accomplishment, but it's not the same thing. TiVo has complete control over the software they run on their boxes, and I'm sure they test it quite carefully before shipping. This isn't unique to TiVo; you could say the same thing about the software that runs in your cell phone, DVD player, etc.

      These people are in a different domain; they don't know what apps their system will run, or what mistakes the sysadmin will make, or what worms someone will write next month -- they're preparing a reactive defense against the unknown.

      --
      Disclaimer: I work for a company, but I don't speak for them.
    3. Re:TiVo by Anonymous Coward · · Score: 0

      Someone at work gave me a TiVo that, after an "upgrade" would no longer boot (green screen of death). I attempted to fix it by putting in a different disk and loading a fresh image on it. It worked for a while, but eventually gave the green screen of death again. So much for self-healing.

    4. Re:TiVo by sysadmn · · Score: 1

      There's a big, big difference between designing a system that can do one thing well, and a system that can do many things well. Worst case with TiVO, it restarts itself and you wonder why a few minutes of your program didn't record. Worst case with Solaris, your system restarts and everyone wonders where eBay went :-0

      --
      Envy my 5 digit Slashdot User ID!
    5. Re:TiVo by rainwalker · · Score: 1

      Must be nice...my DishPVR 501 crashes all the time. Normally if you're careful you can avoid it, but woe to the user that tries to change timer settings while it's recording something, or elects to stop watching something as it's being recorded and goes to the PVR menu. I can crash it about 30% of the time doing either of these common activities. Stupid thing takes almost 4 minutes to reboot, too, so you miss a big chunk of whatever you were trying to watch, plus it commonly crashes again because it gets confused about being in the middle of a timer.

      Of course, I have no idea what OS it uses, but I suspect it's unpatched Win95 from its behavior.

    6. Re:TiVo by Anonymous Coward · · Score: 0
      There are virtually no complaints of TiVo software failure (hard drives certainly go bad from time to time, but very rarely does the OS get itself into a state it can't fix), so the notion that self-healing systems are still years off is silly. They may not be extremely advanced yet, but they're certainly out there.

      There is a big difference between a simple system that doesn't break, and a complex system that is capable of "self-healing." Don't confuse "good" or "reliable" with "self-healing."

      Actually, I think we can see from this how companies that don't implement self-healing features will dilute the terminology...
  9. GNU Hurd by Anonymous Coward · · Score: 0

    This seems to be the closest current system design that could possibly allow a self healing system when the hurd is setup as a cluster of servers.

  10. the real question is by Anonymous Coward · · Score: 0

    if someone wrote a virus that exploited a vulnerability in KDE, would they call it "the Klap?"

  11. Remote Monitoring by grahamsz · · Score: 1

    Is a little bit different from self-healing, but they are in the same vein.

    I believe Sun are working on systems that will attempt to spot failure trends, so they can proactively identify other customers who may run into similar problems and then either have the system fix itself or send someone out to deal with it.

    The other mindset i've seen with RAID disks, is why bother replacing them. Disks are getting to the point that it's probably cheaper just to leave the dead one in there and power up a spare than to dispatch someone to install a new one.

    1. Re:Remote Monitoring by gomoX · · Score: 1

      Actually, decent RAID systems have hot spares installed. When a disk dies, the hot spare takes it's place - you just remove the dead drive in order for it to become a new hot spare, and not to get the array to it's original state.
      You can't just "not replace dead drives" unless you have like 400 SCSI controllers on your machine, therefore providing an insane amount of hot spares for future failures.

      --
      My english is sow-sow. Sowhat?
    2. Re:Remote Monitoring by grahamsz · · Score: 1

      But replacing dead drives can now become part of routine maintenance instead of paging out an engineer.

  12. My fav self-healing method by xv4n · · Score: 1
    1. Re:My fav self-healing method by CmdrObvious · · Score: 0

      My favorite self-healing method is to take to asprin, and call myself in the morning. it works every time, and saves myself thousands in doctor bills. ...wait, that is self-diagnostic too! I am saving even more!!

    2. Re:My fav self-healing method by ThisNukes4u · · Score: 1

      Sorry, but that only works with windows.

      --
      thisnukes4u.net
    3. Re:My fav self-healing method by Frogbert · · Score: 1

      What the hell? INT 18 calls BASIC if it is in ROM... Why the hell don't bios's have that anymore? I feel ripped.

    4. Re:My fav self-healing method by lachlan76 · · Score: 1
      You didn't read the description, did you ;)

      DESCRIPTION : This function reboots the system. It happens that this function crashes the computer instead of rebooting it. This is especially true with resident programs. A more secure way to do a warm reboot is to put the value 1234h at memory location 0040:0072h and then to make a far jump to FFFF:0000h. To do a cold reboot, put the value 0000h at 0040:0072h instead of 1234h.
  13. The blue button by Otter · · Score: 1
    Networks and servers, they tell me, can self-defend, self-diagnose, self-heal, and even have enough computing power left over from all this introspection to perform their owner-assigned tasks.

    After repeated viewing of those Thinkpad commercials where the techs tell the hysterical PHB to press the blue button on startup and thereby enable IBM to magically resurrect his hard drive, I summoned up the courage to try it. (The curly haired guy in those ads is also in one of my favorite commercials ("Please stay on the line...Playing with the queen of hearts..." so I figured I owed it to him.)

    I gingerly pressed the button on my T40 and -- nothing happened. Maybe it's the firewall.

    Yikes, now all the people who flipped out last month when I said I prefer my TiBook to the T40 are going to go even more nuts! "You ignorant asshole troll, don't you know the only thing that matters on a laptop is a magic blue button...!"

    1. Re:The blue button by 0racle · · Score: 1

      Apples do have a magic button to access the HDD if the OS can not boot.

      --
      "I use a Mac because I'm just better than you are."
    2. Re:The blue button by Anonymous Coward · · Score: 0

      You need to have the IBM support software installed. It's not very good though.

    3. Re:The blue button by Anonymous Coward · · Score: 0

      open apple - S

      and then somone screams "ARRRRRRRRRRRRRRRRG"

      because single usermode is scary.

    4. Re:The blue button by 0racle · · Score: 1

      Single User Mode exists in any Unix/Unix like OS. I was talking about when the OS would refuse to start at all, no single user, no login, nothing. The the magic button I was thinking of is the 't' key, which enters target disk mode, you would also need a firewire cable and another Mac.

      --
      "I use a Mac because I'm just better than you are."
  14. Not really by grahamsz · · Score: 2, Insightful

    It's very easy to make a system self-healing when you are running in a completely controlled evironment.

    Indeed my TiVo very rarely crashes and always recovers, but the same is also true of every embedded system i've used - be it a cellphone, weather station or alarm system.

    Now if i screw around modding my tivo then it's entirely possible to crash it and it doesn't recover very well from that...

    1. Re:Not really by arkanes · · Score: 1

      I had a cellphone once that would crash regularly. Some crappy samsung thing, I think. Drove me batty.

    2. Re:Not really by static0verdrive · · Score: 1

      Exactly. People don't go installing software on their TiVo's weekly, or surfing the net with IE, or opening e-mail attachments with malicious code designed for that OS...etc. It's almost outside the realm of comparables.

      --
      ========
      77 77 77 2e 6d 65 6c 76 69 6e 73 2e 63 6f 6d
  15. if by Kanasta · · Score: 2, Insightful

    if self healing = ms office keeps putting another icon in my start menu whenever I start word, then I don't want self healing.

    How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?

    1. Re:if by Anonymous Coward · · Score: 0

      Well if we are talking about a self-healing Windows OS, then yes. If we are talking about a self-healing computer, then I imagine the first thing it will do is wipe-out the Windows partition and install Linux.

    2. Re:if by elid · · Score: 0, Troll

      Does that make Windows self-destructive?

    3. Re:if by upsidedown_duck · · Score: 1

      if self healing = ms office keeps putting another icon in my start menu whenever I start word

      It's much better than that. Self-healing means that disks in a RAID array can detect corrupted blocks of data using checksums and correct them from good mirrors on-the-fly. With multiple mirrors with checksums proving whether there is a problem or not, corrupt data files should be a thing of the past (on systems with RAID). It seems failing drives would be detected sooner, also.

      --
      -- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
    4. Re:if by Anonymous Coward · · Score: 0

      I think I speak for all of us when I say:

      What the hell are you talking about?

    5. Re:if by Aighearach · · Score: 1
      How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?

      However many it takes you to realize the problem is user error and upgrade to an OS that works well.

      If you compare computers to biological systems (something people should spend more time doing, the biological systems usually are more robust) then self-healing is something like the concept of "radiant health." Before you worry about that, first you have to reach a state of health. If you're sick, you've already screwed up. Windows crashes a lot, and generally has a low level of health. Look at average downtime. That's downtime, not to mention less severse chronic illnesses. Radiant health isn't really in the future here.

      Whereas when IBM talks about self healing systems, well, their software is pretty darn healthy, solid. Like a well diciplined monk, perhaps they are indeed in a position to be seeking Radiant Health.

  16. How about systems that I can manually heal first? by grumbel · · Score: 3, Insightful

    While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.

    Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.

    Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.

    Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.

    Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default

    Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.

  17. Sod the stupid machines by Timesprout · · Score: 1

    If they break they can be fixed or replaced. I want selfhealing (prefereably wolverine type) for me.

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
  18. Reset button by mboverload · · Score: 2, Interesting

    I don't know why windows doesn't just have a reset button for all the settings to return it to it's original condition. It's a bitch to reinstall it twice a year, you know.

    1. Re:Reset button by Anonymous Coward · · Score: 0

      XP has system restore which does exactly that.

  19. Actually... by Anonymous Coward · · Score: 0

    I recently bought a refurbished amd64 HP computer on the cheap.

    I was surprised and almost impressed by the "self-healing" nature of the pre-installation.

    When booting the computer, you can press F10 to tell the BIOS to get the customized OS files on the hidden partition of the hard drive and overwrite the existing files on C:. This can also be done right from Windows by clicking an icon. Note that it could well be a chemotherapy type of self-healing (with the registry especially)!

    I promptly installed Gentoo though, so I wonder if I can make pressing F10 do emerge sync; emerge world automatically!

    HP computers are probably a good purchase for regular Windows home users.

  20. I'm confused by SpeedBump0619 · · Score: 1

    Just out of curosity, can anyone define what the precise difference is between "Fault Tolerance" and "Self Healing"?

    Explain to me how any of the failure responses I see discussed in the article or in these discussions qualifies as "Healing"? Almost all fault tolerant systems isolate failing components or programs from the rest of the system (killing rouge processes counts as isolation). Quarantine is not an attempt to heal, it is an attempt to tolerate. Are there actually any non-quarantine "self healing" systems out there today?

    1. Re:I'm confused by Anonymous Coward · · Score: 0

      Just out of curosity, can anyone define what the precise difference is between "Fault Tolerance" and "Self Healing"?

      I would consider "fault tolerance" is that if something breaks the system keeps running -- but the component remains broken. So if you're considering software, you have databases running on two hosts and if one segfaults all requests go to the other one. The first is still down until and admin has to restart it.

      Self healing is that the database sefaults, the system notices it (perhaps through a monitoring processes), and restarts it (if it's determined that doing so won't corrupt data).

      "Fault tolerance" is have redundancy, "self healing" means that the system takes care of itself so that an autoside entity (like an admin) doesn't have to.

      At least, this is how I understand these terms.

    2. Re:I'm confused by MickLinux · · Score: 1

      I think that when most people talk about self-healing, they mean fault tolerant. An example is the Tandem systems mentioned a little below this. Yet I also think that self-healing and fault tolerance are a bit the same.

      However, if you want to understand what self-healing really means (and does not mean), consider that our DNA are self-healing.

      Now, I do not claim to understand the mechanisms whereby the DNA is self healing. I am aware that there is a recent article that points out how the DNA breaks get healed after damaging radiation.

      http://story.news.yahoo.com/news?tmpl=story&cid=57 1&ncid=751&e=1&u=/nm/20041220/hl_nm/tech_mobilepho ne_health_dc )

      I might imagine that the DNA causes the production of RNA. In the event of a break, the RNA is then used to repair the DNA. But if the wrong strand of RNA is used, you get an error.

      But those DNA breaks are healed with errors intact, sometimes.

      Moreover, with bacteria at least, the healing actually causes the bacteria to reengineer its own DNA, so that poisons become food. How this happens, I cannot even imagine.

      How might we do this with computers? By overly stuffing memory and HDD with redundant data records, and then in the event of a crash, trying to put the data pieces back together. Even better, it might be interesting to use P2P plus online storage to have computers back each other up. That is, suppose you use a RAID system. Then when your computer crashes, it guesses the gap length, then goes to other drives (and also gets on the internet and asks other computers) to look for identical strings. Then it stuffs the strings, and checks to see if the system works.

      --
      Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
    3. Re:I'm confused by segfaultcoredump · · Score: 2, Informative

      Fault Tolerance implies the ability to not just detect the fault (i.e. a failed cpu), but to keep the processes running as if nothing happened. This is possible with Stratus and Tandem boxes. It is genrally not possible with common x86/Power/SPARC boxes (unless you put a lot of software on top of two boxes to make them look like one big virual system).

      "Self Healing", in this context, is the systems ability to detect a fault (hardware or software), deal with it (restart a process, isolate hardware, etc) and then get on with life (in a possibly degraded mode). In a way, the venerable Veritas Cluster System is an example of a "self healing" system. (it detects a failure of a service group and restarts it, on another node if needed)

      Note that with "self healing" systems, the process may die, and end users may notice a failure. But the system is 'back online' sooner than if it required manual intervention. Compare this to a Fault Tolerant systems that never went down in the first place.

    4. Re:I'm confused by bo0ork · · Score: 1

      Memwatch will repair it's data structures if they're thrashed. http://en.wikipedia.org/wiki/Memwatch

      --
      Does everything include nothing?
    5. Re:I'm confused by Anonymous Coward · · Score: 0

      Fault Tolerance: You have two legs. You break one. You can hop.

      Self Healing: You have two legs. You break one. You delete the broken leg and generate a new one. You start running again.

  21. Similar to IBM's Autonomic Computing by bhadreshl · · Score: 2, Informative

    Well this seems like where computing services are heading as IBM is doing extensive research on Self-Configuring, Self-Healing, Self-Optimizing, and Self-Protecting computing systems called 'Autonomic'

    Check out: Autonomic Computing

  22. UNIX is the problem. Tandem was the solution. by Animats · · Score: 5, Interesting
    There are operating systems for which "self-healing" is quite feasible, but UNIX is all wrong for it.

    The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It.

    The basic concepts are:

    • All the permanent state is in a database with proper atomic restart and recovery mechanisms.
    • Flat "files" are implemented on top of the database, not the other way round.
    • When applications fail, they are usually restarted completely, with any in-process transactions being backed out.
    • Applications with long-running state are tracked by a watching program on another machine which periodically receives state updates from the first program. If the first program fails, the watching program restarts it from a previous good state.

    Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.

    Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.

    1. Re:UNIX is the problem. Tandem was the solution. by rlp · · Score: 4, Interesting

      Tandem had a FT Unix division in Austin. One of the teams I managed that was responsible for an embedded expert system that monitored faults in the redundant components of the system. Every component was replicated. Each logical CPU actually consisted of four processors - two pairs running in lock-step. If one CPU in a pair disagreed with it's counter-part, the pair would be taken out of service. The expert system monitored transient faults and would "predict" that a component was going to fail, and could take it out of service. The system had a modem that would "phone home" in the event of a component failure, and a service tech would be dispatched with a part - often before the customer knew there was a problem.

      The machines used MIPS processors (supporting SMP) and ran a Tandem variant of System V UNIX. Combine this with a decent transactional database, and application software capable of check-pointing itself, and you have a very robust system. Albeit a very expensive one.

      Tandem was bought out by Compaq, and then by HP. When I left, Tandem had quite a few interesting ideas they were working on, but near as I can tell, they never saw the light of day.

      --
      [Insert pithy quote here]
    2. Re:UNIX is the problem. Tandem was the solution. by burns210 · · Score: 1

      When can such monitoring capabilities be added to things like Linux, Mac OS X (Server) or FreeBSD? Granted, some hardware is required for part of these redundant services, but not all.

    3. Re:UNIX is the problem. Tandem was the solution. by upsidedown_duck · · Score: 2, Insightful


      Knowing HP, your systems are probably being replaced by Tandem-branded PCs with ECC RAM and software RAID. A rescue DVD will provide instant system rebuilds so downtime is never more than two days.

      --
      -- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
    4. Re:UNIX is the problem. Tandem was the solution. by Anonymous Coward · · Score: 0
      If the first program fails, the watching program restarts it from a previous good state.


      ...and it may fail again, because the same data is treated in the same way (that's how programs work), and therefore might lead to the same problem again and again.

      If the OS kernel itself is buggy, the system might even crash, reboot, restart the application in the last saved state, reproduce the error, and crash again. (That's why you can IPL an AS/400 in attended IPL mode, so you can IPL the system without continuing certain applications/transactions)

      Self-healing techniques can help to keep the system running in case of hardware failures (provided that the hardware is capable of detecting and surviving hardware failures, the software can deallocate faulted components and migrate active tasks to functional components), but i have never seen a piece of software, that can transparently "heal" a buggy piece of code in another program.
    5. Re:UNIX is the problem. Tandem was the solution. by Bert64 · · Score: 1

      Typical compaq, buy up some nice technology and then completely ruin it...
      They did the same to DEC

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  23. Joke Spoiler by IO+ERROR · · Score: 2
    Plenty of Sun's boxes have redundant power supplies.

    Click here to ruin the joke.

    --
    How am I supposed to fit a pithy, relevant quote into 120 characters?
    1. Re:Joke Spoiler by kurzweilfreak · · Score: 1

      The latest BOFH, though usually not as good as the earlier stuff.

      --

      kurzweil_freak

      5th Kyu Genbukan Ninpo/KJJR student

      Be the darkness that allows the light to shine.

  24. So how's the health of the system that monitors by melted · · Score: 1

    the health of the rest of the system is monitored, and what are you gonna do if it comes to wrong conclusions?

    1. Re:So how's the health of the system that monitors by Anonymous Coward · · Score: 0

      Intead of a heirarchical system, each level consists of peers the monitor each other. Instead of one memory bank, you have 4. If one bank does not return the same information as the rest, that bank is automatically taken offline and either a reserve is brought into operation, or a warning is sent that fault-tolerance has been degraded.

      With redundant everything, each component can compare its own health with that of its peers. If anomalys are detected, that component is quarantined.

    2. Re:So how's the health of the system that monitors by Anonymous Coward · · Score: 0

      Have three monitoring systems monitoring your main system and each other. It requires the consensus of greater than 50% of the active monitoring systems before a component can be diagnosed as faulty.

      If there are only two active monitoring systems, and one diagnoses the other as faulty, they should wait until the third system comes back on line to tiebreak. If the third system is not automatically back on line within a certain amount of time, the sysadmin gets paged.

      Race conditions and continual restarts of the same subsystem should also be flagged as undesirable behaviour.

  25. Ya, but... by Anonymous Coward · · Score: 0

    In Soviet Russia, self-healing systems diagnose themselves.

    1. Re:Ya, but... by Anonymous Coward · · Score: 0

      No, In Soviet Russia Self-Healing Systems Heal YOU.

  26. Everything I Know I Learned From TOS by Anonymous Coward · · Score: 0
    So how's the health of the system that monitors the health of the rest of the system monitored, and what are you gonna do if it comes to wrong conclusions?
    "Norman coordinate"
  27. First step to self-awareness and AI? by G4from128k · · Score: 1

    Self-healing would seem to be a critical step toward a self-aware artificial intelligence. Self-healing requires an ability for introspection that is sufficient to identify and correct corrupted internal states. Code that is able to introspect its own behavior and internal structure could lead it to interesting outcomes if tied to a learning algorithm (even a simple hill-climbing algorithm).

    It is then a small step to go from simple feedback self-healing mechanisms to feed-forward control mechanisms. Feed forward system would learn (or be told) how certain precurse states and actions lead to fault or non-faulty operating states. A system that learns that particular states and invocations lead to crashes would "learn" to avoid those invocations or correct those states proactively.

    Such a system might even begin to show emotional states in the sense that an emotional state is a summary of the condition of the system that guides future action. "Unhappy" systems would spend extra CPU cycles on introspection to try to understand and correct accumulating faulty conditions. Such systems might even get "angry" at other machines such as machines that send spam or worm packets and refuse to communicate with them.

    --
    Two wrongs don't make a right, but three lefts do.
  28. Misread... by brain007 · · Score: 0

    Did anyone else read this as 'A Diagnosis of Self-Heating Systems?'

    I think my laptop could cook an egg...

  29. full redundancy (almost) always works by mo · · Score: 1

    From TFA: One approach is simply to make an individual system the unit of recovery; if anything fails, either restart the whole thing or fail-over to another system providing redundancy. Unfortunately, with the increasing physical resources available to each system, this approach is inherently wasteful: Why restart a whole system if you can disable a particular processor core, restart an individual application, or refrain from using a bit of your spacious memory or a particular I/O path until a repair is truly needed?

    Because using stuff like stonith or heartbeat works for many more types of failures. Bad network cable? Yup. Power supply? yup. Server Catch on Fire? Yup.

    I'm not saying it wouldn't be nice to have the OS route around bad memory blocks or bad processor cores due to some fancy-pants algorithm (without having to rewrite my app). But you're still going to need a redundant server for when somebody trips over the power cord.

    1. Re:full redundancy (almost) always works by Anonymous Coward · · Score: 0

      > But you're still going to need a redundant server for when somebody trips over the power cord.

      But you can have a robot that actually plugs it back on. Self-healing.

      The redundant server is fault-tolerant. Once all spare servers go dark, you run out of medicine.

    2. Re:full redundancy (almost) always works by Anonymous Coward · · Score: 0

      I'm not saying it wouldn't be nice to have the OS route around bad memory blocks or bad processor cores due to some fancy-pants algorithm (without having to rewrite my app).

      Solaris has done this (at least on SPARC) since at least release 8. You could offline and pull CPU/memory boards out of an E3500 (which are now quite dated) while the rest of the system runs. The OS scehduler will relocate processes on the fly. There does have to be some hardware support for this though.

      Linux is only now beginning to do this.

      People wonder why you would run Solaris and not Linux, or what makes Solaris "better" than Linux. Those that need this feature need it bad, and there's no easy way to do in Linux (if there's a way to do it at all).

  30. One of my self-healing systems by skinfitz · · Score: 4, Interesting

    I have it so that if one of our firewalls detects an attempt to access gator.com it enrols the machine into an active directory system group which the SMS server queries to automatically de-spyware it with SpyBot.

    I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.

    1. Re:One of my self-healing systems by Bert64 · · Score: 2, Interesting

      That's like curing the symptoms and not the cause.
      Your systems shouldn't have gotten infected with spyware in the first place, and the fact that they did shows you have bigger problems. What if they get infected with something more malicious than gator? Or how about something that's not detected by the spyware removal tools?

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    2. Re:One of my self-healing systems by skinfitz · · Score: 2, Interesting

      I agree completely - we do not allow admin or Power User rights on our systems, and typically if a machine has gator on it, it usually has other problems too. In fact I'll guarantee that if any machine has gator on it, it usually has LOTS of other problems.

      Tracking the symptoms like this alerts me to these problems - running SpyBot on a machine never hurts, and I'll do other things too like have a script email me the list of adminstrators on the machine and perhaps change the password.

      As for more malicious, I have used the same technique with Snort sensors around the network logging into a database. Another script queries the database and takes the appropriate action du jour - for example during Nimda I had scripts that would scan the database and clean infected machines.

      Always worth putting in the extra time to automate these things as you have a solution for the future and can sit back and admire your work.

      As for curing the symptoms and not the cause, this frees up my time to tackle the cause. If I ran around manually cleaning up systems my time would go nowhere.

  31. The need for a "self" symbol by Etcetera · · Score: 2, Interesting

    HAL: I've just picked up a fault in the AE35 unit. It's going to go 100% failure in 72 hours.

    This is really something that, IMHO, calls for more interaction between the best of the futurists, science-fiction writers, and coders, and other complexity thinkers.

    In order for any system to have an understanding of and proper diagnosis of its own operation, it needs to be able to conceptualize its relationship to other systems around it. Am I important? What functions do I provide? What level of error is proper to report to my administrator? Do I have a history of hardware problems? Has chip 2341 on motherboard 12 been acting up intermittently? If so, is it getting worse or better? How have I been doing over the last few days? Is there a new virus going around that is similar to something I've had before?

    What good is a self-diagnosing system without a memory of its prior actions?

    All of these questions imply some sort of context that will require the system to use symbols to represent "things" in the "world" around it. Clearly, the largest (though perhaps not qualitatively different) symbol will be a "self" symbol.

    From there, all you have to do is follow Hofstadter's path and you'll arrive at a system with emergent self-awareness or consciousness.

    The end result of this will be something a) very complex and b) designed/grown by itself. You'll have either the computer from the U.S.S. Enterprise or H.A.L.

    Side question: What is CYC doing these days?
  32. Self-Healing Data Transfer by Orasis · · Score: 1

    For swarmstreaming, we use the Tree Hash EXchange format (THEX) to provide cryptographic integrity verification down to a single 1KB resolution so we can automatically repair the corruption.

  33. Where does it hurt? by Doc+Ruby · · Score: 2, Insightful

    How about just systems that fail *verbosely*, so admins can quickly diagnose them? Once the patient can complain properly, we can get to work replacing the admin doctors with "self-healing" metasystems that use those diagnostics. It will be a lot easier just mimicking the best admins' best practices by automating them, than all this screwing around trying to compile marketsprach like "self-healing" without understanding how it even works in nature.

    --

    --
    make install -not war

    1. Re:Where does it hurt? by ElvenSmith · · Score: 1

      I can second that. If just the error message is good/detailed enough, it cuts the time going thru obscure log files etc...
      but i doubt many developers think from a user's point of view...

    2. Re:Where does it hurt? by TTK+Ciar · · Score: 1

      That is more or less how things have evolved here at The Internet Archive.

      Unixes and the services that run on them can be configured to be very verbose in their errors and warnings, and error messages can be used as triggers to check various logs and system states for additional information, but in a nontrivial cluster there are major problems with humans trying to digest this flow of information and make sense of it all.

      Better tools help, but beyond a point it just makes sense to try and make the tools themselves react correctly to the symptoms and correct the problems.

      This Sun guy's article overlooks the existing mechanisms and methods already available for implementing self-healing systems (though, he's spot-on regarding the need for better abstraction). Hmmm, I write these kinds of tools at The Archive, maybe I should write my own article?

      Regarding the "how do you know the monitor is healthy?" issue: you never really know, but you can have the monitor monitor itself and fairly reliably diagnose many of its own problems, and/or have pools of monitoring peers (qv keepalived), and start with "gentle" solutions which won't hurt anything (much) in the case of a misdiagnosis, and ramp up from there.

      -- TTK

    3. Re:Where does it hurt? by Doc+Ruby · · Score: 1

      You have taken the diagnostics tools to the next level at TIA, keeping the human in the loop, but expanding both the complexity of the diagnostic data, and the tools to process that data. So both the senses and manipulators of the human are bionic :). Maybe that's why TIA is so cool, and reliable. Thanks a lot - with your content, you are keeping the computers as tools in the service of human intercommunication, instead of descending into communications primarily with the machine, as they appear to prefer at Sun. On both sides of the Archive, which winds up a medium for humans to communicate with each other, and ourselves, transcending time and space in every way (except perhaps actually sending a stream of me back to 1970-23-01::Honolulu-Civic-Center ;).

      --

      --
      make install -not war

  34. Synonym for this by Anonymous Coward · · Score: 0

    Extortion.

  35. Re:How about systems that I can manually heal firs by upsidedown_duck · · Score: 1


    Mostly, that's because Windows is a piece of shit.

    --
    -- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
  36. It's a long way by jd · · Score: 3, Interesting
    ...from what we have now to the Liberator (DSV-2) from Blake's 7, the Ultimate in self-repairing systems. At the moment, most "self-repair" is in the form of software error-correction and bypassing faulty hardware. (The "badmem" patches for Linux do this, for example.)


    The former could be considered self-repair, but it is limited as you don't have to have much in the way of an error to totally swamp most error-correction codes.


    The second form isn't really self-repair as much as it is damage control. This is just as important as self-repair, as you can't do much repair work if your software can't run.


    On the whole, "normal" systems don't need any kind of self-repair, beyond the basic error-correction codes. Instead, you are likely better off to have a "hot fail-over" system - two systems running in parallel with the same data, only one of them is kept "silent". Both take input from the same source(s), and so should have identical states at all times, with no synchronization required.


    If the "active" one fails, just "unsilence" the other one and restore the first one's state. If the "silent" one fails, all you do is copy the state over.


    However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results. Therefore, in order to have a meaningful hot fail-over of the kind described, the two can't be identical. They have to be different enough to not fail under identical conditions, but be similar enough that you can trivially switch the output from one to the other without anybody noticing.


    eg: The use of a Linux box on an AMD running Roxen, and an OpenBSD box on an Intel running Apache, would be pretty much guaranteed not to have common points of failure. If you used a keepalive daemon for each box to monitor the other's health, you could easily ensure that only one box was "talking" at a time, even if both were receiving.


    The added complexity is minimal, which is always good for reliability, and the result is as good or better than any existing software self-repair method out there.


    Now, you can't always use such solutions. Anything designed to work in space, these days, uses a combination of the above techniques to extend the lifetime of the computer. By dynamically monitoring the health of the components, re-routing data flow as needed, and repairing data/code stored in transistors that have become damaged, you ensure the system will keep functioning.


    Transistors get destroyed by radiation quite easily. If you didn't have some kind of self-repair/damage-control, you'd either be using chips with transistors which may or may not work, or you'd have to scrub the entire chip after a single transistor went.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    1. Re:It's a long way by liangzai · · Score: 1
      However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results.

      Fuck no! Computers have souls, man. That is because they are so complicated that the deterministic model no longer holds; there's a non-deterministic layer that gives the machines their personal features.

  37. I'd rather have self-healing coffee. by dpbsmith · · Score: 1

    I thought I saw an article about that earlier, but on second glance it turned out to about self-heating coffee. Yawn.

    Now, as for self-heating systems...

  38. Re:How about systems that I can manually heal firs by upsidedown_duck · · Score: 1


    Sorry about that, I just said the first thing that came to my mind.

    --
    -- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
  39. The article is a pretty good roadmap... by CodeWanker · · Score: 1

    First of all, it has the phrase that pays: "graceful degradation "

    Next, it talks about verbose and useful errors, so that a techy can make intelligent decisions about terminating a process, restarting it, altering a file, or some other fix. Presumably, once a tech marks a problem "successfully fixed" by a certain set of actions enough times, the system wiull try those series of actions before throwing an error message.

    What will be nice is when the system recognizes what it is it's doing, so it'll have a "what" area and a "how" area in its makeup so it knows what's involved to accomplish the task. Then, if the "How" gets damaged, it can refer to the "what" to reconfigure available resources to meet the desired outcome. And, if the "What" gets damaged, it can be rebuilt by analyzing the "What" part. THAT will be realio coolio.

    --


    "Wow. Now THAT'S a lot of angry Indians." - Lt. Col. George Armstrong Custer
  40. Re:How about systems that I can manually heal firs by Anonymous Coward · · Score: 0

    Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.

    VMS had this years ago.

    Windows XP also has the 'restore' utility (which I've had to use a couple of times when one of their updates hosed the system). This is one feature though that is both useful and worked when I neeeded it to.

  41. Re:How about systems that I can manually heal firs by Qzukk · · Score: 2, Insightful

    Files arn't versioned

    Undelete?

    Check of integritiy of an installed piece of software

    During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available. This is why things like file versioning (popular on large systems like VMS, where the universities/companies running it had the money for the storage requirements) and permanent storage of "unwanted" files just didn't appear.

    The third problem is a bit tougher without some extra metadata and hardcore discussions on exactly what should be monitored/done/etc (personally, I don't think this is a kernel-level operation). Something must be stored somewhere so that the system can identify a modified binary. At some time (before change, in which case the operation is stopped? After change? Monthly?) someone (root? file owner? script kiddie currently logged in as root?) has to be notified (syslog? message to terminal? email?) that something (virus? script kiddie? make install? dpkg? rpm?) has altered the (executable? configuration? library? manpage?). As you can see, its one thing to say "oh yeah the OS should do this" and another entirely to define what this is.

    The second problem is tough as well, but there are patches to libc's unlink() function (either as a patch or as an LD_PRELOAD library to override libc's function) that move the files to a pre-defined trashcan, and that every dynamically linked application will use.

    The first problem is mostly just a lack of demand. Nobody cares, so nobody made a filesystem that can do it. Both ext*fs and reiserfs are extendable (with optional options. Reiserfs moreso than ext), so if you care, do it yourself, but again there's questions you'll have to be prepared to answer (and since you insist on doing this at the kernel level, you have to have THE answer): If a program writes 1MB to a file 1 byte at a time, is that one million revisions? If you're writing a document and you hit save after every paragraph, is that a revision? How are you going to tell this apart at the kernel level?

    --
    If I have been able to see further than others, it is because I bought a pair of binoculars.
  42. Get a *real* OS by Anonymous Coward · · Score: 0
    While a self healing system sounds nifty, todays systems aren't even good enough to be healed manually.

    Only when you limit your OS choices to turds. See below.

    Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.

    Why in hell does the OS have to be involved in an application install?!?!?!

    Oh, right. Because there this single-point-of-failure, dumbest-idea-ever-to-fly-out-of-a-monkey's-butt registry where all applications need to put there configuration data.

    Why!?!?!?

    What "problem" did this solve?

    Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.

    VMS. Old news. Also lots of disk space needed


    Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.

    And why not a Trashcan inside the trashcan. Even better, a popup that will ask if I'm really sure I want to stop using this wonderful Fisher-Price GUI.

    Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default

    Just how in hell is the OS supposed to determine that a third-party application's file hasn't had it's "integrity" destroyed? How on God's good Earth would the OS even define integrity?

    Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.

    Yeah, but it's really only Windows that sucks. Real apps on real computers don't need a dumbass "registry" so the "OS" can tell them where to find their files and data (but only on that one computer....), they smart enough to do it themselves (it's actually really easy, because an app always knows where it's located, and can use that information to find other data in a completely relocatable manner that doesn't depend on anything other than itself. Cool idea: limit your app's impact on the box it's installed on.....)

    1. Re:Get a *real* OS by grumbel · · Score: 1

      ### Only when you limit your OS choices to turds. See below.

      Feel free to suggest one, neither Windows, MacOSX, Linux (any distri) nor FreeBSD get the job done properlly.

      ### Why in hell does the OS have to be involved in an application install?!?!?!

      If not the OS, then who else should take care of it? If the application are free to dump themself anywhere there won't without anybody taking care of them doing it properly its just a matter of time since one app will to bad things.

      ### VMS. Old news. Also lots of disk space needed

      VMS, I now, current OS however are still behind that. About diskspace, yep, plenty, my HD and those of basically EVERYBODY else has at least a few hundred megabytes free, often that would be more than enough for recovery of a accidently overwriten file.

      ### And why not a Trashcan inside the trashcan.

      Trashcan only works if the application does use it, most however don't and never will. Beside from that it can't catch overwritten files only deleted ones.

      ### Just how in hell is the OS supposed to determine that a third-party application's file hasn't had it's "integrity" destroyed? How on God's good Earth would the OS even define integrity?

      Application comes with a bunch of MD5 or whatever checksums which the OS can check again.

      ### Yeah, but it's really only Windows that sucks.

      No, all OS do, some just more then others.

      ### it's actually really easy, because an app always knows where it's located

      No, the app does not know where it is located. The only way to do that portably on Unix today are hardcoded path inside the binary, which simply causes endless throuble in case one wants to relocate a application. Registry is of course not the best solution out there, but hardcoding configuration values into a binary for sure is just an ugly hack.

    2. Re:Get a *real* OS by Bert64 · · Score: 1

      I very much like the OSX way of installing apps. Each app is really a directory, which contains all the files the app requires. The only things stored outside of this directory are user-specific settings which are stored in your home directory, under a subdirectory indicating the name of the application. If you want to remove the os, you delete the directory. No other apps have any reason to overwrite application directories they don't own, nor do they need to overwrite system files and libraries, any libs the app needs can be self-contained.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
    3. Re:Get a *real* OS by Kent+Recal · · Score: 1

      Agree'd.
      If now apple could just return to /home instead of /Users and generally put some sanity into their directory clutter (why the heck do I need /Library and /System/Library?) I'd be happy.

    4. Re:Get a *real* OS by fimbulvetr · · Score: 1

      Yeah, most Linux apps did this long before OSX was out.

  43. worst case? by bird603568 · · Score: 1, Insightful

    this sounds good un till somebody meake a worn that uses an exploit that (for the sake of argument say there is one) was/(will be might be) found. The worm tricks the server in to thinking it is severly messed up so it orders a boat load of parts or shuts down or both. the tech shows up and its just a worm. now you have these parts and have to pay up. also the server shut down, now its lost time. did i mention its a worm so it spreads. thats just worst case, but i could be great unless you fix broken servers for a living.

    1. Re:worst case? by Anonymous Coward · · Score: 0

      neigh. it would detect the worm. 'cause that's, sort of like, its job.

  44. Why the term "self-healing?" by starseeker · · Score: 1

    That makes it sound like people want computers to be able to mechanically fix themselves when they break.

    Wouldn't a "self-healing" system just be good at a) reporting what hardware is actually broken on the machine b) automating well defined responses to well defined programs and c) building parallel, fault tolerant hardware at all levels of the system?

    As far as I know, even the best AI research hasn't come up with software that can diagnose and fix unknown, first time, bizarre problems. Ultimately, it all seems to be about providing better error reporting and automation of traditional technology, not some magical PC that can fix itself and reprogram itself with no human intervention. Which is great, don't get me wrong, but why the term "self-healing?" Does someone know the rational for that?

    --
    "I object to doing things that computers can do." -- Olin Shivers, lispers.org
  45. Oooh ooooh solaris by Anonymous Coward · · Score: 0

    Solaris was mentioned!! Solaris was mentioned!!
    Time to get all slashdotty and proclaim the
    end of the company. Ooooh Solaris was mentioned!

  46. We already have this... by JRHelgeson · · Score: 2, Insightful

    The space shuttle, as old as it is, has an absolutely incredible computer system that is self healing.

    The Shuttle has many thousands of sensors and backup sensors. Each sensor feeds into one of many computer systems. These computer systems talk to each other as more of a committee rather than just passing data amongst themselves. If a computer discovers a fault, another computer will see that fault as well, it will combine data gathered from other computer systems throughout the suttle and each computer system will literally cast a vote on what the best solution should be for the particular fault discovered.

    If one computer system suffers a partial or complete failure, the remaining systems will work around the failed system.

    This computer system has managed to keep our astronauts alive for every mission, except those two that suffered from a catastrophic mechanical failure. The second of which (Columbia) the computers kept the craft flying until it broke apart completely.

    I say not bad for a system designed over 20 years ago!

    --
    Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
  47. Re:How about systems that I can manually heal firs by grumbel · · Score: 1

    ### Something must be stored somewhere so that the system can identify a modified binary.

    Well, the system ultimativly knows when something changes, since it is the one who changes it. You are right that one needs some metadata, those however in most cases already comes with the packages (deb/rpm) one installs, there just isn't a standard way to automatically check these changes. However this problem can be solved completly in userspace with a cronjob, would just be nice to have a standard way to do it.

    ### During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available.

    While that might be true, that was at least 10 years ago. Since we count in GigaByte we have more space then needed and in most cases more then we ever can legally fill. And even if the harddisk runs full from time to time it isn't much of a problem, a versioning filesystem should simply scale back and automatically discard older versions, but as long as there is still free space on the harddisk I see little reason to not use it for something usefull.

    ### The second problem is tough as well, but there are patches to libc's unlink() function

    The throuble with that patch is the same as with the trashcans, they only get what you actually delete, not what you overwrite, fill with 0 bytes or destroy by other operations. I don't have any hard numbers, but in my experience the deleted files don't happen that often compared to the overwritten ones.

    ### If a program writes 1MB to a file 1 byte at a time, is that one million revisions?

    If it does 'fopen(.., "a"); write() close();' then yep, one million revisions. A new version should be created once the filehandled is opened for writing, something that is hard to catch from userspace. Storing of the new version could then happen in some Copy-on-write style manner, thats why the kernel need to play at least some role.

    ### How are you going to tell this apart at the kernel level?

    Since you only need to listen to open() and close(), how many bytes are written inbetween that is not much of an issue. And about the "nobody cares" part, well, the programmer might not, the users who yet again have lost data however might a lot, but as with all things that are only usefull if things go wrong it of course isn't much good for marketing.

  48. Re:How about systems that I can manually heal firs by myowntrueself · · Score: 1

    "and another entirely to define what this is."

    That would be "Survive, damn you! Survive!"

    There is this internal conflict we must have, where on the one hand we want our technology to have a survival instinct; so that it is motivated to look after itself while we are not.

    A bit like a human baby figuring out that sometimes mummy is not looking this way and it has to get out of the way of the reversing SUV by its self.

    On the other hand, the prospect of computers that have a survival instinct is (or bloody should be) a bit scary.

    The real problem we have is very much like facing the emancipation of slaves; on the one hand you'd rather not have the expense of maintaining slaves and couldn't they just take care of themselves. On the other hand you *know* how you treated them and worry that they will extract revenge.

    Or read Stanislaw Lem 'Non Serviam'. A true classic of AI literature.

    Then theres the low level side where a filesystem is filling up and what do you start deleting first? If the *computer* could choose, for its own health and wellbeing, where would *it* start?

    Ooops there goes the pr0n...

    ;)

    --
    In the free world the media isn't government run; the government is media run.
  49. nice systems by mshurpik · · Score: 1

    Well, I've seen some nice systems. When I see some nice fully systems and some quality fully self- systems, I'll be ready for the advent of fully self-healing systems. I expect we will get there one step at a time.

  50. too late by Anne+Thwacks · · Score: 2, Funny

    But I read in 1958 that we would have self healing systems "within a decade" - surely we must have had them for over 30 years!

    --
    Sent from my ASR33 using ASCII
  51. All about dependability by thrill12 · · Score: 1

    Having thought this thing through, I guess it's all about different levels of (inter-)dependability. One program relying on the other, etc..
    I guess if you work this out upto a low enough level, this includes the hardware, you can actually make the system heal itself.
    You could probably start at the root of the whole system: power, and build your way up from there in a sort of tree-version. However, other environmental issues for you system could exist that make a power failure seem like christmas.
    It could be fire, it could be the airconditioning that stands next to (one of your) machines which leaks water. In short, self-healing systems are great - but total self-healing will not be achieved unless you can, somehow, get all environment concerns charted and handled. Even then you probably have to be honest and admit that the system can and will fail in certain (very) far-out conditions. In that case, the system will closely approach a truly intelligent system. As this does not yet exist, why should total self-healing systems ?

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  52. IBMs been there done that by supersnail · · Score: 2, Informative
    .... given away the tshirts.

    The currentzSeries machines come with 16 cpus and L2 & L1 packaged together on a board.
    But only 12 cpus are used.

    Each "cpu" is actually two cpus and a comparitor. When the cpus come up with a different answer the cpu is shutdown and procesing is taken over by one of the four free cpus on the board.

    You will never know it happened until you run one of the mainrneance utilities.

    In the way of IBM this technoligy will probaly appear on top end pSeries (AIX/Linux) and iSeries boxes in a couple of years.

    --
    Old COBOL programmers never die. They just code in C.
  53. TMI, anyone? by Anonymous Coward · · Score: 0

    To understand the problem with systems that fail verbosely, investigate the 3-Mile Island disaster. When the alarm board lights up like a christmas tree, it's hard to figure out what the root cause is amidst the chaos.

    BTW, the whole TMI nuke plant had exactly 2 phone lines to the outside world. Between the governor calling to find out what happened, the media clammoring for answers, and other miscellaneous activity, the guy who designed the plant couldn't get through to tell them how to keep it from blowing up! Would you want to be on the wrong side of a Treo 600 while on vacation after the server coughs up a 20MB log file?

    aQazaQa

    1. Re:TMI, anyone? by Doc+Ruby · · Score: 1

      No, I'd want to be on an X-server and a T3 if I couldn't be at the console. And I'd want to use log analysis tools as complex under the hood as the processes that generated the logfile, with a clear UI. That's the other half of what I posted: verbose error messages, and tools to match. My Treo600 looks pretty simple, but inside its guts is one of the most complex devices ever built, to interface my face to the faces of billions of people around the world. If the log analysis tools, and my training, were appropriately matched in complexity, the Treo could be good enough. TMI suffered from a mismatch of complexity in reporting to analysis, with no one experienced enough to fill the gap.

      --

      --
      make install -not war

  54. To repair or replace, that is the question. by Anonymous Coward · · Score: 0

    I don't know if we can predict the socio-economic forces that will (or will not) obsolete repairmen. I recently discovered that, much to my surprise, automobile mechanics in Thailand (a developing country by all accounts) often resort to whole engine replacement rather than any of the component-level repairs or rebuilds we are familiar with in the US. On top of that, maintenance in general is poorly done. My expectation was that much more effort would be made to keep machines going and repair them to avoid replacement, as is done in the poorer urban and rural areas of the US.

  55. Re:How about systems that I can manually heal firs by Anonymous Coward · · Score: 0

    Hm, I guess that explains why my porn directory suddenly got a new hot_computer_on_computer_action/ subdirectory.

  56. LOL, One for Microsoft by Anonymous Coward · · Score: 0

    This'll probably get moded flaimbait but it is true. If your telling me you unix guys have to start services with a number of scripts, that these services have no dependencies, that you cannot restart them automatically (or run a resume script, or whatever) ...

    Then I am glad my data center spends its $$$ on Microsoft Windows. We've been doing this since NT!

  57. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  58. Re:How about systems that I can manually heal firs by Anonymous Coward · · Score: 0

    Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.
    gentoo has this ...
    Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.
    this too
    Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default
    Do you mean the md5sum of an installation package/archive before you install it? Or you somehow want to test the program after install?

  59. IBM Autonomic Computing by mroshea · · Score: 1

    IBM are basing their future application self-healing abilities on what they call a whole branch of research they have been investing in for years called Autonomic Computing

    It's not all pie in the sky either - they've already released preliminary Autonomic Computing Toolkits as part of their Emerging Technologies Toolkit. Start by looking at the Logging and Trace components, and then maybe look at the Solution Install pieces - they underpin the whole framework.

    It will take a generation, or two (10-15 years) before complete IBM systems (hardware, OS, middleware, databases, applications etc) are close to autonomic - every aspect have to buy in and adapt the Autonomic Compouting framework. Given their extensive software catalog, IBM themselves will probably take 10-15 years to complete that task but they face a significantly larger hurdle convincing major 3rd party vendors (e.g. Oracle, SAP etc) to wire their products into the new autonimic services.

    My guess is, in a mixed vendor (hardware/OS/application) environment, you won't see this for many, many years to come. Pure IBM shops may be able to rely on Autonomic systems within 5-10 years if they are using the latest of everything.

  60. Re:How about systems that I can manually heal firs by grumbel · · Score: 1

    ### Do you mean the md5sum of an installation package/archive before you install it? Or you somehow want to test the program after install?

    Basically both, if one wants a self healing system, the system needs first a way to find out that something is wrong in the first place. If there isn't a way to detected that some files got broken, then a self-healing system can do nothing. Beside from that it might of course also help to detect some cracker attacks or corrupt harddisks easier.

  61. Re:How about systems that I can manually heal firs by Anonymous Coward · · Score: 0

    ### if one wants a self healing system, the system needs first a way to find out that something is wrong in the first place. If there isn't a way to detected that some files got broken, then a self-healing system can do nothing. Beside from that it might of course also help to detect some cracker attacks or corrupt harddisks easier.

    The OS can't possibly know an application is healty or not (There are other applications that can check this: rootkit hunter is one of them). The reason being you can't define a healthy application: is allowing myhost.example.com to connect from the internet on port 22 using tcp a sign of "disease" or that the system is healthy since it allows only one host to connect? What if the host is a malicious user who denied every other user the right to access the system? What you want is more in the field of AI ... and we are *cough* months away from that :)