Slashdot Mirror


Self-Repairing Computers

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

33 of 208 comments (clear)

  1. This would be great by CausticWindow · · Score: 4, Funny

    coupled with self debugging code.

    --
    How small a thought it takes to fill a whole life
  2. Managerspeak by CvD · · Score: 3, Insightful

    I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.

    I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.

    Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.

    Cheers,

    Costyn.

    1. Re:Managerspeak by gilesjuk · · Score: 5, Interesting

      Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)

    2. Re:Managerspeak by Bazzargh · · Score: 3, Insightful

      I haven't read the long and dense article

      Yet you feel qualified to comment....

      requiring a whole plethora of yet unwritten code

      You do realize they have running code for (for example) an email server (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.

      As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.

    3. Re:Managerspeak by sjames · · Score: 4, Interesting

      There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.

      Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.

      Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.

      HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.

      Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.

      Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.

      Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.

      It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.

  3. Interesting choice by sql*kitten · · Score: 4, Insightful
    From the article:

    We decided to focus our efforts on improving Internet site software. ...
    Because of the constant need to upgrade the hardware and software of Internet sites, many of the engineering techniques used previously to help maintain system dependability are too expensive to be deployed.

    (etc)

    Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
  4. /etc/rc.d ? by graveyhead · · Score: 4, Interesting
    Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.
    OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

    Maybe I just don't understand this part. The other points all seem very sensible.
    --
    std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
    1. Re:/etc/rc.d ? by Surak · · Score: 4, Insightful

      Exactly. It isn't. I think the people who wrote this are looking at Windows machines, where restarting individual subcomponents is often impossible.

      If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.

    2. Re:/etc/rc.d ? by Mark+Hood · · Score: 3, Interesting

      It's different (in my view) in that you can go even lower than that... Imagine you're running a webserver, and you get 1000 hits a minute (say).

      Now say that someone manages to hang a session, because of a software problem. Eventually the same bug will hang another one, and another until you run out of resources.

      Just being able to stop the web server & restart to clear it is fine, but it is still total downtime, even if you don't need to reboot the PC.

      Imagine you could restart the troublesome session and not affect the other 999 hits that minute... That's what this is about.

      Alternatively, making a config change that requires a reboot is daft - why not apply it for all new sessions from now on? If you get to a point where people are still logged in after (say) 5 minutes you could terminate or restart their sessions, perhaps keeping the data that's not changed...

      rc.d files are a good start, but this is about going further.

      --
      Liked this comment? Why not buy me something nice
    3. Re:/etc/rc.d ? by Surak · · Score: 3, Interesting

      Yes. I'm typing this on last night's build of Mozilla Firebird running under Windows NT 4.0. Sure you can stop and start the workstation and/or server services. Ever done it? How stable is NT after that?

      I can tell you that on *nix restarting the Samba daemon happens seamlessly.

    4. Re:/etc/rc.d ? by delta407 · · Score: 4, Insightful
      There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
      How about restarting the "Server" service?

      Depending on how file sharing "goes down", you may need to restart a different service. Don't be ignorant: it is usually possible to fix an NT box while it's running. However, it's usually easier to reboot, and if it's not too big of a big deal, Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.
  5. hmmmmm by Shishio · · Score: 5, Funny

    the disappearance of a large Internet site.

    Yeah, I wonder what could ever bring down a large Internet site?
    Ahem.

    --
    Twelve fingers or one, its how you play. ~Gattaca (Vincent)
  6. test errors by paulmew · · Score: 3, Funny

    "Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....

  7. ROC detail by rleyton · · Score: 5, Informative

    For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.

    --
    ooooooh! What does this button do? - DeeDee, Dexters Lab.
  8. it will not work now by KingRamsis · · Score: 4, Insightful

    Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.

  9. Various levels of rebooting... by jkrise · · Score: 4, Funny

    Micro-rebooting: Restart service.
    Mini-rebooting: Restart Windows 98
    Rebooting : Switch off/on power
    Macro-rebooting: BSOD.
    Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.

    --
    If you keep throwing chairs, one day you'll break windows....
  10. uunnschulding sme.. by danalien · · Score: 3, Insightful

    but if end-users got a better computer education, I think most of the problems would be fixed.

    I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".

    --
    I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
  11. Compulsory M$ joke by Rosco+P.+Coltrane · · Score: 3, Funny
    Third, programmers ought to build systems that support an "undo" function (similar to those in word-processing programs), so operators can correct their mistakes. Last, computer scientists should develop the ability to inject test errors; these would permit the evaluation of system behavior and assist in operator training.

    [WARNING]
    You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?

    [Undo] [Continue testing]

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  12. Hmm. by mfh · · Score: 4, Insightful
    Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

    I think that's a big fat lie.

    --
    The dangers of knowledge trigger emotional distress in human beings.
  13. Write scripts for it... by ndogg · · Score: 4, Insightful

    and cron them in.

    This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

    Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)

    "Undo" feature? That's what backups are for.

    Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.

    --
    // file: mice.h
    #include "frickin_lasers.h"
  14. Second paragraph by NewbieProgrammerMan · · Score: 4, Insightful

    The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."

    Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.

    But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.

    --
    [b.belong('us') for b in bases if b.owner() == 'you']
  15. Re:No clue by Gordonjcp · · Score: 4, Informative

    Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have , does it?

  16. I used systems like this by Mark+Hood · · Score: 5, Interesting

    they were large telecomms phone switches.

    When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

    The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

    Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

    Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

    This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

    As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

    Nice to see the 'PC' world finally catching up :)

    If people want more info, then write to me.

    Mark

    --
    Liked this comment? Why not buy me something nice
  17. "Managerspeak"?! by No+Such+Agency · · Score: 3, Insightful

    Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.

    Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.

    The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)

    --
    Freedom: "I won't!"
  18. ACID ROC? by shic · · Score: 3, Insightful

    I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?

    Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?

  19. The Hurd by rf0 · · Score: 3, Interesting

    Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

    Or you could just have some sort of failover setup.

    Rus

  20. Magic Server Pixie Dust by thynk · · Score: 3, Funny

    Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?

    --

    Good judgment comes from experience, and a lot of that comes from bad judgment.
  21. Self-diagnostics by 6hill · · Score: 4, Interesting
    I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).

    My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

    So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.

    1. Re:Self-diagnostics by jtheory · · Score: 3, Insightful

      There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

      We can learn some lessons from how human society works. If your messages don't make sense to most other people, or if you start damaging a lot of other people, you get separated from the rest and possibly "rebooted" (some call this "electroshock therapy") or even deactivated (some call this "Welcome to Texas").

      The difference here is that if the computers in the cluster are all running the same programs, they will contain the exact same coding flaw that they will all concur is the only sane answer (in human terms, this is called "religion"). So we're protected from hardward malfunctions, but not bugs in software or hardware.

      That's why this stuff is so hard to do. It may be possible to use selective program restarts to temporarily keep service up in spite of a nasty memory leak, but nothing is really "repaired"; it's just providing a few more fingers to plug holes in the dam while the river keeps rising. So... do you get into providing alternative services for the ones malfunctioning?

      Interesting stuff (maybe I'll even read the article now).

      --
      There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
  22. Does SCI AM review articles properly nowadays? by panurge · · Score: 3, Insightful
    The authors either don't seem to know much about the current state of the art or are just ignoring it. And as for unreliability - well, it's true that the first Unix box I ever had (8 user with VT100 terminals) could go almost as long without rebooting as my most recent small Linux box, but there's a bit of a difference in traffic between 8 19200 baud serial links and two 100baseT ports, not to mention the range of applications being supported.
    Or the factor of 1000 to 1 in hard disk sizes.
    Or the 20:1 price difference.

    I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.

    --
    Panurge has posted for the last time. Thanks for the positive moderations.
    1. Re:Does SCI AM review articles properly nowadays? by NearlyHeadless · · Score: 4, Insightful
      The authors either don't seem to know much about the current state of the art or are just ignoring it.

      I have to say that I am just shocked at the inane reactions on slashdot to this interesting article. Here we have a joint project of two of the most advanced CS departments in the world. David Patterson's name, at least, should be familiar to anyone who has studied computer science in the last two decades since he is co-author of the pre-eminent textbook on computer architecture.

      Yet most of the comments (+5 Insightful) are (1) this is pie in the sky, (2) they must just know Windows, har-de-har-har, (3) Undo is for wimps, that is what backups are for, (4) this is just "managerspeak".


      Grow up people. They are not just talking about operating systems, they do know what they are talking about. Some of their research involved hugely complex J2EE systems that run on, yes, Unix systems. Some of their work involves designing custom hardware--"ROC-1 hardware prototype, a 64-node cluster with special hardware features for isolation, redundancy, monitoring, and diagnosis."


      Perhaps you should just pause for a few minutes to think about their research instead of trying to score Karma points.

  23. DWIM by PhilHibbs · · Score: 3, Funny

    We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM processors.

  24. Nothing new. by pmz · · Score: 3, Insightful

    micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...

    I think they just invented Lisp :). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.