Self-Repairing Computers
Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.
ooooooh! What does this button do? - DeeDee, Dexters Lab.
Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have , does it?
build an "undo" function (similar to those in word-processing programs) for large computing systems
This is called "the sysadmin thinks ahead."
Essentially, when any sysadmin worth a pile of
beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.
"No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert