Self-Repairing Computers
Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
My experience is the best system is paired computers running in parallel that are balanced by another computer that watches for problems and switches the crashed system from Live to the other computer seamlessly. It then reboots the system with problems and allows it to recreate its dataset from its partner.
In effect this points the way to the importance of massive parallelism required for totally stable systems so that clusters form the virtual computer and we get away from the idea of a computer as a single machine.
Afterall individual computers suffer hardware failure too!
---- The Open Source Record Label : : LOCARECORDS.COM