Self-Repairing Computers

← Back to Stories (view on slashdot.org)

Posted by Hemos on Sunday May 11, 2003 @10:52PM from the repairing-the-box dept.

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

5 of 208 comments (clear)

Min score:

Reason:

Sort:

/etc/rc.d ? by graveyhead · 2003-05-11 23:01 · Score: 4, Interesting

Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.
OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

Maybe I just don't understand this part. The other points all seem very sensible.

--
std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
Re:Managerspeak by gilesjuk · 2003-05-11 23:02 · Score: 5, Interesting

Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)
I used systems like this by Mark+Hood · 2003-05-11 23:24 · Score: 5, Interesting

they were large telecomms phone switches.

When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

Nice to see the 'PC' world finally catching up :)

If people want more info, then write to me.

Mark

--
Liked this comment? Why not buy me something nice
Self-diagnostics by 6hill · 2003-05-11 23:45 · Score: 4, Interesting

I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).
My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.
Re:Managerspeak by sjames · 2003-05-12 00:44 · Score: 4, Interesting

There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.

Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.

Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.

HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.

Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.

Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.

Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.

It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.