Self-Repairing Computers

← Back to Stories (view on slashdot.org)

Posted by Hemos on Sunday May 11, 2003 @10:52PM from the repairing-the-box dept.

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

4 of 208 comments (clear)

Re:Managerspeak by gilesjuk · 2003-05-11 23:02 · Score: 5, Interesting

Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)
hmmmmm by Shishio · 2003-05-11 23:03 · Score: 5, Funny

the disappearance of a large Internet site.

Yeah, I wonder what could ever bring down a large Internet site?
Ahem.

--
Twelve fingers or one, its how you play. ~Gattaca (Vincent)
ROC detail by rleyton · 2003-05-11 23:04 · Score: 5, Informative

For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.

--
ooooooh! What does this button do? - DeeDee, Dexters Lab.
I used systems like this by Mark+Hood · 2003-05-11 23:24 · Score: 5, Interesting

they were large telecomms phone switches.

When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

Nice to see the 'PC' world finally catching up :)

If people want more info, then write to me.

Mark

--
Liked this comment? Why not buy me something nice