Self-Repairing Computers
Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
coupled with self debugging code.
How small a thought it takes to fill a whole life
Is Ctrl-Alt-Del ROC too? :)
I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.
I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.
Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.
Cheers,
Costyn.
The Official Steve Ballmer Webpage
Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
Maybe I just don't understand this part. The other points all seem very sensible.
std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
the disappearance of a large Internet site.
Yeah, I wonder what could ever bring down a large Internet site?
Ahem.
Twelve fingers or one, its how you play. ~Gattaca (Vincent)
"Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....
For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.
ooooooh! What does this button do? - DeeDee, Dexters Lab.
Heal thy-self!
Sometimes I wish I was a plumber, then I'd know how to deal with other people's shit.
Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.
Micro-rebooting: Restart service.
Mini-rebooting: Restart Windows 98
Rebooting : Switch off/on power
Macro-rebooting: BSOD.
Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.
If you keep throwing chairs, one day you'll break windows....
I wonder if this [PDF!] cool new feature will help there.
Sounds a lot like "micro-rebooting" to me...
my
but if end-users got a better computer education, I think most of the problems would be fixed.
I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".
I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
[WARNING]
You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?
[Undo] [Continue testing]
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
I think that's a big fat lie.
The dangers of knowledge trigger emotional distress in human beings.
and cron them in.
This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.
Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)
"Undo" feature? That's what backups are for.
Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.
// file: mice.h
#include "frickin_lasers.h"
Windows Installer, was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.
The Office 2000 self-repairing installations is another notorious one, if you remove something, the installer thinks it has been removed in error and tries to reinstall it...
Oh well, lets wish the recovery-oriented computing guys luck...
Solid!
The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."
Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.
But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.
[b.belong('us') for b in bases if b.owner() == 'you']
Here's the strategy:
1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
3. After the 1st instln. a copy of the installed version is put onto a CD.
4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.
Now, for self-repairing:
1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)
Foolproof system, so far - and yes, lots of foolish users around.
If you keep throwing chairs, one day you'll break windows....
Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have , does it?
they were large telecomms phone switches.
:)
When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.
The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.
Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.
Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.
This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!
As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.
Nice to see the 'PC' world finally catching up
If people want more info, then write to me.
Mark
Liked this comment? Why not buy me something nice
Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.
:-)
Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.
The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them
Freedom: "I won't!"
I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?
Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?
Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.
Or you could just have some sort of failover setup.
Rus
Cheap UK and US VPS
Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?
Good judgment comes from experience, and a lot of that comes from bad judgment.
My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.
My first "PC" was a PDP-11/20, with paper tape reader and linc tape storage. Anyone who tries to tell me that operating today's computers is much more complex needs to take some serious drugs.
What is more complex is what today's computers do, and increasing their reliability or making them goal oriented are both laudable goals. What will not be accomplished is making the things that these computers actually do less complex.
Don't take life too seriously; it isn't permanent.
"But operating them is much more complex."
You're saying the computers of today are more complex to operate than those of 20 years ago?
What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.
So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.
I'm not sure I agree with the premise.
You were mistaken. Which is odd, since memory shouldn't be a problem for you
Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
;-) but i hope you got the point, no time to ask my living dictionary.
This is because the technical computer stuff is so new every year and so...
1: Its to expensive to make it failsafe, development would take to long.
2: You cant refine/redesign and resell, because of new technologie.
3: If it just works noone will buy new systems, so they have to fail every now and then.
While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man (Dutch only) todo your fixing.
excuse me for my bad english
build an "undo" function (similar to those in word-processing programs) for large computing systems
This is called "the sysadmin thinks ahead."
Essentially, when any sysadmin worth a pile of
beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.
"No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert
Or the factor of 1000 to 1 in hard disk sizes.
Or the 20:1 price difference.
I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.
Panurge has posted for the last time. Thanks for the positive moderations.
We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM processors.
so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?
the primary immediately hands over the responsibility to the redundant/backup
is there an effective way to judge which processor is correct? you need an odd number of processors to do that or an odd split on an even number of processors.
I'm not saying that this system is flawed actually the way you described here it is certainly far more reliable than the usual servers, what I'm trying to point out is that the concept itself is the bottleneck.
But operating them is much more complex.
I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.
Flying is easy, just throw yourself at the ground and miss. -Douglas Adams
Do we have to keep using this tired old notion of little old (middle-aged, for the /. crowd) ladies cringing in terror when faced with a computer?
My mother has a B.Math in CS, acquired more than a quarter century ago. Her father is pushing eighty, and he upgrades his computer more often than I do. When he's not busy golfing, he's scanning photographs for digital retouching. (In his age bracket, a man who can remove double chins and smooth wrinkles is very popular.)
The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. Maybe some of these people are frightened of (or frustrated with) computers because their exposure to technology is through the 'typical'* arrogant, smug, condescending /.er--concealing his embarrassment over being unable to get a girlfriend behind clouds of technobabble.
*How does it feel to be the target of an unfair stereotype?
~Idarubicin
Oh yeah. My TRS-80 used to NEVER crash twenty years ago when I accessed LARGE INTERNET SITES.
I object to that article, and to the next reply.
micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...
:). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.
I think they just invented Lisp
Healthcare article at Kuro5hin