Self-Repairing Computers
Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
coupled with self debugging code.
How small a thought it takes to fill a whole life
Is Ctrl-Alt-Del ROC too? :)
I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.
I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.
Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.
Cheers,
Costyn.
The Official Steve Ballmer Webpage
Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
Maybe I just don't understand this part. The other points all seem very sensible.
std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
the disappearance of a large Internet site.
Yeah, I wonder what could ever bring down a large Internet site?
Ahem.
Twelve fingers or one, its how you play. ~Gattaca (Vincent)
"Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....
For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.
ooooooh! What does this button do? - DeeDee, Dexters Lab.
Heal thy-self!
Sometimes I wish I was a plumber, then I'd know how to deal with other people's shit.
Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.
Micro-rebooting: Restart service.
Mini-rebooting: Restart Windows 98
Rebooting : Switch off/on power
Macro-rebooting: BSOD.
Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.
If you keep throwing chairs, one day you'll break windows....
I wonder if this [PDF!] cool new feature will help there.
Sounds a lot like "micro-rebooting" to me...
my
but if end-users got a better computer education, I think most of the problems would be fixed.
I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".
I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
theoretically, i don't see why you shouldn't be able to do it in hardware, if for example an entire OS has been written to report to some piece of hardware what processes it has running, and that each of these processes needs to report to that piece of hardware on it's status. If a report comes in concerning problems, or the report fails to come in altogether, the chip then takes action to remedy the situation, by for example restarting that particular process.
Disclaimer: all uses of the word process in this post are due to a total lack of knowledge concerning *nix and more than is good for me with 2K/XP.
People replying to my sig annoy me. That's why I change it all the time.
[WARNING]
You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?
[Undo] [Continue testing]
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
I think that's a big fat lie.
The dangers of knowledge trigger emotional distress in human beings.
and cron them in.
This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.
Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)
"Undo" feature? That's what backups are for.
Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.
// file: mice.h
#include "frickin_lasers.h"
Sounds like a great way to lure in customers for another product. What happens when part of this ROC fucks up? No coding is perfect. Also, would it be cost effective? I doubt it...
Pls No Negative Modding!
Windows Installer, was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.
The Office 2000 self-repairing installations is another notorious one, if you remove something, the installer thinks it has been removed in error and tries to reinstall it...
Oh well, lets wish the recovery-oriented computing guys luck...
Solid!
The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."
Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.
But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.
[b.belong('us') for b in bases if b.owner() == 'you']
Here's the strategy:
1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
3. After the 1st instln. a copy of the installed version is put onto a CD.
4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.
Now, for self-repairing:
1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)
Foolproof system, so far - and yes, lots of foolish users around.
If you keep throwing chairs, one day you'll break windows....
I think the first good use of ROC would be to clean up the errors and problems in Windows. Of course the only solution the ROC could possibly do to clean up all the problems with Windows is to detele Windows all together, but hey, we'd do it ourselves sooner or later anyway.
Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have , does it?
they were large telecomms phone switches.
:)
When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.
The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.
Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.
Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.
This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!
As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.
Nice to see the 'PC' world finally catching up
If people want more info, then write to me.
Mark
Liked this comment? Why not buy me something nice
So in fact it's not talking about rebooting machine vs restarting services, it's talking about both of the above vs restarting subcomponents.
But hey, if you want to start talking about rebooting failed SMB services on Windows then go right ahead - you're in front of a friendly audience after all.
Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.
:-)
Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.
The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them
Freedom: "I won't!"
hmmmm....Recovery Oreinted Computing......This just screams linux.
Recovery Oreinted Computing is nothing new, most devlopers (well *nix devlopers) have been heading down this route for years, particularly with more hardcore OO languages (is java...and in many respects c++) come to the surface with exception structures, it becomes easier to isloate and identify the exception that occured and take appropiate action to keep the server going.
However, this method of coding is still growing...there are no real solid / accepting methods of isolating and identifying problems...however, in the next few years you will probably see this trend move to the next level as algorithims for identification, and localization are devloped and widely adopted.
Of course if your running on a windows platform this is kinda pointless...rebooting at least once every 30 days really eliminates any chance of long term running and the need for large scale localization and identification
rm -rf /*
^Z
jut for fun!
I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?
Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?
My experience is the best system is paired computers running in parallel that are balanced by another computer that watches for problems and switches the crashed system from Live to the other computer seamlessly. It then reboots the system with problems and allows it to recreate its dataset from its partner.
In effect this points the way to the importance of massive parallelism required for totally stable systems so that clusters form the virtual computer and we get away from the idea of a computer as a single machine.
Afterall individual computers suffer hardware failure too!
---- The Open Source Record Label : : LOCARECORDS.COM
Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.
Or you could just have some sort of failover setup.
Rus
Cheap UK and US VPS
Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?
Good judgment comes from experience, and a lot of that comes from bad judgment.
My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.
In databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit. When you decide that it is no good anyway there is the possibility of a roll-back; everything is rolled back to a last known good state.
In practice it would mean that changes are logged and possibly after logging changes are effectuated. This does result in overhead and in potential vulnerabilities (both for hackers and for errors).
Things like this also reek like what a "standardised" hardware and software would look like. How else can you control the quality of such a system? NB this does not mean that a Linux BSD is inferior, it would only be more obvious and visible what went right what went wrong.
Thanks,
Gerard
You now don't reboot(tm) but you micro-reboot(tm) i.e. the system will do that for you! Remember the times when you are writing that important report under MS(r) Word(tm); and the system crashed, and you had to press Ctrl-Alt-Del(tm) to reboot(tm). No more! No more pressing ackward buttons... The system is intelligent enough to do that for you :)
My first "PC" was a PDP-11/20, with paper tape reader and linc tape storage. Anyone who tries to tell me that operating today's computers is much more complex needs to take some serious drugs.
What is more complex is what today's computers do, and increasing their reliability or making them goal oriented are both laudable goals. What will not be accomplished is making the things that these computers actually do less complex.
Don't take life too seriously; it isn't permanent.
"But operating them is much more complex."
You're saying the computers of today are more complex to operate than those of 20 years ago?
What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.
So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.
I'm not sure I agree with the premise.
You were mistaken. Which is odd, since memory shouldn't be a problem for you
net stop workstation
net start workstation
when nt services blow chunks, the often leave crap in kenel space that prevents them being stopped/started.
I hope things have improved with widows XP.
thank God the internet isn't a human right.
Wouldn't better coding and better hardware be more efficient? This sounds a little silly. Perhaps, come quantum computers, maybe. Think of all the SA's that fix things that break all day who will be jobless.
Rob
Self-Repairing Computers
g e r/Picard/g / Tuvok/g
Finally, this provides us with the long awaited answer to the following situations:
Reed: Captain, direct hit on the power supply!
Archer: That'll teach those cyborgs for flooding our inbox with p0rn!
T'Pol: Captain, their server is mysteriously repairing itself, we're still being flooded.
for any other series:
TOS:
%s/Reed/Checkov/g
%s/Archer/Kirk/
%s/T'Pol/Spock/g
TNG:
%s/Reed/Worf/g
%s/Arch
%s/T'Pol/Data/g
DS9:
%s/Reed/Kira/g
%s/Archer/Sisko/g
%s/T'Pol/Dax/g
VGR:
%s/Reed
%s/Archer/Janeway/g
%s/T'Pol/Kim/g
Since the B&B messed up the timelines anyway, they'll probably pour it into an episode, they seem to be out of inspiration anyhow...
Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
;-) but i hope you got the point, no time to ask my living dictionary.
This is because the technical computer stuff is so new every year and so...
1: Its to expensive to make it failsafe, development would take to long.
2: You cant refine/redesign and resell, because of new technologie.
3: If it just works noone will buy new systems, so they have to fail every now and then.
While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man (Dutch only) todo your fixing.
excuse me for my bad english
build an "undo" function (similar to those in word-processing programs) for large computing systems
This is called "the sysadmin thinks ahead."
Essentially, when any sysadmin worth a pile of
beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.
"No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert
there will be always a single point of failure for ever
Well, yes and no. Single points of failure are extremely difficult to find in the first place, not to mention remove, but it can be done on the hardware side. I could mention the servers formerly known as Compaq Himalaya, nowadays part of HP's NonStop Enterprise Division in some manner. Duplicated everything, from processors and power sources to I/O and all manner of computing doo-dads. Scalable from 2 to 4000 processors.
They are (or were, when I did my research piece on the Himalayas) also self-correcting in the sense that the two processors do lock-step processing and if the two differ in their opinions, the primary immediately hands over the responsibility to the redundant/backup -- data self-correcting on the assembly level. Of course, this doesn't prevent software from being a point of failure or from functioning incorrectly, but one or a cluster of these is as close as you're going to get without automated hotswapping or nanobot parts building, or other such sci-fi notions.
So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.
This is true, however, keep in mind that none of the DOS operating systems had a kernel. nor were any of them truely mutlitasking until windows 95 for the windows world(shudders). And the debut of Unix 20 years ago.
Also keep in mind all the new technologies such as netwroking, (thats a whole post of changes on its own) hardware and bluetooth, firewire, usb, a hudge number of new technologies that have evolved to meet the ever expanding demands we place on systems.
Some of the popular platforms from 20 years ago such as the PC XT are now used in calculators today, The very definition of a computer has changed in 20 years, so the operating systems are orders of magnatidude more complex...20 years ago the pc world was still in its infancy. Since then, everything outside the very definition of the pc has changed...and notebook and handheld technologies are pushing that.
That being said, its not really fair to compare operating systems from 20 years ago to operating systems of today....its just a different world, and the very definition of an operating system is no longer the same
Or the factor of 1000 to 1 in hard disk sizes.
Or the 20:1 price difference.
I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.
Panurge has posted for the last time. Thanks for the positive moderations.
n databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit.
This is actualy exaclty what iptables does...there is even a commit command at the end of every rulset after all exceptional circumstances have been handled
We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM processors.
I wouldn't worry about your english. Its better than some native speaks I've seen
Rus
Cheap UK and US VPS
Computer sure are a lot more complicated, you can't argue with that. But the article just said they're more complicated to operate.
I guess type "C:>DIR" is easier than clicking on explorer and selecting "DETAIL" view.
It's free, it's flexible, it's powerful and it is extremely popular. It's even pretty damn easy to set up. No other OSS comes close.
You know it makes sense.
Ahhhh! Undo! Undo!
My RS "Color Computer" ran 0.9MHz, 8bits, and OS was BASIC. I had Telewriter-64 and some Spreadsheet.
Just in clock speed alone, 3e6/0.9e6 = 3333. A 32 bit machine would make that 4X or 13,333 which is over 10,000. For the functions it had, it was more complex to use than MS Word or Open Office word. Only problem is that it still does not type any faster.
It's about having OS hooks to allow for introspection, subsystem management, etc. on a more fine-grained level.
The software can tell the OS, I have three major components (even though I singly-threaded) and they are each require such and such devices, and such and such memory, etc. and if anything looks out of these parameters I can give you, then call this MAGIC FUNCTION and I'll give it a good whack to make it right again.
Or if such and such hardware device I needed fails, I can take corrective action. Maybe I start listening on network card eth1 when before I was listening to eth0.
etc.
Black holes are where the Matrix raised SIGFPE
Many of these issues are best addressed at the hardware level, IMO. First of all, the software people don't have to worry about it then! ;-) For instance, look at RAID as a good example of reliable hardware (especially redundant RAIDS;). It is possible, using ECC memory and cache, and multiple CPUs, to be quite sure you're getting the correct results for a given calculation. You can also provide failover for continuous uptime.
Some of the rest of the article addressed issues of recovering from software errors as well. The first step is encouraging use of languages that don't constantly result in mechanical errors (stack exploits, wild pointers, freeing already freed space etc.). Many such solutions exist, from "safe" languages like LISP and Ada to managed languages like Java and Java--++ (C#). It is a much better approach to be able to design software as though the system is reliable, rather than working around an unreliable system.
All that said, an interesting approach to server software I ran across recently is Prevalayer. A nice, simple, lightweight object persistence scheme. There is also a good article on it here. Prevalayer is able to recover from system crashes quickly using a saved state and a journal file. Neat stuff!
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
So, what would these transformations be other than... instructions? You could show me a list of "transformations" that the input data is to undergo to generate an output, and I'd show you a list of "instructions" that tell the computer what to do to the input data to generate an output.
Furthermore, what you want is impossible-- "all possible combinations doable by the [yet uncounted] transformations." That's an arbitrarily large amount of work that requires an arbitrarily large machine and time to accomplish it.
Kinda like a Connection Machine, huh? Those are real new.
Hrm. I suppose you've never noticed that memory buses are now specified by what amounts to a bandwidth number, as are IDE (ATA) bus family members. As to the "classical type" computer, again your prototype is the Connection Machine, circa 1983.
But operating them is much more complex.
I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.
Flying is easy, just throw yourself at the ground and miss. -Douglas Adams
...has some of this mythical, magical "pixie dust".
:-)
It has "chipkill" ECC parity memory so that bit erros get autocorrected via the usual ECC method, plus if any chips go bad, the system recognizes that and maps around the faulty chips, while keeping on running.
It has multiple processors and is able to disable an individual cpu should it go bad... still the system will keep running. Even has a pair of "service processors" to manage the general purpose processors.
It has multiple power supplies.
It has a pair of mirrored hard drives for the AIX operating system exclusively to reside upon... even swapspace is mirrored.
It has a big RAID5 array for data and apps with dual SSA controller cards and redundant cabling.
It has multiple network interfaces, naturally.
It even has dual graphics controller cards fer crying out loud.
Of course all the filesystems are JFS or JFS2 journalled filesystems.
The Oracle database engine running on it uses multiple transaction logs for transactions and rollback capability.
The financial application proggies running on it... well, I won't go there today
Pixie Dust? Well... uptime on it today is 329 days, 10 hours and some minutes. I've never needed to reboot it since the day it was installed and powered up. I've even applied several o/s patches, all of which were done hot.
This machine is proving to be as stable as some of the FreeBSD boxes I used to run years ago.
Pixie Dust indeed.
The moment you buy them, they add to the profit ...
To make components reset themselfs or to let them memorize states for the purpose of undoing work is the approach of those not involved.
The need to reset a component is because it has reached a state where it stops responding to any input. Or in other words, the component depended on receiving correct input without checking the input according to its state and thus locked itself up.
An undo operation on the other hand would lead to components accepting any input and to reach any state (even the undefined one) but with the need to memorize their previous states. Other components making use of them now would have to ignore the operability of these components and to memorize the previously issued actions on their part to be able to undo them.
The only component beeing able to start an undo would be the button on the GUI the user can click on.
It is a very interesting concept, giving all power to the user in front and to let him/her decide whether the computer is in an invalid state or not. It would be a radical change in the history of computer science. A user would not anymore be a slave to the blue screen (or a kernel panic) demanding a confirmation of the unavoidable reset!
Everything would have to be redesigned and reimplemented. Reuse of old, existing components would of course be impossible and errors in the final product are only because of imperfect programmers and will be solved through updates and newer releases.
Sven
Do we have to keep using this tired old notion of little old (middle-aged, for the /. crowd) ladies cringing in terror when faced with a computer?
My mother has a B.Math in CS, acquired more than a quarter century ago. Her father is pushing eighty, and he upgrades his computer more often than I do. When he's not busy golfing, he's scanning photographs for digital retouching. (In his age bracket, a man who can remove double chins and smooth wrinkles is very popular.)
The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. Maybe some of these people are frightened of (or frustrated with) computers because their exposure to technology is through the 'typical'* arrogant, smug, condescending /.er--concealing his embarrassment over being unable to get a girlfriend behind clouds of technobabble.
*How does it feel to be the target of an unfair stereotype?
~Idarubicin
Oh yeah. My TRS-80 used to NEVER crash twenty years ago when I accessed LARGE INTERNET SITES.
I object to that article, and to the next reply.
How big is the check I'm writing right now?
How fast is it?
With these as your evaluation function, you are guaranteed to get systems with little redundancy and little or no internal safety checks.
One regrettable example of this is the market for personal finance programs. The feature that sells Quicken is quick-fill - the heuristic automatic data entry that makes entering transactions fast. Never mind that Quicken's register file is fragile - it frequently loses track of balances (requiring the moral equivalent of fsck), and every few years the accumulated unfixed cruft causes a major failure, requiring insane fixes like exporting all your data as QIF files and reimporting it into a new register.
If Quicken were back-ended into a real database, with real transactions, real consistency checks, and real crash recovery, all this would go away. But it would make Quicken slower and require more hardware horespower to run it - the marketplace would punish them for improving their lives.
What the original article is proposing is:
We accept that systems will always suck
Therefore, we should build multi-level suckage damage control into them
Another possibility is:
We accept that there is a tradeoff between system speed and safety
Therefore, we take the speed hit where safety is important
To a Lisp hacker, XML is S-expressions in drag.
I thought Unisys had modified their AT&T UNIX(r) to perform on the fly save points, and when an error occurred, the OS would roll back to the savepoint and re-execute the steps again. The theory was that these errors would only occur if there were several events happening at the same time. By rolling back and re-executing the steps, one or more of the events would not be happening at that time.
They claimed to reduce kernel panics by 80% this way.
I am not sure how an event could not occur when re-executing the same steps, since it's the "same steps". It's been a few years since I was told about this, and I may be remembering incorrectly.
You can lose something that is loose, so tighten the loose item so you don't lose it.
micro-rebooting - Apache has been doing that for years.
undo - transaction rollbacks in data bases.
injecting test errors - how does this differ from automated testing suites?
better tools for pinpointing problems - just an incremental improvement.
Nothing really new here, just an extension of existing technology. All of these have been solved in other areas a long time ago.
my $.02
putting the 'B' in LGBTQ+
And the debut of Unix 20 years ago.
Just to set the record straight, I think you mean more than 30 years ago, unless you're talking about the debut of XENIX.
this one rubs me the wrong way.
This does not seem to be leading edge research. Someone else posted their suspicion that the team was working with win32 systems looking for a better way.
I agree with them.
I think these guys are looking at win32 systems and wishing they were Unix ones.
1. 'Micro Reboots' - Can you say '/etc/init.d' ? Example, my Linux machine sometimes chokes on sound. I could reboot the machine, but instead, I just start the sound service. '/etc/init.d/sound restart'
2. 'Better tools to pinpoint the sources of faults' Can you say '/var/adm/SYSLOG or Messages?' Anything you want to know about the machine ends up there. If you take a proactive approach to your log, you are going to notice things before they bring the whole thing down. Maybe we could have better logging, but still this is exectution of what we know, not anything new.
3. 'We need an Undo' This one is very easy to setup yourself. It could be automated to a point, but really isn't this just a backup. Too much undo and you can't get anything done. Not enough and you need to know something to get the machine running again. Seems to be that a quality analysis of the system and its potential faults would yield a list of data to be incrementally watched and archived to achieve the same results.
4. I will give them a little credit for this one, though I am not sure I agree. This part of the puzzle would happen as part of #3 for the most part.
I have a big problem with #4 in that it makes the assumption that we are smarter than the tech. I am not sure we are. We can build it, but we don't always have a clue as to what it will do because there are too many subtle interactions to account for.
Better to let the machine bitch about the issue and be well prepared to deal with it. After a while, you and your machine will understand each others issues and all will be fine.
So, in the end, these folks are wishing they had a well planned and configured Unix system when they actually have something less...
Why not take those first three ideas and build a Linux that exectutes them nicely? Maybe people will prefer it to what we have now --maybe not. That would be some research.
This just isn't as big of a deal as it looks. (Sorry guys)
Blogging because I can...
" The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. "
/.er. Why? Because I'm smarter than you, richer than you, better looking than you, and yes, I get more women than you.
Thus proving the original poster's point.
Which, I'm sure, you didn't intend on doing. But I only pointed it out because I am the typical arrogant, smug condescending
Life's a bitch, and then you marry one.
Huh? How is double-clicking an icon more complex than typing in LOAD "*",8,1 or having to load something from a tape drive? If my complex you mean you can do more, I agree, but as far as ease of use, its nowhere near as complex or counter-intuitive.
Manipulate the moderator system! Mod someone as "overrated" today.
This item claims that computers have become 10000 times faster in the last 20 years, but I hasten to disagree. Application of Moore's Law suggests they have become 10321 times faster... ;-)
The authors state as a given that crashes will happen.
I remember a few years ago (well, more than a few) I had this conversation with IBM MVS architects in an arguement about whether MVS should boot faster. Their position was that it shouldn't crash in the first place.
So what's widely acknowledged to be the least likely system to crash? IBM Mainframes.
John Roth
Consumers have proven that they will buy just about any piece of crap put shove under their nose, provided that its cheap.
micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...
:). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.
I think they just invented Lisp
Healthcare article at Kuro5hin
To expend this idea, we could do a RUN 3 times system
You would have three operating systems, each running a java type processor. Send the java instructions to all 3 machines, and hopefully they should return identical results. If two match and one is different, then throw out the third result.
Now a bug has to screw up all 2 of the three operating systems, much less likely.
excitingthingstodo.blogspot.com
Just write a process that has a sleep function and checks for other copies of itself and the cron daemon. If it doesn't seen the cron daemon, start it. Run arbitrary numbers of this process for added guarantee.
If i am not mistaken , I read somewhere IBM is using this in their mainframe processors . Sounds good but do you want faster computer chip or more safety features? The quandry.
Check this out for more details - http://www.research.ibm.com/journal/rd/435/spainho wer.html
Circuit-level fault detection and recovery including "instruction-retry capability"!
I'm not the original poster you responded to and I'm posting anonymously so you can't accuse me of trying to score karma.
I'm of the opinion that a large number of the naysayers do have valid points. Detecting and responding to failures can be done in a relatively easy and cost-effective manner on today's hardware. Especially compared to a complete overhaul and redesign that relies on untested methods and practices. If software and hardware is designed to fail safe, with intelligently designed journaling software and multiple redundant hardware - anything else is overkill.
Given the law of diminishing returns, the solutions that business buy and that individuals use are those that work well enough to do the job at hand - and no more. If companies like Amazon.com and Google.com can keep websites up 24x7 with no noticable downtime on cheap commodity hardware, why would they need this technology?
Okay, perhaps there are some military or medical applications that could use this, but it's an unproven solution that's bound to be vastly more expensive than the one it replaces, for very little improved reliability.
I wish them luck, perhaps when they've sold their first couple of installations it'll be worth revisiting. No offense to my former CS department (which is of no relation to the article), but there's a reason that the phrases "that's academic", and "if you can't do - teach" are both not complements. It may not be realistic to automatically give them credit merely because it is a university research project. The environment of a CS department sometimes is not the best for new ideas - idealogical inbreeding and isolation from the real world (where the bottom line is $$$) sometimes gives strange tilts to the work that comes out of them.
"Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex."
Spoken like someone that has never had to choose an interupt for their new sound card.
So this should help reduce the number of railroad accidents in the future, right?
There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
"On board the ship, everything was as it had been for millennia. deeply dark and silent.
Click, hum.
At least, almost everything.
Click, click, hum.
Click, hum, click, hum, click, hum.
Click, click, click, click, click, hum.
Hmmm.
A low-level supervising program woke up a slightly higher-level supervising program deep in the ship?s semisomnolent cyberbrain and reported to it that whenever it went click all it got was a hum.
The higher-level supervising program asked it what it was supposed to get, and the low-level supervising program said that it couldn?t remember what it was meant to get, exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn?t it? It didn?t know what this hum was. Click, hum, click, hum. That was all it was getting.
The higher-level supervising program considered this and didn?t like it. It asked the low-level supervising program what exactly it was supervising and the low-level supervising program said it couldn?t remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn?t find it, which was why it had alerted the higher-level supervising program of the problem.
The higher-level supervising program went to consult one of its own look-up tables to find out what the low-level supervising program was meant to be supervising.
It couldn?t find the look-up table.
Odd.
It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn?t find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.
The sector function supervisor hit immediate problems. It called its supervising agent, which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.
Small modules of software? agents? surged through the logical pathways, grouping, consulting, regrouping. They quickly established that the ship?s memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.
This made the whole problem very simple to deal with, in fact. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.
Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship?s logic chamber for installation.
This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.
This provided the first major clue as to what it was that was wrong.
-- DNA, MH (hhgg5)
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
"Undo" feature? That's what backups are for.
Perhaps you missed the point. Any person who has been administering computers for 10 years should be able to write that script and perform those backups and get it right in about a month or so.
Or, you could make automatic recovery and undo features part of the operating system. It's not easy, but it only has to be done once, and it would just do the Right Thing.
Linux is a better Unix than we had 30 years ago, but we really need a new generation of operating systems, where the shotgun at least has a safety. I wonder if it can be done within the Linux framework, or whether we are talking about a whole new operating system
http://roc.cs.berkeley.edu/
I haven't read the SciAm article so I'm not sure what spin they put on it, but it's actually a very reasonable idea.
The idea is two-fold:
(a) When trying to maximize reliability, it might actually be better in terms of total downtime to reduce recovery time rather than improve reliability. Take a system which crashes 5 times a year on average and takes an hour to go back online each time it crashes. Your total downtime is 5 hours/year. If you fix one place where the system crashes your total downtime will go down to 4 hours/year. But maybe for the same effort you can reduce the recovery time from 1 hour to 45 minutes. That's 3.75 hours/year of downtime. This is the kind of tradeoff that a lot of reliability engineering people don't think about, but should.
At the limit, if you had a file server that could recover within 5 seconds, who cares if it crashes twice a day? That's a short enough interval that the clients will automatically retry and succeed.
(b) You have to design the recovery path anyway, since you have to assume that sometimes your system crashes. You could also design a clean shutdown / startup path OR you could put all of that effort into making your recovery path that much faster and more effective.
Not having a "clean" shutdown path also has the benefit that every time you restart the system for any reason you are testing your recovery logic.
And your ABS ECU doesn't have CTRL ALT DELETE , does it?
;o)
Since BMW were looking at using embedded WinCE in their cars, one day it may well do.
Just another reason not to drive a BMW then
Beep beep.
This article has no specifics...but clearly a Turing Machine does have a state where is cannot continue to function, and I believe that this can be proved using the diagonal method and set theory...
As a professor, I can't help but think that some Slashdot responses are disguised pleas for help. Therefore, let me offer some guidance:
7 /), Figure 1.1 shows a performance improvement of a factor of 1.58 per year between 1984 and 2001 using a few generations of the SPEC benchmarks. That is a factor of 2400. If we add 3 more years at the same rate, we get a factor of 9600. QED.
n do.pdf.
d f. For the Slashdot readers who only have time for a quick overview, see the Scientific American article www.sciam.com/article.cfm?chanID=sa006&articleID=0 00DAA41-3B4E-1EB7-BDC0809EC588EEDF). For those who only have time to read Slashdot, may God protect you on your journey towards technical obsolescence.
* The factor of 10,000 performance improvement in 20 years is not the focus of the article, but if you are interested in where it came from, please see a book. On page 3 of Computer Architecture: A Quantitative Approach, 3rd edition, (http://www.amazon.com/exec/obidos/ASIN/155860596
* Backup is one of the 3Rs of system administrator undo that we are pursuing, but it is not all of them. The 3Rs are Rewind, Repair, and Replay. Backup gives us Rewind, but not Repair or Replay. It is also different from ACID transactions, which operate at a very low level of the system. We are interested in undo of higher level "verbs" that correspond to high-level user actions. If you want to learn about our undo ideas before you need to reply, see http://roc.cs.berkeley.edu/papers/sigops-ew2002-u
* TMR stands for Triple Modular Redundancy, which is an effective but expensive technique to protect from hardware failures. If hardware failures were the leading problem, then TMR would be the path to follow. Hardware errors are responsible for only 15% of the outages, as those who have read the Scientific American article already know. TMR and systems like HP's (nee Tandem's) NonStop do not address operator error.
* We are focused on Internet style applications that are considerably above the operating system, but the problems we have documented about operators being a major source of outages include all systems, including Linux systems. To learn about hard to find data about causes of failures before you reply, please see http://roc.cs.berkeley.edu/papers/usits03.pdf.
* We agree that the telephone industry did many fine things to make communication dependable, and that there is much to learn and emulate from them. If computers were as reliable as telephony, we could be much prouder of our field.
* Our focus is on Internet services, the so the cost of ownership is probably higher for such servers than for PCs. I wouldn't be surprised, however, that if you multiplied a typical white-collar hourly pay rate times the average of number of hours that one spends administering a PC, you may get similar results.
* For those of you who were not using computers in 1983, that is the era of open source UNIX software (BSD) on 32-bit computers (VAX). Sound familiar? Punched cards had been passé for quite a while in 1983.
* For those wanting to read something with more technical depth, see http://roc.cs.berkeley.edu/papers/ROC_TR02-1175.p
If only more people realized this.
Compaq was working on this technology ages ago. The idea was that the computer would self-report imminent failures. It has moved up a notch, but only a notch. Micro-rebooting - there's a concept! Narf!
All Ad hominem replies happily ignored as the sender shall be deemed to lack the faculties to comprehend the equation.
Norton's Crashguard (aka CrashHard) was supposed to help you recover from crashes in Mac OS 8, but in actuality it caused more crashes than it cured.
The parent post is from someone who actually knows what they're talking about, and it's got a score of 1 right now. Would any moderators care to correct this?