Slashdot Mirror


Self-Repairing Computers

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

208 comments

  1. This would be great by CausticWindow · · Score: 4, Funny

    coupled with self debugging code.

    --
    How small a thought it takes to fill a whole life
    1. Re:This would be great by ThundaGaiden · · Score: 1

      And they had better make sure that the self
      debugging can handle threaded applications

      It'll be great , but I can tell you one thing
      straight off...

      I really pity the first person who has to code
      it , ha ha ha , it's not going to be me.

  2. This post by nother_nix_hacker · · Score: 2, Funny

    Is Ctrl-Alt-Del ROC too? :)

  3. Managerspeak by CvD · · Score: 3, Insightful

    I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.

    I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.

    Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.

    Cheers,

    Costyn.

    1. Re:Managerspeak by gilesjuk · · Score: 5, Interesting

      Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)

    2. Re:Managerspeak by TopShelf · · Score: 2, Funny

      Speaking for the PHB's, this sounds very exciting. I can't wait until they have self-upgrading computers as well. No more replacing hardware every 3 years!

      --
      Stop by my site where I write about ERP systems & more
    3. Re:Managerspeak by Bazzargh · · Score: 3, Insightful

      I haven't read the long and dense article

      Yet you feel qualified to comment....

      requiring a whole plethora of yet unwritten code

      You do realize they have running code for (for example) an email server (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.

      As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.

    4. Re:Managerspeak by sjames · · Score: 4, Interesting

      There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.

      Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.

      Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.

      HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.

      Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.

      Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.

      Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.

      It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.

    5. Re:Managerspeak by cyberlync · · Score: 1

      As with many other things, its really just a matter of choosing the right tool for the job. In this case, it sounds allot like Erlang may be the right tool with it's live-reload, built-in fault tolerance , and distributed nature. It may still be allot of work but picking the right tool would move it from impossible (C, C++) to just difficult.

      --
      I'm a programmer, I don't have to spell correctly; I just have to spell consistently
    6. Re:Managerspeak by violent.ed · · Score: 0

      PHB?
      Pimply Half Brains?

      --
      - You're not paranoid, they really are after you.
    7. Re:Managerspeak by CvD · · Score: 0

      Pointy Haired Bosses. You should read Dilbert sometime. :-)

      Cheers,

      Costyn.

    8. Re:Managerspeak by Anonymous Coward · · Score: 0
    9. Re:Managerspeak by Salamander · · Score: 1

      The key is not to build the system hierarchically, with one "big brain" that watches everyone else but nobody watching it back. A more robust approach is to have several peers all watching each other and using a more "democratic" method to determine who's faulty. It's more difficult to design and implement the necessary protocols, but it's not impossible. The folks at Berkeley have quite a bit of experience with this stretching from OceanStore back (at least) to NOW and, having met them, I have full confidence that they know what they're doing.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    10. Re:Managerspeak by Anonymous Coward · · Score: 0

      This is nothing new (maybe in the PC or UNIX world) since we have had methodologies for this in the mainframe world for over 40 years. I have worked with many of these type of methodologies over the last 20 years. Hopefully, someone will realize the benefit this work has had in the past and not try to re-invent the wheel yet again.

      But, again it is not new.

  4. Interesting choice by sql*kitten · · Score: 4, Insightful
    From the article:

    We decided to focus our efforts on improving Internet site software. ...
    Because of the constant need to upgrade the hardware and software of Internet sites, many of the engineering techniques used previously to help maintain system dependability are too expensive to be deployed.

    (etc)

    Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
  5. /etc/rc.d ? by graveyhead · · Score: 4, Interesting
    Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.
    OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

    Maybe I just don't understand this part. The other points all seem very sensible.
    --
    std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
    1. Re:/etc/rc.d ? by jvervloet · · Score: 1

      ... and is this undo feature a big imporvement compared to e.g. regular backups ?

    2. Re:/etc/rc.d ? by oliverthered · · Score: 1

      ....or journaling/transactions.

      --
      thank God the internet isn't a human right.
    3. Re:/etc/rc.d ? by Surak · · Score: 4, Insightful

      Exactly. It isn't. I think the people who wrote this are looking at Windows machines, where restarting individual subcomponents is often impossible.

      If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.

    4. Re:/etc/rc.d ? by Mark+Hood · · Score: 3, Interesting

      It's different (in my view) in that you can go even lower than that... Imagine you're running a webserver, and you get 1000 hits a minute (say).

      Now say that someone manages to hang a session, because of a software problem. Eventually the same bug will hang another one, and another until you run out of resources.

      Just being able to stop the web server & restart to clear it is fine, but it is still total downtime, even if you don't need to reboot the PC.

      Imagine you could restart the troublesome session and not affect the other 999 hits that minute... That's what this is about.

      Alternatively, making a config change that requires a reboot is daft - why not apply it for all new sessions from now on? If you get to a point where people are still logged in after (say) 5 minutes you could terminate or restart their sessions, perhaps keeping the data that's not changed...

      rc.d files are a good start, but this is about going further.

      --
      Liked this comment? Why not buy me something nice
    5. Re:/etc/rc.d ? by platypus · · Score: 1

      How about killing just the worker process which hangs?

    6. Re:/etc/rc.d ? by GigsVT · · Score: 2, Insightful

      Apache sorta does this with its thread pool.

      That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?

      I think this ROC could only encourage buggier programs.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    7. Re:/etc/rc.d ? by 42forty-two42 · · Score: 1

      HURD is an even better example - TCP breaking? Reboot it! Of course, you have a single-threaded filesystem, but that's okay, right?

    8. Re:/etc/rc.d ? by 42forty-two42 · · Score: 1
      Imagine you could restart the troublesome session and not affect the other 999 hits that minute...
      So delete the offending session from the database.
    9. Re:/etc/rc.d ? by Bluelive · · Score: 2

      rc.d doesnt detect failures in the deamons, it doesnt resolve dependencies between deamons, and more of these things. rc.d is a step in the right direction but it isnt a solution to the whole problem set.

    10. Re:/etc/rc.d ? by the-dude-man · · Score: 1

      thats what the goal is, but apache is also trying to keep these threads locked down as well, ie-someone trys to do a bufferoverrun, because of this, we cant simply 'return' they may have overrun the return address, so kill the tread imediatly and flush the stack and dont give them a chance to get to that pointer.

      yes fixing the bug is a proper solution, however, the idea behind this is that you can never catch 100 % of the bugs, that is the one thing you can gaurnetee with any pice of software, because of this, have systems to handle the bugs and then fix them, that way, you still can (and should) fix the bug, but you havent encurred alot of downtime in the proccess

    11. Re:/etc/rc.d ? by the-dude-man · · Score: 1

      in order to isolate that session in memory (without affecting other users), you need some of the very concepts we are talking about. Also, the goal is to make it more stable for end users, so we want to only kill the session if we cant fix the bug

    12. Re:/etc/rc.d ? by 42forty-two42 · · Score: 1

      Put all the data for a session in a few tables, link it all to the SID. Then just do a SQL query to kill it if a sanity check somewhere detects corruption.

    13. Re:/etc/rc.d ? by Mark+Hood · · Score: 1

      Exactly.

      This is what happened in the telco system I mentioned. Sure, we need to fix the bug, but when the system spots it and cleans up it also produces a report. This allows a patch to be created and loaded (on the fly, usually) which solves the bug without affecting anyone else. In the meantime, the bug only affects the people who trigger it, not everyone logged in at once!

      --
      Liked this comment? Why not buy me something nice
    14. Re:/etc/rc.d ? by GigsVT · · Score: 1

      I'm just afraid that report will get ignored if the application seems to work fine.

      Have you ever taken a look at your console window when running Gnome or KDE apps? Thousands of warnings and non-fatal errors, that the programmer apparently doesn't care about.

      --
      I've had enough abrasive sigs. Kittens are cute and fuzzy.
    15. Re:/etc/rc.d ? by Imperator · · Score: 1

      Or you restart the appropriate services, like the Server service (and possibly some others in your vaguely described situation). Come on, have you ever actually used NT?

      --

      Gates' Law: Every 18 months, the speed of software halves.
    16. Re:/etc/rc.d ? by the-dude-man · · Score: 1

      yes...there are alot of warnings, and some non-fatel erros...however, some of these are in X some of these are in KDE....but the reasons for them are in the code.. the warnings you get are because some of the code is portable, ie its designed to compile on ppc, mips32/64 and x86, the waranings you are getting are largley a result of the code having to be hacked up a little so it will compile/run correclty on other archetectures (so you cant always do things the way the compiler wants) and because coders in these projects use idioesm...for example, I know the struct i am pointing at will terminate with 0, so when i do an equality test i dont make an int cast...this generates one of those warnings you see...and its just because i used an itiom that works, but the compiler warns about.

    17. Re:/etc/rc.d ? by Surak · · Score: 3, Interesting

      Yes. I'm typing this on last night's build of Mozilla Firebird running under Windows NT 4.0. Sure you can stop and start the workstation and/or server services. Ever done it? How stable is NT after that?

      I can tell you that on *nix restarting the Samba daemon happens seamlessly.

    18. Re:/etc/rc.d ? by Anonymous Coward · · Score: 0

      Sounds like something that could be worked out of snapshots, if your fs supports snapshots. For now, maybe have your system make a snap every 5 minutes and keep for 10, every hour and keep for 2, every day and keep for 2. Write script to automate recovery process.

    19. Re:/etc/rc.d ? by NearlyHeadless · · Score: 1
      OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.
      If you're really interested, take a look at http://www.stanford.edu/~candea/research.html, especially JAGR: An Autonomous Self-Recovering Application Server, built on top of JBOSS.
    20. Re:/etc/rc.d ? by oliverthered · · Score: 1

      multisession CDROM's support versioning.
      If you overwrite a file then the old file is still there, windows will only display the latest version of the file though.

      --
      thank God the internet isn't a human right.
    21. Re:/etc/rc.d ? by delta407 · · Score: 4, Insightful
      There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
      How about restarting the "Server" service?

      Depending on how file sharing "goes down", you may need to restart a different service. Don't be ignorant: it is usually possible to fix an NT box while it's running. However, it's usually easier to reboot, and if it's not too big of a big deal, Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.
    22. Re:/etc/rc.d ? by Mark+Hood · · Score: 1

      Well, in the telco world alarms are monitored 24/7, and each report comes out as a minor alarm.

      Trust me, the customer who paid for it won't ignore 'warnings'!

      --
      Liked this comment? Why not buy me something nice
    23. Re:/etc/rc.d ? by darkwhite · · Score: 1

      Ever done it? How stable is NT after that?

      Many times. My 2K and XP boxes are perfectly stable after restarting every service except the essential ones (RPC and something else). Not that I've ever had SMB fail on me.

      There are many unstable and wrong things in Windows. The kernel and core services are not one of them. If you're getting problems there, something is very wrong with your setup.

      --

      [an error occurred while processing this directive]
    24. Re:/etc/rc.d ? by Anonymous Coward · · Score: 0

      Yeah. Samba. One service. Try restarting your ext2 module on a live system and see how far you get. You can do the equivalent on NT. Try not to cherry-pick your examples too much, OK?

    25. Re:/etc/rc.d ? by KPU · · Score: 1

      /etc/rc.d/nfs reload comes to mind. It doesn't restart the servers. It just tells them that the config files are updated and it's the server's responsibility to reread configs when it can. Or use privoxy's model which monitors config files (using select) and rereads on change without any problems (every new connection uses the new config).
      Also consider what happens when I restart network over ssh. Network stops, network starts, and my ssh session still works.
      We already have standards for this with the /etc/rc.d system. It's up to the daemons to support this.

    26. Re:/etc/rc.d ? by iabervon · · Score: 1

      The way web traffic works, having a few seconds of complete downtime isn't a big deal, especially if you're not closing the listening socket. If you have 1000 hits/minute, and you HUP your web server, and it stops accepting connections (but doesn't reject them) for 6 seconds, killing all of the serving threads after 3 seconds (to let the non-stuck threads finish serving), and restarting them after reading the new configuration, 100 hits will take up to 6 seconds longer than they should, and then you'll have 110% of your average load while it catches up (less than many bursts anyway), and probably nobody will notice, especially if they're remote clients. For web servers and such, completely resetting the "business" portion of the server is fast and not noticeable to the client. About the only thing that would be noticed would be closing the listenning socket with connections open (which, by TCP rules, is supposed to prevent you from openning it again for a long time, and breaks connections and refuses new ones).

      It would matter more for protocols where you actually have persistent connections, but such protocols are relatively rare, and are generally designed to be restartable these days (with the exception of ssh, which actually does separate out connections entirely, like you suggest).

    27. Re:/etc/rc.d ? by buttahead · · Score: 1

      a restart of samba isn't the only type of restart they are looking at. in your example, an /etc/rc.d/init.d/smb restart kills and starts nmbd and smbd. wthey are talking about finer control.

      think of the dependencies between processes. if my application depends on serviceA, serviceB, and serviceC, I should be able to restart each service on it's own without the other services or the primary application failing. also, if a restart of the primary application is needed, we should be able to restart all the associated services or leave them running depending on their current stability.

      this can be done with an rc script, but in addition to a simple "start|stop|restart" choice, we also need "stop all services|stop service(A|B|C)start all services
      start service(A|B|C)|restart all services|restart service(A|B|C)|restart service(A|B|C) if service(A|B|C) is being restarted".

    28. Re:/etc/rc.d ? by Imperator · · Score: 1

      About as stable as it was before that. :)

      Seriously though, NT is a bad example to use when discussing reliable software. Samba is indeed a better example, though even then I'd comment that if one part of it crashes, so does much of it--last I checked it was only a few daemons. If Samba's browser crashes, it should keep serving files. That was the idea of the article I think.

      --

      Gates' Law: Every 18 months, the speed of software halves.
    29. Re:/etc/rc.d ? by Surak · · Score: 1

      this can be done with an rc script, but in addition to a simple "start|stop|restart" choice, we also need "stop all services|stop service(A|B|C)start all services

      I see what you're saying. Actually, Gentoo Linux does this to some extent. It tracks service dependencies through rc scripts, so that if you restart, say net.eth0, it will restart all of the services that depend on net.eth0 (such as samba, nfs, bind, ssh, etc.), rather than leaving some of those services in a confused state. (Of course, this has the annoying side effect of bumping you off your ssh connection when you do it from remote, but c'est la vie.)

    30. Re:/etc/rc.d ? by dubl-u · · Score: 1

      That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?

      Wouldn't the proper solution be to do both? I do a lot of automated testing of my code, both during development and as monitoring of existing systems. I find this improves quality, as it multiple ways to detect and isolate bugs.

      I think the trick is to make the failure-recovery actions visible. For example, I once was having problems with Apache seizing up on a high-volume site. I wrote something that hit the site every few seconds; when the lockups happened Apache would be immediately restarted and the team would get email.

      Since we were getting notice of failures, it gave us good data about what was going on, and prompted us to fix the problem. Better, because we knew we had a safety net, we were much braver in our experimentation, letting us learn more quickly.

    31. Re:/etc/rc.d ? by Anonymous Coward · · Score: 0

      Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.


      This is culturally as well ass technically different from Unix system administration. There's little incentive to investigate problems in Windows, because in general you can't fix them even if you understand them.

      Case in point: I found a bug in RedHat Linux 9 while installing it last week. It relates to how output returned by ethtool is parsed by one of the network scripts. It was an annoyance to investigate, of course, but eminently worth figuring out.

      The point is that now it's fixed. And, since I got the distro for free, I'm more than happy to help get it fixed for everyone.
  6. hmmmmm by Shishio · · Score: 5, Funny

    the disappearance of a large Internet site.

    Yeah, I wonder what could ever bring down a large Internet site?
    Ahem.

    --
    Twelve fingers or one, its how you play. ~Gattaca (Vincent)
    1. Re:hmmmmm by Anonymous Coward · · Score: 0

      btw, your web-browser micro-rebooted 7 times while loading this page.

    2. Re:hmmmmm by Anonymous Coward · · Score: 0

      You slashdotted slashdot!!!

  7. test errors by paulmew · · Score: 3, Funny

    "Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....

    1. Re:test errors by Sanga · · Score: 1

      It is the mice's plan. They are testing us by creating a maze of windows and the BSOD is a dead end :-)

      42

  8. ROC detail by rleyton · · Score: 5, Informative

    For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley, specifically David Paterson's writings.

    --
    ooooooh! What does this button do? - DeeDee, Dexters Lab.
  9. Computer.... by Viceice · · Score: 2, Funny

    Heal thy-self!

    --
    Sometimes I wish I was a plumber, then I'd know how to deal with other people's shit.
    1. Re:Computer.... by Anonymous Coward · · Score: 0

      Heal thy-self!

      Ahh... Pixie Dust!

      Use it regularly, and servers solve their own problems.

    2. Re:Computer.... by timmie... · · Score: 1

      Heal thy-self!

      Dammit Jim! i'm a computer not a doctor!

  10. it will not work now by KingRamsis · · Score: 4, Insightful

    Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.

    1. Re:it will not work now by torpor · · Score: 2, Interesting

      So what are some of the other paradigms which might be proferred instead of von Neumann?

      My take is that for as long as CPU design is instruction-oriented instead of time-oriented, we won't be able to have truly trusty 'self-repairable' computing.

      Give every single datatype in the system its own tightly-coupled timestamp as part of its inherent existence, and then we might be getting somewhere ... the biggest problems with existing architectures for self-repair are in the area of keeping track of one thing: time.

      Make time a fundamental to the system, not just an abstract datatype among all other datatypes, and we might see some interesting changes...

      --
      ; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
    2. Re:it will not work now by Anonymous Coward · · Score: 0

      I think you need to learn the concept of the full-stop. That has to be the least readable sentence I have seen in a looooooong time!

    3. Re:it will not work now by the-dude-man · · Score: 1

      Well yes and no.

      ROC I dont think will every yeild servers that can heal themselves...rather, yeild servers that will be able to take corrective measures for a wide array problems...there really is no way to make a completely redudnat system, well there may be, but as you said, we are no were near there yet.

      ROC may someday evelove into that, however, for the moment, its really a constantly expanding range of exceptional situations that a system can handel by design. Using structures such as exceptions and the like.

    4. Re:it will not work now by KingRamsis · · Score: 0

      well instead of commenting on my english just extract the knowledge in the post, english is not my native language.

    5. Re:it will not work now by the_duke_of_hazzard · · Score: 0

      Dude, please self-repair your grammar. I could only follow what you were saying with some effort.

    6. Re:it will not work now by KingRamsis · · Score: 2, Interesting

      well the man who answers this question will certainly become the von Neumann of the century, you need to do some serious out of the box thinking, first you throw away the concept of the digital computer as you know it, personally I think there will be a split in computer science, there will be generally two computer types the "classical" von Neumann and a new and different type of computer, the classical computer will be useful as a controller of some sort for the newer one, it is difficult to come up with the working principle of that computer, let me elaborate it is like a missing piece of the puzzle you know how it looks like but you are not certain what exactly will be printed on it, but I can summarize it is features:
      1. It must be data oriented with no concept of instructions (just routing information), data flows in the system and transformed in a non-linear way, and the output will be all possible computations doable by the transformations.
      2. It must be based on a fully interconnected grid of very simple processing elements.
      3. The performance of said computer will be measured in terms of bandwidth not the usual MIPS. As you can see you will need a classical type computer to operate the described computer above so it will not totally replace it.
      I believe that we should look into nature more closely, we stole the design of the plane straight from birds wings, and the helicopter from the dragonfly, and there are a lot that was inspired to us by mother nature, one of the relevant examples that always fascinated me was the fly brain, each eye is a processor on its own, the works independently conveying information to a more concise layer and so on, even human vision is based on similar concept of retina cells, there is no "pixel" concept, each layer that process vision emphasize on one concept of vision like texture, color, outline, shadowing, movement...etc ..Etc Finally well such a computer be useful? can we just write a plain spread sheet on it and send it by email to someone and then resume our saved DOOM game?
      well it is possible but we need also to redefine what we can do with a computer because the classical von Neumann computer that we are stuck with for the last half a century certainly limited our imagination on what can be done with a computer.

    7. Re:it will not work now by swillden · · Score: 1

      english is not my native language.

      Your English is fine. You just need to learn to break it into sentence-sized chunks.

      just extract the knowledge in the post

      Sorry, not interested. I have better things to do. If you want people to read what you write, you should do your best to make it easy for them. Otherwise they'll spend their time more efficiently, reading the ideas of someone who cares enough to make themselves understandable.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    8. Re:it will not work now by Anonymous Coward · · Score: 0

      You speak a language without full-stops? Inuit perhaps?

    9. Re:it will not work now by poot_rootbeer · · Score: 1

      Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.

      That was the longest sentence I've ever read, with the exception of Finnegan's Wake.

    10. Re:it will not work now by the_duke_of_hazzard · · Score: 0

      Not read Ulysses then?

  11. Various levels of rebooting... by jkrise · · Score: 4, Funny

    Micro-rebooting: Restart service.
    Mini-rebooting: Restart Windows 98
    Rebooting : Switch off/on power
    Macro-rebooting: BSOD.
    Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.

    --
    If you keep throwing chairs, one day you'll break windows....
    1. Re:Various levels of rebooting... by Fluid+Truth · · Score: 1

      The four Rs of computing (at least, lately):

      Retry
      Reboot
      Reinstall
      Repeat

      ;-)

      --
      Apparently, of the rich, by the rich, for the rich.
  12. !RTFA, but by the_real_tigga · · Score: 2, Interesting

    I wonder if this [PDF!] cool new feature will help there.

    Sounds a lot like "micro-rebooting" to me...

    --
    my .sig is better than yours.
    1. Re:!RTFA, but by Anonymous Coward · · Score: 0

      It seems to me that this (kexec) is an attempt at circular features; Operating system and BIOS use time to autodetect all the components in the system, then kexec "speeds" rebooting by eliminating the discover step in the BIOS.

      Surely the real answer is to (*gasp*) pre-configure the machine!!! I know that it's not fashionable in the current Plug'N'Play world, but we are talking about a system who's hardware is pretty much locked down anyway... Why re-discover all the hardware everytime you boot up?

      The old argument about POST routines (which are no-longer called that ... wonder why?) being required to detect faulty hardware is no-longer reasonable... I mean how many modern systems actually do a memory check anyway? "Enable Quick Boot" or the equivalent in your BIOS killed it off years ago...

      Rather than kexec to skip the autodetection, surely it is cleaner to simply document the system configuration, and then use that list to initialize only the components which need to be initialized, without the overhead of re-discovering them.

      Or maybe I'm just too old fashioned...

      --
      Posting Anonymously because I'm too lazy to log in.

  13. uunnschulding sme.. by danalien · · Score: 3, Insightful

    but if end-users got a better computer education, I think most of the problems would be fixed.

    I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".

    --
    I don't claim I know more than I know, and if you know you know more than I know, then by all means, let me know.
    1. Re:uunnschulding sme.. by Anonymous Coward · · Score: 0
      I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car.
      Nonsense. Most people only need to learn Word, Excel, Powerpoint and their company's in-house stuff, just like most car owners only need to learn to drive.

      With cars, the in-depth course is learning to be a mechanic or how to design engines. There's a similar thing for computers: being a sysadmin or developer. 99.99% of people don't need to know that stuff.
    2. Re:uunnschulding sme.. by Lord+Kholdan · · Score: 1

      Let's say there is a billion computer users today and it'd take 100 hours on average to make them at least somewhat computer savvy. Let's say teaching them costs 5$/hour and they'd earn 10$/hour if they'd be working instead of studying. It'd cost 1500 billion to train them!

      I think it'd be much cheaper to just write stable software, even in the really long run.

  14. Re:No clue by Jedi+Alec · · Score: 1

    theoretically, i don't see why you shouldn't be able to do it in hardware, if for example an entire OS has been written to report to some piece of hardware what processes it has running, and that each of these processes needs to report to that piece of hardware on it's status. If a report comes in concerning problems, or the report fails to come in altogether, the chip then takes action to remedy the situation, by for example restarting that particular process.

    Disclaimer: all uses of the word process in this post are due to a total lack of knowledge concerning *nix and more than is good for me with 2K/XP.

    --

    People replying to my sig annoy me. That's why I change it all the time.
  15. Compulsory M$ joke by Rosco+P.+Coltrane · · Score: 3, Funny
    Third, programmers ought to build systems that support an "undo" function (similar to those in word-processing programs), so operators can correct their mistakes. Last, computer scientists should develop the ability to inject test errors; these would permit the evaluation of system behavior and assist in operator training.

    [WARNING]
    You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?

    [Undo] [Continue testing]

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  16. Hmm. by mfh · · Score: 4, Insightful
    Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

    I think that's a big fat lie.

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Hmm. by Anonymous Coward · · Score: 0

      I think that's a big fat lie.

      True. A computer does pretty much the same thing it did 20 years ago but on a bigger, grander scale. Most of that 10,000-fold speed increase is sucked up by software - making operating them easier, not by what the user does with the machine.

    2. Re:Hmm. by Anonymous Coward · · Score: 0

      My old TI99/4A needs specific arbitrary commands to be entered via the keyboard to be operated. It has very little concept of "file system", and nothing approching the desktop-metaphor of my PowerBook's OS X GUI. It also requires extensive knowledge of either BASIC or assembler languages if you want to use it for anything other than a primitive arcade gaming machine, while my PowerBook's software, videogames and administration tools all follow the same interface guidelines.

      And even if you consider the command line, my PowerBook tries to correct the input and auto-completes filepaths and command names, while my TI994A shits itself or returns obscure error codes if I misspell anything.

    3. Re:Hmm. by Technician · · Score: 1

      Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

      Let's see. IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer. That's 4X the bandwidth. I don't think they missed the mark by much. 10,000 times or 12,000 times, what the the diff?

      --
      The truth shall set you free!
    4. Re:Hmm. by vofka · · Score: 1

      But even there, you are forgetting to factor in Multiple Pipelineing (At best, the P4 can 'complete' 9 Instructions per Cycle, though it doesn't usually get that good), and shorter Instruction Execution times, for example, a 32-bit Relative CALL on a 386 takes a minimum of 7 Cycles, whereas on a Pentium Class system it takes only One Cycle...

      So, it's a lot more complex than just comparing clock-for-clock, or even clock and bus-width... 10,000 times is probably a very low estimate of how much power has increased in 20 years, just for x86 alone - and that doesn't factor in other architectures such as SPARC or PPC!

      --
      Disclaimer: I meant what I thought, not what I wrote! What? You can't read my Mind? Oh dear!
    5. Re:Hmm. by gl4ss · · Score: 1

      hmm... is it just me but does windows/beos/whatever look more complex to operate than ms dos 1.0 and dos based programs in general?

      --
      world was created 5 seconds before this post as it is.
    6. Re:Hmm. by justin_speers · · Score: 1

      I agree with the original post actually, I think you misinterpreted it.

      Computers may be (approximately) 10,000 times faster, but is operating them really more complex?

    7. Re:Hmm. by Anonymous Coward · · Score: 0

      I think that's a big fat lie.

      Nonsense! It's a small, skinny lie.

    8. Re:Hmm. by default+luser · · Score: 1

      The processor architectures themselves aren't 5 orders of magnitude different.

      IPC of the 8088: ~1/12 ( somewhere around 300-400k IPS for the 4.7MHz version )

      Typically agreed upon IPC for the Pentium IV: 6

      Today's architectures, despite pipelining, multiple ALUs, fast caching and all the other tricks, are not even 100 times faster than their origins. For reference, we didn't even top 1.0 IPC until the 486.

      It's the combination of speed-tuned architectures plus the small processes that create todays super fast processors. You could not easily clock out an 8088 at multiple GHz, and probably couldn't even enter the hundreds of MHz range without some heavy modification.

      --

      Man is the animal that laughs.
      And occasionally whores for Karma.

    9. Re:Hmm. by mr3038 · · Score: 2, Insightful
      IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer.

      You don't need to go that far back to history to see a really big difference. Just compare the FPU speed of i287 and Athlon. i287 took minimum of 90 cycles for FMUL, minimum of 70 cycles for FADD and at least 30 cycles for a floating point load. Compare that to Athlon that can do two loads, FMUL and FADD every cycle. So, something that took i287 at least 90+70+2*30 = 220 cycles, Athlon can do every clock cycle. In addition to that, Athlon is running at 2GHz instead of 10MHz. So one could argue that current Athlon is 2000/10*220 = 44000 times faster than about a 20 year old FPU (when was 287 released anyway?). In addition to that, we have MMX, SSE and SSE2 that can further boost best case scenarios but I think it's safe to say that current x86 CPUs are at least 10000 times faster than 20 year old ones. Not to count more advanced caches -- not too many years ago L2 cache was external and optional. Of course, if you compare 20 year old Gray and a CPU inside modern portable device the difference is much smaller.

      --
      _________________________
      Spelling and grammar mistakes left as an exercise for the reader.
    10. Re:Hmm. by tricorn · · Score: 1

      Don't forget the speed advantages in a system (not CPU, which is only part of a system) of vast amounts of much faster memory, both volatile and persistent. Twenty year ago, the microcomputer world was in the process of moving from 64K of RAM and 160-320K of slow floppy disk storage, to 512KB-1MB of RAM and 5-10MB hard drives. It is amazing to me that now I can read up the contents of my entire first hard disk drive into a small portion of RAM in 1-2 seconds, the entire amount of RAM on that machine was less than 1% of what I have now, and I can emulate it at least 50 times faster than it actually ran, and that's on a 2-year-old laptop. In 20 years, will we have computers where 1000GB of RAM is normal, I can read persistent storage at a rate of 50-100GB/sec, and I can emulate a current CPU as if it ran at 50GHz? You'll be able to do real-time stereo-vision HD rendering of things that currenly take several minutes per frame. Let's not forget communications, either - 300-1200bps was the norm, 9600bps was about the best you could do over the phone lines.

      Computer systems are a lot more complicated today than they were 20 years ago. However, a lot of that complexity is arguably necessary to support the increased capabilities (that were simply not possible with the hardware limitations of 20 years ago compared to today). On the other hand, a lot of the complexity, bloat and poor performance is attributable to poor development models. It may make economic sense for a development group to use tools and methodologies that produce bloated slow buggy complex programs, but at least some of that is due to the end consumer not knowing how much it is costing them, or at any rate not being able to do anything about it.

  17. Write scripts for it... by ndogg · · Score: 4, Insightful

    and cron them in.

    This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

    Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)

    "Undo" feature? That's what backups are for.

    Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.

    --
    // file: mice.h
    #include "frickin_lasers.h"
    1. Re:Write scripts for it... by the-dude-man · · Score: 1

      Your quite right....most large systems are maintained by shell scripts and the crontab

      However, this is inheriently limited to finding the errors, some errors (ie /var/run has incorrect permissions) cant be solved by restarting the service, this concept is about identifing the problem and then taking correct measures.

      What you described is a primitive version of this, it will handle most of the *dumb* errors, not persistant errors that could be outside of the programs control. ROC is more/less an evolution of what you described

    2. Re:Write scripts for it... by poot_rootbeer · · Score: 1

      and cron them in.

      What happens if the cron daemon dies?

  18. Go figure... by qat · · Score: 1

    Sounds like a great way to lure in customers for another product. What happens when part of this ROC fucks up? No coding is perfect. Also, would it be cost effective? I doubt it...

    --
    Pls No Negative Modding!
  19. Self Repairing gone bad by UndercoverBrotha · · Score: 2, Insightful

    Windows Installer, was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.

    The Office 2000 self-repairing installations is another notorious one, if you remove something, the installer thinks it has been removed in error and tries to reinstall it...

    Oh well, lets wish the recovery-oriented computing guys luck...

    --
    Solid!
    1. Re:Self Repairing gone bad by swankypimp · · Score: 1
      This week I took a look at my sister's chronicly gimpy machine. It had Gateway's "GoBack" software on it, which lets the OS return to a bootable state if it gets completely hosed (the "system restore" option on newer versions of Windows are similar, but GoBack loads right after the BIOS POST, before the machine tries to boot the OS).

      The problem is that GoBack interprets easily recoverable errors as catastrophic. The machine didn't shutdown properly? GoBack to previously saved state. BSOD lockup? GoBack to previously saved state. The end result was that files were written to the hard disk but the system didn't keep track of them. The files were still there, and I could access them from a DOS prompt, but Windows Explorer had no clue where they were. The same thing happened with recently-installed programs, which utterly cocked things up. Windows only "knew" about them subconsciously, or something.

      Of course, this (and the installers you mentioned) are cheap consumer grade products, and the server grade ones these people are researching would be much better. Because GoBack exists mainly as a "why buy Gateway over Dell" marketing tool, while real ROC would exist on mission critical servers. I just felt like ranting about Gateway GoBack for a while. (Finally I just uninstalled it rather than troubleshoot the thing. I still have some "hidden" directories, though. If I ever need a place to hide my porno stash, now I have an option. Shrug.)

      --

      --All your stolen base are belong to Rickey Henderson
    2. Re:Self Repairing gone bad by Unregistered · · Score: 1

      I would just like to point out that i do rarely lose xp/2k systems (except from the ever-so-satisfying mkreiserfs /dev/hda3). While the recovery is far from perfect it does tend to leave the system at least bootable.

  20. Second paragraph by NewbieProgrammerMan · · Score: 4, Insightful

    The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."

    Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.

    But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.

    --
    [b.belong('us') for b in bases if b.owner() == 'you']
    1. Re:Second paragraph by OeLeWaPpErKe · · Score: 0

      The problem with windows instability is not that windows is badly written (well, general opinion is that it is, and to a degree, that might be correct, or not). It's an inherent problem in commercial software.

      first symptom :
      Ever notice how every program under windows installs itself in a "marketing oriented" way. (e.g. Paint Shop Pro in a subdir "Jasc" in the start menu). And PSP is a very well-behaved program, yet it clobbers up the start menu uneccesarily.

      second symptom:
      Now let's look at kazaa. It installs hooks in a lot of systems pointing back at itself. It disables a few features, enables some others, to prevent those hooks from being too easily detected. Now obviously, if Kazaa fails, all hooked services (ie most notably) take a hit.

      third symptom:
      Software on windows does not cooperate with eachother(I mean truly different software, ie non-ms soft with non-ms soft from another company). On the contrary, if a certain program does something to another program, it generally could be called sabotage (kazaa, spyware, "helpful" toolbars in ie, disabling anti-virus soft, blocking debuggers, sabotaging the crash handler, taking over the entire screen, changing screen resolution without asking, ...). The developers of both programs don't know eachother, and don't care in the slightest.

      Compare the situation on linux. You still have the situation that both developers of 2 programs interacting don't know eachoter and (in most cases) don't care. But, since it's open source, they make a patch, and send it in. Those patches are generally accepted, and the software thus learns to cooperate (first with kde/gnome then with other programs in its class, or vertical integration (eg procmail knowing about different mailservers and acting upon that)).

      Windows (and macos and ) will continue to suck in this manner, because (I think) there is no economically viable solution for this (except perhaps having all software made by a single software company, in essence creating an open-source culture inside a huge abusive monopoly). Companies will not give out source code freely, and they will not adapt their program to 15 other programs.

    2. Re:Second paragraph by Reziac · · Score: 1

      Same here. I beat on Windows and various apps til they're all how I want them, then leave 'em the hell alone. I don't always upgrade to the latest and greatest, nor patch just because the patch exists (the best wisdom is to apply only those patches and SPs that address a problem YOU are experiencing). And I don't install random programs on a work box. I've had hardworking Windows setups go as long as 7 years without a single reinstall, and no loss of good behaviour either. Hell, it's rare that I even restart Win9* more than once or twice a month, and half the time it's because I want to use pure DOS. And I see BSODs seldom to never.

      I'll do the same with linux when/if I ever get more involved with it than one test box. I don't need to experiment with every new kernel subversion that comes down the pipe.

      So long as whatever's on the machine works, does the job required of it, and is well-behaved, why screw around with it, and maybe screw it up?

      BTW, Intuit is on my shitlist not because of their activation BS, but because their installer forcibly upgraded IE and thereby broke my setup. I didn't give anyone permission to muck about with system files, but they did anyway. And thereby lost a good customer (and gained a tireless critic :)

      --
      ~REZ~ #43301. Who'd fake being me anyway?
    3. Re:Second paragraph by Anonymous Coward · · Score: 0

      I've been doing large scale system administration for over 20 years, and I have to express serious reservations about the second paragraph as well.

      On the contrary, in my experience individual computer systems have become significantly less brittle and unreliable over the period, this despite a corresponding increase in hardware and software complexity. We used to routinely find compiler bugs, for example.

      So, in some ways it's astonishing how much the quality of these systems has improved. That speaks to a sustained effort by countless designers and implementors. In my opinion, society does not acknowledge this nearly enough, and likewise seems to overlook the difficulty of the problem. I think all computing professionals suffer when that happens.

      In that sense, the article makes a valid point. If we can forgive the hyperbole, we might concede that computer systems as we presently use them exhibit more brittleness and unpredictability than in the past. My own experience supports this observation also! I see an epidemic of increasingly complex sites being administered to decreasing standards of consistency. Indeed, most sites report that they do not know how their systems are configured.

      So, while I don't necessarily buy the whole ROC paradigm, I have to point out that it does have some merit. For example, in order for a subsystem to be meaningfully recoverable, it must have some model of how it should be configured. Merely to acknowledge the need for such a model would be a huge practical step forward. ROC seems also to reinforce the concept of modularity, which we see very poorly expressed in some operating systems.

  21. I already do this with Linux... by jkrise · · Score: 2, Interesting

    Here's the strategy:
    1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
    2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
    3. After the 1st instln. a copy of the installed version is put onto a CD.
    4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.

    Now, for self-repairing:
    1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
    2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)

    Foolproof system, so far - and yes, lots of foolish users around.

    --
    If you keep throwing chairs, one day you'll break windows....
  22. First use for this by buyo-kun · · Score: 1

    I think the first good use of ROC would be to clean up the errors and problems in Windows. Of course the only solution the ROC could possibly do to clean up all the problems with Windows is to detele Windows all together, but hey, we'd do it ourselves sooner or later anyway.

  23. Re:No clue by Gordonjcp · · Score: 4, Informative

    Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have , does it?

  24. I used systems like this by Mark+Hood · · Score: 5, Interesting

    they were large telecomms phone switches.

    When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

    The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

    Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

    Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

    This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

    As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

    Nice to see the 'PC' world finally catching up :)

    If people want more info, then write to me.

    Mark

    --
    Liked this comment? Why not buy me something nice
    1. Re:I used systems like this by the-dude-man · · Score: 1

      I've been striving to work this kind of stability into my client's software for years! To a certian extent, alot of its there, the problem with the pc world is you have to do an update every 3 days just to prevent someone for rooting your box with all the remote exploits floating aroung out there.

      I usually use large sets of negitive data to isolate the problem...but there are just some things that users can cause, that in an itergrated world like the pc world, will just take things down.

      Thats not to say that you cant keep a box up for several years. I have a client that has outright refused to update their kerenel and not reboot a red-hat box I set up 5 years ago. The kerenel is sheltered enough from the real world that as long as it does what we want...its fine. And services are updated almost daily via scripting, and most of the kernel is modulized so parts of the kernel can be updated to keep with the services.

      So i can keep an operating system up for a very long time...my concern has now turned to keeping services up....there are just somethings that will take down a service no matter what (ie dosing the socket until it explodes) I do, i cant seem to find a way around without restarting the service to correcet the problem...this is problamatic because there is an indefinate number of other users that we dont want to affect....telecom has been doing this for years so i would be interested in hearing any coding tricks you may have up your sleve :)

  25. Or by gazbo · · Score: 1
    You're just making assumptions. Snippet:

    The most common way to fix Web site faults today is to reboot the entire system, which takes anywhere from 10 seconds (if the application alone is rebooted) to a minute (if the whole thing is restarted). According to our initial results, micro- rebooting just the necessary subcomponents takes less than a second.

    So in fact it's not talking about rebooting machine vs restarting services, it's talking about both of the above vs restarting subcomponents.

    But hey, if you want to start talking about rebooting failed SMB services on Windows then go right ahead - you're in front of a friendly audience after all.

    1. Re:Or by Anonymous Coward · · Score: 0

      Doesn't Apache coupled with a good database already provide much of what this article discusses? Apache already monitors child processes and reboot these sub-components when necessary. A good database provides transactions enabling a roll-back should something go south.

  26. "Managerspeak"?! by No+Such+Agency · · Score: 3, Insightful

    Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.

    Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.

    The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)

    --
    Freedom: "I won't!"
    1. Re:"Managerspeak"?! by _typo · · Score: 1

      so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)

      Then maybe the solution isn't using aditional bug-prone software to try to recover fast from failures but to actually replace the crufty, evil OS's

      --

      Pedro Côrte-Real.

    2. Re:"Managerspeak"?! by cloudmaster · · Score: 2, Insightful

      It might be a better use of time to write code that works correctly and is properly tested before release, rather than doing all of that on some other piece of meta-code that's likely to have a bunch o' problems too.

    3. Re:"Managerspeak"?! by fgodfrey · · Score: 2, Interesting
      No, it's not (well, debugging software is definetly good, but writing "self healing" code is important too). An operating system is an incredibly complex piece of software. At Cray and SGI a *very* large amount of testing goes on before release, but software still gets released with bugs. Even if you were, by some miracle, to get a perfect OS, hardware still breaks. In a large system, hardware breaks quite often. Having an OS that can recover from a software or hardware failure on a large system is essential to keeping the system running.


      The software that I'm responsible for, in fact, is specifically designed to detect, report, and try to work around errors. We have code to detect a processor hang (through software or hardware failure) and remove it from the running OS image, etc. The Cray T3E (which I didn't work on) can warm-reboot an individual processor on either a software or hardware panic/hang and reintegrate it into the running OS.

      --
      Go Badgers! -- #include "std/disclaimer.h"
    4. Re:"Managerspeak"?! by cloudmaster · · Score: 1

      I have several personal computers that run specific tasks, and they never break. Not "often" - never. It's not impossible. The "impossible" part is putting more time into planning and development to avoid the little problems that crop up, and then putting more time into testing to find whatever problems happen to occur. That's not nearly as glamorous as devloping some big "new" system that'll find faults after they happen, though, so it probably won't happen.

      I'm certainly not saying that error detection is a bad thing or a waste of time. I will say, however, that not nearly enough time is devoted to planning or quality control, and that it'd make more sense to make the systems more stable to begin with. As an analogy, a car with a convertible top probably is easier to extract dead bodies from after a brake failure - but it would be a better use of time to make the brakes work properly the first time instead of making the convertible top easier to cut open... :)

    5. Re:"Managerspeak"?! by Anonymous Coward · · Score: 0

      I like your analogy but find it flawed. For instance, if the human body were simply more resilent then we wouldn't need brakes at all.

    6. Re:"Managerspeak"?! by fgodfrey · · Score: 1

      Right, I have those also. The key there is "personal computers" which have 1 processor (maybe 2 or 4 if you have more spare cash than I do). These are quite small. There is no way that you are going to be able to build a machine of the size I usually work work with (256 processors is on the low end of that) that "never" breaks. It *is* impossible. The statistics eventually catch up with you. Even if the software is perfect, the hardware *will* break, it's only a matter of time (and not as much as you might think). That's where all the error detection, correction, etc. is absolutely required.

      --
      Go Badgers! -- #include "std/disclaimer.h"
  27. already done? by the-dude-man · · Score: 1

    hmmmm....Recovery Oreinted Computing......This just screams linux.

    Recovery Oreinted Computing is nothing new, most devlopers (well *nix devlopers) have been heading down this route for years, particularly with more hardcore OO languages (is java...and in many respects c++) come to the surface with exception structures, it becomes easier to isloate and identify the exception that occured and take appropiate action to keep the server going.

    However, this method of coding is still growing...there are no real solid / accepting methods of isolating and identifying problems...however, in the next few years you will probably see this trend move to the next level as algorithims for identification, and localization are devloped and widely adopted.

    Of course if your running on a windows platform this is kinda pointless...rebooting at least once every 30 days really eliminates any chance of long term running and the need for large scale localization and identification

  28. Excellent by hdparm · · Score: 1, Funny
    we could in that case:

    rm -rf /*

    ^Z

    jut for fun!

    1. Re:Excellent by Anonymous Coward · · Score: 0

      [1]+ Stopped rm -rf /*

  29. ACID ROC? by shic · · Score: 3, Insightful

    I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?

    Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?

    1. Re:ACID ROC? by Anonymous Coward · · Score: 0

      I prefer Acid Jazz.

    2. Re:ACID ROC? by Anonymous Coward · · Score: 0

      I'm a graduate student working with the ROC project at Stanford, and although we are not addressing distributed data incoherence, we are addressing your last point -- a cost effect approach to ACID.

      As you point out earlier in your post, many tradeoffs are made in industrial settings because of the expense of ACID semantics. It should be noted that there is a lot of data and related applications that do not require ACID semantics; two of us here at Stanford ROC are looking at data that does not require ACID semantics.

      I am looking a session state and temporary application state (semi-persistent state), and building a self-managing session state store whose components can be rebooted proactively without losing data, reducing availability, or degrading performance.

      My colleague Andrew Huang is working on a store with good performance and fast recovery that is focused on persistent data that is accessed with a hashtable API.

      Our hope is that these different stores that will allow systems developers to have better choices than the current standard choices of "DB or filesystem or in memory" and simultaneously provide performance that is at least equal to current solutions, yet are easier to manage, tune, and maintain than existing solutions.

    3. Re:ACID ROC? by bling0 · · Score: 1

      Oops, I'm not an anonymous coward. I thought I had logged in, but I guess not.

    4. Re:ACID ROC? by shic · · Score: 1
      I'm interested... though I have to say that I'm at least sceptical it's even possible to store state without degrading performance in the general case. (I can't see any way you can overcome the cost of bandwidth to stable storage.)

      I'm very interested in the idea of designing new kinds of reliable store (if anything in IT can be really new these days!) I've even tinkered with this myself. The hash-table-like store is of particular interest to me... I guess this is done using trees in a way reminiscent of the grand plans Will Phillips discussed for the Tux2 file system. Do you have some URLs for papers/prototypes?

  30. Not going to work by locarecords.com · · Score: 1, Offtopic
    This is pie in the sky.

    My experience is the best system is paired computers running in parallel that are balanced by another computer that watches for problems and switches the crashed system from Live to the other computer seamlessly. It then reboots the system with problems and allows it to recreate its dataset from its partner.

    In effect this points the way to the importance of massive parallelism required for totally stable systems so that clusters form the virtual computer and we get away from the idea of a computer as a single machine.

    Afterall individual computers suffer hardware failure too!

    --
    ---- The Open Source Record Label : : LOCARECORDS.COM
  31. The Hurd by rf0 · · Score: 3, Interesting

    Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

    Or you could just have some sort of failover setup.

    Rus

    1. Re:The Hurd by sql*kitten · · Score: 1

      Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

      QNX, I believe, already does this, and has been in production use throughout the world for years.

  32. Magic Server Pixie Dust by thynk · · Score: 3, Funny

    Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?

    --

    Good judgment comes from experience, and a lot of that comes from bad judgment.
    1. Re:Magic Server Pixie Dust by the-dude-man · · Score: 1

      That was just a gimmick for that commerical...what they were actually selling is BSD boxes running ports that can update themselves, and rebuilt the kernel acording to pre-defined specs, and reboot when necceary to implement the changes, but designed not to reboot whenever possible (so they build the kerenel to be very modular and only update the modules as needed until something in the base needs to be updated, then rebuild the kerenl and reboot. And useing some tweaked out bash scripting to respawn services that died.

      This isnt really ROTC since if its a system wide error, restarting the service just causes it to die agian, they are selling more less very well set up BSD boxes. (i sell similar solutions to people) However, this is not ROC because they still need to be administered, the goal though was for someone with minimal knowledge of linux to be able to handle the day-to-day operations of the server. Since these

  33. Self-diagnostics by 6hill · · Score: 4, Interesting
    I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).

    My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

    So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.

    1. Re:Self-diagnostics by Anonymous Coward · · Score: 0

      [citation]
      Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software?
      [end citation]
      Don't worry! In 2 years, Palladium will solve it all for you! It will clearly define trusted software, so that you will always know which parts of a system are *absolutely perfect*. Hail Microsoft!
      Oh, wait...

    2. Re:Self-diagnostics by Zirnike · · Score: 1

      Ummm... why not use the shuttle method? 3 monitors, take a poll to determine of something needs to be rebooted. If all 3 agree, that's easy. If 2 agree, reboot the problem app, then reboot the 3rd monitor, and reboot the one of the other 2 as it comes back online. (so you have 3 fresh monitor programs)

      --
      I'm not shy, I'm stalking my prey
    3. Re:Self-diagnostics by jtheory · · Score: 3, Insightful

      There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

      We can learn some lessons from how human society works. If your messages don't make sense to most other people, or if you start damaging a lot of other people, you get separated from the rest and possibly "rebooted" (some call this "electroshock therapy") or even deactivated (some call this "Welcome to Texas").

      The difference here is that if the computers in the cluster are all running the same programs, they will contain the exact same coding flaw that they will all concur is the only sane answer (in human terms, this is called "religion"). So we're protected from hardward malfunctions, but not bugs in software or hardware.

      That's why this stuff is so hard to do. It may be possible to use selective program restarts to temporarily keep service up in spite of a nasty memory leak, but nothing is really "repaired"; it's just providing a few more fingers to plug holes in the dam while the river keeps rising. So... do you get into providing alternative services for the ones malfunctioning?

      Interesting stuff (maybe I'll even read the article now).

      --
      There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
    4. Re:Self-diagnostics by spectral · · Score: 1

      Hey yeah, something like this happened in the hitchhiker's guide..

      Ignoring that, have three completely separate monitoring programs? Don't use the same code base, and therefore you don't run in to software problems, and minimize the effect of hardware problems (since the software would probably be interacting with the hardware in different ways, if there's a div bug in the CPU, the errors won't necessarily be the same between all three SEPARATE software programs). If they all monitor the same thing and produce the same output (or at least, output that's understandable to the other two), the likelihood of two messing up at the same time, and producing the same wrong output is rather small.

      However, it's probably possible to knock two out so they produce DIFFERENT output, so you have one valid source, and two screwed up ones, but screwed up in different ways.. Then you basically have to reboot and hope it fixes them.

      Would this work, or am I missing something here? Again, not perfect reliability, but it does tend to make it so that there's a bit of safety from software bugs..

    5. Re:Self-diagnostics by 6hill · · Score: 1

      Yeah, that'd be one of the other methods (server voting was presented just as an example). In reality, having attempted building a simple monitoring software capable of this (i.e. all monitors can boot one another), it turns out it ain't so simple at all :). Self-awareness and building enough checks but not too much into the rebooting mechanism...well, that was one complex mo-fo of a NFA to draw out. In practical cluster applications, it is often easier just to boot the whole machine and have everything start anew.

  34. Sounds like commitment control by GerardM · · Score: 1

    In databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit. When you decide that it is no good anyway there is the possibility of a roll-back; everything is rolled back to a last known good state.

    In practice it would mean that changes are logged and possibly after logging changes are effectuated. This does result in overhead and in potential vulnerabilities (both for hackers and for errors).

    Things like this also reek like what a "standardised" hardware and software would look like. How else can you control the quality of such a system? NB this does not mean that a Linux BSD is inferior, it would only be more obvious and visible what went right what went wrong.
    Thanks,
    Gerard

  35. micro-rebooting... by amanpatelhotmail.com · · Score: 1, Funny
    Is just one of the cool new features in MS Windows(r) Longhorn(tm)

    You now don't reboot(tm) but you micro-reboot(tm) i.e. the system will do that for you! Remember the times when you are writing that important report under MS(r) Word(tm); and the system crashed, and you had to press Ctrl-Alt-Del(tm) to reboot(tm). No more! No more pressing ackward buttons... The system is intelligent enough to do that for you :)

    1. Re:micro-rebooting... by Unregistered · · Score: 1

      dude, xp does that also. It flashes the stop error and reboots to save you the trouble of rebooting or even reading the stop error.

  36. "operating them is much more complex" by NReitzel · · Score: 2, Funny
    Are you crazy?

    My first "PC" was a PDP-11/20, with paper tape reader and linc tape storage. Anyone who tries to tell me that operating today's computers is much more complex needs to take some serious drugs.

    What is more complex is what today's computers do, and increasing their reliability or making them goal oriented are both laudable goals. What will not be accomplished is making the things that these computers actually do less complex.

    --

    Don't take life too seriously; it isn't permanent.

  37. Ah, youth... by tkrotchko · · Score: 2, Insightful

    "But operating them is much more complex."

    You're saying the computers of today are more complex to operate than those of 20 years ago?

    What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.

    So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

    I'm not sure I agree with the premise.

    --
    You were mistaken. Which is odd, since memory shouldn't be a problem for you
  38. net start/stop by oliverthered · · Score: 1

    net stop workstation
    net start workstation

    when nt services blow chunks, the often leave crap in kenel space that prevents them being stopped/started.

    I hope things have improved with widows XP.

    --
    thank God the internet isn't a human right.
  39. Good Code and Hardware by caffeinex36 · · Score: 1

    Wouldn't better coding and better hardware be more efficient? This sounds a little silly. Perhaps, come quantum computers, maybe. Think of all the SA's that fix things that break all day who will be jobless.

    Rob

  40. The long wondered about origin of ... by den_erpel · · Score: 0, Funny

    Self-Repairing Computers

    Finally, this provides us with the long awaited answer to the following situations:

    Reed: Captain, direct hit on the power supply!
    Archer: That'll teach those cyborgs for flooding our inbox with p0rn!
    T'Pol: Captain, their server is mysteriously repairing itself, we're still being flooded.

    for any other series:
    TOS:
    %s/Reed/Checkov/g
    %s/Archer/Kirk/g
    %s/T'Pol/Spock/g
    TNG:
    %s/Reed/Worf/g
    %s/Arche r/Picard/g
    %s/T'Pol/Data/g
    DS9:
    %s/Reed/Kira/g
    %s/Archer/Sisko/g
    %s/T'Pol/Dax/g
    VGR:
    %s/Reed/ Tuvok/g
    %s/Archer/Janeway/g
    %s/T'Pol/Kim/g

    Since the B&B messed up the timelines anyway, they'll probably pour it into an episode, they seem to be out of inspiration anyhow...

    --
    Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
  41. A computer is no washmachine, but why ? by Quazion · · Score: 2, Insightful

    Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
    This is because the technical computer stuff is so new every year and so...

    1: Its to expensive to make it failsafe, development would take to long.
    2: You cant refine/redesign and resell, because of new technologie.
    3: If it just works noone will buy new systems, so they have to fail every now and then.

    While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man (Dutch only) todo your fixing.

    excuse me for my bad english ;-) but i hope you got the point, no time to ask my living dictionary.

    1. Re:A computer is no washmachine, but why ? by cellocgw · · Score: 1

      Well.... fundamentally, computers become allegedly obsolete in 3 yrs or less as much because software gets upgraded every time a new computer shows up, and the upgrades are not back-compatible. Washing machines, and cars, and AA-batteries have reached limitations imposed by various physical laws; CPUs and scanners and printers and video displays have not.
      Also, while it generally is cost-effective to repair cars, anything over very minor problems w/ washing machines, stoves, refrigerators, 35mm cameras, TVs, phones, etc. makes it far more cost-effective to buy new rather than repair.

      --
      https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
    2. Re:A computer is no washmachine, but why ? by Quazion · · Score: 1

      indeed true, just have put it in nice words, but you forget the point that consumers expect a product that works and that is fully done, instead of downloading firmware and other software updates. The cycle is getting to short. I got one costumer who i offered a onsite garanty for 1-3 years, she replied do computers break ? Which shows people think they buying a freaking washingmachine, while it costs even more often :)

  42. But I do that already... by edunbar93 · · Score: 2, Informative

    build an "undo" function (similar to those in word-processing programs) for large computing systems

    This is called "the sysadmin thinks ahead."

    Essentially, when any sysadmin worth a pile of
    beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.

    --
    "No problem. I have the capacity to do infinite work so long as you don't mind that my quality approaches zero."-Dilbert
  43. SPOFs by 6hill · · Score: 1

    there will be always a single point of failure for ever

    Well, yes and no. Single points of failure are extremely difficult to find in the first place, not to mention remove, but it can be done on the hardware side. I could mention the servers formerly known as Compaq Himalaya, nowadays part of HP's NonStop Enterprise Division in some manner. Duplicated everything, from processors and power sources to I/O and all manner of computing doo-dads. Scalable from 2 to 4000 processors.

    They are (or were, when I did my research piece on the Himalayas) also self-correcting in the sense that the two processors do lock-step processing and if the two differ in their opinions, the primary immediately hands over the responsibility to the redundant/backup -- data self-correcting on the assembly level. Of course, this doesn't prevent software from being a point of failure or from functioning incorrectly, but one or a cluster of these is as close as you're going to get without automated hotswapping or nanobot parts building, or other such sci-fi notions.

    1. Re:SPOFs by KingRamsis · · Score: 2, Insightful

      so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?


      the primary immediately hands over the responsibility to the redundant/backup
      is there an effective way to judge which processor is correct? you need an odd number of processors to do that or an odd split on an even number of processors.
      I'm not saying that this system is flawed actually the way you described here it is certainly far more reliable than the usual servers, what I'm trying to point out is that the concept itself is the bottleneck.

    2. Re:SPOFs by 6hill · · Score: 1

      so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?

      Uh...? No self-correcting logic itself, merely hardware duplication. The processor checks were (IIRC) implemented with checksums or some such integrity checks, so this is not in essence a self-correcting system in anything but the assembly level (i.e. things are processed correctly) -- it does not in any way take a stance in the relative "correctness" of the software that runs on it, merely on how its instructions are actualised on the hardware.

      As for price, this is Compaq/HP enterprise division we're talking about, so it'll be 10x the cost of any ordinary dual-processor computer. However, lock-step != sychronised. In synch computing, there will be a delay between an error occurring and its detection; in the Compaq solution, no data is transferred our of the processor's sandbox before its integrity has been verified.

    3. Re:SPOFs by DazzaJ · · Score: 1

      You should probably read up a bit on Nonstop servers (nonstop.compaq.com) before saying why they will not work!

      A Nonstop server is fault tolerant at the hardware and software level. It is a shared nothing parallel processing system (MPP) rather than the SMP systems you are used to. I won't describe it in detail as the HP website does a better job.

      To answer the question "is there an effective way to judge which processor is correct?", the answer is no. When an error is detected in either of the two lock stepped CPUs (or any hardware or software component for that matter) the component is immediately killed. This is called Fail-Fast. This stops the error corrupting other parts of the system. Due to its unique design the Nonstop server application will keep running despite the death of hardware and software components.

      For example, certain software processes run fully Nonstop e.g. the Transaction manager. The Nonstop process "state" is constantly mirrored to another processor. If the processor running the primary Nonstop process fails fast, the backup process takes over immediately in the alternate processor. This is typically sub second.

      There is far more to the architecture, but you get the idea.

  44. Re:Ah, youth... by the-dude-man · · Score: 1

    So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

    This is true, however, keep in mind that none of the DOS operating systems had a kernel. nor were any of them truely mutlitasking until windows 95 for the windows world(shudders). And the debut of Unix 20 years ago.

    Also keep in mind all the new technologies such as netwroking, (thats a whole post of changes on its own) hardware and bluetooth, firewire, usb, a hudge number of new technologies that have evolved to meet the ever expanding demands we place on systems.

    Some of the popular platforms from 20 years ago such as the PC XT are now used in calculators today, The very definition of a computer has changed in 20 years, so the operating systems are orders of magnatidude more complex...20 years ago the pc world was still in its infancy. Since then, everything outside the very definition of the pc has changed...and notebook and handheld technologies are pushing that.

    That being said, its not really fair to compare operating systems from 20 years ago to operating systems of today....its just a different world, and the very definition of an operating system is no longer the same

  45. Does SCI AM review articles properly nowadays? by panurge · · Score: 3, Insightful
    The authors either don't seem to know much about the current state of the art or are just ignoring it. And as for unreliability - well, it's true that the first Unix box I ever had (8 user with VT100 terminals) could go almost as long without rebooting as my most recent small Linux box, but there's a bit of a difference in traffic between 8 19200 baud serial links and two 100baseT ports, not to mention the range of applications being supported.
    Or the factor of 1000 to 1 in hard disk sizes.
    Or the 20:1 price difference.

    I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.

    --
    Panurge has posted for the last time. Thanks for the positive moderations.
    1. Re:Does SCI AM review articles properly nowadays? by NearlyHeadless · · Score: 4, Insightful
      The authors either don't seem to know much about the current state of the art or are just ignoring it.

      I have to say that I am just shocked at the inane reactions on slashdot to this interesting article. Here we have a joint project of two of the most advanced CS departments in the world. David Patterson's name, at least, should be familiar to anyone who has studied computer science in the last two decades since he is co-author of the pre-eminent textbook on computer architecture.

      Yet most of the comments (+5 Insightful) are (1) this is pie in the sky, (2) they must just know Windows, har-de-har-har, (3) Undo is for wimps, that is what backups are for, (4) this is just "managerspeak".


      Grow up people. They are not just talking about operating systems, they do know what they are talking about. Some of their research involved hugely complex J2EE systems that run on, yes, Unix systems. Some of their work involves designing custom hardware--"ROC-1 hardware prototype, a 64-node cluster with special hardware features for isolation, redundancy, monitoring, and diagnosis."


      Perhaps you should just pause for a few minutes to think about their research instead of trying to score Karma points.

    2. Re:Does SCI AM review articles properly nowadays? by kscguru · · Score: 1
      Being a student at the same school as one of these professors, I can assure you that they know EXACTLY what they are talking about.

      This isn't small-scale Unix or Linux boxes having multi-year uptimes. There are two more important applications: 1) dynamicly scripted web servers, and 2) clusters.

      Dynamicly scripted web servers - if one session corrupts some information, how far up the chain do things restart? Right now, I see apache or maybe just the CGI program itself rebooting. These guys are talking about rebooting single components of that CGI program - single components of JBoss, for example. OS kernels or regular applications are nice and stable - but you know the absolute junk that comes in over an HTTP connection?

      An application they were talking about was designing for failure. There was a Java implementation a few years ago that, instead of using a garbage collector, simply leaked memory until it crashed. Ran incredibly fast, except the VM had to be rebooted every few hours. Cheapen the cost of that reboot, cluster appropriately, and Java-without-GC becomes extremely efficient - more efficient than ANY other dynamic scripting out there.

      Or another example: the clusters that (insert major search engine) run. Statistically, each machine fails every so often - when there are thousands of machines, there's no way to avoid it. So, instead of going through a hard crash-reboot cycle whenever something fails, these clusters (today!) do a rolling reboot - every so often, one machine gives away its connections, reboots, then rejoings the cluster. If this rolling reboot is designed as a feature IN ADVANCE, reliability soars, and each reboot is much less disruptive (0 lost connections!)

      Actually, I realize ROC is another point entirely. It's not about lengthening the time between reboots! In fact, this prof ignores that detail entirely. It's about minimizing the time between failure and restart - making the reboot as quick as possible, so the failure doesn't hurt as badly. Sure, your Linux box runs a year without rebooting, but then crashes, full fsck on reboot, and whatever else = hours of downtime. If the ROC computer automatically did less than a second of downtime/maintenence every day, it has less downtime than your Linux machine by all the industry ways of measuring.

      --

      A witty [sig] proves nothing. --Voltaire

    3. Re:Does SCI AM review articles properly nowadays? by Rick.C · · Score: 1
      Just because this article describes "a joint project of two of the most advanced CS departments in the world" doesn't mean we shouldn't examine it critically. Bowing before recognized authority and unquestioningly accepting whatever they say is certainly not the scientific way.

      As a non-academic (I got a BS in Psych in 1968 and have been a mainframe sysprog since 1971), I have noticed that many in the academic community have limited real-world experience. They tend to be thinkers, not doers. Although they publish their results among their peers, there seems to be little or no mechanism for assimilating experience from outside the academic community.

      So when they come out with what they feel is a revelation, others react with a ho-hum "Been there, done that." I implemented two fault tolerant systems in 1976-77. One shepherded an application system, making backups, storing data redundantly, handling errors, restoring, restarting, etc., all under automatic program control. The other was a monitor for a multi-application online transaction processing system that provided the graceful shutdown and component "micro reboot" function described in the article. That was 25 years ago.

      Did I "publish my research" back then? No, it was the property of the company I worked for and besides, I felt that if -I- could figure this stuff out (a Psych major, remember?), then it must not be rocket science.

      My point here is that -nobody- should be held in such high esteem that their pronouncements must be accepted as gospel. This is especially true of "the most advanced CS departments in the world." And as scientists, I bet they'd be the first to agree with that statement.
      --
      You were 80% angel, 10% demon. The rest was hard to explain. - Over The Rhine
      "Math in a song is good."-Linford
  46. Not Just In DataBases by the-dude-man · · Score: 1

    n databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit.

    This is actualy exaclty what iptables does...there is even a commit command at the end of every rulset after all exceptional circumstances have been handled

  47. DWIM by PhilHibbs · · Score: 3, Funny

    We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM processors.

  48. English by rf0 · · Score: 1

    I wouldn't worry about your english. Its better than some native speaks I've seen

    Rus

  49. Dude, read the article... by Anonymous Coward · · Score: 0

    Computer sure are a lot more complicated, you can't argue with that. But the article just said they're more complicated to operate.

    I guess type "C:>DIR" is easier than clicking on explorer and selecting "DETAIL" view.

  50. [OT] by gazbo · · Score: 0
    Apache. Well, Apache's a good example for everything, really. You Slashdot zealots can take your Linuces and Mozillae and shove them up your ass; if you want a flagship to parade for OSS, for God's sake use Apache.

    It's free, it's flexible, it's powerful and it is extremely popular. It's even pretty damn easy to set up. No other OSS comes close.

    You know it makes sense.

  51. OSQ by Hell+O'World · · Score: 1


    Ahhhh! Undo! Undo!

  52. Re:Hmm. What about atari,commodore,color-computer? by lcsjk · · Score: 1

    My RS "Color Computer" ran 0.9MHz, 8bits, and OS was BASIC. I had Telewriter-64 and some Spreadsheet.
    Just in clock speed alone, 3e6/0.9e6 = 3333. A 32 bit machine would make that 4X or 13,333 which is over 10,000. For the functions it had, it was more complex to use than MS Word or Open Office word. Only problem is that it still does not type any faster.

  53. It's more than that... by moogla · · Score: 1

    It's about having OS hooks to allow for introspection, subsystem management, etc. on a more fine-grained level.

    The software can tell the OS, I have three major components (even though I singly-threaded) and they are each require such and such devices, and such and such memory, etc. and if anything looks out of these parameters I can give you, then call this MAGIC FUNCTION and I'll give it a good whack to make it right again.
    Or if such and such hardware device I needed fails, I can take corrective action. Maybe I start listening on network card eth1 when before I was listening to eth0.

    etc.

    --
    Black holes are where the Matrix raised SIGFPE
  54. Better than recovering from a crash... by Glock27 · · Score: 1
    Don't crash in the first place.

    Many of these issues are best addressed at the hardware level, IMO. First of all, the software people don't have to worry about it then! ;-) For instance, look at RAID as a good example of reliable hardware (especially redundant RAIDS;). It is possible, using ECC memory and cache, and multiple CPUs, to be quite sure you're getting the correct results for a given calculation. You can also provide failover for continuous uptime.

    Some of the rest of the article addressed issues of recovering from software errors as well. The first step is encouraging use of languages that don't constantly result in mechanical errors (stack exploits, wild pointers, freeing already freed space etc.). Many such solutions exist, from "safe" languages like LISP and Ada to managed languages like Java and Java--++ (C#). It is a much better approach to be able to design software as though the system is reliable, rather than working around an unreliable system.

    All that said, an interesting approach to server software I ran across recently is Prevalayer. A nice, simple, lightweight object persistence scheme. There is also a good article on it here. Prevalayer is able to recover from system crashes quickly using a saved state and a journal file. Neat stuff!

    --
    Galileo: "The Earth revolves around the Sun!"
    Score: -1 100% Flamebait
  55. A Real Nostradamus by JCMay · · Score: 1

    1. It must be data oriented with no concept of instructions (just routing information), data flows in the system and transformed in a non-linear way, and the output will be all possible computations doable by the transformations.


    So, what would these transformations be other than... instructions? You could show me a list of "transformations" that the input data is to undergo to generate an output, and I'd show you a list of "instructions" that tell the computer what to do to the input data to generate an output.

    Furthermore, what you want is impossible-- "all possible combinations doable by the [yet uncounted] transformations." That's an arbitrarily large amount of work that requires an arbitrarily large machine and time to accomplish it.


    2. It must be based on a fully interconnected grid of very simple processing elements.


    Kinda like a Connection Machine, huh? Those are real new.


    3. The performance of said computer will be measured in terms of bandwidth not the usual MIPS. As you can see you will need a classical type computer to operate the described computer above so it will not totally replace it.


    Hrm. I suppose you've never noticed that memory buses are now specified by what amounts to a bandwidth number, as are IDE (ATA) bus family members. As to the "classical type" computer, again your prototype is the Connection Machine, circa 1983.
    1. Re:A Real Nostradamus by KingRamsis · · Score: 1

      So, what would these transformations be other than... instructions? You could show me a list of "transformations" that the input data is to undergo to generate an output, and I'd show you a list of "instructions" that tell the computer what to do to the input data to generate an output.

      no...instructions are dynamic, given to the computer as a series of parameterized steps, however transformations are static functions that the data may or may not pass by and typically transformations are applied to whole data in hardware (kinda like vector instructions)

      Furthermore, what you want is impossible-- "all possible combinations doable by the [yet uncounted] transformations." That's an arbitrarily large amount of work that requires an arbitrarily large machine and time to accomplish it.

      lets just say i have a data bus that is connected simultaneously to two black boxes, the transformation will happen in parallel but i can chose to discard the result that i dont want so no waste of time will occur as you think.

      Kinda like a Connection Machine [sunysb.edu], huh? Those are real new.
      we first of all I didnt claim that I invented this concept, and secondly the CM computer was ahead of its time, but the problem is most research is done on enhancing the implementation rather than replacing it.

      Hrm. I suppose you've never noticed that memory buses are now specified by what amounts to a bandwidth number, as are IDE (ATA) bus family members. As to the "classical type" computer

      I dont see how this contradicts my point of view? ofcourse data transfer components are measured in bandwidth not MIPS.

      a final word there are millions of great solutions waiting to solve future problems, it is just a matter of technology catching up, to give you an example IIRC the math behind wavelet compression is done in the 18th century by someone who never imagined that his work will be used to compress images on digital computers, maybe the people of Thinking Machines were on to something but the crude technologies of their time didnt help them.

    2. Re:A Real Nostradamus by maraist · · Score: 1

      I always hate deciding who to reply to, knowing that most likely only the replyee will read the post.

      In any case, what is being described in these two posts is a simple combinational logic machine.

      I couldn't get a feel if either of your posters knew of their existance. I'll summarize by saying:

      * For combinational logic machines (reprogrammable-FPGA or write-once-ASIC), the "instructions" are provided to the machine at design time, not at run-time. They embed themselves as bus interconnects such that data exists as signal lines (possibly initiated from a memory-latch), and propagates through logic-modules (of varying complexity contingent apon the feature-set). Thus the original poster was correct in his assessment of time-critical / bandwith-sensative design. A given program requires a specific timing sequence (unless latches are fully utilized), and a minimum bandwith (number of interconnects). Thus it may or may not run successfully on a given piece of hardware.

      * As you pointed out, this sort of technology is not new. All CPU's are designed with a similar approach (though there is greater freedom in bandwidth / substrate-module-complexity etc).

      * Graphics cards, specialized chess machines, etc. all use this sort of logic-level programming today with over-the-counter resources.

      --
      -Michael
    3. Re:A Real Nostradamus by JCMay · · Score: 1

      Yep, I know what combinational logic machines are-- I've built several. They're logic systems where the output is dependent only upon the current states of the inputs: that is, they're memoryless. They contrast with sequential logic systems that have "memory" and the current outputs are dependent not only on the current input values, but also on previous output values.

      My point was that his fuzzy concept of "operations" is nothing more than the idea of "instructions." A short while ago I did a paper design and simulator for a "one-instruction" computer based on Douglas Jones' Ultimate RISC. I've seen several people argue that it's not a "one-instruction" machine at all; the memory-mapped ALU operations are individual instructions where the opcode is encoded in the destination address.

  56. But operating them is much more complex? by fbg111 · · Score: 2, Insightful

    But operating them is much more complex.

    I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.

    --
    Flying is easy, just throw yourself at the ground and miss. -Douglas Adams
  57. My RS6000 actually... by Anonymous Coward · · Score: 0

    ...has some of this mythical, magical "pixie dust".

    It has "chipkill" ECC parity memory so that bit erros get autocorrected via the usual ECC method, plus if any chips go bad, the system recognizes that and maps around the faulty chips, while keeping on running.

    It has multiple processors and is able to disable an individual cpu should it go bad... still the system will keep running. Even has a pair of "service processors" to manage the general purpose processors.

    It has multiple power supplies.

    It has a pair of mirrored hard drives for the AIX operating system exclusively to reside upon... even swapspace is mirrored.

    It has a big RAID5 array for data and apps with dual SSA controller cards and redundant cabling.

    It has multiple network interfaces, naturally.

    It even has dual graphics controller cards fer crying out loud.

    Of course all the filesystems are JFS or JFS2 journalled filesystems.

    The Oracle database engine running on it uses multiple transaction logs for transactions and rollback capability.

    The financial application proggies running on it... well, I won't go there today :-)

    Pixie Dust? Well... uptime on it today is 329 days, 10 hours and some minutes. I've never needed to reboot it since the day it was installed and powered up. I've even applied several o/s patches, all of which were done hot.

    This machine is proving to be as stable as some of the FreeBSD boxes I used to run years ago.

    Pixie Dust indeed.

  58. Self-paying computers by sdack · · Score: 1, Insightful

    The moment you buy them, they add to the profit ...

    To make components reset themselfs or to let them memorize states for the purpose of undoing work is the approach of those not involved.

    The need to reset a component is because it has reached a state where it stops responding to any input. Or in other words, the component depended on receiving correct input without checking the input according to its state and thus locked itself up.
    An undo operation on the other hand would lead to components accepting any input and to reach any state (even the undefined one) but with the need to memorize their previous states. Other components making use of them now would have to ignore the operability of these components and to memorize the previously issued actions on their part to be able to undo them.
    The only component beeing able to start an undo would be the button on the GUI the user can click on.

    It is a very interesting concept, giving all power to the user in front and to let him/her decide whether the computer is in an invalid state or not. It would be a radical change in the history of computer science. A user would not anymore be a slave to the blue screen (or a kernel panic) demanding a confirmation of the unavoidable reset!
    Everything would have to be redesigned and reimplemented. Reuse of old, existing components would of course be impossible and errors in the final product are only because of imperfect programmers and will be solved through updates and newer releases.

    Sven

  59. Re:Ah, youth... by Idarubicin · · Score: 2, Interesting
    I mean, you can actually have your *mother* operate a computer today.

    Do we have to keep using this tired old notion of little old (middle-aged, for the /. crowd) ladies cringing in terror when faced with a computer?

    My mother has a B.Math in CS, acquired more than a quarter century ago. Her father is pushing eighty, and he upgrades his computer more often than I do. When he's not busy golfing, he's scanning photographs for digital retouching. (In his age bracket, a man who can remove double chins and smooth wrinkles is very popular.)

    The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. Maybe some of these people are frightened of (or frustrated with) computers because their exposure to technology is through the 'typical'* arrogant, smug, condescending /.er--concealing his embarrassment over being unable to get a girlfriend behind clouds of technobabble.

    *How does it feel to be the target of an unfair stereotype?

    --
    ~Idarubicin
  60. Oh yeah. by schnitzi · · Score: 2, Funny
    Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site.


    Oh yeah. My TRS-80 used to NEVER crash twenty years ago when I accessed LARGE INTERNET SITES.

    --



    I object to that article, and to the next reply.
  61. MORE bubblegum and spit vs. engineering by alispguru · · Score: 1
    Why are desktop computing systems fragile? Because in the markteplace, they are judged on exactly two criteria:

    How big is the check I'm writing right now?

    How fast is it?

    With these as your evaluation function, you are guaranteed to get systems with little redundancy and little or no internal safety checks.

    One regrettable example of this is the market for personal finance programs. The feature that sells Quicken is quick-fill - the heuristic automatic data entry that makes entering transactions fast. Never mind that Quicken's register file is fragile - it frequently loses track of balances (requiring the moral equivalent of fsck), and every few years the accumulated unfixed cruft causes a major failure, requiring insane fixes like exporting all your data as QIF files and reimporting it into a new register.

    If Quicken were back-ended into a real database, with real transactions, real consistency checks, and real crash recovery, all this would go away. But it would make Quicken slower and require more hardware horespower to run it - the marketplace would punish them for improving their lives.

    What the original article is proposing is:

    We accept that systems will always suck

    Therefore, we should build multi-level suckage damage control into them

    Another possibility is:

    We accept that there is a tradeoff between system speed and safety

    Therefore, we take the speed hit where safety is important

    --

    To a Lisp hacker, XML is S-expressions in drag.
  62. Hasn't this already been done? by cjmnews · · Score: 1

    I thought Unisys had modified their AT&T UNIX(r) to perform on the fly save points, and when an error occurred, the OS would roll back to the savepoint and re-execute the steps again. The theory was that these errors would only occur if there were several events happening at the same time. By rolling back and re-executing the steps, one or more of the events would not be happening at that time.

    They claimed to reduce kernel panics by 80% this way.

    I am not sure how an event could not occur when re-executing the same steps, since it's the "same steps". It's been a few years since I was told about this, and I may be remembering incorrectly.

    --
    You can lose something that is loose, so tighten the loose item so you don't lose it.
  63. nothing new here by plopez · · Score: 1

    micro-rebooting - Apache has been doing that for years.

    undo - transaction rollbacks in data bases.

    injecting test errors - how does this differ from automated testing suites?

    better tools for pinpointing problems - just an incremental improvement.

    Nothing really new here, just an extension of existing technology. All of these have been solved in other areas a long time ago.

    my $.02

    --
    putting the 'B' in LGBTQ+
  64. Re:Ah, youth... by Corgha · · Score: 1

    And the debut of Unix 20 years ago.

    Just to set the record straight, I think you mean more than 30 years ago, unless you're talking about the debut of XENIX.

  65. Normally I like Sci Am articles, but by PotatoHead · · Score: 1

    this one rubs me the wrong way.

    This does not seem to be leading edge research. Someone else posted their suspicion that the team was working with win32 systems looking for a better way.

    I agree with them.

    I think these guys are looking at win32 systems and wishing they were Unix ones.

    1. 'Micro Reboots' - Can you say '/etc/init.d' ? Example, my Linux machine sometimes chokes on sound. I could reboot the machine, but instead, I just start the sound service. '/etc/init.d/sound restart'

    2. 'Better tools to pinpoint the sources of faults' Can you say '/var/adm/SYSLOG or Messages?' Anything you want to know about the machine ends up there. If you take a proactive approach to your log, you are going to notice things before they bring the whole thing down. Maybe we could have better logging, but still this is exectution of what we know, not anything new.

    3. 'We need an Undo' This one is very easy to setup yourself. It could be automated to a point, but really isn't this just a backup. Too much undo and you can't get anything done. Not enough and you need to know something to get the machine running again. Seems to be that a quality analysis of the system and its potential faults would yield a list of data to be incrementally watched and archived to achieve the same results.

    4. I will give them a little credit for this one, though I am not sure I agree. This part of the puzzle would happen as part of #3 for the most part.

    I have a big problem with #4 in that it makes the assumption that we are smarter than the tech. I am not sure we are. We can build it, but we don't always have a clue as to what it will do because there are too many subtle interactions to account for.

    Better to let the machine bitch about the issue and be well prepared to deal with it. After a while, you and your machine will understand each others issues and all will be fine.

    So, in the end, these folks are wishing they had a well planned and configured Unix system when they actually have something less...

    Why not take those first three ideas and build a Linux that exectutes them nicely? Maybe people will prefer it to what we have now --maybe not. That would be some research.

    This just isn't as big of a deal as it looks. (Sorry guys)

  66. Re:Ah, youth... by Anonymous Coward · · Score: 0

    " The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. "

    Thus proving the original poster's point.

    Which, I'm sure, you didn't intend on doing. But I only pointed it out because I am the typical arrogant, smug condescending /.er. Why? Because I'm smarter than you, richer than you, better looking than you, and yes, I get more women than you.

    Life's a bitch, and then you marry one.

  67. What? by t0ny · · Score: 1
    Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

    Huh? How is double-clicking an icon more complex than typing in LOAD "*",8,1 or having to load something from a tape drive? If my complex you mean you can do more, I agree, but as far as ease of use, its nowhere near as complex or counter-intuitive.

    --

    Manipulate the moderator system! Mod someone as "overrated" today.

  68. Speed claims by grahamlee · · Score: 1

    This item claims that computers have become 10000 times faster in the last 20 years, but I hasten to disagree. Application of Moore's Law suggests they have become 10321 times faster... ;-)

  69. Reliability by john_roth · · Score: 1

    The authors state as a given that crashes will happen.

    I remember a few years ago (well, more than a few) I had this conversation with IBM MVS architects in an arguement about whether MVS should boot faster. Their position was that it shouldn't crash in the first place.

    So what's widely acknowledged to be the least likely system to crash? IBM Mainframes.

    John Roth

  70. Whatever by Anonymous Coward · · Score: 0

    Consumers have proven that they will buy just about any piece of crap put shove under their nose, provided that its cheap.

  71. Nothing new. by pmz · · Score: 3, Insightful

    micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...

    I think they just invented Lisp :). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.

  72. Write 3 times... by gurps_npc · · Score: 1
    Was an old version of backup. Everything was written 3 times and read 3 times. When a single read came back different, the odd read was thrown out re-written the way the other two were written.

    To expend this idea, we could do a RUN 3 times system

    You would have three operating systems, each running a java type processor. Send the java instructions to all 3 machines, and hopefully they should return identical results. If two match and one is different, then throw out the third result.

    Now a bug has to screw up all 2 of the three operating systems, much less likely.

    --
    excitingthingstodo.blogspot.com
  73. I know! Run two cron daemons! by KPU · · Score: 1

    Just write a process that has a sleep function and checks for other copies of itself and the cron daemon. If it doesn't seen the cron daemon, start it. Run arbitrary numbers of this process for added guarantee.

  74. IBM mainframes have this 'UNDO' circuitry by zymano · · Score: 1

    If i am not mistaken , I read somewhere IBM is using this in their mainframe processors . Sounds good but do you want faster computer chip or more safety features? The quandry.

  75. Re:Magic Server Pixie Dust...yes by Anonymous Coward · · Score: 0

    Check this out for more details - http://www.research.ibm.com/journal/rd/435/spainho wer.html

    Circuit-level fault detection and recovery including "instruction-retry capability"!

  76. Still doesn't mean it's worthwhile. by Anonymous Coward · · Score: 0

    I'm not the original poster you responded to and I'm posting anonymously so you can't accuse me of trying to score karma.

    I'm of the opinion that a large number of the naysayers do have valid points. Detecting and responding to failures can be done in a relatively easy and cost-effective manner on today's hardware. Especially compared to a complete overhaul and redesign that relies on untested methods and practices. If software and hardware is designed to fail safe, with intelligently designed journaling software and multiple redundant hardware - anything else is overkill.

    Given the law of diminishing returns, the solutions that business buy and that individuals use are those that work well enough to do the job at hand - and no more. If companies like Amazon.com and Google.com can keep websites up 24x7 with no noticable downtime on cheap commodity hardware, why would they need this technology?

    Okay, perhaps there are some military or medical applications that could use this, but it's an unproven solution that's bound to be vastly more expensive than the one it replaces, for very little improved reliability.

    I wish them luck, perhaps when they've sold their first couple of installations it'll be worth revisiting. No offense to my former CS department (which is of no relation to the article), but there's a reason that the phrases "that's academic", and "if you can't do - teach" are both not complements. It may not be realistic to automatically give them credit merely because it is a university research project. The environment of a CS department sometimes is not the best for new ideas - idealogical inbreeding and isolation from the real world (where the bottom line is $$$) sometimes gives strange tilts to the work that comes out of them.

  77. What? by TheLastUser · · Score: 1

    "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex."

    Spoken like someone that has never had to choose an interupt for their new sound card.

  78. Evaluate train operators? by mypalmike · · Score: 1
    ...injecting test errors to better evaluate systems and train operators
    So this should help reduce the number of railroad accidents in the future, right?

    --
    There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
    1. Re:Evaluate train operators? by Anonymous Coward · · Score: 0

      No, just to evaluate people who operate systems and trains. Yearly performance review.

  79. (*Yawn*) Prior Art by jpellino · · Score: 1

    "On board the ship, everything was as it had been for millennia. deeply dark and silent.
    Click, hum.
    At least, almost everything.
    Click, click, hum.
    Click, hum, click, hum, click, hum.
    Click, click, click, click, click, hum.
    Hmmm.
    A low-level supervising program woke up a slightly higher-level supervising program deep in the ship?s semisomnolent cyberbrain and reported to it that whenever it went click all it got was a hum.
    The higher-level supervising program asked it what it was supposed to get, and the low-level supervising program said that it couldn?t remember what it was meant to get, exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn?t it? It didn?t know what this hum was. Click, hum, click, hum. That was all it was getting.
    The higher-level supervising program considered this and didn?t like it. It asked the low-level supervising program what exactly it was supervising and the low-level supervising program said it couldn?t remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn?t find it, which was why it had alerted the higher-level supervising program of the problem.
    The higher-level supervising program went to consult one of its own look-up tables to find out what the low-level supervising program was meant to be supervising.
    It couldn?t find the look-up table.
    Odd.
    It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn?t find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.
    The sector function supervisor hit immediate problems. It called its supervising agent, which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.
    Small modules of software? agents? surged through the logical pathways, grouping, consulting, regrouping. They quickly established that the ship?s memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.
    This made the whole problem very simple to deal with, in fact. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.
    Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship?s logic chamber for installation.
    This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.
    This provided the first major clue as to what it was that was wrong.
    -- DNA, MH (hhgg5)

    --
    "Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
  80. Put an admin out of a job by JWhitlock · · Score: 1
    This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

    "Undo" feature? That's what backups are for.

    Perhaps you missed the point. Any person who has been administering computers for 10 years should be able to write that script and perform those backups and get it right in about a month or so.

    Or, you could make automatic recovery and undo features part of the operating system. It's not easy, but it only has to be done once, and it would just do the Right Thing.

    Linux is a better Unix than we had 30 years ago, but we really need a new generation of operating systems, where the shotgun at least has a safety. I wonder if it can be done within the Linux framework, or whether we are talking about a whole new operating system

  81. See the project homepage by jrstewart · · Score: 1

    http://roc.cs.berkeley.edu/

    I haven't read the SciAm article so I'm not sure what spin they put on it, but it's actually a very reasonable idea.

    The idea is two-fold:

    (a) When trying to maximize reliability, it might actually be better in terms of total downtime to reduce recovery time rather than improve reliability. Take a system which crashes 5 times a year on average and takes an hour to go back online each time it crashes. Your total downtime is 5 hours/year. If you fix one place where the system crashes your total downtime will go down to 4 hours/year. But maybe for the same effort you can reduce the recovery time from 1 hour to 45 minutes. That's 3.75 hours/year of downtime. This is the kind of tradeoff that a lot of reliability engineering people don't think about, but should.

    At the limit, if you had a file server that could recover within 5 seconds, who cares if it crashes twice a day? That's a short enough interval that the clients will automatically retry and succeed.

    (b) You have to design the recovery path anyway, since you have to assume that sometimes your system crashes. You could also design a clean shutdown / startup path OR you could put all of that effort into making your recovery path that much faster and more effective.

    Not having a "clean" shutdown path also has the benefit that every time you restart the system for any reason you are testing your recovery logic.

  82. Re:No clue by Realistic_Dragon · · Score: 1

    And your ABS ECU doesn't have CTRL ALT DELETE , does it?

    Since BMW were looking at using embedded WinCE in their cars, one day it may well do.

    Just another reason not to drive a BMW then ;o)

    --
    Beep beep.
  83. I have been doing this for 4 years by Anonymous Coward · · Score: 0

    This article has no specifics...but clearly a Turing Machine does have a state where is cannot continue to function, and I believe that this can be proved using the diagonal method and set theory...

  84. ROC, Slashdot, and Education by DavePatterson · · Score: 1

    As a professor, I can't help but think that some Slashdot responses are disguised pleas for help. Therefore, let me offer some guidance:

    * The factor of 10,000 performance improvement in 20 years is not the focus of the article, but if you are interested in where it came from, please see a book. On page 3 of Computer Architecture: A Quantitative Approach, 3rd edition, (http://www.amazon.com/exec/obidos/ASIN/1558605967 /), Figure 1.1 shows a performance improvement of a factor of 1.58 per year between 1984 and 2001 using a few generations of the SPEC benchmarks. That is a factor of 2400. If we add 3 more years at the same rate, we get a factor of 9600. QED.

    * Backup is one of the 3Rs of system administrator undo that we are pursuing, but it is not all of them. The 3Rs are Rewind, Repair, and Replay. Backup gives us Rewind, but not Repair or Replay. It is also different from ACID transactions, which operate at a very low level of the system. We are interested in undo of higher level "verbs" that correspond to high-level user actions. If you want to learn about our undo ideas before you need to reply, see http://roc.cs.berkeley.edu/papers/sigops-ew2002-un do.pdf.

    * TMR stands for Triple Modular Redundancy, which is an effective but expensive technique to protect from hardware failures. If hardware failures were the leading problem, then TMR would be the path to follow. Hardware errors are responsible for only 15% of the outages, as those who have read the Scientific American article already know. TMR and systems like HP's (nee Tandem's) NonStop do not address operator error.

    * We are focused on Internet style applications that are considerably above the operating system, but the problems we have documented about operators being a major source of outages include all systems, including Linux systems. To learn about hard to find data about causes of failures before you reply, please see http://roc.cs.berkeley.edu/papers/usits03.pdf.

    * We agree that the telephone industry did many fine things to make communication dependable, and that there is much to learn and emulate from them. If computers were as reliable as telephony, we could be much prouder of our field.

    * Our focus is on Internet services, the so the cost of ownership is probably higher for such servers than for PCs. I wouldn't be surprised, however, that if you multiplied a typical white-collar hourly pay rate times the average of number of hours that one spends administering a PC, you may get similar results.

    * For those of you who were not using computers in 1983, that is the era of open source UNIX software (BSD) on 32-bit computers (VAX). Sound familiar? Punched cards had been passé for quite a while in 1983.

    * For those wanting to read something with more technical depth, see http://roc.cs.berkeley.edu/papers/ROC_TR02-1175.pd f. For the Slashdot readers who only have time for a quick overview, see the Scientific American article www.sciam.com/article.cfm?chanID=sa006&articleID=0 00DAA41-3B4E-1EB7-BDC0809EC588EEDF). For those who only have time to read Slashdot, may God protect you on your journey towards technical obsolescence.

  85. Agreed by mr_zorg · · Score: 1

    If only more people realized this.

  86. This is an OLD idea by !Squalus · · Score: 1

    Compaq was working on this technology ages ago. The idea was that the computer would self-report imminent failures. It has moved up a notch, but only a notch. Micro-rebooting - there's a concept! Narf!

    --
    All Ad hominem replies happily ignored as the sender shall be deemed to lack the faculties to comprehend the equation.
  87. Reminds me of Norton's Crashguard by maccentric · · Score: 1

    Norton's Crashguard (aka CrashHard) was supposed to help you recover from crashes in Mac OS 8, but in actuality it caused more crashes than it cured.

  88. Vote parent up; it's from a prof on the project! by ToastyKen · · Score: 1

    The parent post is from someone who actually knows what they're talking about, and it's got a score of 1 right now. Would any moderators care to correct this?