Slashdot Mirror


UNIX Process Cryogenics?

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?

163 of 555 comments (clear)

  1. the mode you are speaking of by Stone+Rhino · · Score: 2, Informative

    is not suspend, it is hibernate. Suspend will power down the computer except for the energy needed to keep the ram alive. hibernate will save all data to from memory to disk. I, personally, use neither on my laptop.

    --


    Remember, there were no nuclear weapons before women were allowed to vote.
    1. Re:the mode you are speaking of by Timbo · · Score: 2, Informative

      What you refer to as suspend is what most people (and APM) call standby. What you call hibernate is what APM refers to as suspend. I believe Windows uses the term hibernate to refer to a software suspend function.

    2. Re:the mode you are speaking of by SilentChris · · Score: 2, Informative

      *Sigh*. More people with very little experience with laptops. Read the mini-faq, people.

  2. Saving application state by cheezehead · · Score: 2, Insightful

    Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.

    --

    MSN 8: Now Microsoft even has bugs in their ad campaigns.

    1. Re:Saving application state by Binestar · · Score: 2

      Seems to me this guy has figured out that as well. Only he only lost his "paper" once, and wants a solution to "save" it while working on it.

      Hybernation just happens to be the version of saving he wants to implement.

      --
      Do you Gentoo!?
  3. External dependancies by interiot · · Score: 3, Insightful

    External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?

  4. We do it in Condor by epaulson · · Score: 5, Informative

    http://www.cs.wisc.edu/condor/

    Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.

    Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.

    1. Re:We do it in Condor by dsouth · · Score: 5, Informative

      As the poster said, there are plenty of others:

      • SGI IRIX and Cray UNICOS provide kernel-level checkpoint-restart.
      • Condor provides user-level checkpoint restart and process migration by manipulating libraries at runtime.
      • esky provides user-level checkpoint restart under Solaris and Linux via runtime library manipulation.
      • crak provides kernel-level checkpoint restart for linux.
      • cocheck provides user-level checkpoint-restart.
      • libckpt provides user-level checkpoint-restart.


      I'm sure I left serveral out. Checkpoint-restart has been part of the high-performance computing scene for years. Having been a systdmin on large, high-performance, computing platforms for the last few years of my professional life, my experiences with checkpoint-restart have been a mixed bag. All of the existing systems have limitations. Depending on the application, those limitations can be no problem, or they can be deal-breakers.
    2. Re:We do it in Condor by Anonymous Coward · · Score: 2, Funny
      The label "free as in beer" is misleading, due to the cultural differences between Wisconsin and other parts of the world.

      The people of Wisconsin are fat, stupid, drunken oafs. They consider themselves "America's Dairyland", although this title was taken from them many years ago by the state of California. This is not the only false claim to fame that the state keeps. Green Bay Packers fans consider their city to be "Titletown, U.S.A.", because of the numerous NFL championships the team has won. The numbers may seem impressive, but the majority of them were won when the NFL was a small league, and there was no playoff for the championship.

      Getting back to the fatness, they do produce and consume a lot of dairy, but this is not why they are called "cheeseheads". A little known fact is that most of Wisconsin's citizens are inbred, and even those that aren't inbred frequently suffer birth defects, due to maternal alcoholism. This results in a condition that produces small holes in the skull, where fluids escape and eventually congeal into small, yellow lumps, hence the term "cheesehead". Hence, the traditional Packer "cheesehead hat" is actually a symbol of Wisconsin's perseverance in the face of a world that looks down upon inbreeding.

      Getting to the point, Wisconsinites _crave_ beer to feed their alcoholism, so much so that beer is an extremely valuable commodity, despite the abundance of breweries throughout the state. In fact, the Leinenkugel's brewery of Chippewa Falls goes so far as to indicate the value of its beer on the label of their original lager -- "Leinie's Original" is "Good as Gold".

      So you see, we still haven't found an English word or phrase quite as good as "libre" -- "free as in beer" can be just as ambiguous as the word "free" is by itself.

  5. OS X needs this especially by kilgore_47 · · Score: 5, Interesting

    for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?

    --
    ___
    The way to see by faith is to shut the eye of reason. --Ben Franklin
    1. Re:OS X needs this especially by medcalf · · Score: 2

      Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.

      --
      -- Two men say they're Jesus. One of them must be wrong. - Dire Straits
    2. Re:OS X needs this especially by Masker · · Score: 2

      Errrr... Without protected memory spaces, I _don't_ think that this is what you want. You'd actually be setting yourself up for more problems. You don't want to save the system's memory state unless you can be sure that it's relatively clean & safe...

      --

      ---------The early bird gets the worm, but the second mouse gets the cheese.

    3. Re:OS X needs this especially by iso · · Score: 2

      I think what he means is save the clean boot-up state of the classic environment (provided nothing has changed in the System folder since the last boot of classic). That way when classic needs to boot, OS X could just throw up a booted classic environment memory state in a matter of seconds instead of booting classic from scratch each time.

      - j

    4. Re:OS X needs this especially by medcalf · · Score: 2

      You'd have to define what you mean by "nothing has changed in the System folder", since prefs, for example, can change all the time. I suppose if you checked the image against the latest modification time of all files in the system folder, and threw away the image if the image was older than any file, it would work, but it seems that it could be pretty time consuming to do.

      --
      -- Two men say they're Jesus. One of them must be wrong. - Dire Straits
    5. Re:OS X needs this especially by Quixotic+Raindrop · · Score: 2, Interesting

      Which is funny, because VMware has exactly this capability.

      It needs some refinement, and sometimes it's slow when it picks back up again, but it generally works in my experience. It is obviously not only possible, but implementable using current technology

      --
      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
    6. Re:OS X needs this especially by suwain_2 · · Score: 2
      Maybe I'm misunderstanding you, but if not... I just got an idea that's just a slight twist of yours.

      Why not just suspend the entire system to the hard drive...? The system could simply read the way your memory 'should be', and quickly copy it over. I don't have a lot of experience with how things boot, but this seems like a good idea to me...? It should be limited only by your hard drive's speed...?

      --
      ________________________________________________
      suwain_2 :: quality slashdot p
    7. Re:OS X needs this especially by Suppafly · · Score: 2

      Why not just figure out a better way to run old apps than to boot up basically the entire old os.. windows2000 can run dos and win3.1 win95 etc apps without loading the entire old kernel/os. Realistically, Apple could do the same if they spent a little bit more time on the problem.. but then you wouldn't have all the other cool advances to MacOS..

      OS X needs this especially for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?

    8. Re:OS X needs this especially by ncc74656 · · Score: 5, Interesting
      Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.

      I don't know how directly comparable this example might be, but I used to use VMware (under Linux) to suspend Win98 when I didn't need it. If I needed to do something under Win98 (like browse the web), VMware would load up Win98 where I last left it. It saved the minute or so of waiting for the VM to POST and load Win98.

      (If VMware provided better support for DirectX, I might not have needed to switch my home workstation from Linux to Win2K. It's been more than a year since I checked, though, so things might've improved.)

      --
      20 January 2017: the End of an Error.
    9. Re:OS X needs this especially by passion · · Score: 2

      You'd have to define what you mean by "nothing has changed in the System folder", since prefs, for example, can change all the time.

      Preferences get written and re-written all the time. In fact, classic versions of Mac OS can be booted w/out anything in the Preferences folder. You drop a good point here, but this is a poor example. I see no reason for it not to work just fine, and would love to see Apple implement this.

      --
      - passion
    10. Re:OS X needs this especially by Dwonis · · Score: 2

      It was called suspend-to-disk until Microsoft called it hibernate.

    11. Re:OS X needs this especially by The+Raven · · Score: 2

      Without protected memory space? Maybe I'm misunderstanding your disagreement, but OSX *does* have protected memory. It is OS9 and prior that do not.

      --
      "I will trust Google to 'do no evil' until the founders no longer run it." Hello Alphabet.
    12. Re:OS X needs this especially by Lazaru5 · · Score: 2

      W2K can run older Windows apps because it's all still Win32. OSX is Unix/Carbon/Cocoa and can't just run old MacOS apps at will, hence the Classic OS boots.

      A closer comparison would be how FreeBSD supports Linux binaries via a "thunking" layer -- translating Linux syscalls to BSD syscalls. Again, this is fairly easy to do since it's still the Unix API. The old Trumpet 32bit Winsock did the same thing by translating 32bit calls to 16bit ones.

      Apple could probably have easily made a thunking layer that would at least run classic binaries using Mach syscalls, but drawing the windows themselves might not have been as easy to support natively.

      Then again, can't LinuxPPC and friends run classic apps? Or PPC BeOS?

      --

      --
      My comments and opinions completely reflect those of anyone and anything I am remotely associated with.
    13. Re:OS X needs this especially by Jace+of+Fuse! · · Score: 2

      IBM has been calling it Rapid-Resume for years.

      --

      "Everything you know is wrong. (And stupid.)"

      Moderation Totals: Wrong=2, Stupid=3, Total=5.
  6. Re:Use Windows XP by mindstrm · · Score: 2

    They do exist in other systems, or at least, they work on other systems.

    My laptop has no problems suspending/hibernating linux.

    The question here is about process hibernation, not the whole box.

  7. BeOS? by ScumBiker · · Score: 2

    I had Be installed for a while and I thought it would do that. I do know I never lost anything due to it crashing. Of course, it didn't crash much. I think using a journaled file system or at least soft-updates would be a good start. Frankly, I have no idea how to code something simlar to Win XP hibernate. Shouldn't be that hard though.

    --
    --- Think of it as evolution in action ---
  8. Search on "Checkpointing" by crow · · Score: 3, Redundant

    What you want is known as "checkpointing."

    There have been a number of projects that do this under Unix over the years. Many of them do it for the purpose of process migration. Others do it just for recovery.

    One such project that I used in the early 90s was Condor.

    The typical approach is to do something along the lines of forcing a core dump and then doing some magic to restart the process from the core file.

    1. Re:Search on "Checkpointing" by duplicate-nickname · · Score: 2, Informative

      The condor project is still alive and well: http://www.cs.wisc.edu/condor/ and should do what this guy wants to accomplish (but not what he's asking).

      --

      ÕÕ

  9. Re:Use Windows XP by Ewan · · Score: 2, Funny

    The difference is that suspending a laptop is done using hardware, but the suspend mode in WindowsXP is done in software, so desktop PCs can do it without additional functionality.

    Ewan

  10. Hmm, VMWare can do this in a different way. by GeorgieBoy · · Score: 5, Interesting

    VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).

    While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .

  11. Extended core dump? by The+G · · Score: 5, Interesting

    Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.

    I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.

    And damn but I'd love to have this ability.
    --G

    1. Re:Extended core dump? by ADRA · · Score: 2, Interesting

      You forget that the kernel has created a sandbox for this core to live in. If the sandbox wakes up with a different environment, byebye process.

      Simple example

      # ./bigwasteoftime &
      ./bigwasteoftime[1]
      # hibernate bigwasteoftime
      # exit

      The program is tied to the console which no longer exists, and if woken up, which process is it childed to? What if bigwasteoftime knew its parent before hibernation, and tried to modify it?

      As it stands, you cannot guarantee its stability.

      --
      Bye!
    2. Re:Extended core dump? by ianezz · · Score: 4, Interesting
      GNU Emacs basically does this to reduce initialization times.

      When compiling Emacs from the sources, the initial executable file is only a (relatively) small virtual machine executing elisp bytecode.

      Then, it is started, and several basic elisp packages are loaded and initialized.

      Once initialized, it makes a dump of itself on a file on disk (IIRC actually dumping core by sending a fatal signal to itself).

      The dump is prepended with an appropriate loader which restore the Emacs process (in its initialized status) in memory, and the resulting file is used as the main Emacs binary (what you can usually find in /usr/bin).

      This works for Emacs because it knows when it is checkpointed, and special care is taken not to do anything that depends on parts of the running environment that can't be fully restored.

    3. Re:Extended core dump? by Dwonis · · Score: 2

      How about this? Trap SIGSTOP, but then stop anyway. (let the kernel handle the rest). When your process wakes up, re-initialize whatever you have to.

  12. hhgttg by Score0,+Overrated · · Score: 3, Funny

    The job it was doing would have been done in a few days,

    In that case, Arthur Dent should know the answer.

    1. Re:hhgttg by libertynews · · Score: 2

      No, no, we already know the answer (42). Its the bloody question that is so elusive!

      Brian

      --
      Remember Lexington Green!
  13. eros-os by ischarlie · · Score: 2, Interesting

    back in the day there was a post:

    http://slashdot.org/article.pl?sid=99/10/28/015121 2&mode=thread

    about an operating system with "journaled" processes of a sort, that would automatically back up images of it's processes.

  14. you can by Lumpy · · Score: 5, Informative

    It's called software suspend for linux. look for it on freshmeat.net

    --
    Do not look at laser with remaining good eye.
    1. Re:you can by Lumpy · · Score: 5, Informative

      AHA! I knew I still had it
      http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l

      this is what you need.

      --
      Do not look at laser with remaining good eye.
    2. Re:you can by Anonymous Coward · · Score: 2, Insightful

      Talk about the ultimate in karma whoring. Instead of just having one post modded to +5, you get two by delaying the posting of your link. It's almost criminal.

    3. Re:you can by i_am_nitrogen · · Score: 3, Informative

      There's just one tiny little problem with that. It only supports ext2. Try it with a journalling filesystem, and ... bye bye Linux partition!
      At least, last time I checked that's how it was. There may have been improvements made. It would require somewhat major changes to the VM and each filesystem in the current Linux implementation to get it working with journalled systems, or if Linux finally gets a journal-capable VM (similar to IRIX's, perhaps), it would just require some VM changes if it's done right.

      (Begin semi-OT stuff)
      Oh, and please, please everyone ask Linus not to rip out memory zones just because it's a BSD-like idea.

      Kernel 2.6 will probably be able to support hibernation without funkiness in the filesystems themselves, just a good VM setup. The new framebuffer system (Ruby) will rock, too (think 'echo "640x480-16@60" > /dev/gfx/fb/0/mode'), especially because DRI is going to be separated from X so console applications can take advantage of OpenGL as well.

    4. Re:you can by scrytch · · Score: 2

      Insightful my ass.

      Yunno, some people hit the 50 cap long ago. Some never cared. I thought this whinging over so-called "karma whoring" had died long ago (I was thinking of changing my sig), but I guess there are some people still left who are socially stunted enough that they cannot conceive of others partaking in conversation for the fun or edification instead of pleas for attention. I thought I was kind of messed up, but I can't say that I feel particularly validated or not based on some score I have on slashdot.

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
  15. Re:Use Windows XP by Lukey+Boy · · Score: 2

    He's talking about on a process-level, as in freeze a lengthy game of Asteroids and restore it later. Hibernation is system-wide, not on a process-by-process basis. And Linux has that too ;-) Note: this comment is reused!

  16. process migration is the term you want by Danny+Rathjens · · Score: 2, Interesting

    There has been a lot of work done on "process migration". That is moving processes from machine to machine.
    Obviously those techniques would apply to what you are asking about.
    google has lots of links about it

  17. it's encrypted in your brain waves! by spacefem · · Score: 5, Funny

    I once had an enourmous computer working out a very important question but it was destroyed by Volgons five minutes before it was finished. I feel your pain.

    1. Re:it's encrypted in your brain waves! by medcalf · · Score: 2
      I once had an enourmous computer working out a very important question but it was destroyed by Volgons[sic] five minutes before it was finished. I feel your pain.

      That must have annoyed the Vogons, who were coming to do the same thing. Not to mention the mice!

      --
      -- Two men say they're Jesus. One of them must be wrong. - Dire Straits
    2. Re:it's encrypted in your brain waves! by Don+Negro · · Score: 2

      Also, it was encoded, not encrypted.

      But what's a few letters among severe geeks.

      --

      Don Negro
      Perl 6 will give you the big knob. -- Larry Wall

  18. I took a quick look... by eXtro · · Score: 2, Funny
    through my engineering library and I found a similar situation. A massive computer system, completely one of a kind, was destroyed prior to providing the solution to the problem for which it was designed. Recalculating the solution from scratch would take far too long, but there was one possibility. One of its computational units was still intact and the answer was surmised to be embedded deep within its memory.


    I think the same solution would apply here: Find Arthur Dent.

  19. No need, my good man by JohnTheFisherman · · Score: 2, Offtopic

    The answer is 42. :D

  20. Re:Really worth the effort? by b_pretender · · Score: 4, Insightful
    Good point. He should also create numerical algorithms with log files that keep track of how far they are getting and track results.

    This sounds like common sense to me. You never know when the disk is going to poop, the power shut off, the network reset.

    At my old job, we were required to record the status of all jobs that took longer than an hour (on a 6 cpu SGI). They never crashed on their own, but I would usually interrupt them if the requirements changed or whatever. If they ever did crash, then there was a record of exactly where they left off.

  21. Resurrecting core files by robbo · · Score: 2

    I've always wondered how hard it would be to resurrect a core file. One would think that there's enough info in a complete core to reopen all the open fd's, and possibly even reinitiate network connects. Everything else is there-- program counter, stack, heap, etc. As such, one could 'kill -ABRT' the process and revive it again later. Has anyone seen this done?

    --
    So long, and thanks for all the Phish
  22. Suspend by selectspec · · Score: 4, Informative

    You can't just serialize and page out one process. Under every process are a slew of kernel objects and kernel crud including the virtual to physical mappings of your address space. It would be quite a challenge to isolate all of this and somehow persist it.

    To make suspend work, you'd have to dump your entire memory image to disk. Then you swap in the entire image, kernel and user pages alike.

    --

    Someone you trust is one of us.

    1. Re:Suspend by arkanes · · Score: 2

      Which is exactly how windows does it. This even seems to work with memory-intesive games that manage thier own swap, like Diablo 2

  23. This CAN be trivially done on any un*x i know... by ugen · · Score: 2, Redundant

    1) Produce the core dump of a process
    2) Use the core and process image to restart it
    (for example in the debugger such as gdb, if you
    don't want to write specialized software).

    To the best of my knowledge perl "compiler" uses
    precisely this technique to produce perl "executables" - dumps them out as a core right
    after compilation and reuses it later on.

    You can do this to a kernel as well, if you
    REALLY want to.

    However, since indeed many things may be dependant
    on state of kernel, files, network connections, devices etc. etc. doing this is not adviseable.

    Good coding practice for long-running processes is
    to actually spend some time on writing the state
    saving functionality to support process restart.

    Anyway, (call it a flame if ya will) but the fact
    that /. posts this as a relevant question is very
    disquieting - level of technical knowledge here
    gets reduced day after day.

  24. Solaris Suspend & Resume by morcheeba · · Score: 3, Informative

    I've used the Suspend/Resume feature on a sun box. IIRC, it mostly worked, but with a minor hitch that made me worry enough to never do it again. This suspend/resume is just like the laptop version -- save a copy of all memory to disk -- not the cryogenic per-process version you're talking about.

    The per-process sounds neat, but usable only if you've got a simple critical task you're running. For a more complicated application, multiple processes may be working together, and you'd have to suspend all of them at the same time.
    One big question I would have would be file handles... if you restore a process that thinks it owns file handle #5 and some other process is already using it, it would be awkward to get either process to use a different handle.

  25. Future of Process Management by gehrehmee · · Score: 3, Interesting

    First, let me say that what the poster is suggesting sounds a little more sophisticated then a simple re-implementation of XP's hibernate function, although functionality like that under UNIX would certainly be invaluable. It sounds like the poster wants control over individual processes, something that I consider far more interesting.
    What's said here is certainly very reasonable. But the extensions of whats being suggested are even more fantastic. Once a process is completely removed from memory, with file handles and storage and status all kept away safely, is there any reason that the process is really tied to that computer? Why wouldn't it be possible to take that 'frozen' process, transfer it to another machine with access to the same filesystem on some level (some translation of file handles would likely be neccesary), and thaw it there, allowing someone to move a running process to another machine? Need to replace your web server's only CPU, but don't want downtime? Move the process to a backup machine, replace the original's hardware, and move the process back.
    I even thought I had heard that someone was working on just such a project, or at least thinking about the details of implementing it. (I'm just getting started in learning UNIX internals myself). Anybody have more references to information on this sort of thing?

    --
    "You know, Hobbes, some days even my lucky rocketship underpants don't help" -- Calvin
  26. different approach: Savepoints by esonik · · Score: 2, Interesting

    A different solution, which is very common for long running processes, is to use savepoints, i.e. save the state of the process regularly to a file at suitable points of the algorithm. Once your process dies or you killed it, you can restart from that savepoint. If your state information is very large, you can stretch the save interval to reasonable long times, e.g. several hours. Typically you don't mind to lose some hours of calculations due to an occasional power outage.

    Of course this solution is not as general as the "process cryogenics" you describe, but it's also easier to implement because you have more information about the problem.

    1. Re:different approach: Savepoints by jgerman · · Score: 2

      Yes, this is similar to what I've done in applications, especially easy in an OO environment. Coded correctly you can view your process as a virtual machine, one that has a fixed instruction set. Serializing all of the data and dumping it to file will allow you to pick up where you left off. Of course this is per application, but it's is relatively simple to build into your app when you write it.

      --
      I'm the big fish in the big pond bitch.
  27. No reason not to by NaturePhotog · · Score: 2

    There's no reason why you can't do it either in an app by saving state or in the OS by saving memory to disk as on a laptop.

    GEOS had the concept of state-saving in the OS circa 1990, so it's nothing new. The UI saves its state, what apps are running, what windows are open, etc. and restores it exactly as you left it when you restart. If an app has extra data to save, such as where it was in a lengthy computation, it can save it, too.

    A slightly different approach than brute-force writing out all of used memory, but both work quite well with the speed of current hard drives.

  28. Checkpoint/restart by td · · Score: 3, Interesting

    This facility is called checkpoint/restart. It was a feature of OS/360 and other operating systems in the 1960s. In some very early versions of Unix, core files were restartable. Usually it's pretty easy for programs to save enough state to be restartable on a case by case basis, except when it's just about impossible (like when networks reconfigure) so it's not a popular system feature these days (hard to implement in a general way, doesn't do a very good job in the cases that can be handled easily.)

    A friend of mine (Hugh Redelmeier) ran a very long (~400 day) computation on a PDP-11 in the mid-1970s. The program ran stand-alone, and part of the test plan involved flipping the power switch on and off a few times -- very amusing to watch the program keep on running right through power failures. (Main memory on the machine in question was magnetic cores, which are non-volatile.)

    --
    -Tom Duff
    1. Re:Checkpoint/restart by shaper · · Score: 2

      I was peripherally involved in some early efforts to include checkpoint/restart in POSIX with respect to standardizing fault tolerance and high availability features. I was a US DoD employee at the time. The military's interest was to be able (in a semi-portable standard way) to reset to a known good previous state in the case of some arbitrary failure mode in safety critical systems, i.e. flight controls, stores (weapons) management, etc. AFAIK, the POSIX standards efforts never went very far due to many different, sometimes conflicting needs. The more business-oriented high availability people had needs for very similar OS functionality that was markedly different in character from the military's viewpoint. My involvement ended in the early to mid 90's, so my understanding of the situation may be more than a little stale.

    2. Re:Checkpoint/restart by td · · Score: 2

      I made a version of the same program and reran it a few years ago on an SGI Octane. It took about 8 days.

      --
      -Tom Duff
  29. VMWare by Creedo · · Score: 2, Informative

    Vmware does this for the VM's it hosts. Works great.

    Creed

    --
    All that is necessary for the triumph of good is that evil men do nothing.
  30. Build in persistence yourself. by blair1q · · Score: 5, Insightful

    Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.

    --Blair

    P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.

    1. Re:Build in persistence yourself. by dillon_rinker · · Score: 3, Insightful

      Re-read the comment you replied to; it suggests something subtly different from what you suggest. Checkpointing intermediate results is not the same thing as checkpointing processes. To take a much oversimplifed example, I write a program to multiply a two-digit number by a one digit number. My program does the following:

      1. Multiply ones digits
      2. Multiply tens digit by ones digit
      3. Multiply previous result by ten
      4. Add results from steps 1 & 3
      5. Display previous result.

      If my program crashes at any point before step 5, I have to start all over. So, I save my intermediate results at step 1, step 2, step 3, and save my final result at step 4. This is checkpointing my intermediate steps.

      Your suggestion, on the other hand, is to periodically save the entire system state. This is checkpointing the processes.

      I see a need for both types of checkpointing - applications periodically checkpointing data (like the autosave feature in the market-leading word processor) and system-state saves (like the sleep feature of some laptops). Reliability and recoverability should be engineered in at all layers.

    2. Re:Build in persistence yourself. by Erasmus+Darwin · · Score: 2
      "Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time."

      That seems rather wasteful. The whole point of checkpointing is to avoid having to waste time recalculating things. Since you're trading off between two potential wastes of time, it's a more complicated issue than you make it out to be.

      For example, imagine a scenario where you have many, many jobs to run. Each job takes a week to run. Your goal is to run the most jobs in a given time period. Checkpointing doubles the run-time, kicking it up to two weeks. Finally, let's say there's a 1% chance per day that the system will go down for the day.

      That means there's a 93.2% chance that we'll make it through a non-checkpointed job without failing. Even if we do fail, there's a 93.2% chance that we'll make it through the rerun. If we make it through either time, our worst case scenario is to tie with the checkpointed job.

      Still, occasionally, a non-checkpointed job will hit multiple failures and take longer than a checkpointed one. But under the constraints I provided, it should be clear that checkpointing's going to lose in the long run.

      All that being said, there are certainly scenarios where checkpointing is the better choice, such as when it's more important to get the jobs done within a certain deadline or when the failure rate is higher. But it's absurd to declare checkpointing to always be the optimal solution.

    3. Re:Build in persistence yourself. by blair1q · · Score: 2

      If you know the MTBF on your computing system (including every necessary system all the way back to the watershed that's driving the hydroelectric plant) then yes, you can do a cost-benefit analysis.

      But if you're a yahoo at J. Random University who's just writing in his thesis, you're going to type :w ever few words, no?

      If the system it runs on is out of your control, and you have no idea of the probability of a crash in the next few weeks, and you only have one or two shots to get it done, you need to maximize the robustness.

      But yea, you're right, don't get paralytic about it. Just organize your data and state info into a data structure that can be serialized to a file and read back in later.

      --Blair

    4. Re:Build in persistence yourself. by gotan · · Score: 2

      As long as the 'interrupted run' is not due to some moron switching off their workstation when an application of theirs hangs (in an university environment say), what you describe is under control of the person running the program. Also, in the case you describe, altering parameters during runtime seems quite common. Yet i wonder, why it isn't known in advance, when (after which iteration) someone might want to change the parameters (so you could make the program stop after that iteration and dump the data then).

      Also a better idea might be to have the program look for new/changed parameters itself (by means of a special input file, and maybe sending a signal), or make it stop itself and write a dump by some mechanism controlled from outside (a signal or a 'stop'-file it looks for).

      Then i wonder how writing a dump would produce a 100% overhead (that would mean, half the time the program is writing a dump, it should alternate between dump-files then, so there's at least one valid dump at any time), and be worth it. Usually, on large supercomputes handling numerically intense programs such as you describe, there are means of fast I/O too, which provide the necessary bandwith to dump that data fast, or at least nonblocking.

      --
      "By the way if anyone here is in advertising or marketing... kill yourself." -- Bill Hicks
  31. User Control by Skweetis · · Score: 2, Interesting
    It would be neat if this could be controlled by the user. Ideally, this would be done by a process signal. To actually cause a process to hibernate, a user would do a kill -HIB $PID or something like that. Then the kernel would save the process information to a file (somewhere under /var maybe?) until it is restored.

    This next one would complicate things a bit: the user should also be able to wake up the process the same way, i.e. kill -WAK $PID. This means that an index of hibernated processes also needs to be kept synchronized between the kernel process tables and a file on disk, to be preserved between reboots.

    Maybe I'll write another kernel patch...

  32. Been there, done that by jstott · · Score: 2, Informative
    Look at the makefile for emacs--the emacs executable is essentially a memory dump of a partially initialized emacs process. Perl's dump and undump work the same way.

    For long-running processes, rather than shut down the process when the UPS kicks in, I've always found it easier to have the program snapshot its data tables periodically (say every half-hour) and build a "resume from disk" feature into the program. This lets you restart the program from its last check-point even in the event of uncontrolled program termination (e.g. kill -9 and the like).

    -JS

    --
    Vanity of vanities, all is vanity...
  33. The hardware will be a big issue.... by King_TJ · · Score: 2

    The main reason this "suspend" feature works relatively well for a laptop is because the hardware is a "given". The laptop has to have a certain video card and motherboard chipset, specific type of hard drive, floppy, CD-ROM and sound device. (In fact, when laptops fail to come back up properly from a suspend, it's almost always the one "add-on" card people have in laptops, the PCMCIA network adapter, that causes the problem.)

    3Com PCMCIA cards are about the only ones I've used that allow the laptop to power them down and back up again, and resume network activity without a complete machine reboot.

  34. Hibernation comments are missing the point by ry4an · · Score: 5, Insightful

    The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.

    I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:

    kill -HIBERNATE `cat /var/longoperation.pid`

    and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.

    1. Re:Hibernation comments are missing the point by Hrunting · · Score: 5, Insightful

      And if you have something like that, you open yourself up to a wealth of potential problems in the program. Take this simple perl script.

      #!perl

      use strict;

      my $pid = $$;
      print $pid


      If you stop it between those two $pid commands, there's no guarantee that you're going to get the same pid value back. Programs would have to be specifically programmed to handle this sort of thing (there are other examples, this is just the most basic; network programs particularly would have problems).

    2. Re:Hibernation comments are missing the point by eries · · Score: 2

      And what an incredible debugging tool. I know that my process is producing buggy output after running for four hours. Solution: run for four hours, hibernate, copy and re-run the last five minutes as many times as you want.

    3. Re:Hibernation comments are missing the point by The+Smith · · Score: 2, Informative
      You mean like: run for four hours, force a core dump by pressing Ctrl-\, and then re-run the last five minutes as many times as you want?

      You don't need hibernation for that.

    4. Re:Hibernation comments are missing the point by gorilla · · Score: 3, Insightful

      There are lots of other issues. If a program has a socket, or a device open, what should happen? Should the OS reopen the socket? What if the remote end is requiring status. No point reopening a FTP session if the application thinks it's already sent the userid/password but the server doesn't. What if it's a device, eg a modem, and it is locked?

    5. Re:Hibernation comments are missing the point by redback · · Score: 2

      Ever used Windows 2000 or WindowsXP

      They have hiberbate. Completely hardware independant hibernate.

      It works on anything that has proper drivers

      You will find it under power in the control panel in w2k, and its on by default in wxp

    6. Re:Hibernation comments are missing the point by enkidu · · Score: 2

      Sorry, that is not correct. The state of most programs are not represented by the "memory/stack space" of the process + the register status alone. You have to remember that the kernel is also part of the space in which most processes run. Add in network sockets and device handles and inter-process semaphores and hibernation gets really complicated really quickly. The way around that is to restrict yourself to a small(er) set of system calls which is what Condor does I believe.

      In fact most "checkpoint anytime" systems allow you to delineate atomic sections of code where checkpointing/hibernation should not happen. The only way to allow true checkpoint/hibernation anywhere is to build is explicitly into the kernel.

      --

      There is no trap so deadly as the trap you set for yourself
      -Raymond Chandler, The Long Goodbye
    7. Re:Hibernation comments are missing the point by Dwonis · · Score: 2

      Actually, you only need to send a SIGSTOP to the applications themselves, then get the kernel to swap out the process completely and save the result somewhere.

  35. That will not be easy by bartman · · Score: 2, Interesting

    There are big problems with such an approach, and mainly with device usage. Basically they are all the problems that you would have with process migration add a few because of temporal discontinuity.

    If you are using a scanner, or a mouse, or whatever, that device may not be there or may not be available when the process is brought back. Furthermore you may have a file descriptor opened on a local (or network shared) file which no longer exists or has changed drastically.

    There are further non-device-dependent problems with shared memory, opened-but-unlinked files, parent PID, IPC resources.

    Having said all of the above... I suppose that for the very rare case that your program is completely memory and CPU dependent you could retire and recover a task.

    my $0.02

    --
    -- bartman
  36. Apple Tried this with OS 9 by zaius · · Score: 3, Interesting
    Apple implemented this feature in early versions of OS 9, but took it out after they realized that some laptops would never "unfreeze" without the user hitting a reset switch buried deep inside the laptop.

    The idea was that when you put your computer to sleep, instead of keeping the SDRAM (or whatever the laptop had) powered to preserve the memory contents, it would write it all to a special sector on the hard drive that the firmware knew to read from when starting from sleep. This allowed sleep to be even more low-power than it already is, since a hard drive does not require power to retain data.

  37. EPCKPT by cmason · · Score: 5, Informative
    EPCKPT is a checkpoint/restart utility built into the Linux kernel. Checkpointing is the ability to save an image of the state of a process (or group of processes) at a certain point during its lifetime.

    --

    --
    "If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
  38. This would be useful for more than just blackouts by EccentricAnomaly · · Score: 2

    If you could sleep processes you could run some intensive job at a high priority when your not logged into your workstation and then sleep the processes when you log in. This way you could run some job that takes weeks or months but not bog down a workstation that you need for doing daily work on.

    Yeah, you could "nice" down the process so that it doesn't slow things down while your logged in... but then system processes at higher priorities might slow down your number crunching when you're not logged in... It'd be best to be able to run it at high priority at night only.... ya know, use those unused cycles.

    --
    There are 10 types of people in this world, those who can count in binary and those who can't.
  39. Application level solution. by Mark+Imbriaco · · Score: 2, Interesting

    One fairly simple alternative is to simply have the application save it's own state to a "checkpoint" file periodically. This approach has been used in other applications for a long time in the form of auto-save files (ie: emacs) and would be easily adapted to a long running program like the one you describe.

    Just because the OS doesn't support it automagically it doesn't mean that you can't solve it for yourself with a little bit of extra work and planning.

  40. Re:Really worth the effort? by kdawg6000 · · Score: 3, Informative

    If you are a grad student who has been waiting for a month for a job to finish...this could be very important. I was in an engineering department where jobs that ran for weeks were not uncommon (fortunately most of mine only took a day or two). A shutdown of a critical machine could set someone back months.

  41. Software suspend by Timbo · · Score: 2, Informative

    Linux software suspend may be of interest.

  42. Re:This CAN be trivially done on any un*x i know.. by zaius · · Score: 2

    So, you mean that the next time my app segfaults and dumps core, I can say it was a feature designed to allow it to be restarted...? Cool. Seriously though, how can you restart a core (obviously not one from a segfault) using gdb?

  43. Volgons? by wiredog · · Score: 3, Offtopic

    The bastard children of Vogons and Vorlons?

    1. Re:Volgons? by crawling_chaos · · Score: 2

      I wonder what their poetry sounds like? At least it would be set to music, I suppose...

      --
      You can only drink 30 or 40 glasses of beer a day, no matter how rich you are.
      -- Colonel Adolphus Busch
  44. App Specific "Resume" by 4of12 · · Score: 2

    Long ago and far away (about 15 years ago) I recall that TeX was frequently built in a fashion that required running the binary on some "initialization" information. That process took some nontrivial amount of time back in those days (I'm sure now it would be an eyeblink), and the program could be made to \dump its state in some way.

    Then, when you ran TeX in everyday circumstances, the digested initialization file was read in by the application as part of the usual startup process.

    I'm probably botching the explanation of how this really worked, but I guess my point is that the "resume" function had to be coded into the specific application.

    --
    "Provided by the management for your protection."
  45. Windows 2000 and Hibernation by doorbot.com · · Score: 5, Informative
    If you have a Windows 2000 or XP machine you can enable hibernation. However, this is not a "power management" feature... it has been separated from ACPI and/or proprietary disk partitions and will work on all computers, even servers, whether they have ACPI/APM/nothing for power management.

    Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.

    Some things to note:

    You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.

    Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.

    Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).

    1. Re:Windows 2000 and Hibernation by Joe+U · · Score: 2, Funny

      Creative releasing drivers that cause a bluescreen?

      Who would of thought it was possible.

      Rule 1 with hibernation, no creative products.

    2. Re:Windows 2000 and Hibernation by dublin · · Score: 3, Interesting

      This is not strictly speaking a W2K function. The real kicker here for Linux folks is that the easiest way to do hibernation in the modern world is to use ACPI, which Linux doesn't do very well. (See this week's LWN for a timely discussion.

      APM BIOSes can also do this, but they aren't as standard: Often the implementation details are specific to the hardware. For instance, Phoenix BIOSes (at least as of two years ago, I haven't messed with this stuff much since then) tend to want to put the STD (suspend-to-disk) data in a special file in a Windows partition, while some others (Dell for sure, since I used to work this stuff for them) save this info in a special STD partition (type 84, IIRC) which is a more generic solution, but requires more knowledge when setting up the box. (When was the last time you thought you might need an STD partition when building your box? BTW, they should be at a minimum, PhysicalMemorySize + 1 MB for state info, video register settings, etc.)

      --
      "The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
    3. Re:Windows 2000 and Hibernation by doorbot.com · · Score: 3, Interesting

      This is not strictly speaking a W2K function.

      Agreed, and as you go on to explain, and I believe I alluded to in my post, there are many proprietary implementations via the BIOS or DOS drivers, etc.

      My point was that Windows 2000 separates the hibernation feature from the BIOS. As far as the BIOS can tell, the system is booting normally... but once the BIOS loads the NTLDR, Windows takes over of course and handles the hibernation. This is why it works so well and does not have all of the "stupid issues" such as custom drivers, partitions, or the like. The end result is not a MS-only function, but the implementation is, as far as I can tell.

    4. Re:Windows 2000 and Hibernation by denzo · · Score: 3, Interesting
      Not according to Microsoft (on their knowledgebase). This article states that Win2k needs ACPI to support OS hibernation, and that the BIOS has to support it. Although Microsoft has been known to contradict itself.

      And simply having a WHQL-certified drivers doesn't necessarily mean it'll work. I had a Future Domain SCSI controller in my computer that loaded with the default Win2k WHQL driver, but I could never hibernate it. When I swapped it out with an Adaptec 2940UW, I was able to enable Hibernation in my Control Panel settings.

    5. Re:Windows 2000 and Hibernation by EvlG · · Score: 2

      The need for 100% kosher drivers and apps is the real kicker here.

      Lots and lots and lots of people don't have great (or even good) drivers for some hardware.

      Apps suck even more - the whole Windows platform is full of the people doing X, Y, and Z in different ways to skirt different OS bugs or other pet peeves they didn't want to deal with.

      I've never gotten Hibernate to work properly for just those reasons - apps and drivers on windows suck.

    6. Re:Windows 2000 and Hibernation by dublin · · Score: 2

      FWIW, I don't think this is Windows-only. Hibernation should work in any OS that understands how the APM or ACPI BIOs APIs work. Sadly, the only Linux I've found that even comes close to understanding these correctly is Corel. (They actually did a great deal of the "hard stuff" right - I hated to see them fade away before making their mark...)

      (FWIW, I prefer to use the terms "suspend-to-RAM" (S2R or STR) and suspend-to-disk (S2D or STD), since there's no ambiguity about what's going on that way, as there can be with terms like sleep, suspend, snooze, and hibernate.)

      --
      "The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
  46. Process-saving is known, but not what you want by Seth+Finkelstein · · Score: 4, Informative
    The idea of saving the state of a process is very well-known. Take a look at anything from emacs dumping to the gcore(1) program. It's been used in everything from saved games of Rogue to saved states of PERL.

    But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.

    Sig: What Happened To The Censorware Project (censorware.org)

    1. Re:Process-saving is known, but not what you want by bfields · · Score: 2
      But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.

      Perhaps the process is running software he didn't write, in which case this might not be so easy.---Bruce F.

  47. Re:Really worth the effort? by NetJunkie · · Score: 3

    No, I wouldn't design a totally new memory dump system, I'd keep logs. Have the app keep track of where it is so that should the system restart it can pick back up again. That could be done without new BIOS and memory systems.... And you could do it TODAY with your existing hardware setup.

  48. Already available for Linux by HishamMuhammad · · Score: 2, Interesting

    There is a kernel patch to do this. It's called Software Suspend. It is also part of the FOLK project (Functionality Overloaded Linux Kernel, a project to merge the largest possible amount of patches into the kernel).

  49. Bad coding? by oolon · · Score: 2

    Surely if this process takes so long to execute the person who wrote it should have made it save its state every once in a while. Problems like these can have been avoided! Setiathome to name but one does exactly this.

    James

  50. You sure of that? by bastion_xx · · Score: 4, Funny

    My Intel processor puts it somewhere around 41.99999999967

    1. Re:You sure of that? by Sj0 · · Score: 2

      No, you're wrong. My AMD processor says it's... oops, the heatsink fell off and the CPU fried itself.

      Sorry, that's AMDs fault right there. They shouldn't be letting retards with no motor skills try to put together a modern PC. BAD AMD! BAD!

      People should really lighten up -- I suppose you'd be blaming AMD if a power surge caused your power supply to send 1000V DC into your board, destroying all the components? Electronics are designed to be used in a certain way(ie. a CPU which runs at such a high speed, and does more computations than the entire computing world 25 years ago runs hot, and was designed to run with a heat sync), and if someone isn't functional enough(alternate:Is mentally retarded or is suffering from parkinsons disease) to ensure that their heatsync doesn't fall off, maybe they shouldn't be allowed around complex electronics.

      Oops! My heatsync fell off and shorted out my video card, destroying it. BAD NVIDIA! BAD!

      --
      It's been a long time.
  51. Re:Use Windows XP by rlowe69 · · Score: 4, Redundant

    This comment is far from (Score:4, Informative) ... it's not even relevant. We're not talking about the whole OS hibernating, we're talking about saving the execution state of an executing process so that it can be resurrected later and continued (ie. if a reboot is necessary).

    --
    ----- rL
  52. Re:This CAN be trivially done on any un*x i know.. by xyzzy · · Score: 3, Informative

    You can't. The previous poster was making it sound too easy. Real checkpointing needs to save Kernel state as well -- file handles, device driver state, you name it. It isn't as simple as saving the in-memory image of the process.

  53. Cryogenic freeze / Hibernation by Dr_Marvin_Monroe · · Score: 2, Interesting

    I think that this might also be a really good bug fix/hacking tool. I can also remember something like this for the Apple II in years gone by. You could press a button and take a snapshot of all memory in the system. Then you could write the executable part to disk and pick up where you left off. Good for freezing a copy of a game or whatever.

    This would also be good for tracking down bugs using the "before and after" technique.

    Such a program could be tied into the UPS monitor in such a way as to save everything that couldn't be stopped.

  54. CDC Cyber 205 by epepke · · Score: 5, Interesting

    As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.

  55. How hard could this be to experiment with? by Nelson · · Score: 5, Interesting
    I've thought about this for booting issues. I have a server that's all journaled and everything and it's periodically get's bumped. Boot time is still on the order of 2 to 4 minutes for a full Linux server install. With my current stats that means I'm probably going to miss a hit or two on one of the web pages, all things being equal. A good portion of that is just icing though, things that are there "just in case" or get used infrequently. (Okay, I can screw with the init order and the problem essentially goes away or I can switch hardware but we're nerds and geeks so let's just explore this)


    I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.


    I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.


    The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..

  56. doesnt SETI@home do this, sorta? by Pharmboy · · Score: 3, Informative
    seti@home kinda does it.

    the seti@home client uses its *.sah files to save the state of a calculation. of course, this is program dependent, not OS dependent. I guess if you have the source files for the program doing the counting.....

    --
    Tequila: It's not just for breakfast anymore!
  57. STANDALONE CONDOR CHECKPOINTING by Anonymous Coward · · Score: 5, Informative

    STANDALONE CONDOR CHECKPOINTING:

    Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
    "standalone" mode checkpointing.

    To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
    have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
    utility with the little-known "-condor_standalone" option. For example:

    condor_compile -condor_standalone [options/files....]

    where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
    the condor_compile man page for additional information.

    Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
    line arguments: "_condor_ckpt " and "_condor_restart ".

    If the command line looks like:

    exec_name -_condor_ckpt ..

    then we set up to checkpoint to the given file name.

    If the command line looks like:

    exec_name -_condor_restart ...

    then we effect a restart from the given file name.

    Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
    instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
    name by which we were invoked with a "ckpt" extension.

    To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:

    kill( getpid(), SIGTSTP );

    Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
    program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:

    kill( getpid(), SIGUSR2 );

    In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
    checkpoint file. The prototypes are:

    void init_image_with_file_name( char *ckpt_name );

    void init_image_with_file_descriptor( int fd );

    void restart( );

    The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
    Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.

  58. Search in the slashdot archives for kernel patches by Alan · · Score: 5, Informative

    I think it was somewhere in the list of patches from the -mjc tree (see here) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.

    Just found it here, it's the 'swsusp' patch.

  59. Java has lightweight persistence... by bernz · · Score: 2, Interesting

    If you utilize the java.io.serialization stuff right, you can create a lightweight persistence and should be able to freeze and resume processes on the same application if you handle threading right with it.

  60. Doesn't matter... by Anonymous Coward · · Score: 2, Funny

    The answer would have been 42 once the processing was complete. So who cares? Get a bigger UPS :-)

  61. Darwin/MacOS X by Duck_Taffy · · Score: 4, Informative

    Here's a mutation of FreeBSD that can do exactly that. I've put my laptop to sleep in the middle of installing software while running MacOS X and brought it back up several hours later to resume installation with no problems. The same function works on my G4 tower. Yes, it does drop network connections. However, it does use a trickle charge to power the LED's and presumably to keep the processor alive, and possibly some memory. Paging several hundred megabytes in a couple of seconds would be quite the task! One item of note is that all Apple machines have a special piece of hardware known as the PMU (Power Management Unit). In the desktops, it's parted out onto the mother board and into the power supply, but in the laptops it's a seperate card which controls both sleep and the charging of the battery. Perhaps other UNIX machines would need a similar device for this function to work properly.

    --
    Karma: Ran over your dogma.
  62. problematic by S.+Allen · · Score: 2

    Easier said than done. If this wasn't part of the application's design or if it's relatively sophisticated, making these changes can be non-trival. And (shock/horror) if you don't have the source code, it's impossible without OS assistance.

  63. Solid-state memory by kenneth_martens · · Score: 2

    I think this problem is more easily solved in hardware than in software. With recent advances in solid-state memory, hopefully a standard can be worked out so that solid-state memory can replace or complement volatile memory (i.e., RAM as we know it.) Solid-state memory could would survive a power outage, and you could pick up where you left off.

    The disadvantages are speed (solid-state memory is getting faster all the time, but it is still slower than volatile RAM), cost, and lack of current standardized implementations (I'm not even sure there are any working implementations.)

    For some background research in solid-state memory, check out this site (it's a bit old, but still interesting.

  64. It is possible...but it could be messy... by Mysticalfruit · · Score: 3, Interesting

    What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?

    I think the better solution is to write a new signal called "SIGFREEZE" and have programs just write code that could handle such an event. Let the program figure out how to save their own stuff.

    A good example would be a program that was calculating pi. The programmer would have to implient a signal handler that would when it recieved a SIGFREEZE would stop its computating and write what its currently working on out to file. The other thing the programmer should be doing is periodically writing their data out to a file anyway. Then the programmer should have implement a command line option that would facilitate reloading from a saved state.

    Thats my take on it...

    If you see any problems with it... bring it on.

    --
    Yes Francis, the world has gone crazy.
    1. Re:It is possible...but it could be messy... by ameoba · · Score: 2

      Uhh... Why bother with a new signal when you can just write the program to save a checkpoint when recieving one of the normal ones? It's not like handling signals is -that- hard.

      --
      my sig's at the bottom of the page.
    2. Re:It is possible...but it could be messy... by scrytch · · Score: 2

      > I think the better solution is to write a new signal called "SIGFREEZE"

      Which is not only how Solaris does it, it's what Solaris calls it. The counterpart signal is SIGTHAW. The signals are advisory though, the process isn't required to implement all the freeze/thaw logic in userspace.

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
  65. Re:Really worth the effort? by sketerpot · · Score: 2
    Why not do some work once and save all the application developers a lot of work? This is a good idea.

    This could be done without doing anything to your BIOS; youc could just dump all the memory allocated to a certain program to disk and put that process in a list of hibernating processes. What's so hard about that?

  66. But it gains you nothing by DunbarTheInept · · Score: 2
    But VMware is typically running things twice as slow as native, so you gain nothing at all by running the project under vmware. Consider: Without a way to checkpoint the program, what happens if you have to start over near the end of the run because you had to kill it? You end up taking twice as long overall - the first aborted run plus the full run time again from scratch a second time. So in the worst case scenario, where the program is killed *just as* it was about to finish, you get performance as bad as running under VMware without a crash.

    It only is worth it if you expect to have to halt the program more than once. Assuming only one halt and restart, VMware is still slower.

    --

    Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

  67. File Descriptors are per-process by parc · · Score: 3, Informative

    A file descriptor is a per-process entity. Yes, there's a big table of file descriptors that exists for the entire sstem, but file descriptor 5 for process a is not file descriptor 5 for process b. Not even if they point to the same file/pipe. A case in point is FD 0, aka stdin. Every process starts out with a stdin on FD 0.

    More important is how do you tell the kernel what file descriptor 5 pointed to? What if the file/pipe doesn't exist any more?

    1. Re:File Descriptors are per-process by jelle · · Score: 2, Insightful

      Just return an error message. The application has to be able to deal with lost connections anyway.

      Note that you can SIGSTOP a process, then it will be on hold, may even become completely swapped out. Then you can SIGCONT the same process to let it run again.

      So you could send it a SIGSTOP and force it to swapout. That is just checkpointing until the next reboot... Of course you need more info to restore the process from the swap when the system reboots, but it's a start as to how to implement checkpointing.

      I'm sure there is more than one road to Rome.

      --
      --- Hindsight is 20/20, but walking backwards is not the answer.
  68. Re:Really worth the effort? by uchian · · Score: 2

    Of course, if your running some job which could take a month to finish... you code it so that it can pick up where it left off, or at least where it will only have lost a couple of hours-worth of work at the most.

    Or is that too sensible?

    (and if it's a proprietary package and it can't pick up from where it left off, find a different one).

  69. Re-crashing problem by DunbarTheInept · · Score: 2

    My concern with that is this: Let's say something buggy is making the system crash. Then if the persistant OS does it's job with perfect accuracy, it's just going to end up re-creating the conditions that caused the crash, and Boom - crash again. The only way to avoid this is to NOT succeed at the goal of re-creating the conditions before the crash.

    --

    Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

  70. VMWare isn't a solution to a cpu bound process by brer_rabbit · · Score: 2

    While I love VMWare, it does consume a substantial amount of CPU/memory. The problem is a job like what the original poster described is usually CPU or IO bound, and VMWare just starves the process from what it needs even more.

    Granted, it is a solution, but your job that ran in 3 days just got pushed out to a week. It's just a tradeoff.

    What the poster really needs is to rewrite the program to drop intermediate data along the way. If you have hourly checkpoints you can minimize the amount of data lost. How to implement checkpoints is left as an exercise to the reader :)

  71. Checkpointing? by rnturn · · Score: 2

    If memory serves me (hey, it is Friday after all and both brain cells are pretty tired) we looked into something like what the poster was asking about years ago. In those days, we were running some simulations on a PDP-11/70 that took 7-10 days to complete. In the event of a general power failure we wouldn't have been able to run on backup power for very long. DEC's RSX had a feature whereby a task could be checkpointed to disk. Then, presumably, it could be reloaded and resumed at the same state it was in at the time of the checkpoint. We never did implement it since it would have introduced too much delay into the project schedule (adding it to the simulation, testing, etc.) but it sounds like the sort of thing that could be useful in current day OSs. Anyone know of any general purpose operating systems today that have this feature? I haven't heard of any and wonder (not too seriously, mind you) if anyone sells core memory for a PC architecture computer. Of course, it wouldn't be very fast but you'd worry a lot less about power failures that are longer than the UPS's ability to provide power.

    --
    CUR ALLOC 20195.....5804M
  72. Think of VMware as a process wrapper by Binx+Bolling · · Score: 2, Insightful

    This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.

    bb

  73. dump core, then pick it up in gdb and 'c' by PaulBu · · Score: 2

    you can always dump core of the process
    (e.g., kill -SIGSEGV), then load the core file
    it into gdb (gdb program corefile) and
    issue 'cont'.

    The OS state would be gone though (so, no
    files besides stdin/stdout), but for purely
    computational process that might work as a
    one-time shot. At least you could save main
    arrays from gdb and read them in into a modified
    program.

  74. I'm suprised nobody has search fm yet... by cduffy · · Score: 2

    ...and found esky, a purely userspace checkpoint/resume implementation.

  75. Solaris has done this for a while. by cgleba · · Score: 2

    I remember an option in Solaris 7 that lets you dump memory to swap, shut down the computer and when you restart it reads swap and drops you back into the exact same state as you were in before.

    Pretty cool because you could restore to a full X-session with all the programs and documents you were working on before undisturbed.

    I don't know if this is what you were looking for. . .

  76. Re:Suspending your application code might be simpl by Suppafly · · Score: 2


    I think laptops work by getting the applications and OS into a safe and simpler state and then saving that state. I suspect they cannot save any arbitrary application you could write - just the applications they routinely run.


    If you've ever used a laptop with this feature you'd realize what you just said is totally wrong.. the hibernate function of these laptops is managed by hardware not software and so is os and program agnostic. When you close the lid or hit the sleep button, it dumps the entire state of the ram in to a special partition and turns off.. when you revive it, you are back exactly where you left off, regardless if you are running windows or linux or if you are playing quake3 or cracking rc5 stuff.

    By managing this stuff in hardware, its actually less complicated and works 100% of the time as opposed to the windows software solution that often refuses to 'wake' after being put in sleep mode and is dependant on the power supply being on and supporting the feature.

    If someone were to add a feature like this to a large multiuser mainframe type system, it would definately make more sense to go with a hardware based solution that dumped the system state to a disk or multiple disks to ensure that it always worked and not just some of the time for some of the apps.

  77. Perspective on solution by rcj4747 · · Score: 2, Insightful

    At first this seems like a nice idea. It would be elegant to be able to halt processes and resume them later without them consuming resources in the interim.

    Before going forward ask yourself what the practical application of this work could be. If you have to reboot systems with long running computational work going on you may need more reliable hardware or better management of the system to increase uptime. Furthermore, adding "suspend/resume" functionality to a single process within it's own code would probably be far better as needed.

    Secondly, think of the concerns you face in implementing this as a generalized solution for user processes. Here are the problems with this concept that I can see.

    First, file handles, file system pointers, network connections may not exist when the process is restarted. Let's say that there is processing of NFS data being done and when the process is resumed that mount is no longer accessable. You get an error from NFS like ERRIO or the like and the process dies.

    Secondly, the hardware may no longer be available. What if the process what using a PCMCIA card which has been removed. The process dies. In a more simple case, a process could have a tty open for I/O and that tty may no longer be owned by the user when the process is restarted.

    This requires saving a lot of system state and does little to guarantee that the process can be restarted successfully and safely. Furthermore, the dependancies for a single process (being fairly complex) would require a good knowledge of the process by the user to determine the feasability of suspending and resuming the process.

    It seems that this would not accessible by average users of the system if it were possible to create in a generic sense.

    It does stand as a good question to start someone thinking about unix internals though.

  78. You can do OS-independent process-hybernation... by eyefish · · Score: 2, Interesting

    Something many people not familiar with J2EE (Java 2 Enterprise Edition) know is that when you have an application running in a Java container, it, and the state of all its processes get automatically saved and restored whenever the container, the OS, or the machine crashes. True, in practice some diligence is required from the programmer (for example, when you need to set obejcts to specific state upon re-instantiation), but the functionality is there, is OS-independent, and it's been proven and used daily in heavy-duty environments for a few years now.

  79. Re:Use Windows XP - OT by ADRA · · Score: 2, Informative

    Yeah, not really relevent to the main topic, but any modern PC's do have suspend support built into them, so the no-additional software thing is a pretty moot point.

    Hibernation IS a software thing, and it just means that when the OS receives or generates a shudown-hibernate event, that the OS writes all available memory and state to disk and shutdown, setting a flag that the OS can know that it was hibernated to begin with.

    --
    Bye!
  80. If you really want this... by J.C.B. · · Score: 2

    ...why not just boot up classic at startup? My brother set his computer to do this, you can too if you don't want to wait.

  81. Look at KeykOs by DV · · Score: 2, Interesting

    Basically that was one of the ideas behind the research on micro-kernels. If the state of the system gets small and centralized enough one could not only make a single process persistant but the full system persistant.

    KeykOs was a very promising system offering this at the time. One could not checkpoint the connections outside of the machine, but their demo was a BSD machine with X11, whose powerplug was violently removed. When replugged the state of all processes saved at the last checkpoint was resumed and the system would continue ... Including X-Windows !!!!

    Now wait for the Patent to expire, put it in Linux and watch the world of computing change.

    It was very promising at the time I was doing my PhD 10 years ago, I don't know why this never "made it"

    Daniel

  82. Re:Use Windows XP by taliver · · Score: 2, Insightful

    It's not possible to hibernate a single process.

    Wow, so the fact that its been done here is just a red herring?

    Does Virtual Memory mean anything to you?

    --

    I demand a million helicopters and a DOLLAR!

  83. This is why EROS was invented... by jcr · · Score: 2

    Check out http://www.eros-os.org.

    EROS processes persist until you take them down. They persist across power loss, system upgrades, etc, etc.

    -jcr

    --
    The only title of honor that a tyrant can grant is "Enemy of the State."
  84. Sun Already Does This by Anonymous Coward · · Score: 4, Interesting

    Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.

    10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.

    I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing :-)

    Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.

  85. Cray UNICOS by Huusker · · Score: 2

    What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?

    Back in the 80s, Cray UNICOS had a cadillac checkpoint package. It could track child procs, save /tmp files, save threads, save pipe data, and pass down SIGCKPT for user-controlled checkpoint.

    Of course at $1000/hour you want to damn sure be able to save your work :-)

  86. Palm sort of does that by josepha48 · · Score: 2
    Palm OS sort of does that.

    On a palm you can shut if off and when you turn it on it is where you left the device at. I think it would be neat too if this could be the way operating systems worked. Ideally one would be able to turn off the computer in the middle of an app and it would turn on at the same place it was left at.

    Of course the palm does not do multitasking, multiprocessing or anything like that and when you close an app it is usually sent back to its initial state.

    Maybe the way to do what this user wants is to take journaling to a next step, and rather than have a journeleing file system have a database file system where stuff is done in commits like a jfs. Then one could do rollbacks as well. This would require the whole system to be rethough out though.

    --

    Only 'flamers' flame!

  87. Re:Can you have both? by mlheur · · Score: 2, Interesting

    what if the OS had a hook in it to like
    `kill -FREEZE &LTpid&GT`
    No new hardware, only done once, will work on all processes.

    And as described previously, the FREEZE signal would cause the process to dump execution code, memory pages, FD's etc. etc. to a dump file.

    reboot the system.

    Then find some way to execute that dump file which will in turn load FD's, pages, execution code, and resume with the IP (instruction pointer, not IP Addr. for those not arch inclined) in the same spot?

    /me isnt much of a kernel hacker so I dont know the details of how to do, but that's my high level solution.

  88. Re:Really worth the effort? by harlows_monkeys · · Score: 3, Insightful

    There are more than power problems to worry about with a long running process. There are other hardware failures, scheduled downtime, and system crashes to contend with. Just becuase in this instance it was a power failure that made him wish he had this ability doesn't mean it wouldn't be useful in other circumstances.

  89. Re:Hibernation? by -douggy · · Score: 2

    Acer Travel mates (well my 312T) do the same thing in SUSE linux. If you shut the case it goes into suspend mode (function key +f3) to hybernate fully i needed to leave a fat32 40MB partition to dump the ram to as the bios didn;t seem to want to dump to the linux parition.

    Obviously i dont use the laptop for large numerical simulations but i just tested it with a fortan numbercruncher program running and it woke up fine

  90. Re:Yeah, CDC's NOS/BE could do this 25 years ago by swb · · Score: 3, Insightful

    Why are software techniques shit today compared to yesterday?

    Because we're hopeless caught up in trying to reinvent a somewhat limited computing paradigm (unix). No one, except for some CompSci projects that never really go anywhere, have any real interest in making a new operating system that builds on the lessons of all the previous operating systems and includes reasonable features like process checkpointing/suspension.

    I'd bet there are patent considertions as well -- maybe many of the good OS features are not reproducable due to existing patents.

  91. My experience with windows xp hibernation by Afrosheen · · Score: 2

    Step 1: Clean, fresh install of XP Pro corporate.
    Step 2: The requisite reboots until everything works.
    Step 3: Leave the office, set computer to hibernate for fun.

    I. Results
    A. Blue screen of death upon return to office.
    B. Reboot yielded '/windows/config file is missing or corrupt'.
    C. Much cursing and a swearing off of anything Microsoft.

    XP isn't as wonderful as people would have you believe. A short trip to google inquiring about repairing this mess will result in endless posts.

  92. Dumb mice by WyldOne · · Score: 2

    And if those mice were so smart how come they didn't think about it? Even I know that hardware fails.

    --

    make Linux, not Microsoft. sin(beast) = -0.809016994374947424102293417182819
  93. VMware does this, easily and effectively. by PatJensen · · Score: 2
    I just deployed a FreeBSD 4.4 virtual machine onto an IBM NetVista using VMware Workstation 3.0, which can safely put any PC-based OS into a hibernation mode on demand with one click.

    This hibernation mode snapshot can be duplicated or even put on other machines in the event of a system failure. The virtual machine will then come back online like nothing ever happened, with hardware devices effectively still attached and processes still running.

    It works really slick, you can perform other tasks and come back to your virtual machine later without slow boot times. This will also work on Linux, Solaris, and Windows platforms. I'd highly recommend VMware for on-demand OS access.

    -Pat

  94. Re:Use Windows XP by rlowe69 · · Score: 2
    Why wouldn't you want to save the state of the whole machine when power failure is imminent?

    Because that's not what he asked. He asked, and I quote:

    Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything.


    This has different implications. Let's say that you have to turn off your system to replace a noisy fan, but you have a process going that could take a few days (a render farm or cluster is place this might happen in). You'd like to pause it and then resume it once the computer is back on. In order to do that, you'd have to save EVERY piece of information associated with the running process like memory used, files, etc. THIS is what the guy is talking about, not hibernating the whole computer (which, if the computer is running many processes could be an extremely bad use of hard disk space, not to mention time consuming - time is something you don't have when running off a UPS).

    Cliff's only made the situation worse by saying "Laptops have been doing this in some form for years", but really "in some form" is a generalized stretch. It seems to me that its likely that its much more complicated to save and restore one specific process than it is to save and restore all of them in one big dump back into memory when the system recovers.
    --
    ----- rL
  95. A case for Python by defile · · Score: 3, Informative

    Python supports a concept that it calls 'pickling' (which is also known as Object Serialization).

    It's extremely easy to save the state of any object along with the objects it references to disk with literally a couple of lines of code (like, 3). You cannot pickle whole processes, but it's effortless to write some skeleton code to resume the process from its last pickle. You can also define specific methods in each object that are called on pickle/unpickle for special cases (restoring network connections, for example).

    The fact that it's an interpreted language shouldn't deter you. Python integrates easily with modules compiled from C, allowing you to accelerate time critical aspects of your code while rapidly developing the not so critical aspects.** Python was designed to solve the problems you're working on.

    Oh, and if you're short on time, don't worry; Python is extremely easy to learn.

    ** As most programmers have found, about 90% of their program's execution is spent in 5% of their code.

    1. Re:A case for Python by scrytch · · Score: 2

      Pickling is really trivial. Hell, I've worked on a pickler for C++. Perl has a plethora of picklers (say that 10 times fast), including Data::Dumper, FreezeThaw, and Storable. Then there's Java's Serializable. It's really not a terribly interesting problem, orthogonal persistence is. Orthogonal persistence where you have an interpreter or other such runtime environent that can be started and stopped and moved around in the meantime, possibly with multiple processes attached to the runtime simultaneously. The "orthogonal" means that you don't do anything special (like pickling) to persist objects and retrieve them from the persistent store, they're just there when you want them, and their lifecycle is indefinite when you create them.

      Any programmable MUD is an example of such orthogonal persistence. Squeak and Self would be others. Personally I wouldn't mind such an environment for Python, but I'm not holding my breath.

      --
      I've finally had it: until slashdot gets article moderation, I am not coming back.
    2. Re:A case for Python by defile · · Score: 2

      Python is by no means unique, heavens no. I just think they're features that are probably most accessible to the masses because of Python's popularity and ease of use.

      I really ought to get my hands on Squeak.

  96. my windows PC does it by Gumber · · Score: 2

    I just hibernate and system state is written to disk.

  97. Uh huh... by global_diffusion · · Score: 2, Funny

    GNU Emacs basically does this to reduce initialization times.

    I heard about this. But, my dear boy, I do believe that VI does this better and with more cryptic keyboard commands.

  98. Re:BTW, simple UPS in PSU? by man_ls · · Score: 2

    I'm thinking an internal UPS that operates sort-of like the "Shead" switch in small aircraft. When you have to use it to conserve power, it kills all power to all instruments on the copilot side, some lights, and other stuff, leaving only the radios and nav instruments operating - enough to fly but not enough to do much else.

    The question is - how would Windows react to suddenly having it's, say...CD-ROM and floppy drives just cease to exist while the OS were running? I've accidently pulled power cables on drives that weren't in use, had no active handles on them, and hadn't even had media inserted into them that session, and they still caused massive problems in Win2K.

    I don't think most OSes would react well to having the power to everything except the processor and hdd0 shead from under it, even to conserve power while a savestate took place.

  99. EROS is an entire operating system based on it by jeske · · Score: 2, Informative
    EROS is a research operating system built around the idea of making all processes persistant at all times.

    EROS' predecessor, KeyKOS, made waves at USENIX when they did a demo of a UNIX system + Xwindows which would instantly restore the running state of all software when rebooted. It was basically a UNIX port to KeyKOS, and since everything in KeyKOS was persistant, so was everything in the UNIX.

    One interesting caviat with this type of OS is that you really need to use ECC memory, because bit errors can get saved to disk and propagated forever!

  100. My thoughts on what would be needed. by Vulture_ · · Score: 2, Interesting
    I know this has already been done, but I thought I'd throw in my understanding of what would have to happen:
    • The process' core would need to be dumped. This should be fairly straightforward, since some of them do this a lot already... ;)
    • The process' registers (main CPU registers, FP registers, and any other registers that might exist on whatever exotic arch you use) would need to be saved. This is already done by the kernel for context switching to other processes. Gdb also can fetch and change these, so it can be done from userland.
    • Last but not least, all of the kernel level state of the process would need to be saved. That involves saving:
      • The signal handler table.
      • The pending signals. (Since the process hasn't handled its pending signals yet, it needs to handle them at some point in the future.)
      • The state of whatever syscall, if any, the process was in at the time of freezing. (Or you can set errno = EINTR on most syscalls, if this isn't possible.) This would be rather interesting to implement -- what if you're using a different kernel when you thaw? You can't just save the pc of the syscall then...
      • The file descriptor table. This includes network sockets, which would probably have to be closed (set errno = EPIPE and send SIGPIPE on next read from a thusly closed socket?), for obvious reasons.
      • The System V IPC state. This means message queues, shared memory, and semaphores, all of which would have to somehow be recreated.
      • Any child processes, most likely.
      • More interestingly, any threads, which might be hard to tell apart from ordinary processes since they are ordinary processes with some exceptions.
      • Everything else...

    As you can see, freezing and thawing UNIX processes could get quite nightmarish if you account for all of the possibilities. (Most processes don't use SysV IPC, for instance.) Even the most (seemingly) trivial of syscalls would need to be modified (all socket functions, for instance).

    Note that it's a lot easier to freeze and thaw a virtual machine, because it's so much more self-contained -- all you need to save then is:

    • Core.
    • Registers.
    • The state of any simulated hardware devices (virtual screen, for instance).
    --

    The only way the typical /.er can pick up a chick is with a forklift. -- AC

  101. MacOnLinux also by mbrubeck · · Score: 2

    MacOnLinux has the same feature, for those of us not in Intel land.

  102. SQL-like Rollback by mattr · · Score: 2

    Suse on my Dell Inspiron 7.5K used to work with the suspend key, but no longer (X just hangs).
    But ancient software is involved.

    That said, rather than hibernation I'd prefer a software-UPS or time-rollback widget. How viable would it be to keep a very high frequency incremental save of state (even just the contents of a limited number of folders would be useful)?

    It would be useful to be able to send your machine backwards in time without requiring everything to be in a database or versioning system that requires explicit saves. I'd like to be able to remove the effects of every command in the history of all shells in reverse, in the right order, and have high-granularity access to previous states of a filesystem.

    If I could do that for all the relevant accounts on various machines it would be like never having to worry. I could leave the desk when I want to, kick the power cord or make meatheadded mistakes, and could keep a less paranoid number of full backups. I'd be worried about the life of my hard disk though. Already exists?

  103. Take it one step further by Allnighterking · · Score: 2

    What I'd like would really be one step further in the chain. Something like my palm or the old Cannon Cat. Turn it off, come back a week, month or year later and voila. You are right back at the same point you left, as if you never turned it off. The basics as I see it would be that ram gets written to swap as an image, (which is what the Cannon Cat did.) Then when your restart the box by tuning it on, ram gets re-initialized from the swap file back to the state it was in before power off. The other option would mean adding a small battery pack to a desktop. If you hit the power button on a box or pull the tail from the wall ram is maintained by the battery until you re-power the box. (or the battery finally goes south.) As I see it there shouldn't be any reason why a box once run through startup shouldn't be able to maintain it's running state almost indefinitly. In fact if you could get Linux to do this one thing..... it would be on desktops so fast you wouldn't believe it. Unless you change hardware what is the diffence that occurs that requires the full init sequence anyway? The Green Peacers would love it because people wouldn't mind turning off there comp since it's "instantly on". The only down side would be that you wouldn't want to stay logged in, but then what's the diff between being logged in with the monitor off and being logged in with an instant on feature? Course it would mean uptimes in years instead of days.....

    --

    I'm sorry, I'm to tired to be witty at the moment so this message will have to do.

  104. Re:Really worth the effort? by uchian · · Score: 2

    And what do you do when the program is so obscure that you can't just "find a different one"? Write a new one from scratch?

    Well, if it just lost you a months worth of work, then your in exactly the right frame of mood to go out and do so!

    And of course, if it was open source you wouldn't have to write it from scratch...

    But seriously, if a program is so mission critical (or deadline critical) that it is important not to lose a months worth of work, and the software has no safeguards to prevent this from happening, and if you can't add any yourself AND you go ahead and use the software anyway... well your a fool and deserve everything you get.

    Or at the least, learn a nice important lesson. And then go and rewrite the software.

    And the same deal works no matter what the timescale. If the software isn't up to scratch, then get or make some that is.

  105. Software Suspend (for Linux) by MikeBabcock · · Score: 2

    Check out the software suspend patch for Linux. It allows the system to be suspended by SysRq-D (or shutdown -z) into swap space and resumed (or not) at the next reboot.

    --
    - Michael T. Babcock (Yes, I blog)