Slashdot Mirror


UNIX Process Cryogenics?

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?

26 of 555 comments (clear)

  1. We do it in Condor by epaulson · · Score: 5, Informative

    http://www.cs.wisc.edu/condor/

    Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.

    Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.

    1. Re:We do it in Condor by dsouth · · Score: 5, Informative

      As the poster said, there are plenty of others:

      • SGI IRIX and Cray UNICOS provide kernel-level checkpoint-restart.
      • Condor provides user-level checkpoint restart and process migration by manipulating libraries at runtime.
      • esky provides user-level checkpoint restart under Solaris and Linux via runtime library manipulation.
      • crak provides kernel-level checkpoint restart for linux.
      • cocheck provides user-level checkpoint-restart.
      • libckpt provides user-level checkpoint-restart.


      I'm sure I left serveral out. Checkpoint-restart has been part of the high-performance computing scene for years. Having been a systdmin on large, high-performance, computing platforms for the last few years of my professional life, my experiences with checkpoint-restart have been a mixed bag. All of the existing systems have limitations. Depending on the application, those limitations can be no problem, or they can be deal-breakers.
  2. OS X needs this especially by kilgore_47 · · Score: 5, Interesting

    for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?

    --
    ___
    The way to see by faith is to shut the eye of reason. --Ben Franklin
    1. Re:OS X needs this especially by ncc74656 · · Score: 5, Interesting
      Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.

      I don't know how directly comparable this example might be, but I used to use VMware (under Linux) to suspend Win98 when I didn't need it. If I needed to do something under Win98 (like browse the web), VMware would load up Win98 where I last left it. It saved the minute or so of waiting for the VM to POST and load Win98.

      (If VMware provided better support for DirectX, I might not have needed to switch my home workstation from Linux to Win2K. It's been more than a year since I checked, though, so things might've improved.)

      --
      20 January 2017: the End of an Error.
  3. Hmm, VMWare can do this in a different way. by GeorgieBoy · · Score: 5, Interesting

    VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).

    While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .

  4. Extended core dump? by The+G · · Score: 5, Interesting

    Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.

    I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.

    And damn but I'd love to have this ability.
    --G

    1. Re:Extended core dump? by ianezz · · Score: 4, Interesting
      GNU Emacs basically does this to reduce initialization times.

      When compiling Emacs from the sources, the initial executable file is only a (relatively) small virtual machine executing elisp bytecode.

      Then, it is started, and several basic elisp packages are loaded and initialized.

      Once initialized, it makes a dump of itself on a file on disk (IIRC actually dumping core by sending a fatal signal to itself).

      The dump is prepended with an appropriate loader which restore the Emacs process (in its initialized status) in memory, and the resulting file is used as the main Emacs binary (what you can usually find in /usr/bin).

      This works for Emacs because it knows when it is checkpointed, and special care is taken not to do anything that depends on parts of the running environment that can't be fully restored.

  5. you can by Lumpy · · Score: 5, Informative

    It's called software suspend for linux. look for it on freshmeat.net

    --
    Do not look at laser with remaining good eye.
    1. Re:you can by Lumpy · · Score: 5, Informative

      AHA! I knew I still had it
      http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l

      this is what you need.

      --
      Do not look at laser with remaining good eye.
  6. it's encrypted in your brain waves! by spacefem · · Score: 5, Funny

    I once had an enourmous computer working out a very important question but it was destroyed by Volgons five minutes before it was finished. I feel your pain.

  7. Re:Really worth the effort? by b_pretender · · Score: 4, Insightful
    Good point. He should also create numerical algorithms with log files that keep track of how far they are getting and track results.

    This sounds like common sense to me. You never know when the disk is going to poop, the power shut off, the network reset.

    At my old job, we were required to record the status of all jobs that took longer than an hour (on a 6 cpu SGI). They never crashed on their own, but I would usually interrupt them if the requirements changed or whatever. If they ever did crash, then there was a record of exactly where they left off.

  8. Suspend by selectspec · · Score: 4, Informative

    You can't just serialize and page out one process. Under every process are a slew of kernel objects and kernel crud including the virtual to physical mappings of your address space. It would be quite a challenge to isolate all of this and somehow persist it.

    To make suspend work, you'd have to dump your entire memory image to disk. Then you swap in the entire image, kernel and user pages alike.

    --

    Someone you trust is one of us.

  9. Build in persistence yourself. by blair1q · · Score: 5, Insightful

    Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.

    --Blair

    P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.

  10. Hibernation comments are missing the point by ry4an · · Score: 5, Insightful

    The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.

    I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:

    kill -HIBERNATE `cat /var/longoperation.pid`

    and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.

    1. Re:Hibernation comments are missing the point by Hrunting · · Score: 5, Insightful

      And if you have something like that, you open yourself up to a wealth of potential problems in the program. Take this simple perl script.

      #!perl

      use strict;

      my $pid = $$;
      print $pid


      If you stop it between those two $pid commands, there's no guarantee that you're going to get the same pid value back. Programs would have to be specifically programmed to handle this sort of thing (there are other examples, this is just the most basic; network programs particularly would have problems).

  11. EPCKPT by cmason · · Score: 5, Informative
    EPCKPT is a checkpoint/restart utility built into the Linux kernel. Checkpointing is the ability to save an image of the state of a process (or group of processes) at a certain point during its lifetime.

    --

    --
    "If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
  12. Windows 2000 and Hibernation by doorbot.com · · Score: 5, Informative
    If you have a Windows 2000 or XP machine you can enable hibernation. However, this is not a "power management" feature... it has been separated from ACPI and/or proprietary disk partitions and will work on all computers, even servers, whether they have ACPI/APM/nothing for power management.

    Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.

    Some things to note:

    You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.

    Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.

    Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).

  13. Process-saving is known, but not what you want by Seth+Finkelstein · · Score: 4, Informative
    The idea of saving the state of a process is very well-known. Take a look at anything from emacs dumping to the gcore(1) program. It's been used in everything from saved games of Rogue to saved states of PERL.

    But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.

    Sig: What Happened To The Censorware Project (censorware.org)

  14. You sure of that? by bastion_xx · · Score: 4, Funny

    My Intel processor puts it somewhere around 41.99999999967

  15. Re:Use Windows XP by rlowe69 · · Score: 4, Redundant

    This comment is far from (Score:4, Informative) ... it's not even relevant. We're not talking about the whole OS hibernating, we're talking about saving the execution state of an executing process so that it can be resurrected later and continued (ie. if a reboot is necessary).

    --
    ----- rL
  16. CDC Cyber 205 by epepke · · Score: 5, Interesting

    As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.

  17. How hard could this be to experiment with? by Nelson · · Score: 5, Interesting
    I've thought about this for booting issues. I have a server that's all journaled and everything and it's periodically get's bumped. Boot time is still on the order of 2 to 4 minutes for a full Linux server install. With my current stats that means I'm probably going to miss a hit or two on one of the web pages, all things being equal. A good portion of that is just icing though, things that are there "just in case" or get used infrequently. (Okay, I can screw with the init order and the problem essentially goes away or I can switch hardware but we're nerds and geeks so let's just explore this)


    I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.


    I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.


    The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..

  18. STANDALONE CONDOR CHECKPOINTING by Anonymous Coward · · Score: 5, Informative

    STANDALONE CONDOR CHECKPOINTING:

    Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
    "standalone" mode checkpointing.

    To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
    have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
    utility with the little-known "-condor_standalone" option. For example:

    condor_compile -condor_standalone [options/files....]

    where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
    the condor_compile man page for additional information.

    Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
    line arguments: "_condor_ckpt " and "_condor_restart ".

    If the command line looks like:

    exec_name -_condor_ckpt ..

    then we set up to checkpoint to the given file name.

    If the command line looks like:

    exec_name -_condor_restart ...

    then we effect a restart from the given file name.

    Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
    instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
    name by which we were invoked with a "ckpt" extension.

    To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:

    kill( getpid(), SIGTSTP );

    Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
    program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:

    kill( getpid(), SIGUSR2 );

    In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
    checkpoint file. The prototypes are:

    void init_image_with_file_name( char *ckpt_name );

    void init_image_with_file_descriptor( int fd );

    void restart( );

    The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
    Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.

  19. Search in the slashdot archives for kernel patches by Alan · · Score: 5, Informative

    I think it was somewhere in the list of patches from the -mjc tree (see here) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.

    Just found it here, it's the 'swsusp' patch.

  20. Darwin/MacOS X by Duck_Taffy · · Score: 4, Informative

    Here's a mutation of FreeBSD that can do exactly that. I've put my laptop to sleep in the middle of installing software while running MacOS X and brought it back up several hours later to resume installation with no problems. The same function works on my G4 tower. Yes, it does drop network connections. However, it does use a trickle charge to power the LED's and presumably to keep the processor alive, and possibly some memory. Paging several hundred megabytes in a couple of seconds would be quite the task! One item of note is that all Apple machines have a special piece of hardware known as the PMU (Power Management Unit). In the desktops, it's parted out onto the mother board and into the power supply, but in the laptops it's a seperate card which controls both sleep and the charging of the battery. Perhaps other UNIX machines would need a similar device for this function to work properly.

    --
    Karma: Ran over your dogma.
  21. Sun Already Does This by Anonymous Coward · · Score: 4, Interesting

    Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.

    10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.

    I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing :-)

    Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.