UNIX Process Cryogenics?
shawarma asks: "Due to a recent
power outage, I've had to shut down a server running a process that had
been running for ages calculating something. The job it was doing would
have been done in a few days, I think, but I had to shut it down before the
UPS ran out of juice. This got me thinking: Why can't I freeze down the
process and thaw it back up at a later time? It ought to be possible to take
all the connected memory pages and save them in some way, preserve file
handles and pointers, and everything. Maybe net-connections would die,
but that's understandable. Has any work been done in this field? If not,
shouldn't there be? I'd like to contribute in some way, but I think it's a bit
over my head.." Laptops have been doing this in some form for years:
most laptops, when they run out of power, or when told by the user will
go into "suspend" mode which is similar to what the poster is describing,
however outside of laptops, I haven't seen this done. Sleeping processes
also do something similar, sending their memory pages into swap so other
running processes can use the memory. What, if anything, is preventing
someone from taking this a step further?
http://www.cs.wisc.edu/condor/
Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.
Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.
It's called software suspend for linux. look for it on freshmeat.net
Do not look at laser with remaining good eye.
--
"If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.
Some things to note:
You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.
Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.
Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).
STANDALONE CONDOR CHECKPOINTING:
..
...
Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
"standalone" mode checkpointing.
To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
utility with the little-known "-condor_standalone" option. For example:
condor_compile -condor_standalone [options/files....]
where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
the condor_compile man page for additional information.
Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
line arguments: "_condor_ckpt " and "_condor_restart ".
If the command line looks like:
exec_name -_condor_ckpt
then we set up to checkpoint to the given file name.
If the command line looks like:
exec_name -_condor_restart
then we effect a restart from the given file name.
Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
name by which we were invoked with a "ckpt" extension.
To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:
kill( getpid(), SIGTSTP );
Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:
kill( getpid(), SIGUSR2 );
In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
checkpoint file. The prototypes are:
void init_image_with_file_name( char *ckpt_name );
void init_image_with_file_descriptor( int fd );
void restart( );
The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.
I think it was somewhere in the list of patches from the -mjc tree (see here) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.
Just found it here, it's the 'swsusp' patch.