Slashdot Mirror


UNIX Process Cryogenics?

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?

32 of 555 comments (clear)

  1. OS X needs this especially by kilgore_47 · · Score: 5, Interesting

    for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?

    --
    ___
    The way to see by faith is to shut the eye of reason. --Ben Franklin
    1. Re:OS X needs this especially by Quixotic+Raindrop · · Score: 2, Interesting

      Which is funny, because VMware has exactly this capability.

      It needs some refinement, and sometimes it's slow when it picks back up again, but it generally works in my experience. It is obviously not only possible, but implementable using current technology

      --
      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
    2. Re:OS X needs this especially by Anonymous Coward · · Score: 1, Interesting

      No Apple could not do this even if they spent "a little more time on the problem," and they are already spending lots of time on the problem. Fundamentally, many of the ways old Mac programs are written can not work on a system where applications can be preempted and where applications memory can be swaped out at any point. To clarify, Windows 2000 can not run all DOS, Windows 3.1, and Windows 9x applications. That is part of the reason it took MS so long to transition to the NT kernel ... getting apps up to speed. For reasons beyond the scope of this thread, Apple is making a much faster transition than MS did.

    3. Re:OS X needs this especially by ncc74656 · · Score: 5, Interesting
      Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.

      I don't know how directly comparable this example might be, but I used to use VMware (under Linux) to suspend Win98 when I didn't need it. If I needed to do something under Win98 (like browse the web), VMware would load up Win98 where I last left it. It saved the minute or so of waiting for the VM to POST and load Win98.

      (If VMware provided better support for DirectX, I might not have needed to switch my home workstation from Linux to Win2K. It's been more than a year since I checked, though, so things might've improved.)

      --
      20 January 2017: the End of an Error.
  2. Hmm, VMWare can do this in a different way. by GeorgieBoy · · Score: 5, Interesting

    VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).

    While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .

  3. Extended core dump? by The+G · · Score: 5, Interesting

    Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.

    I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.

    And damn but I'd love to have this ability.
    --G

    1. Re:Extended core dump? by ADRA · · Score: 2, Interesting

      You forget that the kernel has created a sandbox for this core to live in. If the sandbox wakes up with a different environment, byebye process.

      Simple example

      # ./bigwasteoftime &
      ./bigwasteoftime[1]
      # hibernate bigwasteoftime
      # exit

      The program is tied to the console which no longer exists, and if woken up, which process is it childed to? What if bigwasteoftime knew its parent before hibernation, and tried to modify it?

      As it stands, you cannot guarantee its stability.

      --
      Bye!
    2. Re:Extended core dump? by ianezz · · Score: 4, Interesting
      GNU Emacs basically does this to reduce initialization times.

      When compiling Emacs from the sources, the initial executable file is only a (relatively) small virtual machine executing elisp bytecode.

      Then, it is started, and several basic elisp packages are loaded and initialized.

      Once initialized, it makes a dump of itself on a file on disk (IIRC actually dumping core by sending a fatal signal to itself).

      The dump is prepended with an appropriate loader which restore the Emacs process (in its initialized status) in memory, and the resulting file is used as the main Emacs binary (what you can usually find in /usr/bin).

      This works for Emacs because it knows when it is checkpointed, and special care is taken not to do anything that depends on parts of the running environment that can't be fully restored.

  4. eros-os by ischarlie · · Score: 2, Interesting

    back in the day there was a post:

    http://slashdot.org/article.pl?sid=99/10/28/015121 2&mode=thread

    about an operating system with "journaled" processes of a sort, that would automatically back up images of it's processes.

  5. process migration is the term you want by Danny+Rathjens · · Score: 2, Interesting

    There has been a lot of work done on "process migration". That is moving processes from machine to machine.
    Obviously those techniques would apply to what you are asking about.
    google has lots of links about it

  6. Future of Process Management by gehrehmee · · Score: 3, Interesting

    First, let me say that what the poster is suggesting sounds a little more sophisticated then a simple re-implementation of XP's hibernate function, although functionality like that under UNIX would certainly be invaluable. It sounds like the poster wants control over individual processes, something that I consider far more interesting.
    What's said here is certainly very reasonable. But the extensions of whats being suggested are even more fantastic. Once a process is completely removed from memory, with file handles and storage and status all kept away safely, is there any reason that the process is really tied to that computer? Why wouldn't it be possible to take that 'frozen' process, transfer it to another machine with access to the same filesystem on some level (some translation of file handles would likely be neccesary), and thaw it there, allowing someone to move a running process to another machine? Need to replace your web server's only CPU, but don't want downtime? Move the process to a backup machine, replace the original's hardware, and move the process back.
    I even thought I had heard that someone was working on just such a project, or at least thinking about the details of implementing it. (I'm just getting started in learning UNIX internals myself). Anybody have more references to information on this sort of thing?

    --
    "You know, Hobbes, some days even my lucky rocketship underpants don't help" -- Calvin
  7. different approach: Savepoints by esonik · · Score: 2, Interesting

    A different solution, which is very common for long running processes, is to use savepoints, i.e. save the state of the process regularly to a file at suitable points of the algorithm. Once your process dies or you killed it, you can restart from that savepoint. If your state information is very large, you can stretch the save interval to reasonable long times, e.g. several hours. Typically you don't mind to lose some hours of calculations due to an occasional power outage.

    Of course this solution is not as general as the "process cryogenics" you describe, but it's also easier to implement because you have more information about the problem.

  8. Checkpoint/restart by td · · Score: 3, Interesting

    This facility is called checkpoint/restart. It was a feature of OS/360 and other operating systems in the 1960s. In some very early versions of Unix, core files were restartable. Usually it's pretty easy for programs to save enough state to be restartable on a case by case basis, except when it's just about impossible (like when networks reconfigure) so it's not a popular system feature these days (hard to implement in a general way, doesn't do a very good job in the cases that can be handled easily.)

    A friend of mine (Hugh Redelmeier) ran a very long (~400 day) computation on a PDP-11 in the mid-1970s. The program ran stand-alone, and part of the test plan involved flipping the power switch on and off a few times -- very amusing to watch the program keep on running right through power failures. (Main memory on the machine in question was magnetic cores, which are non-volatile.)

    --
    -Tom Duff
  9. User Control by Skweetis · · Score: 2, Interesting
    It would be neat if this could be controlled by the user. Ideally, this would be done by a process signal. To actually cause a process to hibernate, a user would do a kill -HIB $PID or something like that. Then the kernel would save the process information to a file (somewhere under /var maybe?) until it is restored.

    This next one would complicate things a bit: the user should also be able to wake up the process the same way, i.e. kill -WAK $PID. This means that an index of hibernated processes also needs to be kept synchronized between the kernel process tables and a file on disk, to be preserved between reboots.

    Maybe I'll write another kernel patch...

  10. That will not be easy by bartman · · Score: 2, Interesting

    There are big problems with such an approach, and mainly with device usage. Basically they are all the problems that you would have with process migration add a few because of temporal discontinuity.

    If you are using a scanner, or a mouse, or whatever, that device may not be there or may not be available when the process is brought back. Furthermore you may have a file descriptor opened on a local (or network shared) file which no longer exists or has changed drastically.

    There are further non-device-dependent problems with shared memory, opened-but-unlinked files, parent PID, IPC resources.

    Having said all of the above... I suppose that for the very rare case that your program is completely memory and CPU dependent you could retire and recover a task.

    my $0.02

    --
    -- bartman
  11. Apple Tried this with OS 9 by zaius · · Score: 3, Interesting
    Apple implemented this feature in early versions of OS 9, but took it out after they realized that some laptops would never "unfreeze" without the user hitting a reset switch buried deep inside the laptop.

    The idea was that when you put your computer to sleep, instead of keeping the SDRAM (or whatever the laptop had) powered to preserve the memory contents, it would write it all to a special sector on the hard drive that the firmware knew to read from when starting from sleep. This allowed sleep to be even more low-power than it already is, since a hard drive does not require power to retain data.

  12. Application level solution. by Mark+Imbriaco · · Score: 2, Interesting

    One fairly simple alternative is to simply have the application save it's own state to a "checkpoint" file periodically. This approach has been used in other applications for a long time in the form of auto-save files (ie: emacs) and would be easily adapted to a long running program like the one you describe.

    Just because the OS doesn't support it automagically it doesn't mean that you can't solve it for yourself with a little bit of extra work and planning.

  13. Already available for Linux by HishamMuhammad · · Score: 2, Interesting

    There is a kernel patch to do this. It's called Software Suspend. It is also part of the FOLK project (Functionality Overloaded Linux Kernel, a project to merge the largest possible amount of patches into the kernel).

  14. Cryogenic freeze / Hibernation by Dr_Marvin_Monroe · · Score: 2, Interesting

    I think that this might also be a really good bug fix/hacking tool. I can also remember something like this for the Apple II in years gone by. You could press a button and take a snapshot of all memory in the system. Then you could write the executable part to disk and pick up where you left off. Good for freezing a copy of a game or whatever.

    This would also be good for tracking down bugs using the "before and after" technique.

    Such a program could be tied into the UPS monitor in such a way as to save everything that couldn't be stopped.

  15. CDC Cyber 205 by epepke · · Score: 5, Interesting

    As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.

  16. How hard could this be to experiment with? by Nelson · · Score: 5, Interesting
    I've thought about this for booting issues. I have a server that's all journaled and everything and it's periodically get's bumped. Boot time is still on the order of 2 to 4 minutes for a full Linux server install. With my current stats that means I'm probably going to miss a hit or two on one of the web pages, all things being equal. A good portion of that is just icing though, things that are there "just in case" or get used infrequently. (Okay, I can screw with the init order and the problem essentially goes away or I can switch hardware but we're nerds and geeks so let's just explore this)


    I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.


    I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.


    The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..

  17. Java has lightweight persistence... by bernz · · Score: 2, Interesting

    If you utilize the java.io.serialization stuff right, you can create a lightweight persistence and should be able to freeze and resume processes on the same application if you handle threading right with it.

  18. It is possible...but it could be messy... by Mysticalfruit · · Score: 3, Interesting

    What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?

    I think the better solution is to write a new signal called "SIGFREEZE" and have programs just write code that could handle such an event. Let the program figure out how to save their own stuff.

    A good example would be a program that was calculating pi. The programmer would have to implient a signal handler that would when it recieved a SIGFREEZE would stop its computating and write what its currently working on out to file. The other thing the programmer should be doing is periodically writing their data out to a file anyway. Then the programmer should have implement a command line option that would facilitate reloading from a saved state.

    Thats my take on it...

    If you see any problems with it... bring it on.

    --
    Yes Francis, the world has gone crazy.
  19. Re:Windows 2000 and Hibernation by dublin · · Score: 3, Interesting

    This is not strictly speaking a W2K function. The real kicker here for Linux folks is that the easiest way to do hibernation in the modern world is to use ACPI, which Linux doesn't do very well. (See this week's LWN for a timely discussion.

    APM BIOSes can also do this, but they aren't as standard: Often the implementation details are specific to the hardware. For instance, Phoenix BIOSes (at least as of two years ago, I haven't messed with this stuff much since then) tend to want to put the STD (suspend-to-disk) data in a special file in a Windows partition, while some others (Dell for sure, since I used to work this stuff for them) save this info in a special STD partition (type 84, IIRC) which is a more generic solution, but requires more knowledge when setting up the box. (When was the last time you thought you might need an STD partition when building your box? BTW, they should be at a minimum, PhysicalMemorySize + 1 MB for state info, video register settings, etc.)

    --
    "The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
  20. You can do OS-independent process-hybernation... by eyefish · · Score: 2, Interesting

    Something many people not familiar with J2EE (Java 2 Enterprise Edition) know is that when you have an application running in a Java container, it, and the state of all its processes get automatically saved and restored whenever the container, the OS, or the machine crashes. True, in practice some diligence is required from the programmer (for example, when you need to set obejcts to specific state upon re-instantiation), but the functionality is there, is OS-independent, and it's been proven and used daily in heavy-duty environments for a few years now.

  21. Re:Use Windows XP by Anonymous Coward · · Score: 1, Interesting
    Actually, modern VM makes this "hard part" completely trivial. Each process has it's own address space. The only possible foul up would be shared library mappings, but I suspect that's easy to fix.

    Also, smart programming is not a valid requirement. Much critical long running code is written by noncomputer people, e.g. physicists.

  22. Look at KeykOs by DV · · Score: 2, Interesting

    Basically that was one of the ideas behind the research on micro-kernels. If the state of the system gets small and centralized enough one could not only make a single process persistant but the full system persistant.

    KeykOs was a very promising system offering this at the time. One could not checkpoint the connections outside of the machine, but their demo was a BSD machine with X11, whose powerplug was violently removed. When replugged the state of all processes saved at the last checkpoint was resumed and the system would continue ... Including X-Windows !!!!

    Now wait for the Patent to expire, put it in Linux and watch the world of computing change.

    It was very promising at the time I was doing my PhD 10 years ago, I don't know why this never "made it"

    Daniel

  23. Re:Windows 2000 and Hibernation by doorbot.com · · Score: 3, Interesting

    This is not strictly speaking a W2K function.

    Agreed, and as you go on to explain, and I believe I alluded to in my post, there are many proprietary implementations via the BIOS or DOS drivers, etc.

    My point was that Windows 2000 separates the hibernation feature from the BIOS. As far as the BIOS can tell, the system is booting normally... but once the BIOS loads the NTLDR, Windows takes over of course and handles the hibernation. This is why it works so well and does not have all of the "stupid issues" such as custom drivers, partitions, or the like. The end result is not a MS-only function, but the implementation is, as far as I can tell.

  24. Sun Already Does This by Anonymous Coward · · Score: 4, Interesting

    Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.

    10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.

    I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing :-)

    Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.

  25. Re:Windows 2000 and Hibernation by denzo · · Score: 3, Interesting
    Not according to Microsoft (on their knowledgebase). This article states that Win2k needs ACPI to support OS hibernation, and that the BIOS has to support it. Although Microsoft has been known to contradict itself.

    And simply having a WHQL-certified drivers doesn't necessarily mean it'll work. I had a Future Domain SCSI controller in my computer that loaded with the default Win2k WHQL driver, but I could never hibernate it. When I swapped it out with an Adaptec 2940UW, I was able to enable Hibernation in my Control Panel settings.

  26. Re:Can you have both? by mlheur · · Score: 2, Interesting

    what if the OS had a hook in it to like
    `kill -FREEZE &LTpid&GT`
    No new hardware, only done once, will work on all processes.

    And as described previously, the FREEZE signal would cause the process to dump execution code, memory pages, FD's etc. etc. to a dump file.

    reboot the system.

    Then find some way to execute that dump file which will in turn load FD's, pages, execution code, and resume with the IP (instruction pointer, not IP Addr. for those not arch inclined) in the same spot?

    /me isnt much of a kernel hacker so I dont know the details of how to do, but that's my high level solution.

  27. My thoughts on what would be needed. by Vulture_ · · Score: 2, Interesting
    I know this has already been done, but I thought I'd throw in my understanding of what would have to happen:
    • The process' core would need to be dumped. This should be fairly straightforward, since some of them do this a lot already... ;)
    • The process' registers (main CPU registers, FP registers, and any other registers that might exist on whatever exotic arch you use) would need to be saved. This is already done by the kernel for context switching to other processes. Gdb also can fetch and change these, so it can be done from userland.
    • Last but not least, all of the kernel level state of the process would need to be saved. That involves saving:
      • The signal handler table.
      • The pending signals. (Since the process hasn't handled its pending signals yet, it needs to handle them at some point in the future.)
      • The state of whatever syscall, if any, the process was in at the time of freezing. (Or you can set errno = EINTR on most syscalls, if this isn't possible.) This would be rather interesting to implement -- what if you're using a different kernel when you thaw? You can't just save the pc of the syscall then...
      • The file descriptor table. This includes network sockets, which would probably have to be closed (set errno = EPIPE and send SIGPIPE on next read from a thusly closed socket?), for obvious reasons.
      • The System V IPC state. This means message queues, shared memory, and semaphores, all of which would have to somehow be recreated.
      • Any child processes, most likely.
      • More interestingly, any threads, which might be hard to tell apart from ordinary processes since they are ordinary processes with some exceptions.
      • Everything else...

    As you can see, freezing and thawing UNIX processes could get quite nightmarish if you account for all of the possibilities. (Most processes don't use SysV IPC, for instance.) Even the most (seemingly) trivial of syscalls would need to be modified (all socket functions, for instance).

    Note that it's a lot easier to freeze and thaw a virtual machine, because it's so much more self-contained -- all you need to save then is:

    • Core.
    • Registers.
    • The state of any simulated hardware devices (virtual screen, for instance).
    --

    The only way the typical /.er can pick up a chick is with a forklift. -- AC