Slashdot Mirror


UNIX Process Cryogenics?

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?

17 of 555 comments (clear)

  1. Saving application state by cheezehead · · Score: 2, Insightful

    Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.

    --

    MSN 8: Now Microsoft even has bugs in their ad campaigns.

  2. External dependancies by interiot · · Score: 3, Insightful

    External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?

  3. Re:Really worth the effort? by b_pretender · · Score: 4, Insightful
    Good point. He should also create numerical algorithms with log files that keep track of how far they are getting and track results.

    This sounds like common sense to me. You never know when the disk is going to poop, the power shut off, the network reset.

    At my old job, we were required to record the status of all jobs that took longer than an hour (on a 6 cpu SGI). They never crashed on their own, but I would usually interrupt them if the requirements changed or whatever. If they ever did crash, then there was a record of exactly where they left off.

  4. Build in persistence yourself. by blair1q · · Score: 5, Insightful

    Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.

    --Blair

    P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.

    1. Re:Build in persistence yourself. by dillon_rinker · · Score: 3, Insightful

      Re-read the comment you replied to; it suggests something subtly different from what you suggest. Checkpointing intermediate results is not the same thing as checkpointing processes. To take a much oversimplifed example, I write a program to multiply a two-digit number by a one digit number. My program does the following:

      1. Multiply ones digits
      2. Multiply tens digit by ones digit
      3. Multiply previous result by ten
      4. Add results from steps 1 & 3
      5. Display previous result.

      If my program crashes at any point before step 5, I have to start all over. So, I save my intermediate results at step 1, step 2, step 3, and save my final result at step 4. This is checkpointing my intermediate steps.

      Your suggestion, on the other hand, is to periodically save the entire system state. This is checkpointing the processes.

      I see a need for both types of checkpointing - applications periodically checkpointing data (like the autosave feature in the market-leading word processor) and system-state saves (like the sleep feature of some laptops). Reliability and recoverability should be engineered in at all layers.

  5. Hibernation comments are missing the point by ry4an · · Score: 5, Insightful

    The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.

    I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:

    kill -HIBERNATE `cat /var/longoperation.pid`

    and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.

    1. Re:Hibernation comments are missing the point by Hrunting · · Score: 5, Insightful

      And if you have something like that, you open yourself up to a wealth of potential problems in the program. Take this simple perl script.

      #!perl

      use strict;

      my $pid = $$;
      print $pid


      If you stop it between those two $pid commands, there's no guarantee that you're going to get the same pid value back. Programs would have to be specifically programmed to handle this sort of thing (there are other examples, this is just the most basic; network programs particularly would have problems).

    2. Re:Hibernation comments are missing the point by Anonymous Coward · · Score: 1, Insightful

      No one said process level hibernation would be easy, just that it would be nice. You've pointed out that the OS would at least have to provide some sort of pid reclaimation system for it to be tennable.

    3. Re:Hibernation comments are missing the point by gorilla · · Score: 3, Insightful

      There are lots of other issues. If a program has a socket, or a device open, what should happen? Should the OS reopen the socket? What if the remote end is requiring status. No point reopening a FTP session if the application thinks it's already sent the userid/password but the server doesn't. What if it's a device, eg a modem, and it is locked?

  6. Re:you can by Anonymous Coward · · Score: 2, Insightful

    Talk about the ultimate in karma whoring. Instead of just having one post modded to +5, you get two by delaying the posting of your link. It's almost criminal.

  7. Think of VMware as a process wrapper by Binx+Bolling · · Score: 2, Insightful

    This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.

    bb

  8. Perspective on solution by rcj4747 · · Score: 2, Insightful

    At first this seems like a nice idea. It would be elegant to be able to halt processes and resume them later without them consuming resources in the interim.

    Before going forward ask yourself what the practical application of this work could be. If you have to reboot systems with long running computational work going on you may need more reliable hardware or better management of the system to increase uptime. Furthermore, adding "suspend/resume" functionality to a single process within it's own code would probably be far better as needed.

    Secondly, think of the concerns you face in implementing this as a generalized solution for user processes. Here are the problems with this concept that I can see.

    First, file handles, file system pointers, network connections may not exist when the process is restarted. Let's say that there is processing of NFS data being done and when the process is resumed that mount is no longer accessable. You get an error from NFS like ERRIO or the like and the process dies.

    Secondly, the hardware may no longer be available. What if the process what using a PCMCIA card which has been removed. The process dies. In a more simple case, a process could have a tty open for I/O and that tty may no longer be owned by the user when the process is restarted.

    This requires saving a lot of system state and does little to guarantee that the process can be restarted successfully and safely. Furthermore, the dependancies for a single process (being fairly complex) would require a good knowledge of the process by the user to determine the feasability of suspending and resuming the process.

    It seems that this would not accessible by average users of the system if it were possible to create in a generic sense.

    It does stand as a good question to start someone thinking about unix internals though.

  9. Re:OS X needs this especially by Anonymous Coward · · Score: 1, Insightful

    my personal preference is to not run Classic apps...I think Apple made a smart call saying "why work hard on something that will be useless in 2 years anyway?"

  10. Re:Use Windows XP by taliver · · Score: 2, Insightful

    It's not possible to hibernate a single process.

    Wow, so the fact that its been done here is just a red herring?

    Does Virtual Memory mean anything to you?

    --

    I demand a million helicopters and a DOLLAR!

  11. Re:Really worth the effort? by harlows_monkeys · · Score: 3, Insightful

    There are more than power problems to worry about with a long running process. There are other hardware failures, scheduled downtime, and system crashes to contend with. Just becuase in this instance it was a power failure that made him wish he had this ability doesn't mean it wouldn't be useful in other circumstances.

  12. Re:Yeah, CDC's NOS/BE could do this 25 years ago by swb · · Score: 3, Insightful

    Why are software techniques shit today compared to yesterday?

    Because we're hopeless caught up in trying to reinvent a somewhat limited computing paradigm (unix). No one, except for some CompSci projects that never really go anywhere, have any real interest in making a new operating system that builds on the lessons of all the previous operating systems and includes reasonable features like process checkpointing/suspension.

    I'd bet there are patent considertions as well -- maybe many of the good OS features are not reproducable due to existing patents.

  13. Re:File Descriptors are per-process by jelle · · Score: 2, Insightful

    Just return an error message. The application has to be able to deal with lost connections anyway.

    Note that you can SIGSTOP a process, then it will be on hold, may even become completely swapped out. Then you can SIGCONT the same process to let it run again.

    So you could send it a SIGSTOP and force it to swapout. That is just checkpointing until the next reboot... Of course you need more info to restore the process from the swap when the system reboots, but it's a start as to how to implement checkpointing.

    I'm sure there is more than one road to Rome.

    --
    --- Hindsight is 20/20, but walking backwards is not the answer.