UNIX Process Cryogenics?
shawarma asks: "Due to a recent
power outage, I've had to shut down a server running a process that had
been running for ages calculating something. The job it was doing would
have been done in a few days, I think, but I had to shut it down before the
UPS ran out of juice. This got me thinking: Why can't I freeze down the
process and thaw it back up at a later time? It ought to be possible to take
all the connected memory pages and save them in some way, preserve file
handles and pointers, and everything. Maybe net-connections would die,
but that's understandable. Has any work been done in this field? If not,
shouldn't there be? I'd like to contribute in some way, but I think it's a bit
over my head.." Laptops have been doing this in some form for years:
most laptops, when they run out of power, or when told by the user will
go into "suspend" mode which is similar to what the poster is describing,
however outside of laptops, I haven't seen this done. Sleeping processes
also do something similar, sending their memory pages into swap so other
running processes can use the memory. What, if anything, is preventing
someone from taking this a step further?
Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.
MSN 8: Now Microsoft even has bugs in their ad campaigns.
External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?
This sounds like common sense to me. You never know when the disk is going to poop, the power shut off, the network reset.
At my old job, we were required to record the status of all jobs that took longer than an hour (on a 6 cpu SGI). They never crashed on their own, but I would usually interrupt them if the requirements changed or whatever. If they ever did crash, then there was a record of exactly where they left off.
Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.
--Blair
P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.
The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.
/var/longoperation.pid`
I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:
kill -HIBERNATE `cat
and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.
Talk about the ultimate in karma whoring. Instead of just having one post modded to +5, you get two by delaying the posting of your link. It's almost criminal.
This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.
bb
At first this seems like a nice idea. It would be elegant to be able to halt processes and resume them later without them consuming resources in the interim.
Before going forward ask yourself what the practical application of this work could be. If you have to reboot systems with long running computational work going on you may need more reliable hardware or better management of the system to increase uptime. Furthermore, adding "suspend/resume" functionality to a single process within it's own code would probably be far better as needed.
Secondly, think of the concerns you face in implementing this as a generalized solution for user processes. Here are the problems with this concept that I can see.
First, file handles, file system pointers, network connections may not exist when the process is restarted. Let's say that there is processing of NFS data being done and when the process is resumed that mount is no longer accessable. You get an error from NFS like ERRIO or the like and the process dies.
Secondly, the hardware may no longer be available. What if the process what using a PCMCIA card which has been removed. The process dies. In a more simple case, a process could have a tty open for I/O and that tty may no longer be owned by the user when the process is restarted.
This requires saving a lot of system state and does little to guarantee that the process can be restarted successfully and safely. Furthermore, the dependancies for a single process (being fairly complex) would require a good knowledge of the process by the user to determine the feasability of suspending and resuming the process.
It seems that this would not accessible by average users of the system if it were possible to create in a generic sense.
It does stand as a good question to start someone thinking about unix internals though.
my personal preference is to not run Classic apps...I think Apple made a smart call saying "why work hard on something that will be useless in 2 years anyway?"
It's not possible to hibernate a single process.
Wow, so the fact that its been done here is just a red herring?
Does Virtual Memory mean anything to you?
I demand a million helicopters and a DOLLAR!
There are more than power problems to worry about with a long running process. There are other hardware failures, scheduled downtime, and system crashes to contend with. Just becuase in this instance it was a power failure that made him wish he had this ability doesn't mean it wouldn't be useful in other circumstances.
Why are software techniques shit today compared to yesterday?
Because we're hopeless caught up in trying to reinvent a somewhat limited computing paradigm (unix). No one, except for some CompSci projects that never really go anywhere, have any real interest in making a new operating system that builds on the lessons of all the previous operating systems and includes reasonable features like process checkpointing/suspension.
I'd bet there are patent considertions as well -- maybe many of the good OS features are not reproducable due to existing patents.
Just return an error message. The application has to be able to deal with lost connections anyway.
Note that you can SIGSTOP a process, then it will be on hold, may even become completely swapped out. Then you can SIGCONT the same process to let it run again.
So you could send it a SIGSTOP and force it to swapout. That is just checkpointing until the next reboot... Of course you need more info to restore the process from the swap when the system reboots, but it's a start as to how to implement checkpointing.
I'm sure there is more than one road to Rome.
--- Hindsight is 20/20, but walking backwards is not the answer.