UNIX Process Cryogenics?
shawarma asks: "Due to a recent
power outage, I've had to shut down a server running a process that had
been running for ages calculating something. The job it was doing would
have been done in a few days, I think, but I had to shut it down before the
UPS ran out of juice. This got me thinking: Why can't I freeze down the
process and thaw it back up at a later time? It ought to be possible to take
all the connected memory pages and save them in some way, preserve file
handles and pointers, and everything. Maybe net-connections would die,
but that's understandable. Has any work been done in this field? If not,
shouldn't there be? I'd like to contribute in some way, but I think it's a bit
over my head.." Laptops have been doing this in some form for years:
most laptops, when they run out of power, or when told by the user will
go into "suspend" mode which is similar to what the poster is describing,
however outside of laptops, I haven't seen this done. Sleeping processes
also do something similar, sending their memory pages into swap so other
running processes can use the memory. What, if anything, is preventing
someone from taking this a step further?
Windows XP has 2 features that are improved over Windows 2000 they are called suspend and hibernate. Suspend is a low power standby mode but it keeps all of your applications up and running when you come back. Hibernate actually saves everything to disk and shuts the computer all the way off. When you come back everything you are working on is there.
Good luck.
Is what you're asking for not just hibernation?
Fully available on Win2k/XP etc, works just a treat.
No idea of anything comparable elsewhere, i.e.: for linux but the concept is neither new or unheard of.
No Comment.
is not suspend, it is hibernate. Suspend will power down the computer except for the energy needed to keep the ram alive. hibernate will save all data to from memory to disk. I, personally, use neither on my laptop.
Remember, there were no nuclear weapons before women were allowed to vote.
How often is this a problem? If it happens a lot fix the power problem...not the problems after.
I don't see it worth the time and effort to set something like this up.
But its called hibernation. Bassicly all the processes are suspeneded then the system memory is copied to disk. The tricky part is getting the devices to hibernate. The way MS handles it is that all the active devices have to support the hibernation calls or the entire system won't hibernate.
I'm sure other OS's have this too, i wouldn't be suprised if someone has done it with linux.
-Jon
this is my sig.
If memory serves me right we used to freeze, backup, thaw and go on with life.
Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.
MSN 8: Now Microsoft even has bugs in their ad campaigns.
I think that would exceed the recommended operating temperatures for your hardware. But on the up side, we might see the head (?) of your box on Futurama.. ;-)
Light cup, beer drink, thin so chain, neck turtle fat, man I won't say it again
External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?
http://www.cs.wisc.edu/condor/
Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.
Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.
for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?
___
The way to see by faith is to shut the eye of reason. --Ben Franklin
Sounds like a great feature that has actually been implemented on some platforms. But until it starts catching on as a trend and other people figure out its usefulness it won't reach the general masses unfortunately. Customer demand and survival of the fitness will dictate if someone picks up the ball and runs with the idea. Try settting up an advocacy website and mailing list to turn your works into actions.
I'd think (I used 'think'!) that if you had control UPS software that talked to the OS, and the OS itself (yay, linux!), it shouldn't be a hard process.
The UPS control software says its running outta juice, the OS then saves all the memory to disk, and sets a flag, so on startup, it remaps all the memory back.
Then again, I'm not a big assembly level programmer, so I'm sure its more complex than this...
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
I had Be installed for a while and I thought it would do that. I do know I never lost anything due to it crashing. Of course, it didn't crash much. I think using a journaled file system or at least soft-updates would be a good start. Frankly, I have no idea how to code something simlar to Win XP hibernate. Shouldn't be that hard though.
--- Think of it as evolution in action ---
What you want is known as "checkpointing."
There have been a number of projects that do this under Unix over the years. Many of them do it for the purpose of process migration. Others do it just for recovery.
One such project that I used in the early 90s was Condor.
The typical approach is to do something along the lines of forcing a core dump and then doing some magic to restart the process from the core file.
This has been done in GNU Emacs for years - at the process level. I used to use some commercial EDA (Unix) software which required some of the source from emacs (unexec.c rings a bell) with some modifications.
VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).
While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .
IT's "quick start" feature saves the contents of RAM to disk, just like XP's Hibernate function. When you start it up, the system is just how you left it, apps and all.
Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.
I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.
And damn but I'd love to have this ability.
--G
The job it was doing would have been done in a few days,
In that case, Arthur Dent should know the answer.
back in the day there was a post:
1 2&mode=thread
http://slashdot.org/article.pl?sid=99/10/28/01512
about an operating system with "journaled" processes of a sort, that would automatically back up images of it's processes.
many laptops i've seen have built in hibernation stuff in the bios, which did something like create a partition of ~300MB and stored the current state of the system (ram contents, etc) to that partition, which it would reload to memory when it is started again.
i'm sure there are more details to it. i've seen this done an a number of IBM thinkpads...
would it be possible to create patches for current BIOS revisions that could hack in support for something like this?
No, Beowulf clusters can't imagine in Soviet Russia.
It's called software suspend for linux. look for it on freshmeat.net
Do not look at laser with remaining good eye.
Check out the web using the keyword "checkpointing". There's some publicaly available checkpointing support from MIT and probably some of the scientific oriented Linux sites like Beowulf probably have these libraries available.
The idea is that a program doing a long calculation periodically dumps state and can restart from the last saved dump if necessary
The question is about process cryogenics, not about how well your stupid laptops hibernate!
There has been a lot of work done on "process migration". That is moving processes from machine to machine.
Obviously those techniques would apply to what you are asking about.
google has lots of links about it
I once had an enourmous computer working out a very important question but it was destroyed by Volgons five minutes before it was finished. I feel your pain.
spacefem.com
Trying to do this over networked file systems would be a pain. Imagine trying to copy a remote file that might be changing. Also, if the process was revived in a different environment, you'd have problems in general, be it a new processor, a new hard drive, anything.
That said, I think this is a Good Idea(TM). But, it would have to be implemented on a a per-process basis, not just a general system daemon. Imagine if power failed, and every single process was suddenly "remembered". You have to have enough hard drive space, memory.... And if you ran out, it would be hard for the OS to figure out which ones to save and which ones not to.
This
I think the same solution would apply here: Find Arthur Dent.
Chris Kuivenhoven is a thief, beware
This is implemented in BIOS on my laptop, an HP pavilion. If memory serves its running Phoenix bios of some sort. It requires about a half-gig partition on the hard drive dedicated to the hibernation process, I think it has to be the first one on the drive. But basically it just copies the contents of the memory to that partition then loads it back up. works like a champ, and its OS independent. Actually, I've found it works even better under linux than 2k.
Of course, as usual, I could be full of shite...
My Sun SparcStation 5 has this feature, last time I checked it had an uptime of over a year dispite the fact the we moved offices ten months ago. I just suspended it before moving it.
This is a feature of Connectix Virtual PC which can also host Linux. Of course, it has the advantage that it is simulating all the hardware in software.
The answer is 42. :D
+5:offtopic,but anti-American
Condor http://www.cs.wisc.edu/condor uses a checkpointing mechanism for migrating processes between hosts (works on Unices & Win). Not exactly what you need, but maybe a starting point.
You know what, i bet that something like this could be done really easily in Java. Suspend the VM, store its state, then when the system boots back up you restart the VM with an argument to restore the state. Would also be great for debugging purposes or sending in bug reports (Here's a copy of my VM state when it hung).
....
I'm starting to drool over the possabilities here.
42 - So long and thanks for all the fish.
I've always wondered how hard it would be to resurrect a core file. One would think that there's enough info in a complete core to reopen all the open fd's, and possibly even reinitiate network connects. Everything else is there-- program counter, stack, heap, etc. As such, one could 'kill -ABRT' the process and revive it again later. Has anyone seen this done?
So long, and thanks for all the Phish
You can't just serialize and page out one process. Under every process are a slew of kernel objects and kernel crud including the virtual to physical mappings of your address space. It would be quite a challenge to isolate all of this and somehow persist it.
To make suspend work, you'd have to dump your entire memory image to disk. Then you swap in the entire image, kernel and user pages alike.
Someone you trust is one of us.
(I'm a Solaris user but I assume linux is the same). You can type Ctrl-\ to get a coredump of a runnining process and you can load a coredump with dbx. It seems like that's 90% or the infrastructure. You'd want it to run outside dbx and do it automatically. My guess is you'd have to just remap some addresses, recreate file pointers (assuming said files haven't been modified), reinstate the stack, and go.
This should be even easier to do in a JVM, even not relying on their serialization stuff.
1) Produce the core dump of a process
/. posts this as a relevant question is very
2) Use the core and process image to restart it
(for example in the debugger such as gdb, if you
don't want to write specialized software).
To the best of my knowledge perl "compiler" uses
precisely this technique to produce perl "executables" - dumps them out as a core right
after compilation and reuses it later on.
You can do this to a kernel as well, if you
REALLY want to.
However, since indeed many things may be dependant
on state of kernel, files, network connections, devices etc. etc. doing this is not adviseable.
Good coding practice for long-running processes is
to actually spend some time on writing the state
saving functionality to support process restart.
Anyway, (call it a flame if ya will) but the fact
that
disquieting - level of technical knowledge here
gets reduced day after day.
I've used the Suspend/Resume feature on a sun box. IIRC, it mostly worked, but with a minor hitch that made me worry enough to never do it again. This suspend/resume is just like the laptop version -- save a copy of all memory to disk -- not the cryogenic per-process version you're talking about.
The per-process sounds neat, but usable only if you've got a simple critical task you're running. For a more complicated application, multiple processes may be working together, and you'd have to suspend all of them at the same time.
One big question I would have would be file handles... if you restore a process that thinks it owns file handle #5 and some other process is already using it, it would be awkward to get either process to use a different handle.
HIV Crosses Species Barrier... into Muppets
There is a feature of VirtualPC on Macs that does this. If you try to exit the emulator before shutting down the emulated machine, VirtualPC asks if you want to save the memory. If you say yes, the whole memory of the emulated PC is saved in a file, and you can continue using the PC later, exactly where you left it last time.
The other nice thing about this is that restauring the memory is much faster than rebooting.
You can also save several sessions and start again with the one you want.
Of course it would be nice to do all that for just one process, or maybe even for all of them on a UNIX machine...
-- Slef
First, let me say that what the poster is suggesting sounds a little more sophisticated then a simple re-implementation of XP's hibernate function, although functionality like that under UNIX would certainly be invaluable. It sounds like the poster wants control over individual processes, something that I consider far more interesting.
What's said here is certainly very reasonable. But the extensions of whats being suggested are even more fantastic. Once a process is completely removed from memory, with file handles and storage and status all kept away safely, is there any reason that the process is really tied to that computer? Why wouldn't it be possible to take that 'frozen' process, transfer it to another machine with access to the same filesystem on some level (some translation of file handles would likely be neccesary), and thaw it there, allowing someone to move a running process to another machine? Need to replace your web server's only CPU, but don't want downtime? Move the process to a backup machine, replace the original's hardware, and move the process back.
I even thought I had heard that someone was working on just such a project, or at least thinking about the details of implementing it. (I'm just getting started in learning UNIX internals myself). Anybody have more references to information on this sort of thing?
"You know, Hobbes, some days even my lucky rocketship underpants don't help" -- Calvin
Many of the old console emulators do this exact thing. It's really nice to save the state of the game, and load it back up exactly where you left off. It makes it really easy to cheat too!
I was wondering the exact same things about other applications, or the whole OS itself! Woudn't it be much faster to have an on/off button that would save and load the state of the computer? Then you wouldn't have to boot it up and down, you can just pick up where you left off. Maybe this isn't feasible, I don't know.
All that does is make all your apps still run. and the apps may not be actually DOING anything.
Freezing every calculation, right down to waht was in RAM, at that very moment....i dont know. might be a little difficult. or not? we have been able to freeze and thaw light for godsakes..
Carpe Noctem -=- Seize The Night
So it seems to me that if this were going to be done at an OS level, the OS would need some kind of integration with a data base and apps needing to "freeze: would need a standard method of saving the last completed intermediate phase and deltas into the OS database for later re-activation.
I don't personally know of any software/OS combination that does this well, but am admittedly not an OS know-it-all, and look forward to responses from the rest of the /. community.
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
A different solution, which is very common for long running processes, is to use savepoints, i.e. save the state of the process regularly to a file at suitable points of the algorithm. Once your process dies or you killed it, you can restart from that savepoint. If your state information is very large, you can stretch the save interval to reasonable long times, e.g. several hours. Typically you don't mind to lose some hours of calculations due to an occasional power outage.
Of course this solution is not as general as the "process cryogenics" you describe, but it's also easier to implement because you have more information about the problem.
There's no reason why you can't do it either in an app by saving state or in the OS by saving memory to disk as on a laptop.
GEOS had the concept of state-saving in the OS circa 1990, so it's nothing new. The UI saves its state, what apps are running, what windows are open, etc. and restores it exactly as you left it when you restart. If an app has extra data to save, such as where it was in a lengthy computation, it can save it, too.
A slightly different approach than brute-force writing out all of used memory, but both work quite well with the speed of current hard drives.
This is semi-feature in BeOS, for instance the first time you boot, (or if you modify drivers, etc,) it takes a while because it has to load all of those drivers, do initialization, etc... However it also appears to save a snapshot of everything required to boot again, because the next time you boot, it only would take 10 to 15 seconds.
I may be wrong about exactly how it does it, (i.e. snapshot,) but it works as if it had done that. For application stuff, this is MUCH harder, unless you save out the memory of each application to disk, and hope that any hardware they need doesn't change during the next boot. There are lots of little niglys however, and this problem isn't light development.
This facility is called checkpoint/restart. It was a feature of OS/360 and other operating systems in the 1960s. In some very early versions of Unix, core files were restartable. Usually it's pretty easy for programs to save enough state to be restartable on a case by case basis, except when it's just about impossible (like when networks reconfigure) so it's not a popular system feature these days (hard to implement in a general way, doesn't do a very good job in the cases that can be handled easily.)
A friend of mine (Hugh Redelmeier) ran a very long (~400 day) computation on a PDP-11 in the mid-1970s. The program ran stand-alone, and part of the test plan involved flipping the power switch on and off a few times -- very amusing to watch the program keep on running right through power failures. (Main memory on the machine in question was magnetic cores, which are non-volatile.)
-Tom Duff
Vmware does this for the VM's it hosts. Works great.
Creed
All that is necessary for the triumph of good is that evil men do nothing.
Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.
--Blair
P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.
There is some ongoing work on hibernation and process checkpointing. ACPI4Linux
is an attempt at implementing the ACPI specification for Linux. This is different from APM though, and the product is quite preliminary. There's also an interesting site on process checkpointing, migration and resumption. Basically, its implemented as a kernel module that upon invocation, freezes the scheduler, dumps all process-related information into a separate hibernation partition and shuts off.
HTH,
Shankar
If only you were using Java; you could have included a trigger to stop the process, serialize it, send it to another server, and continue the process until completion.
from Connectix (I think) does this... of course, it's not a "whole system" solution (I run Win2k as host, then Virtual PC with RedHat 7.0) and it saves the state of the linux machine to disk whenever I shut it down. This works pretty well for me, but might not be so great for huge number crunching, as the Virtual PC is always a lot slower than the host OS. Still, it might be worth looking into for some people.
The most straight-forward solution to the data loss problem is to design the software to maintain its own restart data. I've spent about a year working on an atmosperic simulation that typically takes several days to run. We wrote the sim program to dump its current state every hour or so, that way in the event of catastrophy(power outage, OS crash, whatever), the most we'd lose is an hour's worth of computation. Of course, this requires that you have enough access to the innards of the program to do this...
I'd rather be flying
There is or was a project to suspend the whole os to disk. Details are here: http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l
The difference between Canada and the USA is that in Canada healthcare is a right and gun ownership is a privilege.
This next one would complicate things a bit: the user should also be able to wake up the process the same way, i.e. kill -WAK $PID. This means that an index of hibernated processes also needs to be kept synchronized between the kernel process tables and a file on disk, to be preserved between reboots.
Maybe I'll write another kernel patch...
For long-running processes, rather than shut down the process when the UPS kicks in, I've always found it easier to have the program snapshot its data tables periodically (say every half-hour) and build a "resume from disk" feature into the program. This lets you restart the program from its last check-point even in the event of uncontrolled program termination (e.g. kill -9 and the like).
-JS
Vanity of vanities, all is vanity...
The main reason this "suspend" feature works relatively well for a laptop is because the hardware is a "given". The laptop has to have a certain video card and motherboard chipset, specific type of hard drive, floppy, CD-ROM and sound device. (In fact, when laptops fail to come back up properly from a suspend, it's almost always the one "add-on" card people have in laptops, the PCMCIA network adapter, that causes the problem.)
3Com PCMCIA cards are about the only ones I've used that allow the laptop to power them down and back up again, and resume network activity without a complete machine reboot.
You can run Linux in an isolated environment on your computer and when you want to freeze a process, VPC can save the state of the environment. When you thaw it hours or years later, the environment doesn't know any time has passed. Since VPC can run multiple instances on the same machine, you can put the critical process in its own environment.
The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.
/var/longoperation.pid`
I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:
kill -HIBERNATE `cat
and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.
There are big problems with such an approach, and mainly with device usage. Basically they are all the problems that you would have with process migration add a few because of temporal discontinuity.
If you are using a scanner, or a mouse, or whatever, that device may not be there or may not be available when the process is brought back. Furthermore you may have a file descriptor opened on a local (or network shared) file which no longer exists or has changed drastically.
There are further non-device-dependent problems with shared memory, opened-but-unlinked files, parent PID, IPC resources.
Having said all of the above... I suppose that for the very rare case that your program is completely memory and CPU dependent you could retire and recover a task.
my $0.02
-- bartman
The idea was that when you put your computer to sleep, instead of keeping the SDRAM (or whatever the laptop had) powered to preserve the memory contents, it would write it all to a special sector on the hard drive that the firmware knew to read from when starting from sleep. This allowed sleep to be even more low-power than it already is, since a hard drive does not require power to retain data.
--
"If you are an idealist it doesn't matter what you do or what goes on around you, because it isn't real anyway."-R.P.W.
If you could sleep processes you could run some intensive job at a high priority when your not logged into your workstation and then sleep the processes when you log in. This way you could run some job that takes weeks or months but not bog down a workstation that you need for doing daily work on.
Yeah, you could "nice" down the process so that it doesn't slow things down while your logged in... but then system processes at higher priorities might slow down your number crunching when you're not logged in... It'd be best to be able to run it at high priority at night only.... ya know, use those unused cycles.
There are 10 types of people in this world, those who can count in binary and those who can't.
One fairly simple alternative is to simply have the application save it's own state to a "checkpoint" file periodically. This approach has been used in other applications for a long time in the form of auto-save files (ie: emacs) and would be easily adapted to a long running program like the one you describe.
Just because the OS doesn't support it automagically it doesn't mean that you can't solve it for yourself with a little bit of extra work and planning.
Linux software suspend may be of interest.
Take a look at this http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l
Generally most Laptops can do this, but I think what the poster is going for is a tool which will hibernate a single process. I think this is a very useful idea.
For instance, what if your company runs 1 shift, and you're sitting there thinking, now what could I use this IBM zSeries Linux server for at night...how about trying to factor the RSA-2048 number?...but your implementation of the General Number Field Sieve algorythm consumes massive resources, so you want to hibernate the process during business hours and wake it up at night when the boss goes home with out having to start all over. Then 50 years later when the process is finished you'll have your $200,000 prize from RSA.
So, you mean that the next time my app segfaults and dumps core, I can say it was a feature designed to allow it to be restarted...? Cool. Seriously though, how can you restart a core (obviously not one from a segfault) using gdb?
The bastard children of Vogons and Vorlons?
Best Slashdot Co
The vmware window has a freeze/suspend button that will let you freeze the session and resume later. Taking that a step farther, you can even copy the files for that virtual machine to another host, start vmware back up, and execution will resume right where it left off. A number of Linux/BSD/Win os's supported, too.
Hibernation is great. Much faster boot up is the end result. C'mon, if MS can implement it smoothly, it must be possible in UNIX/LINUX/BSD. It's invaluable for laptops, somewhat less for desktops, and neglible for servers, except in this guy's situation.
That said, who's gonna have the foresight to NOT strip this feature out of your own install to conserve server resources? Doh!
Often in Error, Never in Doubt.
I'm afraid that this is clearly -1 OFFTOPIC, even if the HHGTTG reference does make you wet your pants with glee. Pull out those mod cannons!
Long ago and far away (about 15 years ago) I recall that TeX was frequently built in a fashion that required running the binary on some "initialization" information. That process took some nontrivial amount of time back in those days (I'm sure now it would be an eyeblink), and the program could be made to \dump its state in some way.
Then, when you ran TeX in everyday circumstances, the digested initialization file was read in by the application as part of the usual startup process.
I'm probably botching the explanation of how this really worked, but I guess my point is that the "resume" function had to be coded into the specific application.
"Provided by the management for your protection."
A student at my lab who needed several days to
run his simulations got tired of network outages,
unscheduled reboots and such, wiping out his results. So he redesigned his programs to save partial results and states to a file. If he had to restart his sim, it took up where it last saved.
Seti@Home takes about 65 hours per data packet on my machine -- of course with Win98 there are almost daily requested or insisted upon shutdowns. The designers obviously anticipated power-offs (intended or not) and dealt with it. I think the apps that require such runtimes should be designed to deal with such exigencies.
For most purposes, 355/113 is close enough.
I don't know what the issue is. *nix can swap processes to disk. It'll save all of the info in a file (just like a core dump). Solaris can suspend everything (it's entire state) and recover that later. I'm pretty sure I've heard my friends talk about the same feature under Linux as well...
Saving a process (all of it's pages) has been around for a very long time.
On the other hand. If you have a program that takes days/weeks/months to finish (I do quite frequently) you need periodic checkpoints. There is no way around it. If you're talking about weeks/months - upload those checkpoints to another computer over the net - or burn a CD. The cost of $.50/CD disk is nothing to the loss of a month of computation.
Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.
Some things to note:
You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.
Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.
Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).
But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.
Sig: What Happened To The Censorware Project (censorware.org)
Actually Emacs has been doing this for a long time, believe it or not.
... core dumps!
... quickly, I guess.
If you've ever built Emacs, you'll see that at one point the makefile runs emacs. Well, it runs an executable that shows a print out GNU Emacs, (c) etc., loads a whole bunch of things and then
It seems that Emacs does a lot of stuff when loading, meaning it would take forever to run. to get around this the developers decided to have it load partially during build, setting up everything once and then do checkpointing. They dump the state of the app (which is what a core dump does) and then build an executable that just loads the core dump and continues it. That's why Emacs loads so
I don't know if you can do a similar thing to other programs via GDB, but I'm sure it is possible to build this things into each program.
There is a kernel patch to do this. It's called Software Suspend. It is also part of the FOLK project (Functionality Overloaded Linux Kernel, a project to merge the largest possible amount of patches into the kernel).
The filesystem is the package manager
Surely if this process takes so long to execute the person who wrote it should have made it save its state every once in a while. Problems like these can have been avoided! Setiathome to name but one does exactly this.
James
My Intel processor puts it somewhere around 41.99999999967
http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l
You can't. The previous poster was making it sound too easy. Real checkpointing needs to save Kernel state as well -- file handles, device driver state, you name it. It isn't as simple as saving the in-memory image of the process.
I think that this might also be a really good bug fix/hacking tool. I can also remember something like this for the Apple II in years gone by. You could press a button and take a snapshot of all memory in the system. Then you could write the executable part to disk and pick up where you left off. Good for freezing a copy of a game or whatever.
This would also be good for tracking down bugs using the "before and after" technique.
Such a program could be tied into the UPS monitor in such a way as to save everything that couldn't be stopped.
As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.
Actually, if you want to play with a persistent programming environment, download a Smalltalk environment. Smalltalk environments are able to serialize themselves to image files. When subsequently re-serialized, the state of all the objects in the system at the time of serialization is restored.
--
CPAN rules. - Guido van Rossum
I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.
I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.
The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..
the seti@home client uses its *.sah files to save the state of a calculation. of course, this is program dependent, not OS dependent. I guess if you have the source files for the program doing the counting.....
Tequila: It's not just for breakfast anymore!
STANDALONE CONDOR CHECKPOINTING:
..
...
Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
"standalone" mode checkpointing.
To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
utility with the little-known "-condor_standalone" option. For example:
condor_compile -condor_standalone [options/files....]
where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
the condor_compile man page for additional information.
Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
line arguments: "_condor_ckpt " and "_condor_restart ".
If the command line looks like:
exec_name -_condor_ckpt
then we set up to checkpoint to the given file name.
If the command line looks like:
exec_name -_condor_restart
then we effect a restart from the given file name.
Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
name by which we were invoked with a "ckpt" extension.
To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:
kill( getpid(), SIGTSTP );
Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:
kill( getpid(), SIGUSR2 );
In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
checkpoint file. The prototypes are:
void init_image_with_file_name( char *ckpt_name );
void init_image_with_file_descriptor( int fd );
void restart( );
The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.
I think it was somewhere in the list of patches from the -mjc tree (see here) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.
Just found it here, it's the 'swsusp' patch.
I dunno about GDB, but you can do this on command with the "abort" call and the "undump" command. While in your program, call abort(). Run undump on the core file to get an executable. When you run the executable it starts exactly where it left off at the abort().
details here
Woops, after reading that it sounds like it starts off at the top of main() again. But, if you had a flag to indicate where you'd aborted from, you could jump to that immediately and resume operations.
It's a cool little trick; unfortunately I've not yet gotten to use it for anything :)
Your right to not believe: Americans United for Separation of Church and
If you utilize the java.io.serialization stuff right, you can create a lightweight persistence and should be able to freeze and resume processes on the same application if you handle threading right with it.
...is that way too often they don't wake up when you want them to. I've seen this happen on macs as well as compaq boxes. It's an annoyance when a reboot is required - it's even more annoying when you're in the midst of a huge calculation/rendering job and there's a power problem. Hypothetically, You freeze/sleep/whatever your system. Once the crisis has been averted you wipe the sweat from your brow and breathe a sigh of relief, hit the little pulsating moon button and watch as your computer does...nothing.
"The answer to life, the universe and everything is...is...[zzzzzzt]"
"We're going to get lynched, do you know that?"
--Triv
The answer would have been 42 once the processing was complete. So who cares? Get a bigger UPS :-)
Here's a mutation of FreeBSD that can do exactly that. I've put my laptop to sleep in the middle of installing software while running MacOS X and brought it back up several hours later to resume installation with no problems. The same function works on my G4 tower. Yes, it does drop network connections. However, it does use a trickle charge to power the LED's and presumably to keep the processor alive, and possibly some memory. Paging several hundred megabytes in a couple of seconds would be quite the task! One item of note is that all Apple machines have a special piece of hardware known as the PMU (Power Management Unit). In the desktops, it's parted out onto the mother board and into the power supply, but in the laptops it's a seperate card which controls both sleep and the charging of the battery. Perhaps other UNIX machines would need a similar device for this function to work properly.
Karma: Ran over your dogma.
If memory serves me right. It even was called 'checkpointing' already .Although - I never used this feature.
NOS/BE = Operating System of CDC (Control Data Corporation) for their CDC6600 and Cyber systems.
Ah, those were the days...
You could do this with VMware. Run another copy of Linux inside a VM, and suspend the VM when you need to shut the box down for a while. Very simple.
This is not the most efficient way to use a computer, of course -- you'd probably want to dedicate your resources to the application instead of to a virtual machine environment -- but technically this does get the job done.
Tired of FB/Google censorship? Visit UNCENSORED!
Easier said than done. If this wasn't part of the application's design or if it's relatively sophisticated, making these changes can be non-trival. And (shock/horror) if you don't have the source code, it's impossible without OS assistance.
That sounds more like "suspend" than "hibernation". When you hibernate a Windows box, it writes its entire RAM image to disk and turns off. When you turn the box back on again, it actually has to boot the image back into RAM.
In your situation, some power is still necessary to maintain the RAM while the lid is closed.
-- Brian
The most rabid believers in American Exceptionalism are the exact same people whose policies are destroying it.
UNICOS has been doing system checkpoints for years. Checkpoint the system, shut it off, turn it on, restart from checkpoint, everything is exactly as it was when originally checkpointed.
you could save a quick start file which was a snapshot of the running program. To start up again later, it would just read that snapshot and push the whole bloody thing into RAM. Once completed you were running.
it was the difference between a 10 minute startup or a 1 minute startup.
comment directly in my journal
Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field?
Yes, in Redmond. It's called "Hibernate Mode", and it's been around for a while now. If the truth hurts, go ahead and mod me down. My karma's capped at 50 right now anyway.
I think this problem is more easily solved in hardware than in software. With recent advances in solid-state memory, hopefully a standard can be worked out so that solid-state memory can replace or complement volatile memory (i.e., RAM as we know it.) Solid-state memory could would survive a power outage, and you could pick up where you left off.
The disadvantages are speed (solid-state memory is getting faster all the time, but it is still slower than volatile RAM), cost, and lack of current standardized implementations (I'm not even sure there are any working implementations.)
For some background research in solid-state memory, check out this site (it's a bit old, but still interesting.
It depends on what kind of calculations you are doing,
But if they are time-sensitive, then besides all the almost purely hadware solutions so far proposed,
You'd also need to have os system calls that give you fake date/time related answers. (to be different of course of the 'real ones' also available)
If you want to have per process "cryogenics" you'll also have to keep track of the different date/time status for each one of them, whoich could cause trouble if they comunicate now and then...
Not quite.
I know of no laptop manufacturer that calls this process "hibernation". Every laptop manufacturer I know of calls this "suspend to disk" or something similar.
The only product that I know of that calls this process "hibernation" is Windows 2000.
Windows 2000 implements hibernation at the *OS* level. It has nothing to do with your BIOS.
Make your standard slashbot comments about W2K, but this is a feature they got right. Since it sends the same 'ol suspend/unsuspend messages to processes, most of your apps will even reestablish their network connections without any fuss.
I have a fairly loud multi-processor system and the misfortune of a combined bedroom/home office so hibernation is a real lifesaver for me if I want to get any sleep.
Would someone please give the community some insights on how this worked on the HP 3000 or Hp MPE. My understanding is that everything in that operating system was a transaction. When powering on, the system would roll back to the last commited transaction and just start right back where it left off.
With this system, the process would just start from where it left off.
A description is in this paper MPE/iX Transaction Manager
hum...
maybe ctrl-z?
speaking from a mac's POV, i'm running mandrake 8.1 under emulation on Virtual PC 5...install under emulation takes a while, so i split it up into 3 days. You can save the entire PC's state, copy it, and run it on another computer and boot back up under that same instance. Possibly you could run a "virtual linux server" that people have been talking about in recent mainframe posts....not 100 or the likes, just one, which i would guess wouldn't be too difficult. once that works, you might be able to save the "virtual server"'s state. shrug
which brings another thought: could you distribute an ISO of linux that was in a saved state, you just put in the CD, turn on the computer, and go. you could limit it to accessing 128 megs of ram, using a NE2000 compat. network card w/dhcp, and a standard vesa2.0 video driver. just boot up and go, no install or partitioning. write to ram as a ramdisk, but you'd lose everything when you shutdown; not bad as a dumb terminal.
it would fall somewhat along the same principles of making a PC a game console, just stick in the disk, turn it on, and go-all the basic hardware works. sound might be an issue, I don't know of any sound standards other than soundblaster 16 support.
moox. for a new generation.
What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?
I think the better solution is to write a new signal called "SIGFREEZE" and have programs just write code that could handle such an event. Let the program figure out how to save their own stuff.
A good example would be a program that was calculating pi. The programmer would have to implient a signal handler that would when it recieved a SIGFREEZE would stop its computating and write what its currently working on out to file. The other thing the programmer should be doing is periodically writing their data out to a file anyway. Then the programmer should have implement a command line option that would facilitate reloading from a saved state.
Thats my take on it...
If you see any problems with it... bring it on.
Yes Francis, the world has gone crazy.
Is there any way to stop an X application, and restart it to a different display?? (i.e. other machine)
DVD Ripping, Divx, VCD, SVCD under Linux
If we would have just stuck with core memory, we wouldn't be having these problems!
Pick One: http://www-rohan.sdsu.edu/~stremler/sigs/sigs.html (Note - disable Javascript first!)
Couldn't it be also possible to hybernate a process, serialize it and send it to another machine with the same architecture to be executed there?
This maybe sounds too crazy, but it should be possible with well designed systems (aka Linux).
Life isn't like a box of chocolates. It's more like a jar of jalapenos. What you do today, might burn your ass tomorrow.
IRIX has the capability to checkpoint and restart just as the original poster is asking for. It can successfully checkpoint and restart very complicated jobs, not just the simple programs that some of the posters have indicated.
There are a number of items which cannot be automatically checkpointed (i.e. open sockets). However, through the use of signals, any application written to cooperate with IRIX's checkpoint/restart will be given an opportunity to gracefully save the portion of its state that the kernel cannot automatically handle.
This is one of those capabilities of big mature UNIXes that is still awaiting implementation on most open-source UNIXes.
Cyrano de Maniac
Windows has supported HIBERNATE for a couple YEARS now. Toshiba was doing it with their laptops even a couple years before that. What's the problem.
Hell, not even a problem:
What's the question? This has been done... AGES ago.
I'm not a prophet or a stone-age man,
I'm just a mortal with potential of a super man.
Actually, I do it all the time on my UNIX laptop (Macintosh PowerBook G4 running MacOS X). I also run a Windows 98 box in emulation (VPC 5.0.2) pretty constantly on the G4, so I have two options: run the process on the emulated box and be able to recover after a shutdown, or run it on the PowerBook and use sleep.
It only is worth it if you expect to have to halt the program more than once. Assuming only one halt and restart, VMware is still slower.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
A file descriptor is a per-process entity. Yes, there's a big table of file descriptors that exists for the entire sstem, but file descriptor 5 for process a is not file descriptor 5 for process b. Not even if they point to the same file/pipe. A case in point is FD 0, aka stdin. Every process starts out with a stdin on FD 0.
More important is how do you tell the kernel what file descriptor 5 pointed to? What if the file/pipe doesn't exist any more?
Why is it every time there is a hardware problem you guys look for a software solution.
Programmers.
Drop $10,000.00 on a portable generator and the necessary wiring.
Problem solved, with 0 lines of code.
Maintenance Commands sys-suspend(1M)
/usr/openwin/bin/sys-suspend [ -fnxh ] [ -d ]
NAME
sys-suspend - Suspend or shutdown the system and power off
SYNOPSIS
AVAILABILITY
SUNWpmowu
DESCRIPTION
sys-suspend(1M) provides options to suspend or shutdown the
whole system.
A system may be suspended to conserve power or to prepare
the system for transport. The suspend should not be used
when performing any hardware reconfiguration or replacement.
man cpr
man powerd
man power.conf
Sometimes there are advantages of commercial OSs
Hawks
in anima Apparatus
Suspending an entire computer system at any given point is, in principle, possible given that you have thought through how to preserve the entire system state through a reboot and then restore it. Of course, you may also have to suspend and preserve data on other systems too, if you are depending on them. Laptops and Windows can do it fairly reliably for some applications. I think laptops work by getting the applications and OS into a safe and simpler state and then saving that state. I suspect they cannot save any arbitrary application you could write - just the applications they routinely run.
Easier, however, would be to design your application such that it records its state maybe every hour or so. It could write pointers to incoming data, output data, and other important values to a log file. Given that smaller set of information you can resume the application at the last saved state and continue.
Doing that can present a challenge to the design in many cases but I think it would often return your effort when you can stop the machine, debug your code, and continue from the last saved state. You don't have to restart from the beginning all the time...
Above statements are IMHO and your mileage may vary.
My concern with that is this: Let's say something buggy is making the system crash. Then if the persistant OS does it's job with perfect accuracy, it's just going to end up re-creating the conditions that caused the crash, and Boom - crash again. The only way to avoid this is to NOT succeed at the goal of re-creating the conditions before the crash.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
It's quite do-able and has been done a few times. Solaris could do this for the whole machine, SGI IRIX unixen could do this per process, and Cray's had this as a key feature a long long time ago. For the first two, the guy who wrote the code used to sit next to me.
While I love VMWare, it does consume a substantial amount of CPU/memory. The problem is a job like what the original poster described is usually CPU or IO bound, and VMWare just starves the process from what it needs even more.
:)
Granted, it is a solution, but your job that ran in 3 days just got pushed out to a week. It's just a tradeoff.
What the poster really needs is to rewrite the program to drop intermediate data along the way. If you have hourly checkpoints you can minimize the amount of data lost. How to implement checkpoints is left as an exercise to the reader
If memory serves me (hey, it is Friday after all and both brain cells are pretty tired) we looked into something like what the poster was asking about years ago. In those days, we were running some simulations on a PDP-11/70 that took 7-10 days to complete. In the event of a general power failure we wouldn't have been able to run on backup power for very long. DEC's RSX had a feature whereby a task could be checkpointed to disk. Then, presumably, it could be reloaded and resumed at the same state it was in at the time of the checkpoint. We never did implement it since it would have introduced too much delay into the project schedule (adding it to the simulation, testing, etc.) but it sounds like the sort of thing that could be useful in current day OSs. Anyone know of any general purpose operating systems today that have this feature? I haven't heard of any and wonder (not too seriously, mind you) if anyone sells core memory for a PC architecture computer. Of course, it wouldn't be very fast but you'd worry a lot less about power failures that are longer than the UPS's ability to provide power.
CUR ALLOC 20195.....5804M
This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.
bb
Back in a previous incarnation, one of the projects that was going on at our place was called HECTOR. I didn't work on it but some of my friends did. It worked on a variety of UN*X flavors. Something similar to what you are talking about was that any processes that ran through it could be suspended, including file handles and sockets, and then be started on another machine (sockets only as long as all the processes connected via sockets were also running over it). It was used primarily with MPICH (but would work independently of MPICH). It was used to load balance a network of workstations and migrated processes around the cluster if node(s) became loaded beyond some threshold. To find links, search on http://www.google.com using "HECTOR RUSS" or "HECTOR ERC" (Dr. Sam Russ was the lead).
Funny, my Win2K laptop can hibernate in the middle of a Winamp track and when I wake it up, it picks up in Winamp exactly where it left off -- before I've even logged in!
Of course, it can only pick up the 802.11 connections half of the time and only know how to use them half of that so I still have to reboot before I can see the world... but that's not the point!
you can always dump core of the process
(e.g., kill -SIGSEGV), then load the core file
it into gdb (gdb program corefile) and
issue 'cont'.
The OS state would be gone though (so, no
files besides stdin/stdout), but for purely
computational process that might work as a
one-time shot. At least you could save main
arrays from gdb and read them in into a modified
program.
There has been quite a bit of work into doing process migration. The idea is transmit the entire state of a process to another computer to continue execution there. If you instead of transmitting the state to another machine, you wrote it to a file, this would do exactly what you are interested in.
Doesn't Windows XP's Hibernate feature do exactly this?
"You are not a beautiful and unique snowflake."...Tyler Durden
This might be completely wrong, but couldn't you use something like vmware and 'suspend virtual machine'.
I'm pretty sure that when you started up your virtual machine, your program would still be running.
I do this all the time on my Solaris Box, I press the power key on the keyboard (sun type 6 keyboard), and the entire system state is paged to disk. Power shuts off..
When I power back on using the power button, it goes through openfirmware, boots the kernel, then restores the system state... paging typically takes about 90 seconds on my Ultra 30 with 1 gig of ram. There are two paging requests so roughly three minutes spent in doing suspend.
If you had checkpointed your calculations, you wouldn't have to redo them from the git-go, now would you?
There has been a utility available on unix for ages called undump that sounds like what you are looking for. It seems like old versions of emacs used to use this to decrease startup time buy creating a new executable at the point that all of the initialization was completed. A quick search indicates a copy here.
In the 1980s and maybe much earlier, the Cray supercomputers NLTSS operating system had this feature to allow stopping/restarting of applications and it was called checkpointing.
For Windows NT, Lucent had a group that developed Fault-Tolerance software which had a checkpointing feature. This was called SwiFT.
http://www.bell-labs.com/project/swift/
At the same place but under the support, there is some mention of a Unix version.
WhatMeWorry
...and found esky, a purely userspace checkpoint/resume implementation.
I'm wondering since you can kill a program with sigabrt or sigsegv and get a core dump, would the core dump be enough to restart it again? I know gdb can do this for debugging purposes (although running real code inside gdb to accomplish this end would be quite the inefficient solution). I'm going to play around with options a little bit and see if i can cook something up...
Brian
After tooling with the kernel, I have been told time and time again, if it needs to be in the kernel, then put it in. If not, make it user space.
This process hibernation deal does not need to be in the kernel simply because a program should have the option of sleeping, or what have you built into the programs construct.
To put an operating system klude to support a program's shortcomings is a microsoft mentality that I would rather not have repeated in Linux.
If at the very most, the kernel should give the process the ability to capture all of its relevent data before closing.
Reasons why to not implement "random access" process hybernation:
1. File Access:
The operating system would have to guarantee that the file descriptor is stored, and that the referencing file is not unlinked. The alternative is having the program smart enough to realize it was put inot hibernation, hence throwing away the advantage of a kernel solution.
2. Mutual Exclusion:
If you have a program that uses OS based mutual exclusion, you will run into several problems.
(a) If the OS does not yank the lock, other programs that share the lock will be stuck forever, which could break like tons of programs. Most shared libraries should use locking to keep the two processes from trouncing on one another during execution in the module, so...
(b) If the OS releases the lock, then when unhibernated, the program could run into serious problems if it thinks it has a lock, but in reality, does not, or you can run into cyclic lock dependencies and race conditions if the locking code was not writen right. This issue has the same issues involved with preemption.
3. Networking:
Pretty obvious, but if the program and the server do not know about the hibernation, the server should grumble but live on (never trust those clients..), and the client will probably become SOL and defunct. Since there is no notification that the connection was broken, the OS can either send an invalid descriptor (if it isn't stored), or it can be a little smarter and say that the foreign host closed the connection.. This one can be solved, but I think that saving the descriptor and reviving it could be interesting..
I am sure there things I have not covered, (Removable media syncs..), but this is too long already. There are a lot of technilogical factors which would make this very hard for a single kernel fix, but if we tied a unified solution into the user space, we could make a slow transition to supporting this.
Bye!
This is yet another suspend alternative. This one is not your thread checkpointing type of solution , but allows for a software suspend with no APM support. I scanned though the messages and did not find a reference to this link so here it is, http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l
I hope it helps.
Except for hardware and driver state synchronization; all you need to do is be able to pause a process, take the code from fork() and copy the process to disk instead of creating a new process, then kill -9 the process. Of course, you will have to iterate in a preorder tree- traversal to get the leaf nodes first. I think I could bang this out in Minix in about 20 min, Linux kernel mods would probably take me a day. But, you'd have to modifiy all of the drivers to support "suspend-to-disk" type operation, such as ACPI winbloze type of stuff, because the hardware will have to be reset in a state to match the software.
The biggest trick the devil pulled was letting lawyers become politicians so they can write the laws.
I remember an option in Solaris 7 that lets you dump memory to swap, shut down the computer and when you restart it reads swap and drops you back into the exact same state as you were in before.
Pretty cool because you could restore to a full X-session with all the programs and documents you were working on before undisturbed.
I don't know if this is what you were looking for. . .
Just change the program to be able to save and load a partial solution onto disk. Problem solved.
From the cpr manpage:
/dev/mmem file; see mmap(2)
IRIX Checkpoint and Restart (CPR) offers a set of user-transparent software management tools, allowing system administrators, operators, andusers with suitable privileges to suspend a job or a set of jobs in mid-execution, and restart them later on. The jobs may be running on a single machine or on an array of networking connected machines. CPR may be used to enhance system availability, provide load and resource controlor balancing, and to facilitate simulation or modeling.
There's even an option to restart the process(es) after upgrading the OS.
Some caveats, the following system objects are not checkpoint-safe:
o network socket connections; see socket(2)
o X terminals and X11 client sessions
o special devices such as tape drivers and CDROM
o files opened with setuid credential that cannot be reestablished
o System V semaphores and messages; see semop(2) and msgop(2)
o memory mapped files using the
o open directories
Of course, you need proprietary SGI hardware.
I think laptops work by getting the applications and OS into a safe and simpler state and then saving that state. I suspect they cannot save any arbitrary application you could write - just the applications they routinely run.
If you've ever used a laptop with this feature you'd realize what you just said is totally wrong.. the hibernate function of these laptops is managed by hardware not software and so is os and program agnostic. When you close the lid or hit the sleep button, it dumps the entire state of the ram in to a special partition and turns off.. when you revive it, you are back exactly where you left off, regardless if you are running windows or linux or if you are playing quake3 or cracking rc5 stuff.
By managing this stuff in hardware, its actually less complicated and works 100% of the time as opposed to the windows software solution that often refuses to 'wake' after being put in sleep mode and is dependant on the power supply being on and supporting the feature.
If someone were to add a feature like this to a large multiuser mainframe type system, it would definately make more sense to go with a hardware based solution that dumped the system state to a disk or multiple disks to ensure that it always worked and not just some of the time for some of the apps.
MOSIX
Off topic? You've got to be kidding. Flamebait is arguable, but off topic?
While the proprietary guys have been slow to get many of the Open Source Tools - Sun is just introducing gzip, apache in Solaris 2.8; ssh in Solaris 2.9 - they often ARE focussing on things that are dreams in the Open Source only world.
Convergence should benefit both sides.
At first this seems like a nice idea. It would be elegant to be able to halt processes and resume them later without them consuming resources in the interim.
Before going forward ask yourself what the practical application of this work could be. If you have to reboot systems with long running computational work going on you may need more reliable hardware or better management of the system to increase uptime. Furthermore, adding "suspend/resume" functionality to a single process within it's own code would probably be far better as needed.
Secondly, think of the concerns you face in implementing this as a generalized solution for user processes. Here are the problems with this concept that I can see.
First, file handles, file system pointers, network connections may not exist when the process is restarted. Let's say that there is processing of NFS data being done and when the process is resumed that mount is no longer accessable. You get an error from NFS like ERRIO or the like and the process dies.
Secondly, the hardware may no longer be available. What if the process what using a PCMCIA card which has been removed. The process dies. In a more simple case, a process could have a tty open for I/O and that tty may no longer be owned by the user when the process is restarted.
This requires saving a lot of system state and does little to guarantee that the process can be restarted successfully and safely. Furthermore, the dependancies for a single process (being fairly complex) would require a good knowledge of the process by the user to determine the feasability of suspending and resuming the process.
It seems that this would not accessible by average users of the system if it were possible to create in a generic sense.
It does stand as a good question to start someone thinking about unix internals though.
...of the fitness!
I love it. I also like living life to the fullness.
Something many people not familiar with J2EE (Java 2 Enterprise Edition) know is that when you have an application running in a Java container, it, and the state of all its processes get automatically saved and restored whenever the container, the OS, or the machine crashes. True, in practice some diligence is required from the programmer (for example, when you need to set obejcts to specific state upon re-instantiation), but the functionality is there, is OS-independent, and it's been proven and used daily in heavy-duty environments for a few years now.
Free as in beer? If it's not free as in energy it's probably not worth lookin at.
;-)
Yeah, not really relevent to the main topic, but any modern PC's do have suspend support built into them, so the no-additional software thing is a pretty moot point.
Hibernation IS a software thing, and it just means that when the OS receives or generates a shudown-hibernate event, that the OS writes all available memory and state to disk and shutdown, setting a flag that the OS can know that it was hibernated to begin with.
Bye!
This would be great in environments such as universities that still sell time on Crays or similar systems. Lets say you have a simulation that will take 3 days of non stop computing to complete. Well if you were able to lease 2 hours a day for one month, you might get it finished. Or if you were able to get time on multiple machines you could transfer a frozen image over to that system. It doesn't sound that hard to implement just so long as it is well tested to prevent corruption.
We used to have a feature like this in an old HP3000 (model 70?) minicomputer back in the early 1990's. We ran a text-mode accounting system and all users had HP2932 dumb ascii terminals connected via RS232 serial lines. Whenever the machine would encounter a power failure (we had no ups), after the system was restarted, all the users' sessions and the programs they were running would come back to life at the point where they left off. You just had to hit the enter key and all the applications' screens would repaint themselves. You might have lost a few fields worth of data on the immediate screen you were entering data upon, but that was all you lost when the system went down. This type of powerfail recovery was really a nice feature. Sometimes when the O/S (MPE) would crash, we could also recover users' sessions, but usually a hard O/S crash meant that the recoverability info also got corrupted and everybody lost their sessions completely.
With IRIX you can perform CPR and the process will miraculously revive!
- bi n/0650/nph-infosrch.cgi/infosrchtpl/SGI_Admin/CPR_ OG/%40InfoSearch__BookTextView/110
http://techpubs.sgi.com/library/dynaweb_bin/ebt
"Are you an Anonymous Coward or just too lazy to register?"
-- Anonymous Coward
...why not just boot up classic at startup? My brother set his computer to do this, you can too if you don't want to wait.
SGI's IRIX has had this for a while. Ironically engough it's called CPR.
IRIX Checkpoint and Restart (CPR) is a facility for saving the state of running processes, and for later resuming execution where the checkpoint occurred.
See the IRIX Checkpoint and Restart Operation Guide for more details. see
OK, Slashdot let's cut down on the 'feature creep'. Download managers and web-page mirroring software has been doing this for years.
1. Write current state to log at say 15 min. intervals.
2. Continue whenever you want from current log state.
Easy.
Ok, perhaps it is not possible in this instance, but for the vast majority of systems, you can store data as you are going through. If you KNOW the process is going to take ages, it should be implemented in code.
Bad coder, no cookie.
If you wanted to save developers work, you'd have to do this at the hardware level like a notebook does. That way the entire system, OS and all, get dumped to disk.
If you wanted only specific apps to do it, then the developer of the app would have to handle that...I would think....
That would sweet - when the load on my home machine gets to high, I just freeze some process, send it over to another machine and finish running it...and eventually have the OS work up to just picking up cycles automatically.
Basically that was one of the ideas behind the research on micro-kernels. If the state of the system gets small and centralized enough one could not only make a single process persistant but the full system persistant.
KeykOs was a very promising system offering this at the time. One could not checkpoint the connections outside of the machine, but their demo was a BSD machine with X11, whose powerplug was violently removed. When replugged the state of all processes saved at the last checkpoint was resumed and the system would continue ... Including X-Windows !!!!
Now wait for the Patent to expire, put it in Linux and watch the world of computing change.
It was very promising at the time I was doing my PhD 10 years ago, I don't know why this never "made it"
Daniel
BAXterrr...
come on, thats a complete load of bull
Other cool features of Sprite included a log-structured file system (yeah, everybody has one now, but they didn't 10+ years ago) and RAID.
Why doesn't standard power supplies come with a small builtin UPS?
I lived at a place for a short time where the fuses blew now and then. It only took me a few seconds to fix it, but it caused loss of soem data and an unclean crash.
I only need, say, one-and-a-half minute. When the UPS looses power from the net, it waits a few seconds to see if it was transient, then alerts the user (the user probably knows, as the lights may have gone out). If power isn't restored within one minute, the system will begin to take down and save large processes. When, finally, the "I'm dying" signal comes, the system will do a clean shutdown.
While I know of a few internal UPSes, some of which seems neat, I only know about one such unit, Amsdell IPPS, but that company seems rather dead. I know they still exist, but I'm not sure they will ever power an Athlon. Also, it comes with short cables, I need some long ones for my cabinet, and you need to take a wire out of the cabinet. I looked at my Mobo, and I have an SMBus, perhaps that could be used for this purpose?
I mean, this should be widespread...
I guess the answer to my question is that most people are so used to crashes, a crash because of power loss isn't such an issue... :-)
Employee of Inrupt, Project Release Manager and Community Manager for Solid
Check out http://www.eros-os.org.
EROS processes persist until you take them down. They persist across power loss, system upgrades, etc, etc.
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
When the UPS daemon senses that it's time to shut down, it sends all processes a SIG to warn them. This gives each process a chance to clean up, save state, and exit. Your program just needs to respond in the appropriate way to the SIG your UPS daemon already posts, so it can resume where it left off next time it's started. Doing this on an OS-wide basis, I think, would be overkill.
Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.
:-)
10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.
I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing
Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.
My sun Ultra 1 (solaris 2.5) has been doing this since the day it was new.
What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?
Back in the 80s, Cray UNICOS had a cadillac checkpoint package. It could track child procs, save /tmp files, save threads, save pipe data, and pass down SIGCKPT for user-controlled checkpoint.
Of course at $1000/hour you want to damn sure be able to save your work :-)
Windows allready has this "feature" (for better or for worse). Of course, if/when it is put into linux, it will be innovation?
The suspend feature in VMware can just suspend the entire system. The performance hit is usually not too bad for the added features like undoable disks and suspending. This is really helpful when you have a buggy laptop that suspend freezes half the time.
On a palm you can shut if off and when you turn it on it is where you left the device at. I think it would be neat too if this could be the way operating systems worked. Ideally one would be able to turn off the computer in the middle of an app and it would turn on at the same place it was left at.
Of course the palm does not do multitasking, multiprocessing or anything like that and when you close an app it is usually sent back to its initial state.
Maybe the way to do what this user wants is to take journaling to a next step, and rather than have a journeleing file system have a database file system where stuff is done in commits like a jfs. Then one could do rollbacks as well. This would require the whole system to be rethough out though.
Only 'flamers' flame!
Couldn't agree more with the article and shawarma. It should be made possible on unix and unix clones. Better still would be, if the running process could be migrated to a different machine with total preservation of its current context, to run uninterrupted.
Although not quite related, in Stackless Python, there is work being done on allowing function invocations to persist in secondary storage and get called again in a later program invocations. They call it "pickling" a function call. In theory, one could pickle the main() function call at some point in its execution and achieve the same effect as suspending the process and recording the state of the program.
If your program dies for any reason, just before computing the answer, the answer was most likely to be 42.
This is the Adams Constant.
;)
When you first start Classic it boots up MacOS 9. However, after a specified period of time of not being used it goes to sleep and consumes 0% CPU and pages all the memory - it's just like turning it off. Just run an OS9 app and it comes back to life.
You don't want it to always start from a saved state because OS9 just isn't reliable enough.
Because you never have to turn off OSX (sleep mode works great) you should never have to launch Classic more then once.
Perhaps some sort of wrapper could be developed that you could run most simple apps in that could be suspended...
Certainly one would think that apps written in langauages like java that are already set to run in a sandbox would be fairly easy to wrap and suspend... as apposed to things like X windows or things that work too closely with the hardware.
The simple way is to simply buy a small generator.
Let sleeping processes lie
MS 2k/xp os's have done it, why can't it be done of UN*X?
Ahem this should b -1 redundant... thsi wascovered 900 posts ago and is marked a +5 already.
It would help so much in debugging if you could save the current state of a process so you could try "debugging from here" over and over.
Rocky J. Squirrel
My HP Omnibook is saving to disk when it runs out of battery. When I srotch the laptop back again, it is first restoring the RAM for few seconds and then it freezes ;-(
It only restores if te batteries run out in Windows.
So, obviously it is not very good implementation.
What you are really looking for here is two fold: You want to be able to suspend/recover more that one program associated with a particular task (lets say a program with its associated database tasks) AND you need to coordinate the suspension/recovery.
So what you need is a coordinated core system that cores out the processes you tell it to (along with child processes) at a single specific time. The coordinated core system would then be able to restart the processes at a later date.
Wow! That would be cool. How many times have I wanted to freeze a set of processes, reboot a box, and then start them up where they left off. It would be a great debugging tool!
OK, feeling a little giddy.... deep breaths....
For those of you interested, I'm part of a group developing checkpoint/restart for Linux. We're fairly early off in the project, but we're going to be adding this feature to Linux fairly soon. (Hoping to have a patch/module release out in May.)
We're putting two features in: Checkpoint/Restart and Suspend/Resume. Checkpoint/Restart allows you to save a running session or process to disk, and restart it sometime later, on a different node, or after a system reboot. Suspend/Resume does more or less the same thing, but keeps the process data structures in the kernel, without writing them to disk. S/Resume won't work through a reboot, but it's useful for certain applications. You can think of it as a combination of swapping the process to disk and hitting ^Z to nab the process.
We're putting in some signalling mechanisms, to allow the process to catch the checkpoint, restart and continue signals. We're also going in and adding some code to capture data in pipes and FIFOs. It'll work with multi-threaded processes, and full UNIX sessions (so you could checkpoint, say, a login shell and e-mail it to all of your friends. :)
Our checkpoint/restart is meant for scientific applications, but should work on just about anything else. We're going to spend this summer hanging out with the LAM crew to make it work with MPI applications correctly.
For those of you looking for something to download, I'm sorry I can't post a working link right now, or any code. We just got past our requirements document, and we're putting the design spec's together now. The req's doc't is due to be published next month, an implementation survey is coming out in March. If you're interested in having a look at those, drop me a line, and I'll let you know when they're available.
- ERoman at (no spam) lbl dot gov
A good place to start is the technical report from the UW's Condor project here
Hasn't Windows 2000 had this ability since it was released (hibernate)? GASP! Could it be that Linux is missing a feature that Windows has?! Ack! Quick, someone hack the kernel before we are all assimilated by M$!!
*crude laughter heard in the background*
just swap the process out, then write the pages from the swapfile to a regular file. Like the shop manuals say, "assembly is the reverse of disassembly"...
Just junk food for thought...
A lot of people have said that some sort of VM would be ideal for this (VMWare, JVM, etc.). What about User-Mode Linux? Would it be feasable to either add checkpointing to the UML patches, or to load/unload UML in a frozen state?
The more cumbersome Classic OS 9 feels, the more it drives home Apple's point of getting developers to OS X.
If users feel the pain, too, they'll bitch about what a pain Classic is and how everything is cool in OS X.
I bet Apple wants to avoid a pro- longed two- system situation, for example by NOT making it nice and comfy.
Judge_Fire
Couldnt we extend this idea further to allow say a running system a kernel update without rebooting? Preserving those precious uptimes everyone drools over? Just a random thought I had while drawing boring network diagrams and thinking of more fun things......
------------
Human Stupdity + Computers = IT
AT&Ts SVR4 Powerfail Recovery mode does it well. You can set Powerfail strategy to either shutdown or recovery using the 'strategy' command or setting the STRATEGY variable manually in /etc/default/dump. In recovery mode, memory is saved to the dump slice and when the power comes back on, it is restored and continues where it left off. Simple as that. Network connections DO suffer obviously but even an active Informix engine continued running after such an outage. Alas, NCR's roadmap is going to kill SVR4 3.02.01 in a short while :(
Many people in the parallel discrete event simulation community do this, sometimes using
compiler assisted tools.
I had a similar thing happen last year The power went down while I was running something on oue of my computers. Not being a techi I chose the simple option. I had only two of my computers running. all of them have UPSs of one brand or the other. I unplugged the important backup and plugged it into one next to it, I daisy chained all five backups, and it worked. when the power camu back on in two hours the daisy chained backups were still running.
Kind of like how the earth was destroyed by the Vogons 15 minutes before the ultimate question would have been produced? If only the mice had compiled the world with the SavePlanetState() function . . .
The previous has been a secret message to my comrades.
Step 1: Clean, fresh install of XP Pro corporate.
Step 2: The requisite reboots until everything works.
Step 3: Leave the office, set computer to hibernate for fun.
I. Results
A. Blue screen of death upon return to office.
B. Reboot yielded '/windows/config file is missing or corrupt'.
C. Much cursing and a swearing off of anything Microsoft.
XP isn't as wonderful as people would have you believe. A short trip to google inquiring about repairing this mess will result in endless posts.
I can't believe no one has mentioned this yet. User mode Linux is a linux kernel that runs as a process under Linux. You can run many copies of Linux on a single machine. With the checkpoint (or whatever you want to call it) system we're talking about here, you can save the entire state of a system. Very nice.z
A couple of years ago, I interviewed with a company called Ejasent (formerly Apera) that had a modified Solaris kernel which allowed them to freeze processes, and then thaw them very quickly (low milliseconds). Their goal was to build an edge server network allowing very quick bring up of common apps (like "start Oracle and look for this book") for browser-based clients. I met with them again a couple of months ago, and they seem to have made a lot of progress, and the technology is now sanctioned by Sun. http://www.ejasent.com
...is a list/scheme interpreter making a dump of it's state which can be later loaded or executed. i think the scm interpreter uses the code from emacs (which makes sense as it basically is an os in lisp (vi forever! =)), but the idea goes back quite a long way.
Has any project solved the Holy Grail of distributed computing: migrating a process WITH ACTIVE SOCKET CONNECTIONS to another machine? Clearly this problem would need the assistance of an external router, but it could be done.
And if those mice were so smart how come they didn't think about it? Even I know that hardware fails.
make Linux, not Microsoft. sin(beast) = -0.809016994374947424102293417182819
If you're doing a long term processing job, it makes sense to store results incrementally. No big black magic here.
This hibernation mode snapshot can be duplicated or even put on other machines in the event of a system failure. The virtual machine will then come back online like nothing ever happened, with hardware devices effectively still attached and processes still running.
It works really slick, you can perform other tasks and come back to your virtual machine later without slow boot times. This will also work on Linux, Solaris, and Windows platforms. I'd highly recommend VMware for on-demand OS access.
-Pat
42!
This Sig has been depreciated.
ok, so we know it's easy to have a process write all the contents of it's memory space to disk (dereference an invalid pointer in C, and the os will do it for you), so suspending isn't a problem. the problem, which is pointed out in several places in this thread, is starting the process back up again, because it won't start in the same memory space, pointers will be invalid, etc. so, would it be possible to have a program that would resurrect a suspended program, and spoof it to make it think it's in the same memory space it was before it was suspended. kind of an emergency vm of sorts. does this already exist? if not, why?
is it just me or does it seem like slashdot is being used like a search engine.
"Hrm, I want to know about blah, but I'm too lazy to search on google, so I'm going to ask slashdot cuz it gives me a hard-on to get a post on the front page."
The flow of redundant and pointless questions has ruined slashdot.
Half of the questions on slashdot could be answered by someone who paid even 5 minutes of attention in a 300 level CS course or has a browser and 5 minutes to search!
SGI IRIX has had this feature (checkpoint/restart) for ages. Precisely beacuse it's used in scientific/numerical computations (because of FP performance) and these computations run for ages.
Then why didn't you and some colleages lift the server and the UPS together (without switching it off), put it in a car, and drive to the nearest place with electricity before the UPS runs out.
Or if the server was too heavy/big to move, then run to the hardware store and buy a generator
Other answer:
If it's a linux machine, then build a mosix cluster, then you can migrate the process to a system that still has power (assuming not all systems in the cluster lost power...)
--- Hindsight is 20/20, but walking backwards is not the answer.
Windows has a hibernate mode: it dumps memory to disk and reload it after.
That's what you need. Don't waste time implementing this in unix when you can spend 100$ for windows. I mean computer programmer time is more costly than the license for winbloze
--
"I'm not sure exactly what an AS/400 is, however, I'm pretty certain I wouldn't want one up my ass"
There is some great work being done with process migration across heterogeneous machines using checkpointing techniques. If a process is written in a standard language common on many platforms, like C or C++, it's actually quite easy to save the running process data in a text file and then start the same program on another machine. (Even floating-point data can be saved this way, preventing any of the architecture blunders that occur when saving data in binary; i.e. big-endian/little-endian.) There are plenty of libraries already out there that do this, but few of them save the data in with different platforms in mind. Being able to do this type of process migration is great when working with architectures across the internet, or any other heterogeneous network environment. There's some work being done at Arizona State University under the direction of Dr Rida Bazzi to make this automatic across a network. That is, when a process fails on one machine, another machine of a different architecture is able to execute the process from last checkpoint. There's a rough paper at http://www.public.asu.edu/~vidar/fault-tolerance-c heckpointing.html that briefly describes checkpointing for fault tolerance.
" If you could sleep processes "
/', then press ctrl-z, you will get the prompt back and 'find' will be sleeping, type 'fg' to wake up find).
Yes you can, just send it a signal 19 (SIGSTOP), and wake it up with a signal 18 (SIGCONT).
When you press ctrl-z on a terminal process, the same thing happens (try for example 'find
killall -19 myprogram
killall -18 myprogram
--- Hindsight is 20/20, but walking backwards is not the answer.
There was once a project called Condor.
From a american university. Wisc.edu?
It uses a patched standard clib. It preserves everything including filehandles. It was primaryly intended to let processes migrate in a LAN from machine to machine.
I've set it up on my campus 10 years ago? Ah, I think it was 1993. It run pretty well on Sun OS 4.3 and Dec Ultrix.
It tried to finds a idle machine and migrated the process to that one. If the machine got load it suspended the process and freezed it. If there was an other machine, it migrated there, otherwise it sleeped until a machine got available.
Regards,
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Python supports a concept that it calls 'pickling' (which is also known as Object Serialization).
It's extremely easy to save the state of any object along with the objects it references to disk with literally a couple of lines of code (like, 3). You cannot pickle whole processes, but it's effortless to write some skeleton code to resume the process from its last pickle. You can also define specific methods in each object that are called on pickle/unpickle for special cases (restoring network connections, for example).
The fact that it's an interpreted language shouldn't deter you. Python integrates easily with modules compiled from C, allowing you to accelerate time critical aspects of your code while rapidly developing the not so critical aspects.** Python was designed to solve the problems you're working on.
Oh, and if you're short on time, don't worry; Python is extremely easy to learn.
** As most programmers have found, about 90% of their program's execution is spent in 5% of their code.
I just hibernate and system state is written to disk.
I remember working on an NCR tower back in 1988-1990. I used to set up the system for the salespeople to demo at a trade show, or a trainer to use in a class. I even logged in all the terminals and had them sitting in different screens. Then I pulled the plug out of the wall and shipped it across the country. A few days later, someone would plug it back in, connect all the terminals and power them up. If you hit a key to refresh the screen on the terminal, it would all be back exactly as I left it.
I know network connections, etc. would not resume the same way, but if modern systems could do this, think of what a great feature that would be.
Incidentally, if the system ever locked up, there was a reset switch you could poke with a sharp object, though I don't recall ever having to do that.
If you can't spell downtime, use VMS.
Checkpointing is built in to VMS. Here's a reference :
VMS High Availability
Compaq OpenVMS provides integrated and distributed batch processing.
Batch processing permits non-time-critical applications to be scheduled
in the background and processed on any of specified sets of available
systems. OpenVMS also provides for batch restart -- permitting batch jobs
to checkpoint application data and automatically restart after a system
shutdown or failure. This gives you a simple way to schedule your
non-priority tasks to gather available resources across a collection of
nodes, or to schedule high-priority tasks transparently and
automatically, without regard for which specific nodes are available
when the job runs.
I thought that this is what this article was about:3 23 3&mode=nested
w oo d2200/ln2/
http://slashdot.org/article.pl?sid=02/01/17/182
And to be more specific, this link:
http://www.muropaketti.com/artikkelit/cpu/north
Sounds like computer Cryogenics to me!
"The only way to learn a new programming language is by writing programs in it." - Brian Kernighan
this sounds like a kernel task, and I imagine it would re-use aspects of the sleep state. I could see this as simply a hack in the kernel that would accept a particular signal and place the process into a sleep mode but rather than writing to swap, writing to a special partition. this sounds reasonable but i have not played much with the kernel and only written a few programs that use signals.. This signal could be triggered by nut or some other UPS monitoring device or syslog or perhaps just set to run fairly often (like a journaling FS except for physical memory).. hmmm.. maybe I am just blathering..
Done this.
I just thawed my old laptop (HP4150), and guess what, it was running Debian, had my last irc session (oh immediatly was on the network as if it never left), on the windowmanager wmnet thing, i can see my last net connection slowly going by, as if it was just a min ago that I last used it, the movie I was playing on it (heh crouching tiger ) was still playing in mplyaer.. it was cool.
OH, I put it in hibernate sometime around summer of 2001!!
This is what i love about laptops.
no text needed
The problem with auto loading Classic is that it can only be set to start up on log in, as the classic environment runs as an application too, so is shut down when you log out. You can set Classic to sleep after a period of inactivity though (check the 'Advanced options" in Classic prefs.
My suggestion though, would be to try and replace the classic apps you use with OSX alternatives; the number of Carbon and Cocoa (better) increases every day. I hardly boot Classic any more, let alone restart into OS9.
GNU Emacs basically does this to reduce initialization times.
I heard about this. But, my dear boy, I do believe that VI does this better and with more cryptic keyboard commands.
The complicated part of getting this to work on a per process basis has to do with maintaining a process' notions about parentage, signaling relationships etc.
In other words, simply writing a core file and restoring it isn't enough. If the process had any children, or if any other process on the system "cared" about your PID, then those relationships would have to be restored as well. For some very simple programs it might not be necessary, but to
generalize it must be done. These same issues
come up when talking about process migration and close coupled clustering (like Compaq/DEC's SSI -- single system image work).
Basically, when you wanted to turn off the system, you would hit the Suspend button on the control panel (or an appropriate set of hotkeys). Your RAM image would then be flushed to a special disk partition, and the system would be powered down. At the next power on time, this partition was checked to determine if it was valid (I forget what the cookies were), and if it was, the boot loader would simply copy the contents of that partition to RAM.
It worked pretty well for me - I went months at a time of several power-on/power-off cycles per day without a real reboot.
Racing is an addiction that makes heroin look like a vague hankering for something crunchy.
Windows ME does this using the "Hibernate" feature.
I've already done this feat with Linux in the past. I run it under VMWare and hit the suspend button. Works like a charm.
EROS' predecessor, KeyKOS, made waves at USENIX when they did a demo of a UNIX system + Xwindows which would instantly restore the running state of all software when rebooted. It was basically a UNIX port to KeyKOS, and since everything in KeyKOS was persistant, so was everything in the UNIX.
One interesting caviat with this type of OS is that you really need to use ECC memory, because bit errors can get saved to disk and propagated forever!
As you can see, freezing and thawing UNIX processes could get quite nightmarish if you account for all of the possibilities. (Most processes don't use SysV IPC, for instance.) Even the most (seemingly) trivial of syscalls would need to be modified (all socket functions, for instance).
Note that it's a lot easier to freeze and thaw a virtual machine, because it's so much more self-contained -- all you need to save then is:
The only way the typical /.er can pick up a chick is with a forklift. -- AC
was 42. The question is suspected to be 6 X 9.
Damn Vogons.
MacOnLinux has the same feature, for those of us not in Intel land.
Perhaps, but couldn't that file be redistributed? And thus apple would be violating its own Intellectual Property Rights, and thus would enter a death spiral, as the DMCA forced apple lawyers sue eachother, over and over again, until all corporate assets have been converted into lawyers fees.
The current Slashdot moderation system is made by gay communists!
But seriously....
The current Slashdot moderation system is made by gay communists!
To be fair, it's only 98% of what you need. The other 1.8% you can get by poking around in /proc/{pid} and reading the state of open
file descriptors.
Golly. And people wonder why the bubble finally burst.
I think that the guys talking about windows [any flavor] or macosx miss the point. Whole systems freeze/restore might be fine on small user boxes/laptops but it is quite unlikely that larger clusters should do that. It would be nice if User [!!!] can set shell/environment on a per process basis so that in case
the machine(s) go down his/her process-state gets
saved imagine a parallel program running on 10 nodes and one node has to be replaced [because its network card is faulty or somesuch].
I would assume that this is possible given that
it works for the whole OS.
Peter
here
Howdy, with ram prices being so low... why don't we have 128+MiB cache harddrives with thier own batteries and a bit of a processor on it that will assure the data is writen even when the rest of the computer is turned off.
Please use [ informative / summarizing ] SUBJECT LINES
Flame me here
Suse on my Dell Inspiron 7.5K used to work with the suspend key, but no longer (X just hangs).
But ancient software is involved.
That said, rather than hibernation I'd prefer a software-UPS or time-rollback widget. How viable would it be to keep a very high frequency incremental save of state (even just the contents of a limited number of folders would be useful)?
It would be useful to be able to send your machine backwards in time without requiring everything to be in a database or versioning system that requires explicit saves. I'd like to be able to remove the effects of every command in the history of all shells in reverse, in the right order, and have high-granularity access to previous states of a filesystem.
If I could do that for all the relevant accounts on various machines it would be like never having to worry. I could leave the desk when I want to, kick the power cord or make meatheadded mistakes, and could keep a less paranoid number of full backups. I'd be worried about the life of my hard disk though. Already exists?
I'm sure most (frustrated :-)) IBM Aptiva owners are familiar with this concept. 'Back in the days' it was known as 'Rapid Resume', where you could just push the power button - and the pc would power off completely, and then when you turned it on again, it would be where you left it.
Ahhh, trusty old 2144-z30
from system crashes). You just write a program
in their language and presto, you have a persistent application.
hibase.cs.hut.fi
Current status is not very advanced, but...
The main advantage of EROS is that it was designed with provability and reasoning in mind. I was fascinated about SPIN project, where you could put trusted components in the kernel space. For that, they should be written the way compiler could prove their innocency, and Modula 3 type system ensure that. If you look at EROS papers you'll see that authors look at the OS from the OS-as-programming-language point of view, which is very similar with SPIN.
Take a look at EROS, delight (erect, excite) your mind. ;)
The clever workaround to this that I used was to scavange all the UPS' I could, and then plug the UPS of the critical machine into other UPS', until
the power came back on. It took 5 of them to do the job, but the machine never lost power.
Of course, what you reall want is "checkpoint and restart", which is about as old as computing itself.
-- Terry
1) Install TeX to get "undump".
2) man gcore
3) man undump
-- Terry
What I'd like would really be one step further in the chain. Something like my palm or the old Cannon Cat. Turn it off, come back a week, month or year later and voila. You are right back at the same point you left, as if you never turned it off. The basics as I see it would be that ram gets written to swap as an image, (which is what the Cannon Cat did.) Then when your restart the box by tuning it on, ram gets re-initialized from the swap file back to the state it was in before power off. The other option would mean adding a small battery pack to a desktop. If you hit the power button on a box or pull the tail from the wall ram is maintained by the battery until you re-power the box. (or the battery finally goes south.) As I see it there shouldn't be any reason why a box once run through startup shouldn't be able to maintain it's running state almost indefinitly. In fact if you could get Linux to do this one thing..... it would be on desktops so fast you wouldn't believe it. Unless you change hardware what is the diffence that occurs that requires the full init sequence anyway? The Green Peacers would love it because people wouldn't mind turning off there comp since it's "instantly on". The only down side would be that you wouldn't want to stay logged in, but then what's the diff between being logged in with the monitor off and being logged in with an instant on feature? Course it would mean uptimes in years instead of days.....
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.
If you use the ext3 FS, and a few other Journaling filesystems under Linux, it can be configured to journal data as well as metadata. This may also work under reiserfs if cache is closed down. Ext3, however, does not require this and works fine for that. Power outages, or red-button reboots are therefore no problem. The OS simply picks up where it left off upon reboot.
Rien n'est plus beau que le creux du 0.
Take a look at the linux-patch "swsusp", it might be something like what you're looking for.
http://folk.sourceforge.net/. that's it
Or what would speak against suspending the :( And you just can't reboot in between.. but hey it's unix, not windows, you don't have to :D
process with kill -STOP and reviving
it later with kill -CONT ? You can
keep connections alive for ca. 360 seconds.
I think some *nix flavors will also have the kernel keep
the connection of suspended processes alive (right?).
A small caveat is that you need to keep the tty open or the standard input/output file descriptors will be lost.
Check out the software suspend patch for Linux. It allows the system to be suspended by SysRq-D (or shutdown -z) into swap space and resumed (or not) at the next reboot.
- Michael T. Babcock (Yes, I blog)
Use APMD.... It can be used on regula machines also.
It features the "sleep" mode and "suspend" mode and will swap all the system info to the HD.
I'm sure it has already be said, but I like to be redundant P-)
Thanks, Steve
Go to http://techpubs.sgi.com and search for cpr. For example, see http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc .cgi?coll=0650&db=bks&fname=/SGI_Admin/CPR_OG/330& srch=cpr
I work for a large gov't lab where they run calculations for months on the fastest (currently) super computer, ASCI White. People who run codes for any period of time (generally greater than a day or two) write intermediate results a specified intervals so that they can resume in the case of an interruption. Seems like a good general solution that is OS feature independant (disclaimer for nitpicking flamming morons: This is with the assumption that the OS one is using has I/O capabilities).
An Intro to Scheme and its Implementation - Recursion in Scheme
Of course this is inside a lisp process, BUT if Lisp is your OS...
Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice.
That's nothing! I once had a computer that was demolished to make way for a hyperspacial bypass seconds before finishing the program it had been running for millions of years.
If there is hope, it lies in the trolls.
Really if you write a program that is going to take more than an hour, you really should spend a few minutes doing it properly.
Or you could use any computer that can hibernate. I do find it highly amusing that in response to the "use XP" messgaes, the linux community replies by saying "Oh thats not what he meant at all". It isnt? He said he had to shut down his machine because of a power failure. He didnt say "I want to shut down this one single process". Hibernation would quite clearly have satisfied the original post. But be that as it may, there seem to be quite a few non techies happy to jump about and talk about suspending a single process, without any thought to things like file handle, inter process communications, access to devices. I'm sure someone could implement a "hack" that would occasionally manage to save a process, but it would not be reliable enough to risk using. I would agree with the posters who suggest that creating a protocol through which a program can participate in suspending itself would be ideal, and if it can handle being restarted on another machine then perhaps we have moved on to talking about agents...
[http://download.cnet.com/downloads/0-10091-100-40 08596.html?tag=st.dl.10001-103-3.lst-7-25.4008596| Memory Dumper]
Were that I say, pancakes?