Logging Unexpected Shutdowns/Crashes w/ Linux?
sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."
Same way they know to fsck/chkdsk the drives: if a 'dirty bit' (or file, in your case) exists during boot, shutdown was unclean - log it. Otherwise create it. Only clear it as the last step of a clean shutdown.
Opportunity knocks. Karma hunts you down.
That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.
Why not reserve a set place on the hard drive and write out error trap information there? There's no reason the filesystem needs to be involved at all. I'm going to guess that's what Windows does.
NO CARRIER
OK. Then how do you guarantee the state of the kernel? If you use bios calls, it screws up the memmap even more. Thats assuming you can even pass something like that.
100$ question: How do you break out of code inserted that might have had a bug? How do you determine what code had that bug?
Answer those, and then I'll trust Write_after_system_crash api
I expected the Indy to kernel panic or turn off. Instead, below the complaints about the missing ethernet cable ("en0: link carrier not detected" or similar), there was a lone status message: "Power failure detected."
No UPS, no power saving devices of any kind, only the filter caps in the power supply between the logic board and the unreliable, crufty power system of a 70 year old house at the mercy of a power strip first used on my (brand new at the time) Atari 800. The other computer on the power strip (350 P2 running RH 7.1) rebooted hard, right in the middle of heavy FS activity. I had to hit the reset button before it would come back up again, too - the brownout hung the POST.
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
OK, I've got an idea. When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) and passes it some special flags which basically say "I did a bad thing, clean up after me." The bootloader will unpack another set of routines (checksummed for quality) in the same way it loads the Linux kernel off of the HD, and place them into an area of RAM that's hopefully not used by anything kernel related (app space). It will then read in the pagetables and other info still resident in RAM (use the Linux kernel on the HD for reference / symbol tables, or rebuild the crashdump app at the same time as the kernel with the same memory offsets and values), and formulate a meaningful crashdump. It'll then read in the partition table / slicetable / disk label / whatever, and find and write over the swap partition with the dump (making sure that the swap partiton actually is a swap partition - read its header and such, just in case the partition table was mangled). It will then reboot the computer. Upon reboot, Linux will pick up on the swap partition containing a crash dump (changed magic number?) and copy it to a file on the HD, then reformat the swap partition and mount it as normal, making a note in the syslogs that it crashed and the crashdump can be found at $LOCATION. (And maybe pass control to a different rc file, for a limited or debugging sysinit.)
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
I wonder if /dev/nvram (the small amount of NVRAM availible on the RTC) is large enough to store such a dump.
As others have said, the "Linux crash" is probably hardware failure.
The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.
The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.
This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.
My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.
Murphy's law will apply and the thing that causes your system to crash wont be trapped by whatever magic you try to log it with! We recently had a machine that would just power-down without warning. I eventually discovered it happened after intensive CPU load for about 20 mins, figured maybe it was some heating problem, kicked up the sensors package and spotted the CPU temp heading into egg-frying temperatures. It seems the BIOS would just protect its motherboard by shutting down. The kernel had no chance to report anything.
Windows does something like this too.
At Blue Screen, it will make a dump in the swap partition if so configured. The dump can be a 64k error summary (MiniDump), kernel memory dump, or a full physical memory dump (if swap > physical memory). While there is a slim possibility that doing this might make things worse (if the code to write the dump is corrupted, or the disk driver is corrupted), it is MUCH more likely that the information written will be useful. Also, the swap partition driver is pretty stable and simple, so chances are very good that it won't mess up anything that wasn't already messed up. If you're paranoid, you can turn off this feature.
At clean shutdown, it writes an event to the event log indicating clean shutdown.
At boot, if there is no "clean shutdown" event, it writes an "unexpected shutdown" event to the event log. It estimates the time of the crash based on the last events in the event log. Since Windows has periodic "I am running ok" events recorded to the event log, it can use the last "I am ok" event to guess at the crash time.
At boot, if there is a crash dump in the swap partition, it is recovered and copied to a file for subsequent analysis.
Time flies like an arrow. Fruit flies like a banana.
When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) [...]
When Linux panics, it usually has a good reason to do so. Something like a damaged descriptor table, overwritten kernel code, hardware that works wrong, and various other catastrophes. Panic means a real panic: You can not reliably use any hardware. So you can not rerun the bootloader, and you can not access the BIOS. You can only hope that a hardware watchdog card notices that the kernel has paniced (because its timeout counter is no longer reset) and reboots the machine.
(BTW: to access the boot loader and the BIOS, you probably would have to drop out of protected mode back into the ugly world of real mode (or V86 mode), causing even more P.I.T.A.)
Denken hilft.
LTT log every system call at a ns precision in a RAM buffer and then on disk. The events include, for instance, read/write/open operations, system call, interuptions, process state, disk and internet interface operations and so on. You can add specific event by modifying your application and recompile with the LTT library.
LTT is not yet included in the kernel and was not choosen after the "Halloween Freeze" however, the new infrastructure can operate in a "flight recorder" mode that will, for instance, log the last 5 Mb of events that happens on the system.
Of course, when there is a kernel crash, you can not be certain to have those events on disk but this is chicken and egg problem.
Anyway, I believe this king of functionality is in demande by most critical applications. This is very important in the embedded market too where debugging and optimization is very painful.
You can use a serial console or try out some version of the netconsole patch to get the messages on another computer. (Notice that netconsole over the internet is probably possible, but it is sent in clear and can be snooped or modified). I also recall reading about some patch to keep a new kernel ready in memory that could be booted with arguments telling it where to find the log from the old kernel, I even think it included a checksum to prevent booting the new kernel if it had been corrupted.
Do you care about the security of your wireless mouse?