Logging Unexpected Shutdowns/Crashes w/ Linux?

← Back to Stories (view on slashdot.org)

Logging Unexpected Shutdowns/Crashes w/ Linux?

Posted by Cliff on Sunday September 14, 2003 @04:06PM from the finding-evidence-of-the-problem dept.

sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."

13 of 86 comments (clear)

Min score:

Reason:

Sort:

Flag it. by slittle · 2003-09-14 16:31 · Score: 2, Interesting

Same way they know to fsck/chkdsk the drives: if a 'dirty bit' (or file, in your case) exists during boot, shutdown was unclean - log it. Otherwise create it. Only clear it as the last step of a clean shutdown.

--
Opportunity knocks. Karma hunts you down.
1. Re:Flag it. by Creepy+Crawler · 2003-09-14 16:42 · Score: 3, Interesting
  
  You fail to understand what happens to create the "Dirty Bit".
  
  1: System starts up (say clean).
  2: It marks a bit on the partition that system has been started up.
  3: Usage Usage Usage
  4: Send shutdown
  5: System umounts cleanly. Undoes "dirty bit"
  6: Power == 0
  
  On a dirty FS, stage #5 is never hit so when system comes back on, it checks the bit and detects unclean shutdown. The bit is never wrote during the unclean shutdown.
  
  In the similar problem, I see problems when NTkern crashes. How exactly does it manage to:
  
  1: Read the partitiom
  2: Read the program on the partition
  3: Run the insert log program to add log entry
  4: Still have the "blue screen"
  
  I smell nasty data corruption waiting to happen. After all, if you cant guarantee the state of the kernel, does it really justify reading, writing, and executing on a crashed kernel????
  --
  
  Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:I'm pretty sure.. by Chester+K · 2003-09-14 16:40 · Score: 3, Interesting

That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.

Why not reserve a set place on the hard drive and write out error trap information there? There's no reason the filesystem needs to be involved at all. I'm going to guess that's what Windows does.

--

NO CARRIER
Re:I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:48 · Score: 3, Interesting

OK. Then how do you guarantee the state of the kernel? If you use bios calls, it screws up the memmap even more. Thats assuming you can even pass something like that.

100$ question: How do you break out of code inserted that might have had a bug? How do you determine what code had that bug?

Answer those, and then I'll trust Write_after_system_crash api
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Other OSes by FueledByRamen · 2003-09-14 17:33 · Score: 3, Interesting

f course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
I had a similar interesting experience with an SGI Indy (Irix 6.5.13, or thereabouts). I was booting it up after it'd been sitting for a while, just to see what I had running on there. While it was going, and I was fumbling around for an ethernet cable for it (it takes several minutes at boot to wait for a cable instead of noting its absence and moving on), I kicked the power strip that it was on and the plug wiggled around in the wall socket. I heard a spark jump in the socket, and the monitor it was on (Dell/Sony Trinitron 19") went to half-height mode for a few seconds, spitting and clicking, turning the screen on and off and varying the vertical height randomly.

I expected the Indy to kernel panic or turn off. Instead, below the complaints about the missing ethernet cable ("en0: link carrier not detected" or similar), there was a lone status message: "Power failure detected."

No UPS, no power saving devices of any kind, only the filter caps in the power supply between the logic board and the unreliable, crufty power system of a 70 year old house at the mercy of a power strip first used on my (brand new at the time) Atari 800. The other computer on the power strip (350 P2 running RH 7.1) rebooted hard, right in the middle of heavy FS activity. I had to hit the reset button before it would come back up again, too - the brownout hung the POST.

--
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
Re:I'm pretty sure.. by FueledByRamen · 2003-09-14 17:59 · Score: 2, Interesting

OK, I've got an idea. When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) and passes it some special flags which basically say "I did a bad thing, clean up after me." The bootloader will unpack another set of routines (checksummed for quality) in the same way it loads the Linux kernel off of the HD, and place them into an area of RAM that's hopefully not used by anything kernel related (app space). It will then read in the pagetables and other info still resident in RAM (use the Linux kernel on the HD for reference / symbol tables, or rebuild the crashdump app at the same time as the kernel with the same memory offsets and values), and formulate a meaningful crashdump. It'll then read in the partition table / slicetable / disk label / whatever, and find and write over the swap partition with the dump (making sure that the swap partiton actually is a swap partition - read its header and such, just in case the partition table was mangled). It will then reboot the computer. Upon reboot, Linux will pick up on the swap partition containing a crash dump (changed magic number?) and copy it to a file on the HD, then reformat the swap partition and mount it as normal, making a note in the syslogs that it crashed and the crashdump can be found at $LOCATION. (And maybe pass control to a different rc file, for a limited or debugging sysinit.)

--
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
Re:Other OSes by kinema · 2003-09-14 18:24 · Score: 3, Interesting

I wonder if /dev/nvram (the small amount of NVRAM availible on the RTC) is large enough to store such a dump.
"Linux crashes" are probably contact failure. by Futurepower(R) · 2003-09-14 19:37 · Score: 3, Interesting

As others have said, the "Linux crash" is probably hardware failure.

The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.

The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.

This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.

My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.
It'll never catch the things you want... by Bazman · 2003-09-14 20:51 · Score: 2, Interesting

Murphy's law will apply and the thing that causes your system to crash wont be trapped by whatever magic you try to log it with! We recently had a machine that would just power-down without warning. I eventually discovered it happened after intensive CPU load for about 20 mins, figured maybe it was some heating problem, kicked up the sensors package and spotted the CPU temp heading into egg-frying temperatures. It seems the BIOS would just protect its motherboard by shutting down. The kernel had no chance to report anything.
Re:Other OSes by cookd · 2003-09-14 20:55 · Score: 2, Interesting

Windows does something like this too.

At Blue Screen, it will make a dump in the swap partition if so configured. The dump can be a 64k error summary (MiniDump), kernel memory dump, or a full physical memory dump (if swap > physical memory). While there is a slim possibility that doing this might make things worse (if the code to write the dump is corrupted, or the disk driver is corrupted), it is MUCH more likely that the information written will be useful. Also, the swap partition driver is pretty stable and simple, so chances are very good that it won't mess up anything that wasn't already messed up. If you're paranoid, you can turn off this feature.

At clean shutdown, it writes an event to the event log indicating clean shutdown.

At boot, if there is no "clean shutdown" event, it writes an "unexpected shutdown" event to the event log. It estimates the time of the crash based on the last events in the event log. Since Windows has periodic "I am running ok" events recorded to the event log, it can use the last "I am ok" event to guess at the crash time.

At boot, if there is a crash dump in the swap partition, it is recovered and copied to a file for subsequent analysis.

--
Time flies like an arrow. Fruit flies like a banana.
Re:I'm pretty sure.. by Tux2000 · 2003-09-15 00:02 · Score: 2, Interesting

When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) [...]

When Linux panics, it usually has a good reason to do so. Something like a damaged descriptor table, overwritten kernel code, hardware that works wrong, and various other catastrophes. Panic means a real panic: You can not reliably use any hardware. So you can not rerun the bootloader, and you can not access the BIOS. You can only hope that a hardware watchdog card notices that the kernel has paniced (because its timeout counter is no longer reset) and reboots the machine.

(BTW: to access the boot loader and the BIOS, you probably would have to drop out of protected mode back into the ugly world of real mode (or V86 mode), causing even more P.I.T.A.)

--
Denken hilft.
Linux Trace Toolkit by bendl · 2003-09-15 03:18 · Score: 2, Interesting

I'm working on a project called Linux Trace Toolkit (LTT) that is suitable for an automatic logging.
LTT log every system call at a ns precision in a RAM buffer and then on disk. The events include, for instance, read/write/open operations, system call, interuptions, process state, disk and internet interface operations and so on. You can add specific event by modifying your application and recompile with the LTT library.
LTT is not yet included in the kernel and was not choosen after the "Halloween Freeze" however, the new infrastructure can operate in a "flight recorder" mode that will, for instance, log the last 5 Mb of events that happens on the system.
Of course, when there is a kernel crash, you can not be certain to have those events on disk but this is chicken and egg problem.
Anyway, I believe this king of functionality is in demande by most critical applications. This is very important in the embedded market too where debugging and optimization is very painful.
A few hints by kasperd · 2003-09-15 07:54 · Score: 2, Interesting

You can use a serial console or try out some version of the netconsole patch to get the messages on another computer. (Notice that netconsole over the internet is probably possible, but it is sent in clear and can be snooped or modified). I also recall reading about some patch to keep a new kernel ready in memory that could be booted with arguments telling it where to find the log from the old kernel, I even think it included a checksum to prevent booting the new kernel if it had been corrupted.

--

Do you care about the security of your wireless mouse?