Logging Unexpected Shutdowns/Crashes w/ Linux?

← Back to Stories (view on slashdot.org)

Logging Unexpected Shutdowns/Crashes w/ Linux?

Posted by Cliff on Sunday September 14, 2003 @04:06PM from the finding-evidence-of-the-problem dept.

sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."

19 of 86 comments (clear)

Min score:

Reason:

Sort:

I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:11 · Score: 3, Informative

That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.

And if I'm correct, if you turn on serial console, you'll get a Panic output on serial. Add a serial2IP box and you're set.
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
1. Re:I'm pretty sure.. by bobthemonkey13 · 2003-09-14 16:43 · Score: 2, Informative
  
  Or a dot-matrix printer. Seriously, I did this for a while; you can turn on console-on-LPT support in your kernel config, and pass a parameter with your bootloader. It takes a while for the stupid thing to display all the kernel messages at boot, but the sound is priceless. Sadly, the 20-some-year-old printer decided to kick the bucket (still lasted longer than my HP DeskJet, thank you very much), so I switched to a 286 laptop running Minix 1.5 and term, which might be a cheap way to implement the serial2IP idea.
2. Re:I'm pretty sure.. by Anonymous Coward · 2003-09-14 17:00 · Score: 2, Informative
  
  Why not reserve a set place on the hard drive and write out error trap information there?
  
  The IDE/SCSI driver could still corrupt your data. How would you know where to write the info anyway? The kernel could easily calculate and store a block number, but could it trust the stored number after panic() is called? Maybe somebody overwrote that variable, or maybe some bad RAM caused it to spontaneously change.
  
  Kernel panics are generally used as a last resort, when something goes really wrong and there's no sane way to handle it (at least in theory). Deciding whether it's safe to write to disk would be difficult.
  
  Anyway, there are kernel patches that save crash dumps to your swap space or even your video RAM (search for something like "linux crash dump"). I wouldn't use them all the time, but they might be useful if you know your RAM is OK and don't suspect disk/IDE problems. A serial console is safer, but not always practical.
Easy... by icemax · 2003-09-14 16:15 · Score: 3, Informative

'last reboot' should show you all the recent boots

--

__________
Love conquers all... except CANCER
Logging Unexpected Shutdowns/Crashes w/ Linux? by krishnaD · 2003-09-14 16:16 · Score: 3, Informative

/var/log/messages, /var/log/syslog should give you enough info about kernel . Also there are lots of tools to enable various kind of accouting check sa.
1. Re:Logging Unexpected Shutdowns/Crashes w/ Linux? by Tux2000 · 2003-09-15 00:05 · Score: 2, Informative
  
  Add a serial or parallel console that writes to paper, i.e. a printer. Disable syslogd and klogd and let all log output go to the console.
  
  --
  Denken hilft.
Kernel Panic on Linux? Sounds like hardware prob. by Radical+Rad · 2003-09-14 16:39 · Score: 2, Informative

After 10 years without ever needing to apply the knowledge I forgot how.Would the magic sysrq key help? I bet it is a hardware problem though. And what about logging power outages? That is easy to do. APC probably has Linux software already to do this. For other logging there is ample facilities on Linux. Start a syslog server. Point everything to the loopback address.
Other OSes by menscher · 2003-09-14 16:49 · Score: 5, Informative

This will probably be modded down as flame bait, but I can't resist pointing out what some other OSes have done when crashing:
IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash. You know you've done a good job when the kernel panic causes a panic, called a double panic. I used to be able to trigger those at will. Hrmm, I should test that on the current release.
AIX summarizes the likely causes of failure (power failure, someone pressed the power switch, or power supply died, etc). I've seen (but do not personally use) a similar thing with IRIX that actually assigns a percentage confidence level to its guess.
Of course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
Sadly, I don't know how to get any useful information out of linux. And don't give me crap about it never crashing. I can prove otherwise. Too bad I can't figure out why.... Maybe a kernel developer will read this and copy some ideas from the commercial Unix vendors.
1. Re:Other OSes by Ster · 2003-09-14 18:06 · Score: 3, Informative
  
  Mac OS X writes a crash dump to the non-volitile RAM in the event of a panic. Then, after the next successful boot, it reads out the dump and adds it to /Library/Logs/panic.log. If, for some reason, the machine won't come back up, you can probably read the dump from OpenFirmware.
  
  -Ster
2. Re:Other OSes by anthony_dipierro · 2003-09-14 18:37 · Score: 4, Informative
  
  IRIX will core dump to the swap partition.
  
  FreeBSD does this. HP/UX does this. I always assumed Linux did it too, it just wasn't turned on by default. I guess I was wrong.
  
  As a side note, my first job out of college was to analyze core dumps from HP/UX. There's an awful lot you can learn from these things. Not just stack traces, the entire memory of the system is contained in the dump. It's time consuming, but a large portion of the time you can find out *exactly* what went wrong.
3. Re:Other OSes by Tux2000 · 2003-09-14 23:45 · Score: 2, Informative
  
  Nope. RTC memory is something between 128 Bytes (IBM AT) and 2 KBytes (IBM PS/2 series). And each bit of it is used for the BIOS and some hardware stuff (Microchannel requires a lot of memory). Perhaps, some machines have a few unused bits. But you can't stuff all your memory into them. You can't compress several megabytes or gigabytes into 10 to 20 Bits (at least not lossless). With a lot of luck and deep knowledge of the used machine and BIOS, you may be able to store a dirty-or-clean shutdown flag. But as I said, it depends very much on the machine and the BIOS.
  
  --
  Denken hilft.
Try the Linux Kernel Crash Dump (LKCD) patches by bigsteve@dstc · 2003-09-14 17:02 · Score: 5, Informative

If you are adventurous, you could try applying the LKCD patches to your kernel. Start looking here
1. Re:Try the Linux Kernel Crash Dump (LKCD) patches by Anonymous Coward · 2003-09-14 20:20 · Score: 2, Informative
  
  Its located here for the unitiated.
  
  I took a look at it a while back and it looked interesting. Just checked back at the site and there is a reasonable howto provided by IBM in the doc section which should give you some idea of what/how it works.
  
  Its worth a look, but honestly, it sounds a LOT more like a hardware issue than software. Is the server on an UPS? If not, get it on one. Reseat all your cards and ram etc then see if it crashes as regularly.
Re:Event Log? by Anonymous Coward · 2003-09-14 17:38 · Score: 1, Informative

Exactly what parallel universe are you living in? I've never ever get useful event log after the NT/2K goes BSOD.

Apparently, he lives in the same parallel universe I do. I suppose you think the checkbox in Startup and Recovery labeled "Write an event to the system log" is there for looks?
rc.local by bobbozzo · 2003-09-14 19:18 · Score: 2, Informative

As others have mentioned, there are various ways to see when the system rebooted.
If you want to be emailed if the system reboots, put something at the end of /etc/rc.d/rc.local, if you're using something like RedHat (SYSV init, IIRC).
Logwatch will probably let you know if the system rebooted also.
If you want a log of the kernel panic, or something else, that's a lot more complicated, as others ahve mentioned

--
Nothing to see here; Move along.
Depends on what it's doing by Sits · 2003-09-14 21:18 · Score: 3, Informative

To the best of my knowledge linux doesn't automatically reboot after a kernel crash unless you have told it to. If the crash was that severe this means you can walk up to the crashed machine and read the oops off the screen. If the machine isn't oopsing before the reboot this suggests some sort of hardware fault (e.g. your CPU is overheating). If it is hardware resetting the machine it is very unlikely that Linux can tell you what the fault is by itself (e.g. if it was the CPU overheating you will have to find someway to log the temperature to a file and observe the graph up to crash yourself).

Oh and here's a useful way of working out whether there was a crash or not:
last -x | grep "shutdown\|reboot"
Every reboot that doesn't have a matching shutdown was probably a crash (other than the last line).
Here's how: by samjam · 2003-09-14 21:19 · Score: 4, Informative

1) First disable console blanking, that way when you get to the crashed box and plug the monitor in you can see the kernel panic message. /usr/sbin/setterm -blank 0 -powersave off -powerdown 0

We had some early kernel 2.4 redhat boxes crashing like the dickens for a while, it was a kernel problem and only when it happened on a local machine under our eyes did we get to realise what had happened.

2) Network syslog;
If you syslog to a central machine not only does it make error spotting centralised and easier but it means you have the last gasps of the crashed machine logged on a machine that is still up.

Sam

--
blog.sam.liddicott.com
serial console by treat · 2003-09-14 21:47 · Score: 3, Informative

A serial console (make sure you enable the magic sysrq key! for some reason RedHat disables it by default) is an essential tool for any Linux server you care about. If you don't have the money for a console server, just plug servers into each other.

If your machine crashes without a panic message, however, you're out of luck. Wait until crash dumps are available - I'm surprised this isn't a 2.6 feature. Until we get crash dumps that work 99% of the time (like on Sparc-Solaris), Linux will continue to suck. At least it sucks less than the alternatives.
Some ideas by Gudlyf · 2003-09-15 07:19 · Score: 4, Informative
Mission Critical Linux does this.
There's also the LKCD (Linux Kernel Crash Dumps) package:
KCD contains kernel and user level code designed to:
- Save the kernel memory image when the system dies due to a software failure;
- Recover the kernel memory image when the system is rebooted;
- Analyze the memory image to determine what happened when the failure occurred.
--
Trolls lurk everywhere. Mod them down.