Logging Unexpected Shutdowns/Crashes w/ Linux?

← Back to Stories (view on slashdot.org)

Logging Unexpected Shutdowns/Crashes w/ Linux?

Posted by Cliff on Sunday September 14, 2003 @04:06PM from the finding-evidence-of-the-problem dept.

sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."

22 of 86 comments (clear)

Min score:

Reason:

Sort:

I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:11 · Score: 3, Informative

That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.

And if I'm correct, if you turn on serial console, you'll get a Panic output on serial. Add a serial2IP box and you're set.
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
1. Re:I'm pretty sure.. by Chester+K · 2003-09-14 16:40 · Score: 3, Interesting
  
  That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.
  
  Why not reserve a set place on the hard drive and write out error trap information there? There's no reason the filesystem needs to be involved at all. I'm going to guess that's what Windows does.
  
  --
  
  NO CARRIER
2. Re:I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:48 · Score: 3, Interesting
  
  OK. Then how do you guarantee the state of the kernel? If you use bios calls, it screws up the memmap even more. Thats assuming you can even pass something like that.
  
  100$ question: How do you break out of code inserted that might have had a bug? How do you determine what code had that bug?
  
  Answer those, and then I'll trust Write_after_system_crash api
  --
  
  Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Easy... by icemax · 2003-09-14 16:15 · Score: 3, Informative

'last reboot' should show you all the recent boots

--

__________
Love conquers all... except CANCER
1. Re:Easy... by Big+Jason · 2003-09-14 18:08 · Score: 4, Funny
  
  Not to be confused with last | reboot, which I've done before. Doh!
Logging Unexpected Shutdowns/Crashes w/ Linux? by krishnaD · 2003-09-14 16:16 · Score: 3, Informative

/var/log/messages, /var/log/syslog should give you enough info about kernel . Also there are lots of tools to enable various kind of accouting check sa.
Re:Flag it. by Creepy+Crawler · 2003-09-14 16:42 · Score: 3, Interesting

You fail to understand what happens to create the "Dirty Bit".

1: System starts up (say clean).
2: It marks a bit on the partition that system has been started up.
3: Usage Usage Usage
4: Send shutdown
5: System umounts cleanly. Undoes "dirty bit"
6: Power == 0

On a dirty FS, stage #5 is never hit so when system comes back on, it checks the bit and detects unclean shutdown. The bit is never wrote during the unclean shutdown.

In the similar problem, I see problems when NTkern crashes. How exactly does it manage to:

1: Read the partitiom
2: Read the program on the partition
3: Run the insert log program to add log entry
4: Still have the "blue screen"

I smell nasty data corruption waiting to happen. After all, if you cant guarantee the state of the kernel, does it really justify reading, writing, and executing on a crashed kernel????
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Kernel Panic on Linux? Sounds like hardware pro by Drakon · 2003-09-14 16:43 · Score: 4, Funny

If you run 2.6.0-test6 with -mm15 and some home brewed patches, you can have crashes without hardware failure

(one who speaks from experiance) :-)

--
Buttsex.
Other OSes by menscher · 2003-09-14 16:49 · Score: 5, Informative

This will probably be modded down as flame bait, but I can't resist pointing out what some other OSes have done when crashing:
IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash. You know you've done a good job when the kernel panic causes a panic, called a double panic. I used to be able to trigger those at will. Hrmm, I should test that on the current release.
AIX summarizes the likely causes of failure (power failure, someone pressed the power switch, or power supply died, etc). I've seen (but do not personally use) a similar thing with IRIX that actually assigns a percentage confidence level to its guess.
Of course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
Sadly, I don't know how to get any useful information out of linux. And don't give me crap about it never crashing. I can prove otherwise. Too bad I can't figure out why.... Maybe a kernel developer will read this and copy some ideas from the commercial Unix vendors.
1. Re:Other OSes by FueledByRamen · 2003-09-14 17:33 · Score: 3, Interesting
  
  f course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
  I had a similar interesting experience with an SGI Indy (Irix 6.5.13, or thereabouts). I was booting it up after it'd been sitting for a while, just to see what I had running on there. While it was going, and I was fumbling around for an ethernet cable for it (it takes several minutes at boot to wait for a cable instead of noting its absence and moving on), I kicked the power strip that it was on and the plug wiggled around in the wall socket. I heard a spark jump in the socket, and the monitor it was on (Dell/Sony Trinitron 19") went to half-height mode for a few seconds, spitting and clicking, turning the screen on and off and varying the vertical height randomly.
  
  I expected the Indy to kernel panic or turn off. Instead, below the complaints about the missing ethernet cable ("en0: link carrier not detected" or similar), there was a lone status message: "Power failure detected."
  
  No UPS, no power saving devices of any kind, only the filter caps in the power supply between the logic board and the unreliable, crufty power system of a 70 year old house at the mercy of a power strip first used on my (brand new at the time) Atari 800. The other computer on the power strip (350 P2 running RH 7.1) rebooted hard, right in the middle of heavy FS activity. I had to hit the reset button before it would come back up again, too - the brownout hung the POST.
  
  --
  Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
2. Re:Other OSes by Ster · 2003-09-14 18:06 · Score: 3, Informative
  
  Mac OS X writes a crash dump to the non-volitile RAM in the event of a panic. Then, after the next successful boot, it reads out the dump and adds it to /Library/Logs/panic.log. If, for some reason, the machine won't come back up, you can probably read the dump from OpenFirmware.
  
  -Ster
3. Re:Other OSes by kinema · 2003-09-14 18:24 · Score: 3, Interesting
  
  I wonder if /dev/nvram (the small amount of NVRAM availible on the RTC) is large enough to store such a dump.
4. Re:Other OSes by anthony_dipierro · 2003-09-14 18:37 · Score: 4, Informative
  
  IRIX will core dump to the swap partition.
  
  FreeBSD does this. HP/UX does this. I always assumed Linux did it too, it just wasn't turned on by default. I guess I was wrong.
  
  As a side note, my first job out of college was to analyze core dumps from HP/UX. There's an awful lot you can learn from these things. Not just stack traces, the entire memory of the system is contained in the dump. It's time consuming, but a large portion of the time you can find out *exactly* what went wrong.
5. Re:Other OSes by isorox · 2003-09-14 20:59 · Score: 3, Funny
  
  Nah, if kernel developers read slashdot, nothing would get done!
Try the Linux Kernel Crash Dump (LKCD) patches by bigsteve@dstc · 2003-09-14 17:02 · Score: 5, Informative

If you are adventurous, you could try applying the LKCD patches to your kernel. Start looking here
"Linux crashes" are probably contact failure. by Futurepower(R) · 2003-09-14 19:37 · Score: 3, Interesting

As others have said, the "Linux crash" is probably hardware failure.

The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.

The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.

This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.

My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.
Depends on what it's doing by Sits · 2003-09-14 21:18 · Score: 3, Informative

To the best of my knowledge linux doesn't automatically reboot after a kernel crash unless you have told it to. If the crash was that severe this means you can walk up to the crashed machine and read the oops off the screen. If the machine isn't oopsing before the reboot this suggests some sort of hardware fault (e.g. your CPU is overheating). If it is hardware resetting the machine it is very unlikely that Linux can tell you what the fault is by itself (e.g. if it was the CPU overheating you will have to find someway to log the temperature to a file and observe the graph up to crash yourself).

Oh and here's a useful way of working out whether there was a crash or not:
last -x | grep "shutdown\|reboot"
Every reboot that doesn't have a matching shutdown was probably a crash (other than the last line).
Here's how: by samjam · 2003-09-14 21:19 · Score: 4, Informative

1) First disable console blanking, that way when you get to the crashed box and plug the monitor in you can see the kernel panic message. /usr/sbin/setterm -blank 0 -powersave off -powerdown 0

We had some early kernel 2.4 redhat boxes crashing like the dickens for a while, it was a kernel problem and only when it happened on a local machine under our eyes did we get to realise what had happened.

2) Network syslog;
If you syslog to a central machine not only does it make error spotting centralised and easier but it means you have the last gasps of the crashed machine logged on a machine that is still up.

Sam

--
blog.sam.liddicott.com
serial console by treat · 2003-09-14 21:47 · Score: 3, Informative

A serial console (make sure you enable the magic sysrq key! for some reason RedHat disables it by default) is an essential tool for any Linux server you care about. If you don't have the money for a console server, just plug servers into each other.

If your machine crashes without a panic message, however, you're out of luck. Wait until crash dumps are available - I'm surprised this isn't a 2.6 feature. Until we get crash dumps that work 99% of the time (like on Sparc-Solaris), Linux will continue to suck. At least it sucks less than the alternatives.
The cleaning team by Tux2000 · 2003-09-15 00:10 · Score: 3, Funny

"Now, where was the power outlet for the vaccuum cleaner? Hell, I'll tear out that red cable and plug the vaccuum cleaner there."

--
Denken hilft.
Yes! by twistedcubic · 2003-09-15 03:14 · Score: 4, Funny

Is it possible to do something similar in Linux?

Yeah, but we have to wait until our SCO insider funnels us the code.
Some ideas by Gudlyf · 2003-09-15 07:19 · Score: 4, Informative
Mission Critical Linux does this.
There's also the LKCD (Linux Kernel Crash Dumps) package:
KCD contains kernel and user level code designed to:
- Save the kernel memory image when the system dies due to a software failure;
- Recover the kernel memory image when the system is rebooted;
- Analyze the memory image to determine what happened when the failure occurred.
--
Trolls lurk everywhere. Mod them down.