Logging Unexpected Shutdowns/Crashes w/ Linux?
sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."
That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.
And if I'm correct, if you turn on serial console, you'll get a Panic output on serial. Add a serial2IP box and you're set.
'last reboot' should show you all the recent boots
__________
Love conquers all... except CANCER
/var/log/messages, /var/log/syslog should give you enough info about kernel . Also there are lots of tools to enable various kind of accouting check sa.
Same way they know to fsck/chkdsk the drives: if a 'dirty bit' (or file, in your case) exists during boot, shutdown was unclean - log it. Otherwise create it. Only clear it as the last step of a clean shutdown.
Opportunity knocks. Karma hunts you down.
After 10 years without ever needing to apply the knowledge I forgot how.Would the magic sysrq key help? I bet it is a hardware problem though. And what about logging power outages? That is easy to do. APC probably has Linux software already to do this. For other logging there is ample facilities on Linux. Start a syslog server. Point everything to the loopback address.
If you run 2.6.0-test6 with -mm15 and some home brewed patches, you can have crashes without hardware failure
:-)
(one who speaks from experiance)
Buttsex.
IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash. You know you've done a good job when the kernel panic causes a panic, called a double panic. I used to be able to trigger those at will. Hrmm, I should test that on the current release.
AIX summarizes the likely causes of failure (power failure, someone pressed the power switch, or power supply died, etc). I've seen (but do not personally use) a similar thing with IRIX that actually assigns a percentage confidence level to its guess.
Of course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
Sadly, I don't know how to get any useful information out of linux. And don't give me crap about it never crashing. I can prove otherwise. Too bad I can't figure out why.... Maybe a kernel developer will read this and copy some ideas from the commercial Unix vendors.
If you are adventurous, you could try applying the LKCD patches to your kernel. Start looking here
Exactly what parallel universe are you living in? I've never ever get useful event log after the NT/2K goes BSOD.
Apparently, he lives in the same parallel universe I do. I suppose you think the checkbox in Startup and Recovery labeled "Write an event to the system log" is there for looks?
I also bet he is experiencing a hardware problem. Did he run memtest86? fsck? Were they clean?
--
"Those who cast the votes decide nothing; those who count the votes decide everything." - Josef Stalin
If you want to be emailed if the system reboots, put something at the end of /etc/rc.d/rc.local, if you're using something like RedHat (SYSV init, IIRC).
Logwatch will probably let you know if the system rebooted also.
If you want a log of the kernel panic, or something else, that's a lot more complicated, as others ahve mentioned
Nothing to see here; Move along.
Solaris does the same thing. Actually, I think several commercial Unixes do this. Some even provide some basic analysis tools so that you can pore over the /var/wherever/crash dumps yourself; see which processes were running, which ones were on the CPUs when it crashed, which instruction was executing, etc.
I've always been disappointed that this hasn't been part of Linux. Copying down OOPS text by hand onto paper and then typing it back in after the reboot is needlessly difficult. I don't have terminals sitting around for serial output. I've heard rumours that something like the save-to-swap-space facilities are finally going in, or that there are patches available for the DIY'ers.
And in my particular case, I'm not sure it would help anyhow. My desktop machine occasionally just goes *click* and reboots. If it tries to panic, it may not get time, I dunno, I'm not here to watch it. I do know that when I have gotten OOPSes, I usually don't bother trying to send a useful report in to lkml, because I don't have pen and paper around.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
As others have said, the "Linux crash" is probably hardware failure.
The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.
The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.
This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.
My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.
When will M$ learn that their crappy WinDoze software is just a bug ridden mess of......excuse me, what? Linux....crash??!?! I have dreamed a dream and now that dream is gone....
Murphy's law will apply and the thing that causes your system to crash wont be trapped by whatever magic you try to log it with! We recently had a machine that would just power-down without warning. I eventually discovered it happened after intensive CPU load for about 20 mins, figured maybe it was some heating problem, kicked up the sensors package and spotted the CPU temp heading into egg-frying temperatures. It seems the BIOS would just protect its motherboard by shutting down. The kernel had no chance to report anything.
At work, we used to have a SUN E250 [1]. One day, the power went away. (Turned out to be a problem with the airco.) After the power came back, I checked the logs, and I saw that the machine had been writing messages like "Power lost, running on backup power", and when the power came back "Switching to AC". (The monitor wasn't protected, so there was no way to check the machine during the blackout.)
The PC in the same room didn't have backup power and shut down. It came back with no ill effects.
The RS/6000 in your story probably had a power monitor and a large voltage protector, or maybe it had its own UPS as well.
[1] We still have it somewhere, but I'm no longer using it.
WWTTD?
To the best of my knowledge linux doesn't automatically reboot after a kernel crash unless you have told it to. If the crash was that severe this means you can walk up to the crashed machine and read the oops off the screen. If the machine isn't oopsing before the reboot this suggests some sort of hardware fault (e.g. your CPU is overheating). If it is hardware resetting the machine it is very unlikely that Linux can tell you what the fault is by itself (e.g. if it was the CPU overheating you will have to find someway to log the temperature to a file and observe the graph up to crash yourself).
Oh and here's a useful way of working out whether there was a crash or not:
last -x | grep "shutdown\|reboot"
Every reboot that doesn't have a matching shutdown was probably a crash (other than the last line).
1) First disable console blanking, that way when you get to the crashed box and plug the monitor in you can see the kernel panic message. /usr/sbin/setterm -blank 0 -powersave off -powerdown 0
We had some early kernel 2.4 redhat boxes crashing like the dickens for a while, it was a kernel problem and only when it happened on a local machine under our eyes did we get to realise what had happened.
2) Network syslog;
If you syslog to a central machine not only does it make error spotting centralised and easier but it means you have the last gasps of the crashed machine logged on a machine that is still up.
Sam
blog.sam.liddicott.com
Er, how about the 2.4 series kernel with high disk I/O and quotas turned on. That will get you random panics, no problems.
A serial console (make sure you enable the magic sysrq key! for some reason RedHat disables it by default) is an essential tool for any Linux server you care about. If you don't have the money for a console server, just plug servers into each other.
If your machine crashes without a panic message, however, you're out of luck. Wait until crash dumps are available - I'm surprised this isn't a 2.6 feature. Until we get crash dumps that work 99% of the time (like on Sparc-Solaris), Linux will continue to suck. At least it sucks less than the alternatives.
Although not really capable of providing an audit of reboots (for a variety of reasons, already outlined above), Snare for Linux (google for 'snare') is roughly analagous to the Windows Event log.
Snare is capable of monitoring events such as file opens, execve's, setuid/setgid and so on, which may assist in tracking down the problem.
Red.
"Now, where was the power outlet for the vaccuum cleaner? Hell, I'll tear out that red cable and plug the vaccuum cleaner there."
Denken hilft.
Part of the system startup would scan the dump for the logging buffers, extract the messages and append them to the log file. The file system would have been recovered at this point so a corrupt disk isn't a problem.
I don't know how NT does it, but I would guess something similar given that the architect and some of his team were refugees from Digials central engineering.
Regrettably, the kernel dump project for linux is somewhat 'on-hold' as Linus would rather follow what the big vendors do (those offering enterprise support will need crash dumps for diagnostic purposes).
See my journal, I write things there
If you're running test kernels (and kernel-hacker specific patchsets to boot!) on a production server, you should be shot. Or at least demoted.
Nonsense.
I have a lot of experience fixing hardware failures. Before I started doing computer work exclusively, I was an electronics design engineer. So, I'm able to understand hardware issues, and have something to contribute in that area.
Perhaps Slashdot people don't have much experience with hardware problems, and are skeptical of anyone who does, because answers to hardware problems are not usually modded up, and are often attacked.
Linux has millions of technically knowledgeable users. Those users know that, if they report a problem with crashing accurately, it will be fixed. I've never reported a Linux crash because I've never seen one. However, I did report a crash in Mozilla before breakfast one day, and the bug was fixed just after breakfast. Linux developers are the same way. So, it is common that users report literally years of uptime.
Now, what chance is there that the person who wrote the Slashdot story is seeing multiple real crashes, due to badly written software in Linux itself, instead of bad hardware or a poorly selected hardware driver? That chance is very, very small, given the circumstances.
I had another problem, entire machine used to freeze whenever I used to rsync the entire /dev/sda2. I checked everything memory, fsck, hdparam none of them showed any problem, but when I just changed the HDD cable everything worked fine.
Is it possible to do something similar in Linux?
Yeah, but we have to wait until our SCO insider funnels us the code.
LTT log every system call at a ns precision in a RAM buffer and then on disk. The events include, for instance, read/write/open operations, system call, interuptions, process state, disk and internet interface operations and so on. You can add specific event by modifying your application and recompile with the LTT library.
LTT is not yet included in the kernel and was not choosen after the "Halloween Freeze" however, the new infrastructure can operate in a "flight recorder" mode that will, for instance, log the last 5 Mb of events that happens on the system.
Of course, when there is a kernel crash, you can not be certain to have those events on disk but this is chicken and egg problem.
Anyway, I believe this king of functionality is in demande by most critical applications. This is very important in the embedded market too where debugging and optimization is very painful.
I don't know if this applies to you situation but it can't hurt to check the fan. My work Linux machine got in the habit of crashing for a bit. Turned out the CPU fan wasn't working. I haven't have a crash in months now that I've fixed it.
Anyway Linux machines rarely crash in my experience and my top suspect is ussually hardware when it does.
So what? I click the little checkbox, wait for my next BSOD, and voila, I've got something useful? Head shake time, unless your idea of useful is finding out that WinDoze is the problem and you have to wait for M$ to fix it. Bottom line, I'd rather have to work a little harder to find the problem, and be able to fix it, than to have the problem spelled out in plain English and be at the mercy of the three monkeys in Redmond:
See no source
Hear no source
Speak no source.
"Talk minus action equals nothing" - Joey Shithead, D.O.A.
"Talk minus action equals
Something is wrong with that. See below output of my headless EPIA file/print server locked away somewhere deep and dark..
robtu@astra:~$ last reboot
wtmp begins Thu Sep 4 09:47:50 2003
robtu@astra:~$ uptime
21:12:13 up 235 days, 1:48, 1 user, load average: 0.00, 0.00, 0.00
robtu@astra:~$
To Terminate, or not to Terminate, that's the question - SCSIROB
There's also the LKCD (Linux Kernel Crash Dumps) package:
KCD contains kernel and user level code designed to:
Trolls lurk everywhere. Mod them down.
You can use a serial console or try out some version of the netconsole patch to get the messages on another computer. (Notice that netconsole over the internet is probably possible, but it is sent in clear and can be snooped or modified). I also recall reading about some patch to keep a new kernel ready in memory that could be booted with arguments telling it where to find the log from the old kernel, I even think it included a checksum to prevent booting the new kernel if it had been corrupted.
Do you care about the security of your wireless mouse?
Unexpected shutdowns? Crashes? You must be mistaken. Linux does not crash. Ever.
Now, what was your name and address again?
--
viqsi - See "vixen"
If we do not change our direction we are likely to end up where we are headed.
Actually, the "magic sysrq key" is disabled by default for a damn good reason.
The "magic SysRq key" is a key sequence that allows some basic commands to be passed directly to the kernel. Kernel software developers use this interface to debug their software. Under most circumstances it can also be used to uncleanly reboot the computer, something that is otherwise difficult or expensive to do remotely.
Anyone can dial into a modem and send a break, so if the serial console is attached to a modem we need to disable the magic SysRq key
So. the SysRq key is disabled because it can be used (remotely) to do bad things, like an unclean shutdown, something you probably don't want people to do with your servers. (Only under certain circumstances -- but it's likely that one wouldn't remember about sysrq, it being mostly unused and all.
Quotes from The Linux Documentation Project www.tldp.org
You can use a central NetDump server to collect oops message and a dump of physical memory of every Linux box on your network...
r edhat/netdu mp/
t Dump
chekout
http://www.redhat.com/support/wpapers/
another link with the lkcd patches
https://projects.clusterfs.com/lustre/Ne
Type last /var/log/messages and guessing what lead up the a problem, but then again how often does that happen anyways?
You wont be able to find out a why without taking a stroll down
Quick tip, try checking your irq's..
DRACO-
Consider yourself blessed if you are sneezed on by a dragon and only get wet, it could have been a fireball.