Logging Unexpected Shutdowns/Crashes w/ Linux?

I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:11 · Score: 3, Informative

That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.

And if I'm correct, if you turn on serial console, you'll get a Panic output on serial. Add a serial2IP box and you're set.

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Re:I'm pretty sure.. by Chester+K · 2003-09-14 16:40 · Score: 3, Interesting

That the reason Linux doenst write anything to the HD after Panic si so that it doesnt mangle/destroy the FS.

Why not reserve a set place on the hard drive and write out error trap information there? There's no reason the filesystem needs to be involved at all. I'm going to guess that's what Windows does.

--

NO CARRIER
Re:I'm pretty sure.. by bobthemonkey13 · 2003-09-14 16:43 · Score: 2, Informative

Or a dot-matrix printer. Seriously, I did this for a while; you can turn on console-on-LPT support in your kernel config, and pass a parameter with your bootloader. It takes a while for the stupid thing to display all the kernel messages at boot, but the sound is priceless. Sadly, the 20-some-year-old printer decided to kick the bucket (still lasted longer than my HP DeskJet, thank you very much), so I switched to a 286 laptop running Minix 1.5 and term, which might be a cheap way to implement the serial2IP idea.
Re:I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:48 · Score: 3, Interesting

OK. Then how do you guarantee the state of the kernel? If you use bios calls, it screws up the memmap even more. Thats assuming you can even pass something like that.

100$ question: How do you break out of code inserted that might have had a bug? How do you determine what code had that bug?

Answer those, and then I'll trust Write_after_system_crash api
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:I'm pretty sure.. by Creepy+Crawler · 2003-09-14 16:51 · Score: 2, Insightful

Too true. I knew about those options too, but if he wants logging, he can make a cheap logserver that archives these problems across the whole network.
If it's only 1 computer, I'd probably use a real terminal or a cheapie like your minix box or printer.
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:I'm pretty sure.. by Anonymous Coward · 2003-09-14 17:00 · Score: 2, Informative

Why not reserve a set place on the hard drive and write out error trap information there?

The IDE/SCSI driver could still corrupt your data. How would you know where to write the info anyway? The kernel could easily calculate and store a block number, but could it trust the stored number after panic() is called? Maybe somebody overwrote that variable, or maybe some bad RAM caused it to spontaneously change.

Kernel panics are generally used as a last resort, when something goes really wrong and there's no sane way to handle it (at least in theory). Deciding whether it's safe to write to disk would be difficult.

Anyway, there are kernel patches that save crash dumps to your swap space or even your video RAM (search for something like "linux crash dump"). I wouldn't use them all the time, but they might be useful if you know your RAM is OK and don't suspect disk/IDE problems. A serial console is safer, but not always practical.
Re:I'm pretty sure.. by bucky0 · 2003-09-14 17:15 · Score: 1

Why don't you just write to the swapfile? If your on the way to rebooting, nothing gets hurt by munging up the swapfile some. Of course, if your harddrive drivers crash, it's no good, but for all other cases, it sounds like a good idea.

--

-Bucky
Re:I'm pretty sure.. by FueledByRamen · 2003-09-14 17:59 · Score: 2, Interesting

OK, I've got an idea. When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) and passes it some special flags which basically say "I did a bad thing, clean up after me." The bootloader will unpack another set of routines (checksummed for quality) in the same way it loads the Linux kernel off of the HD, and place them into an area of RAM that's hopefully not used by anything kernel related (app space). It will then read in the pagetables and other info still resident in RAM (use the Linux kernel on the HD for reference / symbol tables, or rebuild the crashdump app at the same time as the kernel with the same memory offsets and values), and formulate a meaningful crashdump. It'll then read in the partition table / slicetable / disk label / whatever, and find and write over the swap partition with the dump (making sure that the swap partiton actually is a swap partition - read its header and such, just in case the partition table was mangled). It will then reboot the computer. Upon reboot, Linux will pick up on the swap partition containing a crash dump (changed magic number?) and copy it to a file on the HD, then reformat the swap partition and mount it as normal, making a note in the syslogs that it crashed and the crashdump can be found at $LOCATION. (And maybe pass control to a different rc file, for a limited or debugging sysinit.)

--
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
Re:I'm pretty sure.. by anthony_dipierro · 2003-09-14 18:31 · Score: 2, Insightful

What's the reason that just about every other unix does write to the HD after panic?
Re:I'm pretty sure.. by You're+All+Wrong · 2003-09-14 20:45 · Score: 2, Insightful

Why do you trust a kernel that has got its knickers in a twist to be able to know where the swap partition is?

I'd be happier with it writing to a floppy, serial, or other isolated subsystem. The difference between your swap partition and your root directory structure might be just 0x10000 in one of the register values, and that's considered too close to be worth risking.

YAW.

--
Your head of state is a corrupt weasel, I hope you're happy.
Re:I'm pretty sure.. by Tux2000 · 2003-09-15 00:02 · Score: 2, Interesting

When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) [...]

When Linux panics, it usually has a good reason to do so. Something like a damaged descriptor table, overwritten kernel code, hardware that works wrong, and various other catastrophes. Panic means a real panic: You can not reliably use any hardware. So you can not rerun the bootloader, and you can not access the BIOS. You can only hope that a hardware watchdog card notices that the kernel has paniced (because its timeout counter is no longer reset) and reboots the machine.

(BTW: to access the boot loader and the BIOS, you probably would have to drop out of protected mode back into the ugly world of real mode (or V86 mode), causing even more P.I.T.A.)

--
Denken hilft.
Re:I'm pretty sure.. by Wakko+Warner · 2003-09-15 04:08 · Score: 1

This is why things I like to call "more thought-out" operating systems (Solaris, AIX, IRIX, etc) allow for a separate raw partition simply to write out system dumps for later analysis, safe from the dangers of filesystem corruption. You don't know how useful these tools are until you've had to use them.

- A.P.

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
Re:I'm pretty sure.. by Polo · 2003-09-15 07:18 · Score: 1

Why not reserve a set place on the hard drive and write out error trap information there?

Funny suggestion. That's exactly what solaris does. It writes its kernel core dumps to the swap partition.
Re:I'm pretty sure.. by ctr2sprt · 2003-09-15 07:37 · Score: 1

Obviously it can be done, since pretty much every Unix and clone except Linux already does it. And nobody says you have to trust being able to make crash dumps. Why wouldn't it be a tunable setting? If you don't like it, turn it off. Fact is, some people would find this sort of feature really useful, so why deprive them of it when adding it won't affect you at all?
I suppose I shouldn't be surprised. Linux doesn't implement a feature, ergo that feature is not worth implementing. I'm just glad the people on Slashdot aren't making any decisions on what goes into Linux, or it would be unchanged from what it was 5 years ago.
Re:I'm pretty sure.. by alex_ant · 2003-09-15 09:07 · Score: 1

That would be too elegant and convenient. Linux/Unix doesn't do either of those because they're for morons who like to point and click at stuff like the malformed children they are.
Re:I'm pretty sure.. by idontgno · 2003-09-15 11:52 · Score: 1

Good point. I've been a Solaris admin since God was a corporal, and the crashdump in the swap slice has save our corporate asses about a zillion times. (Damned UltraSparc II level-0 cache memory bit error...)
I guess I didn't realize that the Linux kernel doesn't do a comparable function. I wonder why not? Solaris seems comfortable locating the swap slice and writing the kernel core dump. Is partition management so much less reliable on x86 boxes? And if so, why is that a limitation for non-x86 Linux kernels, like those for Sun Sparc architectures? Makes you wonder, eh?

--
Welcome to the Panopticon. Used to be a prison, now it's your home.
Re:I'm pretty sure.. by FueledByRamen · 2003-09-15 17:22 · Score: 1

Yeah, it would have to drop back into real mode. The assumption is, though, that there will be enough of the machine (and kernel) left to reboot and write a panic log. If not? Oh well, the machine crashes again. Not like it'll hurt anything (checksumming the crash dump code once when its loaded and again before it dumps to disk will ensure that no bits have been flipped, and sanity checking both on the inputs and outputs should take care of any weirdness not caught by that). Most of the kernel panics I've experienced were not due to hardware problems or anything that would prevent a second boot such as what I described, but software issues that could have easily dumped a proper log with my method. A device driver taking the system down is a favorite, which always left plenty of Linux intact for me to SysRQ my way to a clean unmount and reboot.

--
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
Re:I'm pretty sure.. by enigma48 · 2003-09-15 19:12 · Score: 1

This doesn't seem to be a difficult problem. I may have read this somewhere but no idea where.

Mark the filesystem as 'not cleanly shutdown' when the OS boots. If it shuts down properly, mark it 'clean shutdown'. The only remaining option, an improper shutdown, will leave the flag in the original state - 'not cleanly shutdown'.

Only thing to watch out for is a proper shutdown not updating the flag correctly, which should be rare - if you get to the 50th item on a shutdown list ("update flag"), you're pretty much set.

Easy... by icemax · 2003-09-14 16:15 · Score: 3, Informative

'last reboot' should show you all the recent boots

--

__________
Love conquers all... except CANCER

Re:Easy... by Anonymous Coward · 2003-09-14 16:52 · Score: 1, Insightful

But it doesn't show stuff after the boot, which is what the author is looking for.
Re:Easy... by Big+Jason · 2003-09-14 18:08 · Score: 4, Funny

Not to be confused with last | reboot, which I've done before. Doh!

Logging Unexpected Shutdowns/Crashes w/ Linux? by krishnaD · 2003-09-14 16:16 · Score: 3, Informative

/var/log/messages, /var/log/syslog should give you enough info about kernel . Also there are lots of tools to enable various kind of accouting check sa.

Re:Logging Unexpected Shutdowns/Crashes w/ Linux? by Creepy+Crawler · 2003-09-14 16:21 · Score: 2, Insightful

Its not enough if you're trying to determine whats throwing the system out to lunch.

Id be apt to turn on hangcheck with 1min restart + email on my servers. But better is to know what they failed by..
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Logging Unexpected Shutdowns/Crashes w/ Linux? by Tux2000 · 2003-09-15 00:05 · Score: 2, Informative

Add a serial or parallel console that writes to paper, i.e. a printer. Disable syslogd and klogd and let all log output go to the console.

--
Denken hilft.

Flag it. by slittle · 2003-09-14 16:31 · Score: 2, Interesting

Same way they know to fsck/chkdsk the drives: if a 'dirty bit' (or file, in your case) exists during boot, shutdown was unclean - log it. Otherwise create it. Only clear it as the last step of a clean shutdown.

--
Opportunity knocks. Karma hunts you down.

Re:Flag it. by Creepy+Crawler · 2003-09-14 16:42 · Score: 3, Interesting

You fail to understand what happens to create the "Dirty Bit".

1: System starts up (say clean).
2: It marks a bit on the partition that system has been started up.
3: Usage Usage Usage
4: Send shutdown
5: System umounts cleanly. Undoes "dirty bit"
6: Power == 0

On a dirty FS, stage #5 is never hit so when system comes back on, it checks the bit and detects unclean shutdown. The bit is never wrote during the unclean shutdown.

In the similar problem, I see problems when NTkern crashes. How exactly does it manage to:

1: Read the partitiom
2: Read the program on the partition
3: Run the insert log program to add log entry
4: Still have the "blue screen"

I smell nasty data corruption waiting to happen. After all, if you cant guarantee the state of the kernel, does it really justify reading, writing, and executing on a crashed kernel????
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Flag it. by slittle · 2003-09-14 16:54 · Score: 1

Where did I say to write to disk during a crash?

--
Opportunity knocks. Karma hunts you down.
Re:Flag it. by Anonymous Coward · 2003-09-14 20:18 · Score: 0

> In the similar problem, I see problems when NTkern crashes. How exactly does it manage to:
>
> 1: Read the partitiom
> 2: Read the program on the partition
> 3: Run the insert log program to add log entry
> 4: Still have the "blue screen"
>
> I smell nasty data corruption waiting to happen. After all
> if you cant guarantee the state of the kernel, does it
> really justify reading, writing, and executing on a crashed
> kernel????

A smart system does steps 1, 2, and half of 3 at boot time, then locks down the code in al ways possible (don't swap, execute only, etc). On top of that, the logging code should make as few assumptions as possible (e.g. by initializing the stack pointer)

That leaves
- writing crash data somewhere: this can easily be done into a pre-allocated block of physical RAM
- executing the logging code: this is not different from executing the 'blue screen' or 'print "PANIC"' code

Kernel Panic on Linux? Sounds like hardware prob. by Radical+Rad · 2003-09-14 16:39 · Score: 2, Informative

After 10 years without ever needing to apply the knowledge I forgot how.Would the magic sysrq key help? I bet it is a hardware problem though. And what about logging power outages? That is easy to do. APC probably has Linux software already to do this. For other logging there is ample facilities on Linux. Start a syslog server. Point everything to the loopback address.

Re:Kernel Panic on Linux? Sounds like hardware pro by Drakon · 2003-09-14 16:43 · Score: 4, Funny

If you run 2.6.0-test6 with -mm15 and some home brewed patches, you can have crashes without hardware failure

(one who speaks from experiance) :-)

--
Buttsex.

Other OSes by menscher · 2003-09-14 16:49 · Score: 5, Informative

This will probably be modded down as flame bait, but I can't resist pointing out what some other OSes have done when crashing:

IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash. You know you've done a good job when the kernel panic causes a panic, called a double panic. I used to be able to trigger those at will. Hrmm, I should test that on the current release.

AIX summarizes the likely causes of failure (power failure, someone pressed the power switch, or power supply died, etc). I've seen (but do not personally use) a similar thing with IRIX that actually assigns a percentage confidence level to its guess.

Of course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.

Sadly, I don't know how to get any useful information out of linux. And don't give me crap about it never crashing. I can prove otherwise. Too bad I can't figure out why.... Maybe a kernel developer will read this and copy some ideas from the commercial Unix vendors.

Re:Other OSes by FueledByRamen · 2003-09-14 17:33 · Score: 3, Interesting

f course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
I had a similar interesting experience with an SGI Indy (Irix 6.5.13, or thereabouts). I was booting it up after it'd been sitting for a while, just to see what I had running on there. While it was going, and I was fumbling around for an ethernet cable for it (it takes several minutes at boot to wait for a cable instead of noting its absence and moving on), I kicked the power strip that it was on and the plug wiggled around in the wall socket. I heard a spark jump in the socket, and the monitor it was on (Dell/Sony Trinitron 19") went to half-height mode for a few seconds, spitting and clicking, turning the screen on and off and varying the vertical height randomly.

I expected the Indy to kernel panic or turn off. Instead, below the complaints about the missing ethernet cable ("en0: link carrier not detected" or similar), there was a lone status message: "Power failure detected."

No UPS, no power saving devices of any kind, only the filter caps in the power supply between the logic board and the unreliable, crufty power system of a 70 year old house at the mercy of a power strip first used on my (brand new at the time) Atari 800. The other computer on the power strip (350 P2 running RH 7.1) rebooted hard, right in the middle of heavy FS activity. I had to hit the reset button before it would come back up again, too - the brownout hung the POST.

--
Every cloud has a silver lining (except for the mushroom shaped ones, which have a lining of Iridium & Strontium 90)
Re:Other OSes by Ster · 2003-09-14 18:06 · Score: 3, Informative

Mac OS X writes a crash dump to the non-volitile RAM in the event of a panic. Then, after the next successful boot, it reads out the dump and adds it to /Library/Logs/panic.log. If, for some reason, the machine won't come back up, you can probably read the dump from OpenFirmware.

-Ster
Re:Other OSes by kinema · 2003-09-14 18:24 · Score: 3, Interesting

I wonder if /dev/nvram (the small amount of NVRAM availible on the RTC) is large enough to store such a dump.
Re:Other OSes by anthony_dipierro · 2003-09-14 18:37 · Score: 4, Informative

IRIX will core dump to the swap partition.

FreeBSD does this. HP/UX does this. I always assumed Linux did it too, it just wasn't turned on by default. I guess I was wrong.

As a side note, my first job out of college was to analyze core dumps from HP/UX. There's an awful lot you can learn from these things. Not just stack traces, the entire memory of the system is contained in the dump. It's time consuming, but a large portion of the time you can find out *exactly* what went wrong.
Re:Other OSes by larien · 2003-09-14 19:23 · Score: 1

FWIW, Solaris does the same as IRIX, saving the output to /var/crash/`hostname`
From that, you can run some tests on the core files to get some info about what went wrong as well as things like stack traces & process lists. Even without analyzing that, you'll generally get some info in /var/adm/messages.
Linux should have some method of capturing what errors were generated during crashes; count it as one of those "enterprise level" features....
Re:Other OSes by cookd · 2003-09-14 20:55 · Score: 2, Interesting

Windows does something like this too.

At Blue Screen, it will make a dump in the swap partition if so configured. The dump can be a 64k error summary (MiniDump), kernel memory dump, or a full physical memory dump (if swap > physical memory). While there is a slim possibility that doing this might make things worse (if the code to write the dump is corrupted, or the disk driver is corrupted), it is MUCH more likely that the information written will be useful. Also, the swap partition driver is pretty stable and simple, so chances are very good that it won't mess up anything that wasn't already messed up. If you're paranoid, you can turn off this feature.

At clean shutdown, it writes an event to the event log indicating clean shutdown.

At boot, if there is no "clean shutdown" event, it writes an "unexpected shutdown" event to the event log. It estimates the time of the crash based on the last events in the event log. Since Windows has periodic "I am running ok" events recorded to the event log, it can use the last "I am ok" event to guess at the crash time.

At boot, if there is a crash dump in the swap partition, it is recovered and copied to a file for subsequent analysis.

--
Time flies like an arrow. Fruit flies like a banana.
Re:Other OSes by isorox · 2003-09-14 20:59 · Score: 3, Funny

Nah, if kernel developers read slashdot, nothing would get done!
Re:Other OSes by Tux2000 · 2003-09-14 23:45 · Score: 2, Informative

Nope. RTC memory is something between 128 Bytes (IBM AT) and 2 KBytes (IBM PS/2 series). And each bit of it is used for the BIOS and some hardware stuff (Microchannel requires a lot of memory). Perhaps, some machines have a few unused bits. But you can't stuff all your memory into them. You can't compress several megabytes or gigabytes into 10 to 20 Bits (at least not lossless). With a lot of luck and deep knowledge of the used machine and BIOS, you may be able to store a dirty-or-clean shutdown flag. But as I said, it depends very much on the machine and the BIOS.

--
Denken hilft.
Re:Other OSes by ColaMan · 2003-09-14 23:55 · Score: 1

I guess that they have some sort of monitor on the "power good" wire that comes from your power supply. Maybe it latches and is then reset? If you get a brief "power bad" and still have enough juice in the capacitors to run things, the kernel notes the latched line and resets it and adds a "power failure detected" log.

Well, that's how I'd do it if *I* was a Quality Computer Manufacturer, anyway ;-)

--

You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.

Try the Linux Kernel Crash Dump (LKCD) patches by bigsteve@dstc · 2003-09-14 17:02 · Score: 5, Informative

If you are adventurous, you could try applying the LKCD patches to your kernel. Start looking here

Re:Try the Linux Kernel Crash Dump (LKCD) patches by Anonymous Coward · 2003-09-14 20:20 · Score: 2, Informative

Its located here for the unitiated.

I took a look at it a while back and it looked interesting. Just checked back at the site and there is a reasonable howto provided by IBM in the doc section which should give you some idea of what/how it works.

Its worth a look, but honestly, it sounds a LOT more like a hardware issue than software. Is the server on an UPS? If not, get it on one. Reseat all your cards and ram etc then see if it crashes as regularly.
Re:Try the Linux Kernel Crash Dump (LKCD) patches by GiMP · 2003-09-15 00:48 · Score: 1

Except this is a dedicated server. It may take a while for him to get anyone on the telephone (or via email) to do this.. and they may even charge him.

The more the client (owner of the server) can do himself, the better.

Re:Event Log? by Anonymous Coward · 2003-09-14 17:38 · Score: 1, Informative

Exactly what parallel universe are you living in? I've never ever get useful event log after the NT/2K goes BSOD.

Apparently, he lives in the same parallel universe I do. I suppose you think the checkbox in Startup and Recovery labeled "Write an event to the system log" is there for looks?

Re:Kernel Panic on Linux? Sounds like hardware pro by StalinJoe · 2003-09-14 18:23 · Score: 1

I also bet he is experiencing a hardware problem. Did he run memtest86? fsck? Were they clean?

--

--
"Those who cast the votes decide nothing; those who count the votes decide everything." - Josef Stalin

rc.local by bobbozzo · 2003-09-14 19:18 · Score: 2, Informative

As others have mentioned, there are various ways to see when the system rebooted.

If you want to be emailed if the system reboots, put something at the end of /etc/rc.d/rc.local, if you're using something like RedHat (SYSV init, IIRC).

Logwatch will probably let you know if the system rebooted also.

If you want a log of the kernel panic, or something else, that's a lot more complicated, as others ahve mentioned

--
Nothing to see here; Move along.

Fairly standard for !(Linux) by devphil · 2003-09-14 19:36 · Score: 1

IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash.

Solaris does the same thing. Actually, I think several commercial Unixes do this. Some even provide some basic analysis tools so that you can pore over the /var/wherever/crash dumps yourself; see which processes were running, which ones were on the CPUs when it crashed, which instruction was executing, etc.

I've always been disappointed that this hasn't been part of Linux. Copying down OOPS text by hand onto paper and then typing it back in after the reboot is needlessly difficult. I don't have terminals sitting around for serial output. I've heard rumours that something like the save-to-swap-space facilities are finally going in, or that there are patches available for the DIY'ers.

And in my particular case, I'm not sure it would help anyhow. My desktop machine occasionally just goes *click* and reboots. If it tries to panic, it may not get time, I dunno, I'm not here to watch it. I do know that when I have gotten OOPSes, I usually don't bother trying to send a useful report in to lkml, because I don't have pen and paper around.

--
You cannot apply a technological solution to a sociological problem. (Edwards' Law)

Re:Fairly standard for !(Linux) by bigsteve@dstc · 2003-09-15 12:00 · Score: 1

Solaris does the same thing. Actually, I think several commercial Unixes do this.
I recall that even 4.1 and 4.2 BSD had crash dumps and rudimentary crash dump analysis tools. I think it was 4.2 BSD that introduced dumping to the swap partition and the utility that snarfed the dump on reboot. (I could be wrong though: it was 20 years ago).
Warning: gratuitous punctuation detected!

"Linux crashes" are probably contact failure. by Futurepower(R) · 2003-09-14 19:37 · Score: 3, Interesting

As others have said, the "Linux crash" is probably hardware failure.

The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.

The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.

This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.

My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.

Re:"Linux crashes" are probably contact failure. by gazbo · 2003-09-14 21:05 · Score: 0, Flamebait

Do you recall the parable of the Good Samaritan? Of course you do. Now bear with me here:
Well the context around it is he is telling it to a Pharisee in answer to the question "Who is your neighbour?"
Well, Jesus tells the story, and when he asks the Pharisee which of the three men was his neighbour, the Pharisee answers: "The one who helped him." Y'see, the thing is that the Pharisees despise the Samaritans so much that he couldn't bring himself to even say their name.
I was just wondering if you saw any parallels between this and the fact you insist writing the quotes around "Linux crash"? Whether it had occurred to you how ridiculous it was that your dogmatic fanaticism about Linux means you can't even bring yourself to write about it crashing?

When will they learn by thebatlab · 2003-09-14 19:48 · Score: 0, Offtopic

When will M$ learn that their crappy WinDoze software is just a bug ridden mess of......excuse me, what? Linux....crash??!?! I have dreamed a dream and now that dream is gone....

It'll never catch the things you want... by Bazman · 2003-09-14 20:51 · Score: 2, Interesting

Murphy's law will apply and the thing that causes your system to crash wont be trapped by whatever magic you try to log it with! We recently had a machine that would just power-down without warning. I eventually discovered it happened after intensive CPU load for about 20 mins, figured maybe it was some heating problem, kicked up the sensors package and spotted the CPU temp heading into egg-frying temperatures. It seems the BIOS would just protect its motherboard by shutting down. The kernel had no chance to report anything.

Re:It'll never catch the things you want... by PapaZit · 2003-09-15 05:36 · Score: 1
Check out the lm_sensors project at http://secure.netroedge.com/~lm78/
There are tools to monitor CPU temperature under Linux. Most PC motherboards made in the last few years have included temperature monitoring. The biggest problem is that most motherboard makers include different "fudge factors" in their setup, so different mobos have different settings and finding the actual temperature can be tricky.
What I do:
- Don't worry about the accuracy of the temperature that's being reported. It has no bearing on reality. Instead, worry about relative temperature. First, get a baseline temperature when the system's stable but busy (half an hour into a large compile is a good time to take a reading).
- When the temperature's 10 above baseline, send a high-priority syslog message (which I have set to log locally, remotely, and wall to all users.) I use "Is it hot in here, or is it just me?" as my warning message. :)
- When the temperature's 20 above baseline, alarm and shut down cleanly.
- When the temperature's 30 above baseline, halt immediately. Better to fsck or restore from backup than melt the processor.
--
Forward, retransmit, or republish anything I say here. Just don't misquote me.

Built-in UPS by ggeens · 2003-09-14 21:16 · Score: 1

At work, we used to have a SUN E250 [1]. One day, the power went away. (Turned out to be a problem with the airco.) After the power came back, I checked the logs, and I saw that the machine had been writing messages like "Power lost, running on backup power", and when the power came back "Switching to AC". (The monitor wasn't protected, so there was no way to check the machine during the blackout.)

The PC in the same room didn't have backup power and shut down. It came back with no ill effects.

The RS/6000 in your story probably had a power monitor and a large voltage protector, or maybe it had its own UPS as well.

[1] We still have it somewhere, but I'm no longer using it.

--
WWTTD?

Depends on what it's doing by Sits · 2003-09-14 21:18 · Score: 3, Informative

To the best of my knowledge linux doesn't automatically reboot after a kernel crash unless you have told it to. If the crash was that severe this means you can walk up to the crashed machine and read the oops off the screen. If the machine isn't oopsing before the reboot this suggests some sort of hardware fault (e.g. your CPU is overheating). If it is hardware resetting the machine it is very unlikely that Linux can tell you what the fault is by itself (e.g. if it was the CPU overheating you will have to find someway to log the temperature to a file and observe the graph up to crash yourself).

Oh and here's a useful way of working out whether there was a crash or not:
last -x | grep "shutdown\|reboot"
Every reboot that doesn't have a matching shutdown was probably a crash (other than the last line).

Re:Depends on what it's doing by stef0x77 · 2003-09-15 03:20 · Score: 1

last -x | grep "shutdown\|reboot"

Don't forget the quotes :)

Here's how: by samjam · 2003-09-14 21:19 · Score: 4, Informative

1) First disable console blanking, that way when you get to the crashed box and plug the monitor in you can see the kernel panic message. /usr/sbin/setterm -blank 0 -powersave off -powerdown 0

We had some early kernel 2.4 redhat boxes crashing like the dickens for a while, it was a kernel problem and only when it happened on a local machine under our eyes did we get to realise what had happened.

2) Network syslog;
If you syslog to a central machine not only does it make error spotting centralised and easier but it means you have the last gasps of the crashed machine logged on a machine that is still up.

Sam

--
blog.sam.liddicott.com

Re:Kernel Panic on Linux? Sounds like hardware pro by Anonymous Coward · 2003-09-14 21:39 · Score: 0

Er, how about the 2.4 series kernel with high disk I/O and quotas turned on. That will get you random panics, no problems.

serial console by treat · 2003-09-14 21:47 · Score: 3, Informative

A serial console (make sure you enable the magic sysrq key! for some reason RedHat disables it by default) is an essential tool for any Linux server you care about. If you don't have the money for a console server, just plug servers into each other.

If your machine crashes without a panic message, however, you're out of luck. Wait until crash dumps are available - I'm surprised this isn't a 2.6 feature. Until we get crash dumps that work 99% of the time (like on Sparc-Solaris), Linux will continue to suck. At least it sucks less than the alternatives.

Linux Auditing by RedPhoenix · 2003-09-14 22:00 · Score: 1

Although not really capable of providing an audit of reboots (for a variety of reasons, already outlined above), Snare for Linux (google for 'snare') is roughly analagous to the Windows Event log.

Snare is capable of monitoring events such as file opens, execve's, setuid/setgid and so on, which may assist in tracking down the problem.

Red.

The cleaning team by Tux2000 · 2003-09-15 00:10 · Score: 3, Funny

"Now, where was the power outlet for the vaccuum cleaner? Hell, I'll tear out that red cable and plug the vaccuum cleaner there."

--
Denken hilft.

Re:The cleaning team by schon · 2003-09-15 04:38 · Score: 1

Even better:

"Hmm, that box in the corner is beeping. That doesn't sound good - I think I'll turn it off."

The box was a UPS.
Re:The cleaning team by foo12 · 2003-09-15 09:22 · Score: 1

Something similar happened with our film processor (processes film which comes out of the imagesetter). It beeps when it wants chemicals --- cleaning crew turned it off. Know what happens to unheated, unstirred chemical in a fim processor? It solidifies into a crystalline mess.

How the big systems do it... by hughk · 2003-09-15 00:10 · Score: 1

I've done some of the larger systems like VMS and they would crash leaving at least a minimal dump of the exec event message queues, to a full dump of the system state. The area used for system dumps is pre-allocated and set aside during startup so the exec can locate the file directly by block number without using the file system (or even the regular drivers).

Part of the system startup would scan the dump for the logging buffers, extract the messages and append them to the log file. The file system would have been recovered at this point so a corrupt disk isn't a problem.

I don't know how NT does it, but I would guess something similar given that the architect and some of his team were refugees from Digials central engineering.

Regrettably, the kernel dump project for linux is somewhat 'on-hold' as Linus would rather follow what the big vendors do (those offering enterprise support will need crash dumps for diagnostic purposes).

--
See my journal, I write things there

Re:Kernel Panic on Linux? Sounds like hardware pro by tzanger · 2003-09-15 00:18 · Score: 1

If you're running test kernels (and kernel-hacker specific patchsets to boot!) on a production server, you should be shot. Or at least demoted.

Linux is VERY reliable. by Futurepower(R) · 2003-09-15 01:16 · Score: 1

Nonsense.

I have a lot of experience fixing hardware failures. Before I started doing computer work exclusively, I was an electronics design engineer. So, I'm able to understand hardware issues, and have something to contribute in that area.

Perhaps Slashdot people don't have much experience with hardware problems, and are skeptical of anyone who does, because answers to hardware problems are not usually modded up, and are often attacked.

Linux has millions of technically knowledgeable users. Those users know that, if they report a problem with crashing accurately, it will be fixed. I've never reported a Linux crash because I've never seen one. However, I did report a crash in Mozilla before breakfast one day, and the bug was fixed just after breakfast. Linux developers are the same way. So, it is common that users report literally years of uptime.

Now, what chance is there that the person who wrote the Slashdot story is seeing multiple real crashes, due to badly written software in Linux itself, instead of bad hardware or a poorly selected hardware driver? That chance is very, very small, given the circumstances.

Re:Linux is VERY reliable. by TheSunborn · 2003-09-15 04:12 · Score: 1

Remember that the drivers are part of the linux kernal. Linux does contain drivers of high quality, but it also contains drivers which sometimes crash and finding out which driver causes problems is quite a problem. So he might have a piece of hardware with a bad driver.

Martin
Re:Linux is VERY reliable. by Vlad_the_Inhaler · 2003-09-15 07:37 · Score: 2, Insightful

I built the machine I am writing this on around 18 months ago. After a few months it became totally unstable (after a software upgrade to SuSE 8.0, I think). Now it runs SuSE 8.2 with absolutely no hardware changes and has not died on me for months.
Other people had no problems with that level.
The driver for the Realtek 8139 that came with the early 2.4 kernels used to kill the machine I first ran it on. Kill it stone dead, I had to hit reset to restart. The machine is dual-boot and worked fine under Win95. That problem was fixed in a kernel that came out in late 2001 (?) and that nic has always worked just fine in this machine since I built it, as did the old 3com card I replaced it with in the older machine..

Linux is not impervious to quality problems. No OS is.

--
Mielipiteet omiani - Opinions personal, facts suspect.

Re:Kernel Panic on Linux? Sounds like hardware pro by krishnaD · 2003-09-15 01:38 · Score: 1

I had another problem, entire machine used to freeze whenever I used to rsync the entire /dev/sda2. I checked everything memory, fsck, hdparam none of them showed any problem, but when I just changed the HDD cable everything worked fine.

Yes! by twistedcubic · 2003-09-15 03:14 · Score: 4, Funny

Is it possible to do something similar in Linux?

Yeah, but we have to wait until our SCO insider funnels us the code.

Linux Trace Toolkit by bendl · 2003-09-15 03:18 · Score: 2, Interesting

I'm working on a project called Linux Trace Toolkit (LTT) that is suitable for an automatic logging.

LTT log every system call at a ns precision in a RAM buffer and then on disk. The events include, for instance, read/write/open operations, system call, interuptions, process state, disk and internet interface operations and so on. You can add specific event by modifying your application and recompile with the LTT library.

LTT is not yet included in the kernel and was not choosen after the "Halloween Freeze" however, the new infrastructure can operate in a "flight recorder" mode that will, for instance, log the last 5 Mb of events that happens on the system.

Of course, when there is a kernel crash, you can not be certain to have those events on disk but this is chicken and egg problem.

Anyway, I believe this king of functionality is in demande by most critical applications. This is very important in the embedded market too where debugging and optimization is very painful.

Fan? by shadowpuppy · 2003-09-15 04:28 · Score: 1

I don't know if this applies to you situation but it can't hurt to check the fan. My work Linux machine got in the habit of crashing for a bit. Turned out the CPU fan wasn't working. I haven't have a crash in months now that I've fixed it.

Anyway Linux machines rarely crash in my experience and my top suspect is ussually hardware when it does.

Re:Event Log? by i_r_sensitive · 2003-09-15 05:23 · Score: 1

Apparently, he lives in the same parallel universe I do. I suppose you think the checkbox in Startup and Recovery labeled "Write an event to the system log" is there for looks?

So what? I click the little checkbox, wait for my next BSOD, and voila, I've got something useful? Head shake time, unless your idea of useful is finding out that WinDoze is the problem and you have to wait for M$ to fix it. Bottom line, I'd rather have to work a little harder to find the problem, and be able to fix it, than to have the problem spelled out in plain English and be at the mercy of the three monkeys in Redmond:

See no source

Hear no source

Speak no source.

--
"Talk minus action equals nothing" - Joey Shithead, D.O.A.
"Talk minus action equals /." -

Re:Doesn't work... by scsirob · 2003-09-15 06:15 · Score: 1

Something is wrong with that. See below output of my headless EPIA file/print server locked away somewhere deep and dark..
robtu@astra:~$ last reboot

wtmp begins Thu Sep 4 09:47:50 2003
robtu@astra:~$ uptime
21:12:13 up 235 days, 1:48, 1 user, load average: 0.00, 0.00, 0.00
robtu@astra:~$

--
To Terminate, or not to Terminate, that's the question - SCSIROB

Some ideas by Gudlyf · 2003-09-15 07:19 · Score: 4, Informative

Mission Critical Linux does this.

There's also the LKCD (Linux Kernel Crash Dumps) package:

KCD contains kernel and user level code designed to:

Save the kernel memory image when the system dies due to a software failure;
Recover the kernel memory image when the system is rebooted;
Analyze the memory image to determine what happened when the failure occurred.

--
Trolls lurk everywhere. Mod them down.

A few hints by kasperd · 2003-09-15 07:54 · Score: 2, Interesting

You can use a serial console or try out some version of the netconsole patch to get the messages on another computer. (Notice that netconsole over the internet is probably possible, but it is sent in clear and can be snooped or modified). I also recall reading about some patch to keep a new kernel ready in memory that could be booted with arguments telling it where to find the log from the old kernel, I even think it included a checksum to prevent booting the new kernel if it had been corrupted.

--

Do you care about the security of your wireless mouse?

You must be mistaken. by Viqsi · 2003-09-15 12:22 · Score: 2, Funny

Unexpected shutdowns? Crashes? You must be mistaken. Linux does not crash. Ever.

Now, what was your name and address again?

--

--
viqsi - See "vixen"
If we do not change our direction we are likely to end up where we are headed.

sysrq disabled for some reason... by Fareq · 2003-09-15 13:27 · Score: 1

Actually, the "magic sysrq key" is disabled by default for a damn good reason.

The "magic SysRq key" is a key sequence that allows some basic commands to be passed directly to the kernel. Kernel software developers use this interface to debug their software. Under most circumstances it can also be used to uncleanly reboot the computer, something that is otherwise difficult or expensive to do remotely.

Anyone can dial into a modem and send a break, so if the serial console is attached to a modem we need to disable the magic SysRq key

So. the SysRq key is disabled because it can be used (remotely) to do bad things, like an unclean shutdown, something you probably don't want people to do with your servers. (Only under certain circumstances -- but it's likely that one wouldn't remember about sysrq, it being mostly unused and all.

Quotes from The Linux Documentation Project www.tldp.org

Re:sysrq disabled for some reason... by treat · 2003-09-22 02:06 · Score: 1

So. the SysRq key is disabled because it can be used (remotely) to do bad things, like an unclean shutdown,
So you're saying that this is a massive security hole on every one of Sun's Sparc machines that has gone unnoticed all these years, as it has the same problem.

NetDump & LKCD by grothesk · 2003-09-15 14:25 · Score: 1

You can use a central NetDump server to collect oops message and a dump of physical memory of every Linux box on your network...

chekout
http://www.redhat.com/support/wpapers/r edhat/netdu mp/

another link with the lkcd patches
https://projects.clusterfs.com/lustre/Net Dump

How to know if a linux box went down.. by DRACO- · 2003-09-16 17:48 · Score: 1

Type last
You wont be able to find out a why without taking a stroll down /var/log/messages and guessing what lead up the a problem, but then again how often does that happen anyways?

Quick tip, try checking your irq's..

DRACO-

--
Consider yourself blessed if you are sneezed on by a dragon and only get wet, it could have been a fireball.

Slashdot Mirror

Logging Unexpected Shutdowns/Crashes w/ Linux?

86 comments